CyberOSINT banner

Elasticsearch for Text Analysis

March 29, 2016

Short honk: Put your code hat on. “Mining Mailboxes with Elasticsearch and Kibana” walks a reader through using open source technology to do text analysis. The example under the microscope is email, but the method will work for any text corpus ingested by Elasticsearch. The write up includes code samples and enough explanation to get the Elastic system moving forward. Visualizations are included. These make it easy to spot certain trends; for example, the top recipients of the email analyzed for the tutorial. Worth a look.

Stephen E Arnold, March 29, 2016

Retraining the Librarian for the Future

March 28, 2016

The Internet is often described as the world’s biggest library containing all the world’s knowledge that someone dumped on the floor.  The Internet is the world’s biggest information database as well as the world’s biggest data mess.  In the olden days, librarians used to be the gateway to knowledge management but they need to vamp up their skills beyond the Dewey Decimal System and database searching.  Librarians need to do more and Christian Lauersen’s personal blog explains how in, “Data Scientist Training For Librarians-Re-Skilling Libraries For The Future.”

DST4L is a boot camp for librarians and other information professionals to learn new skills to maintain relevancy.  Last year DST4L was held as:

“DST4L has been held three times in The States and was to be set for the first time in Europe at Library of Technical University of Denmark just outside of Copenhagen. 40 participants from all across Europe were ready to get there hands dirty over three days marathon of relevant tools within data archiving, handling, sharing and analyzing. See the full program here and check the #DST4L hashtag at Twitter.”

Over the course of three days, the participants learned about OpenRefine, a spreadsheet-like application that cane be used for data cleanup and transformation.  They also learned about the benefits of GitHub and how to program using Python.  These skills are well beyond the classed they teach in library graduate programs, but it is a good sign that the profession is evolving even if the academia aspects lag behind.

Whitney Grace, March 28, 2016
Sponsored by, publisher of the CyberOSINT monograph


Hot Data Startups to Notice

March 22, 2016

An outfit called UBM, which looks a lot like the old IDC I knew and loved, published “9 Hot Big Data and Analyt5ics Startups to Watch.” The article is a series of separate pages. Apparently the lust for clicks is greater than the MBAs’ interest in making information easy to access. Progress in online publishing is zipping right along the information highway it seems.

What are the companies the article and UBM as describing as “hot.” I interpret the word to mean “having a high degree of heat or a high temperature” or “(of food) containing or consisting of pungent spices or peppers that produce a burning sensation when tasted.” I have a hunch the use of the word in this write up is intended to suggest big revenue producers which you must license in order to get or keep a job. Just a guess, mind you.

The companies are:

AtScale, founded in 2013

Algorithmia, founded in 2013

Bedrock Data, founded in 2012

BlueTalon, founded in 2013

Cazena, founded in 2014

Confluent, founded in 2014, founded in 2011

RJMetrics, founded in 2008

Wavefront, founded in 2013

The list is US centric. I assume none of the Big Data and analytics outfits in other countries are “hot.” I think the reason is that the research process looked at Boston, Seattle, and the Sillycon Valley pool and thought, “Close enough for horseshoes.” Just a guess, mind you.

If you are looking for the next big thing founded within the last two to eight years, the list is just what you need to make your company or organization great again. Sorry, some catchphrases are tough to purge from my addled goose brain. Enjoy the listicle. On high latency systems, the slides don’t render. Again. Do MBAs worry about this stuff? A final comment: I like the name “BlueTalon.”

Stephen E Arnold, March 22, 2016

Need a Classification Algorithm or 17?

March 21, 2016

I gave a lecture a couple of years ago about the similarity among major content processing systems. In that talk, I focused on 10 numerical recipes which our research identified in the commercial products from a number of well known intelligence platform vendors. The point of the lecture was to underscore the baked in weaknesses of platforms which use procedures taught in many universities. Outputs often vary because of the goofy decisions humans make or because the underlying data pumped into the numerical recipes is flawed.

I want to call your attention to “Implementation of 17 Classification Algorithms in R.” If you want to see the differences classification algorithms output, just fire up your system, implement these 17 methods, and check out the results. Our research reiterated to my goslings that one can select a classification algorithm to produce the type of output desired by the system engineer. Yep, put your hands on the steering wheel and drive that output pretty much where you want it to go. Do users of content processing systems know about these baked in pre-loaded destinations? Nah.

Stephen E Arnold, March 21, 2016

Google Decides to Be Nice to

March 18, 2016

Google is a renowned company for its technological endeavors, beautiful office campuses, smart employees, and how it is a company full of self-absorbed and competitive people.  While Google might have a lot of perks, it also has its dark side.  According to Quartz, Google wanted to build a more productive team so they launched Project Aristotle to analyze how and they found, “After Years Of Intensive Analysis, Google Discovers The Key To Good Teamwork Is being Nice.”

Project Aristotle studied hundreds of employees in different departments and analyzed their data.  They wanted to find a “magic formula,” but it all beats down to one of the things taught in kindergarten: be nice.

“Google’s data-driven approach ended up highlighting what leaders in the business world have known for a while; the best teams respect one another’s emotions and are mindful that all members should contribute to the conversation equally. It has less to do with who is in a team, and more with how a team’s members interact with one another.”

Team members who understand, respect, and allow each other to contribute to conversation equally.  It is a basic human tenant and even one of the better ways to manage a relationship, according to marriage therapists around the world.  Another result of the project is dubbed “psychological safety,” where team members create an environment with the established belief they can take risks and share ideas without ridicule.

Will psychological safety be a new buzzword since Google has “discovered” that being nice works so well?  The term has been around for a while, at least since 1999.

Google’s research yields a business practice that other companies have adopted: Costco, Trader Joes, Pixar, Sassie, and others to name a few.  Yet why is it so hard to be nice?


Whitney Grace, March 18, 2016
Sponsored by, publisher of the CyberOSINT monograph

Gartner and the Business Intelligence Magic Quadrant: Lots of Explaining, Lots of Subjectivity It Seems

March 13, 2016

I read a downright weird article/interview called “Big Data Discovery may put Oracle back in BI Magic Quadrant.” The title contains the magic word “may”, which does not promise to make Oracle a big dot in a Gartner Magic Quadrant, but it suggests that Gartner is doing some explaining.

As I understand the situation, the mid tier consulting firm analyzed the business intelligence sector and figured out which companies were winners and losers. Well, that’s the lingo that the original Boston Consulting Group quadrant used, and that’s how General Eisenhower used his quadrant. So those approaches override the Garnter words like niche players and visionaries. (Is it not possible for a niche player to be a visionary? Does Gartner know “Venn” to check it logic?)

The point of the write up is that Oracle, one of the big dogs in the Department of Defense’s DCGS-A and DCGS-N mash up analytics initiative is not in the Garnter magic square thing. Nope. Deleted.

Why may be a question which some folks at Oracle have been asking. The article/interview appears to be an “explainer” to make the Garnter mid tier method appear more near the top drawer in the cabinet of analytics collectibles.

I noted this passage:

Question: It sounds like the change isn’t coming from something Oracle did, but from Gartner.

Gartner’s R&D Big Dog, Josh Parenteau: Right, OBIEE is still there. It’s still being sold as their platform, but it does not meet the modern definition of the Magic Quadrant right now.

The acronym OBIEE means Oracle analytics. You, gentle reader, knew that.

Oracle was excluded because “they didn’t fully participate,” says Parenteau. He adds:

I do think that they’re late to the game by quite a bit… For Oracle, it’s recognizing the signals a bit earlier. It’s responding to customer needs and, I think, realizing that it’s not just about product. You can have the best product in the world, but if customers don’t want to work with you because they don’t like the relationship, it’s not going to matter.

So what companies of note made the Magic Quadrant? Since I don’t pay Gartner to advise me, I checked Bing and Google to locate the 2016 Magic Quadrant for Business Intelligence. It did not take long, because this MQ report appears to be a marketing item, not a confidential study like a report about the AVATAR program.

Check out these outfits who have met the Gartner criteria, objective and subjective:

  • BeyondCore
  • Domo
  • Logi Analytics
  • Platfora
  • Sisense

Okay, some names of note.

These outfits made the list as well:

  • IBM
  • Microsoft
  • SAS.

I highlighted this paragraph as particularly suggestive:

But I would say that, if you are a member of the install base of Oracle, know that they do have offerings in the space. They just didn’t have enough traction to get on the quadrant. If you have a big data Hadoop initiative going on, of course look at Big Data Discovery, because that’s exactly what it’s focused on. If you are looking for a tool to do data discovery, of course look at Visual Analyzer, which is part of the cloud service. If you have an initiative to get into the cloud, look at BICS. I wouldn’t say that, just because they’re not on the Magic Quadrant, if you’re an existing Oracle customer that you shouldn’t continue to look at them for solutions. This doesn’t mean that they are gone forever or off the MQ forever. It’s a transition. We’re in a market that is transitioning. Next year, it may be a new ball game.

Very mid tier. I liked the “you shouldn’t continue to look at them for solutions.” Are those words a positive or a negative? Worth watching the interaction of the Oracle folks at the Gartner experts.

Stephen E Arnold, March 13, 2016

Microsoft Predictions for the Oscars in 2016

March 5, 2016

I know that Microsoft has a prediction system. I don’t pay much attention to Bing or other Microsoft technology. I understand that I am an analog brontosaurus.

I noted “Microsoft Bing Correctly guessed Almost Every Oscar Winner Last Year but It Didn’t Do As Well This Year.”

The point, for me, is that predictive systems need to be based on numerical recipes which perform in a consistent manner. One can fiddle the definition of “consistency,” but when a predictive system is driving an autonomous vehicle, identifying a treatment for death, or identifying the worthy individuals as Oscar winners—the systems have to be pretty darned accurate.

The write up points out:

In 2015, Microsoft Bing’s prediction engine nailed the Academy Awards, guessing 20 out of 24 Oscar winners. The year before that, it did even better, going 21 for 24.

But in 2016, the Bingster, according to the write up:

only guessed 71% of the winners correctly, with 17 out of 24 correct choices.

In the real world, Bing’s predictive methods can chop out some highly probable losers. That may be quite useful for some applications like narrowing down a list of potential contractors.

For certain real world applications involving risk to life and limb in some far off war zone, I am not sure the Bing predictive engine will be number one on my list of systems upon which to rely.

The write up does not share my opinion, describing the result as “pretty okay.” Well, for me, a two thirds outcome is not pretty okay. It is below average, almost C minus or D plus territory.

The consumer angle suggests that Microsoft in terms of search and content processing may be prepping to become the next Yahoo.

Stephen E Arnold, March 5, 2016

Quid Cheerleading: The Future of Search

March 4, 2016

I read “The Future of {Re}search.” (I love the curly braces.) The write up identifies the four big things in information access. Keep in mind that the write up is a rah rah for Quid, which is okay.

Here are the main points:

  • Semantic search is the next big thing
  • Visualization matters
  • Humans are part of the search process
  • Bots are the “Future of Search.” (The capitalization is from the source document.)

Quid is an interesting company. I thought that the firm was focused on analytics and nifty visualizations. Their catchphrase is “intelligence amplified,” which strikes me as similar to Palantir’s “augmented intelligence.”

If the write up is on the money, Quid is a search vendor in the same way Palantir Technologies is a search vendor.

The point about bots may catch the attention of the ever-alert Connotate folks. I think bots has been an important part of that firm’s services for many years.

So, “the next big thing”? Well, sort of.

Stephen E Arnold, March 4, 2016

Hershey Chocolate: Semi Sweet Analytics?

March 4, 2016

I am wrapping up my profile of Palantir Technologies. I located a couple of references to Palantir’s activities in the non-government markets. One of the outfits allegedly swooned by the Hobbits was Hershey chocolate. A typical reference to the Hobbits and Kisses folks was “Hershey Turns Kisses and Hugs into Hard Data.”


When I read “The Hershey Company Partners with Infosys to Build Predictive Analytics Capability using Open Source Information Platform on Amazon Web Services,” I wondered why Palantir Technologies was not featured in the write up. Praescient Analytics, near Washington, DC, can plug industrial strength predictive analytics like Recorded Future’s into a Metropolitan installation without much hassle.

The write up makes clear that the chocolate outfit is going a new way. The path leads through Amazon Web Services to the Infosys Information Platform.

I find this quite a surprise. I have no doubt that Infosys has some competent folks on its team. But the questions flashing through my mind are:

  • What’s up with the Palantir system?
  • Why jump to Infosys when there are darned good outfits available in Boston and Washington, DC?
  • What’s an outsourcing firm able to deliver that specialists with deep experience in making sense of data cannot?

I never understood Mars, and now I don’t understand the makers of the York Peppermint Patty.

Perhaps this is a “whopper” of a project?

Stephen E Arnold, March 4, 2016

The FBI Uses Its Hacking Powers for Good

March 4, 2016

In a victory for basic human decency, Engadget informs us, the “FBI Hacked the Dark Web to Bust 1,500 Pedophiles.” Citing an article at Vice Motherboard, writer Jessica Conditt describes how the feds identified their suspects through a site called (brace yourself) “Playpen,” which was launched in August 2014. We learn:

Motherboard broke down the FBI’s hacking process as follows: The bureau seized the server running Playpen in February 2015, but didn’t shut it down immediately. Instead, the FBI took “unprecedented” measures and ran the site via its own servers from February 20th to March 4th, at the same time deploying a hacking tool known internally as a network investigative technique. The NIT identified at least 1,300 IP addresses belonging to visitors of the site.

“Basically, if you visited the homepage and started to sign up for a membership, or started to log in, the warrant authorized deployment of the NIT,” a public defender for one of the accused told Motherboard. He said he expected at least 1,500 court cases to stem from this one investigation, and called the operation an “extraordinary expansion of government surveillance and its use of illegal search methods on a massive scale,” Motherboard reported.

Check out this article at Wired to learn more about the “network investigative technique” (NIT). This is more evidence that, if motivated, the FBI is perfectly capable of leveraging the Dark Web to its advantage. Good to know.


Cynthia Murrell, March 4, 2016

Sponsored by, publisher of the CyberOSINT monograph

« Previous PageNext Page »