CyberOSINT banner

An Oddly Mystical, Whimsical Listicle Combining Big Data and Search

July 4, 2015

Some listicles are clearly the work of college students after a tough beer pong tournament. Others seem as if they emanate from beyond Pluto’s orbit. I am not sure where on this spectrum between the addled and extraterrestrial the listicle in “Top 11 Open Source big Data Enterprise Search Software” falls.

Here’s the list for your contemplation. I have added some questions after each company’s name. Consult the original write up for the explanation the inclusion of these systems in the list. I found the write ups without much heft or “wood” to use a Google term.

  1. Apache Solr. Yep, uses Lucene libraries, right. Performance? Exciting sometimes.
  2. Apache Lucene Core. Ah, Lego blocks for the engineer with some aspirations for continuous employment.
  3. Elasticsearch. The leader in search and retrieval. To do big data, there are some other components required. Make sure your programming and engineering expertise are up to the job.
  4. Sphinx. Okay, workable for structured data. Work required to stuff unstructured content into this system.
  5. Constellio. Isn’t this a part time project of a consulting firm focused on Canadian government work?
  6. DataparkSearch Engine. Yikes.
  7. ApexKB. Okay, a script. For enterprise applications. Big Data? Wow.
  8. Searchdaimon ES. Useful, speedier than either Lucene or Elasticsearch. Not a big data engine without some extra work. Come to think of it. A lot of work.
  9. mnoGoSearch. Well, maybe for text.
  10. Nutch. Old in the tooth. Why not use Lucene?
  11. Xapian. Very robust. Make certain that you have programming expertise and engineering knowledge. Often ignored which is too bad. But be prepared for some heavy lifting or paying a wizard with a mental fork lift to do the job.

Now which of these systems can do “big data.” In one sense, if you are exceptionally gifted with engineering and programming skills, I suppose any of these can do tricks. As Samuel Johnson allegedly observed to his biographer:

“Sir, a woman’s preaching is like a dog’s walking on his hind legs. It is not done well; but you are surprised to find it done at all.”

On the other hand, these programs can be used as a utility within a more robust content processing system which has been purpose built to deal with large flows of structured and unstructured content. But even that takes work.

Anyone want to give Constellio a shot at processing real time Facebook posts? Anyone want to use any of these systems to solve that type of search problem? Show of hands, please?

Stephen E Arnold, July 4, 2015

Forrester: Join Us in the Revolution

June 28, 2015

Err, I am not a revolutionary. The term evokes memories and thoughts which I find uncomfortable. Revolution, Forrester, IS/ISIL/Daesh. Shiver.

The intent of ““Big Data” Has Lost Its Zing – Businesses Want Insight And Action” is one of those marketing, mid tier consulting pronouncements. Most of these are designed to stimulate existing customers to buy more expertise or lure those with problems which the management team cannot solve to the door of an expert who purports to have the answer.

I highlighted this passage in pale yellow with my trusty Office Depot highlighter:

I saw it coming last year. Big data isn’t what it used to be. Not because firms are disillusioned with the technology, but rather because the term is no longer helpful. With nearly two-thirds of firms having implemented or planning to implement some big data capability by the end of 2015, the wave has definitely hit. People have bought in. But that doesn’t mean we find many firms extolling the benefits they should be seeing by now; even early adopters still have problems across the customer lifecycle.

Big Data faces challenges because users want accurate, reliable outputs. News?

Stephen E Arnold, June 28, 2015

Deep Learning System Surprises Researchers

June 24, 2015

Researchers were surprised when their scene-classification AI performed some independent study, we learn from Kurzweil’s article, “MIT Deep-Learning System Autonomously Learns to Identify Objects.”

At last December’s International Conference on Learning Representations, a research team from MIT demonstrated that their scene-recognition software was 25-33 percent more accurate than its leading predecessor. They also presented a paper describing the object-identification tactic their software chose to adopt; perhaps this is what gave it the edge. The paper’s lead author, and MIT computer science/ engineering associate professor, Antonio Torralba ponders the development:

“Deep learning works very well, but it’s very hard to understand why it works — what is the internal representation that the network is building. It could be that the representations for scenes are parts of scenes that don’t make any sense, like corners or pieces of objects. But it could be that it’s objects: To know that something is a bedroom, you need to see the bed; to know that something is a conference room, you need to see a table and chairs. That’s what we found, that the network is really finding these objects.”

Researchers being researchers, the team is investigating their own software’s initiative. The article tells us:

“In ongoing work, the researchers are starting from scratch and retraining their network on the same data sets, to see if it consistently converges on the same objects, or whether it can randomly evolve in different directions that still produce good predictions. They’re also exploring whether object detection and scene detection can feed back into each other, to improve the performance of both. ‘But we want to do that in a way that doesn’t force the network to do something that it doesn’t want to do,’ Torralba says.”

Very respectful. See the article for a few more details on this ambitious AI, or check out the researchers’ open-access paper here.

Cynthia Murrell, June 24, 2015

Sponsored by, publisher of the CyberOSINT monograph


MIT Discover Object Recognition

June 23, 2015

MIT did not discover object recognition, but researchers did teach a deep-learning system designed to recognize and classify scenes can also be used to recognize individual objects.  Kurzweil describes the exciting development in the article, “MIT Deep-Learning System Autonomously Learns To Identify Objects.”  The MIT researchers realized that deep-learning could be used for object identification, when they were training a machine to identify scenes.  They complied a library of seven million entries categorized by scenes, when they learned that object recognition and scene-recognition had the possibility of working in tandem.

“ ‘Deep learning works very well, but it’s very hard to understand why it works — what is the internal representation that the network is building,’ says Antonio Torralba, an associate professor of computer science and engineering at MIT and a senior author on the new paper.”

When the deep-learning network was processing scenes, it was fifty percent accurate compared to a human’s eighty percent accuracy.  While the network was busy identifying scenes, at the same time it was learning how to recognize objects as well.  The researchers are still trying to work out the kinks in the deep-learning process and have decided to start over.  They are retraining their networks on the same data sets, but taking a new approach to see how scene and object recognition tie in together or if they go in different directions.

Deep-leaning networks have major ramifications, including the improvement for many industries.  However, will deep-learning be applied to basic search?  Image search still does not work well when you search by an actual image.

Whitney Grace, June 23, 2015
Sponsored by, publisher of the CyberOSINT monograph

Big Data and Old, Incomplete Listicles

June 19, 2015

I enjoy lists of the most important companies, the top 25 vendors of a specialized service, and a list of companies I should monitor. Wonderful stuff because I encounter firms about which I have zero information in my files and about which I have heard nary a word.

An interesting list appears in “50 Big Data Companies to Follow.” The idea is that I should set up a Google Alert for each company and direct my Overflight system to filter content mentioning these firms. The problem with this post is that the information does not originate with Datamation or Data Science Center. The list was formulated by Sand in a story called “Sand Hill 50 “Swift and Strong” in Big Data.” The list was compiled prior to its publication in January 2014. This makes the list 18 months old. With the speed of change in Big Data, the list in my opinion is stale.

A similar list appears in “CRN 50 Big Data business Analytics Companies,” which appears on the Web site. This list appears to date from the middle of 2014, which makes it about a year old. Better but not fresh.

I did locate an update called “2015 Big Data 100: Business Analytics.” Locating a current list of Big Data companies was not easy. Presumably my search skills are sub par. Nevertheless, the list is interesting.

Here are some firms in Big Data which were new to me:

  • Guavas
  • Knime
  • Zoomdata

But the problem was that the CRN Web site presented only 46 vendors, not 100.


  • Datamation is pushing out via its feed links to old content originating on other publishers’ Web sites
  • The obscurity of the names in the list is the defining characteristic of the lists
  • Getting a comprehensive, current list of Big Data vendors is difficult. Data Science just listed 15 companies and back linked to Sand Hill. CRN displayed 46 companies but forced me to click on each listing. I could not view the entire list.

Not too useful, folks.

Stephen E Arnold, June 19, 2015

Need Confidence in Your Big Data? InfoSphere Delivers Assurances

June 17, 2015

I spotted a tweet about a white paper titled “Improve the Confidence in Your Big Data with IBM InfoSphere.” The write up was a product of Information Asset LLC, a company with which I was not familiar. The link in the tweet was dead, so I located a copy of the white paper on the IBM Web site at this link, which I verified on June 17, 2015. If it is dead when you look for the white paper, take it up with IBM, not me.

The white paper is seven pages long and explains that IBM’s InfoSphere is the hub of some pretty interesting functions; specifically:

  1. Big Data exploration
  2. Enhanced 360 [degree] view of the customer
  3. Application development and testing
  4. Application efficiency
  5. Security and compliance
  6. Application consolidation and retirement
  7. Data warehouse augmentation
  8. Operations analysis
  9. Security/intelligence extension.

I thought InfoSphere was a brand created in 2008 by IBM marketers to group IBM’s different information management software products into one basket. The Big Data thing is a new twist for me.

The white paper takes each of these nine topics and explains them one by one. I found some interesting tidbits in several of the explanations, but I have only enough energy and good humor to tackle one category, Big Data exploration.

The notion of exploring Big Data is an interesting one. I thought one normalized, queried, and reviewed results of a query. The exploration thing is foreign to me. Big Data, by definition, are—well—big. Big collections are data are difficult to explore. I formulate queries, look at results, review clusters, etc. I suppose I am exploring, but I think of the work as routine database look ups. I am so hopelessly old fashioned, aren’t I. Some outfits like Recorded Future generate reports which illustrate certain query results, but we are back to queries, aren’t we.

Here’s what I learned about InfoSphere’s capabilities. Keep in mind that InfoSphere is a collection of discrete software programs and code systems. Data scientists need to explore and mine Big Data to uncover interesting nuggets that are relevant for better  decision making. A large hospital system built a detailed model to predict the likelihood that patients with congestive heart failure would be readmitted within 30 days. Smoking status was a key variable that was strongly correlated with the likelihood of readmission. At the outset, only 25 percent of the structured data around smoking status was populated with binary yes/no answers. However, the analytics team was able to increase the population rate for smoking status to 85 percent of the encounters by using content analytics. The content analytics team was also able to use physicians’ and nurses’ notes to unlock additional information, such as smoking duration and frequency. There were a number of reasons for the discrepancy. For example, some patients indicated that they were non-smokers, but text analytics revealed the following in the doctors’ notes: “Patient is restless and asked for a smoking break,” “Patient quit smoking yesterday,” and “Quit.” IBM InfoSphere Big Insights offers strong text analytic capabilities. In addition, IBM InfoSphere Business Glossary provides a repository for key definitions such as “readmission.” IBM InfoSphere Master Data Management provides an Enterprise Master Patient Index to track readmissions for the same patient across multiple hospitals in the same network. Finally, IBM InfoSphere Data Explorer provides robust search capability across unstructured data.

Okay, search is the operative word. I find this fascinating because IBM is working hard to convince me that Watson can ingest information and figure out what it means and then answer questions automatically. For example, if a cancer doctor does not know what treatment to use, Watson will tell her.

I must tell you that this white paper illustrates the fuzzy thinking that characterizes many firms’ approach to information challenges. Remember. The InfoSphere Big Data explorer is just one of nine capabilities of a marketing label.

Useful? Just ring up your local IBM regional office and solve nine problems with that phone call. Magic. Big insights too.

Stephen E Arnold, June 17, 2015

Paper.Li Enterprise Search Punts

June 15, 2015

Short honk: I monitor the automated “newsletter” called The Enterprise Search Daily. I am not sure how one receives this publication, but I use this url. In the last few days, there has been minimal—maybe zero—enterprise search news. The publication appears to recycle information about Big Data and text analytics. We will continue to report on the search flounders, oops, I mean, search vendors who offer enterprise search solutions. The problem is that venture backed enterprise search start ups will have to do some fancy dancing to explain why a search for enterprise search brings up items like this:


The Beyond Search team will soldier on with one comment: Enterprise search does not do Big
Data without some careful wordsmithing.

Stephen E Arnold, June 15, 2015

IBM: The Me Too Principal

June 15, 2015

Short honk: Big company innovation boils down to a handful of tactics. The most common is buying another outfit and using that firm’s achievements as one’s own. This is a variation of the entitlement culture or “look what money can buy”. Another approach is to hire an innovator and having that individual build a group. An interesting example of this tactic is Microsoft’s hiring Babak Amir Parviz (yep, the fellow with musical names). Then Google hires Dr. Amirparviz. The next jump is that Dr. Parviz (same fellow now) joins Amazon. Each company inherits his wizardry. The third tactic is to imitate which works reasonably well. Autonomy offered a “Portal in a Box” years ago. Other companies quickly followed with their own “in a box” strategy. The apex of the me too is the Google “search in a box” appliance.

Now navigate to the font of business and management expertise—the New York Times. Read if you can find it “IBM Invests to Help Open-Source Big Data Software — and Itself.” The big idea is that IBM is getting into Big Data software. You know, the trend which allowed IBM to convert a search utility like Vivisimo into the Big Data big dog. Well, apparently not. The write up states:

The company is placing a large investment — contributing software developers, technology and education programs — behind an open-source project for real-time data analysis, called Apache Spark.

And Spark in case you have not been following the breathless news releases from various open source commercial players like Lucidworks (Really?). The write up states:

But if Hadoop opens the door to probing vast volumes of data, Spark promises speed. Real-time processing is essential for many applications, from analyzing sensor data streaming from machines to sales transactions on online marketplaces. The Spark technology was developed at the Algorithms, Machines and People Lab at the University of California, Berkeley. A group from the Berkeley lab founded a company two years ago, Databricks, which offers Spark software as a cloud service. Spark, Mr. Picciano said, is crucial technology that will make it possible to “really deliver on the promise of big data.” That promise, he said, is to quickly gain insights from data to save time and costs, and to spot opportunities in fields like sales and new product development.

And IBM has lots of programmers:

IBM said it will put more than 3,500 of its developers and researchers to work on Spark-related projects. It will contribute machine-learning technology to the open-source project, and embed Spark in IBM’s data analysis and commerce software. IBM will also offer Spark as a service on its programming platform for cloud software development, Bluemix. The company will open a Spark technology center in San Francisco to pursue Spark-based innovations.

The write up explains that IBM has a plan. The gray lady puts it this way via the mouth of the an estimable azure chip consulting firm:

IBM makes its money higher up, building solutions for customers,” said Mike Gualtieri, a analyst for Forrester Research. “That’s ultimately why this makes sense for IBM.”

But… but… hasn’t IBM suffered declining revenues and profit erosion over the last three years? Irrelevant item. Set a spark to that tinder and watch the marketing collateral burn. And Vivisimo? Don’t know. Never did.

Stephen E Arnold, June 15, 2015

LinkedIn: A Pinot for a Flavor Profile with a Narrow Market

June 13, 2015

LinkedIn is the social network for professionals. The company meets the needs of individuals who want to be hired and companies looking to find individuals to fill jobs. We use the system to list articles I have written. If you examine some of the functions of LinkedIn, you may discover that sorting is a bit of disappointment.

LinkedIn has been working hard to find technical solutions to its data management challenges. One of the company’s approaches has been to create software, make it available as open source, and then publicize the contributions.

A recent example is the article “LinkedIn Fills Another SQL-on-Hadoop Niche.” What is interesting in the write up is that the article does not make clear what LinkedIn does with this  software home brew. I learned:

Pinot was designed to provide the company with a way to ingest “billions of events per day” and serve “thousands of queries per second” with low latency and near-real-time results — and provide analytics in a distributed, fault-tolerant fashion.

On the surface, it seems that Hadoop is used as a basked. Then the basket’s contents is filtered using SQL queries. But for me the most interesting information in the write up is what the system does not do; for example:

  • The SQL-like query language used with Pinot does not have the ability to perform table joins
  • The data is (sic) strictly read-only
  • Pinot is narrow in focus.

Has LinkedIn learned that its internal team needs more time and money to make Pinot a mash up with wider appeal? Commercial companies going open source is often a signal that the assumptions of the in house team have collided with management’s willingness to pay for a sustained coding commitment.

Stephen E Arnold, June 13, 2015

Mongo the Destroyer and JSON and the Datanauts Team Up

June 12, 2015

Hadoop fans, navigate to “A Better Mousetrap: A JSON Data Warehouse Takes on Hadoop.” There are a couple of very interesting statements in this write up. Those who do the Hadoop the loop know that certain operations are sloooow. Other operations are not efficient for certain types of queries. One learns about these Hadoop the Loops over time, but the issues are often a surprise to the Hadoop/Big Data cheerleaders.

The article reports that SonarW may have a good thing with its Mongo and JSON approach. For example, I highlighted:

In other words, Hadoop always tries to maximize resource utilization. But sometimes you need to go grab something real quick and you don’t need 100 nodes to do it.

That means the SonarW approach might address some sharp focus, data analysis tasks. I also noted:

What could work to SonarW’s advantage is its simplicity and lower cost (starting at $15,000 per terabyte) compared to traditional data warehouses and MPP systems. That might motivate even non-MongoDB-oriented companies to at least kick the tires.

Okay, good. One question which crossed my mind, will SonarW’s approach provide some cost and performance capabilities that offer some options to XML folks thinking JSON thoughts?

I think SonarW warrants watching.

Stephen E Arnold, June 12, 2015

Next Page »