Lucid Imagination: Open Source Search Reaches for Big Data

September 30, 2011

We are wrapping up a report about the challenges “big data” pose to organizations. Perhaps the most interesting outcome of our research is that there are very few search and content processing systems which can cope with the digital information required by some organizations. Three examples merit listing before I comment on open source search and “big data”.

The first example is the challenge of filtering information required by orgnaizatio0ns produced within the organization and by the organizations staff, contractors, and advisors. We learned in the course of our investigation that the promises of processing updates to Web pages, price lists, contracts, sales and marketing collateral, and other routine information are largely unmet. One of the problems is that the disparate content types have different update and change cycles. The most widely used content management system based on our research results is SharePoint, and SharePoint is not able to deliver a comprehensive listing of content without significant latency. Fixes are available but these are engineering tasks which consume resources. Cloud solutions do not fare much better, once again due to latency. The bottom line is that for information produced within an organization employees are mostly unable to locate information without a manual double check. Latency is the problem. We did identify one system which delivered documented latency across disparate content types of 10 to 15 minutes. The solution is available from Exalead, but the other vendors’ systems were not able to match this problem of putting fresh, timely information produced within an organization in front of system users. Shocked? We were.

Reducing latency in search and content processing systems is a major challenge. Vendors often lack the resources required to solve a “hard problem” so “easy problems” are positioned as the key to improving information access. Is latency a popular topic? A few vendors do address the issue; for example, Digital Reasoning and Exalead.

Second, when organizations tap into content produced by third parties, the latency problem becomes more severe. There is the issue of the inefficiency and scaling of frequent index updates. But the larger problem is that once an organization “goes outside” for information, additional variables are introduced. In order to process the broad range of content available from publicly accessible Web sites or the specialized file types used by certain third party content producers, connectors become a factor. Most search vendors obtain connectors from third parties. These work pretty much as advertised for common file types such as Lotus Notes. However, when one of the targeted Web sites such as a commercial news services or a third-party research firm makes a change, the content acquisition system cannot acquire content until the connectors are “fixed”. No problem as long as the company needing the information is prepared to wait. In my experience, broken connectors mean another variable. Again, no problem unless critical information needed to close a deal is overlooked.

Third, many organizations are seeking to make sense of flows of social content. In our research, we learned that a majority of those with whom we spoke expressed interest in Facebook, Twitter, and other social content. However, few firms were using social content, electing to tap into a subset of information via a third party provider like DataSift and processing the content with a third party solution optimized for that content stream. This is a workable solution, but it underscores the problem of making needed information available to employees or analysts who need timely data for a business decision.

Net net.

It is easy to talk about low latency and real time content processing, but it remains expensive and mostly an unmet goal. Many vendors focus on jumping over less formidable hurdles.

We found the write up “Lucid Imagination Brings Open Source Search to Big Data” quite interesting. I have not associated open source search and big data in the manner set forth in the Fierce write up. Here’s a Lucid statement which caught my attention:

One of the biggest issues for companies adopting open source software is that they often lack the polished installation packages of the more commercial software packages. LucidWorks 2.0 provides new setup and management tools via an administrative console that in its words, “streamlines configuration, deployment and operations” for IT when setting up, deploying and managing the application.

This is indeed a point. Installation of some open source software impossible without appropriate technical expertise. However, installation is manageable by those with enough knowledge to decide to use open source software in the first place. Managers may not be able to do it, but managers can hire people who can. My experience is that those who are interested in downloading Lucene/Solr from Apache.org presumably have some technical chops.

More important, in my opinion, is that open source search and content processing must be more capable with regard to the three examples I identified above. A query run on incomplete or stale information is likely to undermine a decision at some point. All to often in my research, I encounter search results which often—perhaps usually—do not reflect the most current information in digital form.

Search is going to climb Big Data Mountain. Aren’t specialized tools and systems needed?

I am okay with improved installation and deployment, but vendors will have to work overtime to beat the Google Search Appliance or Blossom Software’s hosted solution for speedy, painless, search-today roll outs. In short, the problem with “open source search” in the enterprise is narrowed to installation. Several firms are robust, yet different approaches to open source search; for instance:

Digital Reasoning. The company has proprietary software, but it has teamed with other firms so that it can deliver a comprehensive solution to massive flows of content in a near zero latency implementation. Yep, near zero latency. Installation is part of a search solution, but it is the starting point. The end point is making sure the outputs are usable and reflect the freshest possible data in a form appropriate to the user of the system.
FLAX. This open source solution delivers a range of features. Installation is reliable and dozens of organizations have found the FLAX approach flexible, scalable, and cost effective. Its open source nature eliminates the “license handcuffs” that limit degrees of freedom imposed by some commercial search system vendors.
SearchBlox. This firm’s approach seems quite similar to the one described in the Fierce article. Amazon highlights the company’s use of Amazon Web services in a case study. We are not sure how much revenue SearchBlox is generating, but the company says, it is “a leading provider of enterprise search solutions based on Apache Lucene. Over 300 customers in 30 countries use SearchBlox to power their website, intranet and custom search. SearchBlox Software, Inc. was founded in 2003 with the aim to develop commercial search products based on Apache Lucene.”

One negative in the open source search world was the fact that Tesuji, an open source search vendor in Hungary, is repositioning its operations. Despite the strong interest in open source search, the Tesuji approach did not gain sufficient traction.

Our view is that open source search is a viable option for many organizations with the appetite for a solution that sidesteps license fees and permits by-the-drink technical support. However, most of today’s search solutions have far more significant challenges to address. One may argue with the price tags paid for Fast Search & Transfer, Exalead, and Autonomy, but there is a reason. IBM has embraced open source search and wrapped it with its particular array of add ons and special services. The problem with open source search as well as other search solutions is that once deployed the systems exhibit the latency problem, the connector glitch problem, and the single point of access problem.

My view is that a single attribute like easy installation echoes Oracle’s “secure enterprise search” marketing approach. A single factor is presented as a way to capture market interest. We think that the issues associated with content acquisition, system latency, index updates, interface, and a single point of access to needed information are ultimately more important than installation hassles or even the open source idea itself. I will explore open source search in my next Online Magazine column. The connector issue is of particular interest to me, almost as important as the latency challenge. After all, who wants to run a query and get information which has not been refreshed to make the most current information findable?

Stephen E Arnold, September 30, 2011

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.