ElasticSearch: Was Google Right about Simplicity?

November 13, 2012

When the Google Search Appliance became available nine or 10 years ago, I was the victim of a Google briefing. The eager Googler showed me the functions of the original Google Search Appliance. I was not impressed. As I wrote in the Google Legacy, the GSA was a “good start” and showed promise.

But one thing jumped out at me. Google’s product planners had identified the key weakness or maybe “flaw” in most of the enterpriser search solutions available a decade ago—Complexity. No single Googler could install Autonomy, Endeca, Fast Search & Transfer, or Convera without help from the company. Once the system was up and running, not even a Googler could tune the system, perform reliable hit boosting, or troubleshoot indexers which could not update. Not surprisingly, most of the flagship enterprise search systems ran up big bills for the licensees. One vendor went down in flames because there were not enough engineers to keep the paying customers happy. So ended an era of complexity with the Google Search Appliance.

I may have been wrong.

I just read “Indexing BigData with ElasticSearch.” If you are not familiar with ElasticSearch (formerly Compass), think about the Compass search engine and the dozens of companies surfing on Lucene/Solr to get in the search game. Even IBM uses Lucene/Solr to slash development costs and free up expensive engineers for more value added work like the wrappers that allow Watson to win a TV game show. I have completed for IDC an analysis of 13 open source search vendors and some of these profiles are available for only $3,500 each. See http://www.idc.com/getdoc.jsp?containerId=236511 for an example.

Is your search system as easy to learn to ride as a Big Wheel toy? If not, there may be some scrapes and risks ahead. In today’s business climate, who wants to incur additional risks or costs in a pursuit of a short cut only a developer can appreciate. Not me or the CFOs I know. A happy quack to http://www.bigwheeltricycle.net/ for this image.

The write up explains how to perform Big Data indexing with ElasticSearch. I urge you to read the write up. Consider this key passage:

The solution finally appeared in the name of ElasticSearch, an open-source Java based full text indexing system, based on the also open-source Apache Lucene engine, that allows you to query and explore your data set as you collect it. It was the ideal solution for us, as doing BigData analysis requires a distributed architecture.

Sounds good. With a fresh $10 million ElasticSearch seems poised to revolutionize the world of enterprise search, big data, and probably business intelligence, search based applications, and unified information access. Why not? Most open source vendors exercise considerable license in an effort to differentiate themselves from next generation solutions such as CyberTap, Digital Reasoning, and others pushing the envelope of findability technology.

Read more

LucidWorks Solr 4 Training

November 9, 2012

I learned yesterday that LucidWorks will host a one day intensive Solr training session. The full day session covers what’s new in Solr 4.0, including a functional overview and deep dive into SolrCloud, followed by an expert panel discussion and open lab/workshop. LucidWorks is the leader in enterprise open source search solutions. The company’s technology, engineering team, and customer service sets the company apart.

There will be a Boot Camp training event in Reston, Virginia, on November 14, 2012. Erik Hatcher and Erick Erickson to learn how Solr 4.0 dramatically improves scalability, performance, and flexibility.  An overhauled Lucene underneath sports near real-time (NRT) capabilities allowing indexed documents to be rapidly visible and searchable. Lucene’s improvements also include pluggable scoring, much faster fuzzy and wildcard querying, and vastly improved memory usage. These Lucene improvements automatically make Solr much better, and Solr magnifies these advances with “SolrCloud.”

Paul Doscher, president of LucidWorks, told me:

Attendees will learn how to use SolrCloud to transform your existing Solr application into a highly scalable, fault tolerant solution with distributed indexing and search capabilities. The session will include demonstrations of SolrCloud in action. Some of the details covered will include configuring and tuning your own cluster. The presenters will detail how Solr 4.0 can be used as a NoSQL store, how to do near real time search at scale, and provide some tips and technical tips for maintaining a Solr cluster over the long term. This session will put the attendee on the path to becoming knowledgeable in SolrCloud configuration, scaling, monitoring and tuning. One of the highlights of the session is a review of the differences between the previous versions of Solr.

The training event features a breakfast in the morning and a happy hour after the session. You can sign up at http://goo.gl/voA7r. This strikes me as a must attend event.

Stephen E Arnold, November 9, 2012

Open Source Search: The Me Too Method Is Thriving

November 5, 2012

In the first three editions of The Enterprise Search Report (2003 to 2007), my team and I wrote, we made it clear that the commercial enterprise search vendors were essentially a bunch of me-too services.

The diagrams for the various systems were almost indistinguishable. Some vendors used fancy names for their systems and others stuck with the same nomenclature used in the SMART system. I pointed out that every enterprise search system has to perform certain basic functions: Content acquisition, indexing, query processing, and administration. But once those building blocks were in place, most of the two dozen vendors I profiled added wrappers which created a “marketing differentiator.” Examples ranged from Autonomy’s emphasis on the neuro linguistic processing to Endeca’s metadata for facets to Vivisimo’s building a single results list from federated content.

wheel of fortune fixed copy copy

The rota fortunae of the medieval software licensee. A happy quack to http://www.artlex.com/ArtLex/Ch.html

The reality was that it was very difficult for the engineers and marketers of these commercial vendors to differentiate clearly their system from dozens of look-alikes. With the consolidation of the commercial enterprise search sector in the last 36 months, the proprietary vendors have not changed the plumbing. What is new and interesting is that many of them are now “analytics,” “text mining,” or “business intelligence” vendors.

The High Cost of Re-Engineering

The key to this type of pivot is what I call “wrappers” or “add ins.” The idea is that an enterprise search system is similar to the old Ford and GM assembly lines of the 1970s. The cost for changing those systems was too high. The manufacturers operated them “as is”, hoping that chrome and options would give the automobiles a distinctive quality. Under the paint and slightly modified body panels, the cars were essentially the same old vehicle.

Commercial enterprise search solutions are similar today, and none has been overhauled or re-engineered in a significant way. That is okay. When a company licenses an enterprise search solution from Microsoft or Oracle, the customer is getting the brand and the security which comes from an established enterprise search vendor.

Let’s face it. The RECON or SDC Orbit system is usable without too much hassle by a high school student today. The precision and recall are in the 80 top 85 percent range. The US government has sponsored a text retrieval program for many years. The results of the tests are not widely circulated. However, I have heard that the precision and recall scores mostly stick in the 80 to 85 percent range. Once in a while a system will perform better, but search technology has, in my opinion, hit a glass ceiling. The commercial enterprise search sector is like the airline industry. The old business model is not working. The basic workhorse of the airline industry delivers the same performance as a jet from the 1970s. The big difference is that the costs keep on going up and passenger satisfaction is going down.

Open Source: Moving to Center Stage

But I am not interested in commercial enterprise search systems. The big news is the emergence of open source search options. Until recently, open source search was not mainstream. Today, open source search solutions are mainstream. IBM relies on Lucene/Solr for some of its search functions. IBM also owns Web Fountain, STAIRS, iPhrase, Vivisimo, and the SPSS Clementine technology, among others. IBM is interesting because it has used open source search technology to reduce costs and tap into a source of developer talent. Attivio, a company which just raised $42 million in additional venture funding, relies on open source search. You can bet your bippy that the investors want Attivio to turn a profit. I am not sure the financial types dive into the intricacies of open source search technology. Their focus is on the payoff from the money pumped into Attivio. Many other commercial content processing companies rely on open source search as well.

The interesting development is the emergence of pure play search vendors built entirely on the Lucene/Solr code. Anyone can download these “joined at the hip” software from the Apache Foundation. We have completed an analysis of a dozen of the most interesting open source search vendors for a big time consulting firm. What struck the ArnoldIT research team was:

  1. The open source search vendors are following the same path as the commercial enterprise search vendors. The systems are pretty much indistinguishable.
  2. The marketing “battle” is being fought over technical nuances which are of great interest to developers and, in my opinion, almost irrelevant to the financial person who has to pay the bills.
  3. The significant differentiators among the dozen companies we analyzed boils down to the companies’ financial stability, full time staff, and value-adding proprietary enhancements, customer support, training, and engineering services.

What this means is that the actual functionality of these open source search systems is similar to the enterprise proprietary solutions. In the open source sector, some vendors specialize by providing search for a Big Data environment or for remediating the poor search system in MySQL and its variants. Other companies sell a platform and leave the Lucene/Solr component as a utility service. Others just take the Lucene/Solr and go forward.

The Business View

In a conversation with Paul Doscher, president of LucidWorks, I learned that his organization is working through the Project Management Committee (PMC) Group of the Lucene/Solr project within the Apache Software Foundation to build the next-generation search technology. The effort is to help transform people’s ability to turn data into decision making information.

This next generation search technology is foundational in developing a big data technology stack to enable enterprisers to reap the rewards of the latest wave of innovation.

The key point is that figuring out which open source search system does what is now as confusing and time consuming as figuring out the difference between the proprietary enterprise search systems was 10 years ago.

Will there be a fix for me-too’s in enterprise search. I think that some technology will be similar and probably indistinguishable to non-experts? What is now raising the stakes is that search systems are viewed as utilities. Customers want answers, visualizations, and software which predicts what will happen. In my opinion, this is search with fuzzy dice, 20 inch chrome wheels, and a 200 watt sound system.

The key points of differentiation for me will remain the company’s financial stability, its staff quality, its customer service, its training programs, and its ability to provide engineering services to licensees who require additional services. In short, the differentiators may boil down to making systems pay off for licensees, not marketing assertions.

In the rush to cash in on organizations’ need to cut costs, open source search is now the “new” proprietary search solution. Buyer beware? More than ever. The Wheel of Fortune in search is spinning again. Who will be a winner? Who will be a loser? Place your bets. I am betting on open source search vendors with the service and engineering expertise to deliver.

Stephen E Arnold, November 5, 2012

The Decline of PCs and Search?

November 4, 2012

I worked through “The Slow Decline of PCs and the Fast Rise of Smartphones/Tablets Was Predicted in 1993.” The main point is that rocket scientist cook and patent expert, Nathan P. Myhrvold anticipated the shift from desktop computers to more portable form factors. Years earlier I remember a person from Knight Ridder pitching a handheld gizmo which piggybacked on the Dynabook. When looking for accurate forecasts and precedents, those with access to a good library, commercial databases, and the Web can ferret up many examples of the Nostradamus approach to research. I am all for it. Too many people today do not do hands on research. Any exercise of this skill is to be congratulated.

Here’s the main point of the write up in my opinion:

His memo is amazingly accurate. Note that his term “IHC” (Information Highway Computer) could be roughly equated with today’s smartphone or tablet device, connecting to the Internet via WiFi or a cellular network. In his second last paragraph, Myhrvold predicts the winners will be those who “own the software standards on IHCs” which could be roughly equated with today’s app stores, such as those on iOS (Apple), Android (Google, Amazon) and Windows 8 (Microsoft). The only thing you could say he possibly didn’t foresee would be the importance of hardware design in the new smartphone and tablet industry.

Let’s assume that Mr. Myhrvold was functioning in “I Dream of Jeannie” mode. Now let’s take that notion of a big change coming quickly and apply it to search. My view is that traditional key word search was there and then—poof—without a twitch of the soothsayer’s nose, search was gone.

Look at what exists today:

  1. Free search which can be downloaded from more than a dozen pretty reliable vendors plus the Apache Foundation. Install the code and you have state of the art search, facets, etc.
  2. Business intelligence. This is search with grafted on analytics. I think of this as Frankensearch, but I am old and live in rural Kentucky. What do you expect?
  3. Content process. This is data management with some search functions and a bunch of parsing and tagging. Indexing is good, but the cost of humans is too high for many government intelligence organizations. So automation is the future.
  4. Predictive search. This is the Google angle. You don’t need to do anything, including think too much. The system does the tireless nanny job.

So is search in demise mode? Yep. Did anyone predict it? I would wager one thin dime that any number of azure chip consultants will have documents in their archive which show that the death of search was indeed predicted. One big outfit killed a “magic carpet tile” showing the search industry and then brought it back.

So search is not dead. Maybe it was Mark Twain who said, “The reports of my death have been greatly exaggerated.” Just like PCs, mainframes, and key word search?

Stephen E Arnold, November 4, 2012

The Fragmentation of Content Analytics

October 29, 2012

I am in the midst of finalizing a series of Search Wizards Speak interviews with founders or chief technology officers of some interesting analytics vendors. Add to this work the briefings I have attended in the last two weeks. Toss in a conference which presented a fruit bowl of advanced technologies which read, understand, parse, count, track, analyze, and predict who will do what next.

Wow.

From a distance, the analytics vendors look the same. Up close, each is individual and often not identical. Pick up the wrong shard and a cut finger or worse may result.

A happy quack to www.thegreenlivingexpert.com

Who would have thought that virtually every company engaged in indexing would morph into next-generation, Euler crazed, and Gauss loving number crunchers. If the names Euler and Gauss do not resonate with you, you are in for tough sledding in 2013. Math speak is the name of the game.

The are three very good reasons for repackaging Vivisimo as a big data and analytics player. I choose Vivisimo because I have used it as an example of IBM’s public relations mastery. The company developed a deduplication feature which was and is, I assume, pretty darned good. Then Vivisimo became a federated search system, nosing into territory staked out by Deep Web Technologies. Finally, when IBM bought Vivisimo for about $20 million, the reason was big data and similarly bright, sparkling marketing lingo. I wanted to mention Hewlett Packard’s recent touting of Autonomy as an analytics vendor or Oracle’s push to make Endeca a business analytics giant. But IBM gets the nod. Heck, it is a $100 billion a year outfit. It can define an acquisition any way it wishes. I am okay with that.

Read more

The Google Search Appliance Adds Bells and Whistles

October 18, 2012

A version of this article appears on the www.citizentekk.com Web site.

The Google Search Appliance is getting along in year. A couple of weeks ago (October 2012), Google announced that Version 7.0 of the Google Search Appliance GB-7007 and the GB-9009 was available. The features of the new system are long-overdue in my opinion. Among the new features are two highly desirable enhancements: better security controls, faceted browsing. But the killer feature, in my opinion, is support of the Google Translate application programming interface.

Microsoft will have to differentiate the now aging SharePoint Search 2013 from a Google Search Appliance. Why? GSA Version 7 can be plugged into a SharePoint environment and the system will, without much or fuss, index the SharePoint content. Plug and play is not what SharePoint Search 2013 delivers. The fast deployment of a GSA remains one of its killer features. Simplicity and ease of use are important. When one adds Google magic, the GSA Version 7 can be another thrust at Microsoft’s enterprise business.

See http://www.bluepoint.net.au/google-search/gsa-product-model

Google has examined competitive search solutions and, in my opinion, made some good decisions. For example, a user may add a comment to a record displayed in a results list. The idea of allowing enterprise users add value to a record was a popular feature of Vivisimo Velocity. But since IBM acquired Vivisimo, that company has trotted down the big data trail.
Endeca has for more than 12 years offered licensees of its systems point-and-click navigation. An Endeca search solution can slash the time it takes for a user to pinpoint content related to a query. Google has made the GSA more Endeca like while retaining the simplified deployment which characterizes an appliance solution.

As I mentioned in the introduction, one of the most compelling features of the Version 7 GSAs is direct support for Google Translate. Organizations increasingly deal with mixed language documents. Product and market research will benefit from Google’s deep support of languages. At last count, Google Translate supported more than 60 languages, excluding Latin and Pig Latin. Now Google is accelerating its language support due to its scale and data sets. Coupled with Google’s smart software, the language feature may be tough for other vendors to match.

Enterprise searchers want to be able to examine a document quickly. To meet this need, Google has implemented in-line document preview. A user can click on a hit and see a rendering of the document without having to launch the native applications. A PDF in a results list appears without waiting the seconds it takes for Adobe Reader or FoxIt to fetch and display the document.

What’s not to like? The GSA GB-7007 and GB-9009 delivers most of the most-wanted features to make content searchable regardless of resource. If a proprietary file type must be indexed, Google provides developers with enough information to get the content into a form which the GSA can process. Failing that, Google partners and third-party vendors can deliver specialized connectors quickly.

Read more

How a Sitemap Can Enhance a Web Presence

October 17, 2012

Business2Community covers the importance of Web site indexing in its piece, “How To Build Your Own Sitemap in Five Minutes, and Why You Need To.”  In it, the author discusses differences processes for simply creating an effective sitemap.  However, he begins with an introduction.

When you ask most business owners and beginning online marketers what a ‘Sitemap’ is, you usually get two responses. ‘What’s that?’ or ‘That’s just too complicated for us.’ Sitemaps for your website aren’t impossible to make, and they certainly aren’t a waste of time. To understand why you need to make your own Sitemap today, you need to understand what they are and how they work.

The author then goes on to recommend various tools and techniques for effectively creating a sitemap.  However, there are other solutions that not only automatically generate sitemaps, but also automatically crawl and index any organization’s site in order to enable effective Web site search.  One highly awarded option is Fabasoft Mindbreeze InSite.  Fabasoft Mindbreeze takes the guesswork out of indexing and mapping, reaping high results with little effort.  Explore how Fabasoft Mindbreeze might enhance your organization’s online presence today.

Emily Rae Aldridge, October 17, 2012

Sponsored by ArnoldIT.com, developer of Augmentext.

Automation to Cure Duplicate Content Issues

October 15, 2012

Search Engine Land is shining a light on a common Web site search problem, duplicate content issues.  Read the full report in, “An Automated Tool To Eliminate Duplicate Content Issues.”

The author begins:

BloomReach announced a new software product named Dynamic Duplication Reduction (DDR) that aims to eliminate duplicate content issues on web sites.  Typically, software tools are known to cause duplicate content issues but this tool promises to reverse it.  The tool deeply crawls your web pages and continuously interprets all content on a site. It will automatically discover and act on duplicate pages.

They say an ounce of prevention is worth a pound of cure and in this case the prevention needed is effective Web site indexing.  Fabasoft Mindbreeze InSite quickly crawls and indexes all Web site content delivering search results based on relevancy.  Misspellings are even corrected with InSite and duplication is prevented.  Fabasoft Mindbreeze is a longstanding leader in third party solutions for the enterprise.  InSite is quickly becoming the icing on the cake of this industry leader.

Emily Rae Aldridge, October 15, 2012

Sponsored by ArnoldIT.com, developer of Augmentext.

Get A Comprehensive Search Strategy Plan from Aspire

October 12, 2012

People tend to doubt the power of a good search application.  They take it for granted that all out-of-the-box and Internet search engines are as accurate as Google (only the most powerful in the public eye).  The truth of the matter is most businesses are losing business productivity, because they have not harnessed the true potential of search.  Search Technologies, a leading IT company that specializes in search engine implementation, managed services, and consulting, is the innovator behind Aspire:

“Aspire is a powerful framework and application platform for acquiring both structured and unstructured data from just about any content source, processing / enriching that content, and then publishing it to the search engine or business analytics tool of your choice.”

Aspire uses a built-in indexing pipeline and propriety code maintained by Search Technologies high standards.  It is based on Apache Felix, the leading open source implementation for OSGI standard.  OSGI is built for Java and supported by IT companies worldwide. Aspire can gather documents from a variety of resources, including relational databases, SharePoint, file systems, and many more. The metadata is captured and then it can be enriched, combined, reformatted, or normalized to whatever the business needs before it is submitted search engines, document repositories, or business analytics applications.  Aspire performs content processing that cleans and repackages data for findability.

“Almost all structured data is originally created in a tightly controlled or automated way.

By contrast, unstructured content is created interactively by individual people, and is infinitely variable in its format, style, quality and structure.  Because of this, content processing techniques that were originally developed to work with structured data simply cannot cope with the unpredictability and variability of unstructured content.”

By implementing a content processing application like Aspire, unstructured content is “scrubbed,” then enriched, for better search results.  Most commercial search engines do not have the same filters that weed out relevant content from the bad.  The results displayed to the user are thus poor quality and are of zero to little use.  They try to resolve the problem with custom coding and updates for every new data source that pops up, which is tedious.  Aspire fixes tired coding problems, by using automated metadata extraction and manipulation outside the search engine.

As powerful as commercial search engines are they can often lack the refined quality one gets from a robust ISV.  Aspire does not follow the same search technology path as its competitors, rather it has designed a new, original solution to provide its clients with a comprehensive search strategy plan to help improve productivity, organization, and data management.

Remember. Search Technologies is sponsoring a meet up at the October 2012 Enterprise Search Summit. More information is available at http://www.meetup.com/DC-Metro-Enterprise-Search-Network/

Iain Fletcher, October 12, 2012

Sponsored by ArnoldIT.com, developer of Augmentext

Salesforce Incorporates Coveo Enterprise Search

September 22, 2012

ITWorldCanada announces, “Coveo Brings Enterprise Search to Salesforce.com.” The Canadian company will contribute its indexing engine and business intelligence tools to the Salesforce.com cloud. Coveo for Salesforce, which can pull together, index, and analyze unstructured data from multiple sources, will be fully integrated into the popular online customer relationship management (CRM) platform.

The write up tells us:

“Louis Tetu, CEO of Coveo, said the product  is the first tool of its kind that is integrated directly into Salesforce. ‘We are enabling an entirely new paradigm to federate information on demand,’ he said. ‘And that paradigm means that we don’t have to move data, we’re just pointing…secure indexes to that information.’

“Users of the technology that need information delivered in real-time, such as customer-facing companies, will be able to get it rapidly — within 100 milliseconds —  he added. This will help solve the common problem of consumers dealing with contact centres that cannot pull up their information in a reasonable period of time.”

Yes, that is a real plus. Tetu went on to emphasize that this is no small development– his company has conquered the considerable challenges of operating securely in the cloud. He mentions they also make a special effort to ensure new users can dive in as easily as possible.

Coveo was founded in 2005 by some members of the team which developed Copernic Desktop Search. Coveo takes pride in solutions that are agile and easy to use yet scalable, fast, and efficient.

Cynthia Murrell, September 22, 2012

Sponsored by ArnoldIT.com, developer of Augmentext

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta