Why Enterprise Search Fails

July 12, 2016

I participated in a telephone call before the US holiday break. The subject was the likelihood of a potential investment in an enterprise search technology would be a winner. I listened for most of the 60 minute call. I offered a brief example of the over promise and under deliver problems which plagued Convera and Fast Search & Transfer and several of the people on the call asked, “What’s a Convera?” I knew that today’s whiz kids are essentially reinventing the wheel.

I wanted to capture three ideas which I jotted down during that call. My thought is that at some future time, a person wanting to understand the incredible failures that enterprise search vendors have tallied will have three observations to consider.

No background is necessary. You don’t need to read about throwing rocks at the Google bus, search engine optimization, or any of the craziness about search making Big Data a little pussycat.

Enterprise Search: Does a Couple of Things Well When Users Expect Much More

Enterprise search systems ship with filters or widgets which convert source text into a format that the content processing module can index. The problem is that images, videos, audio files, content from wonky legacy systems, or proprietary file formats like IBM i2’s ANB files do not lend themselves to indexing by a standard enterprise search system.  The buyers or licensees of the enterprise search system do not understand this one trick pony nature of text retrieval. Therefore, when the system is deployed, consternation follows confusion when content is not “in” the enterprise search system and, therefore, cannot be found. There are systems which can deal with a wide range of content, but these systems are marketed in a different way, often cost millions of dollars a year to set up, maintain, and operate.

image

Net net: Vendors do not explain the limitations of text search. Licensees do not take the time or have the desire to understand what an enterprise search system can actually do. Marketers obfuscate in order to close the deal. Failure is a natural consequence.

Data Management Needed

The disconnect boils down to what digital information the licensee wants to search. Once the universe is defined, the system into which the data will be placed must be resolved. No data management, no enterprise search. The reason is that licensees and the users of an enterprise search system assume that “all” or “everything” – maps to web content, email to outputs from an AS/400 Ironside are available any time. Baloney. Few organizations have the expertise or the appetite to deal with figuring out what is where, how much, how frequently each type of data changes, and the formats used. I can hear you saying, “Hey, we know what we have and what we need. We don’t need a stupid, time consuming, expensive inventory.” There you go. Failure is a distinct possibility.

image

Net net: Hope springs eternal. When problems arise, few know what’s where, who’s on first, and why I don’t know is on third.

Read more

Enterprise Search Is Stuck in the Past

July 4, 2016

Enterprise search is one of the driving forces behind an enterprise system because the entire purpose of the system is to encourage collaboration and quickly find information.  While enterprise search is an essential tool, according to Computer Weekly’s article. “Beyond Keywords: Bringing Initiative To Enterprise Search” the feature is stuck in the past.

Enterprise search is due for an upgrade.  The amount of enterprise data has increased, but the underlying information management system remains the same.  Structured data is easy to make comply with the standard information management system, however, it is the unstructured data that holds the most valuable information.  Unstructured information is hard to categorize, but natural language processing is being used to add context.  Ontotext combined natural language processing with a graph database, allowing the content indexing to make more nuanced decisions.

We need to level up the basic keyword searching to something more in-depth:

“Search for most organisations is limited: enterprises are forced to play ‘keyword bingo’, rephrasing their question multiple times until they land on what gets them to their answer. The technologies we’ve been exploring can alleviate this problem by not stopping at capturing the keywords, but by capturing the meaning behind the keywords, labeling the keywords into different categories, entities or types, and linking them together and inferring new relationships.”

In other words, enterprise search needs the addition of semantic search in order to add context to the keywords.  A basic keyword search returns every result that matches the keyword phrase, but a context-driven search actually adds intuition behind the keyword phrases.  This is really not anything new when it comes to enterprise or any kind of search.  Semantic search is context-driven search.

 

Whitney Grace,  July 4, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Amazon AWS Jungle Snares Some Elasticsearch Functions

July 1, 2016

Elastic’s Elasticsearch has become one of the go to open source search and retrieval solutions. Based on Lucene, the system has put the heat on some of the other open source centric search vendors. However, search is a tricky beastie.

Navigate to “AWS Elasticsearch Service Woes” to get a glimpse of some of the snags which can poke holes in one’s rip stop hiking garb. The problems are not surprising. One does not know what issues will arise until a search system is deployed and the lucky users are banging away with their queries or a happy administrator discovers that Button A no longer works.

The write up states:

We kept coming across OOM issues due the JVMMemoryPresure spiking and inturn the ES service kept crapping out. Aside from some optimization work, we’d more than likely have to add more boxes/resources to the cluster which then means more things to manage. This is when we thought, “Hey, AWS have a service for this right? Let’s give that a crack?!”. As great as having it as a service is, it certainly comes with some fairly irritating pitfalls which then causes you to approach the situation from a different angle.

One approach is to use templates to deal with the implementation of shard management in AWS Elasticsearch. Sample templates are provided in the write up. The fix does not address some issues. The article provides a link to a reindexing tool called es-tool.

The most interesting comment in the article in my opinion is:

In hindsight I think it may have been worth potentially sticking with and fleshing out the old implementation of Elasticsearch, instead of having to fudge various things with the AWS ES service. On the other hand it has relieved some of the operational overhead, and in terms of scaling I am literally a couple of clicks away. If you have large amounts of data you pump into Elasticsearch and you require granular control, AWS ES is not the solution for you. However if you need a quick and simple Elasticsearch and Kibana solution, then look no further.

My takeaway is to do some thinking about the strengths and weaknesses of the Amazon AWS before chopping through the Bezos cloud jungle.

Stephen E Arnold, July 1, 2016

Google Search: Retrievers Lose. Smart Software Wins

June 28, 2016

I scanned a number of write ups about Google’s embrace of machine learning and smart software. I supplement my Google queries with the results of other systems. Some of these have their own index; for example, Yandex.ru and Exalead. Others are metasearch engines will suck in results and do some post processing to help answer the users’ questions. Others are disappointing and I check them out when I have a client who is willing to pay for stone flipping; for example, DuckDuckGo, iSeek, or the estimable Qwant. (I love quirky spelling too.)

I read “RankBrain Third Most Important Factor Determining Google Search Results.” Here’s the quote I noted:

Google is characteristically fuzzy on exactly how it improves search (something to do with the long tail? Better interpretation of ambiguous requests?) but Jeff Dean [former AltaVista wizard] says that RankBrain is “involved in every query,” and affects the actual rankings “probably not in every query but in a lot of queries.” What’s more, it’s hugely effective. Of the hundreds of “signals” Google search uses when it calculates its rankings (a signal might be the user’s geographical location, or whether the headline on a page matches the text in the query), RankBrain is now rated as the third most useful. “It was significant to the company that we were successful in making search better with machine learning,” says John Giannandrea. “That caused a lot of people to pay attention.”Pedro Domingos, the University of Washington professor who wrote The Master Algorithm, puts it a different way: “There was always this battle between the retrievers and the machine learning people,” he says. “The machine learners have finally won the battle.”

I have noticed in the last year, that I am unable to locate certain documents when I use the words and phrases which had served me well before smart software became the cat’s pajamas.

One recent example was my need to locate a case example about a German policeman’s trials and tribulations with the Dark Web. When I first located this document, I was trying to verify an anecdote shared with me after one of my intelligence community lectures.

I had the document in my file and I pulled it up on my monitor. The document in question is the work of an outfit and person labeled “Lars Hilse.” The title of the write up is “Dark Web & Bitcoin: Global Terrorism “Threat Assessment. The document was published in April 2013 with an update issued in November 2013. (That document was the source or maybe confirmed the anecdote about the German policeman and his Dark Web research.)

For my amusement, I wondered if I could use the new and improved Google Web search to locate the document. I display section 4.8 on my screen. The heading of the section is “Extortion (of Law Enforcement Personnel).

I entered the phrase into Google without quotes. Here’s the first page of results:

image

None of the hits points to the document with the five word phrase.

Read more

Enterprise Search Vendor Sinequa Partners with MapR

June 8, 2016

In the world of enterprise search and analytics, everyone wants in on the clients who have flocked to Hadoop for data storage. Virtual Strategy shared an article announcing Sinequa Collaborates With MapR to Power Real-Time Big Data Search and Analytics on Hadoop. A firm specializing in big data, Sinequa, has become certified with the MapR Converged Data Platform. The interoperation of Sinequa’s solutions with MapR will enable actionable information to be gleaned from data stored in Hadoop. We learned,

“By leveraging advanced natural language processing along with universal structured and unstructured data indexing, Sinequa’s platform enables customers to embark on ambitious Big Data projects, achieve critical in-depth content analytics and establish an extremely agile development environment for Search Based Applications (SBA). Global enterprises, including Airbus, AstraZeneca, Atos, Biogen, ENGIE, Total and Siemens have all trusted Sinequa for the guidance and collaboration to harness Big Data to find relevant insight to move business forward.”

Beyond all the enterprise search jargon in this article, the collaboration between Sinequa and MapR appears to offer an upgraded service to customers. As we all know at this point, unstructured data indexing is key to data intake. However, when it comes to output, technological solutions that can support informed business decisions will be unparalleled.

 

Megan Feil, June 8, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Speculation About Beyond Search

June 2, 2016

If you are curious to learn more about the purveyor of the Beyond Search blog, you should check out Singularity’s interview with “Stephen E Arnold On Search Engine And Intelligence Gathering.”  A little bit of background about Arnold is that he is an expert specialist in content processing, indexing, online search as well as the author of seven books and monographs.  His past employment record includes Booz, Allen, & Hamilton (Edward Snowden was a contractor for this company), Courier Journal & Louisville Times, and Halliburton Nuclear.  He worked on the US government’s Threat Open Source Intelligence Service and developed a cost analysis, technical infrastructure, and security for the FirstGov.gov.

Singualrity’s interview covers a variety of topics and, of course, includes Arnold’s direct sense of humor:

“During our 90 min discussion with Stephen E. Arnold we cover a variety of interesting topics such as: why he calls himself lucky; how he got interested in computers in general and search engines in particular; his path from college to Halliburton Nuclear and Booze, Allen & Hamilton; content and web indexing; his who’s who list of clients; Beyond Search and the core of intelligence; his Google Trilogy – The Google Legacy (2005), Google Version 2.0 (2007), and Google: The Digital Gutenberg (2009); CyberOSINT and the Dark Web Notebook; the less-known but major players in search such as Recorded Future and Palantir; Big Brother and surveillance; personal ethics and Edward Snowden.”

When you listen to the experts in certain fields, you always get a different perspective than what the popular news outlets gives.  Arnold offers a unique take on search as well as the future of Internet security, especially the future of the Dark Web.

 

Whitney Grace, June 2, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Search Sink Hole Identified and Allegedly Paved and Converted to a Data Convenience Store

May 20, 2016

I try to avoid reading more than one write up a day about alleged revolutions in content processing and information analytics. My addled goose brain cannot cope with the endlessly recycled algorithms dressed up in Project Runway finery.

I read “Ryft: Bringing High Performance Analytics to Every Enterprise,” and I was pleased to see a couple of statements which resonated with my dim view of information access systems. There is an accompanying video in the write up. I, as you may know, gentle reader, am not into video. I prefer reading, which is the old fashioned way to suck up useful factoids.

Here’s the first passage I highlighted:

Any search tool can match an exact query to structured data—but only after all of the data is indexed. What happens when there are variations? What if the data is unstructured and there’s no time for indexing? [Emphasis added]

The answer to the question is increasing costs for sales and marketing. The early warning for amped up baloney are the presentations given at conferences and pumped out via public relations firms. (No, Buffy, no, Trent, I am not interested in speaking with the visionary CEO who hired you.)

I also highlighted:

With the power to complete fuzzy search 600X faster at scale, Ryft has opened up tremendous new possibilities for data-driven advances in every industry.”

I circled the 600X. Gentle reader, I struggle to comprehend a 600X increase in content processing. Dear Mother Google has invested to create a new chip to get around the limitations of our friend Von Neumann’s approach to executing instructions. I am not sure Mother Google has this nailed because Mother Google, like IBM, announces innovations without too much real world demonstration of the nifty “new” things.

I noted this statement too:

For the first time, you can conduct the most accurate fuzzy search and matching at the same speed as exact search without spending days or weeks indexing data.

Okay, this strikes me as a capability I would embrace if I could get over or around my skepticism. I was able to take a look at the “solution” which delivers the astounding performance and information access capability. Here’s an image from Ryft’s engineering professionals:

image

Notice that we have Spark and pre built components. I assume there are myriad other innovations at work.

The hitch in the git along is that in order to deal with certain real world information processing challenges, the inputs come from disparate systems, each generating substantial data flows in real time.

Here’s an example of a real world information access and understanding challenge, which, as far as I know, has not been solved in a cost effective, reliable, or usable manner.

image

Image source: Plugfest 2016 Unclassified.

This unclassified illustration makes clear that the little things in the sky pump out lots of data into operational theaters. Each stream of data must be normalized and then converted to actionable intelligence.

The assertion about 600X sounds tempting, but my hunch is that the latency in normalizing, transferring, and processing will not meet the need for real time, actionable, accurate outputs when someone is shooting at a person with a hardened laptop in a threat environment.

In short, perhaps the spark will ignite a fire of performance. But I have my doubts. Hey, that’s why I spend my time in rural Kentucky where reasonable people shoot squirrels with high power surplus military equipment.

Stephen E Arnold, May 20, 2016

Big Data and Value

May 19, 2016

I read “The Real Lesson for Data Science That is Demonstrated by Palantir’s Struggles · Simply Statistics.” I love write ups that plunk the word statistics near simple.

Here’s the passage I highlighted in money green:

… What is the value of data analysis?, and secondarily, how do you communicate that value?

I want to step away from the Palantir Technologies’ example and consider a broader spectrum of outfits tossing around the jargon “big data,” “analytics,” and synonyms for smart software. One doesn’t communicate value. One finds a person who needs a solution and crafts the message to close the deal.

When a company and its perceived technology catches the attention of allegedly informed buyers, a bandwagon effort kicks in. Talks inside an organization leads to mentions in internal meetings. The vendor whose products and services are the subject of these comments begins to hint at bigger and better things at conferences. Then a real journalist may catch a scent of “something happening” and writes an article. Technical talks at niche conferences generate wonky articles usually without dates or footnotes which make sense to someone without access to commercial databases. If a social media breeze whips up the smoldering interest, then a fire breaks out.

A start up should be so clever, lucky, or tactically gifted to pull off this type of wildfire. But when it happens, big money chases the outfit. Once money flows, the company and its products and services become real.

The problem with companies processing a range of data is that there are some friction inducing processes that are tough to coat with Teflon. These include:

  1. Taking different types of data, normalizing it, indexing it in a meaningful manner, and creating metadata which is accurate and timely
  2. Converting numerical recipes, many with built in threshold settings and chains of calculations, into marching band order able to produce recognizable outputs.
  3. Figuring out how to provide an infrastructure that can sort of keep pace with the flows of new data and the updates/corrections to the already processed data.
  4. Generating outputs that people in a hurry or in a hot zone can use to positive effect; for example, in a war zone, not get killed when the visualization is not spot on.

The write up focuses on a single company and its alleged problems. That’s okay, but it understates the problem. Most content processing companies run out of revenue steam. The reason is that the licensees or customers want the systems to work better, faster, and more cheaply than predecessor or incumbent systems.

The vast majority of search and content processing systems are flawed, expensive to set up and maintain, and really difficult to use in a way that produces high reliability outputs over time. I would suggest that the problem bedevils a number of companies.

Some of those struggling with these issues are big names. Others are much smaller firms. What’s interesting to me is that the trajectory content processing companies follow is a well worn path. One can read about Autonomy, Convera, Endeca, Fast Search & Transfer, Verity, and dozens of other outfits and discern what’s going to happen. Here’s a summary for those who don’t want to work through the case studies on my Xenky intel site:

Stage 1: Early struggles and wild and crazy efforts to get big name clients

Stage 2: Making promises that are difficult to implement but which are essential to capture customers looking actively for a silver bullet

Stage 3: Frantic building and deployment accompanied with heroic exertions to keep the customers happy

Stage 4: Closing as many deals as possible either for additional financing or for licensing/consulting deals

Stage 5: The early customers start grousing and the momentum slows

Stage 6: Sell off the company or shut down like Delphes, Entopia, Siderean Software and dozens of others.

The problem is not technology, math, or Big Data. The force which undermines these types of outfits is the difficulty of making sense out of words and numbers. In my experience, the task is a very difficult one for humans and for software. Humans want to golf, cruise Facebook, emulate Amazon Echo, or like water find the path of least resistance.

Making sense out of information when someone is lobbing mortars at one is a problem which technology can only solve in a haphazard manner. Hope springs eternal and managers are known to buy or license a solution in the hopes that my view of the content processing world is dead wrong.

So far I am on the beam. Content processing requires time, humans, and a range of flawed tools which must be used by a person with old fashioned human thought processes and procedures.

Value is in the eye of the beholder, not in zeros and ones.

Stephen E Arnold, May 19, 2016

Enterprise Search: The Valiant Fight On

May 17, 2016

I read “VirtualWorks and Language Tools Announce Merger.” I ran across Language Tools several years ago. The company was working to create components for ElasticSearch’s burgeoning user base. The firm espoused natural language processing as a core technology. NLP is useful, but it imposes some computational burdens on some content processing functions. ElasticSearch works pretty well, and there are a number of companies optimizing, integrating, and creating widgets to make life with ElasticSearch better, faster, and presumably more impressive than the open source system is.

This news release highlights the fact that VirtualWorks and Language Tools have merged. The financial details are not explicit, and it appears that a company founded by a wizard from Citrix will make Language Tools’ R&D hub for the Florida-based VirtualWorks’ operation.

According to the story:

The combined organization brings together best of breed core technologies in the areas of enterprise search, data management, text analytics, discovery techniques and analytics to enable the development of new and exciting next generation applications in the business intelligence space.

VirtualWorks is or was a SharePoint centric solution. Like other search vendors, the company uses connectors to suck data into a central indexing point. Users then search the content and have access to the content without having to query separate systems.

This idea has fueled enterprise search since the days of Verity, Autonomy, Fast Search, Convera, et al. The real money today seems to be in the consulting and engineering services required to make enterprise search useful.

SharePoint is certainly widely used, and it is fraught with interesting challenges. Will the lash up of these two firms generate the type of revenue once associated with Autonomy and Fast Search & Transfer?

My hunch is that enterprise search continues to be a tough market. There are functional solutions to locating information available as open source or at comparatively modest license fees. I am thinking of dtSearch and Maxxcat. Both of these work well within Microsoft centric environments.

Stephen E Arnold, May 17, 2016

Facebook and Humans: Reality Is Not Marketing

May 16, 2016

I read “Facebook News Selection Is in Hands of Editors Not Algorithms, Documents Show.” The main point of the story is that Facebook uses humans to do work. The idea is that algorithms do not seem to be a big part of picking out what’s important.

The write up comes from a “real” journalism outfit. The article points out:

The boilerplate about its [Facebook’s]  news operations provided to customers by the company suggests that much of its news gathering is determined by machines: “The topics you see are based on a number of factors including engagement, timeliness, Pages you’ve liked and your location,” says a page devoted to the question “How does Facebook determine what topics are trending?”

After reading this, I thought of Google’s poetry created by its artificial intelligence system. Here’s the line which came to mind:

I started to cry. (Source: Quartz)

I vibrate with the annoyance bubbling under the surface of the newspaper article. Imagine. Facebook has great artificial intelligence. Facebook uses smart software. Facebook open sources its systems and methods. The company says it is at the cutting edge of replacing humans with objective procedures.

The article’s belief in baloney is fried and served cold on stale bread. Facebook uses humans. The folks at real journalism outfits may want to work through articles like “Different Loci of Semantic Interference in Picture Naming vs. Word-Picture Matching Tasks” to get a sense of why smart systems go wandering.

So what’s new? Palantir Technologies uses humans to index content. Without that human input, the “smart” software does some useful work, but humans are part of the work flow process.

Other companies use humans too. But the marketing collateral and the fizzy presentations at fancy conferences paint a picture of a world in which cognitive, artificially intelligent, smart systems do the work that subject matter experts used to do. Humans, like indexers and editors, are no longer needed.

Now reality pokes is rose tinted fingertips into the real world.

Let me be clear. One reason I am not happy with the verbiage generated about smart software is one simple fact.

Most of the smart software systems require humans to fiddle at the beginning when a system is set up, while the system operates to deal with exceptions, and after an output is produced to figure out what’s what. In short, smart software is not that smart yet.

There are many reasons but the primary one is that the math and procedures underpinning many of the systems with which I am familiar are immature. Smart software works well when certain caveats are accepted. For example, the vaunted Watson must be trained. Watson, therefore, is not that much different from the training Autonomy baked into its IDOL system in the mid 1990s. Palantir uses humans for one simple reason. Figuring out what’s important to a team under fire with software works much better if the humans with skin in the game provide indexing terms and identify important points like local names for stretches of highway where bombs can be placed without too much hassle. Dig into any of the search and content processing systems and you find expenditures for human work. Companies licensing smart systems which index automatically face significant budget overruns, operational problems because of lousy outputs, and piles of exceptions to either ignore or deal with. The result is that the smoke and mirrors of marketers speaking to people who want a silver bullet are not exactly able to perform like the carefully crafted demonstrations. IBM i2 Analyst’s Notebook requires humans. Fast Search (now an earlobe in SharePoint) requires humans. Coveo’s system requires humans. Attivio’s system requires humans. OpenText’s suite of search and content processing requires humans. Even Maxxcat benefits from informed set up and deployment. Out of the box, dtSearch can index, but one needs to know how to set it up and make it work in a specific Microsoft environment. Every search and content processing system that asserts that it is automatic is spackling flawed wallboard.

For years, I have given a lecture about the essential sameness of search and content processing systems. These systems use the same well known and widely taught mathematical procedures. The great breakthroughs at SRCH2 and similar firms amount to optimization of certain operations. But the whiziest system is pretty much like other systems. As a result, these systems perform in a similar manner. These systems require humans to create term lists, look up tables of aliases for persons of interest, hand craft taxonomies to represent the chunk of reality the system is supposed to know about, and other “libraries” and “knowledgebases.”

The fact that Watson is a source of amusement to me is precisely because the human effort required to make a smart system work is never converted to cost and time statements. People assume Watson won Jeopardy because it was smart. People assume Google knows what ads to present because Google’s software is so darned smart. People assume Facebook mines its data to select news for an individual. Sure, there is automation of certain processes, but humans are needed. Omit the human and you get the crazy Microsoft Tay system which humans taught to be crazier than some US politicians.

For decades I have reminded those who listened to my lectures not to confuse what they see in science fiction films with reality. Progress in smart software is evident. But the progress is very slow, hampered by the computational limits of today’s hardware and infrastructure. Just like real time, the concept is easy to say but quite expensive and difficult to implement in a meaningful way. There’s a reason millisecond access to trading data costs so much that only certain financial operations can afford the bill. Smart software is the same.

How about less outrage from those covering smart software and more critical thinking about what’s required to get a system to produce a useful output? In short, more info and less puffery, more critical thinking and less sawdust. Maybe I imagined it but both the Google and Tesla self driving vehicles have crashed, right? Humans are essential because smart software is not as smart as those who believe in unicorns assume. Demos, like TV game shows, require pre and post production, gentle reader.

What happens when humans are involved? Isn’t bias part of the territory?

Stephen E Arnold, May 16, 2016

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta