Parsing Document: A Shift to Small Data

November 14, 2019

DarkCyber spotted “Eigen Nabs $37M to Help Banks and Others Parse Huge Documents Using Natural Language and Small Data.” The folks chasing the enterprise search pot of gold may need to pay attention to figuring out specific problems. Eigen uses search technology to identify the important items in long documents. The idea is “small data.”

The write up reports:

The basic idea behind Eigen is that it focuses what co-founder and CEO Lewis Liu describes as “small data”. The company has devised a way to “teach” an AI to read a specific kind of document — say, a loan contract — by looking at a couple of examples and training on these. The whole process is relatively easy to do for a non-technical person: you figure out what you want to look for and analyze, find the examples using basic search in two or three documents, and create the template which can then be used across hundreds or thousands of the same kind of documents (in this case, a loan contract).

Interesting, but the approach seems similar to identify several passages in a text and submitting these to a search engine. This used to be called “more like this.” But today? Small data.

With the cloud coming back on premises and big data becoming user identified small data, what’s next? Boolean queries?

DarkCyber hopes so.

Stephen E Arnold, November 14, 2019

Curious about Semantic Search the SEO Way?

November 12, 2019

DarkCyber is frequently curious about search: Semantic, enterprise, meta, multi-lingual, Boolean, and the laundry list of buzzwords marshaled to allow a person to find an answer.

If you want to get a Zithromax Z-PAK of semantic search talk, navigate to ‘Semantic Search Guide.” One has to look closely at the url to discern that this “objective” write up is about search engine optimization or SEO. DarkCyber affectionately describes SEO as the “relevance” killer, but that’s just our old-fashioned self refusing to adapt to the whizzy new world.

The link will point to a page with a number of links. These include:

  • Target audience and contributions
  • The knowledge graph explained
  • The evolution of search
  • Using Google’s entity search tool
  • Getting a Wikipedia listing

DarkCyber took a look at the “Evolution of Search” segment. We found it quirky but interesting. For example, we noted this passage:

Now we turn to the heart of full-text search. SEOs tend to dwell on the indexing part of search or the retrieval part of the search, called the Search Engine Results Pages (SERPs, for short). I believe they do this because they can see these parts of the search. They can tell if their pages have been crawled, or if they appear. What they tend to ignore is the black box in the middle. The part where a search engine takes all those gazillion words and puts them in an index in a way that allows for instant retrieval. At the same time, they are able to blend text results with videos, images and other types of data in a process known as “Universal Search”. This is the heart of the matter and whilst this book will not attempt to cover all of this complex subject, we will go into a number of the algorithms that search engines use. I hope these explanations of sometimes complex, but mostly iterative algorithms appeal to the marketer inside you and do not challenge your maths skills too much. If you would like to take these ideas in in video form, I highly recommend a video by Peter Norvig from Google in 2011:

Oh, well. This is one way to look at universal search. But Google has silos of indexes. The system after 20 plus years does not federate results across indexes. Semantic search? Yeah, right. Search each index, scan results, cut and paste, and then try to figure out the dates and times. Semantic search does not do time particularly well.

Important. Not to the SEO. Search babble may be more compelling.

If this approach is your cup of tea, inLinks has the hot water you need to understand why finding information is not what it seems.

Stephen E Arnold, November 12, 2019

The Key to Millions: Enterprise Search?

November 11, 2019

I thought the world was crazier than ever when enterprise search became the focal point of a multi-billion dollar deal and a multi-year lawsuit. The open source search movement picked up steam as companies shifted their attention from proprietary search and retrieval solutions to those maintained by a “community.” Search became a utility which many information technology professionals found a Bermuda Triangle for careers.


Our research prior to the publication of the three volumes of the Enterprise Search Report I wrote and our subsequent work on next generation search solutions revealed these problems:

  1. Enterprise search implies one size fits all. Information retrieval needs vary by business unit, department, and individuals. When one pokes around a large organization, one finds numerous search and information access systems. One size? Nope.
  2. Users look for information in the enterprise search system and cannot locate it. The reasons vary, but the universal gripe is, “I can’t locate the document I just saved.” The notion of real time is not one that fits into more organization’s information infrastructure. Cost is one big reason. What looks good in a demo does not work in the “real world” of a company.
  3. Silos. The implications of “enterprise” suggest that a significant amount of information will be available to a user of the search system. Nothing could be further from the reality. Legal keeps some documents under lock and key. Personnel? The same approach. Research? No data goes out of the lab or the researcher’s workstation. On and on.
  4. Changes that are not captured. The top sales professional changes his presentation right before giving a talk to seal a big deal. The changes are not indexed because the sales professional has to do the contract. Missing info? Yes.
  5. Untracked digital information. Enterprise search has not been either quick nor adept at handling social media posts (authorized or unauthorized), interviews, videos produced in lieu of a written report, and similar information objects. Try to find key facts from these content collections. Give up yet.

I could extend this list, but I don’t have the energy. Few are interested in what caused Entopia to go out of business. No one I have spoken with in the last five years cares about why Fast Search & Transfer self destructed. No one cares.

I read “Want to Earn Millions? Launch an AI Based Enterprise Search Startup.” That’s a path to fame and riches. The write up states:

Enterprise search engines based on artificial intelligence systems are taking off fast. Cognitive search systems using NLP can include structured data contained in databases and even nontraditional enterprise information like pictures, video, sound, and machine information, for example, from the internet of things (IoT) gadgets, to bring contextual results in the actual business context.

Sounds good. How about this?

For startups and venture investing, the trend is clear. One prime example of this trend is the world’s leading space agency- NASA has enormous data ever since it was created in 1958. Now, the agency is working to make its data increasingly accessible for rocket designers and researchers. It is redesigning search and analytics abilities utilizing AI and natural language processing (NLP) systems created by a company known as Sinequa which is collaborating with the agency to deploy a worldwide knowledge management ability.

Amazing. Technologies like RECON’s which NASA helped move forward because engineers could not locate key documents is looking at technology which has wobbled from search to intelligence and back again.

A quick reality check, gentle reader, please.

One can download open source search and retrieval software and get decent results. But there are firms which have goosed the “money” in enterprise search to astronomical levels:

  • Algolia, $100 million
  • Coveo, $200 million
  • LucidWorks, $150 million
  • ThoughtSpot, $248 million.

Now let’s think about Autonomy. At its height, the company reported revenues of about $800 million. HP paid $10.3 billion. After a short period of time, HP realized its massive sales and marketing system could not generate enough new sales and sustainable revenue to keep the Autonomy business an alleged winner.

How will these companies pitching enterprise search generate sufficient revenue to pay back their investors, fund research and development, add filters and other components needed to deal with today’s content flows, and support their existing systems as licensees try to make search work like investigative software?

The answer is, “The odds are quite unappealing.”

  • Enterprise search has been available for half a century with some of the old school systems still available from OpenText in the guise of BRS Search
  • Dissatisfaction with enterprise search systems generally runs about 50 to 70 percent in most organizations with such a system
  • Costs of keeping an enterprise search and retrieval system continue to creep up despite the advent of managed services like those available from Amazon and others

Where are the customers?

That’s the question the article ignores.

Customers are likely to be just as tough to convince to use an enterprise solution as they have been for decades.

Net net: Enterprise search may not be the spring chicken the write up describes. Enterprise search has a history. And history is about to repeat itself. When the Autonomy matter is resolved, there may be be a new search drama to follow.

Keep in mind that Google couldn’t make enterprise search work. But these cash stuffed outfits can? Maybe? Well, probably not.

Stephen E Arnold, November 11, 2019

Google: Bert Search Is Here. Where Is Ernie Advertising?

November 10, 2019

Google wants to stay at the top of search, so they are constantly developing new technology to keep their search algorithms ahead the competition. Fast Company shares the latest on Google’s search technology in the article, “Google Just Got Better At Understanding Your Trickiest Searches.” Search queries power all of Google searches and the problem for search algorithms is understanding which words in the query are the most important. Another issue is that the algorithms need to understand how the words relate to one another. The relationship between keywords and their intent is subtle, particularly with all the subtle meanings in the English language.

Google’s newest search algorithm endeavor is dubbed BERT, short for Bidirectional Encoder Representations from Transformers. What does that mean?

“We non-AI scientists don’t have to worry about what encoders, representations, and transformers are. But the gist of the idea is that BERT trains machine language algorithms by feeding them chunks of text that have some of the words removed. The algorithm’s challenge is to guess the missing words—which turns out to be a game that computers are good at playing, and an effective way to efficiently train an algorithm to understand text. From a comprehension standpoint, it helps “turn keyword-ese into language,” said Google search chief Ben Gomes.”

Apparently the more text fed into a search, the better BERT can interpret its meaning. Google search scientists tested BERT by feeding the algorithm an endless stream of text from the search engine results. The “bidirectional” in BERT’s name comes from how the algorithm interprets data. Traditional search algorithms read English search queries from left to right, while BERT’s bidirectional reads the queries from unconventional ways.

The average user will not recognize that BERT has altered their search results, but it will be beneficial to them. BERT will not have the same reaching impact as universal search and knowledge graph, but it does give Google a competitive advantage.

The Wall Street Journal did some Google related sleuthing. The focus is advertising. You can read the story and look at the very millennial diagram in “How Google Edged Out Rivals and Built the World’s Dominant Ad Machine: A Visual Guide.” You will have to pay to learn what the diagram shown below means. You will also have to do some homework to figure out how advertising and search / retrieval are connected. That’s important to some. But that diagram is remarkable. It uses Google colors too.


Whitney Grace, November 10, 2019

Search System Bayard

November 1, 2019

Looking for an open source search and retrieval tool written in Rust and built on top of Tantivy (Lucene?). Point your browser to Github and grab the files. The read me file highlights these features:

  • Full-text search/indexing
  • Index replication
  • Bringing up a cluster
  • Command line interface.

DarkCyber has not tested it, but a journalist contacted us on October 31, 2019, and was interested in the future of search. I pointed out that there are free and open source options.

What people want to buy, however, is something that does not alienate two thirds of the search system’s users the first day the software is deployed.

Surprised? You may not know what you don’t know, but, gentle reader, you are an exception.

Stephen E Arnold, November 1, 2019

Metasearch Engine Changes Hands

October 28, 2019

In 1998 a Wall Street professionals founded Ixquick. As I recall, the developer was David Bodnick. Like other search developers, selling was better than pumping ads and trying to compete in the world of the digital library card catalog. Ixquick’s buyer was Surfboard Holding BV.

Metasearch engines like DuckDuckGo sends queries to other search engines and present a list of semi-deduplicated results. Dogpile and Vivisimo were other metasearch engines. The Ixquick twist was privacy. I don’t want to go into the notion of privacy in an ad supported search system in this item.

DarkCyber noted a Reddit post that reveals System1 (Privacy One Group) now owns the service. Note the word privacy. As I said, I am not going to explain for the umpteenth time why free Web search or free services of any type may have a different notion of privacy than someone in Harrod’s Creek, Kentucky.

Should I explain the issues related to metasearch systems? Nope. Just like the privacy thing. No one understands and no one cares.

Stephen E Arnold, October 28, 2019

Google NLP Search: Fortune Loves It. Simple Queries Reveal Shortcomings

October 25, 2019

I read “Google Says Its Latest Tech Tweak Provides Better Search Results. Here’s How.” DarkCyber enjoys Fortune Magazine’s how to explanations. They are just. So. Wonderful.

We learned:

Google’s goal is to make it easier for users, who often don’t know how to enter queries for the information they want. Since its search engine debuted in 1997, Google has focused on getting its technology to better understand natural language to produce relevant results even in cases where users enter a misspelled word or a query that is off target. With the latest change, Google will also now consider the sequential order in which words are placed in a search, instead of returning results based on a “mixed bag” of keywords.

Yes, but what about tuning search to advertising? What about ignoring bound phrases? What about Boolean logic? What about words like “terminal” which have different, often difficult to disambiguate meanings?

Fortune jumps over these questions.

Try this query on the “new” Google?

What companies compete with Subsentio?

What about this one?

Amazon law enforcement products

Not what I had in mind. I was thinking about QLDB and digital currency deanonymization.

Sorry, Google. Not yet. Personalization does not work either, by the way. (You know. Examine the search history, etc. etc.)

Fortune, check out where Google’s ad revenue comes from. Just a small clue to put Google search in its context.

Stephen E Arnold, October 25, 2019

Dumais on Search: Bell Labs Roots Are Thriving

October 23, 2019

We just love a genuine Search guru, and Dr. Susan Dumais is one of the best. The illustrious Dr. Dumais is now a Microsoft Technical Fellow and Deputy Lab Director of MDR AI. If you wanted to know the history of information retrieval, she would be the one to hear tell about it—and now you can, courtesy of the Microsoft Research Podcast. Both the 38-minute podcast itself and a transcript are posted at, “HCI, IR and the Search for Better Search with Dr. Susan Dumais.” The good doctor describes what motivates her in her work:

“I think there are two commonalities and themes in my work. One is topical. So, as you said, I’m really interested in understanding problems from a very user-centric point-of-view. I care a lot about people, their motivations, the problems they have. I also care about solving those problems with new algorithms, new techniques and so on. So, a lot of my work involves this intersection of people and technology, thinking about how work practices co-evolve with new technological developments. And so thematically, that’s an area that I really like. I like this ability to go back and forth between understanding people, how they think, how they reason, how they learn, how they find information, and finding solutions that work for them. In the end, if something doesn’t work for people, it doesn’t work. In addition to topically, I approach problems in a way that is motivated, oftentimes, by things that I find frustrating. We may talk a little bit later about my work in latent semantic indexing, but that grew out of a frustration with trying to learn the Unix operating system. Work I’ve done on email spam, grew out of a frustration in mitigating the vast amount of junk that I was getting. So, I tend to be motivated by problems that I have now, or that I anticipate that our customers, and people will have in general, given the emerging technology trends.”

She and host Gretchen Huizinga go on to discuss the evolution of search technology over the last twenty years, beginning with the first HTML page crawlers that indexed but a couple thousand queries per day. They also cover Dumais’ work over the years to build bridges, provide context in search, and bring changing content into the equation. We hope you will check out the intriguing and informative interview for yourself, dear reader.

Cynthia Murrell, October 23, 2019

Algolia: Cash Funding Hits $184 Million

October 15, 2019

Exalead was sucked into Dassault Systèmes. Then former Exaleaders abandoned ship. Algolia benefited from some Exalead experience. But unlike Exalead, Algolia embraced venture funding with cash provided by Accel, Point Nine Capital, Storm Ventures, and Y Combinator, among others.

DarkCyber noted “Algolia Finds $110M from Accel and Salesforce for Its Search-As-a-Service, Used by Slack, Twitch and 8K Others.” The write up reports that the company has “closed a Series C of $110 million, money that it plans to invest in R&D around its search technology, including doubling down on voice, and further global expansion in Europe, North America and Asia Pacific.”

The write up adds:

Having Salesforce as a strategic backer in this round is notable: the CRM giant currently does not have a native search product in its wide range of cloud-based services for enterprises, instead opting for endorsed integrations with third parties, such as Algolia competitor Coveo. The plan will be to further integrate with Salesforce although no products to speak of as of yet.

The challenge will be to go where few search and retrieval systems have gone before.

Some people have forgotten the disappointments and questionable financial tricks promising search vendors delivered to stakeholders and customers.

With venture firms looking for winners, returns of 20 percent will not deliver what the sources of the funds expect. The good old days of a 17X return may have cooled, but generating an 8X or 12X return may be a challenge.


In the course of our researching and writing the enterprise search report in 2003 to 2006 and out and our subsequent work, several “themes” or “learnings” surfaced:

  1. Good enough search is now the order of the day; that is, an organization-wide search system does not meet the needs of many operating units. Examples range from the legal department to research and development to engineering and the drawings plus data embedded in product manufacturing systems to information under security umbrellas with real time data and video content objects. Therefore, the “one solution” approach dissipates like morning fog.
  2. Utility search from outfits like Amazon are “good enough.” This means that a developer using Amazon blockchain services and workflow tools may use the search functions available from Amazon. Maybe Amazon will buy Algolia, but for the foreseeable future, search is a tag-along function, not a driver of the big money apps which Amazon is aiming toward.
  3. Search, regardless of vendor, must spend significant sums to enrich the functions of the system. Natural language processing, predictive analytics, entity extraction, and other desired functions are moving targets. Adding and tuning these capabilities becomes expensive. And it the experiences of Autonomy and Fast Search & Transfer are representative, the costs become difficult to control.

DarkCyber hopes that Algolia can adapt to these research factoids. If not, search and retrieval may be rushing toward a disconnect between revenues, sustainable profits, and investor expectations.

The wheel of fortune is spinning. Where will it stop? On a winner or a loser? This is a difficult question to answer, and one which Attivio, BA-Insight, Coveo, Elastic, IBM Watson, Lucidworks, Microsoft, Sinequa, Voyager Search, and others have been trying to answer with millions of dollars, thousands of engineering hours, and massive investments in marketing. I am not including the search vendors positioned as policeware and intelware; for example, BAE NetReveal, Diffeo, LookingGlass, Palantir Technologies, and Shadowdragon, among others.

Worth monitoring the trajectory of Algolia.

Stephen E Arnold, October 15, 2019

Amazon: Elasticsearch Bounced and Squished

October 14, 2019

DarkCyber noted “AWS Elasticsearch: A Fundamentally-Flawed Offering.” The write up criticizes Amazon’s implementation of Elasticsearch. Amazon hired some folks from Lucidworks a few years ago. But under the covers, Lucene thrums along within Amazon and a large number of other search-and-retrieval companies, including those which present themselves as policeware. There are many reasons: [a] good enough, [b] no one company fixes the bugs, [c] good enough, [d] comparatively cheap, [e] good enough. Oh, one other point: Not under the control of one company like those good, old fashioned solutions like STAIRS III, Fulcrum (remember that?), or Delphes (the francophone folks).

This particular write up is unlikely to earn a gold star from Amazon’s internal team. The essay states:

I’m currently working on a large logging project that was initially implemented using AWS Elasticsearch. Having worked with large-scale mainline Elasticsearch clusters for several years, I’m absolutely stunned at how poor Amazon’s implementation is and I can’t fathom why they’re unable to fix or at least improve it.

I think the tip off is the phrase “how poor Amazon’s implementation is…”

The section Amazon Elasticsearch Operation provides some color to make vivid the author’s viewpoint; for example:

On Amazon, if a single node in your Elasticsearch cluster runs out of space, the entire cluster stops ingesting data, full stop. Amazon’s solution to this is to have users go through a nightmare process of periodically changing the shard counts in their index templates and then reindexing their existing data into new indices, deleting the previous indices, and then reindexing the data again to the previous index name if necessary. This should be wholly unnecessary, is computationally expensive, and requires that a raw copy of the ingested data be stored along with the parsed record because the raw copy will need to be parsed again to be reindexed. Of course, this also doubles the storage required for “normal” operation on AWS. [Emphasis in the original essay.]

The wrap up for the essay is clear from this passage:

I cannot fathom how Amazon decided to ship something so broken, and how they haven’t been able to improve the situation after over two years.

DarkCyber’s team formulated several observations. Let’s look at these in the form of questions and trust that some young sprites will answer them:

  1. Will Amazon make its version of Elasticsearch proprietary?
  2. Are these changes designed to “pull” developers deeper into the AWS platform, making departure more difficult or impossible for some implementations?
  3. Are the components the author of the essay finds objectionable designed to generate more revenue for Amazon?

Stephen E Arnold, October 14, 2019

Next Page »

  • Archives

  • Recent Posts

  • Meta