Semantic Search Hoohah: Hakia

June 8, 2015

My Overflight system snagged an updated post to an article written in 2006 about Hakia. Hakia, as you may know, was a semantic search system. I ran an interview with Riza C. Berkan in 2008. You can find that Search Wizards Speak interview here.

Hakia went quiet months ago. The author of “Hakia Is a True Semantic Search Engine” posted a sentence that said: “Hakia, unfortunately, failed and went out of business.”

I reviewed that nine year old article this morning and highlighted several passages. These are important because these snippets illustrate how easy it is to create a word picture which does not match reality. Search engine developers find describing their vision a heck of a lot easier than converting talk into sustainable revenues.

Let’s run down three of the passages proudly displaying my blue highlighter’s circles and arrows. The red text is from the article describing Hakia and the blue text is what the founder of Hakia said in the Search Wizards Speak interview.

Passage One

So a semantic search engine doesn’t have to address every word in the English language, and in fact it may be able to get by with a very small core set of words. Let’s say that we want to create our own semantic search engine. We can probably index most mainstream (non-professional) documents with 20-30,000 words. There will be a few gaps here and there, but we can tolerate those gaps. But still, the task of computing relevance for millions, perhaps billions of documents that use 30,000 words is horrendously monumental. If we’re going to base our relevance scoring on semantic analysis, we need to reduce the word-set as much as possible.

This passage is less about Hakia and more about the author’s perception of semantic search. Does this explanation resonate with you? For me, many semantic methods are computationally burdensome. As a result the systems are often sluggish and unable to keep pace with updated content and new content.

Here’s how Dr. Riza C. Berkan, a nuclear engineer and math whiz explained semantics:

With semantic search, this is not a problem. We will extract everything of the 500 words that is relevant content. That is why Google has a credibility problem. Google cannot guarantee credibility because its system relies on link statistics. Semantic methods do not rely on links. Semantic methods use the content itself. For example, hakia QDexed approximately 10 million PubMed pages. If there are 100 million questions, hakia will bring you to the correct PubMed page 99 percent of the time, whereas other engines will bring you perhaps 25 percent of the time, depending on the level of available statistics. For certain things, the big players do not like awareness. Google has never made, and probably never will make, credibility important. You can do advanced search and do “site: sitename” but that is too hard for the user; less than 0.5% of users ever use advanced search features.

Passage Two

What I believe the founders of Hakia have done is borrow the concept of Lambda Calculus from compiler theory to speed the process of reducing elements on pages to their conceptual foundations. That is, if we assume everyone writes like me, then most documents can be reduced to a much smaller subset of place-holders that accurately convey the meaning of all the words we use.

Okay, but in my Search Wizards Speak interview, the founder of Hakia said:

We can analyze 70 average pages per second per server. Scaling: The beauty of QDexing is that QDexing grows with new knowledge and sequences, but not with new documents. If I have one page, two pages or 1,000 pages of the OJ Simpson trial, they are all talking about the same thing, and thus I need to store very little of it. The more pages that come, the more the quality of the results increase, but only with new information is the amount of QDex stored information increased. At the beginning, we have a huge steep curve, but then, processing and storage are fairly low cost. The biggest cost is the storage, as we have many many QDex files, but these are tiny two to three Kb files. Right now, we are going through news, and we are showing a seven to 10 minute lag for fully QDexing news.

No reference to a type of calculus that thrills Googlers. In fact, a review of the patent shows that well know methods are combined in what appears to be an interesting way.

Passage Three

Documents can still pass value by reference in a semantic index, but the mechanics of reference work differently. You have more options, so less sophisticated writers who don’t embed links in their text can have just as much impact on another document’s importance as professional search optimization copywriters. Paid links may not become a thing of the past very quickly, but you can bet your blog posts that buying references is going to be more sophisticated if this technology takes off. That is what is so exciting about Hakia. They haven’t just figured out a way to produce a truly semantic search engine. They have just cut through a lot of the garbage (at a theoretical level) that permeates the Web. Google AdSense arbitragers who rely on scraping other documents to create content will eventually find their cash cows drying up. The semantic index will tell Hakia where the original content came from more often than not.

Here’s what the founder says in the Search Wizards Speak interview:

With semantic search, this is not a problem. We will extract everything of the 500 words that is relevant content. That is why Google has a credibility problem. Google cannot guarantee credibility because its system relies on link statistics. Semantic methods do not rely on links. Semantic methods use the content itself. For example, hakia QDexed approximately 10 million PubMed pages. If there are 100 million questions, hakia will bring you to the correct PubMed page 99 percent of the time, whereas other engines will bring you perhaps 25 percent of the time, depending on the level of available statistics. For certain things, the big players do not like awareness. Google has never made, and probably never will make, credibility important. You can do advanced search and do “site: sitename” but that is too hard for the user; less than 0.5% of users ever use advanced search features.

The key fact is that Hakia failed. The company tried to get traction with health and medical information. The vocabulary for scientific, technical, and medical content is less poetic than the writing in business articles and general blog posts. Nevertheless, the customers and users did not bite..

Notice that both the author of the article did not come to grips with the specific systems and methods used by Hakia. The write up “sounds good” but lacks substance. The founder’s explanation reveals his confidence in what “should be,” not what was and is.

My point: Writing about search is difficult. Founders see the world one way; those writing about search interpret the descriptions in terms of their knowledge.

Where can one get accurate, objective information about search? The options are limited and have been for decades. Little wonder that search remains a baffler to many people.

Stephen E Arnold, June 8, 2015

Comments

2 Responses to “Semantic Search Hoohah: Hakia”

  1. messef on June 9th, 2015 11:07 am

    The founder and chairman of hakia, Dr. Pentti Kouri, died unexpectedly in the middle of the startup phase. Following his death, the company pulled out from consumer search market, and went into enterprise search, scoring with giants like Boeing. This article is not only inaccurate with the facts, it is also disrespectful to the memory of Dr. Pentti Kouri. I recommend a better research before writing an article.

  2. Need Semantic Search: Lucidworks Asserts It Is the Answer by Golly : Stephen E. Arnold @ Beyond Search on July 3rd, 2015 9:57 am

    […] In June I pointed to an article which had been tweeted as “new stuff.” Wrong. Navigate to “Semantic Search Hoohah: Hakia”; you will learn that Hakia is a quiet outfit. Quiet as in no longer on the Web. Maybe […]

  • Archives

  • Recent Posts

  • Meta