hakia
An Interview with Riza C. Berkan
by Avi Deitcher, New York City, August 11, 2008
hakia's offices are located in lower Manhattan. The bee hive of activity makes in clear that hakia is attracting attention. Like Powerset, hakia is one of the information retrieval companies pushing the boundaries of semantic and linguistic text processing. The idea is to craft systems and methods that can figure out what content means. The "old style" approach to search is string matching. The user types words and the system returns a list of documents containing those words. While better than turning handles on a microfilm reader, a smarter approach is needed. |
hakia is a general purpose semantic search engine. "Semantic" search translates to concept and meaning matches. The challenge is to make software understand what a document means, and, equally important, understand what information a user wants. With the strong backlash against key word information retrieval systems, the race is on to crack this tough problem. Microsoft paid $200 million for the semantic search system developer Powerset earlier in 2008. hakia's system may be next in line for a high-profile acquisition. Powerset was sold for a reported more than $100M.
I spoke with Dr. Riza C. Berkan in the firm's Manhattan offices. The full text of my interview with Dr. Berkan appears below:
What's your background? Where did hakia originate?
I am a nuclear engineer. I have physics degrees – BS and PhD – and I worked as a subcontractor to the US government for 10 years. While I was working for the government, my PhD dissertation was all about controls of and for information. How do you control nuclear systems – signals coming this way, commands coming that way? What kind of computation system do you need to handle all kinds of cases?
The academic discipline for that is called Artificial Intelligence, and the branch is called fuzzy logic. Fuzzy logic is a new type of mathematics, called "computing with words." In other words, you and I, if we are negotiating something, we will never negotiate black and white, never by exact numbers.
We will use softer notions like, "really good product, marvelous job," while you will say, "I don't want to pay too much." How can you make a computational system handle vague concepts like, "too much," too little," etc? If you look closely, 90% of all computations are really fuzzy, especially involving humans. It is the mathematics of gray areas.
My area is that, and we were designing computation systems where a computer can really make decisions from fuzzy inputs, because the input might be that the situation is "bad" or there is "so much" pressure in the tank, or "not enough" people in the basement, or pipes are "going full speed." The computer can take all of these things and make a fuzzy computation and decide to ring alarm bells or other responses.
I wrote a book in 1997 – Fuzzy System Design Principles, published by IEEE. At this point we are in the 1990s, AltaVista.com is getting traffic and media attention., Search is just beginning.
At this very time, I was also designing systems to analyze documents.
When I was dealing with that, I met Victor Raskin, who is now the Distinguished Professor of Linguistics at Purdue University. Victor and I thought of proposing something to the US Government for them to use. Victor is the father of Ontological Semantics. He has been working on this area for more than 30 years.
He introduced me to the subject, and I liked it. We went to the government, and we wanted to convince them that this is a good thing for them to use. Before too long, I left the government's employ, and started entrepreneurial adventures. I met Pentti Kouri who has been my partner for the last 10 years or so.
Our idea was that the search engines could be made better. We watched. We waited. We were expecting so many improvements. These did not come, so we started hakia.
By improvements, you mean away from keywords and towards semantics?
Yes. Because so many improvements were available, yet nothing was sticking, these high-profile vendors like AskJeeves, Google, Microsoft, and Yahoo were not moving. So Pentti and I started the company, and Victor came on as a consultant.
It is interesting, because back then, one company at least tried to be semantic, and that was AskJeeves. They did not succeed very highly.
I do not know exactly what went wrong with them. We spoke with some people in their team. Early on, AskJeeves did something very unwise with their strategy. I think the marketing team went too far ahead before the product was anywhere.
So you think they promised, but couldn't deliver?
There is a well-known documentary called startup.com. Somewhere in the documentary, they show they were installing AskJeeves, and it was terrible, not producing answers. I remember it. It was the biggest overpromise in IT history. That is how AskJeeves blew its opportunity.
A good semantic system takes a very long time, it is a very difficult challenge, and we still have a ways to go. Of course, Silicon Valley does not like these long-term projects. Our project is very much like biotechnology, with a very long market span. In biotechnology, you start today, and are ready in 15 years.
So how do your investors have so much patience?
That is one of the keys that makes hakia unique from everyone else. Our investors are entrepreneurs on their own, who started something and made it big, and understand what it takes.
So it is not standard venture capital type investments pushing for a rapid exit?
Of course, we have plans like that, but our investors all have built companies on their own and come from the IT world, so they understand what it takes to build this. For example, my partner Pentti was the first to put together the Nokia fund and was on the board of Nokia. Going into this business, we knew it would take a long time, so it doesn't fit well with the classical Silicon Valley venture capital time frame, looking for four to five years to get out maximum.
Powerset dominated news for a while, but they have little to show for their service. Why have demos to date been confined to very modest content?
Powerset specialize in one aspect of semantic search. They do a syntactic meaning extraction from the content. Turning it into a search engine is a whole different ballgame.
We built the data centers and everything from scratch to make semantic search actually scalable. Powerset has been around for a short period of time, and the company is not even using its own technology. Powerset licensed technology from Xerox PARC. Powerset's engineers then had to patch the Xerox code and make it work for Powerset's specific situation. They did a good job in short amount if time.
If you want broad semantic search, you have to develop the platform to support it, as we have. You cannot simply use an index and convert it to semantic search.
For example, if someone calls your name, you do not have an index in your brain that you go through; you simply know it. The active dataset within our system is called QDex or QDexing, so we have direct connections, rather than taking index intersections and ranking them statistically, as the index does.
What hakia does is to go through an actual learning process similar to that of human brain. We take the page and content, and create queries and answers that can be asked to that page, which are then ready before the query comes.
Are you trying to anticipate the questions that could be asked about a particular document, rather than understanding the question as it comes?
Yes, yes. exactly.
Is this what you have then patented? What are the features of the patent?
Our patent has a couple of pieces. The system has three things in between. One of them is what we call the Semantic Rank Algorithm. Certain parts are being kept as a trade secret and not being patented. We are choosing to take the risk of copy in order to not expose it. We will open-source it some day, so it will give people a chance to use the QDexing platform so they can create their own solution.
Interesting. What is your timeframe for that?
Probably some time in 2009. I don't want to put this date in concrete, though.
Semantic technology has been available for a number of years. Why is it becoming the "hot" or "now" technology at this time?
Frankly, I think it is because of us and the media hype around Powerset. We came out at the same time, telling the same story, and people understand it. I think we stirred something, touched everyone's fear in the big search companies. If there was no fear or truth, there would be no follow-up.
There is much more to do in search, and there is a level of suffering and discontent with the current solutions, and a level of disestablishmentarianism. A combination of those, as we came out in the last two years, we really stirred it.
Do you think the delivery would be enough this time around to have it successfully reach the plateau and grow?
Only time will tell, it is difficult to predict, but from our perspective, I think this is the right time. In the 1990s, when AskJeeves came out, it was a novel idea, but nobody knew or had heard of natural language processing.
The VCs and IT world had no idea what these claims were all about. In the last 10 to 15 years, people have become very educated, have seen glimpses of success and implementation, and have come to believe in it and expect it. There is an element of everyone being more educated. Still, there is a long way to go.
What is it that you think people are looking for from semantic technology? What is that discontent that you describe? Is it richer interface, better content? What are people really looking for?
Good question. There are two things.
First is the interface. The era of "ten blue links" like on the the Google results page is coming to an end. Even Google is getting ready to change it. Everything is changing: cars have new models, new telephones come out, search results will change to more visual and more interactive. However, if you just pretty up the interface and do nothing to support it, it will go nowhere.
The second coming of Ask.com did a much better interface, yet the pretty face alone did not deliver what users wanted. If you make something pretty, you need to provide valuable and interesting results to go along with it, then you have the killer combination.
From the content side, the situation is that you have a large numbers of users going to Google and doing searches. However, many people, especially professionals (doctors, lawyers, engineers, among others.) who are looking for good and credible information, and they do not use it because of the lack of credibility of the results.
If you listen to Google, the PR people there will tell you they want to satisfy you "at all costs." Thus, if they cannot bring you a credible page for your medical query, the results may come from the commercial page for the local massage parlor. Google is not expected to be an expert in each of these fields, neither are you, yet you get these results.
I think the next phase of the search will have credibility rankings. For example, for medical searches, first you will see government results – FDA, National Institutes of Health, National Science Foundation. – then commercial – WebMD – then some doctor in Southern California – and then user contributed content. You give users such results with every search; for example, searching for Madonna, you first get her site, then her official fan site, and eventually fan Web logs.
If you say the two sides are form and content, then from form perspective, you are looking at how well the results are presented; from a content perspective, you are looking at two sides: how relevant the results are and how credible the sources are, right?
Let me back up and explain why credible results is an issue for Google. Google's crawler has to go and harvest information to go and rank search results. Google does some back of the envelope calculation. Let's do a back of the envelope calculation.
On average, there are ten outgoing links and 500 words of content on every page. Out of 500 words, on average, only ten or so are wrapped with links. In other words, big chunks of content on every page are not statistically sampled by Google and similar search engines.
Because of that, when they go, for example, to PubMed, Google is going to harvest only a small fraction of the information, even though a huge amount of it is relevant. So if you ask a question about something longer and more complex, Google will fetch information from wherever the Google crawler sampled. Many times, this is another site, which may not be as credible as the National Library of Medicine. Google gets a result, but it may not be the information that is useful to you, just to bring a relevant result to you.
With semantic search, this is not a problem. We will extract everything of the 500 words that is relevant content. That is why Google has a credibility problem. Google cannot guarantee credibility because its system relies on link statistics.
Semantic methods do not rely on links. Semantic methods use the content itself. For example, hakia QDexed approximately 10 million PubMed pages. If there are 100 million questions, hakia will bring you to the correct PubMed page 99 percent of the time, whereas other engines will bring you perhaps 25 percent of the time, depending on the level of available statistics.
For certain things, the big players do not like awareness. Google has never made, and probably never will make, credibility important. You can do advanced search and do "site: sitename" but that is too hard for the user; less than 0.5% of users ever use advanced search features.
What about Google's Programmable Search Engine, invented by a former IBM researcher?
First, it may not feasible for Google to use this technology. If Google could use the technology, Google would have it in operation. Second, you have to start from scratch to do this type of content processing. You cannot use the huge Google index to do this discovery and context analysis.
Let me explain why.
Let us assume you look for dealing with a headache. From a keyword perspective, it can mean many things: an annoyance; a migraine; a medical condition; slang; etc.
Without semantics, you need to cross-link the index association in multiple ways, but you cannot do that within the index, or you would need to double the records – all matches for car also go for automobile, and the index gets orders of magnitude bigger and unwieldy very fast, without giving significantly better results. To embody all semantic associations in the index, you would have to increase your index size million fold!
There is simply no way to stick all the semantic associations in Google's index. For this reason, Google would have to start processing content from scratch using a new platform. I have never seen anything like it from them. They have the cash, but do they have the skills and the vision?
Either they are all fooling everyone and have a hidden building somewhere building it in secret, or they are not doing it, and there is nothing visible in public – reports, pronouncements, etc. – to indicate that they grasp the magnitude of this.
What do you see as major trends in information processing in the next 9-12 months, whether within search or outside?
I think major trends are twofold, with one ahead of the other.
The first is visuals. The Web is changing from being a static page with text and a few pictures to a totally visual experience, actually intersecting with television, where your computer is basically your television.
The second is that the Internet will become more of a conversational environment. We now have chats, email, social networks. You are going to start to come to a point where advertisements are becoming more visual and conversational at the same time. You will see an ad for a car which will be audiovisual, and when you click on it, you will talk to a sales person, or an avatar for that person.
I think that these will come together with search being embedded in the conversations and visuals.
Meaning searching within the conversation?
Actually, I mean making a search engine part of your conversation.
Let me give you an example: You are talking to a friend on Skype, discussing something, and you will invite hakia to join your conversation to ask questions. Being semantic, hakia will understand the conversation, the humor, be able to actually partake in that conversation. We see these as what we call "virgin areas" of search. No one has touched these functions yet.
What are you doing with Yahoo's BOSS service, and given that Yahoo's search results are not necessarily particularly useful, what are the advantages to you of working with their system?
Yahoo helps us in certain ways, acting as more of an accelerator, when we go and crawl the Web for certain queries. When you come and input a query to hakia, we don't keep any records as to who you are. We do take note of the fact that a query was entered. We use this as a learning input. Think of it as an indication that this query may now be more important.
We now want to QDex for that query. There are two ways we could go about doing it. We could crawl the entire Web exhaustively, or use an accelerator to get us that content, which is exactly what Yahoo is doing for us. For us, Yahoo BOSSS is a quick way to get at raw data, rather than the index itself.
You offer a health vertical. There is already a way to index using MeSH. What are you using to go "beyond MeSH?"
Health is a very important topic, and it is a good showcase for hakia.
We have a complete medical ontology that we have built ourselves. We are completely independent of any other indexing. We spent two and a half years to build it.
What we have is much more flexible and more complex than MeSH.
What do you mean? MeSH is the standard in the health field?
Well, we look not only at synonyms, but also at context and relationships. For example, MeSH might understand that "kneecap" and "patella" are the same. hakia understands that "joint connecting thigh to shin" is the same. Our technology will understand what the meaning of an orthopedic problem is.
Scaling is a major hurdle. The average $200 million company produces 400 million or more documents per year. How are you scaling just processing data? And what is your actual throughput in QDex in, say, megabytes per minute?
Yes, throughput: We can analyze 70 average pages per second per server. Scaling: The beauty of QDexing is that QDexing grows with new knowledge and sequences, but not with new documents. If I have one page, two pages or 1,000 pages of the OJ Simpson trial, they are all talking about the same thing, and thus I need to store very little of it.
The more pages that come, the more the quality of the results increase, but only with new information is the amount of QDex stored information increased.
At the beginning, we have a huge steep curve, but then, processing and storage are fairly low cost. The biggest cost is the storage, as we have many many QDex files, but these are tiny two to three Kb files. Right now, we are going through news, and we are showing a seven to 10 minute lag for fully QDexing news.
You have a mobile search system. How is it doing in terms of usage?
That's a good question. Now that the iPhone and Blackberry are out, there is some usage. For low-end phones, we partnered with a company called Berggi that makes low-end phones email-enabled, but it is largely insignificant.
I don't believe these low-end phones are made for Web and email usage. Overall, mobile usage is still minor compared to hakia.com, although we have not done any marketing. I can say that we are exploring in this area.
What is the business model? Where do you make money and what is the exit?
The model is advertising. Right now we use a third-party system, but I can tell you that we are working on own system. We it in development. We have also built a semantic advertising system, which will be ready by the end of the year, and we expect to have a bigger impact than the search itself. You will be able to push ads onto hakia.com as well as elsewhere. You will be able to match your ad to the query with the same semantic principles.
Are you going to sell your company like Powerset and Fast Search did?
No. We are an IPO track company, although we don't have a target date as of yet.
ArnoldIT Comment by Stephen E. Arnold
hakia offers a next-generation search system. The company has been expanding its services. More importantly, with Powerset disappearing into the maw of Microsoft, hakia is one of the semantic search systems squarely in the public eye. You can explore the Beta version of the system yourself at www.hakia.com. Unlike most companies in the information retrieval business, hakia is one of a small number with fresh systems and methods. The company is also able to identify specific technical and procedural weaknesses in well known, high profile firms. The company and its technology merit close attention.
August 12, 2008