Bitext
An Interview with Antonio S. Valderrábanos
Madrid – the Prado, Real Madrid at the Bernabéu, restaurants that open at 11 pm – is one of the most interesting cities in Spain. I'm sitting with Sr. Valderrábanos, the urbane founder of Bitext. His company has caught the attention of government agencies and commercial organizations in Europe. In the United States, dtSearch uses the Bitext technology to add natural language processing to its widely-used search and retrieval system. The two technologies – Bitext and dtSearch – are sold separately. However, they integrate seamlessly. |
There are less pleasant things to do than sit in the Casa Luis (Cocina española), a short walk from the Bitext offices and talk about text processing.
Sr. Valderrábanos enjoys technology and tapas. Over drinks in Casa Luis, where the staff makes every effort to keep our drinks refreshed, I probed into Bitext's growing visibility.
Madrid, like Paris, seems to have come alive in search and information retrieval. Zed is here. The university is a hot bed of innovation. I thought football was the core competency of Madrileños?
Ah, football. Maybe first. But technology is a very close second. At Bitext technology and customer support are first for us most of the time. Also, Madrid attracts many people from outside of Spain. It's a dynamic and attractive city, right?
Yes, it reminds me of parts of São Paulo, Brazil.
Yes, there are some similarities. We too are enjoying a booming economy. There are many small companies being created around software.
What fueled your interest in search? Search is a pretty tough business in which to generate large revenues quickly.
With the amount of digital information growing, I thought there were new, interesting ways to help people deal with this problem.
Search is about text handling, both in the query and document side. Bitext is formed by a group of people specialized in text processing, in linguistics. Search is a very natural market for our solutions. Our goal is to complement search engines, giving them the ability to handle text according to its content, rather than its form as it happens in most applications, including search engines. We are interested in all forms of search, including search in databases or Geographical Information Systems (NaturalGIS).
You came to my attention because of your deal with dtSearch. What functions are you adding to that search and retrieval system?
Our NaturalFinder products adds to dtSearch a natural language interface, so users can enter a word or a phrase and get results without worrying about Boolean operators. We also provide advanced spelling functionality, beyond the standard word level. Our technology can correct words in context.
Our technology can be used to integrate a thesaurus with dtSearch. This gives dtSearch a way to handle semantic relations like similar meaning (synonymy) and the ability to handle words that sound alike but have different spellings. We also handle other semantic relationships like the one between city and Madrid, where Madrid is a type of city. This is called hyponymy.
Did dtSearch want to add natural language processing to their search system?
I think that dtSearch wanted to respond to be open to satisfying new customers' interest, being open to integrating features from third parties that would make search much easier for users.
Some users find putting queries in a search box a frustrating battle of wits. If the user hits on the right phrase, the system rewards the user with the information needed. dtSearch wanted to reduce key word frustration. Also, our technology improves precision and relevance of queries.
From a broad perspective, how did that deal come about?
We were looking for a search engine manufacturer, so we could integrate our technology with their engine and test the advantages that language technology can provide.
The people from dtSearch were extremely helpful and we ended up partnering with them. As far as we know, they have one of the most effective search engines. Their customer base is very large with many developers incorporating the dtSearch system into third-party applications. So far, we haven't found limits to issues like query length and complexity or index size (as we have found in other popular engines). dtSearch scales nicely.
There are a number of companies offering "understanding" functions for search. These range from the aging Inxight to start ups like Radar Networks' Twine. What sets Bitext apart?
Bitext has set up a unique offering for the search engine industry: the functionality described above for NaturalFinder can be added to any search engine. Our technology makes other search and retrieval systems deliver more relevant results and a better user experience.
We have Bitext integrated with the Google Search Appliance, dtSearch, Oracle Secure Enterprise Search, Memex, Lucene, Search Point, and others.
Our technology works with these systems no matter what operating system or platform the licensee prefers to use. We support Microsoft Windows, Linux, and UNIX. Bitext can run on the desktop, an Intranet, on a Web search system.
Our API [application programming interface] allows a partner or a direct licensee to do integration in a non-intrusive way. We support different languages, essential in today's world.
You have been active in academic conferences. Will you begin exposing Bitext to the more commercial search conferences in 2008? If so, which ones?
This year we will participate in the 30th European Conference on Information Retrieval in Glasgow. We will participate in some of the Search Engine Strategies meetings, the Search Engine Summit, and several others as well.
In general, we are going to become more visible in 2008. The conferences are a good way to meet people and learn more about opportunities for our NLP technologies.
On your Web site, I found a demonstration of your system "hooked into" Microsoft Live.com. What functionality do you add that a Microsoft customer can't get from Fast Search & Retrieval, which Microsoft will own sometime in 2008?
As far as I know, the main text-understanding feature that Fast Search – in its present form – provides automatic categorization or clustering. We can provide many other functionalities to a Fast ESP [enterprise search platform], Microsoft Share Point, or any other search-and-retrieval system. NaturalFinder, as I mentioned, acts like a turbo-charger for these systems, extending their usefulness and, we think, adding functions users want.
For large customers, we can provide as well highly specialized consulting services on topics like concept extraction, named entity recognition, multilingual search, text categorization, controlled language technology, natural language access to databases. We are doing this now for the Ministry of Defense in Spain. We see increasing interest in our technology as a way to reduce the dependence of many enterprise systems on a Boolean query. Users want a more natural way to find the information they need to do their work.
In your opinion will Microsoft compete with partners who have developed SharePoint search solutions?
It may certainly happen, although we are not experts in these topics. We think that the Microsoft tie up will create many new opportunities for the simple reason that more visibility means that Microsoft marketing is very good at stimulating interest and demand. We are not a replacement for Microsoft technology; we are a company that adds value to Microsoft or any other search system. I think your American MBAs call this "agility". Bitext has technical agility.
You landed a deal with Spain's Ministry of Public Administration. Will you provide some detail about what you are doing to add usability and functionality to its search system?
Yes, the ministry requested proposals last year for a solution that will give citizens a way to ask questions online. The questions will be something along the lines of "How do I recycle my laptop?"
SITESA-Grupo EP was awarded the contract. That company is developing a system for the ministry that uses the Google Search Appliance and our NaturalFinder. The goal is building a federated search engine for public administration.
In this new system, Natural Finder adds new functionality to the Google Search Appliance. We think that many citizens will find this easy-to-use single point of access to public administration information a significant improvement over the usual keyword system.
The solution will give citizens a way to ask questions online. The questions will in natural language like "How do I pay may taxes?" We understand that the service will be called Red 060. I think you have this type of service available as part of the USA.gov system. Please, take a look at Red 060.
We are also developing an e-commerce portal for the Spanish company Captalis. According to the client, it's the most ambitious project for the Spanish market. Like the Red 060 system, users will be able to locate information without having to guess combinations of key words. In large organizations, we're seeing a great deal of interest in moving away from this key word guessing game of traditional search and retrieval.
Does your system handle structured information as well as unstructured information?
Since we can hook our solutions to any application that can be queried, we can handle any kind of information that the application we integrate with can handle in our various products. For example, we offer NaturalFinder, for unstructured information. We have NaturalAssistant, for semi-structured information (with XML and the like FAQs) and for online self-service. Our NaturalGIS supports structured geographic information. Not long ago we struck a deal with ESRI Spain and are doing joint development. We also have what we call NaturalSQL, development resulting from two consulting projects, to provide natural language access to databases. I know there are some companies in the US working in this area as well, but we have a very interesting solution which may be of interests to publishers and eCommerce companies in North America.
Of course, each of these solutions translate from natural language to the formal language of the target application.
What's the content processing approach for your system uses?
That's a good question. We make it very clear to anyone who contacts us that we don't handle the indexing itself with our system. Our approach is to say, "Okay, you have a perfectly good key word indexing system. We add value to that system in ways that make users happier and without getting rid of the system in which you have invested significant time and money."
We integrate, complement, turbo-charge.
Think of our system as adding features and functions to whatever system our customer uses. Instead of having to rely on key words, our technology allows the user to interact in a more natural way with the content.
The Italian vendor Expert System includes a knowledge base. This helps make its system "smart". What is your approach to figuring out what text means? Statistical, linguistic, hybrid?
Our approach is 99 percent linguistic, although we may complement it with statistical techniques in certain contexts. Scientifically, our approach is based on the findings and techniques of computational linguistics. Our approach has some specific components that allow our technology to deliver the NLP functions, the concepts, and the natural interaction users have with the system.
So, and I don't want to sound like a marketing video on YouTube, we have something we call DataLexica. This is a lexical database that includes roots and linguistic information like part of speech, gender, number, and verb tense.
We also have created DataNet. This is a lexical-semantic database that includes synonyms, hyponyms, and other useful "meaning" information. Looking up a synonym, in our experience, delivers some performance and control benefits. Letting the system "discover" meaning creates some performance challenges which have avoided.
We also use rules. These are in our DataGrammar. This is a rule database where information about linguistic structures is stored. Rules allow precise control, and these rules can be strict or relaxed.
We also have a spelling component – DataSpell – that automatically corrects a user's mistyping a word. I think the US word is autocorrect.
Our system, then, is constructed from these building blocks – bloques de edificio. We can mix and match these components as needed to meet the customer's requirements.
Mobile search, so far, has been a matter of simplifying key word entry. Yahoo, for example, is using more "pick from this list" type of interfaces. What approach to mobile search are you taking with Bitext?
We watch for development companies that have a robust on-the-fly translation system for human speech. When we find this company, we think that search problems in mobile environments will be very similar to those of the Web. Mobile users don't want to type very much text. Tiny keyboards are okay for short messages, but not long or complex messages.
In the US, there's a battle between "invisible functions" like Google's popularity algorithms and more explicit functions like Endeca's "guided navigation" where point-and-click asks the user to take an action. Which approach is more likely to be more useful for mobile search? For public facing consumer sites? For Intranet search?
As your question suggests, there will be different battlefields in this mobile search war, with different requirements for each.
I think mobile search has to make easy what I call the "lazy-user" approach. The invisible functions will be important in mobile and also for public facing consumer sites. In these context, users expect the system to be "smart"; that is do more of the work for the user. we tend to expect a greater effort from the other side
It's a bit different for an Intranet search installation. The users want some functions to be smart like remembering preferences. But, in general, the user has to have access to a more "proactive-user" approach. Business information needs can change suddenly, and a system has to help the user in this circumstance.
It comes down to position the user takes in front of the system, as customers we tend to be "lazy" and let our "smart software helper" do the hard work. But as an information workers, in Intranet environments behind a firewall, we need different tools.
In any case, I like Endeca's approach. It's a flexible and smart way of establishing a dialog between the search application and the user during the retrieval process.
Without giving away any trade secrets, what are some of the new features and functions that you will include in Bitext in the next release?
I don't want to reveal too many of our innovations just yet. I can say that we will add entity recognition and extraction very soon. We have customers who want to be able to identify people, places, things, and dates. Watch for an announcement very soon about this function. At the same time, we will be adding some tracking and alerting functions as well.
Also, we want to bring more flexibility to automatic text classification. Our approach will give the users more control over opening and closing categories. Many systems are too rigid for the fast-changing nature of information that our customers have to manage. We think a hybrid approach will be very useful.
Our customers speak many languages, and we continue to focus on our support for different languages. For example, we are one of the few text processing companies to offer support for Basque, a language that poses very particular challenges. Very soon we will offer support for Catalan and Galician. I think you know that we support Spanish, English, and are working on French, German, Italian, and Portuguese.
In the US, search vendors find themselves caught in the recession. Will the US economy's troubles have an impact on your business?
Hopefully not! Our mobiles are buzzing now. Email inquiries are flowing into us every day. But you never know about global finance.
We are betting that our solutions provide ease of access, with the associated save of cost and user satisfaction. In the context of a recession, these features should be attracting even more customers. Besides, we offer both Spanish and English, the two main languages of the market.
Let's wrap up. I want some tapas. What do you see as the major trends in search and retrieval?
That could take a long time to answer. I see you are paying particular attention to Casa Luis's camarones, yes? Let me be quick.
I think the future will want one single interface to different information sources, whether documents or databases or some combination of data from many different systems. be them docs or databases or hybrid.
Of course, the interface will be natural language, the simplest most effective way of communicating for human. We will certainly not want to bother with different applications and formal languages – so no key word queries, Boolean statement, SQL strings, or forms. People want to get the information they need without hurdles. The user doesn't want to have to decide where to search. No one in a hurry wants to figure out whether to use the GIS system, the corporate databases, or a Web blog. No one – including me – wants to stop and answer this question, "Okay how to do I better exploit this particular source".
So we will need richer indexes. The same way we tag documents (or text chunks) with metadata now, we will tag each word in indexes. Currently, search engines just store words in a list, what we have been calling an index. Most systems add some "physical" information; for example, this word appears in this documents and in these particular positions.
In order to make retrieval effective, we need to tag each word and expression in an index with its semantic or linguistic information. It would be extremely useful to be able to tag the word "breaking" in "breaking news" as a synonym of "latest" rather than as a synonym of "fragile", as we would do in a context-independent environment, like a thesaurus.
These are hard problems. As I said, we want to find a partner, expert in indexing matters, and develop a joint solution.
ArnoldIT Comment
Bitext offers a "beyond search" solution that in many ways is quite different from the approach taken by the more than 150 vendors competing in behind-the-firewall search. Bitext positions its technology as adding functionality to an existing system. Many vendors advocate a "rip and replace" approach. Not Bitext can add natural language functionality or support for structured and unstructured information to your existing search solution. Bitext's architecture allows the firm's technology to integrate with almost any enterprise application. This agility makes it possible to add a more intuitive Web-based self-help function to an existing customer support operation. More information can be found at the Bitext web site. ArnoldIT says, "Take a hard look at Bitext." An added benefit is a chance to have meeting in Madrid at Casa Luis.
Stephen E. Arnold, April 7, 2008