An Interview with David Hawking
On my last trip to Australia, I made an effort to meet with the companies developing search and content processing technology. The Australian government, who hosted my visit, helped me set up meetings. After a bit of emailing to and fro, I was able to meet with Dr. David Hawking, an expert in information retrieval, in Canberra, not far from Funnelback's offices.
This year, Funnelback shifted into high gear, and Dr. Hawking, long associated with Funnelback and its predecessor, Panoptic, joined the company full time.
I met Dr. Hawking at Artespresso Restaurant in Canberra, not far from the Parliament House. The chef moved from Sydney and has captured the palates of computer wizards like Dr. Hawking and wandering Kentuckians alike. After lunch, we sat on a balcony overlooking a garden and talked about Funnelback, a search system based on technology developed, in part, at the prestigious Australian National University or ANU. The full text of the interview appears below:
Thanks for taking time to meet with me.
Not at all. I try to combine a discussion of Funnelback and a free lunch whenever I can.
What's your background?
I grew up in Beechworth, Victoria (population then 3,700 plus 1,000 patients in a psychiatric hospital and 200 prisoners in a jail!). In 1974, I was in the first Computer Science Honours class at the Australian National University (ANU). I worked at ANU for many years in many roles: Tutor in Computer Science, manager of student computing, developer of text processing software, system and network administrator, building designer, and researcher.
In the eighties I was a campus evangelist for networks, laser printing, microcomputers, and UNIX workstations. In 1991, I believe I became the only person in the world ever to receive hardware training on both the Connection Machine CM2 and Fujitsu AP1000 parallel supercomputers. In the same year I wrote a text retrieval system (PADRE - the PArallel Document Retrieval Engine) for the AP1000 which evolved over the years and resulted in a number of scientific publications.
Yes, I have seen your name in ACM and IEEE bibliographies for years. I must admit I haven't read many of your papers. You were involved with the US government text retrieval conferences as well, right?
Yes, in early 1994, I participated in TREC for the first time and joined a team in the ACSys Cooperative Research Centre which in 1993 had built a prototype of PASTIME, a multimedia Web site for the Australian federal parliament. Amazingly for the era, PASTIME converted Hansard transcripts in WordPerfect to fully searchable HTML in which pattern-driven hyperlinks were inserted at viewing time. When you clicked on a search result you could view and listen to the corresponding segment of video, provided you had the relevant MOSAIC browser plug in and a fast network connection.
What's the family tree of CSIRO - Panoptic - Funnelback?
ACSys was a partnership between ANU and Australia's premier science research agency (CSIRO). The government funding agency insisted that it achieve research commercialization objectives. In 1998, my ACSys colleague, the late Dr Paul Thistlewaite, put forward a plan to commercialize our research in the form of a suite of intranet tools, with PADRE as a key component. Unfortunately, he suddenly died of cancer in February 1999, and I was left to oversee the implementation of his plan. So far only the search part of it, but give me time. Peter Bailey (now at Live.Com) put together crawler and interface components, Vijay Boyapati (WhizBang, then Google) worked on a text filter framework, and we launched the first production service in July 1999.
Soon after, Francis Crimmins (now R&D manager at Funnelback) wrote a new crawler from scratch and packaged up the system (now called P@NOPTIC) with a cut-down RedHat Linux and Apache so it could be easily installed on a suitable PC. Our thinking was that by positioning of our system as a "search appliance" we could control the entire software environment and thus avoid a whole lot of messy integration and support issues.
We sold one appliance (hardware included), but most potential customers were resistant to the idea and wanted to install our software on their servers.
In 2000, ACSys came to the end of its funding, and development of P@NOPTIC was continued by CSIRO.
Nick Craswell (now at Microsoft Research) developed an amazingly powerful capability to customize P@NOPTIC, enabling it to be used in some highly creative applications. Nick also ported P@NOPTIC to Solaris, developed an image search prototype, and together with Trystan Upstill (now at Google) contributed significantly to Web ranking algorithms. Brett Matson (now Funnelback CEO) ported the system to other varieties of Linux and to Windows and developed a generic database adapter.
Although revenue grew rapidly in percentage terms, the team was downsized (to three people) and we were told to find new applications for our research. We were reprieved at the last minute by a rush of customer orders and by winning three important awards.
By December 2005, we had built a thriving business whose potential was being stifled by the constraints of a research organisation. A spin off was created and the name Funnelback was adopted for the company and the product.
Around 1997, the link between PADRE and parallel supercomputers was broken and engineering was undertaken to improve the amount of text which could be indexed and searched on a single workstation. In 1999, at TREC-9, I demonstrated query processing over an index of 18.5 million Web pages using a laptop with only a Pentium-II and 128MB of RAM.
What's the purpose of Funnelback as a company? Is it a spin out that must survive without subsidies? Is it a sponsored entity?
Funnelback's purpose is to deliver high-quality, high-value solutions to customers, to grow, and to deliver return on investment to shareholders.
It is not sponsored or subsidized and must survive on its own.
Funnelback is now at Version 8. How has the system changed over the last two or three years?
In the last couple of years we've added a whole raft of capabilities: Adaptors for a number of additional enterprise applications; faceted navigation and e-commerce features; Search result mining (Fluster); simple multi-platform graphical installer; geographical search; better administration interface and more sophisticated reports; improved scalability; improved and more adaptive ranking; Web-2.0 features; improved tunability; and multi-lingual search including support for CJKT languages (Chinese, Japanese, Korean and Thai.)
There's much more in our innovation pipeline.
Many search systems run into problems when the volume of content rises and the frequency of updating indexes changes from monthly or weekly to hourly or near real time. How does Funnelback scale?
Funnelback scalability is illustrated by a couple of recent internal projects.
For example, we showed that Funnelback could index and search 320 million (short) database records on a single machine.
We also demonstrated that Funnelback could index and search the UK2006 Web collection (80 million pages, two terabytes of text, four billion links) on a single $2,000 server. Accurate answers were returned via the web interface in less than 0.5 sec.
Naturally, there is a trade-off between speed and functionality. Throughput from given hardware can be increased by limiting the use of advanced features.
For very large data sets, Funnelback can share crawling and indexing across a cluster of machines. For very high query volumes machines or clusters can be replicated to multiply throughput.
What differentiates Funnelback from other enterprise search systems?
That's a good question. Funnelback differentiates in the following areas:
- Search ranking quality and level of tunability
- Geospatial query processing (that is,. search for records in close proximity to the user's location, or rank by proximity)
- Folksonomy tagging of search results
- Simple to configure and effective search result clustering and faceted navigation
- Customizable workflows
- Ability to offer highly customised SaaS search solutions
- Support for Windows, Solaris and Linux
And we make an effort to offer responsive and quality customer support.
Most vendors assure me that their systems include smart software, linguistic processes, semantics, and statistical processes. What's is the fabric of Funnelback?
The core of a search tool is it's ranking function. The output of that function must accurately predict the satisfaction of the people submitting queries. It must be configurable to work in specialised applications and able to be tuned to particular data sets or requirements.
Our ranking function combines many sources of evidence, such as the content and structure of the document, internal and external metadata, Web site structure, link structure, user interaction data, publication dates, and external textual annotations such as anchortext and tags.
In order to achieve accurate and reliable rankings, we try to extract as much useful evidence as possible and weight it according to its value and reliability. All of the components of the system must be well engineered and standards compliant while tolerant of standards violations prevalent on most Web sites. Data gathering components must be able to recognize duplicate and near-duplicate content/sites. Text filters must reliably extract text content and metadata for indexing. Link analysis tools must be able to recognize subtle affiliations as well as obvious ones. The indexer must know what to index and what not, etc. etc.
The best ranking system in the world (quite possibly ours) won't deliver the results people want if they submit inappropriate queries.
So a bad query makes it difficult for a search system?
Yes. But we provide a kit of tools to help people refine and improve their queries, or to find good results despite query deficiencies.
For example, we make it possible to present to the user partially matching results. We offer a spelling correction feature, essentially asking "did you mean?"
Funnelback supports thesaurus expansion and annotation matching. We also offer two different levels of stemming.
Our system also supports dynamic faceting of search results, what some people are now describing as "assisted navigation". The term is not that important. Users can scan a series of suggestions and links, picking those that are helpful. We also offer search result mining (known as "Fluster").
What are your plans for expanding the company's sales presence in the US and Western Europe?
We're not in a position to announce anything just yet. I will let you know once our plans have solidified. Obviously these are large markets, and we need to look at each closely. Stay tuned, please.
Many enterprise search systems describe their technology as a platform. The idea is to get a customer to build information applications on top of a search framework. Is this approach part of the Funnelback vision?
Definitely. As one example, Funnelback has been used by the Australian Defence Force Academy to build a plagiarism detection system. We have further plans in this area, but can't make specific announcements at this stage. I'm sure you understand.
Yes, I am in Australia courtesy of the Australian National Police. Search vendors are shifting from broad brush solutions to niche or vertical versions of their technology. Is Funnelback offering niche or vertical builds of its system for customer support, business intelligence, financial services, or other niche markets?
We obviously want to continue with our successful generic search product but newly developed capabilities mean we can support a number of new verticals. We're looking at taking on sales and marketing staff to pursue some of these opportunities and could end up developing vertical-specific products. We're also looking at OEM deals.
What's the procedure for customizing a Funnelback installation?
Customization can be carried out either by Funnelback's team of experts or by the customer. Funnelback includes an intuitive Web based administration interface for configuration, user interface customization and viewing query reports. No programming skills are required for the majority of configuration tasks, but deeper integrations can be achieved by developing specific interfaces to work with various enterprise application such as content management systems or portal applications.
I recall reading that Funnelback includes a large number of connectors to Lotus Notes, Documentum and other enterprise software. Did your team write these connectors, or are you using third party solutions? How do I create a new connector or modify an existing connector?
Funnelback includes a connector for the Tower Software (recently acquired by Hewlett Packard) TRIM EDRMS. Like some other vendors, we have integrated third-party connectors from Persistent Systems. Funnelback's workflow engine allows custom developed "gathering¡" components to collect content from repositories for indexing. We can also index repositories such as those for Documentum and Lotus Notes. You can also interface Funnelback with repositories to enable appropriate filtering of sensitive content and indexing of rapidly changing content.
What will be coming in the next release of Funnelback?
The next major release (9.0) is scheduled for the first half of 2009. We hope to make it an exciting release to celebrate the product's tenth anniversary. Feature proposal and prioritization discussions for Funnelback 9.0 are already under way but we usually get great suggestions from our customers at our annual User Group Forums. The next one is scheduled for November 7, 2008.
As you look forward, what are the major trends that you see emerging in the next nine to 12 months?
Personally, I still want to focus on the fundamental goal of enterprise search. That is, to provide tools which help employees and customers to complete their tasks faster and better (i.e., search utility). There's still scope for improvement, particularly in the complex information environments typical of medium to large organisations around the world.
That means grappling with the hard problem of measuring how well tools actually work for users. Without this focus you can end up with demo gimmicks that don't actually add value. What task is the searcher performing? Do the search results contain all the information they need to complete that task? Is that information in optimum form for quick access and understanding? If the system is falling short, where is it falling short? What tool or feature could overcome the failing? Would it have a downside? Can we build it? How can we design it so that people understand how to use it?
Funnelback, though well established in Australia, has a lower profile in the United States. Like Autonomy, Funnelback relies on sophisticated algorithms. The present version of the search system includes a number of useful features. These allow the licensee to offer users suggestions, summaries, and advanced search features. Users of the system in Canada and the United Kingdom report positive experiences with the system and have high praise for the Funnelback technical support team. Funnelback, like other search systems developed in Australia and New Zealand, must content with distance and time zones. ISYS Search Software and SLI Systems have been successful in expanding their market in North America and elsewhere. I expect Funnelback to become more aggressive in their marketing in 2008 and 2009. You can find more information about the system here.