An Interview with François Schiettecatte
ArnoldIT.com worked with François Schiettecatte on a project for a client in New York. Prior to working with my team,he had contributed to a component of the human genome analysis and was providing search and content services to a number of clients. Mr Schiettecatte and I lost track o f one another until recently. He was working on a real time search project, which is in stealth mode. He did consent to an interview on May 29, 2009. The full text of the interview appears below.
What's your background?
I have a degree in Systems & Management from City University in London UK. This degree involved very little computer science which was limited to Basic and SPSS (to manage herds of sheep if you must know). I got my first programming job in 1989 working on the Genome Database project first at the Imperial Cancer Research Fund in London and then at Johns Hopkins Hospital in Baltimore. After that I did consulting around information search and retrieval for 12 years, during which I developed and co-marketed ScienceServer (a web based electronic journal browser) with Elsevier Science, which I later sold to them. In spring 2003 I co-founded Feedster, a weblog search engine, where I developed the crawler, the search engine and did a lot of work around scalability. More recently (in the fall of 2008) I was co-founder and CTO at MyRoar, an NLP based search engine focused on financial information, which I have since left.
What was the trigger in your career that made search and retrieval a focal point?
This occurred when I was working at Johns Hopkins Hospital in 1990. One of the databases our lab was involved in supporting was the Online Mendelian Inheritance in Man. This database was a collection of documents cataloging inherited genetic traits in humans which I indexed with a full text retrieval system called WAIS which had just been released by Brewster Kahle of Thinking Machines. I originally downloaded WAIS because I was very interested in full text retrieval even though my work was heavily focused on very structured data. I quickly got hooked because the concept of full text retrieval was very interesting to me and the source code was available which meant I (and others) could tinker with it. A small community of people quickly sprang up around this and we started to exchange patches which added functionality. I took it upon myself to put together the first release of FreeWAIS which included the most popular patches.
Do you have a search related project underway? If so, will you describe it?
I returned to a project I was working on before I joined MyRoar. This is a research project which I started in mid-2007 after I left Feedster. I wanted to tackle the issue of information filtering and remixing, allowing users to select feeds, to slice and remix them and repackage them into new feeds centered around very specific subjects (or channels as I prefer to call them.) The impetus for this project was a need on my part to deal with the deluge of data that came in though my feed reader. There are a very large number of very good feeds available but subscribing to all of them generated so many posts to read that I was spending more time choosing what to read rather than doing the actual reading, and felt that there had to be a better way to tackle this issue. The second impetus for this project was that I wanted to try out some ideas around scalability.
The number of new companies entering the search and content processing "space" is increasing. What's your view on too many hungry mouths and too few chocolate chip cookies?
This view suggests a zero sum environment to me and I don't agree with that premise. You can apply different search approaches to different data sets, for example traditional search as well as NLP search to the same set of documents. And certain data set will lend themselves more naturally to one type of search as opposed to another. Of course user needs are key here in deciding what approaches work best for what data. I would also add that we have only begun to tackle search and that there is much more to be done, and new companies are usually the ones willing to bring new approaches to the market.
What are the functions that you want to deliver to your customers?
On the surface my project is a Web log search engine, but beyond that I want to offer users the ability to create, share and subscribe to specific channels of information. My hunch is that 80% of the users will be passive consumers of channel, but 20% will create and share channels with others, bringing their domain expertise to the channel creation process.
What are two or three of the key features you are / will be implementing?
The project I am currently working on is to satisfy a personal "itch" as I mentioned further up, but underneath it I wanted to test out a number of ideas for scalability in a number of different areas. I wanted to see if it was possible to build scalability into a system from the outset without incurring overhead. I also wanted to test out some ideas on how to scale various components in the system such as data repository, search engine, load, throughput, etc... Of course the proof will be once I start running it on web sized data sets which should be in a couple of months or so, but so far tests are promising.
There's a push to create mash ups--that is, search results that deliver> answers or reports. What's your view of this trend?
I think this is a very good trend. Right now the general model is to plug in a number of terms, get a search result page, scan a number of candidate documents, and repeat until a satisfying result is obtained. All of which is very costly in terms of time spent. It would be a real time saver to have the search engine synthesize data into a table or report for those more complicated searches. Google Squared and WolframAlpha are steps in that direction. While at MyRoar we experimented with searches that would aggregate data into reports so you could ask the question "What is the medicaid shortfall for the States" and it would list the shortfall in dollar value for each State in a nice table.
Are you supporting other vendors' systems or are you a stand-alone solution?
The system I am working on is designed to be a web based solution for consumers of data. The main differentiator is the creation of channels which I can keep to myself or share with others.
Semantic systems have been getting quite a bit of coverage, yet the Powerset technology and other semantic players like Hakia.com have been slow out of the gate. What's your view on semantics and natural language processing? Are these technologies ready for prime time?
MyRoar is in the same space as PowerSet and Hakia so I can speak to that, with the caveat that my background in NLP is relatively shallow. First NLP is complex primarily because language itself is complex, and there is much more to it than just extracting tokens. Second parsing for NLP is very costly in terms of CPU time. At MyRoar we were able to make a dent on the cost of parsing, but it was still very costly. While this is practical for smaller collections, it is currently not very practical for web sized collection. However in time the technology will get better and machines will get faster. In fact we are actually seeing NLP-like behavior on certain types of searches from the major search engines.
A number of vendors have shown me very fancy interfaces. The interfaces take center stage and the information within the interface gets pushed to the background. Are we entering an era of eye candy instead of results that are relevant to the user?
Eye candy is attractive, but I feel users are sophisticated enough not to be taken in by that, and have been getting more sophisticated with time. The gradual increase in average terms per search over the past 10-15 years and the spareness of the current search engine result pages compared to the very busy ones we had a decade ago both suggest that users value substance over eye candy.
What is it that you think people are looking for from semantic technology?
Two things I think. The first is improved precision. More data to search usually means more possible answers to a search, which means that I have to scan more to arrive at the answer, improved precision will go a long way to address that issue. A more pedestrian way to put this is: "I don't care if there are about a million result, I just want the one result". The second is having the search engine take the extra step in extracting data out of the search results and synthesizing that data into a meaningful table/report. This is more complicated but I has the potential to really save time in the long run.
What are the hot trends in search for the next 12 to 24 months?
There are two hot trends to watch for in search. One is in data extraction and synthesizing which I talked about above, and I think that Google Squared and Wolfram Alpha are great first steps in that direction. And the other is in real time search where searches are run against incoming streams of data.
Where can people get more information?
Stephen E. Arnold, June 2, 2009