Indexing Dynamic Databased Content
April 20, 2008
In the last week, there’s been considerable discussion of what is now called “deep Web” content. The idea is that some content requires the user to enter a query. The system processes the querey and generates a search result from a database. This function is easier to illustrate than explain in words.
Look at the screen shot below. I have navigated to Southwest Airlines Web page and entered a query for flights from Louisville, Kentucky, to Baltimore, Maryland.
Here’s what the system shows me:
If you do a search on Google, Live.com, or Yahoo, you won’t see the specific listing of flights shown below:
Google’s announcement that it will begin to index forms means that Google wants to make these types of data available in its index. There will be some push back from Web site owners and data producers going forward.
Who Is in the Game?
I would wager a bowl of burgoo (Kentucky squirrel stew) that you don’t know too much about these two search and content processing vendors:
- Bright Planet. You can learn quite a bit about the company’s harvesting technology from its Web site at http://www.brightplanet.com. If you work in the US government, you are probably using some BrightPlanet content, but you may not have instant name recognition for the firm’s services. In a nutshell, BrightPlanet has technology that allows it to index content on sites like Southwest Airlines’ Web site as well as those sites that require a user name and password. The company’s been in business since year 2000 when I learned about the firm.
- Deep Web Technologies. This company is located near Los Alamos, New Mexico, and it was founded by one of the wizards behind the Verity search engine. You can read about this company at its Web site, http://www.deepwebtech.com. You can also see the company’s system in action by navigating to Science.gov at http://www.science.gov. The owner of this company is a friend of mine, and I want to keep the cheerleading in check, but I think this outfit has some nifty technology.
There are other companies working in this sector, and I will run an exclusive interview with one of the individuals who is among the world’s elite in acquiring information from these types of online resources.
The point is that Google’s announcement comes late in the game, which is not unusual in the consumer indexing space. Google’s mastery of forms, however, marks a turning point in this branch of search and content processing; namely, it’s now in the spotlight. Bright Planet and Deep Web Technologies have been unable to capture the type of visibility that Google achieves.
Expect more scrutiny, debate, and exploration of these technologies in the coming weeks and months.
A Hard Problem
Indexing the content that requires user input is different from indexing a plain vanilla Web site such as mine at http://www.arnoldit.com. I have created a site map that makes it dead simple for a software spider to find the information on my Web site’s pages. In fact, most visitors to my Web site find me from a search engine’s result list, so my splash page here http://www.arnoldit.com/ is probably not needed even though my flapping duck is particularly meaningful to me. (How did I get that duck to stay in the circle?)
A dynamic site–one requiring the user to fill in a form or submit a formal query–is different. The information are spread out on the table for anyone to see. The information about the flights to Baltimore from Louisville only appear after I pick cities, days, dates, and submit the specifications to Southwest’s computer. The computer then scurries to the Southwest database, pulls out only what’s needed to match my query, and then creates a specific content display for me.
There’s nothing for Google, Microsoft, or Yahoo to index until the Southwest system receives a completed form. The problem, therefore, is much hardeer than indexing a static HTML page.
Today, more than half of the Web sites created each month are dynamic, and the ratio of static to dynamic sites is changing. I can’t reproduce the data I obtained from one of my clients, but I can highlight two facts. First, the number of dynamic sites is growing more rapidly than this time last year. At soime point in 2009, static sites will be in minority, essentially becoming brochureware that no one will pay much attention to. Second, the people operating dynamic sites want to protect their data from aggregators. Once structured data have been sucked out of a dynamic site, the value of the inforamtion decreases sharply. The creatoir of dynamic data faces the real possiblity that the aggregator can monopolize the traffic to that data. While some Web site operators may not object, others–particularly airlines, commercial database producers, and for-fee aggregators like specialist networks may be quite annoyed.
The deep Web indexing then triggers three issues that have been out of sight and out of mind for many. First, there is the business threat of losing control of traffic and data. Deep Web and Bright Planet have business models that forge partnerships. Newcomers may not have these firm’s processes and policies in place. Second, in the past the hassle of creating scripts to log in, get result pages, and keep these up to date have been a barrier to large-scale deep Web indexing. As the structured data become easier to obtain, the previously harder-to-get informatoin is easier to get and costs of acquisition begin to drop. Economics then change. Third, structured data for dynamic Web pages are often little more than directories or collections of facts; for example, the departure times of the Amtrak train from Baltimore to New York. Right now, getting this information from Amtrak is a massive hassle. The Amtrak system is–shall we say–of modest functionality. Sharp Web traffic changes can have significant impacts on the organizations operating these Web sites. The search engine optimization industry exists to drive traffic to sites. If the traffic goes elsewhere, who knows what will happen? Layoffs? More SEO? More complex access procedures?
Room for Specialists
Indexing dynamic content inside an organization is necessary function. Most vendors of industrial-strength search systems assert that their sosftware handles both structured (forms-based) data and unstructured information (Word and PowerPoint files). The reality is that these systems are comparatively primitive compared to what’s required to handle forms-based content in the wild. You will want to look closely at systems that can indeed handled structured data with aplomb. Some of the companies you will want to check out include Dieselpoint and MarkLogic, to name just two.
There are other specialist vendors who can provide useful functions for managing forms-based content. You will want to become more familiar with these firms’ systems.
- Mercado Software, a specialist in structured content for ecommerce sites. More information is at http://www.mercado.com
- Northern Light, a company offering tools that can handle static and dynamic busienss information. More informatoin is available at www.nlsearch.com.
- Progress Software’s EasyAsk tools. You can learn more at http://www.easyask.com
Other vendors may be found in my Beyond Search study.
I believe that there will be room for specialists in dynamic data.
With more dynamic functions, the need for tools increases. I thnk it unlikely that a single firm will be able to alter quickly the fragmented landscape in which dynamic data exist. More likely, a spate of new companies will come on the scene and create new opportunities for themselves and others.
The search and content processing scene remains one that is poorly understood by many pundits. It nooks and crannies hold many surprises. The challenges associated with acquiring, processing, and repurposing content are signficant ones. I’m tempted to use the word intractable, but eventually dynamic data will be tamed. We need to keep in mind that most people mean text when speaking about search and content processing.
Our tools and systems for text are getting better, but I think of them as little better than stone hatchets. We need to push innovation to resolve the problems of text. The real challenges of audio, video, and dynamic compound data await.
As we flounder with text, the amount of non-text content continues to rise. Search remains an interesting problem.
Stephen Arnold, April 20, 2008