Indexing Dynamic Databased Content
April 20, 2008
In the last week, there’s been considerable discussion of what is now called “deep Web” content. The idea is that some content requires the user to enter a query. The system processes the querey and generates a search result from a database. This function is easier to illustrate than explain in words.
Look at the screen shot below. I have navigated to Southwest Airlines Web page and entered a query for flights from Louisville, Kentucky, to Baltimore, Maryland.
Here’s what the system shows me:
If you do a search on Google, Live.com, or Yahoo, you won’t see the specific listing of flights shown below:
Google’s announcement that it will begin to index forms means that Google wants to make these types of data available in its index. There will be some push back from Web site owners and data producers going forward.
Who Is in the Game?
I would wager a bowl of burgoo (Kentucky squirrel stew) that you don’t know too much about these two search and content processing vendors:
- Bright Planet. You can learn quite a bit about the company’s harvesting technology from its Web site at http://www.brightplanet.com. If you work in the US government, you are probably using some BrightPlanet content, but you may not have instant name recognition for the firm’s services. In a nutshell, BrightPlanet has technology that allows it to index content on sites like Southwest Airlines’ Web site as well as those sites that require a user name and password. The company’s been in business since year 2000 when I learned about the firm.
- Deep Web Technologies. This company is located near Los Alamos, New Mexico, and it was founded by one of the wizards behind the Verity search engine. You can read about this company at its Web site, http://www.deepwebtech.com. You can also see the company’s system in action by navigating to Science.gov at http://www.science.gov. The owner of this company is a friend of mine, and I want to keep the cheerleading in check, but I think this outfit has some nifty technology.
There are other companies working in this sector, and I will run an exclusive interview with one of the individuals who is among the world’s elite in acquiring information from these types of online resources.
The point is that Google’s announcement comes late in the game, which is not unusual in the consumer indexing space. Google’s mastery of forms, however, marks a turning point in this branch of search and content processing; namely, it’s now in the spotlight. Bright Planet and Deep Web Technologies have been unable to capture the type of visibility that Google achieves.
Expect more scrutiny, debate, and exploration of these technologies in the coming weeks and months.
A Hard Problem
Indexing the content that requires user input is different from indexing a plain vanilla Web site such as mine at http://www.arnoldit.com. I have created a site map that makes it dead simple for a software spider to find the information on my Web site’s pages. In fact, most visitors to my Web site find me from a search engine’s result list, so my splash page here http://www.arnoldit.com/ is probably not needed even though my flapping duck is particularly meaningful to me. (How did I get that duck to stay in the circle?)
A dynamic site–one requiring the user to fill in a form or submit a formal query–is different. The information are spread out on the table for anyone to see. The information about the flights to Baltimore from Louisville only appear after I pick cities, days, dates, and submit the specifications to Southwest’s computer. The computer then scurries to the Southwest database, pulls out only what’s needed to match my query, and then creates a specific content display for me.
There’s nothing for Google, Microsoft, or Yahoo to index until the Southwest system receives a completed form. The problem, therefore, is much hardeer than indexing a static HTML page.
Today, more than half of the Web sites created each month are dynamic, and the ratio of static to dynamic sites is changing. I can’t reproduce the data I obtained from one of my clients, but I can highlight two facts. First, the number of dynamic sites is growing more rapidly than this time last year. At soime point in 2009, static sites will be in minority, essentially becoming brochureware that no one will pay much attention to. Second, the people operating dynamic sites want to protect their data from aggregators. Once structured data have been sucked out of a dynamic site, the value of the inforamtion decreases sharply. The creatoir of dynamic data faces the real possiblity that the aggregator can monopolize the traffic to that data. While some Web site operators may not object, others–particularly airlines, commercial database producers, and for-fee aggregators like specialist networks may be quite annoyed.
The deep Web indexing then triggers three issues that have been out of sight and out of mind for many. First, there is the business threat of losing control of traffic and data. Deep Web and Bright Planet have business models that forge partnerships. Newcomers may not have these firm’s processes and policies in place. Second, in the past the hassle of creating scripts to log in, get result pages, and keep these up to date have been a barrier to large-scale deep Web indexing. As the structured data become easier to obtain, the previously harder-to-get informatoin is easier to get and costs of acquisition begin to drop. Economics then change. Third, structured data for dynamic Web pages are often little more than directories or collections of facts; for example, the departure times of the Amtrak train from Baltimore to New York. Right now, getting this information from Amtrak is a massive hassle. The Amtrak system is–shall we say–of modest functionality. Sharp Web traffic changes can have significant impacts on the organizations operating these Web sites. The search engine optimization industry exists to drive traffic to sites. If the traffic goes elsewhere, who knows what will happen? Layoffs? More SEO? More complex access procedures?
Room for Specialists
Indexing dynamic content inside an organization is necessary function. Most vendors of industrial-strength search systems assert that their sosftware handles both structured (forms-based) data and unstructured information (Word and PowerPoint files). The reality is that these systems are comparatively primitive compared to what’s required to handle forms-based content in the wild. You will want to look closely at systems that can indeed handled structured data with aplomb. Some of the companies you will want to check out include Dieselpoint and MarkLogic, to name just two.
There are other specialist vendors who can provide useful functions for managing forms-based content. You will want to become more familiar with these firms’ systems.
- Mercado Software, a specialist in structured content for ecommerce sites. More information is at http://www.mercado.com
- Northern Light, a company offering tools that can handle static and dynamic busienss information. More informatoin is available at www.nlsearch.com.
- Progress Software’s EasyAsk tools. You can learn more at http://www.easyask.com
Other vendors may be found in my Beyond Search study.
I believe that there will be room for specialists in dynamic data.
Looking Ahead
With more dynamic functions, the need for tools increases. I thnk it unlikely that a single firm will be able to alter quickly the fragmented landscape in which dynamic data exist. More likely, a spate of new companies will come on the scene and create new opportunities for themselves and others.
The search and content processing scene remains one that is poorly understood by many pundits. It nooks and crannies hold many surprises. The challenges associated with acquiring, processing, and repurposing content are signficant ones. I’m tempted to use the word intractable, but eventually dynamic data will be tamed. We need to keep in mind that most people mean text when speaking about search and content processing.
Our tools and systems for text are getting better, but I think of them as little better than stone hatchets. We need to push innovation to resolve the problems of text. The real challenges of audio, video, and dynamic compound data await.
As we flounder with text, the amount of non-text content continues to rise. Search remains an interesting problem.
Stephen Arnold, April 20, 2008
Comments
11 Responses to “Indexing Dynamic Databased Content”
Hello Stephen,
I’m just up the road in Indianapolis and have a patent pending on a invisible “deep” web search utility. Would enjoy chatting with you soon. Paul Thompson at Dartmouth is the coauthor and CTO.
ISEN is a database of databases with a unique serial number and robust metadata (120 MODS fields) assigned to each. We plan to use a social network of librarians to accomplish the ongoing collection and maintenance of the ISEN registry.
Thanks for writing. I would like to learn more and maybe put up a short profile of what you and your team are doing. I don’t include an exhaustive list of vendors in these Web log posts. In the first three editions of Enterprise Search Report I profiled about 32 vendors. In Beyond Search, my most recent study, I profiled an additional 24 vendors. I have a list of more than 300 companies in this space, and I would be delighted to add your firm to the list.
Stephen Arnold, April 21, 2008, 11 09 am
[…] Indexing Dynamic Databased Content : Beyond Search, arnoldit.com […]
You didn’t mention that the content managers, ie entities hosting deep web data, can create dynamic site maps based off of the dynamic content, which allows the enterprise crawlers easy access to the deep web content. Fairly trivial to do so and very effective in getting this data into the enterprise search engine indexes.
Good article, however I disagree with the blanket statement “Second, the people operating dynamic sites want to protect their data from aggregators.” Not necessarily true. Take for instance your query of ‘flights from Louisville, Kentucky, to Baltimore, Maryland’. I would think that any SouthWest Airlines shareholder would be quite happy if I typed this into Google and got the SouthWest page of flights, with a handy link to purchase tickets. Your statement is probably true in some cases, therefore the question to answer is “What is the main objection to the public finding data available in publicly available systems (given that the front page to the web is Google, MSN, Yahoo; and does this objection describe a majority of deep web content owners?”
Content for another article?
I appreciate your post. In these Web log essays, I often try to make one point. I certainly agree with your comments. In the back of my mind are the Guha patent applications. These contain provisions for discovering fields and missing data if a Web site does not provide programmatic instructions. Southwest is an interesting case. Under the airline’s present management, I’m not seeing SWA flights in the aggregation services I use.
Stephen Arnold, May 2, 2008
[…] is now working on indexing dynamic web page content and therefore poking into the deep web. This article here talks about that and mentions some of the players in deep web searching such as […]
Watch for an interview with one of the “deep Web” gurus, Abe Lederman. His interview for the Search Wizards Speak series on the ArnoldIT.com Web site will run either Monday or Tuesday, June 9 or 10. Mr. Lederman is preparing for a trip, but he assures me that I will have some final inputs before he wallows in the luxury of a 20 hour flight on one of the Trailways of the air.
Stephen Arnold, June 5, 2008
Stephen, great article.
Just one thought – will it not be better if we let the websites take more responsibility of what search leads to which direction.
I feel the websites knows more than the search engines in terms of what information is stored, be it static or dynamic, within their system. so why not have the user go to the website to avail of the special promotion instead of first going to a search engine.
your comments…
thanks
Nitiin
Nitan,
Thank you for writing. There may be “tension” between the Web site and the indexing system. The Web site has one or a limited set of data. The Web site sees those data as having significant value. The indexing system sees the Web site as farmer with corn, essentially data are data. The indexing system, therefore, creates value by harvesting the Web site and creating a meta-construct that has higher fungible value than the single Web site’s data. Therefore, we have tension. Lots of tension.
Stephen Arnold, August 2, 2008
[…] In case you’re a large site, I recommend reading Arnold’s Indexing dynamic content […]
Hi !
I read with a lot on enthusiasm your article as well as comments.
In fact i’m a student working on a project regarding
How to index dynamic documents,what kind of algorithm to use for indexing this type of documents .
Your suggestion is highly welcome
thanks