ProQuest: A Typo or Marketing?

June 10, 2011

I was poking around with the bound phrase “deep indexing.” I had a briefing from a start up called Correlation Concepts. The conversation focused on the firm’s method of figuring out relationships among concepts within text documents. If you want to know more about Correlation Concepts, you can get more information from the firm’s Web site at http://goo.gl/gnBz6.

I mentioned to Correlation Concepts Dr. Zbigniew Michalewicz’s work in mereology and genetic algorithms and also referenced the deep extraction methods developed by Dr. David Bean at Attensity. I also commented on some of the methods disclosed in Google’s open source content. But Google has become less interesting to me as new approaches have become known to me. Deep extraction requires focus, and I find it difficult to reconcile focus with the paint gun approach Google is now taking in disciplines far removed from my narrow area of interest.

image

A typo is a typo. An intentional mistake may be a joke or maybe disinformation. Source: http://thiiran-muru-arul.blogspot.com/2010/11/dealing-with-mistakes.html

After the interesting demo given to me by Correlation Concepts, I did some patent surfing. I use a number of tools to find, crunch, and figure out which crazily worded filing relates to other, equally crazily worded documents. I don’t think the patent system is much more than an exotic work of fiction and fancy similar to Spenser’s The Faerie Queene.

Deep indexing is important. Key word indexing does not capture in some cases the “aboutness” of a document. As metadata becomes more important, indexing outfits have to cut costs. Human indexers are like tall grass in an upscale subdivision. Someone is going to trim that surplus. In indexing, humans get pushed out for fancy automated systems. Initially more expensive than humans, the automated systems don’t require retirement, health care, or much management. The problem is that humans still index certain content better than automated systems. Toss out high quality indexing and insert algorithmic methods, and you get search results which can vary from indexing update to indexing update.

Okay for a novice. Not okay for a professional who knows how to formulate a precise Boolean query. Deep indexing promises to make concepts king and permit fancy short cuts like automated “See Alsos” and “Suggested Searches”, facets, and one click sorting on machine tagged content. When an algorithm parses a document, some systems can display a “synthetic report” composed of snippets from different sources or different content types. Some searchers don’t want to do old fashioned research. Some searchers just want an answer which is “good enough”. Ah, intellectual rigor gets crushed in the wheels and gears of smart software. But how smart? And who has the smartest system and method for deep indexing? I don’t know.

What caught my attention was a news story from the CBS Detroit affiliate called “ProQuest Deep Indexing Gets US Patent.” The story I read was dated April 26, 2011. Here’s the passage that caught my attention:

Deep Indexing, the subject of U.S. Patent No. 5,950,196,  is now available in the all-new ProQuest platform, allowing the innovation to be used across a much broader range of data.  “Deep Indexing significantly accelerates discovery in serious research and is just one example of the kind of technology leadership that’s resident across the ProQuest enterprise,” said Marty Kahn, ProQuest CEO. “The creation of a single, unified platform enables us to leverage this kind of innovation across the breadth of ProQuest content, rather than confining it to a handful of databases.” ProQuest’s Deep Indexing creates metadata from the elements within illustrations so these graphics — including table, charts, photos, drawings, etc. — can be searched for relevant content.  Before the debut of ProQuest’s new unified platform, the technology was available for scientific and technology journals. Deep Indexing now becomes one of the platform’s advanced content management tools that can be used across ProQuest data.

I am a curious person. I noted that the patent number 5,950,196 was not one I recalled seeing for 2011 patents. I snagged a copy of the patent document and learned some interesting things:

  1. The patent was filed in 1997 and is assigned to Pallavi Pryeddy and W. Bruce Croft (one of the big guns in content processing). The patent was granted in 1999, not April 2011. My hunch is that there was either a typographic error or the patent was purchased by ProQuest. Maybe there is some marketing magic in the announcement as well.
  2. The 5,950,196 document does not communicate “deep indexing” to me. I think it is closer to what the old PROMT File 16 database did with humans; that is, look for a table and put it in a form that made it possible to retrieve tables with data on specific topics. There are methods to handle this situation, but in 1999 this was indeed rocket science.
  3. The actual ProQuest patents are like a Slim Fast diet. I located US7,765,199 and US7,376,709. I also noted application US20100318561 and US20070219970. The “inventor” Matt Dunie does not seem to be part of the ProQuest or Cambridge team at this time. He was, it appears from the patent documents, a spark plug of innovation. I am not sure who is the “inventor” at ProQuest today. Without more data, I had to assume that the new patent number in the Detroit CBS story is probably an error or an error that originated somewhere in the information chain leading to the Detroit reporter.

When I looked at the news stories about the alleged patent, most just said ProQuest has a patent for deep indexing. Too bad the base document, which seems to be a ProQuest document called “ProQuest’s Deep Indexing Earns Patent”, did not provide more detailed bibliographic data. (Is this an effort to streamline the write up, an indication that an indexing company does not include the details needed to locate a document, or an attempt at obfuscating a technical claim?)

If ProQuest licensed the reference patent, that’s one thing. If ProQuest is alleging that it invented the system and method in US5,950,196, that strikes me as a little inaccurate. Somewhere along the line, the facts seem to be tangled in marketing thorns or just lousy reporting, careless writing, or a 21st-century positioning play.

And what about the “real” journalist’s inclusion of what seems to be an off point patent number? Yikes!

Back to the Correlation Concepts point. The notion of “deep indexing” is one of considerable interest. A lawyer will have to figure out who has which patent. I am inclined to accept the Correlation Concepts’ assertion that its approach is unique. Too bad I could not pin down an alleged patent that may have an impact on the promising start up, Correlation Concepts.

I learned one thing: indexing companies may not be able to provide the type of information I need to locate a full text document. No problem. There are alternatives with indexing behaviors upon which I can rely. Google is slipping a bit in my opinion. EBSCO seems pretty good. Some of the others? Management churn, cost cutting, and automated processes are making findability more difficult for me.

Thank goodness I am old and too tired to do much more than issue a cautionary, “Honk.”

Stephen E Arnold, June 10, 2011

Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion. Check out the new vertical information service on the automobile industry and the news service about Internet investment challenges.

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta