Google Nails Duplicate Detection Invention

December 3, 2009

I know that most of my two or three readers does not give a goose feather for duplicate detection. Pretty boring stuff. Google result lists seem to be just one list with few repeating objects. Even in the Google News service, identical stories rarely slip through the digital net.

The ever reliable USPTO has granted a patent to the Google for its duplicate detection method. If you want to know a bit more about the Google approach, you will want to download US7,627,613, “Duplicate Document Detection in a Web Crawler System”. Before my pals at various search and content processing companies email me to explain that their duplicate detection is better, save that energy. No one at the Beyond Search goose pond is asserting “better”. The Google invention deals with scale, petabytes of digital crapola deduped quickly and reasonably effectively. The “scale” idea is one clue to Google’s technology. The challenges of scale are not well understood unless you have to figure out what to do with trillions of instances of digital crapola.

Google says in its glorious prose:

Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.

Notice what’s left out? Now read the patent document. Notice what’s left out? Google does not make explicit how these separate inventions interlock. Those interlocks are sort of important, particularly if you are a competitor and one of your 20 somethings say, “That’s obvious. I can code that up myself.” Scale. Remember scale. Remember that Google can convert speech to text and then dedupe those outputs too. Scale. Performance. Cost. Useful Google concepts all.

Stephen Arnold, December 3, 2009

I wish to disclose to the National Constitution Center that I was not paid to write this essay with its implicit reference to the constitutional right of Google competitors to misunderstand the notion of “scale” in Google’s weird vocabulary.

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta