Duplicates and Deduplication

December 29, 2008

In 1962, I was in Dr. Daphne Swartz’s Biology 103 class. I still don’t recall how I ended up amidst the future doctors and pharmacists, but there I was sitting next to my nemesis Camille Berg. She and I competed to get the top grades in every class we shared. I recall that Miss Berg knew that there five variations of twinning three dizygotic and two monozygotic. I had just turned 17 and knew about the Doublemint Twins. I had some catching up to do.

Duplicates continue to appear in data just as the five types of twins did in Bio 103. I find it amusing to hear and read about software that performs deduplication; that is, the machine process of determining which item is identical to another. The simplest type of deduplication is to take a list of numbers and eliminate any that are identical. You probably encountered this type of task in your first programming class. Life gets a bit more tricky when the values are expressed in different ways; for example, a mixed list with binary, hexadecimal, and real numbers plus a few more interesting versions tossed in for good measure. Deduplication becomes a bit more complicated.

At the other end of the scale, consider the challenge of examining two collections of electronic mail seized from a person of interest’s computers. There is the email from her laptop. And there is the email that resides on her desktop computer. Your job is to determine which emails are identical, prepare a single deduplicated list of those emails, generate a file of emails and attachments, and place the merged and deduplicated list on a system that will be used for eDiscovery.

Here are some of the challenges that you will face once you answer this question, “What’s a duplicate?” You have two allegedly identical emails and their attachments. One email is dated January 2, 2008; the other is dated January 3, 2008. You examine each email and find that difference between the two emails is in the inclusion of a single slide in the two PowerPoint decks. You conclude what:

The two emails are not identical and include both and the two attachments
The earlier email is the accurate one and exclude the later email
The later email is accurate and exclude the earlier email.

Now consider that you have 10 million emails to process. We have to go back to our definition of a duplicate and apply the rules for that duplicate to the collection of emails. If we get this wrong, there could be legal consequences. A system develop who generates a file of emails where a mathematical process has determined that a record is different may be too crude to deal with the problem in the context of eDiscovery. Math helps but it is not likely to be able to handle the onerous task of determining near matches and the reasoning required to determine which email is “the” email.

Which is Jill? Which is Jane? Parents keep both. Does data work like this? Source: http://celebritybabies.typepad.com/photos/uncategorized/2008/04/02/natalie_grant_twins.jpg

Here’s another situation. You are merging two files of credit card transactions. You have data from an IBM DB2 system and you have data from an Oracle system. The company wants to transform these data, deduplicate them, normalize them, and merge them to produce on master “clean” data table. No, you can’t Google for an offshore service bureau, you have to perform this task yourself. In my experience, the job is going to be tricky. Let me give you one example. You identify two records which agree in field name and data for a single row in Table A and Table B. But you notice that the telephone number varies by a single digit. Which is the correct telephone number? You do a quick spot check and find that half of the entries from Table B have this variant, or you can flip the analysis around and say that half of the entries in Table A vary from Table B. How do you determine which records are duplicates.

Now you have to deduplicate search results. Let’s assume you are using your whizbang SharePoint system and the MOSS (Microsoft Office SharePoint Search). You run a query for documents by Tom Smith. You get a list of results, and you see that there are many different Tom Smith documents and some of them are identical. The variations are from different departments, have a different title, or contain emendations embedded in a Word file with change tracking enabled. There are different dates for the documents. Which is the original document? Which is the best and final document?

Let’s step back:

One class of duplicates exist within content produced by a single individual
A second class of duplicates require determining which data item is the correct one
A third class of duplicates involves figuring out from many different versions the best and final document.

These are routine problems, and none is trivial, particularly when you have to deal with big data; that is, tens of millions of records. You can try clever new approaches such as numerical recipes that do “near” duplicate detection. Click here for an example described by Martin Theobald, Jonathan Siddharth, and Andreas Paepcke of Stanford University. Alternatively you can call a vendor and buy a commercial package and use its output. Vendors offering deduplication systems include Sepaton or Data Domain.

Enterprise search vendors offer deduplication functions. But the method vary from vendor to vendor. In our tests, Vivisimo’s deduplication method does a very good job of eliminating duplicates and near duplicates from search result sets. But the Vivisimo system was not designed to handle the needs of an eDiscovery specialist. eDiscovery vendors Clearwell Systems and Iron Mountain Stratify take different approaches to the problem, but both shift some responsibility to the humans managing the system. The “humans” are often attorneys or trained specialists, and no eDiscovery vendor wants to get between the two parties in a legal matter.

Google generates lists of results that are either deduplicated (a standard Google query for Britney Spears) or partially deduplicated (any listing of similar stories in Google News). The company has some interesting inventions regarding duplicate detection; for example, US6615209, US6658423, and the more recent US2008/0044016. These documents are available from the USPTO without charge.

The future, however, is not a software process. Google’s approach, in my opinion, embeds the deduplication methods into its data centers; that is, what I call the Googleplex of servers and storage devices that make Google work. As a result, Google points to an embedded approach that I think other vendors will pursue aggressively.

Not long ago, I had an exchange with a bright, eager lad from a major search vendor. I asserted that Intel was interested in certain search, indexing, and content processing technologies to create an appliance. The bright, eager lad told me that Intel was not interested in search. Maybe? Maybe not? I think Intel and Virtualization vendors want to poke into putting deduplication and similar content processing functions into firmware or into appliances. Here’s why:

Appliance-based content processing can be made to go really fast. If you don’t believe me, check out the throughput for the the Exegy content processing appliance.
The appliance’s rules can be locked down and explained to someone. Administrative controls allow some tuning, but the idea is that the appliance delivers a baseline of functionality. Some competitors point out the limitations of appliance-based search solutions. Flip it around. The appliance based solution delivers a known operation. That’s a benefit, not a liability in certain situations.
Appliances can be stacked on top of one another; that is, scaling can be accomplished by adding gizmos. The existing server infrastructure is not touched. The outputs of the appliance can be written to disk or just pumped into an existing enterprise system.

As a result, I try to keep my radar aimed at outfits like Intel, Cisco, some Japanese outfits, and the F5 Networks types of companies.

How does this impact enterprise search? Two ways. First, I think enterprise search vendors may face competition from third parties who are attacking the issue of deduplication with a utility-type solution. This may squeeze the dollars available for the core search and content processing vendor. The search vendor will either have to beef up duplicate detection or face more price pressure and more camels with their noses under their tents. Second, the volume of digital content continues to go up. Organizations have to come to grips with the need to store one instance of certain data. The promiscuity of most storage policies creates increasing data management costs and vulnerabilities.

My final comment: watch the data deduplication niche. I think some interesting developments will become more evident in 2009. Wait, do you hear Google’s footsteps with the beast carrying a Google Search Appliance and a network service to perform this work for licensees?

Stephen Arnold, December 29, 2009

Written by Stephen E. Arnold · Filed Under Database, EDiscovery, Enterprise, Feature, Online (general), Search, Text analytics, Text processing

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.