Mysteries of Online 8: Duplicates

February 24, 2009

In print, duplicates are the province of scholars and obsessives. In the good old days, I would sit in a library with two books. I would then look at the data in one book and then hunt through the other book until I located the same or similar information. Then I would examine each entry to see if I could find differences. Once I located a major difference such as a number, a quotation, or an argument of some type, I would write down that information on a 5×8 note card. I had a forensics scholarship along with some other cash for guessing accurately on objective tests. To get the forensics grant, I had to participate in cross examination debate, extemporaneous speaking, and just about any other crazy Saturday time waster my “coaches” demanded.

Not surprisingly, mistakes or variances in books, journals, and scholarly publications were not of much concern to some of the students who attended the party school that accepted an addled goose with thick glasses. There were rewards for spending hours looking for information and then chasing down variances. I recall that our debate team, which was reasonably good if you liked goose arguments, were putting up with a team from Dartmouth College. I was listening when I heard a statement that did not match what I had located in a government reference document and in another source. The opponent from Dartmouth had erroneously presented the information. I gave a short rebuttal. I still remember the look of nausea that crossed our opponent’s face when she realized that I presented what I found in my hours of manual checking  and reminded the judges that distorting information suggests an issue with the argument. We won.

image

For most people, the notion of having two individuals with the same source is an example of duplicate information. Upon closer inspection, duplication does not mean identical in gross features. Duplication drills down to the details of the information and to the need to determine which item of information is at variance and then figuring out why and what is the most likely version of the duplicate.

That’s when the fun begins in traditional research. An addled goose can do this type of analysis. Brains are less important than persistence and a toleration for some dull, tedious work. As a result, finding duplicative information and then figuring out variances was not something that the typical college sophomore spends much time doing.

Enter computer systems.

Now the issue of duplicates and variances takes an interesting twist. There is now a mindless system to look at information and find identical information. In the happy world of computer systems, a duplicate is an object that is identical in every respect, right down to the punctuation in the last sentence of the last paragraph. One variance means that the two information objects (documents) are different.

The result of this is that documents that are alike in most respects but different in one or more features are considered different. The problem is that a human can look at two versions of a story on two different online services and recognize that the stories are “about” the same subject. A quick scan of the two articles will allow most online news sucking humans to determine if these documents are pretty much alike. If there is a difference such as a picture in one article and no picture in another, even a speed reader will spot the difference. If the facts are generally in line in each article, the reader concludes that the two articles are the “same”.

Yikes, the news sucking human has identified a duplicate and that process does not match what the computer does. The human operates at a higher level of abstraction than the average computer. A human discards certain variances and hooks into important differences as the human perceives them. Remember, this is not a scholar who will be using a different mental flight path.

No one wants to read dozens of identical “hits” that are essentially the “same” article repeated over and over. So, online systems have devised methods to eliminate duplicates from certain hit lists. Other systems identify duplicates and then group them together so the news sucking human can read one version of the story and then dig into other variants as time and inclination permit.

There are, therefore, different meanings to the notion of duplicate content.

One one hand, there is the computer system that can easily match two information objects bit for bit and determine if the documents are identical. There are short cuts that operate to make this a relatively speedy process today.

Then there is a human who looks at two articles and can determine, often without much thought, that the articles are duplicates. But humans can make some mistakes because in certain documents that look identical in broad features may have some important minor variances. These “minor variances” can mean the difference between going to jail or avoiding jail, the perp walk, and the awkward reentry explanations.

Yikes again. Duplicate content, variances, and similarities are suddenly not such a backwater or semi conscious perceptual operation. Like so many digital information issues, duplicates often play no role in production. When those involved turn their attention to the issue, duplicates and the task of deduplication takes center stage. A steep learning curve presents itself. Duplicates quickly become a cost consideration. In most electronic publishing systems, duplicates and duplicate detection become the focal point of trade offs, compromises, and short cuts.

Once again, the uninformed, the trophy generation computer science grad, and the Peter Principle-in-action manager find themselves in scramble mode.

What You Can Buy

A search on Googzilla will reveal that a number of companies offer deduplication tools. Read the fine print. Deduplication for structured data works reasonably well. Deduplication for unstructured or semi structured data is a different animal. Companies often bundle deduplication routines with other content transformation products. You will want to get trial versions of the software and run tests on content with known duplicates. The performance and efficacy of the vendor’s system can then be determined. The more casual your tests, the greater the risks of answering questions in an adversarial situation accurately. The more stringent your tests, the more informed one’s answer may be. Guessing is probably not a good idea.

The ArnoldIT.com approach is to use our custom tools. We know what these scripts can and cannot do. There may be “better” systems available, but when the time window is open one or two inches, the system we know makes my comfort level creep up a notch. Dedeuplication is dependent on definitions, content collection characteristics, and context. One size does not fit every content foot. Custom tweaking is sometimes required.

Points to Consider

I am dipping into my notes from the 1979 to 1980 period so spare me the complaints that the information is old. (Check out the editorial policy here if you are a newcomer to this Web log.)

  1. What is the policy for duplicates for the particular online system that will be deployed? eDiscovery has one angle of attack; a news service has another.
  2. What is the method that will be used to identify duplicates, near duplicates, and distinct objects? Keep in mind that money, technical expertise, software and machine resources are needed to implement the grand plans for duplicates.
  3. Will the user understand what he / she is looking at when deduplicated results are displayed? Confused users are not a positive. On the other hand, if you get the duplicate method wrong, the user may not go away quietly. Think lawyers suing an organization.
  4. What’s is the policy regarding errors? Two document identical in every respect except for one values in a table of payments are in what category? A duplicate or a mistake? The answer to this question depends upon the context of the parties involved.
  5. What does the system do with duplicates? Delete them? Archive them? There are risks associated with some approaches to archiving and retention.
  6. When an error becomes known, how will the error be corrected? The policy for this method used requires some careful thought.

Wrap Up

The issue of duplicative information is not easily confined. The decisions about duplicates made for the Business Dateline database may not be appropriate for today’s online user. Most organizations do not pay much attention to version control, emails that have been doctored by their recipients, text in Word files that the user thinks has been stricken from the “final” version, copies of images in asset management systems, and consultant reports that have been purchased and then copied near and far within an organization so dozens or scores of instances of an “eyes only” document are floating around. There are duplicate Web sites and single Web sites with identical content. The list of duplicate issues can be extended.

The challenges range from basic definitions of duplicates to matters of policy to software methods. Hovering over the issue is a human’s ability to look at two documents and say, “Why are you showing me duplicates?”

Stephen Arnold, February 24, 2009

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta