Info Fragmentation

January 31, 2010

I don’t want to tackle a big philosophical issue is this blog. I do want to point out that while Google has been explaining that it is not a country, Amazon and Macmillan have agreed to disagree. You can read “Amazon and Macmillan Go to War: Readers and Writers are the Civilian Casualties” for a good run down. The point is that online services have been for decades chopping out content when problems arise. The fact is that most online users are clueless about what constitutes an online information system’s content holdings. Researchers jump online, run a query and grab the results. The perception is that the citation list is complete. A student will run a Google query and assume that Google has everything he or she needs to write a killer essay in 15 minutes for an overworked high school teacher. Attorneys are also falling into the trap of assuming that a body of content is complete and accurate. Wrong, dudes and dudettes, wrong, wrong, wrong. I can hear the azure chip consultants and the self appointed search experts gasp in horror. This hypothetical reaction from folks who like to watch videos is not surprising because most people do not do detailed bibliographic and collection analysis. When these cuties encounter someone who does this type of work, there is essentially a miasma of confusion that settles over their brows. Here’s what the scoop is:

  1. A company gets rights to specific information. The publisher changes staff; the database publisher gets an email saying, “The deal must be reworked.” The publisher doesn’t offer more money or customer names or some other requirement. The publisher tells the online vendor to remove the content. This the database producer does and very few people know that info has disappeared. The only  way to track this type of publisher-vendor change is to hope that it becomes a big news item like the Amazon-Macmillan squabble.
  2. An online system has a glitch at loading time. The data * never * make it into the online system. Because  most users do not check online version versus a hard copy, few notice. Heck, at the old Dialog when “gentle Ben” screwed up a file load, we had to tell Dialog that its system spit a hair ball. After denials and excuses, the Dialog tape would be reloaded and all was well. Not every database producer performed this quality check. I can hear the owners of ABI/INFORM snorting now. “Quality. We know quality.” Righto.
  3. A user looks the wrong place for information. Google yaps about universal search but when you need to find info on Google, you have to know the ins and outs of the news archive, the caches, and the specialty indexes. Overlook a manual exercise of running the same query across different indexes, and you will miss info. This happens on most public facing, free systems. Do you run exhaustive queries? I didn’t think so.
  4. Latency. Do you know what this means? Well in a Web index it means that the spider pings a server and the server doesn’t respond. The spider, impatient lass that she is, moves on. Maybe the spider will come back. Maybe not. This means that if an updated content object resides on a  system with latency—that is, really slow system—the content may not be indexed. Ah, ha. Now how do you as a content provider fix this problem? If you don’t know about it, you may not have a quick fix.
  5. Malformed information. A whiz kid does a post and inserts all types of fancy stuff. If you use template developed by third parties for your online service, your cute little widget may “kill” the page. The indexing system can’t “see” the page, so the content does not get indexed.
  6. Corrections. I bet you think that when content is online it is the last, best, and final version. Wrong. Most online services * do not * update a static file indexed at a prior time when a correction to that original article appears in print or on a data feed. Don’t believe me. Run some queries on any online service with a newspaper hard copy that has a correction to a previous story. Now look for that correction online in the original article. My team did the first database to put corrections into online business news. This was expensive and difficult. No one noticed. I think that the new owners of Business Dateline may have forgotten the original correction part of the editorial cycle.

There are other reasons why content disappears and then magically comes back when another change takes place. As people do less rigorous research, the cluelessness about comprehensive, accurate collections increases. Know a librarian. Most can help in this department in my experience.

Stephen E Arnold, January 31, 2010

This is a no fee write up. When I give my SLA spotlight talk in June I will demand a free Diet Pepsi. That’s compensation, and I will report this to the Library of Congress, an outfit moving into open source software. I thought collection management was important too.

Comments

One Response to “Info Fragmentation”

  1. Allen Harkleroad on January 31st, 2010 6:19 am

    This news makes me so glad that I am self-published and don’t rely on large book publishers for income.

  • Archives

  • Recent Posts

  • Meta