Google Wants the World’s Knowledge–Yep, All of It

July 13, 2008

ABC News featured Gregory Lamb’s essay “Could Google Monopolize Human Knowledge.” If you are interested in Google’s scanning project or the prospect of Google monopolizing knowledge, click here.

Between you and me, I don’t know what knowledge is, so I think I am supposed to be flexible in interpreting the title of Mr. Lamb’s essay.

The core of the argument is that Google has cash. The company scanning books in a shoddy way. And Microsoft, once involved in this expensive game, dropped out. Brewster Kahle is scanning books and looking for funding to continue with his scanning project.

So, if Google keeps on scanning, Google will have page images, crappy ASCII versions of the source documents, and lots of users. I am not doing justice to Mr. Lamb’s analysis.

One point that encapsulated the argument was:

So far, Google isn’t aggressively trying to make money off its book pages, though a few ads and links to buy hard copies from the publisher do appear. Keeping users inside Google’s online “universe” seems to be the company’s long-term motive.

The operative phrase is “long term motive”. I know that the “don’t be evil” catchphrase clashes with the company’s obligations to Wall Street and stakeholders. In fact, Mr. Lamb cites academics’ annoyance that Google’s optical character recognition sucks. That’s a useful fact because it underscores the lack of understanding some–maybe ABC News, journalists, and University of Virginia professors–bring to commercial information processing.

For several years, I labored in the vineyards at Bell+Howell. The company was one of the leaders in scanning. Converting a paper document to an image file and its accompanying ASCII text is tricky. I am not going into the mechanics of bursting (chopping up source documents in order to get pages that can fed by the stack into a scanning device), making sure the pages are not misfed and therefore crooked, checking the order of the auto numbered images to make sure that the images are OCR’ed in the correct order, and verifying that a group of scanned images comprising a logical document are properly linked to the ASCII file. This stuff is trivial and too bothersome for amateurs to explore.

The core issue is that libraries lack the funds to manage their vertical file collections. When an author dies or a photographer falls out of a helicopter, the heirs often gather up manuscripts, notebooks, negatives, and pictures. A quick trip to the local university library allows the heirs to get a tax deduction and get the stuff out of the kids’ garage. Libraries are choking on this primary material. The Library of Congress has a warehouse full of important primary material and it lacks the funds to catalog and make the hard copy materials available to researchers. Scanning of important materials such as those found in the American Memory project are not funded by government money. The librarians have to do fund raising.

University libraries are in worse financial shape. Public libraries, if you can believe, are farther down the drain.

And publishers? These folks are fighting for survival. If a bright young Radcliffe Institute of Advanced Study post doc gets the idea to scan a book on the publisher’s back list, our take charge, eager beaver will be flipping burgers at the McDonald’s on Times Square.

Let’s review some facts, always painful to folks like those in the news business:

  1. Scanning sucks. Optical character recognition sucks more. Fixing lousy ASCII requires human editors because software still is not infallible. 97% accuracy means three errors per 100 words. If an insect gets trapped in the scanner, accuracy can be adversely affected because the source image has a big bug over the text. The OCR engine can’t figure out what’s under the bug, so 97% drops to 96%. The fix is fuzzy algorithms, trigrams, and other tricks to make lousy ASCII useful. I have been in the information processing business for a long time. OCR sucks less today, but it still sucks.
  2. Scanning is expensive. If Google quits scanning, who is going to do the work and pay the bill? My hunch is that if we asked graduate school professors to work one day a week to scan the primary material in their institution’s library, the request would be received with derisive scorn. Scanning is messy, dirty, tedious, fraught with errors, and dull, dull work. Operating a scanner and performing the human sheep herding is tough work. Volunteers from the UVa?
  3. Google is using the book project in several ones. The good news is that making book search available is useful to scholars. If you look at the fees levied by our friends at Ebsco, ProQuest, Reed Elsevier, and Thomson Reuters–Google’s “free” looks pretty good to me. The bad news is that few people outside of Google understand what the book scanning project provides to Google. And I am not going to include that item in a free Web log post. Google isn’t scanning because it’s cheap. There are technical and economic reasons the company is investing in the project, haphazard as it is.

Perhaps Kirtas Technologies’, maker of five and six digit scanning and OCR systems will dip into the company’s vast cash surplus and do the job right? The reality is that Kirtas won’t scan one page unless it is doing a demo or getting paid to do the work. It’s easy to criticize; it is harder to do this work when you have to write the checks from the Kirtas bank account. Based on my information, Google has a bit more financial headroom than Kirtas.

Observations

  1. Mr. Lamb has a well written essay that contributes to the spate of essays bashing Google. This is a mini trend, and I think the criticism will increase as it dawns on pundits that Google has been beavering away without competition for a decade. Now that the company is the dominant search system and primary online advertising engine, it’s time to point out the many flaws in Google. Sorry, OCR is what it is. Google is what it is.
  2. The complexity of certain Google activities is not well understood. Overlooking the economics of scanning is an important omission in the essay. A question to ask is, “If not Google, who?” I don’t have many names on my short list. Obviously the Bill and Melinda Gates Foundation wasn’t ready to pick up the thrown Microsoft ball.
  3. Google is a very different enterprise. I marvel at how Wall Street sees Google in terms of quarterly ad revenue. I am amazed at analyses of one tiny Google initiative. Google is a game changer, and the book project is a tiny component in a specific information initiative. Any one of the Beyond Search Web log readers know what it is? If you do, use the comments section to fill me in.

Google has reader pull. I look forward to more “flows like water” analyses of the GOOG. Over time, one of the reports will further our understanding of Googzilla. Film at 11.

Stephen Arnold, July 13, 2008

Comments

5 Responses to “Google Wants the World’s Knowledge–Yep, All of It”

  1. Jim on July 29th, 2008 7:55 am

    Kirtas is busy cutting 35 jobs, leaving them with just 100 employees. They are still using taxpayer funds to prop up the company.

    http://www.13wham.com/news/local/story.aspx?content_id=d8573cf3-e2d2-49c2-adf4-5aa7d6eb251d

  2. Stephen E. Arnold on July 29th, 2008 9:10 am

    Jim,

    Thanks for the link. I had heard that the lower cost high speed scanning gear coupled was having an impact on some US vendors.

    Stephen Arnold, July 28, 2008

  3. Share Quotes : on October 26th, 2010 1:34 am

    i always thought that ABC news is even better than CNN when delivering up to date news`”.

  4. Health and Medicine Forum · on November 12th, 2010 12:37 pm

    abc news is of course one of the most reputable news sources these days “:~

  5. Jayna Preskar on June 4th, 2011 10:03 pm

    It’s satisfying to see a writer with high standards like yours. I know this from reading your write-up. You’ve got plenty of useful details.

  • Archives

  • Recent Posts

  • Meta