Arnold at NFAIS: Google Books, Scholar, and Good Enough

June 26, 2009

Speaker’s introduction: The text that appears below is a summary of my remarks at the NFAIS Conference on June 26, 2009, in Philadelphia. I talk from notes, not a written manuscript, but it is my practice to create a narrative that summarizes my main points. I have reproduced this working text for readers of this Web log. I find that it is easier to put some of my work in a Web log than it is to create a PDF and post that version of a presentation on my main Web site, www.arnoldit.com. I have skipped the “who I am” part of the talk and jump into the core of the presentation.

Stephen Arnold, June 26, 2009

In the past, epics were a popular form of entertainment. Most of you have read the Iliad, possibly Beowulf, and some Gilgamesh. One convention is that these complex literary constructs begin in the middle or what my grade school teacher call “In media res.”

That’s how I want to begin my comments about Google’s scanning project – an epic — usually referred to as Google Books. Then I want to go back to the beginning of the story and then jump ahead to what is happening now. I will close with several observations about the future. I don’t work for Google, and my efforts to get Google to comment on topics are ignored. I am not an attorney, so my remarks have zero legal foundation. And I am not a publisher. I write studies about information retrieval. To make matters even more suspect, I do my work from rural Kentucky. From that remote location, I note the Amazon is concerned about Google Books, probably because Google seeks to enter the eBook sector. This story is good enough; that is, in a project so large, so sweeping perfection is not possible. Pages are skewed. Insects scanned. Coverage is hit and miss. But what other outfit is prepared to spend to scan books?

Let’s begin in the heat of the battle. Google is fighting a number things. Google finds itself under scrutiny from publishers and authors. These are the entities with whom Google signed a “truce” of sorts regarding the scanning of books. Increasingly libraries have begun to express concern that Google may not be doing the type of preservation job to keep the source materials in a suitable form for scholars. Regulators have taken an interest in the matter because of the publicity swirling around a number of complicated business and legal issues.

These issues threaten Google with several new challenges.

Since its founding in 1998, Google has enjoyed what I would call positive relationships with users, stakeholders, and most of its constituents. The Google Books’ matter is now creating what I would describe as “rising tension”. If the tension escalates, a series of battles can erupt in the legal arena. As you know, battle is risky when two heroes face off in a sword fight. Fighting in a legal arena is in some ways more risky and more dangerous.

Second, the friction of these battles can distract Google from other business activities. Google, as some commentators, including myself in Google: The Digital Gutenberg may be vulnerable to new types of information challenges. One example is Google’s absence from the real time indexing sector where Facebook, Twitter, Scoopler.com, and even Microsoft seem to be outpacing Google. Distractions like the Google Books matter could exclude Google from an important new opportunity.

Finally, Google’s approach to its projects is notable because the scope of the project makes it hard for most people to comprehend. Scanning books takes exabytes of storage. Converting images to ASCII, transforming the text (that is, adding structure tags), and then indexing the content takes a staggering amount of computing resources.

image

Inputs to outputs, an idea that was shaped between 1999 to 2001. © Stephen E. Arnold, 2009

Google has been measured and slow in its approach. The company works with large libraries, provides copies of the scanned material to its partners, and has tried to keep moving forward. Microsoft and Yahoo, database publishers, the Library of Congress, and most libraries have ceded the scanning of books work to Google.

Now Google finds itself having to juggle a large number of balls.

Now let’s go back in time.

I have noticed that most analysts peg Google Books’s project as starting right before the initial public offering in 2004. That’s not what my research has revealed. Google’s interest in scanning the contents of books reaches back to 2000.

In fact, an analysis of Google’s patent documents and technical papers for the period from 1998 to 2003 reveals that the company had explored knowledge bases, content transformation, and mashing up information from a variety of sources. In addition, the company had examined various security methods, including methods to prevent certain material from being easily copied or repurposed.

The idea, which I described in my The Google Legacy (which I wrote in 2003 and 2004 with publication in early 2005) was to gather a range of information, process that information using mathematical methods in order to produce useful outputs like search results for users and generate information about the information. The word given to describe value added indexing is metadata. I prefer the less common but more accurate term meta indexing.

The scanning part was to take information “locked in” paper or analog form and convert it to a digital form. Google realized that it could not populate its knowledge bases by buying scanning services from commercial sources. That would be too expensive and create a dependency. Traditional database producers and publishers were not in a financial or technical position to undertake a project that would attempt to convert books into digital form. Google seemed to have hit a dead end.

After two or three years of preliminary investigation and engineering research, Google hired Wayne Rosing. Mr. Rosing was and still is a technical wizard, although he no longer works full time at Google. He brought to the company expertise in optical character recognition. He joined Google from Caere where he was instrumental in that firm’s scanning and OCR technology. He also brought pragmatic engineering expertise which contributed to the development by Google suppliers specialized scanning equipment, sophisticated algorithms to deal with curvature of thick volumes, and work flow processes that allow Google to “drop in” a scanning operation  so that a library’s operation is not significantly disrupted.

The project began filling Google’s servers with book information, meta indexing, and digitized content that supplemented Google’s own knowledge bases. The knowledge bases contain its Web index, informatoin about Google’s systems, and the informaiton from scanning books.

As I reported in The Google Legacy in 2005, the notion of knowledge bases at Google was important because “smart software” looks at the knowledge bases or their “values” in order to make “decisions”. Google deals with mathematics and the knowledge bases exist as collections of meaningful values, data, and “digital envelopes”. When you search for spears, the system displays information about “Britney Spears”. You have Google’s knowledge bases to thank or blame for that approach.

Since its beginnings in the 1999 to 2000 period, the Google Books’ project has expanded to include magazines. You can see this yourself. Navigate to Google Books and click on a link to a magazine. The magazine covers appear, and you can see a very rich mash up of information about a particular issue. I like to think of this as what the Union List of Serials should have been. But like most Google services, Google has stepped in, done a job that a publisher like Bowker or Cengage could have done. Google has moved into a sector where I think the Library of Congress or the British Museum should have taken the lead. We know the publishers and the national libraries did not do the job. Google began and now is playing a role that easily could have been played by other organizations. These organizations did not. Google did.

The difference of course is that Google operates in a one-to-one world. The Google computing infrastructUre eliminates the multiple, serialized steps between a user and an answer. I can see that the Google approach makes it possible for a person with a Web log to use Google as a new publishing medium.

When you look at a typical Google Books’s page, you see a number of functions. I don’t have time to work through each of these. But the Google Books’s system includes a way for Google to sell a book, provide information, and display to the user sections that include the search term. I think the system is quite usable, but it is, in a sense, the tip of the iceberg. The information resides within the Google infrastructure so Google can add features, bells, and whistles with little delay and only an incremental cost.

I call this power leveling because Google just does the work. Most observers fail to explain exactly what Google’s strategy is in a particular initiative. To illustrate: What’s the long term contribution of Google Wave to Google Books?

Now let’s move to the dénouement for this epic battle. I don’t know how Google will resolve its many challenges with its Google Books’ project. I do know that in the research for my 2007 study Google Version 2.0, Google had completed most of the core functionality for its globe-wrapping computing infrastructure. Since 2006, Google has been accelerating its application development. The company now has an information application platform that makes it possible for the company to play one or more roles in the global information industry. The company can be a primary publisher as it is with its Knol and Web log services. The company can be a content distributor as it is with its iGoogle service and its YouTube.com product. In short, as I describe in Google: The Digital Gutenberg a publisher can use Google to run a proprietary information business using only Google and making Google a partner to the venture.

Google has patent documents that describe how the partners can control virtually every aspect of an information business. The invention is described in terms of video content, but the system and method, as the patent authors disclose, can be applied to other media which is evident to “one skilled in the art.”

Google offers input forms for its Local service that provide a free “yellow page” type listing, knowledge to Google’s knowledge bases, and useful information to Google users via tethered or mobile computing devices. I have mentioned Knol, which is a type of user built encyclopedia. Google publishes a large amount of information via its more than 70 Web logs, which you can search without charge on my Overflight service. Google even makes it possible for me to have a multimedia ad about myself, hooked into Google Maps and a Google Profile. Educators can weave these services together to provide a rich instruction service to middle school students. You can create a Google “magazine” individualized for you. You can watch a Google channel on YouTube.com. Even the Pope has a Vatican channel on YouTube.com. Google and education is emerging as “next big thing”. You can follow education in terms of Google Apps or in its new Digital Education Portal.

Let me give one example how Google’s smart software can make use of the knowledge bases. A Google software agent examining a “fact table” in a knowledge base. You can see these “fact tables” in Google Base or Google Fusion. When the software agent doesn’t know whether a value is “within range”, the software agent can look in a library for another mathematical method. If that does not resolve the issue, the agent can consult the Google knowledge bases. The idea is that Google refines the “values” in its fact table over time. This is just good engineering, not a science fiction writer’s notion of a super human brain.

The outputs of these systems is interesting. Google does not provide much detail, but one tantalizing example became available in 2007. The Google system, according to the patent document 20070198481 generates a dossier about the user’s query, in this case Michael Jackson, the pop star.

In closing, will there be a sequel to this epic battle between Google and its challengers? I have no idea. I can offer three closing observations:

First, Google has a system that works a bit like Lego blocks. Services, even information, can be snapped together. It is, therefore, imperative that those who want to understand Google look beyond advertising, Web search, and the squabble over Google Books. The company can morph without warning. This makes Google a very formidable competitor. How long would it take Google to become a publisher and resolve copyright by asking me to “publish” my next study for Google, for distribution by Google, and for monetization by Google. In my case, not long at all. My traditional publishers are struggling and their woes impact my financial future. Maslow’s hierarchy comes into play, not a love of tradition.

Second, those fighting Google have to recognize that Google is not a small company. Forget the lava lamps. Google can be a dominant force in certain battles. Without resources, fighting Google can be a difficult proposition. Viacom has been chasing Google for years. What’s the status? Stalled by legal maneuvers. This is an arena for those with considerable funds, lawyers, and stamina. European legal challenges may be contentious. Google Books is not deep linking. Google Books is a large dataspace.

image

Third, I am pragmatic. For years, I have been urging publishers to surf on Google. Now “wave” has another meaning. Google’s newest technology can engulf some organizations. For some, Google presents an opportujnity for a thrilling ride. For libraries faced with funding pressures, Google offers one way to obtain digital instances. For scholars, something good enough may have to do. For others, Google represents a powerful force that can change landscapes. Like some natural forces, Google operates slowly. Are we discerning what is truly significant about Google Books? Are we watching a minor feature, not the major thrust of the activity? I am trying to get the right perspective. Are you?

Stephen Arnold, June 26, 2009

Comments

4 Responses to “Arnold at NFAIS: Google Books, Scholar, and Good Enough”

  1. Dave Kellogg on June 27th, 2009 12:10 am

    Great post, I wish I saw the speech.

    The biggest “aha moment” I’ve had on Google Book Search was during the Mark Logic user conference where I interviewed Dan Clancy and he said this is something I want to be remembered for. While we can analyze projects from a corporate level and their strategic importance, there is also an individual level and Google — thanks to its massive resources — has provided a platform where they can give talented people the resources and opportunity to work on things that motivate them at very deep levels.

    That’s powerful and cool — and dangerous if you’re on the other side.

  2. Google and Scientific Tagging : Beyond Search on June 28th, 2009 3:01 am

    […] my talk on June 26, 2007 for NFAIS, a question came from one of the participants in the Webcast of my […]

  3. Bibliotheken Twitter Web 2.0 en Zoeken: overzicht juni 2009 « Dee’tjes on July 3rd, 2009 5:22 am

    […] Google Books, Scholar, and Good Enough (Steve Arnold at NFAIS: Beyond Search) is de moeite waard om te lezen […]

  4. Percocet addiction. on July 2nd, 2010 1:22 am

    Percocet addiction….

    Percocet withdrawal. Percocet online. Percocet addiction. Buy percocet online….

  • Archives

  • Recent Posts

  • Meta