Order Google: The Digital GutenbergTop Banner

Arnold at NFAIS: Google Books, Scholar, and Good Enough

June 26, 2009

Speaker’s introduction: The text that appears below is a summary of my remarks at the NFAIS Conference on June 26, 2009, in Philadelphia. I talk from notes, not a written manuscript, but it is my practice to create a narrative that summarizes my main points. I have reproduced this working text for readers of this Web log. I find that it is easier to put some of my work in a Web log than it is to create a PDF and post that version of a presentation on my main Web site, www.arnoldit.com. I have skipped the “who I am” part of the talk and jump into the core of the presentation.

Stephen Arnold, June 26, 2009

In the past, epics were a popular form of entertainment. Most of you have read the Iliad, possibly Beowulf, and some Gilgamesh. One convention is that these complex literary constructs begin in the middle or what my grade school teacher call “In media res.”

That’s how I want to begin my comments about Google’s scanning project – an epic — usually referred to as Google Books. Then I want to go back to the beginning of the story and then jump ahead to what is happening now. I will close with several observations about the future. I don’t work for Google, and my efforts to get Google to comment on topics are ignored. I am not an attorney, so my remarks have zero legal foundation. And I am not a publisher. I write studies about information retrieval. To make matters even more suspect, I do my work from rural Kentucky. From that remote location, I note the Amazon is concerned about Google Books, probably because Google seeks to enter the eBook sector. This story is good enough; that is, in a project so large, so sweeping perfection is not possible. Pages are skewed. Insects scanned. Coverage is hit and miss. But what other outfit is prepared to spend to scan books?

Let’s begin in the heat of the battle. Google is fighting a number things. Google finds itself under scrutiny from publishers and authors. These are the entities with whom Google signed a “truce” of sorts regarding the scanning of books. Increasingly libraries have begun to express concern that Google may not be doing the type of preservation job to keep the source materials in a suitable form for scholars. Regulators have taken an interest in the matter because of the publicity swirling around a number of complicated business and legal issues.

These issues threaten Google with several new challenges.

Since its founding in 1998, Google has enjoyed what I would call positive relationships with users, stakeholders, and most of its constituents. The Google Books’ matter is now creating what I would describe as “rising tension”. If the tension escalates, a series of battles can erupt in the legal arena. As you know, battle is risky when two heroes face off in a sword fight. Fighting in a legal arena is in some ways more risky and more dangerous.

Second, the friction of these battles can distract Google from other business activities. Google, as some commentators, including myself in Google: The Digital Gutenberg may be vulnerable to new types of information challenges. One example is Google’s absence from the real time indexing sector where Facebook, Twitter, Scoopler.com, and even Microsoft seem to be outpacing Google. Distractions like the Google Books matter could exclude Google from an important new opportunity.

Finally, Google’s approach to its projects is notable because the scope of the project makes it hard for most people to comprehend. Scanning books takes exabytes of storage. Converting images to ASCII, transforming the text (that is, adding structure tags), and then indexing the content takes a staggering amount of computing resources.

image

Inputs to outputs, an idea that was shaped between 1999 to 2001. © Stephen E. Arnold, 2009

Google has been measured and slow in its approach. The company works with large libraries, provides copies of the scanned material to its partners, and has tried to keep moving forward. Microsoft and Yahoo, database publishers, the Library of Congress, and most libraries have ceded the scanning of books work to Google.

Now Google finds itself having to juggle a large number of balls.

Now let’s go back in time.

I have noticed that most analysts peg Google Books’s project as starting right before the initial public offering in 2004. That’s not what my research has revealed. Google’s interest in scanning the contents of books reaches back to 2000.

In fact, an analysis of Google’s patent documents and technical papers for the period from 1998 to 2003 reveals that the company had explored knowledge bases, content transformation, and mashing up information from a variety of sources. In addition, the company had examined various security methods, including methods to prevent certain material from being easily copied or repurposed.

The idea, which I described in my The Google Legacy (which I wrote in 2003 and 2004 with publication in early 2005) was to gather a range of information, process that information using mathematical methods in order to produce useful outputs like search results for users and generate information about the information. The word given to describe value added indexing is metadata. I prefer the less common but more accurate term meta indexing.

Read more

Library Teaches Search - More Instruction Needed

June 22, 2009

My recollection is that libraries taught search as far back at 1980. I recall that either database vendors would run demonstrations or that librarians skilled in the use of online would provide guidance to those who asked. I recall running a class in ABI/INFORM at Chicago Public Library and there was an overflow crowd of both staff and research minded patrons. I was delighted, therefore, to see an article in the Sacramento Bee that described the Sutter Library’s classes in finding health and medical information online. The class is a reminder to me that:

  1. Librarians and information professionals often know how to search and have an interest in sharing that knowledge
  2. Patrons are smart enough to know that despite the marketing hype and the pundits’ assertions that search is a “done deal” additional instruction attracts people and finds its way into The Sacramento Bee

We have a long way to go before information professionals will be relics of a long gone time. The people who tell me that they “know how to search” and “can locate almost anything online” are kidding themselves. I think I am a reasonably good researcher. But if you spend time monitoring how I find information, you will learn quickly that I turn to experts who make my search skills look primitive. Even my nifty Overflight system pales with the type of information that my research team generates by:

  • Knowing what content is located where
  • Understanding the editorial method behind or absent from certain online systems
  • Leveraging hard-to-manipulate resources such as information from government repositories, specialized services, and individual experts.

I would like to see more libraries move aggressively into online instruction, market those programs, and raise the level of expertise. Most of the people who claim to be experts at search are clueless about how bad their skills are. Among the worst offenders are self appointed search experts who have trouble figuring out when something is likely to be baloney and when something is just plain wrong. Enterprise search, content management, and text mining are three disciplines where better research will be most beneficial in my opinion. Then we need critical thinking skills. Schools have dropped the ball. Maybe libraries can help in this area as well? Search procurement teams will be well served if the team has one or more librarians in the huddle.

Stephen Arnold, June 22, 2009

SirsiDynix Search Plus Discovery for Libraries

May 24, 2009

Brainware landed a deal to provide search and discovery to SirsiDynix. After a bit of poking around, I learned that SirsiDynix wanted to move beyond key word search and provide users of its library systems with discovery functions. “Discovery”, as used in this sense, refers to giving a person looking for information easy-to-use methods to look for related information and suggested information also germane to the user’s query. Endeca hooked up with Ebsco to provide “guided navigation” to Ebsco customers. Most online public access catalogs and library-centric search systems match the users’ query terms or force the user to search by entering an author’s name. Change, at long last, seems to be coming to the library for search of an institution’s textual information. I wrote about some of the Brainware system’s capabilities in my 2008 study “Beyond Search” for the Gilbane Group here. I also did a short write up about Brainware in this Web log in early 2008 here.

A reader alerted me to an announcement here that SirsiDynix will roll out an enhanced enterprise search and discovery system to over 30 libraries. You can read that announcement here. The system includes such features as:

  • Trigram analysis, or “fuzzy logic” which evaluates each trigram in a word to allow for typos, diacritics and more: a first in the library search and discovery market
  • “Did you mean” suggestions which are based on terms in the catalog (rather than a generic third-party dictionary)
  • Dynamic search suggestions
  • Delivery of saved searches through an RSS web feed
  • Email and print options for search results
  • Built-in “Library Favorites”
  • The capability for libraries to define their own “Favorites”, profiles, languages and filters.

You can test the Brainware power “enterprise” service at the Wells County Public Library here.

The library market has been under severe price competition. This information sector is coming under more and more pressure from Google. The world’s largest search provider has been slowly expanding its services, including the controversial Google Books’ program. So far, specialized vendors of library information systems have been able to maintain the grip in today’s slippery economic one lane highway. The impact of Google on this market will be interesting to observe.

Stephen Arnold, May 24, 2009

Google and Libraries

May 1, 2009

The USPTO must be clearing backlogs. A flurry of Google patent documents became available. Several were uninteresting (floating data centers, query expansion), but one struck me as having some disruptive potential. I refer to Library Citation Integration, US7526475. You can get the document from the USPTO at http://www.uspto.gov. The abstract stated:

An online search system generates an index of documents using index information received from a library. Some documents have restricted access; some documents may not be available online. The search system provides links to documents in the library as well as other sites based on a search, and may include link resolvers received from the library. The search system provides access links to the link resolvers if an identifier, such as a user identification or IP address, matches an affiliation list from the library.

Why? Think for a moment about the commercial database vendors, the online public access catalog vendors, and the companies building content for institutional use. I thought the pointing function to items in the OCLC system was interesting. This invention gives the Google some an opportunity to stomp, should it choose to do so, in some other vineyards. Who will be squashed into fine wine? I don’t drink, so I might not be affected. Those in the library ecosystem might have a different view.

Stephen Arnold, May 1, 2009

Amsterdam Breathes New Life into Old Information Institution

April 19, 2009

A happy quack to the reader who sent me the link to Andrew Keen’s “Digital Dutch Masterpieces” here. The article points out that libraries can be both old and new media. He wrote:

at the Amsterdam public library. Instead of the dustiness and crustiness of the typical 20th century library, visitors to Amsterdam’s central public library will find not only books, but a restaurant as well as a children’s theatre and a public radio and television studio. The library, which is open every day from 10.00 am to 10.00 pm, also holds a series of cultural festivals – such as the upcoming week of poetry – which it then broadcasts on the Internet. Amsterdam library’s website epitomizes its innovative approach to the 21st curation of knowledge. The website features its own customized search engine, the “aquabrowser”, which has integrated the library’s books, CDs and DVDs as well as a rich archive of Amsterdam’s history and culture. Equally innovatively, the website provides those who use it within the walls of the library itself open access to all its digital content.

I did not resonate with the assertion that the library has a “return on investment”. That phrase has a specific meaning in financial circles. I think that the Amsterdam effort returns significant social value. One hopes other libraries absorb the lessons of this case.

Stephen Arnold, April 19, 2009

Potential Trouble for LexisNexis and Westlaw

March 2, 2009

Most online surfers don’t click to Reed Elsevier’s LexisNexis or Thomson Reuters Westlaw. The reason? These commercial services charge money–quite a lot of money–to access legal documents. Executives at both firms can deliver compelling elevator pitches about the added value each company brings to legal documents. In the pre-crash era, legal indexing was a manual process. Then the cost crunch arrived so both outfits are trying to slap software against the thorny problem of making sense of court documents, rulings, and assorted effluvia of America’s legal factories. I may write about how these two quasi US outfits have monopolized for fee legal information about American law for lawyers, government agencies. Both Reed and Thomson then turn around and sell access to these documents to the agencies that created them in the first place. I wonder if the good senator is aware of this aspect of commercial online services’ busness practices?

What’s the trouble? I bet you thought I was going to mention Google. Wrong. Google is on the edge of indexing legal information in a more comprehensive way. But the right now trouble is Senator Joe Lieberman. Wired reported that the good senator wondered by public documents are not available without a charge. You can read the story “Lieberman Asks, Why Are Court Docs Still Behind Paid Firewall?” here. Senator Lieberman’s question may lead to a hearing. The process could, in my opinion, start a chain reaction that further erodes the revenue Reed Elsevier and Thomson Reuters derive from public documents. Somewhere in the chain, the Google will beef up the legal content in its Uncle Sam service here.

At their core, Reed Elsevier and Thomson Reuters are traditional publishing and information companies. As such, their business model is fragile. Within the present financial pressure cooker, the Lieberman question could blow the lid off these two organization’s for fee legal business. If government agencies shift to a service provided by Google, Microsoft, or Yahoo, I think these two dead tree outfits will crash to the forest floor.

What the likelihood of this downside scenario. I would put it at better than 60 percent. Have another view? Share it, please. Set the addled goose straight.

Stephen Arnold, March 2, 2009

Another British Library Fear

January 28, 2009

Nick Farrell’s “British Library Fears Loss of History” reminded me that libraries are struggling for relevance in a Google-centric world. You can find his Register story here. For me the most interesting comment was:

The British Library has established a department dedicated to the collection of all these digital materials which are stored on your computer in the same way that it stores books, newspapers, documents, maps, personal letters.

I find categorical affirmatives quite amusing. The UK is collecting email and mobile data. Now the British Library wants “all” of a couple of types of digital information. Right now, the only outfit in a position to capture “all” information is Google, not a country, a company.

Libraries find themselves asked to provide shelter, job hunting, and coffee shop duties. One library expressed an interest in mobile furniture and off site book storage. The idea was that users of the library did not need some books right away.

The fear is well founded. Google will allay that fear in my opinion.

Stephen Arnold, January 28, 2009

New Google Study Announced

January 21, 2009

In July 2007, I vowed, “No more Google studies.” I was tired. Now I am just about finished with my third analysis of Google’s technology and business strategy. The two are intertwined. My publisher (Harry Collier, Infonortics Ltd.) has posted some preliminary information here about the forthcoming monograph, Google: The Digital Gutenberg. If you are curious how a Web search engine can be a digital Gutenberg, you will find this analysis of Google’s newest information technology useful. None of the information in this monograph has appeared in the more than 1,200 posts on this Web log, in my two previous Google studies, nor in my more than 200 publicly available articles, columns, and talks.

In short, the monograph will contain new information.

If you are involved in traditional media as a distributor, producer, content creator, aggregator, reseller, indexer, or user–you will find the monograph useful. You may get a business idea or two. If you are the nervous type, the monograph will give some ideas on which to chew. This study represents more than one year of research and analysis. I don’t pay much attention to the received wisdom about Google. I do focus almost exclusively on the open source information about Google’s technology using journal articles, presentations, and patent documents. The result is a look at Google that is quite different from the Google is an advertising agency approach that continues to dominate discourse. Even the recent chatter about Google’s semantic technology is old hat if you read my previous Google monographs. In short, I think this third study provides a solid look at what Google will be unveiling in the period between mid 2009 and the end of 2010. Here are the links to my two earlier studies.

  • The Google Legacy. Describes how Google’s search system became an application platform. You know this today, but my analysis appeared in early 2005.
  • Google Version 2.0. Explores Google’s semantic technology and the company’s innovations that greased the skids for applications, enterprise solutions, and disintermediation of commercial database publishers. A recent podcast broke the old news just a few days ago. Suffice it to say that most pundits were unaware of the scope and scale of Google’s semantic innovations. Cluelessness is reassuring, just not helpful when trying to assess a competitive threat in my opinion.

I don’t have the energy to think about a fourth Google study, but this trilogy does provide a reasonably comprehensive view of Google’s technical infrastructure. I know from feedback from Googlers that the information about some of Google’s advanced technology is not widely known among Google’s rank and file employees. Google’s top wizards know, but these folks are generally not too descriptive about Google’s competitive strengths. Most pundits are happy to get a Google mouse pad or maybe a Google baseball hat. Not me. I track the nitty gritty and look past the glow of the lava lamps. I don’t even like Odwalla strawberry banana juice.

Stephen Arnold, January 21, 2009

Google: Betting on Demographics

January 21, 2009

A reader groused about my poking fun at the British Library research reports. You can read these Swiftian essays here and here. Libraries are in a tough spot. With the financial crisis expanding, libraries are now the go to place to get warm and look for a job. Most libraries depend on a funding authority for money. As those authorities find themselves short of cash, libraries find themselves fighting for enough dough to keep staff and pay the heating and electricity bills. Book and journal acquisitions are lower on the list. Therefore, libraries have to justify the monetary needs. The British Library and the other national libraries are leading a charge for the relevance and importance of buildings stuffed with people looking for work. Oh, yes, these libraries want to collect dead tree outputs of publishers, pictures so these can be placed on Yahoo’s Flickr service, and electronic information so a user can access these data. The problem with this picture is that the Google has become the global library.  National libraries are becoming more like branch offices of Google. Now librarians get annoyed when I point out that:

  • Google is indexing books, magazines, Web sites, and Web logs
  • Google is indexing government information
  • Google is offering a job service that few know anything about but you can read about this in my forthcoming study of Google due in April 2009 from Infonortics (I’m sure an entitlement generation blogger will jump on this item and write about it before my study comes out. Imitation is a form of flattery I suppose, but it is more of a character trait of the trophy children in my opinion)
  • Google is gathering videos.

What are libraries doing? Well, I don’t think libraries are in step with what users’ information needs are. I think college professors, mayors, and government officials have their views of libraries. Street people hanging out in the Louisville Free Public Library probably have a different view, however.

I thought of this problem when I read the ZDNet article by Zack Whittaker, “Can We Rely Entirely on Google and Wikipedia?” here. The core of the write up is that Mr. Whittaker doesn’t need the library. He’s of the opinion that Google and Wikipedia provide enough information to write “essays and research”. He does a very good job of comparing what’s available fro free and what’s available from a library. To be fair, he does point out that his university library has some utility. But the online services are able to deliver “more than the full library”. The best combination is Internet access and access to a university library.

Now what’s this mean? Mr. Whittaker looks to be about 22 or 23 years old. But what about the kids who are 11 or 12 years old. I think the individuals in this younger demographic chunk will be more comfortable with iPhone and netbook form factors. Libraries may be a very foreign experience. But libraries have to shift into gear or their budgets will continue to shrink. That solves the problem for the 11 to 12 year old. The library will be like a lounge no fungible information artifacts required. A connection to the network and access are sufficient.

Stephen Arnold, January 21, 2009

Google’s Knol Milestone

January 18, 2009

Everyone in the drainage ditch in Harrod’s Creek, Kentucky, thinks Knol is a Wikipedia clone. This addled goose begs to differ. This addled goose thinks Knol is a way for the Google to obtain “knowledge” about topics and the experts who contribute to a Knol (a unit of knowledge). Sure, Knol can be used like Wikipedia, but the addled goose thinks the Knol is more, much, much more.

At any rate, the Google announced on January 16, 2008, after the goose tucked its head under its wing for the week that there are now 100,000 Knols. What this goose found interesting was the headline: “100,000th Knol Published.” I love that word “published”. Google emphasizes that it is not a publisher, but it is interesting to me how the word turns up. You can read the story here.

The blog post contains some interesting insights into Knol; for example, people from 197 countries visit Knol “on an average day.” The interface is available in eight languages. Visitors are editing Knols.

Now how long will it take Knol to reach one million entries?

Stephen Arnold, January 18, 2009

Next Page »