Google Blocked from Indexing a UK Newspaper

June 2, 2010

Short honk: I may have missed the item “Murdoch Blocks Google from Indexing London Times Articles.” News Corp. may be testing different approaches to making content available to search robots from the Google. The Wall Street Journal approach seems more stringent that the London Times approach. My view is that traffic will drop. The revenue from for-fee sign ups will take time to ramp up. The margins enjoyed in the salad days of newspapers may be difficult, expensive, and time consuming to rebuild. This will be interesting to watch. Google has time on its side, however. On the side of News Corp. are the many legal hassles that Google faces. Legal eagles may change Google’s methods helping to make News Corp. the winner again. On the other hand, Google may win and the News Corp. end up in a worse mess than its management envisioned.

Stephen E Arnold, June 2, 2010

Freebie

FBLite: An Indication of Next Generation Web Indexing?

May 10, 2010

I wrote my May column for Information Today about user intermediated Web indexes. As you know, Google indexes via brute force and smart software. The approach was state of the art because it combined some AltaVista.com magic with a “clever” dose of algorithmic goodness. The problem is that as the volume of content to be indexed goes up, costs become an issue even for deep pocket outfits like Google. Consider the economic payoff of tapping into a pool of urls identified by those in a membership network as seeds for a Web index. There may be some cost savings because brute force, although sort of fun in a computer nerd play pen, can be side stepped. At some point, the value of the sites in StumbleUpon.com and Delicious.com will become more widely known. Now Facebook is in the selective indexing foyer, and the company may become more aggressive in Web search. Until Facebook makes its intentions more obvious, you may find FBLite.com interesting. I can envision more robust services, but FBLite.com points the way to the arsenal that could blow up Google. A single hit from FBLite or even Facebook won’t devastate the Google. But keep in mind that Apple, assorted lawyers, and even countries are aiming their lasers at the Mountain View company. Why index the entire Web when users can identify potential high value urls. A useful set of pointers without the brute force costs. Social methods with financial payoffs. Could this be a next generation Web indexing method with more legs than the Google spider?

Stephen E Arnold, May 10, 2010

A freebie.

Indexing Craziness

March 15, 2010

I read “Folksonomy and Taxonomy – do you have to choose?,” which takes the position that a SharePoint administrator can use a formal controlled term list or just let the users slap their own terms into an index field. The buzzword for allowing users to index documents is part of a larger 20 something invention—folksonomy. The key segment for me in the SharePoint centric Jopx blog was:

The way that SharePoint 2010 supports the notion of promoting free tags into a managed taxonomy demonstrates that a folksonomy can be used as a source to define a taxonomy as well.

Let me try and save you a lot of grief. Indexing must be normalized. The idea is to use certain terms to retrieve documents with reasonable reliability. Humans who are not trained indexers do a lousy job of applying terms. Even professional indexers working in production settings fall into some well known ruts. For example, unless care is exercised in management and making the term list available, humans will work from memory. The result is indexing that is wrong about 15 percent of the time. Machine indexing when properly tuned can hit that rate. The problem is the that the person looking for information assumes that indexing is 100 percent accurate. It is not.

The idea behind controlled term lists is that these are logically consistent. When changes are made such as the addition of a term such as “webinar” as a related term to “seminar”, a method exists to keep the terms consistent and a system is in place to update the index terms for the corpus.

When there is a mix of indexing methods, the likelihood of having a mess is pretty high. The way around this problem is to throw an array of “related” links in front of the user and invite the user to click around. This approach to discovery entertains the clueless but leads to the potential for rat holes and wasted time.

Most organizations don’t have the appetite to create a controlled term list and keep it current. The result is the approach that is something I encounter frequently. I see a mix of these methods:

  1. A controlled term list from someplace (old Oracle or Convera term list, a version of the ABI/INFORM or some other commercial database controlled vocabulary, or something from a specialty vendor)
  2. User assigned terms; that is, uncontrolled terms. (This approach works when you have big data like Google but it is not so good when there are little data, which is how I would characterize most SharePoint installations.)
  3. Indexes based on parsing the content.

A user may enter a term such as “Smith purchase order” and get a bunch of extra work. Users are not too good at searching, and this patchwork of indexing terms ensures that some users will have to do the Easter egg drill; that is, look for the specific information needed. When it is located, some users like me make a note card and keep in handy. No more Easter egg hunts for that item for me.

What about third party SharePoint metadata generators? These generate metadata but they don’t solve the problem of normalizing index terms.

SharePoint and its touting of metadata as the solution to search woes are interesting. In my opinion, the approach implemented within SharePoint will make it more difficult for some users to find data, not easier. And, in my opinion, the resulting index term list will be a mess. What happens when a search engine uses these flawed index terms, the search results force the user to look for information the old fashioned way.

Stephen E Arnold, March 15, 2010

A free write up. No one paid me to write this article. I will report non payment to the SharePoint fans at the Department of Defense. Metadata works first time every time at the DoD I assume.

Bing and Slow Indexing

January 8, 2010

Short honk: I noticed a couple of years ago that for certain queries, Microsoft was faster than Google at displaying results. A bit of sleuthing revealed that Microsoft was caching aggressively. One of the people with whom I spoke suggested that Microsoft cached everything and as close to users as possible. We don’t have a data center in Harrod’s Creek, but close enough. Second, Microsoft was indexing only Web sites that were known to generate hits for popular queries. Unpopular Web sites at that time were skipped in order to speed up indexing. At a gig at a certain large software company a person in the know about Microsoft search told me that the expensive speed ups were a thing of the past. Microsoft was in Google territory. Sounded reasonable.

Flash forward to the here and now. Read “Microsoft Admits that Bing Is Slow at Indexing.” If this article is correct, Microsoft has not been able to resolve certain turtle like characteristics of its Google killer, Bing.com. For me, the most interesting comment was this quote allegedly made by a person in the know:

It is well known in the industry that MSNbot is fairly slow. I suggest reading our FAQs stickied at the top of the indexing forum to get some ideas of what to do.

Yikes. No wonder Google is pushing the “speed angle” as one of its marketing themes.

Stephen E. Arnold, January 8, 2010

Unpaid was I. I wrote for free. I must report to the Superfund Basic Research Program. What? You expected poetry?

De-Indexing: Word of the Week

November 25, 2009

Short honk: I gave up learning new words. Addled geese don’t have much to say or think. Cogito ergo quack. I read Taranfx.com’s “Microsoft to Pay for De-Indexing from Google.” The story was not new for me, but I did latch onto the word “de-indexing”. Great word. Maybe it will lead to great riches?

Stephen Arnold, November 25, 2009

I want to disclose to the Department of Agriculture that this turkey of a write up was not something I wrote for money or a leg. The word “de-indexing” is a turkey.

Google Books and Lousy Indexing

September 6, 2009

Thomas Claburn’s “Google Books Metadata Includes Millions of Errors” disclosed some dirty meta data laundry from the Google Books project. Mr. Claburn reported:

A metadata provider gave Google a large number of book records from Brazil that list 1899 as a default publication date, resulting in about 250,000 misdated books from this one source.

Mr. Claburn rounded up additional information that suggests the error problem is orders of magnitude larger than some expect. The good news is that Google is working to correct errors. The bad news is that Google, like other commercial database producers, generates products and services that users perceive to be “right”. In reality, there are quite a few flaws in electronic products. Mistakes in print can be seen and easily shared with others. Electronic mistakes often behave differently and in many cases will go uncorrected for a long time, maybe forever, without anyone knowing what’s amiss or what the impact of the mistake is when smart software sucks up errors as fact. Whizzy new systems that generate reliability and provenance “tags” can be easily fooled. The repercussions of these types of propagated errors are going to be interesting to understand.

Stephen Arnold, September 6, 2009

New York Times: Two Indexing Methods

July 31, 2009

Teragram, a unit of SAS SAP, provides software that automatically indexes content for the New York Times’s Web site. I saw a tweet on my Overflight service that pointed out that the newspaper uses humans to create the New York Times Index, a more traditional index. You can find the tweet here. If true, why won’t the Teragram system do both jobs? When financial corsets get yanked tighter, something has to give. My thought is that if the tweet is accurate, is redundancy cost effective? An indication that neither works particularly well? There is a political logic, not a financial logic, at work?

Stephen Arnold, July 31, 2009

Twitter Link Indexing

June 5, 2009

Today after my talk at the Gilbane content management conference in San Francisco, a person mentioned that Twitter was indexing links in Tweets. I said that I included this information in my Twitter Web log posts. But when I looked at my posts, I found that I had not been explicit. You can get more info at http://www.domaintweeter.com.

Stephen Arnold, June 5, 2009

Drunk Men’s Web Indexing Analysis

April 25, 2009

A happy quack to the reader who sent me a link to Drunk Men’s Web robot analysis. You can find the article here. The data come from 2005 and 2006 and may not be spot on for 2009. The main point of the write up is that the Google does more with its approach to Web crawling. The payoff from PageRank appears to be a way to get around the need to index certain sites as thoroughly as Yahoo. Microsoft’s Web robot does not appear to be on a par with either Google or Yahoo.

Stephen Arnold, April 25, 2009

SharePoint and Indexing a Business Data Catalog

March 27, 2009

SharePoint user? If so, you may want to read and save “Business Data Catalog (BDC) incremental Crawls and How to Test” here. The article understates the performance issues but provides some useful tips. For me, the most important comment was:

But how does the indexer know which BDC records have changed? For it to know this we have to implement a property in our Entity called the __BdcLastModifiedTimestamp. Nice name huh! Now a small admission also. Whenever we describe the IdEnumerator method we always say that it only returns the primary key fields for an entity. This is generally true – except for when you want to implement an incremental crawl. If you want to do this, your IdEnumerator method must also return a DateTime field that will indicate to the indexer when it was last modified. The indexer can then compare this to the previous LastModified value it holds and if it is different, it can index the entire row of data.

If this seems like a bit of extra work for a routine task your are correct. Updating an index should be a click or two, then the system happily ensures that the index is fresh. SharePoint is a work in progress. I assume that when Fast ESP is available, these strange manual workarounds will no longer be needed. One can hope for the basics.

Stephen Arnold, March 27, 2009

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta