Citation Metrics: Another Sign the US Is Lagging in Scholarship

August 31, 2008

Update: August 31, 2008. Mary Ellen Bates provides more color on the “basic cable” problem for professional informatoin. Worth reading here. Econtent does an excellent job on these topics, by the way.

Original Post

A happy quack to the reader who called my attention to Information World Review’s “Numbers Game Hots Up.” This essay appeared in February 2008 and I overlooked it. For some reason, I am plagued by writers who use the word “hots” in their titles. I am certain Tracey Caldwell is a wonderful person and kind to animals. She does a reasonable job of identifying problems in citation analysis. Dr. Gene Garfield, the father of this technique, would be pleased to know that Mr. Caldwell finds his techniques interesting. The point of the long essay which you can read here is that some publishers’ flawed collections yields incorrect citation counts. For me, the most interesting point in the write up was this statement:

The increasing complexity of the metrics landscape should have at least one beneficial effect: making people think twice before bandying about misleading indicators. More importantly, it will hasten the development of better, more open metrics based on more criteria, with the ultimate effect of improving the rate of scientific advancement.

Unfortunately, traditional publishers are not likely to do much that is different from what the firms have been doing since commercial databases became available. The reason is money. Publishers long to make enough money from electronic services to enjoy the profit margins of the pre digital era. But digital information has a different cost basis from the 19th century publishing model. The result is reduced coverage and a reluctance to move too quickly to embrace content produced outside of the 19th century model.

Services that use other methods to determine link metrics exist in another world. If you analyze traditional commercial information, the Web dimension is either represented modestly or ignored. Mr. Caldwell’s analysis looks at the mountain tops, but it does not explore the valleys. In those crevices is another story; namely, researchers who rely on commercial databases are likely to find themselves lagging behind those researchers in countries where commercial databases are simply too expensive for most researchers to use. A researcher who relies on a US or European commercial database is likely to get only an incomplete picture.

Stephen Arnold, August 31, 2008

Google: A Great Place to Work

August 31, 2008

If you want to refresh your memory about how wonderful Google is to employees, you will want to read the Red Orbit “Way of Life in the Google Complex” here. After a summer of transparency, the writer–possibly a Googler or a PR maven in disguise–reprises the wonders of Google. You get a reference to the lava lamp. You get a reminder about the grand piano. You get it all. The writer leaves out entertainment like Tony Bennett at lunch, but you can revel in remarks like this one:

Most of the walls and dividers are made of glass so that rather than becoming a labyrinth of cubicles the buildings remain open and light is easily filtered through.

Yes, metaphorical transparency. That’s a nice rhetorical touch. Plus, I think it’s super that the author knows that the GOOG spends $72 million a year on these and other amenities.

Now that that the summer of transparency is nearing its end, Google’s fall campaign seems to be back to its wild and crazy math club ethos.

What a relief for me. I was growing tired of technical explanations, Google management’s advice to other companies about innovation, and talks that run the Google game plan. I hope that lovable Googler Cyrus somebody who told me and then others that a Google patent application drawing in one of my lectures was a Photoshop fake keeps retelling that fib. The lousy patent illustration was crafted by a Google wizard, not me. But Googlers don’t know what their own employer puts in its patent documents. Who wants reality to intrude on Google’s presentation of its world.

Reality, when viewed through lava lamps, is often different from “regular” reality, at least for me. Google’s lawsuits, Gmail outages, and plans for outer space made the summer of 2008 interesting to me as I watched this most important company enter its 11th year in business. Red Orbit’s write up is a useful glimpse into the world that Google wants me to believe exists. Do Googlers sleep on those fluffy animals instead of going home? Let me know if you have some insights.

Stephen Arnold, August 31, 2008

Google Maps Attract Flak

August 31, 2008

Google inked a deal with GeoEye to deliver 0.5 meter resolution imagery. One useful write up appears in Softpedia here. The imagery is not yet available but will be when the GeoEye-1 satellite begins streaming data. The US government limits commercial imagery resolution. Th Post Chronicle here makes this comment, illustrating the keen insight of traditional media:

Google did not have any direct or indirect financial interest in the satellite or in GeoEye, nor did it pay to have its logo emblazoned on the rocket. [emphasis added]

In my opinion, Google will fiddle the resolution to comply. Because GeoEye-1 was financed in part by a US government agency, my hunch is that Google will continue to provide geographic services to the Federal government and its commercial and Web users. The US government may get the higher resolution imagery. The degraded resolution will be for the hoi polloi.

Almost coincident with news of this lash up, Microsoft’s UK MSN ran “UK Map Boss Says Google Wrecking Our Heritage.” You can read this story here. The lead paragraph to this story sums up the MSN view:

A very British row appears to be brewing after the president of the British Cartographic Society took aim at the likes of Google Maps and accused online mapping services of ignoring valuable cultural heritage. Mary Spence attacked Google, Multimap and others for not including landmarks like stately homes and churches.

The new GeoEye imagery will include “valuable cultural heritage” as well as cows in the commons and hovels in Herfortshire.

Based on my limited knowledge of British security activities, I would wager a curry that Google’s GeoEye maps will be of some use to various police and intelligence groups working for Queen and country. Microsoft imagery in comparison will be a bit low resolution I surmise. MSN UK will keep me up to date on this issue I hope.

Stephen Arnold, August 31, 2008

No Google Killer Yet

August 31, 2008

I think it is still August 30, 2008, here in the hollow. My newsreader delivered to me a September 1, 2008, article. DMNews is getting a jump on publishing in order to make a picnic. The authors are a team–Ellen Keohane and Mary Elizabeth Hurn. You can read the article here.

The main point of the article is that Google is the leader in search. There were two interesting points for me.

First, the authors identified a search engine of which I knew not–UBExact. The url is http://www.ubexact.com. I ran one test query. I selected the geographic search for Louisville and entered the term “blacktop resurfacing”. The system generated zero results. I will check it out in a couple of months.

Second, the duo made a comment I found intriguing:

And, as with Wikia Search, Mahalo,OrganizedWisdom.com and Scour.com, UBExact also uses humans to improve the search experience. Human editors are contracted to eliminate spam, malicious content, unwanted ads and dead links and pages, Stephenson said. In addition to vetting content, the con­tractors also organize Web sites based on content so users can search on UBExact by category.

Humans are expensive, and it will be interesting to see if privacy and click fraud impair Google. Oracle SES10g pitched security. Customers did not value security, and I’m not sure if UBExact’s hooks will either. Agree? Disagree? Let me know.

Stephen Arnold, September 1, 2008

Business Intelligence * Hots * Up

August 30, 2008

I was not going to read this VNU article. The phrase “hots up” annoyed me. One of the engineers in my rural Kentucky redoubt told me to take a look. His thought was that VNU was running a content free news story. I did. The story is by Rosalie Marshall, who is probably a warm and caring individual. The article is not completely content free. There’s a sales pitch tucked inside the sentences. And the title reveals a keen sensitivity to language; to wit, “Business Intelligence Hots Up.” I wish I could turn a phrase like that. The main idea of the article is a synopsis of findings by a research firm. The idea is that business intelligence is what customers desire. There’s a reference to Endeca, a company that has been trying to get more traction in business intelligence for several years. IBM gets a mention. Even Google warrants a comment too. For me the most important point in the article is the notion that business intelligence is becoming important. My thought is that search has not delivered. Vendors now chase revenues with business intelligence pitches. If you want to read this “hots up” stuff, click here.

Stephen Arnold, August 30, 2008

Microsoft: Another Search Buy

August 30, 2008

Microsoft has gathered another search system for its information retrieval basket. There’s a good summary on Yahoo News from Thomson Reuters. I will provide this  link, but it will go dead  in a short span of time. Click here to see if you can access “Microsoft Buys Ciao.com to Boost E-Shopping Search” by Georgina Prodhan and a carton of contributors. The idea is for Microsoft to do a better job with shopping search. Microsoft lags Google, and the hope is to narrow the gap between Microsoft and the GOOG. For me the most important point is the article is:

Microsoft’s Mangelaars acknowledged the distance Microsoft had to cover, especially given the commercial edifice rapidly being built by online advertisers whose models depend on Google’s particular view of the Web. “It’s a race,” he said, “but we also believe it’s very early days in search technology.”

In my opinion, I am tired of hearing that it is early days for search. Search has been around since the 1960s. Sure, I’m on record saying, “Search sucks.” But whether search sucks or not is irrelevant when one company has a 70 percent share and a competitor has been trying to catch up for a decade. Leap frog, not me too, is needed.

Stephen Arnold, August 30, 2008

Enterprise Search Storage Estimator

August 30, 2008

Solrhack has posted a handy “rule of thumb” estimator for capacity planning. You can read the article and see the formula here. The article is called “Enterprise Search Capacity Planning.” Keep in mind that the multiplier can vary. Some vendors with excellent compression methods can generate indexes that are one quarter to one half the size of the total corpus processed. If you know of other tools like this, please, use the comments section of this Web log to share them. I will add them to the ArnoldIT.com and the New Idea Engineering page of search tools. Oh, a happy quack to Solrhack as well.

Stephen Arnold, August 30, 2008

Why Dataspaces Matter

August 30, 2008

My posts have been whipping super-wizards into action. I don’t want to disappoint anyone over the long American “end of summer” holiday. Let’s consider a problem in information retrieval and then answer in a very brief way why dataspaces matter. No, this is not a typographical error.

Set Up

A dataspace is somewhat different from a database. Databases can be within a dataspace, but other information objects, garden variety metadata, and new types of metadata which I like to call meta metadata, among others can be encompassed. These are represented in an index. For our purpose, we don’t have to worry about the type of index. We’re going to look up something in any of the indexes that represent our dataspace. You can learn more about dataspaces in the IDC report #213562, published on August 28, 2008. It’s a for fee write up, and I don’t have a copy. I just contribute; I don’t own these analyses published by blue chip firms.

Now let’s consider an interesting problem. We want to index people, figure out what those people know about, and then generate results to a query such as “Who’s an expert on Google?” If you run this query on Google, you get a list of hits like this.

google expert

This is not what I want. I require a list of people who are experts on Google. Does Live.com deliver this type of output? Here’s the same query on the Microsoft system:

live expert output

Same problem.

Now let’s try the query on Cluuz.com, a system that I have written about a couple of times. Run the query “Jayant Madhavan” and I get this:

cluuz

I don’t have an expert result list, but I have a wizard and direct links to people Dr. Madhavan knows. I can make the assumption that some of these people will be experts.

If I work in a company, the firm may have the Tacit system. This commercial vendor makes it possible to search for a person with expertise. I can get some of this functionality in the baked in search system provided with SharePoint. The Microsoft method relies on the number of documents a person known to the system writes on a topic, but that’s better than nothing. I could if I were working in a certain US government agency use the MITRE system that delivers a list of experts. The MITRE system is not one whose screen shots I can show, but if you have a friend in a certain government agency, maybe you can take a peek.

None of these systems really do what I want.

Enter Dataspaces

The idea for a dataspace is to process the available information. Some folks call this transformation, and it really helps to have systems and methods to transform, normalize, parse, tag, and crunch the source information. It also helps to monitor the message traffic for some of that meta metadata goodness. An example of meta metadata is an email. I want to index who received the email, who forwarded the email to whom and when, and any cutting or copying of the information in the email to which documents and the people who have access to said information. You get the idea. Meta metadata is where the rubber meets the road in determining what’s important regarding information in a dataspace.

Read more

The Enterprise Search Thrill Ride

August 29, 2008

Summer’s ending, and the search engine thrill ride is accelerating. Before you fire up your personal computer and send me an email asking for juicy details, appreciate that I can only comment in a broad way, making observations at a high level. If you have an appetite for more information, you will have to dip into your piggy back and engage me to show up and discuss the state of the industry in a less chatty setting like this Web log.

Every amusement park has a thrill ride. Kids love these roller coasters, bungee jumps, and spinning barrels. Adults or people with an aversion to fear are generally content to watch. Once in a great while, a thrill ride goes wrong. The thrill seekers can be injured and once in a while killed.

Search and content processing companies are in a sense a thrill ride in way. The launch of a company is filled with anticipation. Then the company chugs along and usually gets a sale, and the process repeats itself. At the end of the ride, the company speeds along and in most cases the ride ends with the employees’ displaying big smiles. When a ride goes wrong, the employees aren’t so chipper, but the lawyers often show sly grins.

rollercoaster blur copy

I am quite confident that the September to December 15, 2008, period will be quite exciting for me. First, the search and content processing sector of the enterprise software market is poised for change. Second, a number of companies will have to make their numbers or face the prospect of enduring the lash of venture capitalists’ whips, changing careers, or closing up for good. Third, the GOOG is beginning to move slowly forward in the enterprise sector. Even if Google’s management insists “We’re just running a beta test”, those “beta tests” will be disruptive for established search and content processing vendors. Fourth, newcomers to the North American market will make their presence felt to a greater degree than in the first six months of 2008. Newcomers often become irritants with their promise of better, faster, or cheaper. Of course, the customer may pick two of these claims, but incumbents have to waste time and money deflecting these competitive challenges. Finally, superplatforms–big enterprise software vendors–have to protect their turf. I expect significant pressure from these firms to add another variable to the search and content sector. After all, what can a company do when Microsoft bundles an incrementally improving search and retrieval system with a widely used server product like SharePoint.

Read more

Growth of Electronic Information

August 29, 2008

Larry Borsato, writing for the Industry Standard, presents some interesting information about the growth of electronic information. You can read his article “Information Overload on the Web, and Searching for the Right Sifting Tool” here. The most startling item was this statement:

IBM predicts that in the next couple of years, information will double every 11 hours [PDF].

The article runs down the problems encountered when looking for information using various search services. He’s right. Search is a problem. But that doubling of information every 11 hours underscores the opportunity that exists for a person or company with an information access solution.

Stephen Arnold, August 29, 2008

Next Page »