January 13, 2009

Search engines often stumble when indexing certain content types. I avoid Flash, Flex, and Silverlight myself, but there are 20 somethings who want to make my browser work like the local movie theatre. Here in Harrod’s Creek, Kentucky, we are getting new films every week or so.  Most are still black and white. But the Flash, Flex, and Silverlight crowd goes for color, sound, and the big screen. Well, I should say that Flash and Flex go for the big screen. Silverlight if the data presented by Rich Internet Application Statistics are correct. You can find the information here. The url is one that might be gone when you read this. The data point out that search vendors will be focusing on indexing Flash and Flex. Looking at the pie charts, the Adobe crowd has 90 percent penetration. Silverlight is chugging along in the 15 percent range. Well, the good news is that Microsoft Fast can probably index Silverlight content.

Stephen Arnold, January 13, 2009


Melzoo: Googzilla Killer or Googzilla Snack

January 13, 2009

A happy quack to the reader who sent me the link to the Melzoo.com Web search site. I poked around and located on VNUnet an article providing an overview of the service. You can read “MelZoo Takes on Google with Split Screen Search here. The system is a metasearch engine like Ixquick.com and Vivisimo’s Clusty.com. The metasearch technology is not the hook for Melzoo. The company generates an image of the Web site. I first saw this type of preview when I reviewed Girafa.com for my column in Information World Review five, maybe six years ago. Melzoo asserts here:

This preview feature has an enormous impact on the ‘quality of traffic’ delivered to advertisers: the traditional search engines are offering typically only text as a teaser. Chances are that users who enjoy the luxury of a detailed thumbnail preview, will be a lot more selective in visiting the sites they are interested in. This results in a higher effectiveness of use. The chances of “conversion” (i.e. from hit to buy) is currently estimated 5 times higher than with traditional search engines.

I think the vertical metasearch available from Deb Web Technologies is more useful for my work. You can see one of the DWT vertical federated search systems here.

The VNU write up made me sit up and take notice with its inclusion of this assertion in its write up of Melzoo.com:

“MelZoo has improved the experience of browsing the Internet in a totally different way. For years people have used an old technique – text only – to browse the web. MelZoo has revolutionized the way users will browse the web,” said MelZoo chief executive Alex De Backer. “In addition MelZoo is a welcome novelty for the advertisers, as it offers higher quality visitors at a lower cost.”

There are some issues associated with metasearch. These include latency, being blocked, or having to pay the source of the hits for the privilege of using its results. I will keep my eye on Melzoo.com.

Stephen Arnold, January 12, 2009

Deep Web Technologies’ Vertical Search for Business Information

January 13, 2009

In the early 1990s, Verity was the dominant enterprise search system. IBM’s confused approach to STAIRS and the complexity of STAIRS derivatives created a market opportunity. Verity took it. Verity’s founders have continued to innovate in search. I was delighted to speak with Abe Lederman (that interview is here) and learn about the innovations his company has made. Deep Web Technologies (DWT) tames the tangled world of US government scientific information. You can explore the Science.gov site here. Now, Mr. Lederman and his team have turned their attention to the needs of the person looking for substantive business information. The company’s new business search system–Biznar–débuted in October 2008.

DWT has identified about 60 business oriented Web sites and federates these sources in near real time. To this core list, the Deep Web (Biznar) takes a user’s query and retrieves results from other Web indexing services. The system then blends the results, producing a results list that is designed to answer business questions. On this select source list are such publications as:

  • Business Week
  • Money Magazine
  • Motley Fool
  • US Patent & Trademark Office
  • Wall Street Journal.

Sample Query

Let’s look at a test query. I used Biznar to obtain information about “bankruptcy liability”. The system generated a result list with 1,706 entries. I ran the same query on Google.com, which returned a result list containing more than 9,400,000 results. Obviously no human could examine a fraction of these 9,400,000 results. Google advertises that it is good by virtue of indexing a lot of content. Biznar focuses on a meaningful result set of 1,700 items.

But for most people, 1,700 items are too many. Biznar makes it easy to navigate the results. Look at the results page below:


You see a two column display. The larger column presents a traditional results list with several useful enhancements:

  1. You see a star rating that provides an indication of the importance of the result for this specific query
  2. The source is displayed for each item; for example, Google Blog Search, Google Scholar, the New York Times, etc.
  3. The link includes a snippet of the content in the document that matches the query.

Read more

New York Times Asserts It Is Indeed Hip Riding the Word Train

January 12, 2009

The New York Times is trapped within a mindset, wrapped in a culture, and under a layer of costs. New York Magazine is doing its part to show how trendy and agile this aging swan really is. You can read “The New Journalism: Goosing the Gray Lady.” The notion of goosing a dowager is an image that makes this addled goose cringe. The folks working on this project will benefit from the experience when they seek another job or chat up a venture capitalist for some dough. For me, the dead tree crowd, including the goosed gray lady, is struggling to find a solution to the journalistic equivalent of Fermat’s last theorem. Trying is good.

Stephen Arnold, January 12, 2009

Crazy Stats: Interesting Yet Hardly Web 2.0

January 12, 2009

I think the clever wordsmiths who snagged the Web 2.0 meme are blowing smoke. Losing money is not a business model. Nevertheless, I enjoyed this list of Web 2.0 statistics. I think the word “statistics” as used by TheFutureBuzz.com means “unverifiable factoids”. The article is “49 Amazing Social Media, Web 2.0, and Internet Stats” is here. Three of the unsubstantiated factoids that caught my attention were:

  • Google’s one trillion urls. Impossible to verify. Ranks with Amazon’s assertions about the number of objects managed in its AWS service. More PR fluff than factual bedrock.
  • The 70 million videos on Google. Nice assertion, no verification.
  • 133 million Web logs indexed by Technorati. Yep, but how many have been orphaned. The total number of Web logs remains a mystery.

If you love these types of factoids, TheFutureBuzz.com article is for you.

Stephen Arnold, January 12, 2009

Lousy Economy, Google Gains Share

January 12, 2009

Barron’s reported here that Google gained market share in Web search in the US in December 2008. The source of the data is Hitwise.com. I think these data understate Google’s actual market share, but when the Wall Street Journal’s progeny asserts 72 percent market share, it must be true. The question is, “What will Microsoft and Yahoo do to gain ground?” The answer is, in my opinion, “Not much they can do.” Search is not a priority at either Microsoft or Yahoo. Sure, both outfits say search is job one, but the GOOG is built on search. Search is an add on, a pair of foam dice hanging from a bigger vehicle’s rear view mirror at Microsoft and Yahoo. Time is running out to catchup. Time to leapfrog.

Stephen Arnold, January 12, 2009

More Social Network Issues

January 12, 2009

Social search, social networks, and social pitfalls–the cheerleaders don’t want the social bandwagon to be delayed but trouble looms. Google’s Orkut made clear the issues that can arise when a social network becomes the playground of some interesting people in Brazil. Now you can read “(Under)mining Privacy in Social Networks” here by a trio of Googlers. The Google write up identifies some obvious flaws; for example, exposing information unintentionally. But the more significant part of the paper in my opinion are the references to merging social graphs. The dataspace drum beats are getting louder.

Stephen Arnold, January 12, 2009

Ask.com’s Search Technology Advances

January 12, 2009

Ask.com keeps trying. On January 8,2009, the company announced “Semantic Search technology Advances from Ask.com.” You can read the company’s statement here. The company asserts:

In October last year we introduced our proprietary DADS (Direct Answers from Databases), DAFS (Direct Answers from Search), and AnswerFarm technologies, which are breaking new ground in the areas of semantic, web text, and answer farm search technologies. Specifically, the increasing availability of structured data in the form of databases and XML feeds has fueled advances in our proprietary DADS technology.  With DADS, we no longer rely on text-matching simple keywords, but rather we parse users’ queries and then we form database queries which return answers from the structured data in real time.  Front and center. Our aspiration is to instantly deliver the correct answer no matter how you phrased your query.

The idea is that a user–assuming there is enough traffic to make the site viable in 2009–can enter a query any way he or she wishes. The Ask.com system will figure out the query and provide a Direct Answer. Let’s check out the system.

My first query was, “What’s the daily show?” The system responded with the top result “The Daily Show with Jon Stewart.” Good. My second query was, “What is a dataspace’s application?” The system responded by asking me the question, “What is a data spaces application?” The first result was a link to Sourceforge’s information about EQUIP2. Sorry, the correct answer was in my mind a link to the ACM papers about dataspaces. My third query was, “What is an information manifold?” This is no trick question because there is a technical paper with a title that contains the bound phrase “information manifold.” The Ask.com system asked me, “What is an information mannford?” I don’t know what a “mannford” is.

For the types of questions a middle school student might ask, the new system will work pretty well. For popular culture topics, the system will probably be better than some I have examined this week. For the types of queries I have about technologies that address the known weaknesses of traditional semantic processing, Ask.com won’t help me too much. That’s good. Knowing what questions to ask allows me to feed my goslings. Ask.com won’t put me out of job this year. One final point: I clicked on “mannford”. It’s a a city in Oklahoma. No dataspaces among that state’s wide open spaces. Look west, young search, look to Mountain View, California.

Stephen Arnold, January 12, 2009

Xsearch CEO Norbert Weitkämper Interviewed

January 12, 2009

Weitkämper Technology–based in Staffelsee, Germany–is a search and content processing vendor with a low profile in North America. The firm offers its multi-source search suite that incorporates proprietary technology to deliver fast content and query processing. The company’s XSEARCH package is customizable to focus on the client’s specific need. It offers nine variables: Clustering Engine, Suggest, DidYouMean, Summarizer, Linguistic Engine, Federated Search, Facet Navigator, Entity Extractor and Intelligent Classifier.

The industrial engineer was dissatisfied with the search results available from commercial products. Norbert Weitkämper developed  Xsearch after working in electronic publishing. He told Search Wizards Speak:

As we are specialized on search for more than a decade our package is very well tuned; not only for speed but also for content for example. We will combine our new HitEngine with our established technologies like Linguistic, Did-You-Mean, clustering, synonyms and ontologies, or our personal ranking mechanisms. They are already released, we just have to melt them together.

He added:

For the complex roman languages our linguistic engine with its morphologic analysis is a big advantage, because algorithmic approaches like Bayesian or Porter, which are doing a good job for English, are a miserable failure.

On the subject of semantic analysis, Mr. Weitkämper said:

Semantic analysis is much more difficult for European languages than for English. We are already able to integrate thesauri or ontologies. I have not seen any system yet which meets the requirements for semantic analysis – at least when you have a closer look into the system. But storing information in a quick and accessible way is even more important for this approach, as you have to consider much more than only keywords and positions. So I can imagine that our optimized index structure may help also in this field to achieve adequate results in an acceptable amount of time.

More information about the company is available at its Web site, http://www.weitkamper.com. The full text of the interview with Mr. Weitkämper is at http://www.arnoldit.com/search-wizards-speak/xsearch.html.

Stephen Arnold, January 12, 2009

British Library Dubunks Myth of a Google Generation

January 11, 2009

Libraries are fighting for money and a role in the digital world. The plight of white shoe publishers is well known. Newspapers, once the life blood of information, are now stuffed with soft news or, what’s worse, old information. The shift from desktop boat anchor computers to sleek hand held devices is moving forward. Flag ship PC vendors like Dell Computers is in a fight for Wall Street respectability. The television and motion picture pasha believe that the fate of the traditional music publishing business is not theirs.

On January 16, 2008 (the date and the information come from this source), the British Library press room issued or issues or will issue “Pioneering Research Shows Google Generation Is a Myth.” The news release summarizes the study Information Behaviour of the Research of the Future. Here’s the link I located but it did not work without some clicking around. The report strikes me as something developed in an alternate universe where the Googleplex and its information system are small potatoes indeed.

bl image

He does not exist, but this member of the Google generation made it to the cover of the British Library debunking the myth study. In the future, this lad will be retrieving information from a mobile device, no PC or library required thinks this addled goose.

The study was, according to the press release,

Commissioned by the British Library and JISC (Joint Information Systems Committee), the study calls for libraries to respond urgently to the changing needs of researchers and other users. Going virtual is critical and learning what researchers want and need crucial if libraries are not to become obsolete, it warns. “Libraries in general are not keeping up with the demands of students and researchers for services that are integrated and consistent with their wider Internet experience”, says Dr Ian Rowlands, the lead author of the report.

Now this paragraph seems to suggest that “something” has happened and that libraries must “respond urgently to the changing needs of researchers and other users.” My hunch is that libraries are not surfing on the Google but paddling along trying to keep Googzilla’s spikey back in view.

bl study

Most of these curves head south, right? © British Library 2009 and presumably in the universe which I inhabit.

The news release also suggests libraries must turn to “Page 2.0”, which I presume is another silly reference to the made up world of Search 2.0, Enterprise 2.0, and Web 2.0. The news release from the future ends with the mysterious phrase “The panel:”.

Keep in mind that I am writing this notice on January 11, 2009, at 9 30 am Eastern time. The news release is from the future. It has a date of January 16, 2009. One would think that the British Library, operating outside the normal space time continuum could do more than tell me that the myth of the Google generation does not exist. Clever headline aside, libraries must define a role for themselves before funding dwindles even more. University libraries might be grandfathered into the institutional budget. Other types? Might be a tough sale.

In my opinion, what does not exist among some in the library profession  is a firm grip on the hear and now. I am 65, and I think the Google generation exists. I wish it were not so, but it exists and the world one hopes will be better for the generation’s presence. Libraries seem to exist in a medieval world. Even Shakespeare is in step with the shift from paper to digital information. Consider Hamlet’s statement from one of the versions of the play crafted from Shakespeare’s foul papers:

Let us go in together,
And still your fingers on your lips, I pray.
The time is out of joint—O cursèd spite,
That ever I was born to set it right!
Nay, come, let’s go together.

No myth this, sprites.

Stephen Arnold, January 11, 2009

