Deep Web Technologies’ Vertical Search for Business Information
January 13, 2009
In the early 1990s, Verity was the dominant enterprise search system. IBM’s confused approach to STAIRS and the complexity of STAIRS derivatives created a market opportunity. Verity took it. Verity’s founders have continued to innovate in search. I was delighted to speak with Abe Lederman (that interview is here) and learn about the innovations his company has made. Deep Web Technologies (DWT) tames the tangled world of US government scientific information. You can explore the Science.gov site here. Now, Mr. Lederman and his team have turned their attention to the needs of the person looking for substantive business information. The company’s new business search system–Biznar–débuted in October 2008.
DWT has identified about 60 business oriented Web sites and federates these sources in near real time. To this core list, the Deep Web (Biznar) takes a user’s query and retrieves results from other Web indexing services. The system then blends the results, producing a results list that is designed to answer business questions. On this select source list are such publications as:
- Business Week
- Money Magazine
- Motley Fool
- US Patent & Trademark Office
- Wall Street Journal.
Sample Query
Let’s look at a test query. I used Biznar to obtain information about “bankruptcy liability”. The system generated a result list with 1,706 entries. I ran the same query on Google.com, which returned a result list containing more than 9,400,000 results. Obviously no human could examine a fraction of these 9,400,000 results. Google advertises that it is good by virtue of indexing a lot of content. Biznar focuses on a meaningful result set of 1,700 items.
But for most people, 1,700 items are too many. Biznar makes it easy to navigate the results. Look at the results page below:
You see a two column display. The larger column presents a traditional results list with several useful enhancements:
- You see a star rating that provides an indication of the importance of the result for this specific query
- The source is displayed for each item; for example, Google Blog Search, Google Scholar, the New York Times, etc.
- The link includes a snippet of the content in the document that matches the query.
New York Times Asserts It Is Indeed Hip Riding the Word Train
January 12, 2009
The New York Times is trapped within a mindset, wrapped in a culture, and under a layer of costs. New York Magazine is doing its part to show how trendy and agile this aging swan really is. You can read “The New Journalism: Goosing the Gray Lady.” The notion of goosing a dowager is an image that makes this addled goose cringe. The folks working on this project will benefit from the experience when they seek another job or chat up a venture capitalist for some dough. For me, the dead tree crowd, including the goosed gray lady, is struggling to find a solution to the journalistic equivalent of Fermat’s last theorem. Trying is good.
Stephen Arnold, January 12, 2009
Crazy Stats: Interesting Yet Hardly Web 2.0
January 12, 2009
I think the clever wordsmiths who snagged the Web 2.0 meme are blowing smoke. Losing money is not a business model. Nevertheless, I enjoyed this list of Web 2.0 statistics. I think the word “statistics” as used by TheFutureBuzz.com means “unverifiable factoids”. The article is “49 Amazing Social Media, Web 2.0, and Internet Stats” is here. Three of the unsubstantiated factoids that caught my attention were:
- Google’s one trillion urls. Impossible to verify. Ranks with Amazon’s assertions about the number of objects managed in its AWS service. More PR fluff than factual bedrock.
- The 70 million videos on Google. Nice assertion, no verification.
- 133 million Web logs indexed by Technorati. Yep, but how many have been orphaned. The total number of Web logs remains a mystery.
If you love these types of factoids, TheFutureBuzz.com article is for you.
Stephen Arnold, January 12, 2009
Lousy Economy, Google Gains Share
January 12, 2009
Barron’s reported here that Google gained market share in Web search in the US in December 2008. The source of the data is Hitwise.com. I think these data understate Google’s actual market share, but when the Wall Street Journal’s progeny asserts 72 percent market share, it must be true. The question is, “What will Microsoft and Yahoo do to gain ground?” The answer is, in my opinion, “Not much they can do.” Search is not a priority at either Microsoft or Yahoo. Sure, both outfits say search is job one, but the GOOG is built on search. Search is an add on, a pair of foam dice hanging from a bigger vehicle’s rear view mirror at Microsoft and Yahoo. Time is running out to catchup. Time to leapfrog.
Stephen Arnold, January 12, 2009
Microsoft: Signs of Increased Distress
January 12, 2009
Silicon.com ran an interesting interview with Steve Ballmer, the chief executive officer of the $65 billion software giant here. “What Does Ballmer Worry About? Google, Google, Google” reminded me of Psych 103. A required course, the professor who was sort of wacky in a tweedy way described a range of obsessions, fetishes, and aberrations. I memorized these for a test and them hit the delete key. Ina Fried’s interview reminded me of that class, but you will have to read the article to see if your long term memory vibrates. The interview touched upon Vista, which does not interest me. On page two the subject of search arises. Now I am hooked. Mr. Ballmer is quoted as having said:
I mean, look, this is not something that changes overnight. Everybody wants us to snap our fingers. We have a good competitor, and yet at the same time, we see real opportunities to improve the search experience, to differentiate, but it’s not going to happen overnight. We’re going to have to keep working and working; innovating product-wise, marketing, branding, distribution, and we’re going to have to be patient about it.
I think I have read this quote with the word patience before. I keep hearing that Microsoft needs time and that pundits and Wall Street must have patience. I personally have run out of patience. The last unexpected crash of a Windows Server gobbled the last dollop of mine.
The other key point in the interview is the phrase “Google, Google, Google.” That’s repetitive as well in my view.
Let me offer three observations:
First, Google is a decade old. It’s dominance in Web search has been evident for years. I noted the gap between Google and its Web search competitors in my 2005 The Google Legacy. I reinforced that argument with an analysis of some of Google’s less high profile but very important technical investments. I can’t buy “patience” and I think time is running out. The gap in Web search is not a couple of percentage points. The gap is in somewhere in the country mile range.
Second, the fixation on Google is not healthy. Google is part of the Microsoft challenge, but it is not the sole cause. Microsoft has reached a point where it must split into separate companies or realign much as IBM did when it abandoned the PC front to Microsoft and embraced consulting. IBM’s a $100 billion outfit today, but it is not what it was in 1980, not by another country mile.
Third, the economic pressures on organizations with on premises software are increasing. As a result, an outfit like Google is poised to skim the cream from some market sectors. With the loss of “easy win” customers, Microsoft is going to face its own pressures: financial, managerial, and technical.
Oh, I remember the obsession–attention deficit hyperactivity disorder. Time for Microsoft to focus is my opinion.
Stephen Arnold, January 10, 2009
More Social Network Issues
January 12, 2009
Social search, social networks, and social pitfalls–the cheerleaders don’t want the social bandwagon to be delayed but trouble looms. Google’s Orkut made clear the issues that can arise when a social network becomes the playground of some interesting people in Brazil. Now you can read “(Under)mining Privacy in Social Networks” here by a trio of Googlers. The Google write up identifies some obvious flaws; for example, exposing information unintentionally. But the more significant part of the paper in my opinion are the references to merging social graphs. The dataspace drum beats are getting louder.
Stephen Arnold, January 12, 2009
Ask.com’s Search Technology Advances
January 12, 2009
Ask.com keeps trying. On January 8,2009, the company announced “Semantic Search technology Advances from Ask.com.” You can read the company’s statement here. The company asserts:
In October last year we introduced our proprietary DADS (Direct Answers from Databases), DAFS (Direct Answers from Search), and AnswerFarm technologies, which are breaking new ground in the areas of semantic, web text, and answer farm search technologies. Specifically, the increasing availability of structured data in the form of databases and XML feeds has fueled advances in our proprietary DADS technology. With DADS, we no longer rely on text-matching simple keywords, but rather we parse users’ queries and then we form database queries which return answers from the structured data in real time. Front and center. Our aspiration is to instantly deliver the correct answer no matter how you phrased your query.
The idea is that a user–assuming there is enough traffic to make the site viable in 2009–can enter a query any way he or she wishes. The Ask.com system will figure out the query and provide a Direct Answer. Let’s check out the system.
My first query was, “What’s the daily show?” The system responded with the top result “The Daily Show with Jon Stewart.” Good. My second query was, “What is a dataspace’s application?” The system responded by asking me the question, “What is a data spaces application?” The first result was a link to Sourceforge’s information about EQUIP2. Sorry, the correct answer was in my mind a link to the ACM papers about dataspaces. My third query was, “What is an information manifold?” This is no trick question because there is a technical paper with a title that contains the bound phrase “information manifold.” The Ask.com system asked me, “What is an information mannford?” I don’t know what a “mannford” is.
For the types of questions a middle school student might ask, the new system will work pretty well. For popular culture topics, the system will probably be better than some I have examined this week. For the types of queries I have about technologies that address the known weaknesses of traditional semantic processing, Ask.com won’t help me too much. That’s good. Knowing what questions to ask allows me to feed my goslings. Ask.com won’t put me out of job this year. One final point: I clicked on “mannford”. It’s a a city in Oklahoma. No dataspaces among that state’s wide open spaces. Look west, young search, look to Mountain View, California.
Stephen Arnold, January 12, 2009
Xsearch CEO Norbert Weitkämper Interviewed
January 12, 2009
Weitkämper Technology–based in Staffelsee, Germany–is a search and content processing vendor with a low profile in North America. The firm offers its multi-source search suite that incorporates proprietary technology to deliver fast content and query processing. The company’s XSEARCH package is customizable to focus on the client’s specific need. It offers nine variables: Clustering Engine, Suggest, DidYouMean, Summarizer, Linguistic Engine, Federated Search, Facet Navigator, Entity Extractor and Intelligent Classifier.
The industrial engineer was dissatisfied with the search results available from commercial products. Norbert Weitkämper developed Xsearch after working in electronic publishing. He told Search Wizards Speak:
As we are specialized on search for more than a decade our package is very well tuned; not only for speed but also for content for example. We will combine our new HitEngine with our established technologies like Linguistic, Did-You-Mean, clustering, synonyms and ontologies, or our personal ranking mechanisms. They are already released, we just have to melt them together.
He added:
For the complex roman languages our linguistic engine with its morphologic analysis is a big advantage, because algorithmic approaches like Bayesian or Porter, which are doing a good job for English, are a miserable failure.
On the subject of semantic analysis, Mr. Weitkämper said:
Semantic analysis is much more difficult for European languages than for English. We are already able to integrate thesauri or ontologies. I have not seen any system yet which meets the requirements for semantic analysis – at least when you have a closer look into the system. But storing information in a quick and accessible way is even more important for this approach, as you have to consider much more than only keywords and positions. So I can imagine that our optimized index structure may help also in this field to achieve adequate results in an acceptable amount of time.
More information about the company is available at its Web site, http://www.weitkamper.com. The full text of the interview with Mr. Weitkämper is at http://www.arnoldit.com/search-wizards-speak/xsearch.html.
Stephen Arnold, January 12, 2009
British Library Dubunks Myth of a Google Generation
January 11, 2009
Libraries are fighting for money and a role in the digital world. The plight of white shoe publishers is well known. Newspapers, once the life blood of information, are now stuffed with soft news or, what’s worse, old information. The shift from desktop boat anchor computers to sleek hand held devices is moving forward. Flag ship PC vendors like Dell Computers is in a fight for Wall Street respectability. The television and motion picture pasha believe that the fate of the traditional music publishing business is not theirs.
On January 16, 2008 (the date and the information come from this source), the British Library press room issued or issues or will issue “Pioneering Research Shows Google Generation Is a Myth.” The news release summarizes the study Information Behaviour of the Research of the Future. Here’s the link I located but it did not work without some clicking around. The report strikes me as something developed in an alternate universe where the Googleplex and its information system are small potatoes indeed.
He does not exist, but this member of the Google generation made it to the cover of the British Library debunking the myth study. In the future, this lad will be retrieving information from a mobile device, no PC or library required thinks this addled goose.
The study was, according to the press release,
Commissioned by the British Library and JISC (Joint Information Systems Committee), the study calls for libraries to respond urgently to the changing needs of researchers and other users. Going virtual is critical and learning what researchers want and need crucial if libraries are not to become obsolete, it warns. “Libraries in general are not keeping up with the demands of students and researchers for services that are integrated and consistent with their wider Internet experience”, says Dr Ian Rowlands, the lead author of the report.
Now this paragraph seems to suggest that “something” has happened and that libraries must “respond urgently to the changing needs of researchers and other users.” My hunch is that libraries are not surfing on the Google but paddling along trying to keep Googzilla’s spikey back in view.
Most of these curves head south, right? © British Library 2009 and presumably in the universe which I inhabit.
The news release also suggests libraries must turn to “Page 2.0”, which I presume is another silly reference to the made up world of Search 2.0, Enterprise 2.0, and Web 2.0. The news release from the future ends with the mysterious phrase “The panel:”.
Keep in mind that I am writing this notice on January 11, 2009, at 9 30 am Eastern time. The news release is from the future. It has a date of January 16, 2009. One would think that the British Library, operating outside the normal space time continuum could do more than tell me that the myth of the Google generation does not exist. Clever headline aside, libraries must define a role for themselves before funding dwindles even more. University libraries might be grandfathered into the institutional budget. Other types? Might be a tough sale.
In my opinion, what does not exist among some in the library profession is a firm grip on the hear and now. I am 65, and I think the Google generation exists. I wish it were not so, but it exists and the world one hopes will be better for the generation’s presence. Libraries seem to exist in a medieval world. Even Shakespeare is in step with the shift from paper to digital information. Consider Hamlet’s statement from one of the versions of the play crafted from Shakespeare’s foul papers:
Let us go in together,
And still your fingers on your lips, I pray.
The time is out of joint—O cursèd spite,
That ever I was born to set it right!
Nay, come, let’s go together.
No myth this, sprites.
Stephen Arnold, January 11, 2009
Microsoft’s Data Robustness
January 11, 2009
The “we may go out of business” Seattlepi.com Web site ran a story with the cruel title “Microsoft’s Servers Overloaded by Interest in Windows 7.” You can read this sort of weird headline and its accompanying story here. The story makes clear that Microsoft’s investments in its data centers was not up to the load imposed by the faithful downloading Windows 7.
The misstep was described as a “borkfest” by Lifehacker here. This goose isn’t sure what a borkfest is, but he can make a guess. Gina Trapani’s article nails the problem. She wrote:
If lack of infrastructure to handle an insane traffic spike over a few hours was truly the problem (even though these were conditions Microsoft created), there are lots of alternatives they could’ve used that would have kept their servers up. In fact, users have been happily downloading and distributing the Windows 7 beta build 7000 now for weeks using an efficient file-sharing protocol called BitTorrent.
When the GOOG streamed its live concert test last year, the Googlers tapped Akamai. Did Microsoft use its own content delivery network? Did Microsoft contract out the job? Whoever handled the job may want to check out another line of work in my opinion. Seattlepi.com quotes a Microsoft Web log. I noted this sentence: “We are adding some additional infrastructure support to the Microsoft.com properties before we post the public beta.” Good idea.
Stephen Arnold, January 11, 2009