How Yahoo Will Catch Google in Search

August 25, 2008

Here’s an interview you must read. On August 25, 2008, the Financial Express (India) here published an interview with Yahoo’s super wizard, Prabhakar Raghavan. Dr. Raghavan is the head of research at Yahoo, a Stanford professor, and a highly regarded expert in search, database, and associated technologies. He’s even the editor of computer science and mathematics journals. A fellow like this can leap over Google’s headquarters and poke out Googzilla’s right eye. The interview, conducted by Pragati Verma, provides a remarkable look inside the plans Yahoo has to regain control of Web search.

There were a number of interesting factoids that caught my attention in this interview. Let me highlight a few.

First, Yahoo insists that the cost of launching Web search is $300 million. Dr. Raghavan, who is an expert in things mathematical, said:

Becoming a serious search player requires a massive capital investment of about $300 million. We are trying to remove all barriers to entry for software developers, who have ideas about how to improve search.

The idea is to make it easy for a start up to tap into the Yahoo Web index and create new services. The question nagging at me is, “If Web search is $300 million, why hasn’t Yahoo made more progress?” I use Yahoo once in a while, but I find that its results are not useful to me. When I search Yahoo stores, I have a heck of a time finding what I need. What’s Yahoo been doing since 1998? Answer: losing market share to Google and spending a heck of a lot more than a paltry $300 million losing ground.

Second, Google can lose share to search start ups. Dr. Raghavan said:

According to comScore data, Google had a 62% share of the US search market in May, while we had 21% and MSN 9%. Our prediction models suggest that Google could lose a big chunk of its market share, as BOSS partners and players come in.

My question is, “Since Google is vulnerable, why haven’t other search systems with funding made any headway; for example, Microsoft?” The notion that lots of little mosquitos can hobble Googzilla is not supported by Yahoo’s many search efforts. These range from Mindset to InQuira, from Flickr search to the deal with IBM, etc. Chatter and projections aside, Google’s share is increasing, and I don’t see much zing from the services using Yahoo index so far.

Finally, people don’t want to search. I agree. There is a growing body of evidence that key word search is generally a hassle. Dr. Raghavan said:

Users don’t really want to search. They want to spend time on their work, personal lives and entertainment. They come to search engines only to get their tasks done. We will move search to this new paradigm of getting the task done….

My question is, “How is Yahoo with its diffused search efforts, its jumble of technologies, and its inability to make revenue progress without a deal from Google doing to reverse its trajectory?” I wish Yahoo good luck, but the company has not had much success in the last year or so.

Yahoo lost its way as a directory, as a search system, and as a portal. I will wait to see how Yahoo can turn its “pushcart full of odds and ends” into a Formula One racer.

Stephen Arnold, August 25, 2008

Single Page Format

Linguistic Agents Unveils RoboCrunch

August 21, 2008

Linguistic Agents, based in Jerusalem, has rolled out RoboCrunch.com, a semantic search engine. You can read the company news release here. Like Powerset and Hakia, Linguistic Agents’ technology makes it possible to locate information by asking the system a question. This platform enables software to respond and act upon natural human language in the most intuitive fashion for users.

As of August 21, 2008, the system operates with two functions:

  1. Natural Language Inquiries are transformed to Advanced Search Queries
  2. Results are semantically sorted by relevance.

The developer–founded in 1999–plans to change the present method of Web Navigation by using its advanced semantic technology to better understand user’s information requests. The company says, “Linguistic Agents has developed an integrative language platform that is based on the most current research in the field of theoretical linguistics.”

I have written about this company before. Check out the demo. Let me know your impressions.

Stephen Arnold, August 21, 2008

Attensity Lassos Brands with BzzAgent Tie Up

August 20, 2008

Attensity, a text analytics and content processing company, applies its “deep extraction” methods to law enforcement and customer support tasks. The company has formed a partnership with BzzAgent. You can find out more about this firm here. This Boston-based firm specializes in the delightfully named art of WOM, shorthand for “word of mouth” marketing. The company’s secret sauce is more than 400,000 WOM volunteers. Attensity’s technology can process BzzAgent’s inputs and deliver useful brand cues. Helen Leggatt’s “Marketers to Get ‘Unrivaled Insights’ into WOM.” You can read this interesting article here. For me, the most interesting point is Ms. Leggatt’s article was:

Each month, BzzAgent’s volunteers submit around 100,000 reports. Attensity’s text analytics technology will analyze the data contained within these reports to identify “facts, sentiment, opinions, requests, trends, and trouble spots”.

Like other content processing companies, Attensity is looking for ways to expand into new markets with its extraction and analytic technology. Is this a sign of vitality, or is it a hint that content processing companies are beginning to experience a slow down in other market sectors? Anyone have thoughts on this type of market friction?

Stephen Arnold, August 20, 2008

Search Engine Optimization Meets Semantic Search

August 19, 2008

I’ve been sitting in the corn fields of Illinois for the last six days. I have been following the SES (Search Engine Strategies) Conference via the Web. If you have read some of my previous posts about the art of getting traffic to a Web page, you know my views of SEO. In a word, “baloney.” Web sites without content want to get traffic. The techniques used range from link trading to meta tag spamming. With Google the venturi for 70 percent of Web search, SES is really about spoofing Google. Google goes along with this stuff because the people without traffic will probably give AdWords a go when the content-free tricks don’t work reliably.

I was startled when I read the summary of the panel “Semantic Search: How Will It Change Our Lives?” The write up I saw was by Thomas McMahon, and it seemed better than the other posts I looked at this evening. You can read it here. The idea behind the panel is that “semantic search” goes beyond key words.

This has implications for people who stuff content free Web pages with index terms. Google indexes using words and sometimes the meta tags play a role as well. If semantic search grabs on, people will not search by key words, people will ask questions. The idea is that instead of typing Google +”semantic Web” +Guha, I would type, “What are the documents by Ramanathan Guha that pertain to the semantic Web.” The fellow helped write the standard document several years ago. He’s a semantic Web guru, maybe the Yoda of the semantic Web?

image

Source: http://www.kimrichter.com/Blog/uploaded_images/snakeoil_1-794216.jpg

Participating in this panel were Powerset (Xerox PARC technology plus some original code), Hakia (original technology and a robust site), Ask.com (I’m not sure where it falls on the semantic scale since the rock band wizard from Rutgers cut out), and Yahoo (poor, fragmented Yahoo).

The elephant in the room but not on the panel is Google, a serious omission in my opinion. Microsoft R&D has some hefty semantic talent as well, also not on the panel.

In my opinion the semantic revolution is going to make life more difficult for the SEO folks. Semantic methods require content. Content free Web sites are going to be struggling for traffic unless several actions are taken:

  1. Create original, compelling information. I just completed an analysis of a successful company’s Web site. It was content free. It had zero traffic. The short cut to traffic is content. The client lacks the ability to create content and doesn’t understand that people who create content charge money for their skills. If you don’t have content, go to item two below.
  2. Buy ads. Google’s traffic is sufficiently high that an ad with appropriate key words will get some hits. Buying ads is something SES attendees understand. Google understands it. You may need to pump $20,000 per month into Googzilla’s maw, but you will get traffic.
  3. Combine items one and two.
  4. Buy a high traffic Web site and shoehorn a message into it. There are some tasty morsels available. Go direct and eliminate the hassle and delay of building an audience. Acquire one.

Most SEO consulting is snake oil and expensive snake oil at that. The role of semantic methods will be similar to plumbing. It is important, but like the pipes that carry water, I don’t have to see them. The pipes perform a function. Semantics and SEO are a bit of an odd couple.

Stephen Arnold, August 19, 2008

Facets ‘Lite": Discovery Navigation for Thunderbird

August 18, 2008

David Huynh, a research scientist at MIT, posted in March 2008, a brief description of Seek 1.0. This software plug in allows a user to locate information in Thunderbird email. In eCommerce and enterprise search, Endeca has been successful positioning itself as one of the leaders in point-and-click interfaces. The idea is that during content processing, the system identifies concepts, entities, and relationships. A user has the option of plugging a word into a search box or browsing categories or other objects displayed. The user can scan a list of hot links, click on one, and begin examining information. Key word search is useful, but if the user does not know the terms to use, the browse feature becomes a useful way to locate information.

The Seek 1.0 component, according to Dr. Huynh’s Web log here, “an extension for Mozilla Thunderbird that provides faceted browsing features to let you search through your email more efficiently.” Commercial systems can be expensive. Dr. Huynh’s is available here. Endeca is most likely aware of Dr. Huynh’s activities, and Dr. Huynh lists one of Endeca’s research scientists in his “blogroll”.

Here’s a snippet of the interface:

image

After installing the component, navigate to the Thunderbird Tools menu and click on Seek. You are good to go.

Mr. Huynh says:

It is thus important that everyone be able to deal with data themselves: gather data, sift through data, integrate data, interpret data, make informed conclusions, and present their findings to their peers and to the world.

For me the importance of Seek is that the system is sufficient light weight to run on most notebook computers. Furthermore, the interface integrates well with Thunderbird, so users don’t have to understand metadata to make use of the system. Finally, for now, the system is making discovery interfaces available to a broader range of email users.

Is there a downside? The system does take some time to process content. I didn’t notice significant latency, but I have a fire breather and you may have an asthmatic gizmo. We have not subjected the component to crash recovery testing; that is, is it possible to restore indexes in the event of a problem. We will get to that in the days ahead. Finally, there are a number of commercial systems gearing up to enhance, improve, and search email. At this point it’s not clear how these services will serve to confuse users which can create traction problems for interesting projects like Seek.

A happy quack to Dr Huynh and the rest of the technical Jedi knights at the MIT Haystack Group. If you want to know more about Dr. Huynh, here’s cv is here.

Transinsight: Bio-Science Search

August 17, 2008

Earlier this year, I watched several “webinars” (man, I hate that term) about life science search. One company was in Denmark. Another outfit was in Michigan. A third company was the German firm Transinsight. Semantic content processing allows assisted navigation to complement the search box. The idea is that a user will recognize useful information. A key word search puts the burden on the user to find the “right” query to get the system to disgorge the need information.

The company has a demo to showcase its technology. GoPubMed here allows you to locate information without entering and refining queries. The interface offers some useful options; for example, here’s the discovered topics and statistics for 1,000 documents about oncology.

stats display

The company’s customers include Elsevier, BASF, Unilever, and the Max-Planck-Institut for Biochemistry, among others. The privately held firm has revenues estimated to be about $3.0 million per year. Venture funding has been provided by High Tech Gruenderfonds.

On August 15, 2008, Transinsight announced a deal with Abcam, a specialist in antibodies and reagents, to develop a search solution for antibody targets. You can read more about Abcam here. In today’s search lingo, the new service will be a “vertical search system.” A news release about the new system is here.

The important points about Transinsight and its announcement include:

  • The semantic technology originated in Germany
  • The system pushes beyond the point and click interfaces available for less specialized content with the addition of the illustrated statistics function in the screenshot
  • The technology is an appropriate use for the six or seven synonyms for gene name. Although complex, the application is not a “boil the ocean solution”.

A happy quack to Transinsight and the Beyond Search reader who provided the link to Transight.

Stephen Arnold, August 17, 2008

Daticon’s Invenio

August 14, 2008

eDiscovery continues to bubble despite the lousy economy in North America. Several weeks ago we started the update procedure for our eDiscovery vendors. I made a mental note to post a short item about Daticon, a company supporting organizations engaged in electronic discovery. You can learn more about this company here. What interests me is the firm’s search technology, called Invenio. The technology is based on a neural network, and when I reviewed the system, some of its features reminded me of an outfit called Dolphin Search, but I may be wrong on this point. If Invenio is Dolphin Search, let me know.

Invenio is integrated with Daticon’s management tools. These tools are among the most fine grained I have seen. Once deployed, a manager can track most of the metrics associated with processing, reviewing, and screening email, documents, and other content associated with eDiscovery processes.

Here’s a representative display of system metrics.

dashboard

There are similarities between Daticon’s approach and that of other eDiscovery specialists such as Stratify and Zantaz. Daticon bundles eDiscovery with a work flow, data capture, metrics, and a range of content processing functions.

The search and content processing system support concept searching, duplicate detection and duplicate removal, email threading, non text objects, and case management tools. Essentially this is a case management function that allows analysis of activities associated with a matter.

The company makes an interesting series of demonstrations available. I did not have to register to get walk throughs of the Invenio system. Try them yourself by clicking here.

Stephen Arnold, August 14, 2008

Lexalytics and Infonic Go Beyond Sentiment and Get Hitched

August 14, 2008

I learned about Lexalytics when I was researching Fast Search & Technology. Fast Search introduced when I was writing the Enterprise Search Report a function that would report on the sentiment in documents or email. The idea is particularly important in customer support. A flow of email that turns sour can be identified by sentiment analysis software. Fast Search’s approach was interesting to me because it was able to use Fast Search’s alert feature.

Founded in 2000, Infonic here is a publicly traded company (previously named Corpora plc). Infonic is listed on the UK’s AiM Stock Exchange as LON:IFNC. The company offers geo-replication and document management solutions. The firm also develops text analytics and sentiment technology. The firm’s Geo-Replicator software uses data compression and synchronization technology to replicate data between servers and laptops and server to server. The firm’s Document Manager software permits scanning, search, and retrieval of processed content. The company’s text analytics software product is called Sentiment.

At the end of July 2008, the two companies announced that the sentiment units would be merged. The new unit will be based in the UK and named Lexalytics Limited. I profiled the company in my new study for the Gilbane Group here. Lexalytics software performs entity extraction, sentiment analysis, document summarization and thematic extraction. Information about Lexalytics is here.

According to the two companies,

The rationale behind combining the businesses is to pool the expertise and complementary products of the parties in this specialist area and to drive joint growth in sales, utilizing Infonic’s global sales capabilities.

The new company has a value estimated at $40 million. Jeff Caitlin, founder of Lexalytics, will be the managing director of the new company.

Sentiment analysis is moving to the mainstream. The addled goose wishes the new sentimental outfits good luck. Oh, one final point: watch for more consolidation in the text analytics space. The market is a frosty place for some search and content processing vendors at this time.

Stephen Arnold, August 14, 2008

MarkLogic: The Army’s New Information Access Platform

August 13, 2008

You probably know that the US Army has nicknames for its elite units. Screaming Eagle, Big Red One, and my favorite “Hell on Wheels.” Now some HUMINT, COMINT, and SIGINT brass may create a MarkLogic unit with its own flash. Based on the early reports I have, the MarkLogic system works.

Based in San Carlos (next to Google’s Postini unit, by the way), MarkLogic announced that the US Army Combined Arms Center or CAC in Ft. Leavenworth, Kansas, has embraced MarkLogic Server. BCKS, shorthand for the Army’s Battle Command Knowledge System, will use this next-generation content processing and intelligence system for the Warrior Knowledge Base. Believe me, when someone wants to do you and your team harm, access to the most timely, on point information is important. If Napoleon were based at Ft. Leavenworth today, he would have this unit report directly to him. Information, the famous general is reported to have said, is nine tenths of any battle.

Ft. Leavenworth plays a pivotal role in the US Army’s commitment to capture, analyze, share, and make available information from a range of sources. MarkLogic’s technology, which has the Department of Defense Good Housekeeping Seal of Approval, delivers search, content management, and collaborative functions.

img 813a

An unclassified sample display from the US Army’s BCKS system. Thanks to MarkLogic and the US Army for permission to use this image.

The system applies metadata based on the DOD Metadata Specification (DDMS). The content is managed automatically by applying metadata properties such as the ‘Valid Until’ date. The system uses the schema standard used by the DOD community. The MarkLogic Server manages the work flow until the file is transferred to archives or deleted by the content manager. MarkLogic points to savings in time and money. My sources tell me that the system can reduce the risk to service personnel. So, I’m going to editorialize and say, “The system saves lives.” More details about the BCKS is available here. Dot Mil content does move, so click today. I verified this link at 0719, August 13, 2008.

Read more

hakia’s Founder Riza Berkan on Search

August 12, 2008

Dr. Riza Berkan, founder of hakia, a company engaged in search and content processing, reveals the depth of engineering behind the firm’s semantic technology. Dr. Berkan said here:

If you want broad semantic search, you have to develop the platform to support it, as we have. You cannot simply use an index and convert it to semantic search.

With its unique engineering foundation, the hakia system goes through a learning process similar to that of the human brain. Dr. Berkan added:

We take the page and content, and create queries and answers that can be asked to that page, which are then ready before the query comes.

He emphasized that “there is a level of suffering and discontent with the current solutions”. He continued:

I think the next phase of the search will have credibility rankings. For example, for medical searches, first you will see government results – FDA, National Institutes of Health, National Science Foundation. – then commercial – WebMD – then some doctor in Southern California – and then user contributed content. You give users such results with every search; for example, searching for Madonna, you first get her site, then her official fan site, and eventually fan Web logs.

You can read the full text of the interview with Dr. Riza Berkan on the ArnoldIT.com Web in the Search Wizards Speak series. The interview was conducted by Avi Deitcher for ArnoldIT.com.

Stephen Arnold, August 12, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta