Google and Guha: The Semantic Steamroller
April 17, 2009
I hear quite a lot about semantic search. I try to provide some color on selected players. By now, you know that I recycle in this Web log, and this article is no exception. The difference is that few people pay much attention to patent documents. In general, these are less popular than a printed dead tree daily paper, but in my opinion quite a bit more exciting. But that’s what makes me an addled goose, and you a reader of free Web log posts.
You will want to snag a copy of US20090100036 from our ever efficient USPTO. Please, read the instructions for running a query on the USPTO system. I don’t provide for free support to public facing, easy to use, elegant interfaces such as that available from the Federal government.
The “eyes” of Googzilla. From US20090100036, Figure 21, Cyrus, in case you want to see what your employer is doing these days.
The title of the document is “Methods and Systems for Classifying Search Results to Determine Page Elements” by a gaggle of Googlers, one of whom is Ramanathan Guha. If you read my Google Version 2.0 or the semantic white paper I wrote for Bear Stearns when it was respected and in business, you know that Dr. Guha is a bit of a superstar in my corner of the world. The founder of Epinions.com and a blue chip wizard with credentials (Semantic Web RDF, Babelfish, Open Directory, etc.) that will take away the puffery of newly minted search consultants, Dr. Guha invented, wrote up, and filed five major inventions. These five set forth the Programmable Search Engine. You will have to chase down one of my for fee writings to get more detail about how the PSE meshes with Google’s data management inventions. If you are IBM or Microsoft, you will remind me that patents are products and that Google is not doing anything particularly new. I love those old eight track tapes, don’t you.
The new invention is the work of Tania Bedrax-Weiss, Patrick Riley, Corin Anderson, and Ramanathan Guha. His name is spelled “Ramanthan” in the patent snippet I have. Fish & Richardson, Google’s go-to search patent attorney may have submitted it correctly in October 2007 but it emerged from the USPTO on April 16, 2009, with the spelling error.
The application is a 33 page long document, which is beefy by Google’s standard. Google dearly loves brevity so the invention is pushing into Gone with the Wind length for the GOOG. The Fish & Richardson synopsis said:
This invention relates to determining page elements to display in response to a search. A method embodiment of this invention determines a page element based on a search result. The method includes: (1) determining a set of result classifications based on the search result, wherein each result classification includes a result category and a result score; and (2) determining the page element based on the set of result classifications. In this way, a classification is determined based on a search result and page elements are generated based on the classification. By using the search result, as opposed to just the query, page elements are generated that corresponds to a predominant interpretation of the user’s query within the search results. As result, the page elements may, in most cases, accurately reflect the user’s intent.
Got that? If you did not, you are not alone. The invention makes sense in the context of a number of other Google technical initiatives ranging from the non hierarchical clustering methods to the data management innovations you can spot if you poke around Google Base. I noted classification refinement, snippets, and “signal” weighting. If you are in the health biz, you might want to check out the labels in the figures in the patent application. If you were at my lecture for Houston Wellness, I described some of Google’s health related activities.
On the surface, you may think, “Page parsing. No big deal.” You are not exactly right. Page parsing at Google scale, the method, and the scores complement Google’s “dossier” function about which Sue Feldman and I wrote in our September 2008 IDC client only report. This is IDC paper 213562.
What does a medical information publisher need with those human editors anyway?
Stephen Arnold, April 17, 2009
True Knowledge: Semantic Search System
April 16, 2009
A happy quack to the readers who sent me a link to this ZDNet Web log post called “True Knowledge API Lies at the Heart of Real Business Model” here. I had heard about True Knowledge — The Internet Answer Engine — a while back, but I tucked away the information until a live system became available. I had heard that the computer scientist spark plug of True Knowledge (William Tunstall-Pedoe) has been working on the technology for about 10 years. The company’s Web site is www.trueknoweldge.com, and it contains some useful information. You can sign up for a beta account, read Web log posts, and get some basic information about the system.
About one year ago, the Financial Times’s Web log here reported:
Another Semantic Web company looking for cash: William Tunstall-Pedoe of True Knowledge says he needs $10m in venture capital to back the next stage of his Cambridge (UK)-based company, which is trying to build a sort of “universal database” on the Web.
In April 2009, the company is raising its profile with an API that allows developers to make Web sites smarter.
Interface. © True Knowledge
The company said:
True Knowledge is a pioneer in a new class of Internet search technology that’s aimed at dramatically improving the experience of finding known facts on the Web. Our first service – the True Knowledge Answer Engine – is a major step toward fulfilling a longstanding Internet industry goal: providing consumers with instant answers to complex questions, with a single click.
The company’s proprietary technology allows a user to ask questions and get an answer. Quite a few companies have embraced the “semantic” approach to content processing. The reason is that traditional search engines require that the person with the question find the magic combination that delivers what’s needed. The research done by Martin White and my team, among others, makes clear that about two thirds of the users of a key word search system come away empty handed, annoyed, or both. True Knowledge and other semantic-centric vendors see significant opportunities to improve search and generate revenue.
Architecture block diagram. © True Knowledge
Paul Miller, the author of the ZDNet article, wrote:
True Knowledge is certainly interesting, and frequently impressive. It remains to be seen whether a Platform proposition will set them firmly on the road to riches, or if they’ll end up finding more success following the same route as Powerset and getting acquired by an existing (enterprise?) search provider.
ZDNet wrote a similar article in July 2007 here. In 2008, Venture Beat here mentioned True Knowledge here in July 2008 in a story that referenced Cuil.com (former Googlers) and Powerset (now part of Microsoft’s search cornucopia). Hakia.com was not mentioned even though at that time in 2008, Hakia.com was ramping up its PR efforts. Venture Beat mentioned Metaweb, another semantic start up that obtained $42 million in 2008, roughly eight times the funding of True Knowledge. (Metaweb’s product is Freebase, an open, shared database of the world’s information. More here.) You will want to read Venture Beat’s April 13, 2009, follow up story about True Knowledge here. This article contains an interesting influence diagram.
I don’t know enough about the appetite of investors for semantic search systems to offer an opinion. What I found interesting was:
- The company has roots in Cambridge University where computational approaches are much in favor. With Autonomy and Lemur Consulting working in the search sector, Cambridge is emerging as one of the hot spots in search
- The language and word choice used to describe the system here reminded me of some Google research papers and the work of Janet Widom at Stanford University. If there are some similarities, True Knowledge may be more than a question answering system
- The company received an infusion of $4.0 million in a second round of funding completed in mid 2008. Octopus Ventures provided an earlier injection of $1.2 million in 2007.
- The present push is to make the technology available to developers so that the semantic system can be “baked in” to other applications. The notion is a variant of that used in the early days of Verity’s OEM and developer push in the late 1980s. The API account is offered without charge.
- There’s a True Knowledge Facebook page here.
I recall seeing references to a private beta of the system. I can’t locate my notes from my 2007 trips to the UK, but I think that may have been the first time I heard about the system. I did locate a link to a demo video here, dated late 2007 That video explains that the information is represented in a way “that computers can understand”. I made a note to myself about this because this type of function in 2007 was embodied in the Guha inventions for the Google Programmable Search Engine.
The API allows systems to ask questions. The developer can formulate a query and see the result. Once the developer has the query refined, the True Knowledge system makes it easy for the developer to include the service in another application. The idea, I noted, was to make enterprise software systems smarter. The system performs reasoning and inference. The system generates answers and a reading list. The system can handle short queries, performing accurate disambiguation; that is, figuring out what the user meant. The system made it possible for a user to provide information to the system, in effect a Wikipedia type of function. The approach is a clever way for the user to teach the True Knowledge system.
Hakia: Taking on Google
April 12, 2009
I find these stories about search systems that will challenge Google fascinating. One of the more recent ones I saw was an April 6, 2009, article “Is Hakia.com the Search Engine That Is Going to Challenge Google?” which appeared in My Questions, a South African Web log here. The story provides a useful summary of the features of the Hakia semantic system. I ran an interview with one of the Hakia founders, Riza Berkan, in August 2008. You can read that exclusive interview here. The point that jumped out at me in the My Questions’ write up was this comment:
The results are ranked according to the relevant site and the categories that they belong to.
There is a growing interest in the authority of a source. The role that a subject matter expert, a Ph.D. committee, or a reference librarian once played has to make its ways to software. The present financial climate and the inefficiency of finding a reliable way to validate a source make human methods highly variable. Software, with its machine like consistency, seems to offer a solution. Hakia has probed this issue and includes this component in its search results ranking.
Another comment that caught my attention was:
Hakia is a very good search engine but it still has a lot of ground to cover before it can take over much of the market the Google has. We will only have to see with time how the market receives it.
I think Hakia has much to commend it. My recollection is that the company’s processing of health and medical information was quite useful. In my experience, semantic processes often work more quickly and reliably when processing content that is about a specific subject area. But technology continues to improve and some vendors, like Autonomy, emphasize that their systems can adapt to a changing flow of content. I have been around a long time, and I think that “drift” remains a challenge for many search and content processing vendors.
The effort of carpetbaggers and azure chip consultants to sell taxonomy as a silver bullet is pragmatic. With a managed list of terms or categories, the content can be put in a pigeonhole. There may be drift, but the categories act as a red herring for other indexing flaws.
With the deteriorating financial climate, many search vendors will be forced to retrench or exit the business. Each week I hear rumors about companies that are either for sale, seeking investors, or preparing to close their doors. I will have to follow up with Hakia to see if the company still wants to challenge Google.
Stephen Arnold, April 12, 2009
Digital Gutenberg Study Completed
April 10, 2009
Infonortics Ltd. received the manuscript for Google: The Digital Gutenberg yesterday, April 10, 2009. The monograph is the third in my series of Google analyses. The topics addressed in this new study include:
- Google’s content automation methods
- A discussions of dataspace functions, the report or dossier system, and content-that-follows system
- A description of Google’s increasing impact on education, scholarly publishing, and commercial online
The information in the study comes from open sources such as Google’s presentations, technical reports, and US government filings to the SEC and USPTO. I have revised and updated some of the information I wrote for BearStearns, Trust Company of the West, and IDC for this study as well as included completely new material that, as far as I know, has not been described in detail elsewhere. I am often asked, “Does Google cooperate with you and provide information.” The answer is, “No.” The Google ignores me, making sure my “authoritative” score is near the bottom of the barrel. I have remarked on many occasions that Google would like to see this goose’s cooked. Google professionals off the record express their surprise at what their employer is doing. Google is not into opening its technical kimono for researchers of my ilk. Compartmentalization is useful I suppose.
Why Google and Publishing
I narrowed the focus to publishing for three reasons:
First, Google finds itself in the news because some newspapers have become critical of Google’s pointing to content produced by third parties. What I have tried to do is explain that Google’s technology processes information and provides access. One of my findings is that Google has shown considerable restraint in the use of its inventions. If my research data are correct, Google could be more active as a content generator than it has chosen to be. Google, for this reason, has “potential energy”; that is, without much additional investment, the company could produce more content objects.
Second, Google’s technical infrastructure plus its software adds up to create a “digital Gutenberg”; that is, an individual could create a Knol (fact based essay on a subject), create a business listing in another Google service, and create a Web log on the Knol’s topic. The “author” or user uses Google as a giant information factory. Inputs go in and traffic “finds” the information. There are different ways to monetize this manufacturing and distribution system. Google has created its own version of Ford’s River Rouge integrated facility.
Third, Google is following what users click on. As a result, it is important to track the demographic behaviors of Google customers, advertisers, licensees, and users. The users, not Google management, help determine where Google goes and what Google does. Competitors who attempt to predict Google’s next action are likely to be off base unless those analyses are anchored in demographic and usage data. Another finding is that Google is relying on demographics to carry its “River Rouge” and “digital Gutenberg” capabilities into different markets.
Google did not open its kimono to me. The open source intelligence methods yielded that data in this study. You can see one of my tools here.
Differences in Digital Gutenberg
In my first two studies, I explained in detail Google’s systems and methods. I include a couple of Google equations in this new study. I make brief references to patent documents and technical papers, but my editor and I have worked to make this study more accessible to the general business reader. I lack the capacity to write a “Sergey and Larry eat pizza” monograph. Frankly, technology, not pizza, interests me. I suppose I am as mechanistic and data centric as some Googlers.
Also, I don’t take sides. Google is neither good nor evil. The companies affected by Google’s waves of innovation are just average companies. Google, however, thrives in sophisticated technology and data. In my encounters with Googlers, most would prefer to talk about a function instead of the color of a sofa. The companies criticizing Google lack Google’s techno-centrism. I point out that Google’s actions and public statements make perfect sense to someone who is Googley. Those same statements when heard by those who operated mostly from subjective information come across as arrogant or, in some cases, pretty wacky.
The conclusion to the study is a discussion of one of Google’s most important initiatives in its 10-year history: the Google App Engine. That surprised some of the people whom I asked to read early drafts of the manuscript. The App Engine is the culmination of many thousands of hours of engineering, and it will make its presence felt across the many business sectors into which Google finds itself thrust.
You can see an early version of the study’s table of contents here. (And, yes, I know the Chinese “invented movable wood block printing”. I used “Gutenberg” as a literary convenience.)
Who Should Read This Monograph?
My mom never read any of my monographs. She looked at my first study, written decades ago, and said, “Dull.” Today, I am still writing dull stuff, but the need to understand what is happening and will happen in electronic information is escalating.
At a minimum, I think the contents of the Digital Gutenberg would be of interest to companies who are engaged in traditional media; that is, publishing, video and motion picture production, and broadcasting. Others who may find the monograph a useful reference may include:
- Analysts, consultants, and pundits who track Google
- Competitors and soon-to-be Google’s competitors
- Lawyers who are on the prowl for Google-related information
- Entrepreneurs who want to find out how to “surf on Google”
- Government regulators eager to find out whether the existing net of regulations has hooked on Google
- People who want to work at Google because some of Google’s most exciting innovations are not well known.
Cirilab: Entity Extraction
April 6, 2009
I took a quick look at Cirilab in order to update my files about entity extraction vendors.
Cirilab develops practical search, retrieval and categorization software designed to increase organizational productivity by effectively harnessing key knowledge resources. Cirilab offers a range of advanced analysis and organization applications and tools.
I learned about the company when another consultant sent me links to several online demonstrations of the Cirilab’s technology. I located an older but useful discussion of the Crilab technology here. You can explore a Wikipedia entry about Winston Churchill here and a document navigator of Sir Winston’s writings here. The engine generating these demos is called the KGE or Knowledge Generation. The idea is that KGE can process unstructured text and generate insights into that text.
Source: http://www.cirilab.com/TSMAP/Cirilab_Library/Literature/Winston_Churchill/WikiKMapPage/index.htm
The company’s enterprise solutions include vertical builds of the KGE:
- Publishing. The Web Ready Publishing service allows an organization to take unstructured data in WordPerfect, Word, Adobe PDF, HTML, and even Text files, and publish it in a Web Ready Publishing format so that it is instantly available to your customers in a thematically navigable format.
- Pharma. Cirilab can “read” the documents and therefore allow “mining” of existing data.
- Legal. KGE permits discovery of information.
- Security and intelligence. Cirilab products provide unique insights into this information not otherwise available.
The company offers a range of desktop products. These are excellent ways to learn about the features and functions of the Crilab’s KGE system.
More recently, Cirilab has succeeded in developing and bringing to market a core suite of technologies known as KOS (Knowledge Object Suite) based on its Multidimensional Semantic Spatial Indexing Technology.
You can register and receive a free, thematic map of your Web site. The company is located in Ottawa, Ontario. You can get more information here.
Stephen Arnold, April 6, 2009
Cazoodle: Semantic Search
April 3, 2009
A happy quack to the reader who sent me a link to Euwyn’s “Cazoodle – Semantic Data-aware Search” here. Developed by Chambana wizards, Cazoodle “looks to create semantic data-aware search for various verticals, starting with apartments, events, and shopping (electronics, for the most part).” Euwyn makes clear that Cazoodle is a vertical search engine; that is, the content focuses on a specific topic such as apartments. Cazoodle said:
[It is] a startup company from the University of Illinois at Urbana-Champaign (UIUC), aims to enable “data-aware” search– to access the vast amount of structured information beyond the reach of current search engines. The company is co-founded by Prof. Kevin C. Chang and his research team of graduate and undergraduate students, with the support of the University and technology transfer from the MetaQuerier research at UIUC. Cazoodle is located at EnterpriseWorks, an incubator facility of the University, on the Research Park of UIUC in Champaign, Illinois.
The company seems to be going in the same direction as Classifieds.com, a Web start up that I found quite interesting. Cazoodle delivers a “semantic data-aware search.” I ran a query for an apartment in Urbana, where I worked on my PhD many years ago. The Cazoodle results looked like this:
The service looks interesting, demonstrating that dataspaces can be useful. I detected a few Google influences as well. Click here to try the beta search.
Stephen Arnold, April 3, 2009
Missing the Kosmix Story
March 31, 2009
I read as many stories about the search engines that will be the “next Google.” The editors at Forbes.com like these write ups as well. The most recent one begins with the old saw “life after Google”. You can read “Life after Google: What’s the Next Hot Search Engine?” here. Mr. Buley tips his hat toward Cuil.com, Dr. Anna Patterson’s whack at Google’s carotid. There’s a brief glimpse of Aardvark, a social service that expects Web surfers to formulate and type questions into a search box. With the average query in the 2.3 word range, I think we know how successful that approach will be in crippling the GOOG. Finally, Mr. Buley swallows a bite of Kosmix PR goodness. Kosmix is a mash up service, more like a smart portal with Google results and probably a half dozen or more other sources of information. The key point of the write up is that the world does not need another Google. What the world needs is a mash up, point and click, we think for you service probably a lot like Kosmix. The most interesting comment in the write up, a sentence lost on the Forbes’s editor who crafted the headline, was:
And even if the new search engines persuade users to try more than just Google, they still face the prospect of Google moving into their turf. Blog search used to be a separate market segment in search, with several companies battling to dominate. After Google added blog search to its main search menu, there was the predictable shake-out. Of course, this also means that should any of these companies become a success inside their niche, they would become a Google acquisition target — which may be all the motivation any of them need. “I think it’s fair to say that the conventional search game is over,” says Kosmix’s Rajaraman. “But that doesn’t mean the Internet game is over.”
I wonder if Mr. Buley tugged the threads that connect Cuil.com and Kosmix.com to Google? Cuil.com indexed some quite interesting Google content in its prelaunch run up. Kosmix’s Anand Rajaraman has demonstrated in his Web log pretty useful Google access in my opinion. Not just anyone gets a chance to hob nob with Peter Norvig. That might be a more interesting angle to pursue. Ah, if I were not an addled goose and so old and tired.
Stephen Arnold, March 30, 2009
Cuil.com Gets Better
March 30, 2009
I did a fly over of the Cuil.com Web site. What triggered an overflight was a Google patent; specifically, US20090070312, “Integrating External Related Phrase Information into a Phrase-Based Indexing Information Retrieval System”. Filed in September 2007, the USPTO spit it out on March 12, 2009. I discussed a chain of Dr. Patterson’s inventions in my 2007 study Google Version 2.0 here. Dr. Patterson is no longer a full-time Googler, the tendrils of her research from Xift to Cuil pass through the GOOG. When I looked at Cuil.com today (March 29, 2007), I ran my suite of test queries. Most of them returned more useful and accurate results than my first look at the system in July 2008 here.
Several points I noticed:
- The mismatching of images to hits has mostly been connected. The use of my logo for another company, which was in the search engine optimization business was annoying. No more. That part of the algorithm soup has been filtered.
- The gratuitous pornography did not pester me again. I ran my favorites such as pr0n and similar code words. There were some slips which some of my more young at heart readers will eagerly attempt to locate.
- The suggested queries feature has become more useful.
- My old chestnut “enterprise search” flopped. The hits were to sources that are not particularly useful in my experience. The Fast Forward conference is no more, but there’s a link to the now absorbed user group. The link to the enterprise search summit surprised me. The conference has been promoting like crazy despite the somewhat shocking turn out last year in San Jose, so it’s obvious that flooding information into sites fools the Cuil.com relevancy engine.
- The Explore by Category is now quite useful. One can argue if it is better than the “improved” Endeca. I think Cuil.com’s automated and high-speed method may be more economical to operate. Dr. Patterson and her team deserve a happy quack.
I am delighted to see that the improvements in Cuil.com are coming along nicely. Is the system better than Google’s or Microsoft’s Web search system? Without more testing, I don’t think I can make a definitive statement. I am certain that there will be PhD candidates or ASIS members who will rise to fill this gap in my understanding.
I have, however, added the Cuil.com system to my list of services to ping when I am looking for information.
Stephen Arnold, March 30, 2009
Storage a Problem for Most Organizations
March 30, 2009
Most people don’t know too much about Kroll, a unit of a diversified financial services firm. I was surprised, therefore, to see a public story about a survey conducted by this ultra low profile outfit. The article was “Storage Practices Don’t Match Policies” in IDM.Net, a Australian Web log here. The point of the write up was that in the Kroll survey storage policies were not particularly well conceived. The most important comment in the write up was:
The survey found that 40 percent of individuals stated that their company has a policy regarding where data should be stored. However, the survey results also revealed that 61 percent of respondents “usually” save to a local drive instead of a company network.
Makers of automated back up systems will rejoice. Attorneys suing an organization with lousy back up practices are probably dancing in the streets. Where there are informal collections of data, there is gold for the eDiscovery prospector.
If you want to know more about Kroll, click here and read the Search Wizards Speak with David Chaplin, one of the developers of Engenium, an interesting software for extracting nuggets from these data gold mines.
Stephen Arnold, March 30, 2009
Search Marketing AI Engine
March 26, 2009
I receive a handful of mirthful email each month. One reader alerted me to a Web site that makes it easy to generate search, content processing, and text mining marketing documents. I am not sure if this site is funny ha ha or funny painful. Please, decide for yourself. Navigate to the Corporate Gibberish Generator here. I generated the following text for a made up company called Enterprise Search Consulting Actualizers (ESCA). The system generated this:
Enterprise Search Consulting Specialists is the industry leader of world-class synergies. We apply the proverb “Look before you leap” not only to our channels but our capability to benchmark. We will extend our capability to syndicate without reducing our aptitude to syndicate. Do you have a scheme to become B2C2B? We have proven we know that it is better to morph strategically than to synthesize dynamically. The aptitude to disintermediate intuitively leads to the power to monetize efficiently. Quick: do you have a reconfigurable game plan for coping with new schemas? A company that can synthesize courageously will (at some unknown point of time) be able to incubate courageously. The metrics for raw bandwidth are more well-understood if they are not strategic. A company that can maximize elegantly will (at some point in the future) be able to expedite correctly. Your budget for reinventing should be at least twice your budget for monetizing.
Some of the news releases I receive seem to make use of this system or one that is similar. I bookmarked this gem.
Stephen Arnold, March 26, 2009