NetBase and Content Intelligence
April 30, 2009
Vertical search is alive and well. Technology Review described NetBase’s Content Intelligence here. The story, written by Erica Naone, was “A Smarter Search for What Ails You”. Ms. Naone wrote:
organizes searchable content by analyzing sentence structure in a novel way. The company created a demonstration of the platform that searches through health-related information. When a user enters the name of a disease, he or she is most interested in common causes, symptoms, and treatments, and in finding doctors who specialize in treating it, says Netbase CEO and cofounder Jonathan Spier. So the company’s new software doesn’t simply return a list of documents that reference the disease, as most search engines would. Instead, it presents the user with answers to common questions. For example, it shows a list of treatments and excerpts from documents that discuss those treatments. The Content Intelligence platform is not intended as a stand-alone search engine, Spier explains. Instead, Netbase hopes to sell it to companies that want to enhance the quality of their results.
NetBase (formerly Accelovation) has developed a natural language processing system.Ms. Naone reported:
NetBase’s software focuses on recognizing phrases that describe the connections between important words. For example, when the system looks for treatments, it might search for phrases such as “reduce the risk of” instead of the name of a particular drug. Tellefson notes that this isn’t a matter of simply listing instances of this phrase, rather catching phrases with an equivalent meaning. Netbase’s system uses these phrases to understand the relationship between parts of the sentence.
At this point in the write up, I heard echoes of other vendors with NLP, semantics, bound phrase identification, etc. Elsevier has embraced the system for its illumin8 service. You can obtain more information about this Elsevier service here. Illumin8 asked me, “What if you could become an expert in any topic in a few minutes?” Wow!
The NetBase explanation of content intelligence is:
… understanding the actual “meaning” of sentences independent of custom lexicons. It is designed to handle myriads of syntactical sentence structures – even ungrammatical ones – and convert them to logical form. Content Intelligence creates structured semantic indexes from massive volumes of content (billions of web-pages and documents) used to power question-and-answer type of search experiences.
NetBase asserts:
Because NetBase doesn’t rely on custom taxonomies, manual annotations or coding, the solutions are fully automated, massively scalable and able to be rolled-out in weeks with a minimal amount of effort. NetBase’s semantic index is easy to keep up-to-date since no human editing or updates to controlled vocabulary are needed to capture and index new information – even when it includes new technical terms.
Let me offer several observations:
- The application of NLP to content is not new and it imposes some computational burdens on the search system. To minimize those loads, NLP is often constrained to content that contains a restricted terminology; for example, medicine, engineering, etc. Even with a narrow focus, NLP remains interesting.
- “Loose” NLP can squirm around some of the brute force challenges, but it is not yet clear if NLP methods are ready for center stage. Sophisticated content processing often works best out of sight, delivering to the user delightful, useful ways to obtain needed information.
- A number of NLP systems are available today; for example, Hakia. Microsoft snapped up PowerSet. One can argue that some of the Inxight technology acquired first by Business Objects then by the software giant SAP are NLP systems. To my knowledge, none of these has scored a hat trick in revenue, customer uptake, and high volume content processing.
You can get more information about NetBase here. You can find demonstrations and screenshots. A good place to start is here. According to TechCrunch:
NetBase has been around for a while. Originally called Accelovation, it has raised $9 million in two rounds of venture funding over the past four years, has 30 employees…
In my files, I had noted that the funding sources included Altos Ventures and ThomVest, but these data may be stale or just plain wrong. I don’t have enough information about Netbase to offer substantive comments. NLP requires significant computing horsepower. I need to know more about the plumbing. Technology Review provided the sizzle. Now we need to know about the cow from which the prime rib comes.
Stephen Arnold, April 30, 2009
Google Base Tip
April 23, 2009
Google Base is not widely known among the suits who prowl up and down Madison Avenue. For those who are familiar with Google Base, the system is a portent of Googzilla’s data management capabilities. You can explore the system here. Ryan Frank’s “Optimizing Your Google Base Feeds” here provides some some useful information for those who have discovered that Google Base is a tool for Google employment ads, real estate, and other types of structured information. Mr. Frank wrote:
It is also important to note that Google Base uses the information from Base listings for more than just Google OneBox results. This data may also be displayed in Google Product Search (previously Froogle), organic search results, Google Maps, Google Image Search and more. That adds up to a variety of exposure your site could potentially receive from a single Google Base listing.
Interesting, right? Read the rest of his post for some useful information about this Google service.
Stephen Arnold, April 23, 2009
Personalized Network Searching: Google after People Search
April 22, 2009
The hounds of the Internet are chasing Google’s “Search for Me on Google”. I can’t add to that outpouring of insight about technology that is exciting today but dated by Google time standards. I can, however, direct your attention to US 7,523,096, “Methods and Systems for Personalized Network Searching.” You can download this patent from the USPTO. The document was published on April 21, 2009, and was filed on December 3, 2003. You may want to read the background of the invention and scan the claims. The diagrams are standard Google fare, leaving much to the reader who must bring an understanding of other Google subsystems to the analysis. To put the Search on Me discussion into context, here’s the abstract for the granted patent, now almost six years old:
Systems and methods for personalized network searching are described. A search engine implements a method comprising receiving a search query, determining a personalized result by searching a personalized search object using the search query, determining a general result by searching a general search object using the search query, and providing a search result for the search query based at least in part on the personalized result and the general result. The search engine may utilize ratings or annotations associated with the previously identified uniform resource locator to locate and sort results.
This is an important invention attributed to Stephen Lawrence and Greg Badros. Both have made substantive contributions to Google in the past. You may want to examine the current people search and then check out the dossier invention that I have written about elsewhere. There are some interesting enhancements to the core dossier technology in the future. My assertion is that Google moves slowly. When these “innovations” roll out, some are surprised. The GOOG leaves big footprints in my experience. Where’s Pathfinder when one needs him?
Stephen Arnold, April 22, 2009
GEFCO and Exalead: Win International Prize for Innovation
April 21, 2009
Congratulations to GEFCO, and by extension, Exalead, for winning the Grand Prix et Trophée de l’innovation prize in recognition of innovation in business information management. The trophy was presented on April 7, 2009, by
CIO-online.com, Le Monde Informatique and IT News Info. There’s a video of the awards here ttp://www.trophees-cio.com/ and a PDF profile of the winners and projects at CIO Online.
A leading European provider of vehicle transport, logistics, and other transportation services, GEFCO earned its award thanks to Exalead, a leader of search based business application solutions and information access in the enterprise and on the web. GEFCO won the CIO-online.com trophy for its new vehicle track and trace service built on Exalead CloudView’s platform (You can read about CloudView here.
GEFCO uses Exalead CloudView to drive a search based application engine and real time operational tools for reporting, query, and analysis of the database of vehicles delivered logistics and spare parts management.
ArnoldIT.com interviewed Paul Doscher, U.S. CEO of Exalead, in January 2009, and Mr. Doscher spoke of their partnership with GEFCO then. He stated:
GEFCO is using Exalead to track their vehicles. GEFCO’s new ‘Track and Trace’ application is built upon Exalead’s flagship platform that offers powerful search functionality and can provide up-to-the-minute information from an extremely large data set. You can read the entire interview on the Search Wizards Speak service here.
Jessica Bratcher, April 21, 2009
Semantic Roll Up: The Effect of Financial Compression
April 21, 2009
A flurry of emails arrived today about the tie up among several companies with good reputations but profiles that are lower than those enjoyed by Autonomy and Endeca. You can read the official news announcement here about the deal among Attensity, Empolis GmbH, and Living-e AG. The conflation is called The Attensity Group. Here’s a snapshot of each company based on the information I ratted out of my files in the midst of new carpet, painting, and hanging new boxer dog pictures:
- Attensity. Deep text processing. Started in the intel community. Probed marketing. Acted as ring master for the tie up.
- Empolis GmbH. (Link was dead when I checked it on April 20, 2009.) A distribution and archiving system and file based content transformation. Orphaned after parent Bertelsmann faced up to the realities facing the dead tree crowd. Now positions itself in knowledge management.
- Living-e AG. Provides software products that enable efficient information exchange. Web content management, behavior analysis. Founded in 2003 as WebEdition Software GmbH.
The news release refers to the deal as a “market powerhouse”. This is the type of phrase that gets me to push the goslings to the computer terminals to do some company monitoring.
It’s too early for me to make a call about the product line up the company will offer. Should be interesting. Some pundits will make an attempt to presage the future. Not this silly goose. The customers will decide, not the mavens.
Stephen Arnold, April 21, 2009
Google and Guha: The Semantic Steamroller
April 17, 2009
I hear quite a lot about semantic search. I try to provide some color on selected players. By now, you know that I recycle in this Web log, and this article is no exception. The difference is that few people pay much attention to patent documents. In general, these are less popular than a printed dead tree daily paper, but in my opinion quite a bit more exciting. But that’s what makes me an addled goose, and you a reader of free Web log posts.
You will want to snag a copy of US20090100036 from our ever efficient USPTO. Please, read the instructions for running a query on the USPTO system. I don’t provide for free support to public facing, easy to use, elegant interfaces such as that available from the Federal government.
The “eyes” of Googzilla. From US20090100036, Figure 21, Cyrus, in case you want to see what your employer is doing these days.
The title of the document is “Methods and Systems for Classifying Search Results to Determine Page Elements” by a gaggle of Googlers, one of whom is Ramanathan Guha. If you read my Google Version 2.0 or the semantic white paper I wrote for Bear Stearns when it was respected and in business, you know that Dr. Guha is a bit of a superstar in my corner of the world. The founder of Epinions.com and a blue chip wizard with credentials (Semantic Web RDF, Babelfish, Open Directory, etc.) that will take away the puffery of newly minted search consultants, Dr. Guha invented, wrote up, and filed five major inventions. These five set forth the Programmable Search Engine. You will have to chase down one of my for fee writings to get more detail about how the PSE meshes with Google’s data management inventions. If you are IBM or Microsoft, you will remind me that patents are products and that Google is not doing anything particularly new. I love those old eight track tapes, don’t you.
The new invention is the work of Tania Bedrax-Weiss, Patrick Riley, Corin Anderson, and Ramanathan Guha. His name is spelled “Ramanthan” in the patent snippet I have. Fish & Richardson, Google’s go-to search patent attorney may have submitted it correctly in October 2007 but it emerged from the USPTO on April 16, 2009, with the spelling error.
The application is a 33 page long document, which is beefy by Google’s standard. Google dearly loves brevity so the invention is pushing into Gone with the Wind length for the GOOG. The Fish & Richardson synopsis said:
This invention relates to determining page elements to display in response to a search. A method embodiment of this invention determines a page element based on a search result. The method includes: (1) determining a set of result classifications based on the search result, wherein each result classification includes a result category and a result score; and (2) determining the page element based on the set of result classifications. In this way, a classification is determined based on a search result and page elements are generated based on the classification. By using the search result, as opposed to just the query, page elements are generated that corresponds to a predominant interpretation of the user’s query within the search results. As result, the page elements may, in most cases, accurately reflect the user’s intent.
Got that? If you did not, you are not alone. The invention makes sense in the context of a number of other Google technical initiatives ranging from the non hierarchical clustering methods to the data management innovations you can spot if you poke around Google Base. I noted classification refinement, snippets, and “signal” weighting. If you are in the health biz, you might want to check out the labels in the figures in the patent application. If you were at my lecture for Houston Wellness, I described some of Google’s health related activities.
On the surface, you may think, “Page parsing. No big deal.” You are not exactly right. Page parsing at Google scale, the method, and the scores complement Google’s “dossier” function about which Sue Feldman and I wrote in our September 2008 IDC client only report. This is IDC paper 213562.
What does a medical information publisher need with those human editors anyway?
Stephen Arnold, April 17, 2009
True Knowledge: Semantic Search System
April 16, 2009
A happy quack to the readers who sent me a link to this ZDNet Web log post called “True Knowledge API Lies at the Heart of Real Business Model” here. I had heard about True Knowledge — The Internet Answer Engine — a while back, but I tucked away the information until a live system became available. I had heard that the computer scientist spark plug of True Knowledge (William Tunstall-Pedoe) has been working on the technology for about 10 years. The company’s Web site is www.trueknoweldge.com, and it contains some useful information. You can sign up for a beta account, read Web log posts, and get some basic information about the system.
About one year ago, the Financial Times’s Web log here reported:
Another Semantic Web company looking for cash: William Tunstall-Pedoe of True Knowledge says he needs $10m in venture capital to back the next stage of his Cambridge (UK)-based company, which is trying to build a sort of “universal database” on the Web.
In April 2009, the company is raising its profile with an API that allows developers to make Web sites smarter.
Interface. © True Knowledge
The company said:
True Knowledge is a pioneer in a new class of Internet search technology that’s aimed at dramatically improving the experience of finding known facts on the Web. Our first service – the True Knowledge Answer Engine – is a major step toward fulfilling a longstanding Internet industry goal: providing consumers with instant answers to complex questions, with a single click.
The company’s proprietary technology allows a user to ask questions and get an answer. Quite a few companies have embraced the “semantic” approach to content processing. The reason is that traditional search engines require that the person with the question find the magic combination that delivers what’s needed. The research done by Martin White and my team, among others, makes clear that about two thirds of the users of a key word search system come away empty handed, annoyed, or both. True Knowledge and other semantic-centric vendors see significant opportunities to improve search and generate revenue.
Architecture block diagram. © True Knowledge
Paul Miller, the author of the ZDNet article, wrote:
True Knowledge is certainly interesting, and frequently impressive. It remains to be seen whether a Platform proposition will set them firmly on the road to riches, or if they’ll end up finding more success following the same route as Powerset and getting acquired by an existing (enterprise?) search provider.
ZDNet wrote a similar article in July 2007 here. In 2008, Venture Beat here mentioned True Knowledge here in July 2008 in a story that referenced Cuil.com (former Googlers) and Powerset (now part of Microsoft’s search cornucopia). Hakia.com was not mentioned even though at that time in 2008, Hakia.com was ramping up its PR efforts. Venture Beat mentioned Metaweb, another semantic start up that obtained $42 million in 2008, roughly eight times the funding of True Knowledge. (Metaweb’s product is Freebase, an open, shared database of the world’s information. More here.) You will want to read Venture Beat’s April 13, 2009, follow up story about True Knowledge here. This article contains an interesting influence diagram.
I don’t know enough about the appetite of investors for semantic search systems to offer an opinion. What I found interesting was:
- The company has roots in Cambridge University where computational approaches are much in favor. With Autonomy and Lemur Consulting working in the search sector, Cambridge is emerging as one of the hot spots in search
- The language and word choice used to describe the system here reminded me of some Google research papers and the work of Janet Widom at Stanford University. If there are some similarities, True Knowledge may be more than a question answering system
- The company received an infusion of $4.0 million in a second round of funding completed in mid 2008. Octopus Ventures provided an earlier injection of $1.2 million in 2007.
- The present push is to make the technology available to developers so that the semantic system can be “baked in” to other applications. The notion is a variant of that used in the early days of Verity’s OEM and developer push in the late 1980s. The API account is offered without charge.
- There’s a True Knowledge Facebook page here.
I recall seeing references to a private beta of the system. I can’t locate my notes from my 2007 trips to the UK, but I think that may have been the first time I heard about the system. I did locate a link to a demo video here, dated late 2007 That video explains that the information is represented in a way “that computers can understand”. I made a note to myself about this because this type of function in 2007 was embodied in the Guha inventions for the Google Programmable Search Engine.
The API allows systems to ask questions. The developer can formulate a query and see the result. Once the developer has the query refined, the True Knowledge system makes it easy for the developer to include the service in another application. The idea, I noted, was to make enterprise software systems smarter. The system performs reasoning and inference. The system generates answers and a reading list. The system can handle short queries, performing accurate disambiguation; that is, figuring out what the user meant. The system made it possible for a user to provide information to the system, in effect a Wikipedia type of function. The approach is a clever way for the user to teach the True Knowledge system.
RapidMiner: Open Source Data Mining
April 11, 2009
A happy quack to the reader who reminded me that Google Apps supports Java. If you are interested in data mining, you may want to catch up with RapidMiner, an open source data mining system. RapidMiner drinks Java, so you may want to think about ways to make use of Google Apps and RapidMiner. The person who wrote me wanted some information about this idea.
My April 2009 column for KMWorld talks about Google Apps, but I don’t have any information about hooking RapidMininer into Google Apps. In fact, I had not thought about it.
RapidMiner is “the world-wide leading open-source data mining solution due to the combination of its leading-edge technologies and its functional range. Applications of RapidMiner cover a wide range of real-world data mining tasks.” There is an enterprise version plus consulting services available.
You can download the RapidMiner community edition here. The documentation is quite good. You can snag a copy of those documents here. The community edition offers a number of features, and it is extensible. Here’s an example of a data output from RapidMiner:
You can find a useful discussion by Michael Wurst of the open source version at Nemoz.org here. This write up provides some useful examples that show one way to hook RapidMiner into a Java application. What is quite useful is the code sample for using the text classifier on a chunk of text. RapidMiner’s classification component is called RapidMinerTextClassifier.
There are some limitations to the Google Apps implementation of Java, but I think the person who wrote me has an interesting idea. The notion of combining sophisticated RapidMiner oiperations with the Google Apps struck me as interesting. If you have any interesting examples of this type of hybridization, use the comments section of this Web log to pass along the information.
Stephen Arnold, April 11, 2009
Cirilab: Entity Extraction
April 6, 2009
I took a quick look at Cirilab in order to update my files about entity extraction vendors.
Cirilab develops practical search, retrieval and categorization software designed to increase organizational productivity by effectively harnessing key knowledge resources. Cirilab offers a range of advanced analysis and organization applications and tools.
I learned about the company when another consultant sent me links to several online demonstrations of the Cirilab’s technology. I located an older but useful discussion of the Crilab technology here. You can explore a Wikipedia entry about Winston Churchill here and a document navigator of Sir Winston’s writings here. The engine generating these demos is called the KGE or Knowledge Generation. The idea is that KGE can process unstructured text and generate insights into that text.
Source: http://www.cirilab.com/TSMAP/Cirilab_Library/Literature/Winston_Churchill/WikiKMapPage/index.htm
The company’s enterprise solutions include vertical builds of the KGE:
- Publishing. The Web Ready Publishing service allows an organization to take unstructured data in WordPerfect, Word, Adobe PDF, HTML, and even Text files, and publish it in a Web Ready Publishing format so that it is instantly available to your customers in a thematically navigable format.
- Pharma. Cirilab can “read” the documents and therefore allow “mining” of existing data.
- Legal. KGE permits discovery of information.
- Security and intelligence. Cirilab products provide unique insights into this information not otherwise available.
The company offers a range of desktop products. These are excellent ways to learn about the features and functions of the Crilab’s KGE system.
More recently, Cirilab has succeeded in developing and bringing to market a core suite of technologies known as KOS (Knowledge Object Suite) based on its Multidimensional Semantic Spatial Indexing Technology.
You can register and receive a free, thematic map of your Web site. The company is located in Ottawa, Ontario. You can get more information here.
Stephen Arnold, April 6, 2009
Google Leximo Tie Up
April 2, 2009
Leximo is a social dictionary; specifically, “a Multilingual User Collaborated Dictionary that lets you search, discover and share your words with the World.” Google snapped up the company. You can read the Leximo manifesto here. One of the tenets is:
Open community-based and user-friendly functions promote participation, accountability and trust.
What’s Google need a dictionary for? In my opinion, the GOOG wants a flow of new words plus definitions to fatten up its existing knowledgebases. I am confident the idealism of Leximo will persist at the GOOG.
Stephen Arnold, April 2, 2009