Evvie 2009 Winners: David Evans and Martin Baumgartel

May 4, 2009

Stephen E. Arnold of ArnoldIT.com, http://www.arnoldit.com, announced the Evvie “best paper award” for 2009 at Infonortics’ Boston Search Engine Meeting on April 28.

The 2009 Evvie Award went to Dr. David Evans of Just Systems Evans Research for “E-Discovery: A Signature Challenge for Search.” The paper explains the principal goals and challenges of E-Discovery techniques. The second place award went to Martin Baumgärtel of bioRASI for “Advanced Visualization of Search Results: More Risks or More Chances?”, which addressed the gap between breakthroughs in visualization and actual application of techniques.

evvie 2009

Stephen Arnold (left) is pictured with Dr. David Evans, Just System Evans Research on the right.

The Evvie is given in honor of Ev Brenner, one of the leaders in online information systems and functions. The award was established after Brenner’s death in 2006. Brenner served on the program committee for the Boston Search Engine Meeting since its inception almost 20 years ago. Everett Brenner is generally regarded as one of the “fathers” of commercial online databases. He worked for the American Petroleum Institute and served as a mentor to many of the innovators who built commercial online.

baumgartel

Martin Baumgartel (left) and Dr. David Evans discuss their recognition at the 2009 Boston Search Engine Meeting.

Mr. Brenner had two characteristics that made his participation a signature feature of each year’s program: He was willing to tell a speaker or paper author to “add more content,” and after a presentation, he would ask a presenter one or more penetrating questions that helped make a complex subject more clear.

The Boston Search Engine meeting attracts search professionals, search vendors, and experts interested in content processing, text analysis, and search and retrieval. Held each year in Boston, Ev, as he was known to his friends, demanded excellence in presentations about information processing.

Sponsored by Stephen E. Arnold (ArnoldIT.com), this award goes to the speaker who best exemplifies Ev’s standards of excellence. The selection committee consists of the program committee, assisted by Harry Collier (conference operator) and Stephen E. Arnold.

This year’s judges were Jill O’Neill, NFAIS, Sue Feldman, IDC Content Technologies Group, and Anne Girard, Infonortics Ltd.

Mr. Arnold said, “This award is one way for us to respect his contributions and support his life long commitment to excellence.”

The recipients receive a cash prize and an engraved plaque. Information about the conference is available on the Infonortics, Ltd. Web site at www.infonortics.com and here. More information about the award is here. Information about ArnoldIT.com is here.

The Beeb and Alpha

April 30, 2009

I am delighted that the BBC, the once non commercial entity, has a new horse to ride. I must admit that when I think of the UK and horse to ride, my mind echoes with the sound of Ms. Sperling saying, “Into the valley of death rode the 600”. The story (article) here carries a title worthy of the Google-phobic Guardian newspaper: “Web Tool As Important as Google.” The subject is the Wolfram Alpha information system which is “the brainchild of British-born physicist Stephen Wolfram”.

Wolfram Alpha is a new content processing and information system that uses a “computational knowledge engine”. There are quite a few new search and information processing systems. In fact, I mentioned two of these in recent Web log posts: NetBase here and Veratect here.

image

Can Wolfram Alpha or another search start up Taser the Google? Image source:

In my reading of the BBC story includes a hint that Wolfram Alpha may have a bit of “fluff” sticking to its ones and zeros. Nevertheless, I sensed a bit of glee that Google is likely to face a challenge from a math-centric system.

Now let’s step back:

First, I have no doubt that the Wolfram Alpha system will deliver useful results. Not only does Dr. Wolfram have impeccable credentials, he is letting math do the heavy lifting. The problem with most NLP and semantic systems is that humans are usually needed to figure out certain things regarding “meaning” of and in information. Like Google, Dr. Wolfram lets the software machines grind away.

Second, in order to pull of an upset of Google, Wolfram Alpha will need some ramp up momentum. Think of the search system as a big airplane. The commercial version of the big airplane has to be built, made reliable, and then supported. Once that’s done, the beast has to taxi down a big runway, build up speed, and then get aloft. Once aloft, the airplane must operate and then get back to ground for fuel, upgrades, etc. The Wolfram Alpha system is in it early stages.

Third, Google poses a practical problem to Wolfram Alpha and to Microsoft, Yahoo, and the others in the public search space. Google keeps doing new things. In fact, Google doesn’t have to do big things. Incremental changes are fine. Cumulatively these increase Google’s lead or its “magnetism”, if you will. So competitors are going to have to find a way to leapfrog Google. I don’t think any of the present systems have the legs for this jump, including Wolfram Alpha because it is not yet a commercial grade offering. When it is, I will reassess my present view. What competitors are doing is repositioning themselves away from Google. Instead of getting sand kicked in one face on the beach, the competitors are swimming in the pool at the country club. Specialization makes it easier to avoid Googzilla’s hot breath.

To wrap up, I hope Wolfram Alpha goes commercial quickly. I want to have access to its functions and features. Before that happens, I think that the Beeb and other publishing outfits will be rooting for the next big thing in the hopes that once of these wizards can Taser the Google. For now, the Tasers are running on a partial charge. The GOOG does not feel them.

Stephen Arnold, May 1, 2009

NetBase and Content Intelligence

April 30, 2009

Vertical search is alive and well. Technology Review described NetBase’s Content Intelligence here. The story, written by Erica Naone, was “A Smarter Search for What Ails You”. Ms. Naone wrote:

organizes searchable content by analyzing sentence structure in a novel way. The company created a demonstration of the platform that searches through health-related information. When a user enters the name of a disease, he or she is most interested in common causes, symptoms, and treatments, and in finding doctors who specialize in treating it, says Netbase CEO and cofounder Jonathan Spier. So the company’s new software doesn’t simply return a list of documents that reference the disease, as most search engines would. Instead, it presents the user with answers to common questions. For example, it shows a list of treatments and excerpts from documents that discuss those treatments. The Content Intelligence platform is not intended as a stand-alone search engine, Spier explains. Instead, Netbase hopes to sell it to companies that want to enhance the quality of their results.

NetBase (formerly Accelovation) has developed a natural language processing system.Ms. Naone reported:

NetBase’s software focuses on recognizing phrases that describe the connections between important words. For example, when the system looks for treatments, it might search for phrases such as “reduce the risk of” instead of the name of a particular drug. Tellefson notes that this isn’t a matter of simply listing instances of this phrase, rather catching phrases with an equivalent meaning. Netbase’s system uses these phrases to understand the relationship between parts of the sentence.

At this point in the write up, I heard echoes of other vendors with NLP, semantics, bound phrase identification, etc. Elsevier has embraced the system for its illumin8 service. You can obtain more information about this Elsevier service here. Illumin8 asked me, “What if you could become an expert in any topic in a few minutes?” Wow!

The NetBase explanation of content intelligence is:

… understanding the actual “meaning” of sentences independent of custom lexicons. It is designed to handle myriads of syntactical sentence structures – even ungrammatical ones – and convert them to logical form. Content Intelligence creates structured semantic indexes from massive volumes of content (billions of web-pages and documents) used to power question-and-answer type of search experiences.

NetBase asserts:

Because NetBase doesn’t rely on custom taxonomies, manual annotations or coding, the solutions are fully automated, massively scalable and able to be rolled-out in weeks with a minimal amount of effort. NetBase’s semantic index is easy to keep up-to-date since no human editing or updates to controlled vocabulary are needed to capture and index new information – even when it includes new technical terms.

Let me offer several observations:

  • The application of NLP to content is not new and it imposes some computational burdens on the search system. To minimize those loads, NLP is often constrained to content that contains a restricted terminology; for example, medicine, engineering, etc. Even with a narrow focus, NLP remains interesting.
  • “Loose” NLP can squirm around some of the brute force challenges, but it is not yet clear if NLP methods are ready for center stage. Sophisticated content processing often works best out of sight, delivering to the user delightful, useful ways to obtain needed information.
  • A number of NLP systems are available today; for example, Hakia. Microsoft snapped up PowerSet. One can argue that some of the Inxight technology acquired first by Business Objects then by the software giant SAP are NLP systems. To my knowledge, none of these has scored a hat trick in revenue, customer uptake, and high volume content processing.

You can get more information about NetBase here. You can find demonstrations and screenshots. A good place to start is here. According to TechCrunch:

NetBase has been around for a while. Originally called Accelovation, it has raised $9 million in two rounds of venture funding over the past four years, has 30 employees…

In my files, I had noted that the funding sources included Altos Ventures and ThomVest, but these data may be stale or just plain wrong. I don’t have enough information about Netbase to offer substantive comments. NLP requires significant computing horsepower. I need to know more about the plumbing. Technology Review provided the sizzle. Now we need to know about the cow from which the prime rib comes.

Stephen Arnold, April 30, 2009

Google Base Tip

April 23, 2009

Google Base is not widely known among the suits who prowl up and down Madison Avenue. For those who are familiar with Google Base, the system is a portent of Googzilla’s data management capabilities. You can explore the system here. Ryan Frank’s “Optimizing Your Google Base Feeds” here provides some some useful information for those who have discovered that Google Base is a tool for Google employment ads, real estate, and other types of structured information. Mr. Frank wrote:

It is also important to note that Google Base uses the information from Base listings for more than just Google OneBox results. This data may also be displayed in Google Product Search (previously Froogle), organic search results, Google Maps, Google Image Search and more. That adds up to a variety of exposure your site could potentially receive from a single Google Base listing.

Interesting, right? Read the rest of his post for some useful information about this Google service.

Stephen Arnold, April 23, 2009

Personalized Network Searching: Google after People Search

April 22, 2009

The hounds of the Internet are chasing Google’s “Search for Me on Google”. I can’t add to that outpouring of insight about technology that is exciting today but dated by Google time standards. I can, however, direct your attention to US 7,523,096, “Methods and Systems for Personalized Network Searching.” You can download this patent from the USPTO. The document was published on April 21, 2009, and was filed on December 3, 2003. You may want to read the background of the invention and scan the claims. The diagrams are standard Google fare, leaving much to the reader who must bring an understanding of other Google subsystems to the analysis. To put the Search on Me discussion into context, here’s the abstract for the granted patent, now almost six years old:

Systems and methods for personalized network searching are described. A search engine implements a method comprising receiving a search query, determining a personalized result by searching a personalized search object using the search query, determining a general result by searching a general search object using the search query, and providing a search result for the search query based at least in part on the personalized result and the general result. The search engine may utilize ratings or annotations associated with the previously identified uniform resource locator to locate and sort results.

This is an important invention attributed to Stephen Lawrence and Greg Badros. Both have made substantive contributions to Google in the past. You may want to examine the current people search and then check out the dossier invention that I have written about elsewhere. There are some interesting enhancements to the core dossier technology in the future. My assertion is that Google moves slowly. When these “innovations” roll out, some are surprised. The GOOG leaves big footprints in my experience. Where’s Pathfinder when one needs him?

Stephen Arnold, April 22, 2009

GEFCO and Exalead: Win International Prize for Innovation

April 21, 2009

Congratulations to GEFCO, and by extension, Exalead, for winning the Grand Prix et Trophée de l’innovation prize in recognition of innovation in business information management. The trophy was presented on April 7, 2009, by
CIO-online.com, Le Monde Informatique and IT News Info. There’s a video of the awards here ttp://www.trophees-cio.com/ and a PDF profile of the winners and projects at CIO Online.

A leading European provider of vehicle transport, logistics, and other transportation services, GEFCO earned its award thanks to Exalead, a leader of search based business application solutions and information access in the enterprise and on the web. GEFCO won the CIO-online.com trophy for its new vehicle track and trace service built on Exalead CloudView’s platform (You can read about CloudView here.

GEFCO uses Exalead CloudView to drive a search based application engine and real time operational tools for reporting, query, and analysis of the database of vehicles delivered logistics and spare parts management.

ArnoldIT.com interviewed Paul Doscher, U.S. CEO of Exalead, in January 2009, and Mr. Doscher spoke of their partnership with GEFCO then. He stated:

GEFCO is using Exalead to track their vehicles. GEFCO’s new ‘Track and Trace’ application is built upon Exalead’s flagship platform that offers powerful search functionality and can provide up-to-the-minute information from an extremely large data set. You can read the entire interview on the Search Wizards Speak service here.

Jessica Bratcher, April 21, 2009

Semantic Roll Up: The Effect of Financial Compression

April 21, 2009

A flurry of emails arrived today about the tie up among several companies with good reputations but profiles that are lower than those enjoyed by Autonomy and Endeca. You can read the official news announcement here about the deal among Attensity, Empolis GmbH, and Living-e AG. The conflation is called The Attensity Group. Here’s a snapshot of each company based on the information I ratted out of my files in the midst of new carpet, painting, and hanging new boxer dog pictures:

  • Attensity. Deep text processing. Started in the intel community. Probed marketing. Acted as ring master for the tie up.
  • Empolis GmbH. (Link was dead when I checked it  on April 20, 2009.) A distribution and archiving system and file based content transformation. Orphaned after parent Bertelsmann faced up to the realities facing the dead tree crowd. Now positions itself in knowledge management.
  • Living-e AG. Provides software products that enable efficient information exchange. Web content management, behavior analysis. Founded in 2003 as WebEdition Software GmbH.

The news release refers to the deal as a “market powerhouse”. This is the type of phrase that gets me to push the goslings to the computer terminals to do some company monitoring.

It’s too early for me to make a call about the product line up the company will offer. Should be interesting. Some pundits will make an attempt to presage the future. Not this silly goose. The customers will decide, not the mavens.

Stephen Arnold, April 21, 2009

Google and Guha: The Semantic Steamroller

April 17, 2009

I hear quite a lot about semantic search. I try to provide some color on selected players. By now, you know that I recycle in this Web log, and this article is no exception. The difference is that few people pay much attention to patent documents. In general, these are less popular than a printed dead tree daily paper, but in my opinion quite a bit more exciting. But that’s what makes me an addled goose, and you a reader of free Web log posts.

You will want to snag a copy of US20090100036 from our ever efficient USPTO. Please, read the instructions for running a query on the USPTO system. I don’t provide for free support to public facing, easy to use, elegant interfaces such as that available from the Federal government.

weights 20090100036

The “eyes” of Googzilla. From US20090100036, Figure 21, Cyrus, in case you want to see what your employer is doing these days.

The title of the document is “Methods and Systems for Classifying Search Results to Determine Page Elements” by a gaggle of Googlers, one of whom is Ramanathan Guha. If you read my Google Version 2.0 or the semantic white paper I wrote for Bear Stearns when it was respected and in business, you know that Dr. Guha is a bit of a superstar in my corner of the world. The founder of Epinions.com and a blue chip wizard with credentials (Semantic Web RDF, Babelfish, Open Directory, etc.) that will take away the puffery of newly minted search consultants, Dr. Guha invented, wrote up, and filed five major inventions. These five set forth the Programmable Search Engine. You will have to chase down one of my for fee writings to get more detail about how the PSE meshes with Google’s data management inventions. If you are IBM or Microsoft, you will remind me that patents are products and that Google is not doing anything particularly new. I love those old eight track tapes, don’t you.

The new invention is the work of Tania Bedrax-Weiss, Patrick Riley, Corin Anderson, and Ramanathan Guha. His name is spelled “Ramanthan” in the patent snippet I have. Fish & Richardson, Google’s go-to search patent attorney may have submitted it correctly in October 2007 but it emerged from the USPTO on April 16, 2009, with the spelling error.

The application is a 33 page long document, which is beefy by Google’s standard. Google dearly loves brevity so the invention is pushing into Gone with the Wind length for the GOOG. The Fish & Richardson synopsis said:

This invention relates to determining page elements to display in response to a search. A method embodiment of this invention determines a page element based on a search result. The method includes: (1) determining a set of result classifications based on the search result, wherein each result classification includes a result category and a result score; and (2) determining the page element based on the set of result classifications. In this way, a classification is determined based on a search result and page elements are generated based on the classification. By using the search result, as opposed to just the query, page elements are generated that corresponds to a predominant interpretation of the user’s query within the search results. As result, the page elements may, in most cases, accurately reflect the user’s intent.

Got that? If you did not, you are not alone. The invention makes sense in the context of a number of other Google technical initiatives ranging from the non hierarchical clustering methods to the data management innovations you can spot if you poke around Google Base. I noted classification refinement, snippets, and “signal” weighting. If you are in the health biz, you might want to check out the labels in the figures in the patent application. If you were at my lecture for Houston Wellness, I described some of Google’s health related activities.

On the surface, you may think, “Page parsing. No big deal.” You are not exactly right. Page parsing at Google scale, the method, and the scores complement Google’s “dossier” function about which Sue Feldman and I wrote in our September 2008 IDC client only report. This is IDC paper 213562.

What does a medical information publisher need with those human editors anyway?

Stephen Arnold, April 17, 2009

True Knowledge: Semantic Search System

April 16, 2009

A happy quack to the readers who sent me a link to this ZDNet Web log post called “True Knowledge API Lies at the Heart of Real Business Model” here. I had heard about True Knowledge — The Internet Answer Engine —  a while back, but I tucked away the information until a live system became available. I had heard that the computer scientist spark plug of True Knowledge (William Tunstall-Pedoe) has been working on the technology for about 10 years. The company’s Web site is www.trueknoweldge.com, and it contains some useful information. You can sign up for a beta account, read Web log posts, and get some basic information about the system.

About one year ago, the Financial Times’s Web log here reported:

Another Semantic Web company looking for cash: William Tunstall-Pedoe of True Knowledge says he needs $10m in venture capital to back the next stage of his Cambridge (UK)-based company, which is trying to build a sort of “universal database” on the Web.

In April 2009, the company is raising its profile with an API that allows developers to make Web sites smarter.

image

Interface. © True Knowledge

The company said:

True Knowledge is a pioneer in a new class of Internet search technology that’s aimed at dramatically improving the experience of finding known facts on the Web. Our first service – the True Knowledge Answer Engine – is a major step toward fulfilling a longstanding Internet industry goal: providing consumers with instant answers to complex questions, with a single click.

The company’s proprietary technology allows a user to ask questions and get an answer. Quite a few companies have embraced the “semantic” approach to content processing. The reason is that traditional search engines require that the person with the question find the magic combination that delivers what’s needed. The research done by Martin White and my team, among others, makes clear that about two thirds of the users of a key word search system come away empty handed, annoyed, or both. True Knowledge and other semantic-centric vendors see significant opportunities to improve search and generate revenue.

architecture

Architecture block diagram. © True Knowledge

Paul Miller, the author of the ZDNet article, wrote:

True Knowledge is certainly interesting, and frequently impressive. It remains to be seen whether a Platform proposition will set them firmly on the road to riches, or if they’ll end up finding more success following the same route as Powerset and getting acquired by an existing (enterprise?) search provider.

ZDNet wrote a similar article in July 2007 here. In 2008, Venture Beat here mentioned True Knowledge here in July 2008 in a story that referenced Cuil.com (former Googlers) and Powerset (now part of Microsoft’s search cornucopia). Hakia.com was not mentioned even though at that time in 2008, Hakia.com was ramping up its PR efforts. Venture Beat mentioned Metaweb, another semantic start up that obtained $42 million in 2008, roughly eight times the funding of True Knowledge. (Metaweb’s product is Freebase, an open, shared database of the world’s information. More here.) You will want to read Venture Beat’s April 13, 2009, follow up story about True Knowledge here. This article contains an interesting influence diagram.

I don’t know enough about the appetite of investors for semantic search systems to offer an opinion. What I found interesting was:

  • The company has roots in Cambridge University where computational approaches are much in favor. With Autonomy and Lemur Consulting working in the search sector, Cambridge is emerging as one of the hot spots in search
  • The language and word choice used to describe the system here reminded me of some Google research papers and the work of Janet Widom at Stanford University. If there are some similarities, True Knowledge may be more than a question answering system
  • The company received an infusion of $4.0 million in a second round of funding completed in mid 2008. Octopus Ventures provided an earlier injection of $1.2 million in 2007.
  • The present push is to make the technology available to developers so that the semantic system can be “baked in” to other applications. The notion is a variant of that used in the early days of Verity’s OEM and developer push in the late 1980s. The API account is offered without charge.
  • There’s a True Knowledge Facebook page here.

I recall seeing references to a private beta of the system. I can’t locate my notes from my 2007 trips to the UK, but I think that may have been the first time I heard about the system. I did locate a link to a demo video here, dated late 2007 That video explains that the information is represented in a way “that computers can understand”. I made a note to myself about this because this type of function in 2007 was embodied in the Guha inventions for the Google Programmable Search Engine.

The API allows systems to ask questions. The developer can formulate a query and see the result. Once the developer has the query refined, the True Knowledge system makes it easy for the developer to include the service in another application. The idea, I noted, was to make enterprise software systems smarter. The system performs reasoning and inference. The system generates answers and a reading list. The system can handle short queries, performing accurate disambiguation; that is, figuring out what the user meant.  The system made it possible for a user to provide information to the system, in effect a Wikipedia type of function. The approach is a clever way for the user to teach the True Knowledge system.

RapidMiner: Open Source Data Mining

April 11, 2009

A happy quack to the reader who reminded me that Google Apps supports Java. If you are interested in data mining, you may want to catch up with RapidMiner, an open source data mining system. RapidMiner drinks Java, so you may want to think about ways to make use of Google Apps and RapidMiner. The person who wrote me wanted some information about this idea.

My April 2009 column for KMWorld talks about Google Apps, but I don’t have any information about hooking RapidMininer into Google Apps. In fact, I had not thought about it.

RapidMiner is “the world-wide leading open-source data mining solution due to the combination of its leading-edge technologies and its functional range. Applications of RapidMiner cover a wide range of real-world data mining tasks.” There is an enterprise version plus consulting services available.

You can download the RapidMiner community edition here. The documentation is quite good. You can snag a copy of those documents here. The community edition offers a number of features, and it is extensible. Here’s an example of a data output from RapidMiner:

rapidminer

You can find a useful discussion by Michael Wurst of the open source version at Nemoz.org here. This write up provides some useful examples that show one way to hook RapidMiner into a Java application. What is quite useful is the code sample for using the text classifier on a chunk of text. RapidMiner’s classification component is called RapidMinerTextClassifier.

There are some limitations to the Google Apps implementation of Java, but I think the person who wrote me has an interesting idea. The notion of combining sophisticated RapidMiner oiperations with the Google Apps struck me as interesting. If you have any interesting examples of this type of hybridization, use the comments section of this Web log to pass along the information.

Stephen Arnold, April 11, 2009

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta