CyberOSINT banner

Featured

Taxonomy Turmoil: Good Enough May Be Too Much

For years, I have posted a public indexing Overflight. You can examine the selected outputs at this Overflight link. (My non public system is more robust, but the public service is a useful temperature gauge for a slice of the content processing sector.)

When it comes to indexing, most vendors provide keyword, concept tagging, and entity extraction. But are these tags spot on? No, most are good enough.

image

A happy quack to Jackson Taylor for this “good enough” cartoon. The salesman makes it clear that good enough is indeed good enough in today’s marketing enabled world.

I chose about 50 companies that asserted their systems performed some type of indexing or taxonomy function. I learned that the taxonomy business is “about to explode.” I find that to be either an interesting investment tip or a statement that is characteristic of content processing optimists.

Like search and retrieval, plugging in “concepts” or other index terms is a utility function. For example, if one indexes each word in an article appearing in this blog, the article might be about another subject. For example, in this post, I am talking about Overflight, but the real topic is the broader use of metadata in information retrieval systems. I could assign the term “faceted navigation” to this article as a way to mark this article as germane to point and click navigation systems.

If you examine the “reports” Overflight outputs for each of the companies, you will discover several interesting things as I did on February 28, 2015 when I assembled this short article.

  1. Mergers or buying failed vendors at fire sale prices are taking places. Examples include Lucidea’s purchase of Cuadra and InMagic. Both of these firms are anchored in traditional indexing methods and seemed to be within a revenue envelope until their sell out. Business Objects acquired Inxight and then SAP acquired Business Objects. Bouvet acquired Ontopia. Teradata acquired Revelytix
  2. Moving indexing into open source. Thomson Reuters acquired ClearForest and made most of the technology available as OpenCalais. OpenText, a rollup outfit, acquired Nstein. SAS acquired Teragram. Smartlogic acquired Schemalogic. (A free report about Schemalogic is available at www.xenky.com/vendor-profiles.)
  3. A number of companies just failed, shut down, or went quiet. These include Active Classification, Arikus, Arity, Forth ICA, MaxThink, Millennium Engineering, Navigo, Progris, Protege, punkt.net, Questans, Quiver, Reuse Company, Sandpiper,
  4. The indexing sector includes a number of companies my non public system monitors; for example, the little known Data Harmony with six figure revenues after decades of selling really hard to traditional publishers. Conclusion: Indexing is a tough business to keep afloat.

There are numerous vendors who assert their systems perform indexing, entity, and metadata extraction. More than 18 of these companies are profiled in CyberOSINT, my new monograph. Oracle owns Triple Hop, RightNow, and Endeca. Each of these acquired companies performs indexing and metadata operations. Even the mashed potatoes search solution from Microsoft includes indexing tools. The proprietary XML data management vendor MarkLogic asserts that it performs indexing operations on content stored in its repository. Conclusion: More cyber oriented firms are likely to capture the juicy deals.

So what’s going on in the world of taxonomies? Several observations strike me as warranted:

First, none of the taxonomy vendors are huge outfits. I suppose one could argue that IBM’s Lucene based system is a billion dollar baby, but that’s marketing peyote, not reality. Perhaps MarkLogic which is struggling toward $100 million in revenue is the largest of this group. But the majority of the companies in the indexing business are small. Think in terms of a few hundred thousand in annual revenue to $10 million with generous accounting assumptions.

What’s clear to me is that indexing, like search, is a utility function. If a good enough search system delivers good enough indexing, then why spend for humans to slog through the content and make human judgments. Why not let Google funded Recorded Future identify entities, assign geo codes, and extract meaningful signals? Why not rely on Haystax or RedOwl or any one of more agile firms to deliver higher value operations.

I would assert that taxonomies and indexing are important to those who desire the accuracy of a human indexed system. This assumes that the humans are subject matter specialists, the humans are not fatigued, and the humans can keep pace with the flow of changed and new content.

The reality is that companies focused on delivering old school solutions to today’s problems are likely to lose contracts to companies that deliver what the customer perceives as a higher value content processing solution.

What can a taxonomy company do to ignite its engines of growth? Based on the research we performed for CyberOSINT, the future belongs to those who embrace automated collection, analysis, and output methods. Users may, if the user so chooses, provide guidance to the system. But the days of yore, when monks with varying degrees of accuracy created catalog sheets for the scriptoria have been washed to the margin of the data stream by today’s content flows.

What’s this mean for the folks who continue to pump money into taxonomy centric companies? Unless the cyber OSINT drum beat is heeded, the failure rate of the Overflight sample is a wake up call.

Buying Apple bonds might be a more prudent financial choice. On the other hand, there is an opportunity for taxonomy executives to become “experts” in content processing.

Stephen E Arnold, February 28, 2015

Interviews

Interview with Dave Hawking Offers Insight into Bing, FunnelBack and Enterprise Search

The article titled To Bing and Beyond on IDM provides an interview with Dave Hawking, an award-winner in the field of information retrieval and currently a Partner Architect for Bing. In the somewhat lengthy interview, Hawking answers questions on his own history, his work at Bing, natural language search, Watson, and Enterprise Search, among other things. At one point he describes how he arrived in the field of information retrieval after studying computer science at the Australian National University, where he the first search engine he encountered was the library’s card catalogue. He says,

“I worked in a number of computer infrastructure support roles at ANU and by 1991 I was in charge of a couple of supercomputers…In order to do a good job of managing a large-scale parallel machine I thought I needed to write a parallel program so I built a kind of parallel grep… I wrote some papers about parallelising text retrieval on supercomputers but I pretty soon decided that text retrieval was more interesting.”

When asked about the challenges of Enterprise Search, Hawking went into detail about the complications that arise due to the “diversity of repositories” as well as issues with access controls. Hawking’s work in search technology can’t be overstated, from his contributions to the Text Retrieval Conferences, CSIRO, FunnelBack in addition to his academic achievements.

Chelsea Kerwin, December 09, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Latest News

Good News for Textbook Publishers

I read “Students Reject Digital textbooks.” Textbook publishers have embraced slicing and dicing with alacrity. The idea is that a new textbook or collection... Read more »

March 4, 2015 | | Comment

Silobreaker Forms Cyber Partnership with Norwich University

I learned that cyber OSINT capable Silobreaker has partnered with Silobreaker. Norwich, the oldest private military college in the US, has a sterling reputation... Read more »

March 4, 2015 | | Comment

Opening Watson to the Masses

IBM is struggling financially and one of the ways they hope to pull themselves out of the swamp is to find new applications for its supercomputers and software.... Read more »

March 4, 2015 | | Comment

OpenText Innovation Tour

The post on the CEO Blog on Opentext titled Innovation Tour 2015 Kicks Off announces the 15 city tour from a company acquiring technology, not developing it. The... Read more »

March 4, 2015 | | Comment

Enterprise Search: Is Keyword Search a Lycra-Spandex Technology?

I read a series of LinkedIn posts about why search may be an enterprise application flop. To access the meanderings of those who believe search is a young Bruce... Read more »

March 3, 2015 | | 1 Comment

Google Loon Looms

Which is grabbing more traffic: Facebook or stories about Google Loon? We know that Google has found social media a slippery fish. To address the lack of grippiness... Read more »

March 3, 2015 | | Comment

An Incomplete History of the Semantic Web

The article on the blog Realizing Semantic Web titled Semantic Web – Story So Far explores where exactly credit it due for the current state of Semantic Web technology.... Read more »

March 3, 2015 | | Comment

Preview of SharePoint 2016 Available at Ignite Event

Customers were excited to hear that SharePoint 2016 would be unveiled this year, and even more excited to know that Microsoft is extending on-premises support. Now... Read more »

March 3, 2015 | | Comment

Google Street View: Priorities in Brazil Become Clearer

I used to live in the Cambui district of Campinas, Brazil. The street view of our home was snapped by Street View in 2011. Cambui is a dynamic area, but obviously... Read more »

March 2, 2015 | | Comment

Google: Just the Facts, Folks

Short honk: I read “Google Wants to Rank Websites Based on Facts Not Links.” The article could be a jumping off point for some dictionary excitement. The article... Read more »

March 2, 2015 | | Comment