CyberOSINT banner

Taxonomy Turmoil: Good Enough May Be Too Much

February 28, 2015

For years, I have posted a public indexing Overflight. You can examine the selected outputs at this Overflight link. (My non public system is more robust, but the public service is a useful temperature gauge for a slice of the content processing sector.)

When it comes to indexing, most vendors provide keyword, concept tagging, and entity extraction. But are these tags spot on? No, most are good enough.


A happy quack to Jackson Taylor for this “good enough” cartoon. The salesman makes it clear that good enough is indeed good enough in today’s marketing enabled world.

I chose about 50 companies that asserted their systems performed some type of indexing or taxonomy function. I learned that the taxonomy business is “about to explode.” I find that to be either an interesting investment tip or a statement that is characteristic of content processing optimists.

Like search and retrieval, plugging in “concepts” or other index terms is a utility function. For example, if one indexes each word in an article appearing in this blog, the article might be about another subject. For example, in this post, I am talking about Overflight, but the real topic is the broader use of metadata in information retrieval systems. I could assign the term “faceted navigation” to this article as a way to mark this article as germane to point and click navigation systems.

If you examine the “reports” Overflight outputs for each of the companies, you will discover several interesting things as I did on February 28, 2015 when I assembled this short article.

  1. Mergers or buying failed vendors at fire sale prices are taking places. Examples include Lucidea’s purchase of Cuadra and InMagic. Both of these firms are anchored in traditional indexing methods and seemed to be within a revenue envelope until their sell out. Business Objects acquired Inxight and then SAP acquired Business Objects. Bouvet acquired Ontopia. Teradata acquired Revelytix
  2. Moving indexing into open source. Thomson Reuters acquired ClearForest and made most of the technology available as OpenCalais. OpenText, a rollup outfit, acquired Nstein. SAS acquired Teragram. Smartlogic acquired Schemalogic. (A free report about Schemalogic is available at
  3. A number of companies just failed, shut down, or went quiet. These include Active Classification, Arikus, Arity, Forth ICA, MaxThink, Millennium Engineering, Navigo, Progris, Protege,, Questans, Quiver, Reuse Company, Sandpiper,
  4. The indexing sector includes a number of companies my non public system monitors; for example, the little known Data Harmony with six figure revenues after decades of selling really hard to traditional publishers. Conclusion: Indexing is a tough business to keep afloat.

There are numerous vendors who assert their systems perform indexing, entity, and metadata extraction. More than 18 of these companies are profiled in CyberOSINT, my new monograph. Oracle owns Triple Hop, RightNow, and Endeca. Each of these acquired companies performs indexing and metadata operations. Even the mashed potatoes search solution from Microsoft includes indexing tools. The proprietary XML data management vendor MarkLogic asserts that it performs indexing operations on content stored in its repository. Conclusion: More cyber oriented firms are likely to capture the juicy deals.

So what’s going on in the world of taxonomies? Several observations strike me as warranted:

First, none of the taxonomy vendors are huge outfits. I suppose one could argue that IBM’s Lucene based system is a billion dollar baby, but that’s marketing peyote, not reality. Perhaps MarkLogic which is struggling toward $100 million in revenue is the largest of this group. But the majority of the companies in the indexing business are small. Think in terms of a few hundred thousand in annual revenue to $10 million with generous accounting assumptions.

What’s clear to me is that indexing, like search, is a utility function. If a good enough search system delivers good enough indexing, then why spend for humans to slog through the content and make human judgments. Why not let Google funded Recorded Future identify entities, assign geo codes, and extract meaningful signals? Why not rely on Haystax or RedOwl or any one of more agile firms to deliver higher value operations.

I would assert that taxonomies and indexing are important to those who desire the accuracy of a human indexed system. This assumes that the humans are subject matter specialists, the humans are not fatigued, and the humans can keep pace with the flow of changed and new content.

The reality is that companies focused on delivering old school solutions to today’s problems are likely to lose contracts to companies that deliver what the customer perceives as a higher value content processing solution.

What can a taxonomy company do to ignite its engines of growth? Based on the research we performed for CyberOSINT, the future belongs to those who embrace automated collection, analysis, and output methods. Users may, if the user so chooses, provide guidance to the system. But the days of yore, when monks with varying degrees of accuracy created catalog sheets for the scriptoria have been washed to the margin of the data stream by today’s content flows.

What’s this mean for the folks who continue to pump money into taxonomy centric companies? Unless the cyber OSINT drum beat is heeded, the failure rate of the Overflight sample is a wake up call.

Buying Apple bonds might be a more prudent financial choice. On the other hand, there is an opportunity for taxonomy executives to become “experts” in content processing.

Stephen E Arnold, February 28, 2015

Partition the Web to Manage It

February 22, 2015

I noted that the mid February 2015 Forbes article did not get much coverage. “US Defense Giant Raytheon: We Need To Divide The Web To Secure It” contains a suggestion that could, if implemented, force changes upon Bing, Google, and other Web indexing outfits.

Here’s the passage I highlighted in lovely ice blue:

But some, including Michael Daly, chief technology officer for cyber security at US defense giant Raytheon, believe that the web needs to be divided into communities. As more critical devices, from insulin pumps to cars, connect to the internet, the more likely a genuinely destructive digital attack will occur. To stop this from happening, some people just shouldn’t be allowed into certain corners of the web, according to Daly.

There are some interesting implications in this notion.

Stephen E Arnold, February 22, 2015

Yes, Watson, an Accurate Description

February 22, 2015

I noted a post in the Atlantic. The article is “IBM’s Watson Memorized the Entire Urban Dictionary, Then His Overlords Had to Delete It.” I am not sure how a machine memorizes, but that is an issue for another day. The point is that Watson responded to one researcher’s query with a one word curse. Key point: Watson is supposed to do marvelous smart things. Without context, Watson is more likely to perform just like any other Lucene and home brew script system. Here’s the passage I noted because it underscores the time and expense of making a unicorn work just like a real animal:

Watson couldn’t distinguish between polite language and profanity — which the Urban Dictionary is full of. Watson picked up some bad habits from reading Wikipedia as well. In tests it even used the word “bullshit” in an answer to a researcher’s query. Ultimately, Brown’s 35-person team developed a filter to keep Watson from swearing and scraped the Urban Dictionary from its memory.

IBM, you never fail to entertain me.

Stephen E Arnold, February 22, 2015

Mondeca: Joey Erodes a Brand

February 8, 2015

Protecting the “name” of a company is important. I pointed out that several content processing vendors were losing control of their name in terms of a Bing or Google query. I noticed another vendor finding itself in the same pickle.

Navigate to YouTube. Run a query for Mondeca. What the query “mondeca” triggers is a spate of videos to Joey Mondeca.


Now try the Twitter search for the string “mondeca”. Here’s what I see:


Mondeca, the French smart content outfit, may want to turn its attention to dealing with Joey Mondeca’s social media presence. On the other hand, maybe findability is not a priority.

A Google query for “mondeca” returns links to the company. But Joey is moving up in the results list. When vendors lose control of a name as Brainware did, getting back that semantic traction is difficult. Augmentext has a solution.

Stephen E Arnold, February 9, 2015

Which Auto Classification Method Works?

February 1, 2015

Unfortunately A. Lancichinetti, et al do not deliver a consumer reports type analysis in “High Reproducibility and High Accuracy Method for Automated Topic Classification.” The paper does raise some issues that keyword search vendors with add on categorization do not often explain to licensees. In a nutshell, classification is often chugging along with 65 to 80 percent accuracy. Close enough for horseshoes, Gustav Dirichlet  is here. A couple of thoughts:

  • Dirichlet  died in 1822. Yep, another of the math guys from the past.
  • The method is described as “state of the art.”
  • Bounded content sets of sci tech information yield more accurate classification.

How do these methods work on WhatsApp messages in gang slang?

Stephen E Arnold, February 1, 2015

Zaizi: Search and Content Consulting

January 13, 2015

I received a call about Zaizi and the company’s search and content services. The firm’s Web site is at Based on the information in my files, the company appears to be open source centric and an integrator of Lucene/Solr solutions.

What’s interesting is that the company has embraced Mondeca/Smartlogic jargon; for example, content intelligence. I find the phrase interesting and an improvement over the Semantic Web lingo.

The idea is that via indexing, one can find and make use of content objects. I am okay with this concept; however, what’s being sold is indexing, entity extraction, and classification of content.

The issue facing Zaizi and the other content intelligence vendors is that “some” content intelligence and slightly “smarter” information access is not likely to generate the big bucks needed to compete.

Firms like BAE and Leidos as well as the Google/In-Tel-Q backed Recorded Future offer considerably more than indexing. The need is to process automatically, analyze automatically, and generate outputs automatically. The outputs are automatically shaped to meet the needs of one or more human consumers or one or more systems.

Think in terms of taking outputs of a next generation information access system and inputting the “discoveries” or “key items” into another system. The idea is that action can be taken automatically or provided to a human who can make a low risk, high probability decision quickly.

The notion that a 20 something is going to slog through facets, keyword search, and the mind numbing scan results-open documents-look for info approach is decidedly old fashioned.

You can learn more about what the next big thing in information access is by perusing CyberOSINT: Next Generation Information Access at

Stephen E Arnold, January 14, 2015

Google and Removed Links for Pirated Content

January 5, 2015

I read “Google Received 345 Million Pirate link Removal Requests in 2014.” In 2008, Google received 62 requests. In 2014, Google received requests to remove 345,169,134 links. As the article points out, that’s around a million links a day.

The notion of a vendor indexing “all” information is a specious one. More tricky is that one cannot find information if it is blocked from the public index. How will copyright owners find violators? Is there an index for Dark Net content?

My thought is that finding information today is more difficult than it was when I was in college. Sixty years of progress.

Stephen E Arnold, January 5, 2015

Faceted Search: From the 1990s to Forever and Ever

January 4, 2015

Keyword retrieval is useful. But it is not good for some tasks. In the mid 1990s, Endeca’s founders “invented” a better way. The name that will get you a letter from a lawyer is “guided navigation.” The patents make clear the computational procedure required to make facets work.

The more general name of the feature is “faceted navigation.” For those struggling with indexing, faceted navigation “exposes” the users to content options. This works well if the domain is reasonably stable, the corpus small, and the user knows generally what he or she needs.

To get a useful review of this approach to findability, check out “Faceted Navigation.” Now five years old, the write up will add logs to the fires of taxonomy. However, faceted search is not next generation information access. Faceted navigation is like a flintlock rifle used by Lewis and Clark. Just don’t try to kill any Big Data bears with the method. And Twitter outputs? Look elsewhere.

Stephen E Arnold, January 4, 2014

Finding Books: Not Much Has Changed

December 1, 2014

Three or four years ago I described what I called “the book findability” problem. The audience was a group of confident executives trying to squeeze money from an old school commercial database model. Here’s how the commercial databases worked in 1979.

  1. Take content from published sources
  2. Create a bibliographic record, write or edit the abstract included with the source document
  3. Index it with no more than three to six index terms
  4. Digitize the result
  5. Charge a commercial information utility to make it available
  6. Get a share of the revenues.

That worked well until the first Web browser showed up and individuals and institutions began making information available online. There are a number of companies that still use variations of this old school business model. Examples include newspapers that charge a Web browser user for access to content to outfits like LexisNexis, Ebsco, Cambridge Scientific Abstracts, and other outfits.


As libraries and individuals resist online fees, many of the old school outfits are going to have to come up with new business models. But adaptation will not be easy. Amazon is in the content business. Why buy a Cliff’s Notes-type summary when there are Amazon reviews? Why pay for news when a bit of sleuthing will turn up useful content from outfits like the United Nations or off the radar outfits like World News at Tech information is going through a bit of an author revolt. While not on the level of protests in Hong Kong, a lot of information that used to be available in research libraries or from old school database providers is available online. At some point, peer reviewed journals and their charge the author business models will have to reinvent themselves. Even recruitment services like LinkedIn offer useful business information via

One black hole concerns finding out what books are available online. A former intelligence officer with darned good research skills was not able to locate a copy of my The New Landscape of Search. You can find it here for free.

I read “Location, Location: GPS in the Medieval Library.” The use of coordinates to locate a book on a shelf or hanging from a wall anchored by a chain is not new to those who have fooled around with medieval manuscripts. Remember that I used to index medieval sermons in Latin as early as 1963.

What the write up triggered was the complete and utter failure of indexing services to make an attempt to locate, index, and provide a pointer to books regardless of form. The baloney about indexing “all” information is shown to be a toothless dragon. The failure of the Google method and the flaws of the Amazon, Library of Congress, and commercial database providers is evident.

Now back to the group of somewhat plump, red face confident wizards of commercial database expertise. The group found my suggestion laughable. No big deal. I try to avoid the Titanic type operations. I collected my check and hit the road.

There are domains of content that warrant better indexing. Books, unfortunately, is one set of content that makes me long for the approach that put knowledge in one place with a system that at least worked and could be supplemented by walking around and looking.

No such luck today.

Stephen E Arnold, December 1, 2014

More Metadata: Not Needed Metadata

November 21, 2014

I find the metadata hoo hah fascinating. Indexing has been around a long time. If one wants to dig into the complexities of metadata, you may find the table from helpful:


Mid tier consulting firms often do not use the products or systems their “experts” recommend. Consultants in indexing do create elaborate diagrams that make my eyes glaze over.

Some organizations generate metadata without considering what is required. As a result, outputs from the systems can present mind boggling complex options to the user. A report displaying multiple layers  of metadata can be difficult to understand.

My thought is that before giving the green light to promiscuous metadata generation, some analysis and planning may be useful. The time lost trying to figure out which metadata is relevant to a particular issue can be critical.

But consultants and vendors are indeed impressed with flashy graphics. Too many times no one has a clue what the graphics are trying to communicate. The worst offenders are companies that sell visual sizzle to senior managers. The goal is a gasp from the audience when the Hollywood style visualizations are presented. Pass the popcorn. Skip the understanding.

Stephen E Arnold, November 21, 2014

Next Page »