Zaizi: Search and Content Consulting

January 13, 2015

I received a call about Zaizi and the company’s search and content services. The firm’s Web site is at Based on the information in my files, the company appears to be open source centric and an integrator of Lucene/Solr solutions.

What’s interesting is that the company has embraced Mondeca/Smartlogic jargon; for example, content intelligence. I find the phrase interesting and an improvement over the Semantic Web lingo.

The idea is that via indexing, one can find and make use of content objects. I am okay with this concept; however, what’s being sold is indexing, entity extraction, and classification of content.

The issue facing Zaizi and the other content intelligence vendors is that “some” content intelligence and slightly “smarter” information access is not likely to generate the big bucks needed to compete.

Firms like BAE and Leidos as well as the Google/In-Tel-Q backed Recorded Future offer considerably more than indexing. The need is to process automatically, analyze automatically, and generate outputs automatically. The outputs are automatically shaped to meet the needs of one or more human consumers or one or more systems.

Think in terms of taking outputs of a next generation information access system and inputting the “discoveries” or “key items” into another system. The idea is that action can be taken automatically or provided to a human who can make a low risk, high probability decision quickly.

The notion that a 20 something is going to slog through facets, keyword search, and the mind numbing scan results-open documents-look for info approach is decidedly old fashioned.

You can learn more about what the next big thing in information access is by perusing CyberOSINT: Next Generation Information Access at

Stephen E Arnold, January 14, 2015

Google and Removed Links for Pirated Content

January 5, 2015

I read “Google Received 345 Million Pirate link Removal Requests in 2014.” In 2008, Google received 62 requests. In 2014, Google received requests to remove 345,169,134 links. As the article points out, that’s around a million links a day.

The notion of a vendor indexing “all” information is a specious one. More tricky is that one cannot find information if it is blocked from the public index. How will copyright owners find violators? Is there an index for Dark Net content?

My thought is that finding information today is more difficult than it was when I was in college. Sixty years of progress.

Stephen E Arnold, January 5, 2015

Faceted Search: From the 1990s to Forever and Ever

January 4, 2015

Keyword retrieval is useful. But it is not good for some tasks. In the mid 1990s, Endeca’s founders “invented” a better way. The name that will get you a letter from a lawyer is “guided navigation.” The patents make clear the computational procedure required to make facets work.

The more general name of the feature is “faceted navigation.” For those struggling with indexing, faceted navigation “exposes” the users to content options. This works well if the domain is reasonably stable, the corpus small, and the user knows generally what he or she needs.

To get a useful review of this approach to findability, check out “Faceted Navigation.” Now five years old, the write up will add logs to the fires of taxonomy. However, faceted search is not next generation information access. Faceted navigation is like a flintlock rifle used by Lewis and Clark. Just don’t try to kill any Big Data bears with the method. And Twitter outputs? Look elsewhere.

Stephen E Arnold, January 4, 2014

Finding Books: Not Much Has Changed

December 1, 2014

Three or four years ago I described what I called “the book findability” problem. The audience was a group of confident executives trying to squeeze money from an old school commercial database model. Here’s how the commercial databases worked in 1979.

  1. Take content from published sources
  2. Create a bibliographic record, write or edit the abstract included with the source document
  3. Index it with no more than three to six index terms
  4. Digitize the result
  5. Charge a commercial information utility to make it available
  6. Get a share of the revenues.

That worked well until the first Web browser showed up and individuals and institutions began making information available online. There are a number of companies that still use variations of this old school business model. Examples include newspapers that charge a Web browser user for access to content to outfits like LexisNexis, Ebsco, Cambridge Scientific Abstracts, and other outfits.


As libraries and individuals resist online fees, many of the old school outfits are going to have to come up with new business models. But adaptation will not be easy. Amazon is in the content business. Why buy a Cliff’s Notes-type summary when there are Amazon reviews? Why pay for news when a bit of sleuthing will turn up useful content from outfits like the United Nations or off the radar outfits like World News at Tech information is going through a bit of an author revolt. While not on the level of protests in Hong Kong, a lot of information that used to be available in research libraries or from old school database providers is available online. At some point, peer reviewed journals and their charge the author business models will have to reinvent themselves. Even recruitment services like LinkedIn offer useful business information via

One black hole concerns finding out what books are available online. A former intelligence officer with darned good research skills was not able to locate a copy of my The New Landscape of Search. You can find it here for free.

I read “Location, Location: GPS in the Medieval Library.” The use of coordinates to locate a book on a shelf or hanging from a wall anchored by a chain is not new to those who have fooled around with medieval manuscripts. Remember that I used to index medieval sermons in Latin as early as 1963.

What the write up triggered was the complete and utter failure of indexing services to make an attempt to locate, index, and provide a pointer to books regardless of form. The baloney about indexing “all” information is shown to be a toothless dragon. The failure of the Google method and the flaws of the Amazon, Library of Congress, and commercial database providers is evident.

Now back to the group of somewhat plump, red face confident wizards of commercial database expertise. The group found my suggestion laughable. No big deal. I try to avoid the Titanic type operations. I collected my check and hit the road.

There are domains of content that warrant better indexing. Books, unfortunately, is one set of content that makes me long for the approach that put knowledge in one place with a system that at least worked and could be supplemented by walking around and looking.

No such luck today.

Stephen E Arnold, December 1, 2014

More Metadata: Not Needed Metadata

November 21, 2014

I find the metadata hoo hah fascinating. Indexing has been around a long time. If one wants to dig into the complexities of metadata, you may find the table from helpful:


Mid tier consulting firms often do not use the products or systems their “experts” recommend. Consultants in indexing do create elaborate diagrams that make my eyes glaze over.

Some organizations generate metadata without considering what is required. As a result, outputs from the systems can present mind boggling complex options to the user. A report displaying multiple layers  of metadata can be difficult to understand.

My thought is that before giving the green light to promiscuous metadata generation, some analysis and planning may be useful. The time lost trying to figure out which metadata is relevant to a particular issue can be critical.

But consultants and vendors are indeed impressed with flashy graphics. Too many times no one has a clue what the graphics are trying to communicate. The worst offenders are companies that sell visual sizzle to senior managers. The goal is a gasp from the audience when the Hollywood style visualizations are presented. Pass the popcorn. Skip the understanding.

Stephen E Arnold, November 21, 2014

Is This How Right to Be Forgotten Works?

November 21, 2014

I read “This Is How Google Handles Right to Be Forgotten Requests.” I must admit that the process strikes me as impressive. To explain hitting delete takes about 1,000 words.

One of Google’s problems, though, is that the process is often one-sided because a decision is based on information supplied by one person using a simple Web form.

The article explains what the GOOG allegedly does. There is one omissions. Some content has disappeared and it is difficult to figure out if a form was filed. For information about this, navigate to “Telegraph Stories Affected by EU ‘Right to Be Forgotten’”.

For a “real” journalism outfit, Computerworld is presenting the world as it understands it. Does the description match the Google reality? Good question.

Stephen E Arnold, November 21, 2014

Telegraph Breivik Text Forgotten

November 13, 2014

Short honk: I assume that the Telegraph is not happy with Google’s removal from its index of Breivik content. You can read the allegedly accurate story here. It seems to be a good idea to eliminate information about a convicted mass murderer of young people. As I have said before, it is tough to look for information when the index contains no entry. Ah, progress.

Stephen E Arnold, November 13, 2014

Concept Searching: More Smart Content Rah Rah

September 23, 2014

I read “Concept Searching Taxonomy Workflow Tool solving Migration, Security, and Records Management Challenges.” This is a news release and it can disappear at any time. Don’t hassle me if it is a goner. The write up walks me rapidly into the smart content swamp. The idea is that content without indexing is dumb content. Okay. Lots of folks are pitching the smart versus dumb content idea now.

The fix? Concept Searching provides a smart tool to make content intelligent; that is, include index terms. For the youngster at heart, “indexing” is the old school word for metadata.

The company’s news announcement asserts:

conceptTaxonomyWorkflow serves as a strategic tool, managing enterprise metadata to drive business processes at both the operational and tactical levels. It provides administrators with the ability to independently manage access, information management, information rights management, and records management policy application within their respective business units and functional areas, without the need for IT support or access to enterprise-wide servers. By effectively and accurately applying policy across applications and content repositories, conceptTaxonomyWorkflow enables organizations to significantly improve their compliance and information governance initiatives.

The product name is indeed a unique string in the Google index. The company asserts that the notion of a workflow is strategic. Not only is workflow strategic, it is also tactical. For some, this is a two for one deal that may be heard to resist. The tool allows administrators to perform what appears to be tasks I think of “editorial policy” or as the young at heart say, information governance.

The only issue for me is that the organizations with which I am familiar have pretty miserable information governance methods. What I find is that organizations have Balkanized methods for dealing with digital information. Examples of poor information governance fall readily to hand. The US court system removed public documents only to reinstate them. The IRS in the US cannot locate email. And when the IRS finds an archive of the email, the email cannot be searched. And, of course, there is Mr. Snowden. How many documents did he remove from NSA servers?

The notion that the CTW tool makes it possible to “apply policy across applications and content repositories” sounds absolutely fantastic to a person with indexing experience. There is a problem. Many organizations do not understand an editorial policy or are willing to do much more than react when something goes off the tracks. The reality is that the appetite for meaningful action is often not in commercial enterprises or government entities. Budgets remain tight. Reducing information technology budgets is often a more important goal than improve information technology.

What’s this mean?

My hunch is that Concept Searching is offering a product for an organization that [a] has an editorial policy in place or [b] wants to appear to be taking meaning steps toward useful information governance.

The president of Concept Searching is taking a less pragmatic approach to selling this tool. Martin Garland, according to the company story, states:

Managing metadata and auto-classifying to taxonomies provides high value in applications such as search, text analytics, and business social. But many forward thinking organizations are now looking to leverage their enterprise metadata and use it to improve business processes aligned with compliance and information governance initiatives. To accomplish this successfully, technologies such as conceptTaxonomyWorkflow must be able to qualify metadata and process the content based on enterprise policies. A key benefit of the product is its ease of use and rapid deployment. It removes the lengthy application development cycle and can be used by a large community of business specialists as well as IT.

The key benefit, for me, is that a well conceived and administered information policy eliminates risks of an information misstep. I would suggest that the Snowden matter was a rather serious misstep.

One assumes that companies have information policies, stand behind them, and keep them current. This strikes me as a quite significant assumption.

A similar message is now being pushed by Smartlogic, TEMIS, WAND, and other “indexing” companies.

Are these products delivering essentially similar functionality? Is any system indexing with less than a 10 percent error rate? Are those with responsibility for figuring out what to do with the flood of digital information equipped to enforce organization wide policies? And once installed, will the organization continue to commit resources to support tools that manage indexing? What happens if Microsoft Azure Search and Delve deliver good enough indexing and controls?

These are difficult questions to answer. Based on the pivoting content processing vendors are doing, most companies selling information solutions are trying to find a way to boost revenues in an exhausting effort to maintain stable cash flows.

Does anyone make an information governance tool that keeps track of what information retrieval companies market?

Stephen E Arnold, September 23, 2014

Mondeca: Content IQ

September 23, 2014

I reacted strongly to the IDC report about the knowledge quotient. IDC, as you know, is the home of the fellow who sold my content on Amazon without written permission. I learned that Mondeca is using a variant of “knowledge quotient.” This company’s approach taps the idea of the intelligence quotient of content.

I interpret content with a high IQ in a way that is probably not what Mondeca intended. Smart content is usually content that conveys information that I find useful. Modena, like other purveyors of indexing software, uses the IQ to refer to content that is indexed in a meaningful way. Remember if the users do not use the index terms, assigning these terms to a document does not help a user. Effective indexing helps the user find content. In the good old days of specialist indexing, users had to learn the indexing vocabulary and conventions. Today users just pump 2.7 words into a search box and feel lucky.

Like vendors of automated indexing systems and software, humans have to get into the mix.

One twist Modena brings to the content IQ notion is a process that helps a potential licensee answer the question, “How smart is your content?” For me, poorly indexed content is not smart. The content is simply poorly indexed.

I navigated to the “more information” link on the Content IQ page and learned that answering the question costs 5000 Euros, roughly $6,000.

Like the knowledge quotient play, smart content and allied jargon make an effort to impart a halo of magic around a pretty obvious function. I suppose that in today’s market, clarity is not important. Marketing magic is needed to create a demand for indexing.

I believe professionally administered indexing is important. I was one of the people responsible for creating the ABI/INFORM controlled vocabulary revision and the reindexing of the database in 1981. Our effort involved controlled terms, company name fields, and a purpose built classification system.

Some day automated systems will be able to assign high value index terms without humans. I don’t think that day has arrived. To create smart content, have smart people write it. Then get smart, professional indexers to index it. If a software system can contribute to the effort, I support that effort. I am just not comfortable with the “smart software” trend that is gaining traction.

Stephen E Arnold, September 23, 2014

Luxid: Positioning Adjustments

September 23, 2014

Luxid, based in Paris, offers an automatic indexing service. The company has focused on the publishing sector as well a number of other verticals. The company uses the phrase “semantic content enrichment” to describe the companies indexing. The more trendy phrase is “metatagging,” but I prefer the older term.

The company also uses the term “ontology” along with references to semantic jargon like “triples.” The idea is that a licensee can select a module that matches an industry sector. WAND, a competitor, offers a taxonomy library. The idea is that much of the expensive and intellectually demand work needed to build a controlled vocabulary from scratch is sidestepped.

The positioning that I find interesting is that Luxid delivers “NLP enabled ontology management workflow.” The idea is that once the indexing system is installed, the licensee can maintain the taxonomy using the provided interface. This is another way of saying that administrative tools are included. Another competitor, Smartlogic, uses equally broad and somewhat esoteric terms to describe what are essential indexing operations.

Like other search and content processing vendors, Luxid invokes the magic of Big Data. Luxid asserts, “Streamlined, Big Data architecture offers improved scalability and robust integration options.” The point that indexing processes often stub toes is the amount of human effort and machine processing time required to keep and index updated and populate the new content across already compiled indexes. Scalability can be addressed with more resources. More resources often means increased costs, a challenge for any indexing system that deals with regular content, not just Big Data.

Will the revised positioning generate more inquiries and sales leads? Possibly. I find the wordsmithing content processing vendors use fascinating. The technology, despite the academic jargon, has been around since the days of Data Harmony and other aging methods.

The key points, in my view, is that Luxid offers a story that makes sense. The catnip may be the jargon, the push into publishing which is loath to spend for humans to create indexes, and the packaging of vocabularies into “Skill Cartridges.”

I anticipate that some of Luxid’s competitors will emulate the Luxid terminology. For many years, much of the confusion about which content processing does what can be traced to widespread use of jargon.

Stephen E Arnold, September 22, 2014

Next Page »