Concept Searching: More Smart Content Rah Rah

September 23, 2014

I read “Concept Searching Taxonomy Workflow Tool solving Migration, Security, and Records Management Challenges.” This is a news release and it can disappear at any time. Don’t hassle me if it is a goner. The write up walks me rapidly into the smart content swamp. The idea is that content without indexing is dumb content. Okay. Lots of folks are pitching the smart versus dumb content idea now.

The fix? Concept Searching provides a smart tool to make content intelligent; that is, include index terms. For the youngster at heart, “indexing” is the old school word for metadata.

The company’s news announcement asserts:

conceptTaxonomyWorkflow serves as a strategic tool, managing enterprise metadata to drive business processes at both the operational and tactical levels. It provides administrators with the ability to independently manage access, information management, information rights management, and records management policy application within their respective business units and functional areas, without the need for IT support or access to enterprise-wide servers. By effectively and accurately applying policy across applications and content repositories, conceptTaxonomyWorkflow enables organizations to significantly improve their compliance and information governance initiatives.

The product name is indeed a unique string in the Google index. The company asserts that the notion of a workflow is strategic. Not only is workflow strategic, it is also tactical. For some, this is a two for one deal that may be heard to resist. The tool allows administrators to perform what appears to be tasks I think of “editorial policy” or as the young at heart say, information governance.

The only issue for me is that the organizations with which I am familiar have pretty miserable information governance methods. What I find is that organizations have Balkanized methods for dealing with digital information. Examples of poor information governance fall readily to hand. The US court system removed public documents only to reinstate them. The IRS in the US cannot locate email. And when the IRS finds an archive of the email, the email cannot be searched. And, of course, there is Mr. Snowden. How many documents did he remove from NSA servers?

The notion that the CTW tool makes it possible to “apply policy across applications and content repositories” sounds absolutely fantastic to a person with indexing experience. There is a problem. Many organizations do not understand an editorial policy or are willing to do much more than react when something goes off the tracks. The reality is that the appetite for meaningful action is often not in commercial enterprises or government entities. Budgets remain tight. Reducing information technology budgets is often a more important goal than improve information technology.

What’s this mean?

My hunch is that Concept Searching is offering a product for an organization that [a] has an editorial policy in place or [b] wants to appear to be taking meaning steps toward useful information governance.

The president of Concept Searching is taking a less pragmatic approach to selling this tool. Martin Garland, according to the company story, states:

Managing metadata and auto-classifying to taxonomies provides high value in applications such as search, text analytics, and business social. But many forward thinking organizations are now looking to leverage their enterprise metadata and use it to improve business processes aligned with compliance and information governance initiatives. To accomplish this successfully, technologies such as conceptTaxonomyWorkflow must be able to qualify metadata and process the content based on enterprise policies. A key benefit of the product is its ease of use and rapid deployment. It removes the lengthy application development cycle and can be used by a large community of business specialists as well as IT.

The key benefit, for me, is that a well conceived and administered information policy eliminates risks of an information misstep. I would suggest that the Snowden matter was a rather serious misstep.

One assumes that companies have information policies, stand behind them, and keep them current. This strikes me as a quite significant assumption.

A similar message is now being pushed by Smartlogic, TEMIS, WAND, and other “indexing” companies.

Are these products delivering essentially similar functionality? Is any system indexing with less than a 10 percent error rate? Are those with responsibility for figuring out what to do with the flood of digital information equipped to enforce organization wide policies? And once installed, will the organization continue to commit resources to support tools that manage indexing? What happens if Microsoft Azure Search and Delve deliver good enough indexing and controls?

These are difficult questions to answer. Based on the pivoting content processing vendors are doing, most companies selling information solutions are trying to find a way to boost revenues in an exhausting effort to maintain stable cash flows.

Does anyone make an information governance tool that keeps track of what information retrieval companies market?

Stephen E Arnold, September 23, 2014

Mondeca: Content IQ

September 23, 2014

I reacted strongly to the IDC report about the knowledge quotient. IDC, as you know, is the home of the fellow who sold my content on Amazon without written permission. I learned that Mondeca is using a variant of “knowledge quotient.” This company’s approach taps the idea of the intelligence quotient of content.

I interpret content with a high IQ in a way that is probably not what Mondeca intended. Smart content is usually content that conveys information that I find useful. Modena, like other purveyors of indexing software, uses the IQ to refer to content that is indexed in a meaningful way. Remember if the users do not use the index terms, assigning these terms to a document does not help a user. Effective indexing helps the user find content. In the good old days of specialist indexing, users had to learn the indexing vocabulary and conventions. Today users just pump 2.7 words into a search box and feel lucky.

Like vendors of automated indexing systems and software, humans have to get into the mix.

One twist Modena brings to the content IQ notion is a process that helps a potential licensee answer the question, “How smart is your content?” For me, poorly indexed content is not smart. The content is simply poorly indexed.

I navigated to the “more information” link on the Content IQ page and learned that answering the question costs 5000 Euros, roughly $6,000.

Like the knowledge quotient play, smart content and allied jargon make an effort to impart a halo of magic around a pretty obvious function. I suppose that in today’s market, clarity is not important. Marketing magic is needed to create a demand for indexing.

I believe professionally administered indexing is important. I was one of the people responsible for creating the ABI/INFORM controlled vocabulary revision and the reindexing of the database in 1981. Our effort involved controlled terms, company name fields, and a purpose built classification system.

Some day automated systems will be able to assign high value index terms without humans. I don’t think that day has arrived. To create smart content, have smart people write it. Then get smart, professional indexers to index it. If a software system can contribute to the effort, I support that effort. I am just not comfortable with the “smart software” trend that is gaining traction.

Stephen E Arnold, September 23, 2014

Luxid: Positioning Adjustments

September 23, 2014

Luxid, based in Paris, offers an automatic indexing service. The company has focused on the publishing sector as well a number of other verticals. The company uses the phrase “semantic content enrichment” to describe the companies indexing. The more trendy phrase is “metatagging,” but I prefer the older term.

The company also uses the term “ontology” along with references to semantic jargon like “triples.” The idea is that a licensee can select a module that matches an industry sector. WAND, a competitor, offers a taxonomy library. The idea is that much of the expensive and intellectually demand work needed to build a controlled vocabulary from scratch is sidestepped.

The positioning that I find interesting is that Luxid delivers “NLP enabled ontology management workflow.” The idea is that once the indexing system is installed, the licensee can maintain the taxonomy using the provided interface. This is another way of saying that administrative tools are included. Another competitor, Smartlogic, uses equally broad and somewhat esoteric terms to describe what are essential indexing operations.

Like other search and content processing vendors, Luxid invokes the magic of Big Data. Luxid asserts, “Streamlined, Big Data architecture offers improved scalability and robust integration options.” The point that indexing processes often stub toes is the amount of human effort and machine processing time required to keep and index updated and populate the new content across already compiled indexes. Scalability can be addressed with more resources. More resources often means increased costs, a challenge for any indexing system that deals with regular content, not just Big Data.

Will the revised positioning generate more inquiries and sales leads? Possibly. I find the wordsmithing content processing vendors use fascinating. The technology, despite the academic jargon, has been around since the days of Data Harmony and other aging methods.

The key points, in my view, is that Luxid offers a story that makes sense. The catnip may be the jargon, the push into publishing which is loath to spend for humans to create indexes, and the packaging of vocabularies into “Skill Cartridges.”

I anticipate that some of Luxid’s competitors will emulate the Luxid terminology. For many years, much of the confusion about which content processing does what can be traced to widespread use of jargon.

Stephen E Arnold, September 22, 2014

Governance Information for Office 365

August 18, 2014

Short honk: I am not sure what governance means. But search and content processing vendors bandy about words better than Serena Williams hits tennis balls. If you are governance hungry and use Office 365, Concept Searching has an “online information source” for you. More information is at http://bit.ly/1teO5fA. My hunch is that you will learn to license some smart software to index documents. What happens when there is no Internet connection? Oh, no big deal.

Stephen E Arnold, August 18, 2014

Attensity Leverages Biz360 Invention

August 4, 2014

In 2010, Attensity purchased Biz360. The Beyond Search comment on this deal is at http://bit.ly/1p4were. One of the goslings reminded me that I had not instructed a writer to tackle Attensity’s July 2014 announcement “Attensity Adds to Patent Portfolio for Unstructured Data Analysis Technology.” PR-type “stories” can disappear, but for now you can find a description of “Attensity Adds to Patent Portfolio for Unstructured Data Analysis Technology” at http://reut.rs/1qU8Sre.

My researcher showed me a hard copy of 8,645,395, and I scanned the abstract and claims. The abstract, like many search and content processing inventions, seemed somewhat similar to other text parsing systems and methods. The invention was filed in April 2008, two years before Attensity purchased Biz360, a social media monitoring company. Attensity, as you may know, is a text analysis company founded by Dr. David Bean. Dr. Bean employed various “deep” analytic processes to figure out the meaning of words, phrases, and documents. My limited understanding of Attensity’s methods suggested to me that Attensity’s Bean-centric technology could process text to achieve a similar result. I had a phone call from AT&T regarding the utility of certain Attensity outputs. I assume that the Bean methods required some reinforcement to keep pace with customers’ expectations about Attensity’s Bean-centric system. Neither the goslings nor I are patent attorneys. So after you download 395, seek out a patent attorney and get him/her to explain its mysteries to you.

The abstract states:

A system for evaluating a review having unstructured text comprises a segment splitter for separating at least a portion of the unstructured text into one or more segments, each segment comprising one or more words; a segment parser coupled to the segment splitter for assigning one or more lexical categories to one or more of the one or more words of each segment; an information extractor coupled to the segment parser for identifying a feature word and an opinion word contained in the one or more segments; and a sentiment rating engine coupled to the information extractor for calculating an opinion score based upon an opinion grouping, the opinion grouping including at least the feature word and the opinion word identified by the information extractor.

This invention tackles the Mean Joe Green of content processing from the point of view of a quite specific type of content: A review. Amazon has quite a few reviews, but the notion of an “shaped” review is a thorny one. See, for example, http://bit.ly/1pz1q0V.) The invention’s approach identifies words with different roles; some words are “opinion words” and others are “feature words.” By hooking a “sentiment engine” to this indexing operation, the Biz360 invention can generate an “opinion score.” The system uses item, language, training model, feature, opinion, and rating modifier databases. These, I assume, are either maintained by subject matter experts (expensive), smart software working automatically (often evidencing “drift” so results may not be on point), or a hybrid approach (humans cost money).

image

The Attensity/Biz360 system relies on a number of knowledge bases. How are these updated? What is the latency between identifying new content and updating the knowledge bases to make the new content available to the user or a software process generating an alert or another type of report?

The 20 claims embrace the components working as a well oiled content analyzer. The claim I noted is that the system’s opinion score uses a positive and negative range. I worked on a sentiment system that made use of a stop light metaphor: red for negative sentiment and green for positive sentiment. When our system could not figure out whether the text was positive or negative we used a yellow light.

image

The approach used for a US government project a decade ago, used a very simple metaphor to communicate a situation without scores, values, and scales. Image source: http://bit.ly/1tNvkT8

Attensity said, according the news story cited above:

By splitting the unstructured text into one or more segments, lexical categories can be created and a sentiment-rating engine coupled to the information can now evaluate the opinions for products, services and entities.

Okay, but I think that the splitting of text into segment was a function of iPhrase and search vendors converting unstructured text into XML and then indexing the outputs.

Attensity’s Jonathan Schwartz, General Counsel at Attensity is quoted in the news story as asserting:

“The issuance of this patent further validates the years of research and affirms our innovative leadership. We expect additional patent issuances, which will further strengthen our broad IP portfolio.”

Okay, this sounds good but the invention took place prior to Attensity’s owning Biz360. Attensity, therefore, purchased the invention of folks who did not work at Attensity in the period prior to the filing in 2008. I understand that company’s buy other companies to get technology and people. I find it interesting that Attensity’s work “validates” Attensity’s research and “affirms” Attensity’s “innovative leadership.”

I would word what the patent delivers and Attensity’s contributions differently. I am no legal eagle or sentiment expert. I do like less marketing razzle dazzle, but I am in the minority on this point.

Net net: Attensity is an interesting company. Will it be able to deliver products that make the licensees’ sentiment score move in a direction that leads to sustaining revenue and generous profits. With the $90 million in funding the company received in 2014, the 14-year-old company will have some work to do to deliver a healthy return to its stakeholders. Expert System, Lexalytics, and others are racing down the same quarter mile drag strip. Which firm will be the winner? Which will blow an engine?

Stephen E Arnold, August 4, 2014

Training Your Smart Search System

August 2, 2014

With the increasing chatter about smart software, I want to call to your attention this article, “Improving the Way Neural Networks Learn.” Keep in mind that some probabilistic search systems have to be trained on content that closely resembles the content the system will index. The training is important, and training can be time consuming. The licensee has to create a training set of data that is similar to what the software will index. Then the training process is run, a human checks the system outputs, and makes “adjustments.” If the training set is not representative, the indexing will be off. If the human makes corrections that are wacky, then the indexing will be off. When the system is turned loose, the resulting index may return outputs that are not what the user expected or the outputs are incorrect. Whether the system uses knows enough to recognize incorrect results varies from human to human.

If you want to have a chat with your vendor regarding the time required to train or re-train a search system relying on sample content, print out this article. If the explanation does not make much sense to you, you can document off query results sets, complain to the search system vendor, or initiate a quick fix. Note that quick fixes involve firing humans believed to be responsible for the system, initiate a new search procurement, or pretend that the results are just fine. I suppose there are other options, but I have encountered these three approach seasoned with either legal action or verbal grousing to the vendor. Even when the automated indexing is tuned within an inch of its life, accuracy is likely to start out in the 85 to 90 percent range and then degrade.

Training can be a big deal. Ignoring the “drift” that occurs when the smart software has been taught or learned something that distorts the relevance of results can produce some sharp edges.

Stephen E Arnold, August 2, 2014

Meme Attention Deficit

April 27, 2014

I read “Algorithm Distinguishes Memes from Ordinary Information.” The article reports that algorithms can pick out memes. A “meme”, according to Google, is “an element of a culture or system of behavior that may be considered to be passed from one individual to another by nongenetic means, especially imitation.” The passage that caught my attention is:

Having found the most important memes, Kuhn and co studied how they have evolved in the last hundred years or so. They say most seem to rise and fall in popularity very quickly. “As new scienti?c paradigms emerge, the old ones seem to quickly lose their appeal, and only a few memes manage to top the rankings over extended periods of time,” they say.

The factoid that reminded me how far smart software has yet to travel is:

To test whether these phrases are indeed interesting topics in physics, Kuhn and co asked a number of experts to pick out those that were interesting. The only ones they did not choose were: 12. Rashba, 14. ‘strange nonchaotic’ and 15. ‘in NbSe3′. Kuhn and co also checked Wikipedia, finding that about 40 per cent of these words and phrases have their own corresponding entries. Together this provides compelling evidence that the new method is indeed finding interesting and important ideas.

Systems produce outputs that are not yet spot on. I concluded that scientists, like marketers, like whizzy new phrases and ideas. Jargon, it seems, is an important part of specialist life.

Stephen E Arnold, April 27, 2014

US Government Content Processing: A Case Study

March 24, 2014

I know that the article “Sinkhole of Bureaucracy” is an example of a single case example. Nevertheless, the write up tickled my funny bone. With fancy technology, USA.gov, and the hyper modern content processing systems used in many Federal agencies, reality is stranger than science fiction.

This passage snagged my attention:

inside the caverns of an old Pennsylvania limestone mine, there are 600 employees of the Office of Personnel Management. Their task is nothing top-secret. It is to process the retirement papers of the government’s own workers. But that system has a spectacular flaw. It still must be done entirely by hand, and almost entirely on paper.

One of President Obama’s advisors is quote as describing the manual operation as “that crazy cave.”

And the fix? The article asserts:

That failure imposes costs on federal retirees, who have to wait months for their full benefit checks. And it has imposed costs on the taxpayer: The Obama administration has now made the mine run faster, but mainly by paying for more fingers and feet. The staff working in the mine has increased by at least 200 people in the past five years. And the cost of processing each claim has increased from $82 to $108, as total spending on the retirement system reached $55.8 million.

One of the contractors operating the system is Iron Mountain. You may recall that this outfit has a search system and caught my attention when Iron Mountain sold the quite old Stratify (formerly Purple Yogi automatic indexing system to Autonomy).

My observations:

  1. Many systems have a human component that managers ignore, do not know about, or lack the management horsepower to address. When search systems or content processing systems generate floods of red ink, human processes are often the culprit
  2. The notion that modern technology has permeated organizations is false. The cost friction in many companies is directly related to small decisions that grow like a snowball rolling down a hill. When these processes reach the bottom, the mess is no longer amusing.
  3. Moving significant information from paper to a digital form and then using those data in a meaningful way to answer questions is quite difficult.

Do managers want to tackle these problems? In my experience, keeping up appearances and cost cutting are more important than old fashioned problem solving. In a recent LinkedIn post I pointed out that automatic indexing systems often require human input. Forgetting about those costs produces problems that are expensive to fix. Simple indexing won’t bail out the folks in the cave.

Stephen E Arnold, March 24, 2014

Stephen E Arnold, March 24, 2014

SchemaLogic Profile Available

December 3, 2013

A new profile is available on the Xenky site today. SchemaLogic is a controlled vocabulary management system. The system combines traditional vocabulary management with an organization wide content management system specifically for indexing words and phrases. The analysis provides some insight into how a subsystem can easily boost the cost of a basic search system’s staff and infrastructure.

Taxonomy became a chrome trimmed buzzword almost a decade ago. Indexing has been around a long time, and indexing has a complete body of practices and standards for the practitioner to use when indexing content objects.

Just what an organization needs to make sense of its text, images, videos, and other digital information/data. At a commercial database publsihing company, more than a dozen people can be involved in managing a controlled term list and classification coding scheme. When a term is misapplied, finding a content object can be quite a challenge. If audio or video are misindexed, the content object may require a human to open, review, and close files until the required imnage or video can be located. Indexing is important, but many MBAs do not understand the cost of indexing until a needed content object cannot be found; for example, in a legal discovery process related to a patent matter. A happy quack to http://swissen.in/swictingsys.php for the example of a single segment of a much larger organization centric taxonomy. Consider managing a controlled term list with more than 20,000 terms and a 400 node taxononmy across a Fortune 500 company or for the information stored in your laptop computer.

Even early birds in the search and content processing sector like Fulcrum Technologies and Verity embraced controlled vocabularies. A controlled term list contains forms of words and phrases and often the classification categories into which individual documents can be tagged.

The problem was that lists of words had to be maintained. Clever poobahs and mavens created new words to describe allegedly new concepts. Scientists, engineers, and other tech types whipped up new words and phrases to help explain their insights. And humans, often loosey goosey with language, shifted meanings. For example, when I was in college a half century ago, there was a class in “discussion.” Today that class might be called “collaboration.” Software often struggles with these language realities.

What happens when “old school” search and content  processing systems try to index documents?

The systems can “discover” terms and apply them. Vendors with “smart software” use a range of statistical and linguistic techniques to figure out entities, bound phrases, and concepts. Other approaches include sucking in dictionaries and encyclopedias. The combination of a “knowledgebase” like Wikipedia and other methods works reasonably well.

Read more

New Version of Media Mining Indexer (6.2) from SAIL LABS Technology

November 9, 2013

The release titled SAIL LABS Announces New Release Of Media Mining Indexer 6.2 from SAIL LABS Technology on August 5, 2013 provides some insight into the latest version of the Media Mining Indexer. SAIL LABS Technology considers itself as an innovator in creating solutions for vertical markets, and enhancing technologies surrounding advanced language understanding abilities.

The newest release offers such features as:

“Improved named entity detection of names via unified lists across languages… improved topic models for all languages… improved text preprocessing for Greek, Hebrew, Italian, Frasi, US and international English…support of further languages: Catalan, Swedish, Portuguese, Bahasa (Indonesia), Italian, Farsi and Romanian…improved communication with Media Mining Server to relate recognized speakers to their respective profiles.”

Gerhard Backfried, Head of Research at SAIL LABS, called the latest release a “quantum leap forward” considering the system’s tractability, constancy and ability to respond to clients needs. The flagship product is based on SAIL LABS speech recognition technology, which as won awards, and offers a suite of ideal components for multimedia processing, and the transformation of audio and video data into searchable information. The features boast the ability to convert speech to text accurately with Automatic Speech Recognition and the ability to detect different speakers with Speaker Change Detection.

Chelsea Kerwin, November 09, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Next Page »