Finding Books: Not Much Has Changed

December 1, 2014

Three or four years ago I described what I called “the book findability” problem. The audience was a group of confident executives trying to squeeze money from an old school commercial database model. Here’s how the commercial databases worked in 1979.

  1. Take content from published sources
  2. Create a bibliographic record, write or edit the abstract included with the source document
  3. Index it with no more than three to six index terms
  4. Digitize the result
  5. Charge a commercial information utility to make it available
  6. Get a share of the revenues.

That worked well until the first Web browser showed up and individuals and institutions began making information available online. There are a number of companies that still use variations of this old school business model. Examples include newspapers that charge a Web browser user for access to content to outfits like LexisNexis, Ebsco, Cambridge Scientific Abstracts, and other outfits.

image

As libraries and individuals resist online fees, many of the old school outfits are going to have to come up with new business models. But adaptation will not be easy. Amazon is in the content business. Why buy a Cliff’s Notes-type summary when there are Amazon reviews? Why pay for news when a bit of sleuthing will turn up useful content from outfits like the United Nations or off the radar outfits like World News at www.wn.com? Tech information is going through a bit of an author revolt. While not on the level of protests in Hong Kong, a lot of information that used to be available in research libraries or from old school database providers is available online. At some point, peer reviewed journals and their charge the author business models will have to reinvent themselves. Even recruitment services like LinkedIn offer useful business information via Slideshare.com.

One black hole concerns finding out what books are available online. A former intelligence officer with darned good research skills was not able to locate a copy of my The New Landscape of Search. You can find it here for free.

I read “Location, Location: GPS in the Medieval Library.” The use of coordinates to locate a book on a shelf or hanging from a wall anchored by a chain is not new to those who have fooled around with medieval manuscripts. Remember that I used to index medieval sermons in Latin as early as 1963.

What the write up triggered was the complete and utter failure of indexing services to make an attempt to locate, index, and provide a pointer to books regardless of form. The baloney about indexing “all” information is shown to be a toothless dragon. The failure of the Google method and the flaws of the Amazon, Library of Congress, and commercial database providers is evident.

Now back to the group of somewhat plump, red face confident wizards of commercial database expertise. The group found my suggestion laughable. No big deal. I try to avoid the Titanic type operations. I collected my check and hit the road.

There are domains of content that warrant better indexing. Books, unfortunately, is one set of content that makes me long for the approach that put knowledge in one place with a system that at least worked and could be supplemented by walking around and looking.

No such luck today.

Stephen E Arnold, December 1, 2014

More Metadata: Not Needed Metadata

November 21, 2014

I find the metadata hoo hah fascinating. Indexing has been around a long time. If one wants to dig into the complexities of metadata, you may find the table from InfoLibCorp.com helpful:

image

Mid tier consulting firms often do not use the products or systems their “experts” recommend. Consultants in indexing do create elaborate diagrams that make my eyes glaze over.

Some organizations generate metadata without considering what is required. As a result, outputs from the systems can present mind boggling complex options to the user. A report displaying multiple layers  of metadata can be difficult to understand.

My thought is that before giving the green light to promiscuous metadata generation, some analysis and planning may be useful. The time lost trying to figure out which metadata is relevant to a particular issue can be critical.

But consultants and vendors are indeed impressed with flashy graphics. Too many times no one has a clue what the graphics are trying to communicate. The worst offenders are companies that sell visual sizzle to senior managers. The goal is a gasp from the audience when the Hollywood style visualizations are presented. Pass the popcorn. Skip the understanding.

Stephen E Arnold, November 21, 2014

Is This How Right to Be Forgotten Works?

November 21, 2014

I read “This Is How Google Handles Right to Be Forgotten Requests.” I must admit that the process strikes me as impressive. To explain hitting delete takes about 1,000 words.

One of Google’s problems, though, is that the process is often one-sided because a decision is based on information supplied by one person using a simple Web form.

The article explains what the GOOG allegedly does. There is one omissions. Some content has disappeared and it is difficult to figure out if a form was filed. For information about this, navigate to “Telegraph Stories Affected by EU ‘Right to Be Forgotten’”.

For a “real” journalism outfit, Computerworld is presenting the world as it understands it. Does the description match the Google reality? Good question.

Stephen E Arnold, November 21, 2014

Telegraph Breivik Text Forgotten

November 13, 2014

Short honk: I assume that the Telegraph is not happy with Google’s removal from its index of Breivik content. You can read the allegedly accurate story here. It seems to be a good idea to eliminate information about a convicted mass murderer of young people. As I have said before, it is tough to look for information when the index contains no entry. Ah, progress.

Stephen E Arnold, November 13, 2014

Concept Searching: More Smart Content Rah Rah

September 23, 2014

I read “Concept Searching Taxonomy Workflow Tool solving Migration, Security, and Records Management Challenges.” This is a news release and it can disappear at any time. Don’t hassle me if it is a goner. The write up walks me rapidly into the smart content swamp. The idea is that content without indexing is dumb content. Okay. Lots of folks are pitching the smart versus dumb content idea now.

The fix? Concept Searching provides a smart tool to make content intelligent; that is, include index terms. For the youngster at heart, “indexing” is the old school word for metadata.

The company’s news announcement asserts:

conceptTaxonomyWorkflow serves as a strategic tool, managing enterprise metadata to drive business processes at both the operational and tactical levels. It provides administrators with the ability to independently manage access, information management, information rights management, and records management policy application within their respective business units and functional areas, without the need for IT support or access to enterprise-wide servers. By effectively and accurately applying policy across applications and content repositories, conceptTaxonomyWorkflow enables organizations to significantly improve their compliance and information governance initiatives.

The product name is indeed a unique string in the Google index. The company asserts that the notion of a workflow is strategic. Not only is workflow strategic, it is also tactical. For some, this is a two for one deal that may be heard to resist. The tool allows administrators to perform what appears to be tasks I think of “editorial policy” or as the young at heart say, information governance.

The only issue for me is that the organizations with which I am familiar have pretty miserable information governance methods. What I find is that organizations have Balkanized methods for dealing with digital information. Examples of poor information governance fall readily to hand. The US court system removed public documents only to reinstate them. The IRS in the US cannot locate email. And when the IRS finds an archive of the email, the email cannot be searched. And, of course, there is Mr. Snowden. How many documents did he remove from NSA servers?

The notion that the CTW tool makes it possible to “apply policy across applications and content repositories” sounds absolutely fantastic to a person with indexing experience. There is a problem. Many organizations do not understand an editorial policy or are willing to do much more than react when something goes off the tracks. The reality is that the appetite for meaningful action is often not in commercial enterprises or government entities. Budgets remain tight. Reducing information technology budgets is often a more important goal than improve information technology.

What’s this mean?

My hunch is that Concept Searching is offering a product for an organization that [a] has an editorial policy in place or [b] wants to appear to be taking meaning steps toward useful information governance.

The president of Concept Searching is taking a less pragmatic approach to selling this tool. Martin Garland, according to the company story, states:

Managing metadata and auto-classifying to taxonomies provides high value in applications such as search, text analytics, and business social. But many forward thinking organizations are now looking to leverage their enterprise metadata and use it to improve business processes aligned with compliance and information governance initiatives. To accomplish this successfully, technologies such as conceptTaxonomyWorkflow must be able to qualify metadata and process the content based on enterprise policies. A key benefit of the product is its ease of use and rapid deployment. It removes the lengthy application development cycle and can be used by a large community of business specialists as well as IT.

The key benefit, for me, is that a well conceived and administered information policy eliminates risks of an information misstep. I would suggest that the Snowden matter was a rather serious misstep.

One assumes that companies have information policies, stand behind them, and keep them current. This strikes me as a quite significant assumption.

A similar message is now being pushed by Smartlogic, TEMIS, WAND, and other “indexing” companies.

Are these products delivering essentially similar functionality? Is any system indexing with less than a 10 percent error rate? Are those with responsibility for figuring out what to do with the flood of digital information equipped to enforce organization wide policies? And once installed, will the organization continue to commit resources to support tools that manage indexing? What happens if Microsoft Azure Search and Delve deliver good enough indexing and controls?

These are difficult questions to answer. Based on the pivoting content processing vendors are doing, most companies selling information solutions are trying to find a way to boost revenues in an exhausting effort to maintain stable cash flows.

Does anyone make an information governance tool that keeps track of what information retrieval companies market?

Stephen E Arnold, September 23, 2014

Mondeca: Content IQ

September 23, 2014

I reacted strongly to the IDC report about the knowledge quotient. IDC, as you know, is the home of the fellow who sold my content on Amazon without written permission. I learned that Mondeca is using a variant of “knowledge quotient.” This company’s approach taps the idea of the intelligence quotient of content.

I interpret content with a high IQ in a way that is probably not what Mondeca intended. Smart content is usually content that conveys information that I find useful. Modena, like other purveyors of indexing software, uses the IQ to refer to content that is indexed in a meaningful way. Remember if the users do not use the index terms, assigning these terms to a document does not help a user. Effective indexing helps the user find content. In the good old days of specialist indexing, users had to learn the indexing vocabulary and conventions. Today users just pump 2.7 words into a search box and feel lucky.

Like vendors of automated indexing systems and software, humans have to get into the mix.

One twist Modena brings to the content IQ notion is a process that helps a potential licensee answer the question, “How smart is your content?” For me, poorly indexed content is not smart. The content is simply poorly indexed.

I navigated to the “more information” link on the Content IQ page and learned that answering the question costs 5000 Euros, roughly $6,000.

Like the knowledge quotient play, smart content and allied jargon make an effort to impart a halo of magic around a pretty obvious function. I suppose that in today’s market, clarity is not important. Marketing magic is needed to create a demand for indexing.

I believe professionally administered indexing is important. I was one of the people responsible for creating the ABI/INFORM controlled vocabulary revision and the reindexing of the database in 1981. Our effort involved controlled terms, company name fields, and a purpose built classification system.

Some day automated systems will be able to assign high value index terms without humans. I don’t think that day has arrived. To create smart content, have smart people write it. Then get smart, professional indexers to index it. If a software system can contribute to the effort, I support that effort. I am just not comfortable with the “smart software” trend that is gaining traction.

Stephen E Arnold, September 23, 2014

Luxid: Positioning Adjustments

September 23, 2014

Luxid, based in Paris, offers an automatic indexing service. The company has focused on the publishing sector as well a number of other verticals. The company uses the phrase “semantic content enrichment” to describe the companies indexing. The more trendy phrase is “metatagging,” but I prefer the older term.

The company also uses the term “ontology” along with references to semantic jargon like “triples.” The idea is that a licensee can select a module that matches an industry sector. WAND, a competitor, offers a taxonomy library. The idea is that much of the expensive and intellectually demand work needed to build a controlled vocabulary from scratch is sidestepped.

The positioning that I find interesting is that Luxid delivers “NLP enabled ontology management workflow.” The idea is that once the indexing system is installed, the licensee can maintain the taxonomy using the provided interface. This is another way of saying that administrative tools are included. Another competitor, Smartlogic, uses equally broad and somewhat esoteric terms to describe what are essential indexing operations.

Like other search and content processing vendors, Luxid invokes the magic of Big Data. Luxid asserts, “Streamlined, Big Data architecture offers improved scalability and robust integration options.” The point that indexing processes often stub toes is the amount of human effort and machine processing time required to keep and index updated and populate the new content across already compiled indexes. Scalability can be addressed with more resources. More resources often means increased costs, a challenge for any indexing system that deals with regular content, not just Big Data.

Will the revised positioning generate more inquiries and sales leads? Possibly. I find the wordsmithing content processing vendors use fascinating. The technology, despite the academic jargon, has been around since the days of Data Harmony and other aging methods.

The key points, in my view, is that Luxid offers a story that makes sense. The catnip may be the jargon, the push into publishing which is loath to spend for humans to create indexes, and the packaging of vocabularies into “Skill Cartridges.”

I anticipate that some of Luxid’s competitors will emulate the Luxid terminology. For many years, much of the confusion about which content processing does what can be traced to widespread use of jargon.

Stephen E Arnold, September 22, 2014

Governance Information for Office 365

August 18, 2014

Short honk: I am not sure what governance means. But search and content processing vendors bandy about words better than Serena Williams hits tennis balls. If you are governance hungry and use Office 365, Concept Searching has an “online information source” for you. More information is at http://bit.ly/1teO5fA. My hunch is that you will learn to license some smart software to index documents. What happens when there is no Internet connection? Oh, no big deal.

Stephen E Arnold, August 18, 2014

Attensity Leverages Biz360 Invention

August 4, 2014

In 2010, Attensity purchased Biz360. The Beyond Search comment on this deal is at http://bit.ly/1p4were. One of the goslings reminded me that I had not instructed a writer to tackle Attensity’s July 2014 announcement “Attensity Adds to Patent Portfolio for Unstructured Data Analysis Technology.” PR-type “stories” can disappear, but for now you can find a description of “Attensity Adds to Patent Portfolio for Unstructured Data Analysis Technology” at http://reut.rs/1qU8Sre.

My researcher showed me a hard copy of 8,645,395, and I scanned the abstract and claims. The abstract, like many search and content processing inventions, seemed somewhat similar to other text parsing systems and methods. The invention was filed in April 2008, two years before Attensity purchased Biz360, a social media monitoring company. Attensity, as you may know, is a text analysis company founded by Dr. David Bean. Dr. Bean employed various “deep” analytic processes to figure out the meaning of words, phrases, and documents. My limited understanding of Attensity’s methods suggested to me that Attensity’s Bean-centric technology could process text to achieve a similar result. I had a phone call from AT&T regarding the utility of certain Attensity outputs. I assume that the Bean methods required some reinforcement to keep pace with customers’ expectations about Attensity’s Bean-centric system. Neither the goslings nor I are patent attorneys. So after you download 395, seek out a patent attorney and get him/her to explain its mysteries to you.

The abstract states:

A system for evaluating a review having unstructured text comprises a segment splitter for separating at least a portion of the unstructured text into one or more segments, each segment comprising one or more words; a segment parser coupled to the segment splitter for assigning one or more lexical categories to one or more of the one or more words of each segment; an information extractor coupled to the segment parser for identifying a feature word and an opinion word contained in the one or more segments; and a sentiment rating engine coupled to the information extractor for calculating an opinion score based upon an opinion grouping, the opinion grouping including at least the feature word and the opinion word identified by the information extractor.

This invention tackles the Mean Joe Green of content processing from the point of view of a quite specific type of content: A review. Amazon has quite a few reviews, but the notion of an “shaped” review is a thorny one. See, for example, http://bit.ly/1pz1q0V.) The invention’s approach identifies words with different roles; some words are “opinion words” and others are “feature words.” By hooking a “sentiment engine” to this indexing operation, the Biz360 invention can generate an “opinion score.” The system uses item, language, training model, feature, opinion, and rating modifier databases. These, I assume, are either maintained by subject matter experts (expensive), smart software working automatically (often evidencing “drift” so results may not be on point), or a hybrid approach (humans cost money).

image

The Attensity/Biz360 system relies on a number of knowledge bases. How are these updated? What is the latency between identifying new content and updating the knowledge bases to make the new content available to the user or a software process generating an alert or another type of report?

The 20 claims embrace the components working as a well oiled content analyzer. The claim I noted is that the system’s opinion score uses a positive and negative range. I worked on a sentiment system that made use of a stop light metaphor: red for negative sentiment and green for positive sentiment. When our system could not figure out whether the text was positive or negative we used a yellow light.

image

The approach used for a US government project a decade ago, used a very simple metaphor to communicate a situation without scores, values, and scales. Image source: http://bit.ly/1tNvkT8

Attensity said, according the news story cited above:

By splitting the unstructured text into one or more segments, lexical categories can be created and a sentiment-rating engine coupled to the information can now evaluate the opinions for products, services and entities.

Okay, but I think that the splitting of text into segment was a function of iPhrase and search vendors converting unstructured text into XML and then indexing the outputs.

Attensity’s Jonathan Schwartz, General Counsel at Attensity is quoted in the news story as asserting:

“The issuance of this patent further validates the years of research and affirms our innovative leadership. We expect additional patent issuances, which will further strengthen our broad IP portfolio.”

Okay, this sounds good but the invention took place prior to Attensity’s owning Biz360. Attensity, therefore, purchased the invention of folks who did not work at Attensity in the period prior to the filing in 2008. I understand that company’s buy other companies to get technology and people. I find it interesting that Attensity’s work “validates” Attensity’s research and “affirms” Attensity’s “innovative leadership.”

I would word what the patent delivers and Attensity’s contributions differently. I am no legal eagle or sentiment expert. I do like less marketing razzle dazzle, but I am in the minority on this point.

Net net: Attensity is an interesting company. Will it be able to deliver products that make the licensees’ sentiment score move in a direction that leads to sustaining revenue and generous profits. With the $90 million in funding the company received in 2014, the 14-year-old company will have some work to do to deliver a healthy return to its stakeholders. Expert System, Lexalytics, and others are racing down the same quarter mile drag strip. Which firm will be the winner? Which will blow an engine?

Stephen E Arnold, August 4, 2014

Training Your Smart Search System

August 2, 2014

With the increasing chatter about smart software, I want to call to your attention this article, “Improving the Way Neural Networks Learn.” Keep in mind that some probabilistic search systems have to be trained on content that closely resembles the content the system will index. The training is important, and training can be time consuming. The licensee has to create a training set of data that is similar to what the software will index. Then the training process is run, a human checks the system outputs, and makes “adjustments.” If the training set is not representative, the indexing will be off. If the human makes corrections that are wacky, then the indexing will be off. When the system is turned loose, the resulting index may return outputs that are not what the user expected or the outputs are incorrect. Whether the system uses knows enough to recognize incorrect results varies from human to human.

If you want to have a chat with your vendor regarding the time required to train or re-train a search system relying on sample content, print out this article. If the explanation does not make much sense to you, you can document off query results sets, complain to the search system vendor, or initiate a quick fix. Note that quick fixes involve firing humans believed to be responsible for the system, initiate a new search procurement, or pretend that the results are just fine. I suppose there are other options, but I have encountered these three approach seasoned with either legal action or verbal grousing to the vendor. Even when the automated indexing is tuned within an inch of its life, accuracy is likely to start out in the 85 to 90 percent range and then degrade.

Training can be a big deal. Ignoring the “drift” that occurs when the smart software has been taught or learned something that distorts the relevance of results can produce some sharp edges.

Stephen E Arnold, August 2, 2014

Next Page »