Surprising Sponsored Search Report and Content Marketing

July 28, 2014

Content marketing hath embraced the mid tier consulting firms. IDC, an outfit that used my information without my permission from 2012 until July 2014, has published a study about “knowledge.” I was not able to view the entire report, but the executive summary was available for download at http://bit.ly/1l10sGH. (Verified at 11 am, July 25, 2014) If you have some extra money, you may want to pay an IDC scale fee to learned about “the knowledge quotient.”

I am looking forward to the full IDC report, which promises to be as amusing as a recent Gartner report about search. The idea of rigorous, original research and an endorsement from a company like McKinsey or Boston Consulting Group is a Holy Grail of marketing. McKinsey and BCG (what I call blue chip firms), while not perfect, are produce client smiles for most of their engagements.

Consulting, however, does not have an American Bar Association or other certification process to “certify” a professional’s capabilities. In fact, at Booz, Allen I learned that Halliburton NUS, a nuclear consulting and services shop, was in the eyes of Booz, Allen a “grateful C.” Booz, Allen, like Bain and SRI, were grade A firms. I figured if I were hired at Booz, Allen I could pick up some A-level attributes. Consultants not trained by one of the blue chip firms had to work harder, smarter, and more effectively. Slack off and the consulting firms lower on the totem pole were unlikely to claw their way to the top. When a consulting firm has been a grade C for decades, it is highly unlikely that the blue chip outfits will worry too much about these competitors.

This IDC particular report 249643ES is funded by whom? The fact that I was able to download the report from one of the companies listed as a “sponsor” suggests that Smartlogic and nine other companies were underwriting the rigorous research. You can download the report (verified at 2 30 pm, July 25, 2014) at this link. Hasten to do it, please.

In the consulting arena, multi-client studies come in different flavors or variants. At Booz, Allen & Hamilton, the 1976 Study of World Economic Change was paid for by a number of large banks. We did not write about these banks. We delivered previously uncollected information in a Booz, Allen package. The boss was William Simon, former secretary of the US treasury. He brought a certain mindset and credibility to our project.

The authors of the IDC report are Dave Schubmehl and Dan Vesset. Frankly I don’t known enough about these “experts” to compare them to William Simon. My hunch is that Mr. Simon’s credentials might have had a bit more credibility. We supplemented the Booz, Allen team with specialists from Claremont College, where Peter Drucker was grooming some quite bright business analysts. In short, the high caliber Booz, Allen professionals, the Claremont College whiz kids, and William Simon combined to generate a report with a substantive information payload.

Based on my review of the Executive Summary of “The Knowledge Quotient,” direct comparisons with the Booz, Allen report or even reports from some of the mid tier firms’ analyses in my files are difficult to make. I can, however, highlight a handful of issues that warrant further consideration. Let’s look at three areas where the information highway may be melting in the summer heat.

1. A Focus on Knowledge and the Notion of a Quotient

I do a for fee column for Knowledge Management Magazine. I want to be candid. I am not sure that I have a solid understanding of what the heck “knowledge” is. I know that a quotient is the result obtained by dividing one number by another number.  I am not able to accept that an intangible like “knowledge” can be converted to a numeric output. Lard on some other abstractions like “value” and the entire premise of the report is difficult to take seriously.

image

Well, quite a few companies did take the idea seriously, and we need to look at the IDC material to get a feel for the results based on a survey of 2,155 organizations and in depth interviews with 11 organizations “discovered.” The fact that there are 11 sponsors and 11 in depth interviews suggests that the sample is not an objective one as far as the interviews are concerned. But I may be wrong. Is that a signal that this IDC report is a marketing exercise dressed up as an objective report?

2. The Old Chestnut Makes an Appearance

A second clue is the inclusion of a matrix that reminded me of an unimaginative variation on the 1970 Boston Consulting Group’s tool. The BCG approach used market share or similar “hard” data about products and business units. A version of the BCG quadrant appears below:

image

IDC’s “experts” may be able to apply numbers to nebulous concepts. I would not want to try and pull off this legerdemain. The Schubmehl and Vesset version for IDC strikes me a somewhat spongy; for example, how does one create a quotient for knowledge when parameterizing “socialization” or “culture.” Is the association with New Age and pop culture intentional?

3. The Sponsors: An Eclectic Group United by Sponsoring IDC?

The third tip off to the focus of the report are the sponsors themselves. The 11 companies are an eclectic group, including a giant computer services firm (IBM) a handful of small companies with little or no corporate profile, and an indexing company that delivers training, services, and advice.

4. A Glimpse of the Takeaways

Fourth, the Executive Summary highlights what appear to be important takeaways from the year long research effort. For example, KQ leaders have their expectations exceeded presumably because these KQ savvy outfits have licensed one or more of the study sponsors’ products. The Executive Summary references a number of case studies. As you may know, positive case studies about search and content processing are not readily available. IDC promises a clutch of cases.

And IDC on pages iv and v of the Executive Summary uses a bullet list and some jargon to give a glimpse of high KQ outfits’ best practices. The idea is that if content is indexed and searchable, there are some benefits to the companies.

After 50 years, I assume IDC has this type of work nailed. I would point out that IDC used my information in its for fee reports from August 2012 until July 2014. My attorney was successful in getting IDC to stop connecting my name and that of my researchers with one of IDC’s top billing analysts. I find surfing on my content and name untoward. But again there are substantive differences between blue chip consulting firms and those lower on the for fee services totem pole.

I wonder if the full report will contain positive profiles of the sponsoring organizations. Be prepared to pay a lot for this “knowledge quotient” report. On the other hand, some of the sponsors may provide you with a copy if you have a gnawing curiosity about the buzzwords and jargon the report embraces; for example, analytics,

Some potential reader will have to write a big check. For example, to get one of the IDC reports with my name on it from 2012 to July 2014, the per report price was $3,500. I would not be surprised if the sticker for this KQ report is even higher. Based on the Executive Summary, KQ looks like a content marketing play. The “inclusions” are the profiles of the sponsors.

I will scout around for the Full Monty, and I hope it is fully clothed and buttoned up. Does IDC have a William Simon to ride herd on its “experts”? From my experience, IDC’s rigorousness is quite different. For example, IDC’s Dave Schubmehl used my information and attached himself to my name. Is this the behavior of a blue chip?

Stephen E Arnold, July 28, 2014

Pre Oracle InQuira: A Leader in Knowledge Assessment?

July 28, 2014

Oracle purchased InQuira in 2011. One of the writers for Beyond Search reminded me that Beyond Search covered the InQuira knowledge assessment marketing ploy in 2009. You can find that original article at http://bit.ly/WYYvF7.

InQuira’s technology is an option in the Oracle RightNow customer support system. RightNow was purchased by Oracle in 2001. For those who are the baseball card collectors of enterprise search, you know that RightNow purchased Q-Go technology to make its customer support system more intuitive, intelligent, and easier to use. (Information about Q-Go is at http://bit.ly/1nvyW8G.)

InQuira’s technology is not cut from a single chunk of Styrofoam. InQuira was formed in 2002 by fusing the Answerfriend, Inc. and Electric Knowledge, Inc. systems. InQuira was positioned as a question answering system. For years, Yahoo relied on InQuira to deliver answers to Yahooligans seeking help with Yahoo’s services. InQuira also provided the plumbing to www.honda.com. InQuira hopped on the natural language processing bandwagon and beat the drum until it layered on “knowledge” as a core functionality. The InQuira technology was packaged as a “semantic processing engine.”

InQuira used its somewhat ponderous technology along with AskJeeves’ type short cuts to improve the performance of its system. The company narrowed its focus from “boil the ocean search” to a niche focus. InQuira wanted to become the go to system for help desk applications.

InQuira’s approach involved vocabularies. These were similar to the “knowledge bases” included with some versions of Convera. InQuira, according to my files, used the phrase “loop of incompetence.” I think the idea was that traditional search systems did not allow a customer support professional to provide an answer that would make customers happy the majority of the time. InQuira before Oracle emphasized that its system would provide answers, not a list of Google style hits.

The InQuira system can be set up to display a page of answers in the form of sentences snipped from relevant documents. The idea is that the InQuira system eliminates the need for a user to review a laundry list of links.

The word lists and knowledge bases require maintenance. Some tasks can be turned over to scripts, but other tasks require the ministrations of a human who is a subject matter expert or a trained indexer. The InQuira concept knowledge bases also requires care and feeding to deliver on point results. I would point out that this type of knowledge care is more expensive than a nursing home for a 90 year old parent. A failure to maintain the knowledge bases usually results in indexing drift and frustrated users. In short, the systems are perceived as not working “like Google.”

Why is this nitty gritty important? InQuira shifted from fancy buzzwords as the sharp end of its marketing spear to the more fuzzy notion of knowledge. The company, beginning in late 2008, put knowledge first and the complex, somewhat baffling technology second. To generate sales leads, InQuira’s marketers hit on the idea of a “knowledge assessment.”

The outcome of the knowledge marketing effort was the sale of the company to Oracle in mid 2011. At the time of the sale, InQuira had an adaptor for Oracle Siebel. Oracle appears to have had a grand plan to acquire key customer support search and retrieval functionality. Armed with technology that was arguably better than the ageing Oracle SES system, Oracle could create a slam dunk solution for customer support applications.

Since the application, many search vendors have realized that some companies were not ready to write a Moby Dick sized check for customer support search. Search vendors adopted the lingo of InQuira and set out to make sales to organizations eager to reduce the cost of customer support and avoid the hefty license fees some vendors levied.

What I find important about InQuira are:

  1. It is one of the first search engines to be created by fusing two companies that were individually not able to generate sustainable revenue
  2. InQuira’s tactic to focus on customer support and then add other niche markets brought more discipline to the company’s message than the “one size fits all” that was popular with Autonomy and Fast Search.
  3. InQuira figured out that search was not a magnetic concept. The company was one of the first to explain its technology, benefits, and approach in terms of a nebulous concept; that is, knowledge. Who knows what knowledge is, but it does seem important, right?
  4. The outcome of InQuira’s efforts made it possible for stakeholders to sell the company to Oracle. Presumably this exist was a “success” for those who divided up Oracle’s money.

Net net: Shifting search and content processing to knowledge is a marketing tactic. Will it work in 2014 when search means Google? Some search vendors who have sold their soul to venture capitalists in exchange for millions of jump start dollars hope so.

My thought is that knowledge won’t sell information retrieval. Once a company installs a search systems, users can find what they need or not. Fuzzy does not cut it when users refuse to use a system, scream for a Google Search Appliance, or create a work around for a doggy system.

Stephen E Arnold, July 28, 2014

Sponsors of Two Content Marketing Plays

July 27, 2014

I saw some general information about allegedly objective analyses of companies in the search and content processing sector.

The first report comes from the Gartner Group. The company has released its “magic quadrant” which maps companies by various allegedly objective methods into leaders, challengers, niche players, and visionaries.

The most recent analysis includes these companies:

Attivio
BA Insight
Coveo
Dassault Exalead
Exorbyte
Expert System
Google
HP Autonomy IDOL
IBM
HIS
Lucid Works
MarkLogic
Mindbreeze
Perceptive ISYS Search
PolySpot
Recommind
Sinequa

There are several companies in the Gartner pool whose inclusion surprises me. For example, Exorbyte is primarily an eCommerce company with a very low profile in the US compared to Endeca or New Zealand based SLI Systems. Expert System is a company based in Italy. This company provides semantic software which I associated with mobile applications. IHS (International Handling Service) provides technical information and a structured search system. MarkLogic is a company with XML data management software that has landed customers in publishing and the US government. With an equally low profile is Mindbreeze, a home brew search system funded by Microsoft-centric Fabasoft. Dassault Exalead, PolySpot, and Sinequa are French companies offering what I call “information infrastructure.” Search is available, but the approach is digital information plumbing.

The IDC report, also allegedly objective, is sponsored by nine companies. These outfits are:

Attivio
Coveo
Earley & Associates
HP Autonomy IDOL
IBM
IHS
Lexalytics
Sinequa
Smartlogic

This collection of companies is also eclectic. For example, Earley & Associates does indexing training, consulting, and does not have a deep suite of enterprise software. IHS (International Handling Services) appears in the IDC report as a knowledge centric company. I think I understand the concept. Technical information in Extensible Markup Language and a mainframe-style search system allow an engineer to locate a specification or some other technical item like the SU 25. Lexalytics is a sentiment analysis company. I do not consider figuring out if a customer email is happy or sad the same as Coveo’s customer support search system. Smartlogic is interesting because the company provides tools that permit unstructured content to be indexed. Some French vendors call this process “fertilization.” I suppose that for purists, indexing might be just as good a word.

What unifies these two lists are the companies that appear in both allegedly objective studies:

Attivio
Coveo
HP
IBM
IHS (International Handling Service)
Sinequa

My hunch is that the five companies appearing in both lists are in full bore, pedal to the metal marketing mode.

Attivio and Coveo have ingested tens of millions in venture funding. At some point, investors want a return on their money. The positioning of these two companies’ technologies as search and the somewhat unclear knowledge quotient capability suggest that implicit endorsement by mid tier consulting firms will produce sales.

The appearance of HP and IBM on each list is not much of a surprise. The fact that Oracle Endeca is not in either report suggests that Oracle has other marketing fish to fry. Also, Elasticsearch, arguably the game changer in search and content processing, is not in either pool may be evidence that Elasticsearch is too busy to pursue “expert” analysts laboring in the search vineyard. On the other hand, Elasticsearch may have its hands full dealing with demands of developers, prospects, and customers.

IHS has not had a high profile in either search or content processing. The fact that International Handling Services appears signals that the company wants to market its mainframe style and XML capable system to a broader market. Sinequa appears comfortable with putting forth its infrastructure system as both search and a knowledge engine.

I have not seen the full reports from either mid tier consulting firm. My initial impression of the companies referenced in the promotional material for these recent studies is that lead generation is the hoped for outcome of inclusion.

Other observations I noted include:

  1. The need to generate leads and make sales is putting multi-company reports back on the marketing agenda. The revenue from these reports will be welcomed at IDC and Gartner I expect. The vendors who are on the hook for millions in venture funding are hopeful that inclusion in these reports will shake the money trees from Boston to Paris.
  2. The language used to differentiate and describe the companies referenced in these two studies is unlikely to clarify the differences between similar companies or make clear the similarities. From my point of view, there are few similarities among the companies referenced in the marketing collateral for the IDC and Gartner study.
  3. The message of the two reports appears to be “these companies are important.” My thought is that because IDC and Gartner assume their brand conveys a halo of excellence, the companies in these reports are, therefore, excellent in some way.

Net net: Enterprise search and content processing has a hurdle to get over: Search means Google. The companies in these reports have to explain why Google is not the de facto choice for enterprise search and then explain how a particular vendor’s search system is better, faster, cheaper, etc.

For me, a marketer or search “expert” can easily stretch search to various buzzwords. For some executives, customer support is not search. Customer support uses search. Sentiment analysis is not search. Sentiment analysis is a signal for marketers or call center managers. Semantics for mobile phones, indexing for SharePoint content, and search for a technical data sheet are quite different from eCommerce, business intelligence, and business process engineering.

A fruit cake is a specific type of cake. Each search and content processing system is distinct and, in my opinion, not easily fused into the calorie rich confection. A collection of systems is a lumber room stuffed with different objects that don’t have another place in a household.

The reports seem to make clear that no one in the mid tier consulting firms or the search companies knows exactly how to position, explain, and verify that content processing is the next big thing. Is it?

Maybe a Google Search Appliance is the safe choice? IBM Watson does recipes, and HP Autonomy connotes high profile corporate disputes.

Elasticsearch, anyone?

Stephen E Arnold, July 27, 2014

Search and Data-Starved Case Studies

July 19, 2014

LinkedIn discussions fielded a question about positive search and content processing case studies. I posted a link to a recent paper from Italy (you can find the url at this link).

My Overflight system spit out another case study. The publisher is Hewlett Packard and the example involves Autonomy. The problem concerns the UK’s National Health Service” and its paperless future. You can download the four page document at http://bit.ly/1wIsifS.

The Italian case study focuses on cheerleading for the Google Search Appliance. The HP case study promotes the Autonomy IDOL system applied to medical records.

the HP Autonomy document caught my attention because it uses a buzzword I first heard at Booz, Allen & Hamilton in 1978. Harvey Poppel, then a BAH partner, coined the phrase. The idea caught on. Mr. Poppel, who built a piano, snagged some ink in Business Week. That was a big deal in the late 1970s. Years later I met Alan Siegel, a partner at a New York design firm. He was working on promotion of the Federal government’s paperless initiative. About 10 years ago, I spent some time with Forrest (Woody) Horton, who was a prominent authority on the paperless office. Across the decades, talk about paperless offices generated considerable interest. These interactions about paperless environments have spanned 36 years. Paper seems to be prevalent wherever I go.

When I read the HP Autonomy case study, I thought about the efforts of some quite bright individuals directed at eliminating hard copy documents. There are reports, studies, and analyses about the problems of finding information in paper. I expected a reference to hard data or some hard data. The context for the paperless argument would have captured my attention.

The HP Autonomy case study talks about an integrator’s engineers using IDOL to build a solution. The product is called Evolve and:

It sued 28 years of information management expertise to improve efficiency, productivity and regulatory compliance. The IDOL analytics engine was co-opted into Evolve because it automatically ingests and segments medical records and documents according to their content and concepts, making it easier to find and analyze specific information.

The wrap up of the case study is a quote that is positive about the Kainos Evolve system. No big surprise.

After reading the white paper, three thoughts crossed my mind.

First, the LinkedIn member seeking positive search and content processing case studies might not find the IDOL case study particularly useful. The information is more of an essay from an ad agency generated in-house magazine.

Second, the LinkedIn person wondered why there were so few positive case studies about successful search and content processing installations. I think there are quite a few white papers, case studies, and sponsored content marketing articles crafted along the lines of the HP Autonomy case study. The desire to give the impression that the product encounters no potholes scrubs out the details so useful to a potential licensee.

Third, the case study describes a mandated implementation. So the Evolve product is in marketing low gear. The enthusiasm for implementing a new product shines brightly. Does the glare from the polish obscure a closer look.

At a minimum, I would have found the following information helpful even if presented in bullet points or tabular form:

  1. What was the implementation time? What days, weeks, or months of professional work were required to get the system up and running?
  2. What was the project’s initial budget? Was the project completed within the budget parameters?
  3. What is the computing infrastructure required for the installation? Was the infrastructure on premises, cloud, or hybrid?
  4. What is the latency in indexing and query processing?
  5. What connectors were used “as is”? Were new connectors required? If yes, how long did it take to craft a functioning connector?
  6. What training did users of the system require?

Information at this level of detail is difficult to obtain. In my experience, most search and content processing systems require considerable attention to detail. Take a short cut, and the likelihood of an issue rises sharply.

Obviously neither the vendor nor the licensee want information about schedule shifts, cost over or under- runs and triage expenses to become widely known. The consequence of this jointly enforced fact void helps create case studies that are little more than MBA jargon.

Little wonder the LinkedIn member’s plea went mostly ignored. Paper is unlikely to disappear because lawyers thrive on hard copies. When litigation ensues, the paperless office and the paperless medical practice becomes a challenge.

Stephen E Arnold, July 19, 2014

What Most Search Vendors Cannot Pull Off

July 19, 2014

I recently submitted an Information Today column that reported about Antidot’s tactical play to enter the US market. One of the fact checkers for the write up alerted me that most of the companies I identified were unknown to US readers. Test yourself. How many of these firms do you recognize? How many of them provide information retrieval services?

  • A2ia
  • Albert (originally AMI Albert and AMI does not mean friend)
  • Dassault Exalead
  • Datops
  • EZ2Find
  • Kartoo
  • Lingway
  • LUT Technologies
  • Pertimm
  • Polyspot
  • Quaero
  • Questel
  • Sinequa

How did you do? The point is that French vendors of information retrieval and content processing technology find themselves in a crowded boat. Most of the enterprise search vendors have flamed out or resigned themselves to pitching to venture capitalist that their technology is the Next Big Thing. A lucky few sell out and cash in; for example Datops. Others are ignored or forgotten.

The same situation exists for vendors of search technology in other countries. Search is a tough business. And when former Googlers like Marissa Meyer was the boss when Yahoo’s share of the Web search market sagged below 10 percent. In the same time period, Microsoft increased Bing’s share to about 14 percent. Google dogpaddled and held steady. Other Web search providers make up the balance of the market players. Business Insider reported:

This is a big problem for Yahoo since its search business is lucrative. While Yahoo’s display ad business fell 7% last quarter, revenue from search was up 6% on a year-over-year basis. Revenue from search was $428 million compared to $436 million from its display ad business.

Now enterprise search vendors have been trying to use verbal magic to unlock consistently growing revenue. So far only two vendors have been able to find a way to open the revenue vault’s lock. Autonomy tallied more than $800 million in revenue at the time of its sale to Hewlett Packard. The outcome of that deal was a multi-billion dollar write off and many legal accusations. One thing is clear through the murky rhetoric the deal produced. Hewlett Packard had zero understanding of search and has been looking for a scapegoat to slaughter for its corporate decision. This is not helping the search vendors chasing deals.

Google converted Web search into a $60 billion revenue stream. The fact that the core idea for online advertising originated with the pay-to-play company GoTo which then morphed into Overture which THEN was acquired by Yahoo. Think of the irony. Yahoo has the technology that makes Google a one trick, but very lucrative revenue pony. But, to be fair, Google Web search is not the enterprise search needed to locate a factoid for a marketing assistant. Feed this query “how me the versions of the marketing VP’s last product road map” to a Google appliance and check the results. The human has to do some old fashioned human-type work. To find this information with a Google Search Appliance or any other information retrieval engine for that matter is tricky. Basic indexing cannot do the job, so most marketing assistants hunt manually through files, folders, and hard copies looking for the Easter egg.

Many of the pioneering search engines tried explaining their products and services using euphemisms. There was question answering, content intelligence, smart content, predictive retrieval, entity extraction, and dozens and dozens of phrases that sound fine but are very difficult to define; for example, knowledge management and the phrase “enterprise search” itself or “image recognition” or “predictive analytics”, among others.

I had a hearty chuckle when I read “Don’t Sell a Product, Sell a Whole New Way of Thinking.” Search has been available for at least 50 years. Think RECON, Orbit, Fulcrum Technologies, BASIS, Teratext, and other artifacts of search and retrieval. Smart folks cooked up even the computationally challenged Delphes system, the metasearch system Vivisimo, and the essentially unknown Quertle.

A romp through these firm’s marketing collateral, PowerPoints, and PDFs makes clear that no buzzword has been left untried. Buyers did and do not know what the systems actually delivered.  This is evidence that search vendors have not been able to “sell a whole new way of thinking.”

No kidding. The synonyms search marketers have used in order to generate interest and hopefully a sale are a catalog of information technology jargon. Here is a short list of some of the terms from the 1990s:

  • Business intelligence
  • Competitive intelligence
  • Content governance
  • Content management
  • Customer support then customer relationship management.
  • Knowledge management
  • Neurodynamics
  • Text analytics

If I accept the Harvard analysis, the failing of enterprise search is not financial fiddling and jargon. As you may recall, Microsoft paid $1.2 billion for Fast Search & Transfer. The investigation into allegations of financial fancy dancing were resolved recently with one executive facing a possible jail term and employment restrictions. There are other companies that tried to blend search with content only to find that the combination was not quite like peanut butter and jelly. Do you use Factiva or Ebsco? Did I hear a “what?’ Other companies embraced slick visualizations to communicate key information at a glance. Do you remember Grokker? There was semantic search. Do you recollect Siderean Software.

One success story was Oingo, renamed Applied Semantics. Google understood the value of mapping words to ads and purchased the company to further its non search goals of generating ad revenue.

According to the HBR:

To find the shift, ask yourself a few questions. What was the original insight that led to the innovation? Where do you feel people “don’t get it” about your solution? What is the “aha” moment when someone turns from disinterested to enthusiastic?

Those who code up search systems are quite bright. Is this pat formula of shifting thinking the solution to the business challenges these firms face:

Attivio. Founded by Fast Search & Transfer alums, the company has ingested more than $35 million in venture funding. The company’s positioning is “an actionable 360 degree view of anything you need.” Okay. Dassault Exalead used the same line several years.

Coveo. The company has tapped venture firms for more than $30 million since the firm’s founding in 2004, Coveo uses the phrase “enterprise search” and wraps it in knowledge workers, custom service, engineering, and CRM. The idea is that Coveo delivers solutions tailored to a specific business functions and employee roles.

SRCH2. This is a Xoogler founded company that like Perfect Search before emphasizes speed. The alternative is better than open source search solutions.

Lucid Works. Like Vivisimo, Lucid Works has embraced Big Data and the cloud. The only slow downs Lucid has encountered has been turnover in CEOs, marketing, and engineering professionals. The most recent hurdle to trip up Lucid is the interest in ElasticSearch, fat with almost $100 million in venture funding and developers from the open source community.

IBM Watson. Based on open source and home grown technology, IBM’s marketers have showcased Watson on Jeopardy and garnered headlines for the $1 billion investment IBM is making in its “smart” information processing system. The most recent demonstration of Watson was producing a recipe for Bon Appetit readers.

Amazon’s search approach is to provide it as a service to those using Amazon Web services. Search is, in my mind, just a utility for Amazon. Amazon’s search system on its eCommerce site is not particularly good. Want to NOT out books not yet available on the system. Well, good luck with that query.

After I stopped chuckling, I realized that the Harvard article is less concerned with precision and recall than advocating deception, maybe cleverness. No enterprise search vendor has approached Autonomy’s revenues with the sole exception of Google’s licensing of the wildly expensive Google Search Appliance. At the time of its sale to Oracle, Endeca was chugging along at an estimated $150 million in revenue. Oracle paid about $1 billion for Endeca. With that benchmark, name another enterprise search vendor or eCommerce search vendor that has raced past Endeca. For the majority of enterprise search vendors, revenues of $3 to $10 million represent very significant achievements.

An MBA who takes over an enterprise search company may believe that wordsmithing will make sales. Sure, some sales may result but will the revenue be sustainable. Most enterprise search sales are a knee jerk to problems with the incumbent search system.

Without concrete positive case studies, talking about search is sophistry. There are comparatively few, specific, return on investment analyses for enterprise seach installations. I provided a link to a struggling LinkedIn person about an Italian library’s shift from the 1960s BASIS system to a Google Search Appliance.

Is enterprise search an anomaly in business software. Will the investment firms get their money back from their investments in search and retrieval?

Ask a Harvard MBA steeped in the lore of selling a whole new way of thinking. Ignore 50 years of search history. Success in search is difficult to achieve. Duplicity won’t do the job.

Stephen E Arnold, July 19, 2014

Jepsen-Testing Elasticsearch for Safety and Data Loss

July 18, 2014

The article titled Call Me Mayble: Elasticsearch on Aphyr explores potential issues with Elasticsearch. Jepsen is a section of Aphyr that tests the behaviors of different technology and software under types of network failure. Elasticsearch comes with the solid Java indexing library of Apache-Lucene. The article begins with an overview of how Elasticsearch scales through sharding and replication.

“The document space is sharded–sliced up–into many disjoint chunks, and each chunk allocated to different nodes. Adding more nodes allows Elasticsearch to store a document space larger than any single node could handle, and offers quasilinear increases in throughput and capacity with additional nodes. For fault-tolerance, each shard is replicated to multiple nodes. If one node fails or becomes unavailable, another can take over…Because index construction is a somewhat expensive process, Elasticsearch provides a faster database backed by a write-ahead log.”

Over a series of tests, (with results summarized by delightful Barbie and Ken doll memes) the article decides that while version control may be considered a “lost cause” Elasticsearch handles inserts superbly. For more information on how Elasticsearch behaved through speed bumbs, building a nemesis, nontransitive partitions, needless data loss, random and fixed transitive partitions, and more, read the full article. It ends with recommendations for Elasticsearch and for users, and concedes that the post provides far more information on Elasticsearch than anyone would ever desire.

Chelsea Kerwin, July 18, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Will Germany Scrutinize Google Web Search More Closely?

July 14, 2014

Several years ago, I learned a hard-to-believe factoid. In Denmark, 99 percent of referrals to a major financial service firm’s Web site came via Google. Figuring prominently was Google.de. My contact mentioned that the same traffic flow characterized the company’s German affiliate; that is, if an organization wanted Web traffic, Google was then the only game in town.

I no longer follow the flips and flops of Euro-centric Google killers like Quaero. I have little or no interest in assorted German search revolutions whether from the likes of the Weitkämper Clustering Engine or the Intrafind open source play or the Transinsight Enterprise Semantic Intelligence system. Although promising at one time, none of these companies offers an information retrieval that could supplant Google for German language search. Toss in English and the other languages Google supports, and the likelihood of a German Google killer decreases.

I read “Germany Is Looking to Regulate Google and Other Technology Giants.” I found the write up interesting and thought provoking. I spend some time each day contemplating the search and content processing sectors. I don’t pay much attention to the wider world of business and technology.

The article states:

German officials are planning to clip the wings of technology giants such as Google through heavier regulation.

That seems cut and dried. I also noted this statement:

The German government has always been militant in matters of data protection. In 2013, it warned consumers against using Microsoft’s Windows 8 operating system due to perceived security risks, suggesting that it provided a back door for the US National Security Agency (NSA). Of course, this might have had something to do with the fact that German chancellor Angela Merkel was one of the first high-profile victims of NSA surveillance, with some reports saying that the NSA hacked her mobile phone for over a decade.

My view is that search and content processing may be of particular interest. After all, who wants to sit and listen to a person’s telephone calls. I would convert the speech to text and hit the output with one of the many tools available to attach metadata, generate relationship maps, tug out entities like code words and proper names. Then I would browse the information using an old fashioned tabular report. I am not too keen on the 1959 Cadillac tail fin visualizations that 20 somethings find helpful, but to each his or her own I say.

Scrutiny of Google’s indexing might reveal some interesting things to the team assigned to ponder Google from macro and micro levels. The notion of timed crawls, the depth of crawls, the content parsed and converted to a Guha type semantic store, the Alon Halevy dataspace, and other fascinating methods of generating meta-information might be of interest to the German investigate-the-US-vendors team.

My hunch is that scrutiny of Google is likely to lead to increased concern about Web indexing in general. That means even the somewhat tame Bing crawler and the other Web indexing systems churning away at “public” sites’ content may be of interest.

When it comes to search and retrieval, ignorance and bliss are bedfellows. Once a person understands the utility of the archives, the caches, and the various “representations” of the spidered and parsed source content, bliss may become FUD (a version of IBM’s fear, uncertainty and doubt method). FUD may create some opportunities for German search and retrieval vendors. Will these outfits be able to respond or will the German systems remain in the province of Ivory Tower thinking?

In the short term, life will be good for the law firms representing some of the non German Web indexing companies. I wonder, “Is the Google Germany intercept matter included in the young attorneys’ legal education in Germany?”

Stephen E Arnold, July 14, 2014

Search, Not Just Sentiment Analysis, Needs Customization

July 11, 2014

One of the most widespread misperceptions in enterprise search and content processing is “install and search.” Anyone who has tried to get a desktop search system like X1 or dtSearch to do what the user wants with his or her files and network shares knows that fiddling is part of the desktop search game. Even a basic system like Sow Soft’s Effective File Search requires configuring the targets to query for every search in multi-drive systems. The work arounds are not for the casual user. Just try making a Google Search Appliance walk, talk, and roll over without the ministrations of an expert like Adhere Solutions. Don’t take my word for it. Get your hands dirty with information processing’s moving parts.

Does it not make sense that a search system destined for serving a Fortune 1000 company requires some additional effort? How much more time and money will an enterprise class information retrieval and content processing system require than a desktop system or a plug-and-play appliance?

How much effort is required to these tasks? There is work to get the access controls working as the ever alert security manager expects. Then there is the work needed to get the system to access, normalize, and process content for the basic index. Then there is work for getting the system to recognize, acquire, index, and allow a user to access the old, new, and changed content. Then one has to figure out what to tell management about rich media, content for which additional connectors are required, the method for locating versions of PowerPoints, Excels, and Word files. Then one has to deal with latencies, flawed indexes, and dependencies among the various subsystems that a search and content processing system includes. There are other tasks as well like interfaces, work flow for alerts, yadda yadda. You get the idea of the almost unending stream of dependent, serial “thens.”

When I read “Why Sentiment Analysis Engines need Customization”, I felt sad for licensees fooled by marketers of search and content processing systems. Yep, sad as in sorrow.

Is it not obvious that enterprise search and content processing is primarily about customization?

Many of the so called experts, advisors, and vendors illustrate these common search blind spots:

ITEM: Consulting firms that sell my information under another person’s name assuring that clients are likely to get a wild and wooly view of reality. Example: Check out IDC’s $3,500 version of information based on my team’s work. Here’s the link for those who find that big outfits help themselves to expertise and then identify a person with a fascinating employment and educational history as the AUTHOR.

image

See  http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=idc%20attivio

In this example from http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=idc%20attivio, notice that my work is priced at seven times that of a former IDC professional. Presumably Mr. Schubmehl recognized that my value was greater than that of an IDC sole author and priced my work accordingly. Fascinating because I do not have a signed agreement giving IDC, Mr. Schubmehl, or IDC’s parent company the right to sell my work on Amazon.

This screen shot makes it clear that my work is identified as that of a former IDC professional, a fellow from upstate New York, an MLS on my team, and a Ph.D. on my team.

image

See http://amzn.to/1ner8mG.

I assume that IDC’s expertise embraces the level of expertise evident in the TechRadar article. Should I trust a company that sells my content without a formal contract? Oh, maybe I should ask this question, “Should you trust a high  profile consulting firm that vends another person’s work as its own?” Keep that $3,500 price in mind, please.

ITEM: The TechRadar article is written by a vendor of sentiment analysis software. His employer is Lexalytics / Semantria (once a unit of Infonics). He writes:

High quality NLP engines will let you customize your sentiment analysis settings. “Nasty” is negative by default. If you’re processing slang where “nasty” is considered a positive term, you would access your engine’s sentiment customization function, and assign a positive score to the word. The better NLP engines out there will make this entire process a piece of cake. Without this kind of customization, the machine could very well be useless in your work. When you choose a sentiment analysis engine, make sure it allows for customization. Otherwise, you’ll be stuck with a machine that interprets everything literally, and you’ll never get accurate results.

When a vendor describes “natural language processing” with the phrase “high quality” I laugh. NLP is a work in progress. But the stunning statement in this quoted passage is:

Otherwise, you’ll be stuck with a machine that interprets everything literally, and you’ll never get accurate results.

Amazing, a vendor wrote this sentence. Unless a licensee of a “high quality” NLP system invests in customizing, the system will “never get accurate results.” I quite like that categorical never.

ITEM: Sentiment analysis is a single, usually complex component of a search or content processing system. A person on the LinkedIn enterprise search group asked the few hundred “experts” in the discussion group for examples of successful enterprise search systems. If you are a member in good standing of LinkedIn, you can view the original query at this link. [If the link won’t work, talk to LinkedIn. I have no idea how to make references to my content on the system work consistently over time.] I pointed out that enterprise search success stories are harder to find than reports of failures. Whether the flop is at the scale of the HP/Autonomy acquisition or a more modest termination like Overstock’s dumping of a big name system, the “customizing” issues is often present. Enterprise search and content processing is usually:

  • A box of puzzle pieces that requires time, expertise, and money to assemble in a way that attracts and satisfies users and the CFO
  • A work in progress to make work so users are happy and in a manner that does not force another search procurement cycle, the firing of the person responsible for the search and content processing system, and the legal fees related to the invoices submitted by the vendor whose system does not work. (Slow or no payment of licensee and consulting fees to a search vendor can be fatal to the search firm’s health.)
  • A source of friction among those contending for infrastructure resources. What I am driving at is that a misconfigured search system makes some computing work S-L-O_W. Note: the performance issue must be addressed for appliance-based, cloud, or on premises enterprise search.
  • Money. Don’t forget money, please. Remember the CFO’s birthday. Take her to lunch. Be really nice. The cost overruns that plague enterprise search and content processing deployments and operations will need all the goodwill you can generate.

If sentiment analysis requires customizing and money, take out your pencil and estimate how much it will cost to make NLP and sentiment to work. Now do the same calculation for relevancy tuning, index tuning, optimizing indexing and query processing, etc.

The point is that folks who get a basic key word search and retrieval system work pile on the features and functions. Vendors whip up some wrapper code that makes it possible to do a demo of customer support search, eCommerce search, voice search, and predictive search. Once the licensee inks the deal, the fun begins. The reason one major Norwegian search vendor crashed and burned is that licensees balked at paying bills for a next generation system that was not what the PowerPoint slides described. Why has IBM embraced open source search? Is one reason to trim the cost of keeping the basic plumbing working reasonably well? Why are search vendors embracing every buzzword that comes along? I think that search and an enterprise function has become a very difficult thing to sell, make work,  and turn into an evergreen revenue stream.

The TechRadar article underscores the danger for licensees of over hyped systems. The consultants often surf on the expertise of others. The vendors dance around the costs and complexities of their systems. The buzzwords obfuscate.

What makes this article by the Lexalytics’ professional almost as painful as IDC’s unauthorized sale of my search content is this statement:

You’ll be stuck with a machine that interprets everything literally, and you’ll never get accurate results.

I agree with this statement.

Stephen E Arnold, July 11, 2014

Information Manipulation: Accountability Pipe Dream

July 5, 2014

I read an article with what I think is the original title: “What does the Facebook Experiment Teach us? Growing Anxiety About Data Manipulation.” I noted that the title presented on Techmeme was “We Need to Hold All Companies Accountable, Not Just Facebook, for How They Manipulate People.” In my view, this mismatch of titles is a great illustration of information manipulation. I doubt that the writer of the improved headline is aware of the irony.

The ubiquity of information manipulation is far broader than Facebook twirling the dials of its often breathless users. Navigate to Google and run this query:

cloud word processing

Note anything interesting in the results list displayed for me on my desktop computer:

image

The number one ad is for Google. In the first page of results, Google’s cloud word processing system is listed three more times. I did not spot Microsoft Office in the cloud except in item eight: Is Google Docs Making Microsoft Word Redundant.

For most Google search users, the results are objective. No distortion evident.

Here’s what Yandex displays for the same query:

image

No Google word processing and no Microsoft word processing whether in the cloud or elsewhere.

When it comes to searching for information, the notion that a Web indexing outfit is displaying objective results is silly. The Web indexing companies are in the forefront of distorting information and manipulating users.

Flash back to the first year of the Bush administration when Richard Cheney was vice president. I was in a meeting where the request was considered to make sure that the vice president’s office Web site would appear in FirstGov.gov hits in a prominent position. This, gentle reader, is a request that calls for hit boosting. The idea is to write a script or configure the indexing plumbing to make darned sure a specific url or series of documents appears when and where they are required. No problem, of course. We created a stored query for the Fast Search & Transfer search system and delivered what the vice president wanted.

This type of results manipulation is more common than most people accept. Fiddling Web search, like shaping the flow of content on a particular semantic vector, is trivial. Search engine optimization is a fools’ game compared with the tried and true methods of weighting or just buying real estate on a search results page, a Web site from a “real” company.

The notion that disinformation, reformation, and misinformation will be identifiable, rectified, and used to hold companies accountable is not just impossible. The notion itself reveals how little awareness of the actual methods of digital content injection work.

How much of the content on Facebook, Twitter, and other widely used social networks is generated by intelligence professionals, public relations “professionals,” and folks who want to be perceived as intellectual luminaries? Whatever your answer, what data do you have to back up your number? At a recent intelligence conference in Dubai, one specialist estimated that half of the traffic on social networks is shaped or generated by law enforcement and intelligence entities. Do you believe that? Probably not. So good for you.

Amusing, but as someone once told me, “Ignorance is bliss.” So, hello, happy idealists. The job is identifying, interpreting, and filtering. Tough, time consuming work. Most of the experts prefer to follow the path of least resistance and express shock that Facebook would toy with its users. Be outraged. Call for action. Invent an algorithm to detect information manipulation. Let me know how that works out when you look for a restaurant and it is not findable from your mobile device.

Stephen E Arnold, July 5, 2014

AeroText: A New Breakthrough in Entity Extraction

June 30, 2014

I returned from a brief visit to Europe to an email asking about Rocket Software’s breakthrough technology AeroText. I poked around in my archive and found a handful of nuggets about the General Electric Laboratories’ technology that migrated to Martin Marietta, then to Lockheed Martin, and finally in 2008 to the low profile Rocket Software, an IBM partner.

When did the text extraction software emerge? Is Rocket Software AeroText a “new kid on the block”? The short answer is that AeroText is pushing 30, maybe 35 years young.

Digging into My Archive of Search Info

As far as my archive goes, it looks as though the roots of AeroText are anchored in the 1980s, Yep, that works out to an innovation about the same age as the long in the tooth ISYS Search system, now owned by Lexmark. Over the years, the AeroText “product” has evolved, often in response to US government funding opportunities. The precursor to AeroText was an academic exercise at General Electric. Keep in mind that GE makes jet engines, so GE at one time had a keen interest in anything its aerospace customers in the US government thought was a hot tamale.

1_interface

The AeroText interface circa mid 2000. On the left is the extraction window. On the right is the document window. From “Information Extraction Tools: Deciphering Human Language, IT Pro, November December 2004, page 28.

The GE project, according to my notes, appeared as NLToolset, although my files contained references to different descriptions such as Shogun. GE’s team of academics and “real” employees developed a bundle of tools for its aerospace activities and in response to Tipster. (As a side note, in 2001, there were a number of Tipster related documents in the www.firstgov.gov system. But the new www.usa.gov index does not include that information. You will have to do your own searching to unearth these text processing jump start documents.)

The aerospace connection is important because the Department of Defense in the 1980s was trying to standardize on markup for documents. Part of this effort was processing content like technical manuals and various types of unstructured content to figure out who was named, what part was what, and what people, places, events, and things were mentioned in digital content. The utility of NLToolset type software was for cost reduction associated with documents and the intelligence value of processed information.

The need for a markup system that worked without 100 percent human indexing was important. GE got with the program and appears to have assigned some then-young folks to the project. The government speak for this type of content processing involves terms like “message understanding” or MU, “entity extraction,” and “relationship mapping. The outputs of an NLToolset system were intended for use in other software subsystems that could count, process, and perform other operations on the tagged content. Today, this class of software would be packaged under a broad term like “text mining.” GE exited the business, which ended up in the hands of Martin Marietta. When the technology landed at Martin Marietta, the suite of tools was used in what was called in the late 1980s and early 1990s, the Louella Parsing System. When Lockheed and Martin merged to form the giant Lockheed Martin, Louella was renamed AeroText.

Over the years, the AeroText system competed with LingPipe, SRA’s NetOwl and Inxight’s tools. In the hay day of natural language processing, there were dozens and dozens of universities and start ups competing for Federal funding. I have mentioned in other articles the importance of the US government in jump starting the craziness in search and content processing.

In 2005, I recall that Lockheed Martin released AeroText 5.1 for Linux, but I have lost track of the open source versions of the system. The point is that AeroText is not particularly new, and as far as I know, the last major upgrade took place in 2007 before Lockheed Martin sold the property to AeroText. At the time of the sale, AeroText incorporated a number of subsystems, including a useful time plotting feature. A user could see tagged events on a timeline, a function long associated with the original version of i2’s the Analyst Notebook. A US government buyer can obtain AeroText via the GSA because Lockheed Martin seems to be a reseller of the technology. Before the sale to Rocket, Lockheed Martin followed SAIC’s push into Australia. Lockheed signed up NetMap Analytics to handle Australia’s appetite for US government accepted systems.

AeroText Functionality

What does AeroText purport to do that caused the person who contacted me to see a 1980s technology as the next best thing to sliced bread?

AeroText is an extraction tool; that is, it has capabilities to identify and tag entities at somewhere between 50 percent and 80 percent accuracy. (See NIST 2007 Automatic Content Extraction Evaluation Official Results for more detail.)

The AeroText approach uses knowledgebases, rules, and patterns to identify and tag pre-specified types of information. AeroText references patterns and templates, both of which assume the licensee knows beforehand what is needed and what will happen to processed content.

In my view, the licensee has to know what he or she is looking for in order to find it. This is a problem captured in the famous snippet, “You don’t know what you don’t know” and the “unknown unknowns” variation popularized by Donald Rumsfeld. Obviously without prior knowledge the utility of an AeroText-type of system has to be matched to mission requirements. AeroText pounded the drum for the semantic Web revolution. One of AeroText’s key functions was its ability to perform the type of markup the Department of Defense required of its XML. The US DoD used a variant called DAML or Darpa Agent Markup Language. natural language processing, Louella, and AeroText collected the dust of SPARQL, unifying logic, RDF, OWL, ontologies, and other semantic baggage as the system evolved through time.

Also, staff (headcount) and on-going services are required to keep a Louella/AeroText-type system generating relevant and usable outputs. AeroText can find entities, figure out relationships like person to person and person to organization, and tag events like a merger or an arrest “event.” In one briefing about AeroText I attended, I recall that the presenter emphasized that AeroText did not require training. (The subtext for those in the know was that Autonomy required training to deliver actionable outputs.) The presenter did not dwell on the need for manual fiddling with AeroText’s knowledgebases and I did not raise this issue.)

Read more

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta