IDC and BA Insight: Cartoons and Keyword Search

January 31, 2015

I kid you not. I received a spam mail from an outfit called BA Insight. The spam was a newsletter published every three months. You know that regular flows of news are what ring Google’s PageRank chimes, right?

Here’s the missive:


The lead item is an invitation to:

Unstructured content – email, video, instant messages, documents and other formats accounts for 90% of all digital information.

View the IDC Infographic:
Unlock the Hidden Value of Information

With my fully protected computer, I boldly clicked on the link. I don’t worry too much about keyword search vendors’ malware, but prudence is a habit my now deceased grandma drummed into me.

Here’s what greeted me:


Yep, a giant infographic cartoon stuffed with assertions and a meaningless chunk of jargon: knowledge quotient. Give me cyber OSINT any day.

The concept presented in this fascinating marketing play is that unstructured content has value waiting to be delivered. I learned:

This content is locked in variety locations [sic] and applications made up of separate repositories that don’t talk to each other—e.g., EMC Documentum,, Google Drive, SharePoint, et al.

Now it looks to me as if the word “of” has been omitted between “variety locations”. I also think that EMC Documentum has a new name. Oh, well. Let’s move on.

The key point in the cartoon is that “some organizations can and do unlock information’s hidden value. Organizations with a high knowledge quotient.”

I thought I addressed this silly phrase in this write up.

Let me be clear. IDC is the outfit that sold my information on Amazon without my permission. More embarrassing to me was the fact that the work was attributed to a fellow named Dave Schubmehl, who is one of the, if not the premier, IDC search expert. Scary I believe. Frightful.

What’s the point?

The world of information access has leapfrogged outfits like BA Insight and “experts” like IDC’s pride of pontificators.

The future of information access is automated collection, analysis, and reporting. You can learn about this new world in CyberOSINT: Next Generation Information Access. No cartoons but plenty of screenshots that show what the outputs of NGIA systems deliver to users who need to reduce risk and make decisions of considerable importance and time sensitivity.,

In the meantime, if you want cartoons, flip through the New Yorker. More intelligent fare I would suggest.

How do you become a knowledge quotient leader? In my opinion, not by licensing a keyword search system or buying information from an outfit that surfs on my research. Just a thought.

Stephen E Arnold, January 31, 2015

CyberOSINT and the Associated Press

January 31, 2015

Remember the days when there were Associated Press stringers? Remember when the high value AP service was information gathered at state capitols? Remember when humans did this work?

Enter cyber information or as I dub this stuff Cyber OSINT.

Navigate to “AP’s Robot Journalists Are Writing Their Own Stories Now.” I would have added the subtitle “And the Obituaries of Stringers”. The idea is simple: Smart software assembles sentences that comprise a “news story.” Here’s the passage I noted:

Philana Patterson, an assistant business editor at the AP tasked with implementing the system, tells us there was some skepticism from the staff at first. “I wouldn’t expect a good journalist to not be skeptical,” she said. Patterson tells us that when the program first began in July, every automated story had a human touch, with errors logged and sent to Automated Insights to make the necessary tweaks. Full automation began in October, when stories “went out to the wire without human intervention.” Both the AP and Automated Insights tell us that no jobs have been lost due to the new service. We’re also told the automated system is now logging in fewer errors than the human-produced equivalents from years past.

The shift from humans to software is just beginning. To get a glimpse of how industrial strength systems perform far more sophisticated operations automatically, you will want to read CyberOSINT: Next Generation Information Access.

Forget traditional search and information gathering, the world has shifted. You know it when a stodgy, collectively owned outfit like the AP goes public with cyber tools.

When will the enterprise search vendors flogging consulting services and keyword systems figure it out? Perhaps search and indexing companies are the heirs to the cluelessness of news gathering organizations.

Stephen E Arnold, January 31, 2015

EMC: Another Information Sideshow in the Spotlight

January 31, 2015

An information sideshow is enterprise software that presents itself as the motor, transmission, and differential for the organization. Get real. The main enterprise applications are accounting, database management systems, sales management, and systems that manage real stuff (ERP, PLM, etc.)

Applications that purport to manage Web content or organize enterprise wide information and data are important but the functions concern overhead positions except in publishing companies and similar firms.

Since the Web became everyone’s passport to becoming an expert online professional, Web content management systems blossomed and flamed out. Anyone using Broadvision or Sagemaker?

Documentum is a content management system. It is mandated or was mandated as the way to provide information to support the antics of the Food and Drug Administration and some other regulated sectors. The money from FDA’s blessing does not mean that Documentum is in step with today’s digital demands. In fact, for some applications, systems like Documentum are good for the resellers and integrators. Users often have a different point of view. Do you love OpenText, MarkLogic, and other proprietary content management systems? Remember XyVision?

Several years ago, I had a fly over of a large EMC Documentum project. When I was asked to take a look, a US government entity had been struggling for three years to get a Documentum system up and running. I think one of the resellers and consultants was my old pal IBM, which sells its own content management systems, by the way. At the time I was working with the Capitol Police (yep, another one of those LE entities that few people know much about). Think investigation.

I poked around the system, reviewed some US government style documentation, and concluded that in process system would require more investment and time to get up and toddling, not walking, mind you, just toddling. I bailed and worked on projects that sort of really worked mostly in other governmental entities.

After that experience, I realized that “content management” was a bit of a charade, not to different from Web servers and enterprise search. The frenzy for Web stuff made it easy for vendors of proprietary systems to convince organizations to buy bespoke, proprietary content management systems. Wow.

The outfits that are in the business of creating content know about editorial policies. Licensees of content management systems often do not. But publishing expertise is irrelevant to many 20 somethings, failed webmasters, self appointed experts, and confused people looking for a source of money.

The world is chock a block with content management systems. But there is a difference today, and the shift from proprietary systems to open source systems puts vendors of proprietary systems in a world of sales pain. For some outfits, CMS means SharePoint (heaven help me).

For other companies CMS means open source CMS systems. No license fees. No restrictions on changes. But CMS still requires expensive ministrations from CMS experts. Just like enterprise search.

I read “EMC Reports Mixed Results, Fingers Axe: Reduction in Force Planned.” For me this passage jumped out of the article:

The Unified Backup and Recovery segment includes mid-range VNX arrays and it had a storming quarter too, with 2,000 new VNX customers. VCE also added a record number of new customers. RSA grew at a pedestrian rate in the quarter, four per cent year-on-year with the Information Intelligence Group (Documentum, etc) declining eight per cent; this product set has never shone.

So, an eight percent decline. Not good. Like enterprise search, this proprietary content management product has a long sales cycle and after six months of effort, the client may decide to use an open source solution. Joomla anyone? My hunch is that the product set will emit as many sparklies as the soot in my fireplace chimney.

CMS is another category of software for which cyber OSINT method points the way to the future. Automated systems capture what humans do and operate on that content automatically. Allowing humans to index, tag, copy, date, and perform other acts of content violence leads to findability chaos.

In short, EMC Documentum is going to face some tough months. Drupal anyone?

Stephen E Arnold, January 31, 2015

Google Glass: The Future Some Time

January 30, 2015

A couple of years ago I wrote an unpublished report for a big time investment bank. The subject? Google Glass. The client was a rah rah believer in headgear that provided Terminator style inputs.

I worked through the Google research papers, tracked down a wizard who now works at Amazon, and summarized the supporting technologies required to do Terminator type stuff. I did not come away from the exercise in a state of high energy.

I kept my personal opinion to myself, got paid, and moved on to cyber OSINT related topics.

I just read “Google Is Resetting Its Google Glass Strategy.” I wish the company well. I personally think that this augmenting technology will be handled in the manner described in exquisite detail by novelist and rocket scientist Alastair Reynolds, not with wonky “wearables.” (No, I don’t want something on my trifocals. No, I don’t want a watch the size of two or three stacked Oreo cookies.)

Here’s the write up’s take on the future of Glass:

Google is still expected to release a second generation version of Google Glass sometime in the future, but it was unclear what that might involve. Now we know Google is going back to the drawing board to rethink the Glass programme there’s no way of predicting what will be coming next. At the very least we hope it’ll be a little bit less expensive.

What is important to me about Glass is that it shows how thin the intellectual veneer is at the Google X Labs thing. Elon Musk conceived satellites for Internet access even though that idea was moved along by Equatorial Communications 20 years ago. X Labs pushed balloons. Didn’t these fly over Paris in the 1700s?

Glass is significant because it illustrates:

  • Poor management
  • The consequences of senior management getting involved with staff
  • Marketing that permitted the phrase “glasshole” to become part of the vernacular
  • Technology that was a demo of something that ran out of battery power quickly, had a dorky user interface, and added creepiness to Google’s stockpile of creepiness.

No Glass redux for me. How about getting back to relevant search and products/services that generate revenue?

Stephen E Arnold, January 30, 2015

IBM Flogs Watson as a Lawyer and a Doctor

January 30, 2015

After the disappointing and somewhat negative IBM financial reports, the Watson PR machine has lurched into action. Watson, as you may know, is the next big thing in content processing. Lucene plus home brew code converts search into an artificial intelligence powerhouse. Well, that’s what the Watson cheerleaders want me to believe. I wonder if cheerleading correlates with making sales of more than $1 billion in the next quarter or two or three or four or five.

I read two news items. One is indicative of the use of Watson on a bounded content set, not the big, wide, wonderful world of real time data flows. The other is somewhat troubling but not particularly surprising.

To business.

IBM Watson is now a lawyer. Navigate to “Meet Ross, the IBM Watson-Powered Lawyer.” The idea is that systems from LexisNexis and Thomson Reuters are not what lawyers or the thrifty  legal searcher wants. Nope, Watson converts to a lawyer more easily than a graduate of a third tier law school chases accident victims. According to the write up:

University of Toronto team launches a cognitive computing application that helps lawyers conduct world-class case research.

If I understand the write up, Watson is a search system equipped with the magical powers that allowed the machine and software to win a TV game show. Is post production allowed in the court room? I know that post plays a part in prime time TV. Just asking.

A couple of thoughts. The current line up of legal research systems are struggling to keep revenues and make profits. The reason for the squeeze is that law firms are having some difficulty returning to the salad days of the LingTemcoVought era. Lawyers are getting fired. Lawyers are suing law schools with allegations of false advertising about the employment picture for the newly minted JDs. Lawyers are becoming human resource, public relations, and school counselors. Others are just quitting. I know one Duke Law lawyer who has worked at several of the world’s most highly regarded law firms. Know what the Duke Law degree is doing for money? Running a health club. Interesting development for those embarking on a l;aw degree.

Will Watson generate significant revenue and a profit from its legal research prowess? The answer, in my opinion, is, “No.” What is going to happen is that efficacy of Watson’s usefulness on a bounded set of legal content can be compared to the outputs from the smart system offered by Thomson Reuters and the decidedly less smart system from LexisNexis. For an academic, this comparison will be an endless source of reputational zoomitude. For the person needing legal advice, hire an attorney. These folks advertise on TV now and offer 24×7 hotlines and toll free numbers.

The second item casts a shadow over my skeptical and extremely tiny intellectual capability. Navigate to to “This Medical Supercomputer Isn’t a Pacemaker, IBM Tells Congress.” Excluding classified and closed hearings about next generation intelligence systems, this may be the first time a Lucene recycler is pitching Congress about search and retrieval. The write up says:

The effort to protect decision support tools like Watson from Food and Drug Administration regulation is part of a proposal by the Republican chairman of the House Energy and Commerce Committee, Michigan’s Fred Upton. Called the 21st Century Cures initiative, it’s a major overhaul in the pharmaceutical and medical-device world, and the possibility of its passage is boosted by Republican control of both chambers of Congress. Upton’s bill would give the FDA two years to come up with a verification process for what it calls “medical software.” Such programs wouldn’t require the strict approval process faced by makers of medical devices like heart stents. Another set of products defined as “health software” wouldn’t require FDA oversight at all.

I think an infusion of US government money will provide some revenue to the game show winner. Go for it. Remember I used to work at Halliburton Nuclear and Booz, Allen & Hamilton. But in terms of utility I think that if the Golden Fleece Award were still around, Watson might get a quick look by the 20 somethings filtering the government funding of interesting projects.

Net net: Watson is going to have to vie with HP Autonomy for the billions in revenue from their content processing technologies. Perhaps IBM should take a closer look at i2 and Cybertap? Those IBM owned content processing systems may deliver more value than the keyword centric, super smart Watson system. Just a suggestion from rural Kentucky.

The gray side of the cloud is that IBM may actually get government money. Will Watson bond with Mr. Obama’s health programs? That is an exciting notion.

Stephen E Arnold, January 30, 2015

Autonomy: Leading the Push Beyond Enterprise Search

January 30, 2015

In “CyberOSINT: Next Generation Information Access,” I describe Autonomy’s math-first approach to content processing. The reason is that after the veil of secrecy was lifted with regard to the signal processing`methods used for British intelligence tasks, Cambridge University became one of the hot beds for the use of Bayesian, LaPlacian, and Markov methods. These numerical recipes proved to be both important and controversial. Instead of relying on manual methods, humans selected training sets, tuned the thresholds, and then turned the smart software loose. Math is not required to understand what Autonomy packaged for commercial use: Signal processing separated noise in a channel and allowed software to process the important bits. Thank you, Claude Shannon and the good Reverend Bayes.

What did Autonomy receive for this breakthrough? Not much but the company did generate more than $600 million in revenues about 10 years after opening for business. As far as I know, no other content processing vendor has reached this revenue target. Endeca, for the sake of comparison, flat lined at about $130 million in the year that Oracle bought the Guided Navigation outfit for about $1.0 billion.

For one thing the British company BAE (British Aerospace Engineering) licensed the Autonomy system and began to refine its automated collection, analysis, and report systems. So what? The UK became by the late 1990s the de facto leader in automated content activities. Was BAE the only smart outfit in the late 1990s? Nope, there were other outfits who realized the value of the Autonomy approach. Examples range from US government entities to little known outfits like the Wynyard Group.

In the CyberOSINT volume, you can get more detail about why Autonomy was important in the late 1990s, including the name of the university8 professor who encouraged Mike Lynch to make contributions that have had a profound impact on intelligence activities. For color, let me mention an anecdote that is not in the 176 page volume. Please, keep in mind that Autonomy was, like i2 (another Cambridge University spawned outfit) a client prior to my retirement.) IBM owns i2 and i2 is profiled in CyberOSINT in Chapter 5, “CyberOSINT Vendors.” I would point out that more than two thirds of the monograph contains information that is either not widely available or not available via a routine Bing, Google, or Yandex query. For example, Autonomy does not make publicly available a list of its patent documents. These contain specific information about how to think about cyber OSINT and moving beyond keyword search.

Some Color: A Conversation with a Faux Expert

In 2003 I had a conversation with a fellow who was an “expert” in content management, a discipline that is essentially a step child of database technology. I want to mention this person by name, but I will avoid the inevitable letter from his attorney rattling a saber over my head. This person publishes reports, engages in litigation with his partners, kowtows to various faux trade groups, and tries to keep secret his history as a webmaster with some Stone Age skills.

Not surprisingly this canny individual had little good to say about Autonomy. The information I provided about the Lynch technology, its applications, and its importance in next generation search were dismissed with a comment I will not forget, “Autonomy is a pile of crap.”

Okay, that’s an informed opinion for a clueless person pumping baloney about the value of content management as a separate technical field. Yikes.

In terms of enterprise search, Autonomy’s competitors criticized Lynch’s approach. Instead of a keyword search utility that was supposed to “unlock” content, Autonomy delivered a framework. The framework operated in an automated manner and could deliver keyword search, point and click access like the Endeca system, and more sophisticated operations associated with today’s most robust cyber OSINT solutions. Enterprise search remains stuck in the STAIRS III and RECON era. Autonomy was the embodiment of the leap from putting the burden of finding on humans to shifting the load to smart software.


A diagram from Autonomy’s patents filed in 2001. What’s interesting is that this patent cites an invention by Dr. Liz Liddy with whom the ArnoldIT team worked in the late 1990s. A number of content experts understood the value of automated methods, but Autonomy was the company able to commercialize and build a business on technology that was not widely known 15 years ago. Some universities did not teach Bayesian and related methods because these were tainted by humans who used judgments to set certain thresholds. See US 6,668,256. There are more than 100 Autonomy patent documents. How many of the experts at IDC, Forrester, Gartner, et al have actually located the documents, downloaded them, and reviewed the systems, methods, and claims? I would suggest a tiny percentage of the “experts.” Patent documents are not what English majors are expected to read.”

That’s important and little appreciated by the mid tier outfits’ experts working for IDC (yo, Dave Schubmehl, are you ramping up to recycle the NGIA angle yet?) Forrester (one of whose search experts told me at a MarkLogic event that new hires for search were told to read the information on my Web site like that was a good thing for me), Gartner Group (the conference and content marketing outfit), Ovum (the UK counterpart to Gartner), and dozens of other outfits who understand search in terms of selling received wisdom, not insight or hands on facts.

Read more

Google: Is the One Trick Pony Limping?

January 30, 2015

You should work through the Googley report about the GOOG’s financial results. I would suggest purging your mind of thoughts about Apple’s revenue and Google’s involvement with the Apple Board of Directors. I would also suggest sponging the data about Amazon’s cloud and prime gains, not to mention the world’s smartest man’s delivering a profit.

Properly prepared, now we can consider “Google Inc. Announces Fourth Quarter and Fiscal Year 2014 Results.” There are two attachments, which you may want to peruse as well. For me the key point in the write up was this passage:

Other Revenues – Other revenues were $1.95 billion, or 11% of total revenues, in the fourth quarter of 2014.  This represents a 19% increase over fourth quarter 2013 other revenues of $1.65 billion.

The way I interpret this sentence is that after a decade of real effort, Google has been able to generate a couple of billion dollars in revenue from non-ad, non-search, and non-network related activities. In the early days, Google earned zero money from anything. Then the company stumbled upon in a moment of inspiration the methods of GoTo, Overture, and Yahoo. After a legal flap, Google emerged with a business model; that is, pay to play for traffic and traffic.

Several thoughts:

  1. Google is a money machine. The company has to find a way to generate more of the stuff in order to maintain its reputation as Googzilla. In my view, Loon balloons and related initiatives are the supporting cast. Another Broadway hit or three are needed.
  2. The financials do not touch upon the management and interpersonal storms buffeting the company. One Google professional was the focal point of a TV news program involving yachts, alcohol, a person without a degree from Cal Tech or INSEAD, and interestingly enough a banned substance. There was the dust up about Glass, inter company extracurricular activities involving a high profile Googler, and the departure of a nano-tech whiz to Amazon’s digital jungle. Then there were management changes. So much in just 12 months.
  3. Finally, there was the company’s business decisions that roiled the Google Earth world, the procedural shifts for APIs, and rise of irrelevancy in search results. The grand and glorious visio0n of “the world’s information” dimmed as book scanning seemed to fizzle. Somewhere I have a list of orphaned services. I will start a new list for fiscal 2015-2016 and use a bigger note card.

I find Google fascinating. I began work on The Google Legacy in 2003, Google Version 2.0 in 2005 when the company was approaching its miracle year, and Google: The Digital Gutenberg in 2008. [Alas, the unstable finances of the publisher of these still useful analyses put these volumes out of print. Publishers are also fascinating, almost like Oedipus.] After these three monographs, I was able to state with some conviction that Google had to find a way to monetize mobile at the same profit level as old school desktop search or find new revenue streams. It was obvious that the Google Search Appliance was not going to be a big winner.

Google remains an important company. Many MBAs live and die by Google’s apparent dominance of all things nifty. For me these financial results suggest that Google may need an overhaul in its senior management. The vision thing is just not ringing my bells.

I no longer can do a query on Google to answer this question, “What’s next for Google?” I think I know after 15 years of watching. More ads, more thrashing, and more Loon balloons. I sort of miss getting those nifty tsotchkes at conferences. My LED illuminated Google pin has gone dim. My Google mousepad has worn out. My Google T Shirt has faded.

Mobile online access has arrived, and it is more of a threat than desktop searchers realize.

Stephen E Arnold, January 30, 2015

Linguistic Analysis and Data Extraction with IBM Watson Content Analytics

January 30, 2015

The article on IBM titled Discover and Use Real-World Terminology with IBM Watson Content Analytics provides an overview to domain-specific terminology through the linguistic facets of Watson Content Analytics. The article begins with a brief reminder that most data, whether in the form of images or texts, is unstructured. IBM’s linguistic analysis focuses on extracting relevant unstructured data from texts in order to make it more useful and usable in analysis. The article details the processes of IBM Watson Content Analytics,

“WCA processes raw text from the content sources through a pipeline of operations that is conformant with the UIMA standard. UIMA (Unstructured Information Management Architecture) is a software architecture that is aimed at the development and deployment of resources for the analysis of unstructured information. WCA pipelines include stages such as detection of source language, lexical analysis, entity extraction… Custom concept extraction is performed by annotators, which identify pieces of information that are expressed as segments of text.”

The main uses of WCA are exploring insights through facets as well as extracting concepts in order to apply WCA analytics. The latter might include excavating lab analysis reports to populate patient records, for example. If any of these functionalities sound familiar, it might not surprise you that IBM bough iPhrase, and much of this article is reminiscent of iPhrase functionality from about 15 years ago.

Chelsea Kerwin, January 30, 2014

Sponsored by, developer of Augmentext

SLI Management Shifts

January 30, 2015

SLI System’s is one of the top SaaS for Internet retailers and the New Year brings changes for the New Zealand company. SLI Systems’ software is known for its site search software analysis and strategies to turn Web site visitors into customers. Reseller News announced that “Global Sales VP Leaves As SLI Systems Completes Sales Leadership Transition” and it leaves the company with more than changes than expected.

SLI Systems was reorganizing its leadership, when the Vice President of Global Sales and Business Development Ed Hoffman decided to take his leave. Hoffman left to pursue other business interests. Chief Revenue Office and President of North America Neil Thomas will take over sales, business development personnel, strategy, and operations.

“ ‘This move completes the sales leadership transition that we began last year by appointing Neil to his position,’ says Shaun Ryan, CEO, SLI Systems. We are grateful to Ed for his contribution to SLI’s growth and customer satisfaction since 2003, and we wish him well in his future endeavors.’ “

There does not appear to be any bad blood between Hoffman and SLI Systems. SLI Systems is moving forward to improve sales and services for 2015.

Whitney Grace, January 30, 2015
Sponsored by, developer of Augmentext

Enterprise Search Lacks NGIA Functions

January 29, 2015

Users Want More Than Hunting through a Rubbish

CyberOSINT: Next Generation Information Access is, according to Ric Manning, the publisher of Stephen E Arnold’s new study, is now available. You can order a copy at the Gumroad online store or via the link on

cover for ads

One of the key chapters in the 176 page study of information retrieval solution that move beyond search takes you under the hood of an NGIA system. Without reproducing the 10 page chapter and its illustrations, I want to highlight two important aspects of NGIA systems.

When a person requires information under time pressure, traditional systems pose a problem. The time required to figure out which repository to query, craft a query or take a stab at what “facet” (category) may contain the information, scanning the outputs the system displays, opening a document that appears to be related to the query, and then figuring out exactly what item of data is the one required makes traditional search a non starter in many work situations. The bottleneck is the human’s ability to keep track of which digital repository contains what. Many organizations have idiosyncratic terminology, and users in one department may not be familiar with the terminology used in another unit of the organization.


Register for the seminar on the Telestrategies’ Web site.

Traditional enterprise search systems trip and skin their knees over the time issue and over the “locate what’s needed issue.” These are problems that have persisted in search box oriented systems since the days of RECON, SDC Orbit, and Dialcom. There is little a manager can do to create more time. Time is a very valuable commodity and it often determines what type of decision is made and how risk laden that decision may be.

There is also little one can do to change how a bright human works with a system that forces a busy individual to perform iterative steps that often amount to guessing the word or phrase to unlock what’s hidden in an index or indexes.

Little wonder that convincing a customer to license a traditional keyword system continue to bedevil vendors.

A second problem is the nature of access. There is news floating around that Facebook has been able to generate more ad growth than Google because Facebook has more mobile users. Whether Facebook or Google dominates social mobile, the key development is “mobile.” Works need information access from devices which have smaller and different form factors from the multi core, 3.5 gigahertz, three screen workstation I am using to write this blog post.

Read more

Next Page »

  • Archives

  • Recent Posts

  • Meta