September 1, 2014
Last week I had a conversation with a publisher who has a keen interest in software that “knows” what content means. Armed with that knowledge, a system can then answer questions.
The conversation was interesting. I mentioned my presentations for law enforcement and intelligence professionals about the limitations of modern and computationally expensive systems.
Several points crystallized in my mind. One of these is addressed, in part, in a diagram created by a person interested in machine learning methods. Here’s the diagram created by SciKit:
The diagram is designed to help a developer select from different methods of performing estimation operations. The author states:
Often the hardest part of solving a machine learning problem can be finding the right estimator for the job. Different estimators are better suited for different types of data and different problems. The flowchart below is designed to give users a bit of a rough guide on how to approach problems with regard to which estimators to try on your data.
First, notice that there is a selection process for choosing a particular numerical recipe. Now who determines which recipe is the right one? The answer is the coding chef. A human exercises judgment about a particular sequence of operation that will be used to fuel machine learning. Is that sequence of actions the best one, the expedient one, or the one that seems to work for the test data? The answer to these questions determines a key threshold for the resulting “learning system.” Stated another way, “Does the person licensing the system know if the numerical recipe is the most appropriate for the licensee’s data?” Nah. Does a mid tier consulting firm like Gartner, IDC, or Forrester dig into this plumbing? Nah. Does it matter? Oh, yeah. As I point out in my lectures, the “accuracy” of a system’s output depends on this type of plumbing decision. Unlike a backed up drain, flaws in smart systems may never be discerned. For certain operational decisions, financial shortfalls or the loss of an operation team in a war theater can be attributed to one of many variables. As decision makers chase the Silver Bullet of smart, thinking software, who really questions the output in a slick graphic? In my experience, darned few people. That includes cheerleaders for smart software, azure chip consultants, and former middle school teachers looking for a job as a search consultant.
Second, notice the reference to a “rough guide.” The real guide is understanding of how specific numerical recipes work on a set of data that allegedly represents what the system will process when operational. Furthermore, there are plenty of mathematical methods available. The problem is that some of the more interesting procedures lead to increased computational cost. In a worst case, the more interesting procedures cannot be computed on available resources. Some developers know about N=NP and Big O. Others know to use the same nine or ten mathematical procedures taught in computer science classes. After all, why worry about math based on mereology if the machine resources cannot handle the computations within time and budget parameters? This means that most modern systems are based on a set of procedures that are computationally affordable, familiar, and convenient. Does this similar of procedures matter? Yep. The generally squirrely outputs from many very popular systems are perceived as completely reliable. Unfortunately, the systems are performing within a narrow range of statistical confidence. Stated in a more harsh way, the outputs are just not particularly helpful.
In my conversation with the publisher, I asked several questions:
- Is there a smart system like Watson that you would rely upon to treat your teenaged daughter’s cancer? Or, would you prefer the human specialist at the Mayo Clinic or comparable institution?
- Is there a smart system that you want directing your only son in an operational mission in a conflict in a city under ISIS control? Or, would you prefer the human-guided decision near the theater about the mission?
- Is there a smart system you want managing your retirement funds in today’s uncertain economy? Or, would you prefer the recommendations of a certified financial planner relying on a variety of inputs, including analyses from specialists in whom your analyst has confidence?
When I asked these questions, the publisher looked uncomfortable. The reason is that the massive hyperbole and marketing craziness about fancy new systems creates what I call the Star Trek phenomenon. People watch Captain Kirk talking to devices, transporting himself from danger, and traveling between far flung galaxies. Because a mobile phone performs some of the functions of the fictional communicator, it sure seems as if many other flashy sci-fi services should be available.
Well, this Star Trek phenomenon does help direct some research. But in terms of products that can be used in high risk environments, the sci-fi remains a fiction.
Believing and expecting are different from working with products that are limited by computational resources, expertise, and informed understanding of key factors.
Humans, particularly those who need money to pay the mortgage, ignore reality. The objective is to close a deal. When it comes to information retrieval and content processing, today’s systems are marginally better than those available five or ten years ago. In some cases, today’s systems are less useful.
August 5, 2014
I have mentioned recent “expert analyses” of the enterprise search and content marketing sector. In my view, these reports are little more than gussied up search engine optimization (SEO), content marketing plays. See, for example, this description of the IDC report about “knowledge quotient”. Sounds good, right. So does most content marketing and PR generated by enterprise search vendors trying to create sustainable revenue and sufficient profits to keep the investors on their boats, in their helicopters, and on the golf course. Disappointing revenues are not acceptable to those with money who worry about risk and return, not their mortgage payment.
Some content processing vendors are in need of sales leads. Others are just desperate for revenue. The companies with venture money in their bank account have to deliver a return. Annoyed funding sources may replace company presidents. This type of financial blitzkrieg has struck BA Insight and LucidWorks. Other search vendors are in legal hot water; for example, one Fast Search & Transfer executive and two high profile Autonomy Corp. professionals. Other companies tap dance from buzzword to catchphrase in the hopes of avoiding the fate of Convera, Delphes, or Entopia. The marketing beat goes on, but the revenues for search solutions remains a challenge. How will IBM hit $10 billion in Watson revenues in five or six years? Good question, but I know the answer. Perhaps accounting procedures might deliver what looks like a home run for Watson. Perhaps the Jeopardy winner will have to undergo Beverly Hills-style plastic surgery? Will the new Watson look like today’s Watson? I would suggest that some artificiality could be discerned.
Last week, one of my two or three readers wrote to inform me that the phrase “knowledge quotient” is a registered trademark. One of my researchers told me that when one uses the phrase “knowledge quotient,” one should include the appropriate symbol. Omission can mean many bad things, mostly involving attorneys:
Another one of the goslings picked up the vaporous “knowledge quotient” and poked around for other uses of the word. Remember. I encountered this nearly meaningless quasi academic jargon in the title of an IDC report about content processing, authored by the intrepid expert Dave Schubmehl.
According to one of my semi reliable goslings, the phrase turned up in a Portland State University thesis. The authors were David Clitheroe and Garrett Long.
The trademark was registered in 2004 by Penn State University. Yep, that’s the university which I associate with an unfortunate management “issue.” According to Justia, the person registering the phrase “knowledge quotient” was a Penn State employee named Gene V J Maciol.
So we are considering a chunk of academic jargon cooked up to fulfill a requirement to get an advanced degree in sociology in 1972. That was about 40 years ago. I am not familiar with sociology or the concept knowledge quotient.
I printed out the 111 page document and read it. I do have some observations about the concept and its relationship to search and content processing. Spoiler alert: Zero, none, zip, nada, zilch.
The topic of the sociology paper is helping kids in trouble. I bristled at the assumptions implicit in the write up. Some cities had sufficient resources to help children. Certain types of faculties are just super. I assume neither of the study’s authors were in a reformatory, orphanage, or insane asylum.
Anyway the phrase “knowledge quotient” is toothless. It means, according to page 31:
the group’s awareness and knowledge of the [troubled youth or orphan] home.
And the “quotient” part? Here it is in all its glory:
A knowledge quotient reflects the group’s awareness and knowledge of the home.
August 2, 2014
Editor’s note: These three companies are involved in search and content processing. The opinion piece considers the question, “Is management unable to ensure standard business processes working in some businesses today?” Links have been inserted to open source information that puts some of the author’s comments in context. Comments about this essay may be posted using the Comments function for this blog.
Forgetting to Put Postage on Lots of Letters
I read “HP to Pay $32.5 Million to Settle Claims of Overbilling USPS.” (Keep in mind you may have to pony up some cash to access this article. Mr. Murdoch needs cash to buy more media properties. Do your part!)
The main point of the story, told by “real” journalists, is that the company failed “to comply with pricing terms.” The “real” news story asserts:
The DOJ also alleged H-P made misrepresentations during the negotiation of the contract with the USPS regarding its pricing and its plans to ensure it would provide the required most favored customer pricing.
I suppose any company can overlook putting postage on an envelope. When that happened to me in my day of snail mail activity, my local postmistress Claudette would give me a call and I would go to the Harrod’s Creek post office and buy a stamp.
I am no big time manager, but I understood that snail mail required a stamp. If you are a member of the House or Senate, the rules are different, but even the savvy Congressperson makes sure the proper markings appear on the absolutely essential missives.
My mind, which I admit is not as agile as it was when I worked at Halliburton Nuclear Utility Services, drew a dotted line between this seemingly trivial matter of goofing on an administrative procedure and the fantastic events still swirling around Hewlett Packard’s purchase of Autonomy, a vendor of search and content processing software.
A number of questions flapped slowly across my mind:
- Is HP management becoming careless with trivial matters like paying $11 billion for a company generating about $800 million in revenue and forgetting to pay the US post office?
- Is the thread weaving together such HP events as the mobile operating system affair, the HP tablet, the fumbling of the Alta Vista opportunity, and the apparent administrative goofs like the Autonomy purchase and this alleged postage stamp licking flawed administrative processes?
- What does the stamp sticking, Autonomy litigating, and alleged eavesdropping say about the company’s “git ‘er done” approach?
The attitude may apply to confident senior managers with incentives to produce revenue. Image source: http://profileengine.com/groups/profile/420722222/larry-the-cable-guy-for-president
I don’t think too much about Hewlett Packard. I do wonder if HP is an isolated actor or if companies with search interests are focusing on priorities that seem to be orthogonal to what I understand to be appropriate corporate behavior. One isolated event is highly suggestive.
But what do similar events suggest? In this short essai, I want to summarize two events. Both of these are interesting. For me, I see a common theme connecting the HP stamp licking and the two macro events. The glue fixing these in my mind is what seems to be a failure of management to pay attention to details.
But first, let’s go back in time for a modest effort penned by Edmund Spenser.
July 31, 2014
At lunch yesterday, several search aware people discussed a July 2014 Gartner study. One of the folks had a crumpled image of the July 2014 “magic quadrant.” This is, I believe, report number G00260831. Like other mid tier consulting firms, Gartner works hard to find something that will hook customers’ and prospects’ attention. The Gartner approach is focused on companies that purport to have enterprise search systems. From my vantage point, the Gartner approach is miles ahead of the wild and illogical IDC report about knowledge, a “quotient,” and “unlocking” hidden value. See http://bit.ly/1rpQymz. Now I have not fallen in love with Gartner. The situation is more like my finding my content and my name for sale on Amazon. You can see what my attorney complained about via this link, http://bit.ly/1k7HT8k. I think I was “schubmehled,” not outwitted.
I am the really good looking person. Image source: http://bit.ly/1rPWjN3
What the IDC report lacks in comprehensiveness with regard to vendors, Gartner mentions quite a few companies allegedly offering enterprise search solutions. You must chase down your local Garnter sales person for more details. I want to summarize the points that surfaced in our lunch time pizza fest.
First, the Gartner “study” includes 18 or 19 vendors. Recommind is on the Gartner list even though a supremely confident public relations “professional” named Laurent Ionta insisted that Recommind was not in the July 2014 Gartner report. I called her attention to report number G00260831 and urged her to use her “bulldog” motivation to contact her client and Gartner’s experts to get the information from the horse’s mouth as it were. (Her firm is www.lewispr.com and its is supported to be the Digital Agency of the Year and on the Inc 5000 list of the fastest growing companies in America.) I am impressed with the accolades she included in her emails to me. The fact that this person who may work on the Recommind account was unaware that Gartner pegged Recommind as a niche player seemed like a flub of the first rank. When it comes to search, not even those in the search sector may know who’s on first or among the chosen 19.
To continue with my first take away from lunch, there were several companies that those at lunch thought should be included in the Gartner “analysis.” As I recall, the companies to which my motley lunch group wanted Gartner to apply their considerable objective and subjective talents were:
- ElasticSearch. This in my view is the Big Dog in enterprise search at the moment. The sole reason is that ElasticSearch has received an injection of another $70 million to complement the $30 odd million it had previously gather. Oh, ElasticSearch is a developer magnet. Other search vendors should be so popular with the community crowd.
- Oracle. This company owns and seems to offer Endeca solutions along with RightNow/InQuira natural language processing for enterprise customer support, the fading Secure Enterprise Search system, and still popping and snapping Oracle Text. I did not mention to the lunch crowd that Oracle also owns Artificial Linguistics and Triple Hop technology. This information was, in my view, irrelevant to my lunch mates.
- SphinxSearch. This system is still getting love from the MySQL contingent. Imagine no complex structured query language syntax to find information tucked in a cell.
There are some other information retrieval outfits that I thought of mentioning, but again, my free lunch group does not know what it does not know. Like many folks who discuss search with me, learning details about search systems is not even on the menu. Even when the information is free, few want to confuse fantasy with reality.
The second take away is that rational for putting most vendors in the niche category puzzled me. If a company really has an enterprise search solution, how is that solution a niche? The companies identified as those who can see where search is going are, as I heard, labeled “visionaries.” The problem is that I am not sure what a search visionary is; for example, how does a French aerospace and engineering firm qualify as a visionary? Was HP a visionary when it bought Autonomy, wrote off $8 billion, and initiated litigation against former colleagues? How does this Google supplied definition apply to enterprise search:
able to see visions in a dream or trance, or as a supernatural apparition?
The final takeaway for me was the failure to include any search system from China, Germany, or Russia. Interesting. Even my down on their heels lunch group was aware of Yandex and its effort in enterprise search via a Yandex appliance. Well, internationalization only goes so far I suppose.
I recall hearing one of my luncheon guests say that IBM was, according the “experts” at Gartner, a niche player.Gentle reader, I can describe IBM many ways, but I am not sure it is a niche player like Exorbyte (eCommerce mostly) and MarkLogic (XML data management). Nope, IBM’s search embraces winning Jeopardy, creating recipes with tamarind, and curing assorted diseases. And IBM offers plain old search as part of DB2 and its content management products plus some products obtained via acquisition. Cybertap search, anyone? When someone installs, what used to be OmniFind, I thought IBM was providing an enterprise class information retrieval solution. Guess I am wrong again.
Net net: Gartner has prepared the ground for a raft of follow on analyses. I would suggest that you purchase a copy of the July 2014 Gartner search report. You may be able to get your bearings so you can answer these questions:
- What are the functional differences among the enterprise search systems?
- How does the HP Autonomy “solution” compare to the pre-HP Autonomy solution?
- What is the cost of a Google Search Appliance compared to a competing product from Maxxcat or Thunderstone? (Yep, two more vendors not in the Gartner sample.)
- What causes a company to move from being a challenger in search to a niche player?
- What makes both a printer company and a Microsoft-centric solution qualified to match up with Google and HP Autonomy in enterprise search?
- What are the licensing costs, customizing costs, optimizing costs, and scaling costs of each company’s enterprise search solution? (You can find the going rate for the Google Search Appliance at www.gsaadvantage.gov. The other 18? Good luck.)
I will leave you to your enterprise search missions. Remember. Gartner, unlike some other mid-tier consulting firms, makes an effort to try to talk about what its consultants perceive as concrete aspects of information retrieval. Other outfits not so much. That’s why I remain confused about the IDC KQ (knowledge quotient) thing, the meaning of hidden value, and unlocking. Is information like a bike padlock?
Stephen E Arnold, July 31, 2014
July 28, 2014
Shortly after writing the first draft of Google: The Digital Gutenberg, “Enterprise Findability without the Complexity” became available on the Google Web site. You can find this eight page polemic at http://bit.ly/1rKwyhd or you can search for the title on—what else?—Google.com.
Six years after the document became available, Google’s anonymous marketer/writer raised several interesting points about enterprise search. The document appeared just as the enterprise search sector was undergoing another major transformation. Fast Search & Transfer struggled to deliver robust revenues and a few months before the Google document became available, Microsoft paid $1.2 billion for what was another enterprise search flame out. As you may recall, in 2008, Convera was essentially non operational as an enterprise search vendor. In 2005, Autonomy bought the once high flying Verity and was exerting its considerable management talent to become the first enterprise search vendor to top $500 million in revenues. Endeca was flush with Intel and SAP cash, passing on other types of financial instruments due to the economic downturn. Endeca lagged behind Autonomy in revenues and there was little hope that Endeca could close the gap between it and Autonomy.
Secondary enterprise search companies were struggling to generate robust top line revenues. Enterprise search was not a popular term. Companies from Coveo to Sphinx sought to describe their information retrieval systems in terms of functions like customer support or database access to content stored in MySQL. Vivisimo donned a variety of descriptions, culminating in its “reinvention” as a Big Data tool, not a metasearch system with a nifty on the fly clustering algorithm. IBM was becoming more infatuated with open source search as a way to shift development an bug fixes to a “community” working for the benefit of other like minded developers.
Google’s depiction of the complexity of traditional enterprise search solutions. The GSA is, of course, less complex—at least on the surface exposed to an administrator.
Google’s Findability document identified a number of important problems associated with traditional enterprise search solutions. To Google’s credit, the company did not point out that the majority of enterprise search vendors (regardless of the verbal plumage used to describe information retrieval) were either losing money or engaged in a somewhat frantic quest for financing and sales).
Here are the issues Google highlighted:
- User of search systems are frustrated
- Enterprise search is complex. Google used the word “daunting”, which was and still is accurate
- Few systems handle file shares, Intranets, databases, content management systems, and real time business applications with aplomb. Of course, the Google enterprise search solution does deliver on these points, asserted Google.
Furthermore, Google provides integrated search results. The idea is that structured and unstructured information from different sources are presented in a form that Google called “integrated search results.”
Google also emphasized a personalized experience. Due to the marketing nature of the Findability document, Google did not point out that personalization was a feature of information retrieval systems lashed to an alert and work flow component. Fulcrum Technologies offered a clumsy option for personalization. iPhrase improved on the approach. Even Endeca supported roles, important for the company’s work at Fidelity Investments in the UK. But for Google, most enterprise search systems were not personalizing with Google aplomb.
Google then trotted out the old chestnuts gleaned from a lunch discussion with other Googlers and sifting competitors’ assertions, consultants’ pronouncements, and beliefs about search that seemed to be self-evident truths; for example:
- Improved customer service
- Speeding innovation
- Reducing information technology costs
- Accelerating adoption of search by employees who don’t get with the program.
Google concluded the Findability document with what has become a touchstone for the value of the Google Search Appliance. Kimberly Clark, “a global health and hygiene company,” reduced administrative costs for indexing 22 million documents. The costs of the Google Search Appliance, the consultant fees, and the extras like GSA fail over provisions were not mentioned. Hard numbers, even for Google, are not part of the important stuff about enterprise search.
One interesting semantic feature caught my attention. Google does not use the word knowledge in this 2008 document.
- Was Google unaware of the fusion of information retrieval and knowledge?
- Does the Google Search Appliance deliver a laundry list of results, not knowledge? (A GSA user has to scan the results, click on links, and figure out what’s important to the matter at hand, so the word “knowledge” is inappropriate.)
- Why did Google sidestep providing concrete information about costs, productivity, and the value of indexing more content that is allegedly germane to a “personalized” search experience? Are there data to support the implicit assertion “more is better.” Returning more results may mean that the poor user has to do more digging to find useful information. What about a few, on point results? Well, that’s not what today’s technology delivers. It is a fiction about which vendors and customers seem to suspend disbelief.
With a few minor edits—for example, a genuflection to “knowledge—this 2008 Findability essay is as fresh today as it was when Google output its PDF version.
First, the freshness of the Findability paper underscores the staleness and stasis of enterprise search in the past six years. If you scan the free search vendor profiles at www.xenky.com/vendor-profiles, explanations of the benefits and functions of search from the 1980s are also applicable today. Search, the enterprise variety, seems to be like a Grecian urn which “time cannot wither.”
Second, the assertions about the strengths and weaknesses of search were and still are presented without supporting facts. Everyone in the enterprise search business recycles the same cant. The approach reminds me of my experience questioning a member of a sect. The answer “It just is…” is simply not good enough.
Third, the Google Search Appliance has become a solution that costs as much, if not more, than other big dollar systems. Just run a query for the Google Search Appliance on www.gsaadvantage.gov and check out the options and pricing. Little wonder than low cost solutions—whether they are better or worse than expensive systems—are in vogue. Elasticsearch and Searchdaimon can be downloaded without charge. A hosted version is available from Qbox.com and is relatively free of headaches and seven figure charges.
Net net: Enterprise search is going to have to come up with some compelling arguments to gain momentum in a world of Big Data, open source, and once burned twice shy buyers. I wonder why venture / investment firms continue to pump money into what is same old search packaged with decades old lingo.
I suppose the idea that a venture funded operation like Attivio, BA Insight, Coveo, or any other company pitching information access will become the next Google is powerful. The problem is that Google does not seem capable of making its own enterprise search solution into another Google.
This is indeed interesting.
Stephen E Arnold, July 28, 2014
July 24, 2014
“Myths and Misreporting About Malaysia Airlines Flight 17” is an interesting article. I found the examples of misinformation, disinformation, and reformation thought provoking. The write up spotlights a few examples of fake or distorted information about an airline’s doomed flight.
As i considered the article and its appearance in a number of news alerting services, I shifted from the cleverness of the content to a larger and more interesting issue. From the revelations about software that can alter inputs to an online survey (see this link) to fake out “real” news, determining what’s sort of accurate from what’s totally bogus is becoming more and more difficult. I have professional researchers, librarians, and paralegals at my disposal. Most people do not. No longer surprising to me is the email from one of the editors working to fact check my for fee columns. The questions range from “Did IBM Watson invent a recipe with tamarind in its sauce?” to “Do you have a source for the purchase price of Vivisimo?” Now I include online links for the facts and let the editors look up my source without the intermediating email. Even then, there is a sense of wonderment when an editor expresses surmise that what he or she believed is, in fact, either partially true, bogus, or unexpected. Example: “Why do French search vendors feel compelled to throw themselves at the US market despite the historically low success rates?” The answer is anchored in [a] French tax regulations, [b] French culture, particularly when a scruffy entrepreneur from the wrong side of the educational tracks tries to connect with a French money source from the right side of the educational tracks, [c] the lousy financial environment for certain high technology endeavors, and [d] selling to the big US markets looks like a slam dunk, at least for a while.
The reason for the disconnect between factoids and information manipulation boils down to a handful of factors. Let me highlight several:
First, the need for traffic to Web sites (desktop, mobile, app instances, etc.) is climbing up the hierarchy of business / personal needs. You want traffic today? The choices are limited. Pay Google $25,000 or more a month. Pay an SEO (search engine optimization “expert” whatever you can negotiate. Create content, do traditional marketing, and trust that the traffic follows the “if you build it they will come” pipedream. Most folks just whack at getting traffic and use increasingly SEOized headlines as a low cost way of attracting attention. Think headlines from the National Enquirer in the 1980s.
Second, Google has to pump lots of money into plumbing, infrastructure, moon shots, operational costs (three months at the Stanford Psych unit, anyone?) At the same time, mobile is getting hot. Two problems plague the sunny world of the GOOG. [a] Revenue from mobile ads is less than from traditional ads. Therefore, Google has to find a way to keep that 2006 style revenue flowing. Because there is a systemic shift, the GOOG needs money. One way to get it is to think about Adwords as a machine that needs tweaking. How does one sell Adwords to those who do not buy enough today? You ponder the question, but it involves traffic to a Web site. [b] Google gets bigger so the “think cheap” days of yore are easier to talk about than deliver. A 15 year old company is getting more and more expensive to run. The upcoming battles with Amazon and Samsung will not be cheap. The housing developments, the Loon balloons, and the jet fleet, smart people, and other oddments of the company—money pits. If the British government can fiddle traffic, is it possible that others have this capability too?
Third, marketing, an easy whipping boy or girl as the case may be. After spending lots and lots on Web sites and apps, some outfits’ CFOs are asking, “What do we get for this spending?” In order to “prove” their worth and stop the whipping, marketers have kicked into overdrive. Baloney, specious, half baked, crazy, and recycled content is generated by the terabyte drive. The old fashioned ideas about verification, accuracy, and provenance are kicked to the side of the road.
Net net: running a query on a search engine, accepting the veracity of a long form article, or just finding out what happened at an event is very difficult. The fixes are not palatable to some people. Others are content to believe that their Internet or Internet search engine dispenses wisdom like the oracle at Delphi. Who knew the “oracles” relied on confusing entrances, various substances, and stage tricks to get their story across.
We now consult digital Delphis. How is that working out when you search for information to address a business problem, find a person who can use finger manipulation to relax a horse’s muscle, or determine if a company is what its Web site says it is?
Stephen E Arnold, July 24, 2014
July 10, 2014
Editor’s Note: This is information that did not make Stephen E Arnold’s bylined article in Information Today. That forthcoming Information Today story about French search and content processing companies entering the US market. Spoiler alert: The revenue opportunities and taxes appear to be better in the US than in France. Maybe a French company will be the Next Big Thing in search and content processing. Few French companies have gained significant search and retrieval traction in the US in the last few years. Arguably, the most successful firm is the image recognition outfit called A2iA. It seems that French information retrieval companies and the US market have been lengthy, expensive, and difficult. One French company is trying a different approach, and that’s the core of the Information Today story.)
In 1999, I learned about a Swiss enterprise search system. The working name was, according to my Overflight archive, was AMI Albert.The “AMI” did not mean friend. AMI shorthand for Automatic Message Interpreter.
Flash forward to 2014. Note that a Google query for “AMI” may return hits for AMI International a defense oriented company as well as hits to American Megatrends, Advanced Metering Infrastructure, ambient intelligence, the Association Montessori International, and dozens of other organizations sharing the acronym. In an age of Google, finding a specific company can be a challenge and may inhibit some potential customers ability to locate a specific vendor. (This is a problem shared by Thunderstone, for example. The game company makes it tough to locate information about the search appliance vendor.)
Basic search interface as of 2011.
Every time I update my files, I struggle to get specific information. Invariably I get an email from an AMI Software sales person telling me, “Yes, we are growing. We are very much a dynamic force in market intelligence.”
The UK Web site for the firm is www.amisw.co.uk. The French language Web site for the company is http://www.amisw.com/fr/. And the English language version of the French Web site is at http://www.amisw.com/fr/. The company’s blog is at http://www.amisw.com/fr/blog/, but the content is stale. The most recent update as of July 7, 2014, is from December 2013. The company seems to have shifted its dissemination of news to LinkedIn, where more than 30 AMI employees have a LinkedIn presence. The blog is in French. The LinkedIn postings are in English. Most of the AMI videos are in French as well.
Advanced Search Interface as of 2011.
The Managing Director, according to www.amisw.com/fr, is Alain Beauvieux. The person in charge of products is Eric Fourboul. The UK sales manager is Mike Alderton.
Mr. Beauvieux is a former IBMer and worked at LexiQuest, which originally formerly Erli, S.A. LexiQuest (Clementine) was acquired by SPSS. SPSS was, in turn, acquired by IBM, joining other long-in-the-tooth technologies marketed today by IBM. Eric
Fourboul is a former Dassault professional, and he has some Microsoft DNA in his background.
June 30, 2014
I returned from a brief visit to Europe to an email asking about Rocket Software’s breakthrough technology AeroText. I poked around in my archive and found a handful of nuggets about the General Electric Laboratories’ technology that migrated to Martin Marietta, then to Lockheed Martin, and finally in 2008 to the low profile Rocket Software, an IBM partner.
When did the text extraction software emerge? Is Rocket Software AeroText a “new kid on the block”? The short answer is that AeroText is pushing 30, maybe 35 years young.
Digging into My Archive of Search Info
As far as my archive goes, it looks as though the roots of AeroText are anchored in the 1980s, Yep, that works out to an innovation about the same age as the long in the tooth ISYS Search system, now owned by Lexmark. Over the years, the AeroText “product” has evolved, often in response to US government funding opportunities. The precursor to AeroText was an academic exercise at General Electric. Keep in mind that GE makes jet engines, so GE at one time had a keen interest in anything its aerospace customers in the US government thought was a hot tamale.
The AeroText interface circa mid 2000. On the left is the extraction window. On the right is the document window. From “Information Extraction Tools: Deciphering Human Language, IT Pro, November December 2004, page 28.
The GE project, according to my notes, appeared as NLToolset, although my files contained references to different descriptions such as Shogun. GE’s team of academics and “real” employees developed a bundle of tools for its aerospace activities and in response to Tipster. (As a side note, in 2001, there were a number of Tipster related documents in the www.firstgov.gov system. But the new www.usa.gov index does not include that information. You will have to do your own searching to unearth these text processing jump start documents.)
The aerospace connection is important because the Department of Defense in the 1980s was trying to standardize on markup for documents. Part of this effort was processing content like technical manuals and various types of unstructured content to figure out who was named, what part was what, and what people, places, events, and things were mentioned in digital content. The utility of NLToolset type software was for cost reduction associated with documents and the intelligence value of processed information.
The need for a markup system that worked without 100 percent human indexing was important. GE got with the program and appears to have assigned some then-young folks to the project. The government speak for this type of content processing involves terms like “message understanding” or MU, “entity extraction,” and “relationship mapping. The outputs of an NLToolset system were intended for use in other software subsystems that could count, process, and perform other operations on the tagged content. Today, this class of software would be packaged under a broad term like “text mining.” GE exited the business, which ended up in the hands of Martin Marietta. When the technology landed at Martin Marietta, the suite of tools was used in what was called in the late 1980s and early 1990s, the Louella Parsing System. When Lockheed and Martin merged to form the giant Lockheed Martin, Louella was renamed AeroText.
Over the years, the AeroText system competed with LingPipe, SRA’s NetOwl and Inxight’s tools. In the hay day of natural language processing, there were dozens and dozens of universities and start ups competing for Federal funding. I have mentioned in other articles the importance of the US government in jump starting the craziness in search and content processing.
In 2005, I recall that Lockheed Martin released AeroText 5.1 for Linux, but I have lost track of the open source versions of the system. The point is that AeroText is not particularly new, and as far as I know, the last major upgrade took place in 2007 before Lockheed Martin sold the property to AeroText. At the time of the sale, AeroText incorporated a number of subsystems, including a useful time plotting feature. A user could see tagged events on a timeline, a function long associated with the original version of i2’s the Analyst Notebook. A US government buyer can obtain AeroText via the GSA because Lockheed Martin seems to be a reseller of the technology. Before the sale to Rocket, Lockheed Martin followed SAIC’s push into Australia. Lockheed signed up NetMap Analytics to handle Australia’s appetite for US government accepted systems.
What does AeroText purport to do that caused the person who contacted me to see a 1980s technology as the next best thing to sliced bread?
AeroText is an extraction tool; that is, it has capabilities to identify and tag entities at somewhere between 50 percent and 80 percent accuracy. (See NIST 2007 Automatic Content Extraction Evaluation Official Results for more detail.)
The AeroText approach uses knowledgebases, rules, and patterns to identify and tag pre-specified types of information. AeroText references patterns and templates, both of which assume the licensee knows beforehand what is needed and what will happen to processed content.
In my view, the licensee has to know what he or she is looking for in order to find it. This is a problem captured in the famous snippet, “You don’t know what you don’t know” and the “unknown unknowns” variation popularized by Donald Rumsfeld. Obviously without prior knowledge the utility of an AeroText-type of system has to be matched to mission requirements. AeroText pounded the drum for the semantic Web revolution. One of AeroText’s key functions was its ability to perform the type of markup the Department of Defense required of its XML. The US DoD used a variant called DAML or Darpa Agent Markup Language. natural language processing, Louella, and AeroText collected the dust of SPARQL, unifying logic, RDF, OWL, ontologies, and other semantic baggage as the system evolved through time.
Also, staff (headcount) and on-going services are required to keep a Louella/AeroText-type system generating relevant and usable outputs. AeroText can find entities, figure out relationships like person to person and person to organization, and tag events like a merger or an arrest “event.” In one briefing about AeroText I attended, I recall that the presenter emphasized that AeroText did not require training. (The subtext for those in the know was that Autonomy required training to deliver actionable outputs.) The presenter did not dwell on the need for manual fiddling with AeroText’s knowledgebases and I did not raise this issue.)
June 11, 2014
The news of the $70 million injected into Elasticsearch caused me to check out Crunchbase and some other sources of funding data. I looked at a handful of search and content processing vendors in the departures lounge. I am supposed to be retired, but Zurich beckons.
How large is the market for search and content processing software and services. As a former laborer in the vineyards of Halliburton Nuclear and Booz, Allen & Hamilton, the answer is, “You can charge as much as you want when the customer is in a corner.” The flipside of this adage is, “You can’t charge as much when there are many low cost options.”
In my view, search—regardless of the window dressing slapped on decades old systems and methods—is sort of yesterday. One of the goslings posted a list of Hewlett Packard’s verbal arabesques to explain IDOL search as everything EXCEPT search. The HP verbal arabesques make my point:
Search is not going to generate big money going forward.
Is search (regardless of the words used to describe it) a money pit like as the Tom Hanks’ motion picture made vivid?
For that reason, I am wondering what investors are thinking as they pump money into search and content processing companies. The largest revenue generator in the search sector is either Google or Autonomy. Google, as you may know, is in the online advertising business. Search is a Trojan horse. Search is free and the clicks trigger the GoTo/Overture mechanism that caused Google’s moment of inspiration. Before the Google IPO, Google ponied up some dough to Yahoo regarding alleged borrowing of pay to play methods.
Autonomy focused on the enterprise. Between 1996 and October 2011, Sir Michael Lynch grew the company to about $1 billion in revenues. HP’s prescient and always interesting management paid $10.3 billion for Autonomy and then wrote off $8 billion, aimed allegations at Autonomy at the company, and, in general, made it clear that HP was essentially a printer ink business with what seems to be great faith in IDOL, DRE, and assorted rich media tools.
More recently, IBM, the subject of an entertaining analysis The Decline and Fall of IBM by Robert X. Cringely suggested that Watson would grow to be a $10 billion in revenue business. Not a goal to ignore. The fact that Watson is a collection of home grown widgets and open source search technology. I think Watson’s last search contribution was creating a recipe for a tamarind flavored sauce. IBM is probably staffed with folks smarter than I. But a billion dollar bet with a goal of building a revenue stream 10 to 12 times greater than Autonomy’s in one third the time. Wowza.
Let’s do some simple addition in the elegant United lounge.
Let’s assume that IBM and HP actually generate the billions necessary to recover the cost of IDOL and hit the crazy IBM goal of $10 billion in four or five years. To make the math simple, skip interest, the cost of assuaging stakeholders, and the money needed to close deals that total $20 to $25 billion. HP pumps up Autonomy to $10 or $11 billion and IBM tallies another $10 to $12 billion.
So, HP and IBM need or want to build $10 billion or more in revenues from their respective search and content processing ventures. I estimated that the market for “search” was about $1.3 billion in 2006. I am not too sure that market has grown by a significant factor since the economic headwinds began blowing through carpetland.
Now consider the monies invested in some search and content processing companies.
Attensity (sentiment analysis), $90 million
BA Insight (Microsoft centric, search and business intelligence), $14.5 million
Content Analyst (text analysis, SAIC technology, $7.0 million
Coveo (originally all Microsoft all the time, now kitchen sink vendor), $34.7 million
Digital Reasoning (text analysis, no shipping product), $4.2 million
EasyAsk (natural language processing, several owners(, $20 million
Elasticsearch (open source search and consulting), $104 million
Hakia (semantic search), $23.5 million
MarkLogic (XML data management and kitchen sink apps), $73.6 million
Recorded Future (text analysis of Web content), $20.9 million
Recommind (similar to Autonomy method), $15 million
Sinequa (proprietary search and widgets), $5.3 million
X1 (search and new management), $12.2 million
ZyLab (search and licensed visualizations), $2.4 million
May 24, 2014
Most people don’t know that I lived in Brazil in the period before the sheep’s foot rollers crunched through the Brazilian rain forest. The environmental adjustment was due to the need to prepare for the massive Trans Amazon Highway. When the project began to take shape, preparations had to be made. Once Rodovia Transamazonia became “official”, decades of political and economic preparation had been underway. By the mid 1950s, the need for BR 153 was evident to anyone who tried to go west from any major Brazilian city. It was an airplane or weeks, maybe months, of multi-modal transportation. Need to get across a stream. Chop down trees and put up a “bridge.”
Pretty darned effective I learned first hand. Source: http://bit.ly/1r3uFMY
I recall riding in a Caterpillar bulldozer equipped with two sets of sheep foot rollers. Push though the jungle and then drag the rollers over the trees, slow moving animals, and the occasional native’s house, and you are ready to get down to road building. My father, never the environmentally sensitive type, explained that heavy equipment and bulldozing were beautiful: fast, cheap, effective, and potent. And even I, as a child, understood that the natives had to find their future elsewhere. Once the heavy equipment rolled through, the old ways were toast.
I fondly recalled these early lessons from my father, the giant US company for whom he labored as Managing Director, and stunned look on the faces of the people who lived in the forest and scrubland as we rolled through. In my mind’s eye, I imagine the Hachette professionals have that same look: A mixture of surprise, anger, and confusion. The heavy equipment drivers just shifted gears and crushed forward.
I read “As Publishers Fight Amazon, Books Vanish.” Interesting because the company appears to be bulldozing its way through traditional book publishing. My thought is that when the bulldozers finish, the old way is either gone or too expensive to continue. Savvy natives packed up and moved to favelas and reinvented themselves. Some were entrepreneurs and others tried to recapture a life in a transformed environment.
Digital bulldozers transform business process landscapes with speed and brutal efficiency. My father would have been proud of this approach to business. His one regret would be that Amazon’s corporate colors were not the flashy yellow and black that he so loved.
There were a couple of points in the “real” journalism article I noted. Let me highlight each and make a short comment.
First, “The literary community is fearful and outraged, and practically begging for government intervention.” My thought, “Once the forest has been bulldozed, it is tough to regrow.”
Second, “But the real prize is control of e-books, the future of publishing.” My thought, isn’t the future clear. Hasn’t Amazon won? If it had not won, why then the surprise that the bulldozer crushed traditional business processes the way the bulldozer took out the natives’ houses?
Third, the statement “If this is the new American way [attributed to writer and former advertising professional James Patterson], then maybe it has to be changed—by law, if necessary—immediately, if not sooner.” Catchy statement, but I thought, isn’t it too late? Regrowing that jungle and moving the natives back is a somewhat tough task.
Fourth, Amazon allegedly has been making it tough to buy a biography critical of former Wall Street quant Jeff Bezos. My father did not give interviews either. Guess what? The highway was built through the gut of the Amazon.
And the parable?
Once the landscape is changed, going back gets tough. Modern life is not congruent to Rousseau’s fantasy.
Parts of the Transamazonian experience looks like Paramus, New Jersey. Image source: http://bit.ly/1kdwdPz
Amazon, like Google, has been operating for many years, pursuing the same goals, using the mechanisms of online, and building support from people who spend money.
Maybe governments are more powerful than Amazon, Google, and Facebook? The reality, however, is that the bulldozers have already rolled through. The dispossessed, annoyed, and confused can talk. It is going to be very difficult to restore the jungle and the previous way of life.
By the way, search doesn’t work too well on Amazon to begin with. Not being able to find a book is par for Amazon’s course. Bad search helps sales and Amazon’s imperative. I have learned to live with it. Perhaps the publishers, authors, and real journalists should follow my example. Adapt and move on. Yelling at a bulldozer driver and throwing rocks doesn’t change reality.
Stephen E Arnold, May 23, 2014