February 6, 2015
One of Vivisimo’s founders, Jerome Pesenti, seems to be the voice of IBM Watson. Vivisimo was a metasearch system with hit clustering. The company went through several management arabesques and was sold to IBM in 2012. Vivisimo pitched its system as a federated search engine. The configuration method, as I recall, required Jerome level input. In one installation, I learned that the Vivisimo system hit a wall when 250,000 documents were processed. There were work arounds, but these too required humans who knew the ins and outs of Vivisimo.
I recall that prior to the sale of Vivisimo to IBM, Vivisimo shifted to a government consulting services focus. Many search vendors in the hay day of the buy outs followed this path. License fees were not generating the cash the spreadsheet jockeys funding outfits like Endeca, Exalead, and Vivisimo envisioned. No problem. Some organizations wanted proprietary content processing systems and figured that it was time to sell out. The Big Dog of sell outs was Hewlett Packard’s $11 billion purchase of Autonomy. Vivisimo fetched about $20 million or one year’s projected revenue according to the stockholder familiar with the deal suggested.
Fast forward two or three years and Vivisimo is now Watson. Oh, Vivisimo is also a Big Data solution, not a metasearch engine. I assume the index limits have been addressed. I am thinking about IBM Watson for two reasons:
- IBM is going through a staff reduction. I assume this action was determined by querying the super smart Watson system
- I read “Five New Services Expand IBM Watson Capabilities to Images, Speech, and More,” an IBM in house marketing article.
To my surprise there was a significant shift in Watson marketing; to wit, there are now links to demos of IBM’s text to speech service, image recognition service, relationship analysis service, and something called tradeoff analytics. Now demos are helpful. So is the Watson “great video” about concept insights.
I ran the suggested query for “quantum physics.” Remember I used to work at Halliburton Nuclear Services. Here’s what I saw:
I noticed that each of the experts in the human resources database use the word “quantum” to describe their background.
I then ran a query for “tamarind,” one of the ingredients in a barbeque sauce created by Watson during its recipe phase. Here’s what I saw:
There is no recipe, nor is there an IBM person listing the barbeque recipe as his or her work. I was surprised. No tamarind wizard in the data set.
I asked myself, “Can’t I do this with Elasticsearch?” The answer my mind generated was, “No. No. No. You silly oaf. Watson uses Lucene but it is much, much more.”
How confident are the Watson workers who have dodged IBM layoffs?
What happens if Watson with Vivisimo, iPhrase, WebFountain, and assorted Almaden semantic goodies are aced by Hewlett Packard Autonomy or—heaven forbid—Amazon?
Will Dr. Pesenti be able to build a business that is orders of magnitude larger than Vivisimo’s revenue?
Interesting stuff. Not CyberOSINT level work, but interesting. I wonder why the i2 and related technologies are not pushed more aggressively. i2 works. (Note: I was a consultant to i2 prior to IBM’s purchase of the company.)
Stephen E Arnold, February 6, 2015
February 2, 2015
I find the complaints about Google’s inability to handle time amusing. On the surface, Google seems to demote, ignore, or just not understand the concept of time. For the vast majority of Google service users, Google is no substitute for the users’ investment of time and effort into dating items. But for the wide, wide Google audience, ads, not time, are more important.
Does Google really get an F in time? The answer is, “Nope.”
In CyberOSINT: Next Generation Information Access I explain that Google’s time sense is well developed and of considerable importance to next generation solutions the company hopes to offer. Why the craw fishing? Well, Apple could just buy Google and make the bitter taste of the Apple Board of Directors’ experience a thing of the past.
Now to temporal matters in the here and now.
CyberOSINT relies on automated collection, analysis, and report generation. In order to make sense of data and information crunched by an NGIA system, time is a really key metatag item. To figure out time, a system has to understand:
- The date and time stamp
- Versioning (previous, current, and future document, data items, and fact iterations)
- Times and dates contained in a structured data table
- Times and dates embedded in content objects themselves; for example, a reference to “last week” or in some cases, optical character recognition of the data on a surveillance tape image.
For the average query, this type of time detail is overkill. The “time and date” of an event, therefore, requires disambiguation, determination and tagging of specific time types, and then capturing the date and time data with markers for document or data versions.
A simplification of Recorded Future’s handling of unstructured data. The system can also handle structured data and a range of other data management content types. Image copyright Recorded Future 2014.
Sounds like a lot of computational and technical work.
In CyberOSINT, I describe Google’s and In-Q-Tel’s investments in Recorded Future, one of the data forward NGIA companies. Recorded Future has wizards who developed the Spotfire system which is now part of the Tibco service. There are Xooglers like Jason Hines. There are assorted wizards from Sweden, countries the most US high school software cannot locate on a map, and assorted veterans of high technology start ups.
An NGIA system delivers actionable information to a human or to another system. Conversely a licensee can build and integrate new solutions on top of the Recorded Future technology. One of the company’s key inventions is numerical recipes that deal effectively with the notion of “time.” Recorded Future uses the name “Tempora” as shorthand for the advanced technology that makes time along with predictive algorithms part of the Recorded Future solution.
January 30, 2015
In “CyberOSINT: Next Generation Information Access,” I describe Autonomy’s math-first approach to content processing. The reason is that after the veil of secrecy was lifted with regard to the signal processing`methods used for British intelligence tasks, Cambridge University became one of the hot beds for the use of Bayesian, LaPlacian, and Markov methods. These numerical recipes proved to be both important and controversial. Instead of relying on manual methods, humans selected training sets, tuned the thresholds, and then turned the smart software loose. Math is not required to understand what Autonomy packaged for commercial use: Signal processing separated noise in a channel and allowed software to process the important bits. Thank you, Claude Shannon and the good Reverend Bayes.
What did Autonomy receive for this breakthrough? Not much but the company did generate more than $600 million in revenues about 10 years after opening for business. As far as I know, no other content processing vendor has reached this revenue target. Endeca, for the sake of comparison, flat lined at about $130 million in the year that Oracle bought the Guided Navigation outfit for about $1.0 billion.
For one thing the British company BAE (British Aerospace Engineering) licensed the Autonomy system and began to refine its automated collection, analysis, and report systems. So what? The UK became by the late 1990s the de facto leader in automated content activities. Was BAE the only smart outfit in the late 1990s? Nope, there were other outfits who realized the value of the Autonomy approach. Examples range from US government entities to little known outfits like the Wynyard Group.
In the CyberOSINT volume, you can get more detail about why Autonomy was important in the late 1990s, including the name of the university8 professor who encouraged Mike Lynch to make contributions that have had a profound impact on intelligence activities. For color, let me mention an anecdote that is not in the 176 page volume. Please, keep in mind that Autonomy was, like i2 (another Cambridge University spawned outfit) a client prior to my retirement.) IBM owns i2 and i2 is profiled in CyberOSINT in Chapter 5, “CyberOSINT Vendors.” I would point out that more than two thirds of the monograph contains information that is either not widely available or not available via a routine Bing, Google, or Yandex query. For example, Autonomy does not make publicly available a list of its patent documents. These contain specific information about how to think about cyber OSINT and moving beyond keyword search.
Some Color: A Conversation with a Faux Expert
In 2003 I had a conversation with a fellow who was an “expert” in content management, a discipline that is essentially a step child of database technology. I want to mention this person by name, but I will avoid the inevitable letter from his attorney rattling a saber over my head. This person publishes reports, engages in litigation with his partners, kowtows to various faux trade groups, and tries to keep secret his history as a webmaster with some Stone Age skills.
Not surprisingly this canny individual had little good to say about Autonomy. The information I provided about the Lynch technology, its applications, and its importance in next generation search were dismissed with a comment I will not forget, “Autonomy is a pile of crap.”
Okay, that’s an informed opinion for a clueless person pumping baloney about the value of content management as a separate technical field. Yikes.
In terms of enterprise search, Autonomy’s competitors criticized Lynch’s approach. Instead of a keyword search utility that was supposed to “unlock” content, Autonomy delivered a framework. The framework operated in an automated manner and could deliver keyword search, point and click access like the Endeca system, and more sophisticated operations associated with today’s most robust cyber OSINT solutions. Enterprise search remains stuck in the STAIRS III and RECON era. Autonomy was the embodiment of the leap from putting the burden of finding on humans to shifting the load to smart software.
A diagram from Autonomy’s patents filed in 2001. What’s interesting is that this patent cites an invention by Dr. Liz Liddy with whom the ArnoldIT team worked in the late 1990s. A number of content experts understood the value of automated methods, but Autonomy was the company able to commercialize and build a business on technology that was not widely known 15 years ago. Some universities did not teach Bayesian and related methods because these were tainted by humans who used judgments to set certain thresholds. See US 6,668,256. There are more than 100 Autonomy patent documents. How many of the experts at IDC, Forrester, Gartner, et al have actually located the documents, downloaded them, and reviewed the systems, methods, and claims? I would suggest a tiny percentage of the “experts.” Patent documents are not what English majors are expected to read.”
That’s important and little appreciated by the mid tier outfits’ experts working for IDC (yo, Dave Schubmehl, are you ramping up to recycle the NGIA angle yet?) Forrester (one of whose search experts told me at a MarkLogic event that new hires for search were told to read the information on my ArnoldIT.com Web site like that was a good thing for me), Gartner Group (the conference and content marketing outfit), Ovum (the UK counterpart to Gartner), and dozens of other outfits who understand search in terms of selling received wisdom, not insight or hands on facts.
January 29, 2015
I read “Automated Systems Replacing Traditional Search.” The write up asserts:
Stephen E. Arnold, search industry expert and author of the “Enterprise Search Report” and “The New Landscape of Search,” has announced the publication of “CyberOSINT: Next-Generation Information Access.” The 178-page report explores the tools and methods used to collect and analyze content posted in public channels such as social media sites. The new technology can identify signals that provide intelligence and law enforcement analysts early warning of threats, cyber attacks or illegal activities.
According to Robert Steele, co-founder of USMC Intelligence Activity:
NGIA systems are integrated solutions that blend software and hardware to address very specific needs. Our intelligence, law enforcement, and security professionals need more than brute force keyword search.
According to Dr. Jerry Lucas, president of Telestrategies, which operates law enforcement and training conferences in the US and elsewhere:
This is the first discussion of the innovative software that makes sense of the flood of open source digital information. Law enforcement, security, and intelligence professionals will find this an invaluable resource to identify ways to deal with Big Data.
The report complements the Telestrategies ISS seminar on CyberOSINT. Orders for the monograph, which costs $499, may be placed at www.xenky.com/cyberosint. Information about the February 19, 2015, seminar held in the DC area is at this link.
The software and methods described in the study has immediate and direct applications to commercial entities. Direct orders may be placed at http://gum.co/cyberosint.
Don Anderson, January 29, 2015
January 29, 2015
Enterprise search has been useful. However, the online access methods have changed. Unfortunately, most enterprise search systems and the enterprise applications based on keyword and category access have lagged behind user needs.
The information highway is littered with the wrecks of enterprise search vendors who promised a solution to findability challenges and failed to deliver. Some of the vendors have been forgotten by today’s keyword and category access vendors. Do you know about the business problems that disappointed licensees and cost investors millions of dollars? Are you familiar with Convera, Delphes, Entopia, Fulcrum Technologies, Hakia, Siderean Software, and many other companies.
A handful of enterprise search vendors dodged implosion by selling out. Artificial Linguistics, Autonomy, Brainware, Endeca, Exalead, Fast Search, InQuira, iPhrase, ISYS Search Software, and Triple Hop were sold. Thus, their investors received their money back and in some cases received a premium. The $11 billion paid for Autonomy dwarfed the billion dollar purchase prices of Endeca and Fast Search and Transfer. But most of the companies able to sell their information retrieval systems sold for much less. IBM acquired Vivisimo for about $20 million and promptly justified the deal by describing Vivisimo’s metasearch system as a Big Data solution. Okay.
Today a number of enterprise search vendors walk a knife edge. A loss of a major account or a misstep that spooks investors can push a company over the financial edge in the blink of an eye. Recently I noticed that Dieselpoint has not updated its Web site for a while. Antidot seems to have faded from the US market. Funnelback has turned down the volume. Hakia went offline.
A few firms generate considerable public relations noise. Attivio, BA Insight, Coveo, and IBM Watson appear to be competing to become the leaders in today’s enterprise search sector. But today’s market is very different from the world of 2003-2004 when I wrote the first of three editions of the 400 page Enterprise Search Report. Each of these companies is asserting that their system provides business intelligence, customer support, and traditional enterprise search. Will any of these companies be able to match Autonomy’s 2008 revenues of $600 million. I doubt it.
The reason is not the availability of open source search. Elasticsearch, in fact, is arguably better than any of the for fee keyword and concept centric information retrieval systems. The problems of the enterprise search sector are deeper.
January 21, 2015
Taking a deep breath I read the article. Here’s what I deduced: Word and presumably PowerPoint will get some new features:
While Office 365 offers e-discovery and information governance capabilities, Equivio develops machine learning technologies for both, meaning an integration is expected to make them “even more intelligent and easy to use.” Microsoft says the move is in line with helping its customers tackle “the legal and compliance challenges inherent in managing large quantities of email and documents.”
The Fast Search & Transfer technology is not working out? The dozens of SharePoint content enhancers are not doing their job? The grammar checker is not doing its job?
What is different is that Word is getting more machine learning:
Equivio uses machine learning to let users explore large, unstructured sets of data. The startup’s technology leverages advanced text analytics to perform multi-dimensional analyses of data collections, intelligently sort documents into themes, group near-duplicates, and isolate unique data.
Like Microsoft’s exciting adaptive menus, the new system will learn what the user wants.
Is this a next generation information access system? Is Microsoft nosing into Recorded Future territory?
Nope, but the desire to covert what the user does into metadata seems to percolate in the Microsoft innovation coffee pot.
If Microsoft pulls off this shotgun marriage, I think more pressure will be put on outfits like Content Analyst and Smartlogic.
Stephen E Arnold, January 21, 2015
January 11, 2015
I am not too keen on immortality. My view is that stuff dies. Age appropriate behavior means accepting the lot of mortal man.
But some folks want to extend their lives; others hope to live forever like the nano-stuff creatures in Alastair Reynolds’ novels.
I associate the live longer and collect stock options approach with Ray Kurzweil, the Google big thinker and music inventor. Well, I learned something in “IBM Watson’s Lab to Tackle Aging Issues.” Now Watson with its chugging heart of Lucene has lifetimes of revenue to generate before some activist investors put a bit in this pony’s mouth.
The write up says:
IBM Korea will build a cognitive computing center in Seoul to help tackle an aging society with technology. “IBM submitted a letter of intent to the Seoul Metropolitan Government last month to set up a Watson lab to study smart-aging technology,” IBM Korea said…
I found this statement remarkable because IBM has not turned Lucene and home-grown scripts into a multi billion dollar revenue stream. On the other hand, it has helped the delis close to the IBM Watson facility in Manhattan prospect.
IBM has taken major steps to develop Watson as a new business line for future success. Watson has made achievements in diagnostic medicine and cancer treatment.
The approach involves the phone and microwave company Samsung and various universities, start ups, and public relation professionals in South Korea.
I assume more details will be revealed in Technology Review, a publication that covers Watson’s twists and turns in exquisite, marketing detail.
If you want to get on the anti-ageing train, board in South Korea. Like the projected $10 billion in revenue from a Lucene based system, let me know how those crow’s feet fly.
Stephen E Arnold, January 11, 2015
December 30, 2014
Natural language processing is becoming a popular analytical tool as well as a quicker way for search and customer support. Dragon Nuance is at the tip of everyone’s tongue when NLP enters a conversation, but there are other products with their own benefits. Code Project recently reviewed three of NLP in, ”A Review Of Three Natural Language Processors, AlchemyAPI, OpenCalais, And Semantria.”
Rather than sticking readers with plain product reviews, Code Project explains what NLP is used for and how it accomplishes it. While NLP is used for vocal commands, it can do many other things: improve SEO, knowledge management, text mining, text analytics, content visualization and monetization, decision support, automatic classification, and regulatory compliance. NLP extracts entities aka proper nouns from content, then classifies, tags, and provides a sentiment score to give each entity a meaning.
In layman’s terms:
“…the primary purpose of an NLP is to extract the nouns, determine their types, and provide some “scoring” (relevance or sentiment) of the entity within the text. Using relevance, one can supposedly filter out entities to those that are most relevant in the document. Using sentiment analysis, one can determine the overall sentiment of an entity in the document, useful for determining the “tone” of the document with regards to an entity — for example, is the entity “sovereign debt” described negatively, neutrally, or positively in the document?”
NLP categorizes the human element in content. Its usefulness will become more apparent in future years, especially as people rely more and more on electronic devices for communication, consumerism, and interaction.
December 18, 2014
A smaller big data sector that specializes in text analysis to generate content and reports is burgeoning with startups. Venture Beat takes a look out how one of these startups, Narrative Science, is gaining more attention in the enterprise software market: “Narrative Science Pulls In $10M To Analyze Corporate Data And Turn It Into Text-Based Reports.”
Narrative Science started out with software that created sport and basic earnings articles for newspaper filler. It has since grown into help businesses in different industries to take their data by the digital horns and leverage it.
Narrative Science recently received $10 million in funding to further develop its software. Stuart Frankel, chief executive, is driven to help all industries save time and resources by better understanding their data
“ ‘We really want to be a technology provider to those media organizations as opposed to a company that provides media content,’ Frankel said… ‘When humans do that work…it can take weeks. We can really get that down to a matter of seconds.’”
From making content to providing technology? It is quite a leap for Narrative Science. While they appear to have a good product, what is it they exactly do?
December 12, 2014
I need this in my office. I will dump my early 1940s French posters and go for logos.
Navigate to this link: http://bit.ly/1sdmBL0. You will be able to download a copy of an infographic (poster) that summarizes “The Current State of Machine Intelligence.” There are some interesting editorial decisions; for example, the cheery Google logo turns up in deep learning, predictive APIs, automotive, and personal assistant. I quite liked the inclusion of IBM Watson in artificial intelligence—recipes with tamarind and post-video editing game show champion. I found the listing of Palantir as one of the “intelligence tools” outfits. Three observations:
- I am not sure if the landscape captures what machine intelligence is
- The categories, while brightly colored, do not make clear how a core technology can be speech recognition but not part of the “rethinking industries” category
- Shouldn’t Google be in every category?
I am confident that mid tier consultants and reputation surfers like Dave Schubmehl will find the chart a source of inspiration. Does Digital Reasoning actually have a product? The company did not make the cut for the top 60 companies in NGIA systems. Hmmm. Live and learn.
Stephen E Arnold, December 12, 2014