CyberOSINT banner

SAS Text Miner Provides Valuable Predictive Analytics

March 25, 2015

If you are searching for predictive analytics software that provides in-depth text analysis with advanced linguistic capabilities, you may want to check out “SAS Text Miner.”  Predictive Analytics Today runs down the features and what SAS Text Miner and details how it works.

It is a user-friendly software with data visualization, flexible entity options, document theme discovery, and more.

“The text analytics software provides supervised, unsupervised, and semi-supervised methods to discover previously unknown patterns in document collections.  It structures data in a numeric representation so that it can be included in advanced analytics, such as predictive analysis, data mining, and forecasting.  This version also includes insightful reports describing the results from the rule generator node, providing clarity to model training and validation results.”

SAS Text Miner includes other features that draw on automatic Boolean rule generation to categorize documents and other rules can be exported into Boolean rules.  Data sets can be made from a directory on crawled from the Web.  The visual analysis feature highlights the relationships between discovered patterns and displays them using a concept link diagram.  SAS Text Miner has received high praise as a predictive analytics software and it might be the solution your company is looking for.

Whitney Grace, March 25, 2015
Stephen E Arnold, Publisher of CyberOSINT at

Aberdeen Consulting Labors to Pump up the Watson Balloon

March 7, 2015

I read “IBM Watson and Answering the Questions of the World with Cognitive Computing.” Darned amazing mid tier consulting firm dream spinning it is. Here’s the paragraph I noted, which is a quote from an IBM Watson guru named Rob High, the chief technical officer for Watson:

…We’re going to see this cognitive computing capability be brought down deeper into things that we do on a daily basis. I think we’re going to find this to be the dominant form of computing in the future, especially given that to personalize all of those things is not something you can conceivably do if you had to program all the logic around that for each individual person. These systems are only going to be able to achieve that kind of personalized value if they’re able to learn, learn about you, learn about your way of interpreting the world and the way that you envision the world and the priorities that are important to you, that perhaps you find useful in how you conduct your life. That’s why I think that’s the role that cognitive computing is going to have for us, is to provide that degree of personalization.”

I can hear personalized trumpet fanfares…almost, maybe. I think I hear the heavy breathing of the number one trumpeter.

Now where’s the real sound, the sound of cash registers ringing as companies spend for Watson’s wizardry?

Stephen E Arnold, March 7, 2015

Opening Watson to the Masses

March 4, 2015

IBM is struggling financially and one of the ways they hope to pull themselves out of the swamp is to find new applications for its supercomputers and software. One way they are trying to cash in on Watson is to create cognitive computer apps. EWeek alerts open source developers, coders, and friendly hackers that IBM released a bunch of beta services: “13 IBM Services That Simplify The Building Of Cognitive Watson Apps.”

IBM now allows all software geeks the chance to add their own input to cognitive computing. How?

“Since its creation in October 2013, the Watson Developer Cloud (WDC) has evolved into a community of over 5,000 partners who have unlocked the power of cognitive computing to build more than 6,000 apps to date. With a total of 13 beta services now available, the IBM Watson Group is quickly expanding its developer ecosystem with innovative and easy-to-use services to power entirely new classes of cognitive computing apps—apps that can learn from experience, understand natural language, identify hidden patterns and trends, and transform entire industries and professions.”

The thirteen new IBM services involve language, text processing, analytical tools, and data visualization. These services can be applied to a wide range of industries and fields, improving the way people work and interact with their data. While it’s easy to imagine the practical applications, it is still a wonder about how they will actually be used.

Whitney Grace, March 04, 2015
Sponsored by, developer of Augmentext

IBM Watson Offers Demos

February 6, 2015

One of Vivisimo’s founders, Jerome Pesenti, seems to be the voice of IBM Watson. Vivisimo was a metasearch system with hit clustering. The company went through several management arabesques and was sold to IBM in 2012. Vivisimo pitched its system as a federated search engine. The configuration method, as I recall, required Jerome level input. In one installation, I learned that the Vivisimo system hit a wall when 250,000 documents were processed. There were work arounds, but these too required humans who knew the ins and outs of Vivisimo.

I recall that prior to the sale of Vivisimo to IBM, Vivisimo shifted to a government consulting services focus. Many search vendors in the hay day of the buy outs followed this path. License fees were not generating the cash the spreadsheet jockeys funding outfits like Endeca, Exalead, and Vivisimo envisioned. No problem. Some organizations wanted proprietary content processing systems and figured that it was time to sell out. The Big Dog of sell outs was Hewlett Packard’s $11 billion purchase of Autonomy. Vivisimo fetched about $20 million or one year’s projected revenue according to the stockholder familiar with the deal suggested.

Fast forward two or three years and Vivisimo is now Watson. Oh, Vivisimo is also a Big Data solution, not a metasearch engine. I assume the index limits have been addressed. I am thinking about IBM Watson for two reasons:

  1. IBM is going through a staff reduction. I assume this action was determined by querying the super smart Watson system
  2. I read “Five New Services Expand IBM Watson Capabilities to Images, Speech, and More,” an IBM in house marketing article.

To my surprise there was a significant shift in Watson marketing; to wit, there are now links to demos of IBM’s text to speech service, image recognition service, relationship analysis service, and something called tradeoff analytics. Now demos are helpful. So is the Watson “great video” about concept insights.

I ran the suggested query for “quantum physics.” Remember I used to work at Halliburton Nuclear Services. Here’s what I saw:


I noticed that each of the experts in the human resources database use the word “quantum” to describe their background.

I then ran a query for “tamarind,” one of the ingredients in a barbeque sauce created by Watson during its recipe phase. Here’s what I saw:


There is no recipe, nor is there an IBM person listing the barbeque recipe as his or her work. I was surprised. No tamarind wizard in the data set.

I asked myself, “Can’t I do this with Elasticsearch?” The answer my mind generated was, “No. No. No. You silly oaf. Watson uses Lucene but it is much, much more.”

How confident are the Watson workers who have dodged IBM layoffs?

What happens if Watson with Vivisimo, iPhrase, WebFountain, and assorted Almaden semantic goodies are aced by Hewlett Packard Autonomy or—heaven forbid—Amazon?

Will Dr. Pesenti be able to build a business that is orders of magnitude larger than Vivisimo’s revenue?

Interesting stuff. Not CyberOSINT level work, but interesting. I wonder why the i2 and related technologies are not pushed more aggressively. i2 works. (Note: I was a consultant to i2 prior to IBM’s purchase of the company.)

Stephen E Arnold, February 6, 2015

Recorded Future: Google and Cyber OSINT

February 2, 2015

I find the complaints about Google’s inability to handle time amusing. On the surface, Google seems to demote, ignore, or just not understand the concept of time. For the vast majority of Google service users, Google is no substitute for the users’ investment of time and effort into dating items. But for the wide, wide Google audience, ads, not time, are more important.

Does Google really get an F in time? The answer is, “Nope.”

In CyberOSINT: Next Generation Information Access I explain that Google’s time sense is well developed and of considerable importance to next generation solutions the company hopes to offer. Why the craw fishing? Well, Apple could just buy Google and make the bitter taste of the Apple Board of Directors’ experience a thing of the past.

Now to temporal matters in the here and now.

CyberOSINT relies on automated collection, analysis, and report generation. In order to make sense of data and information crunched by an NGIA system, time is a really key metatag item. To figure out time, a system has to understand:

  • The date and time stamp
  • Versioning (previous, current, and future document, data items, and fact iterations)
  • Times and dates contained in a structured data table
  • Times and dates embedded in content objects themselves; for example, a reference to “last week” or in some cases, optical character recognition of the data on a surveillance tape image.

For the average query, this type of time detail is overkill. The “time and date” of an event, therefore, requires disambiguation, determination and tagging of specific time types, and then capturing the date and time data with markers for document or data versions.


A simplification of Recorded Future’s handling of unstructured data. The system can also handle structured data and a range of other data management content types. Image copyright Recorded Future 2014.

Sounds like a lot of computational and technical work.

In CyberOSINT, I describe Google’s and In-Q-Tel’s investments in Recorded Future, one of the data forward NGIA companies. Recorded Future has wizards who developed the Spotfire system which is now part of the Tibco service. There are Xooglers like Jason Hines. There are assorted wizards from Sweden, countries the most US high school software cannot locate on a map, and assorted veterans of high technology start ups.

An NGIA system delivers actionable information to a human or to another system. Conversely a licensee can build and integrate new solutions on top of the Recorded Future technology. One of the company’s key inventions is numerical recipes that deal effectively with the notion of “time.” Recorded Future uses the name “Tempora” as shorthand for the advanced technology that makes time along with predictive algorithms part of the Recorded Future solution.

Read more

Autonomy: Leading the Push Beyond Enterprise Search

January 30, 2015

In “CyberOSINT: Next Generation Information Access,” I describe Autonomy’s math-first approach to content processing. The reason is that after the veil of secrecy was lifted with regard to the signal processing`methods used for British intelligence tasks, Cambridge University became one of the hot beds for the use of Bayesian, LaPlacian, and Markov methods. These numerical recipes proved to be both important and controversial. Instead of relying on manual methods, humans selected training sets, tuned the thresholds, and then turned the smart software loose. Math is not required to understand what Autonomy packaged for commercial use: Signal processing separated noise in a channel and allowed software to process the important bits. Thank you, Claude Shannon and the good Reverend Bayes.

What did Autonomy receive for this breakthrough? Not much but the company did generate more than $600 million in revenues about 10 years after opening for business. As far as I know, no other content processing vendor has reached this revenue target. Endeca, for the sake of comparison, flat lined at about $130 million in the year that Oracle bought the Guided Navigation outfit for about $1.0 billion.

For one thing the British company BAE (British Aerospace Engineering) licensed the Autonomy system and began to refine its automated collection, analysis, and report systems. So what? The UK became by the late 1990s the de facto leader in automated content activities. Was BAE the only smart outfit in the late 1990s? Nope, there were other outfits who realized the value of the Autonomy approach. Examples range from US government entities to little known outfits like the Wynyard Group.

In the CyberOSINT volume, you can get more detail about why Autonomy was important in the late 1990s, including the name of the university8 professor who encouraged Mike Lynch to make contributions that have had a profound impact on intelligence activities. For color, let me mention an anecdote that is not in the 176 page volume. Please, keep in mind that Autonomy was, like i2 (another Cambridge University spawned outfit) a client prior to my retirement.) IBM owns i2 and i2 is profiled in CyberOSINT in Chapter 5, “CyberOSINT Vendors.” I would point out that more than two thirds of the monograph contains information that is either not widely available or not available via a routine Bing, Google, or Yandex query. For example, Autonomy does not make publicly available a list of its patent documents. These contain specific information about how to think about cyber OSINT and moving beyond keyword search.

Some Color: A Conversation with a Faux Expert

In 2003 I had a conversation with a fellow who was an “expert” in content management, a discipline that is essentially a step child of database technology. I want to mention this person by name, but I will avoid the inevitable letter from his attorney rattling a saber over my head. This person publishes reports, engages in litigation with his partners, kowtows to various faux trade groups, and tries to keep secret his history as a webmaster with some Stone Age skills.

Not surprisingly this canny individual had little good to say about Autonomy. The information I provided about the Lynch technology, its applications, and its importance in next generation search were dismissed with a comment I will not forget, “Autonomy is a pile of crap.”

Okay, that’s an informed opinion for a clueless person pumping baloney about the value of content management as a separate technical field. Yikes.

In terms of enterprise search, Autonomy’s competitors criticized Lynch’s approach. Instead of a keyword search utility that was supposed to “unlock” content, Autonomy delivered a framework. The framework operated in an automated manner and could deliver keyword search, point and click access like the Endeca system, and more sophisticated operations associated with today’s most robust cyber OSINT solutions. Enterprise search remains stuck in the STAIRS III and RECON era. Autonomy was the embodiment of the leap from putting the burden of finding on humans to shifting the load to smart software.


A diagram from Autonomy’s patents filed in 2001. What’s interesting is that this patent cites an invention by Dr. Liz Liddy with whom the ArnoldIT team worked in the late 1990s. A number of content experts understood the value of automated methods, but Autonomy was the company able to commercialize and build a business on technology that was not widely known 15 years ago. Some universities did not teach Bayesian and related methods because these were tainted by humans who used judgments to set certain thresholds. See US 6,668,256. There are more than 100 Autonomy patent documents. How many of the experts at IDC, Forrester, Gartner, et al have actually located the documents, downloaded them, and reviewed the systems, methods, and claims? I would suggest a tiny percentage of the “experts.” Patent documents are not what English majors are expected to read.”

That’s important and little appreciated by the mid tier outfits’ experts working for IDC (yo, Dave Schubmehl, are you ramping up to recycle the NGIA angle yet?) Forrester (one of whose search experts told me at a MarkLogic event that new hires for search were told to read the information on my Web site like that was a good thing for me), Gartner Group (the conference and content marketing outfit), Ovum (the UK counterpart to Gartner), and dozens of other outfits who understand search in terms of selling received wisdom, not insight or hands on facts.

Read more

Enterprise Search Pressured by Cyber Methods

January 29, 2015

I read “Automated Systems Replacing Traditional Search.” The write up asserts:

Stephen E. Arnold, search industry expert and author of the “Enterprise Search Report” and “The New Landscape of Search,” has announced the publication of “CyberOSINT: Next-Generation Information Access.” The 178-page report explores the tools and methods used to collect and analyze content posted in public channels such as social media sites. The new technology can identify signals that provide intelligence and law enforcement analysts early warning of threats, cyber attacks or illegal activities.

According to Robert Steele, co-founder of USMC Intelligence Activity:

NGIA systems are integrated solutions that blend software and hardware to address very specific needs. Our intelligence, law enforcement, and security professionals need more than brute force keyword search.

According to Dr. Jerry Lucas, president of Telestrategies, which operates law enforcement and training conferences in the US and elsewhere:

This is the first discussion of the innovative software that makes sense of the flood of open source digital information. Law enforcement, security, and intelligence professionals will find this an invaluable resource to identify ways to deal with Big Data.

The report complements the Telestrategies ISS seminar on CyberOSINT. Orders for the monograph, which costs $499, may be placed at Information about the February 19, 2015, seminar held in the DC area is at this link.

The software and methods described in the study has immediate and direct applications to commercial entities. Direct orders may be placed at

Don Anderson, January 29, 2015

Enterprise Search Problems: Why NGIA Systems Push Beyond Traditional Information Access Methods

January 29, 2015

Enterprise search has been useful. However, the online access methods have changed. Unfortunately, most enterprise search systems and the enterprise applications based on keyword and category access have lagged behind user needs.

The information highway is littered with the wrecks of enterprise search vendors who promised a solution to findability challenges and failed to deliver. Some of the vendors have been forgotten by today’s keyword and category access vendors. Do you know about the business problems that disappointed licensees and cost investors millions of dollars? Are you familiar with Convera, Delphes, Entopia, Fulcrum Technologies, Hakia, Siderean Software, and many other companies.

cover for ads

A handful of enterprise search vendors dodged implosion by selling out. Artificial Linguistics, Autonomy, Brainware, Endeca, Exalead, Fast Search, InQuira, iPhrase, ISYS Search Software, and Triple Hop were sold. Thus, their investors received their money back and in some cases received a premium. The $11 billion paid for Autonomy dwarfed the billion dollar purchase prices of Endeca and Fast Search and Transfer. But most of the companies able to sell their information retrieval systems sold for much less. IBM acquired Vivisimo for about $20 million and promptly justified the deal by describing Vivisimo’s metasearch system as a Big Data solution. Okay.

Today a number of enterprise search vendors walk a knife edge. A loss of a major account or a misstep that spooks investors can push a company over the financial edge in the blink of an eye. Recently I noticed that Dieselpoint has not updated its Web site for a while. Antidot seems to have faded from the US market. Funnelback has turned down the volume. Hakia went offline.

A few firms generate considerable public relations noise. Attivio, BA Insight, Coveo, and IBM Watson appear to be competing to become the leaders in today’s enterprise search sector. But today’s market is very different from the world of 2003-2004 when I wrote the first of three editions of the 400 page Enterprise Search Report. Each of these companies is asserting that their system provides business intelligence,  customer support, and traditional enterprise search. Will any of these companies be able to match Autonomy’s 2008 revenues of $600 million. I doubt it.

The reason is not the availability of open source search. Elasticsearch, in fact, is arguably better than any of the for fee keyword and concept centric information retrieval systems. The problems of the enterprise search sector are deeper.

Read more

Microsoft, Text Analytics, and Writing

January 21, 2015

I read the marvelously named “Microsoft Acquires Text Analysis Startup Equivio, Plans to Integrate Machine Learning Tech into Office 365: Equivio Zoom In. Find Out.”

Taking a deep breath I read the article. Here’s what I deduced: Word and presumably PowerPoint will get some new features:

While Office 365 offers e-discovery and information governance capabilities, Equivio develops machine learning technologies for both, meaning an integration is expected to make them “even more intelligent and easy to use.” Microsoft says the move is in line with helping its customers tackle “the legal and compliance challenges inherent in managing large quantities of email and documents.”

The Fast Search & Transfer technology is not working out?  The dozens of SharePoint content enhancers are not doing their job? The grammar checker is not doing its job?

What is different is that Word is getting more machine learning:

Equivio uses machine learning to let users explore large, unstructured sets of data. The startup’s technology leverages advanced text analytics to perform multi-dimensional analyses of data collections, intelligently sort documents into themes, group near-duplicates, and isolate unique data.

Like Microsoft’s exciting adaptive menus, the new system will learn what the user wants.

Is this a next generation information access system? Is Microsoft nosing into Recorded Future territory?

Nope, but the desire to covert what the user does into metadata seems to percolate in the Microsoft innovation coffee pot.

If Microsoft pulls off this shotgun marriage, I think more pressure will be put on outfits like Content Analyst and Smartlogic.

Stephen E Arnold, January 21, 2015

Shades of Ray Kurzweil: Watson to Crack Ageing

January 11, 2015

I am not too keen on immortality. My view is that stuff dies. Age appropriate behavior means accepting the lot of mortal man.

But some folks want to extend their lives; others hope to live forever like the nano-stuff creatures in Alastair Reynolds’ novels.

I associate the live longer and collect stock options approach with Ray Kurzweil, the Google big thinker and music inventor. Well, I learned something in “IBM Watson’s Lab to Tackle Aging Issues.” Now Watson with its chugging heart of Lucene has lifetimes of revenue to generate before some activist investors put a bit in this pony’s mouth.

The write up says:

IBM Korea will build a cognitive computing center in Seoul to help tackle an aging society with technology. “IBM submitted a letter of intent to the Seoul Metropolitan Government last month to set up a Watson lab to study smart-aging technology,” IBM Korea said…

I found this statement remarkable because IBM has not turned Lucene and home-grown scripts into a multi billion dollar revenue stream. On the other hand, it has helped the delis close to the IBM Watson facility in Manhattan prospect.

IBM has taken major steps to develop Watson as a new business line for future success. Watson has made achievements in diagnostic medicine and cancer treatment.

The approach involves the phone and microwave company Samsung and various universities, start ups, and public relation professionals in South Korea.

I assume more details will be revealed in Technology Review, a publication that covers Watson’s twists and turns in exquisite, marketing detail.

If you want to get on the anti-ageing train, board in South Korea. Like the projected $10 billion in revenue from a Lucene based system, let me know how those crow’s feet fly.

Stephen E Arnold, January 11, 2015

Next Page »