CyberOSINT banner

SAS Text Miner Provides Valuable Predictive Analytics

March 25, 2015

If you are searching for predictive analytics software that provides in-depth text analysis with advanced linguistic capabilities, you may want to check out “SAS Text Miner.”  Predictive Analytics Today runs down the features and what SAS Text Miner and details how it works.

It is a user-friendly software with data visualization, flexible entity options, document theme discovery, and more.

“The text analytics software provides supervised, unsupervised, and semi-supervised methods to discover previously unknown patterns in document collections.  It structures data in a numeric representation so that it can be included in advanced analytics, such as predictive analysis, data mining, and forecasting.  This version also includes insightful reports describing the results from the rule generator node, providing clarity to model training and validation results.”

SAS Text Miner includes other features that draw on automatic Boolean rule generation to categorize documents and other rules can be exported into Boolean rules.  Data sets can be made from a directory on crawled from the Web.  The visual analysis feature highlights the relationships between discovered patterns and displays them using a concept link diagram.  SAS Text Miner has received high praise as a predictive analytics software and it might be the solution your company is looking for.

Whitney Grace, March 25, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com

Recorded Future: Google and Cyber OSINT

February 2, 2015

I find the complaints about Google’s inability to handle time amusing. On the surface, Google seems to demote, ignore, or just not understand the concept of time. For the vast majority of Google service users, Google is no substitute for the users’ investment of time and effort into dating items. But for the wide, wide Google audience, ads, not time, are more important.

Does Google really get an F in time? The answer is, “Nope.”

In CyberOSINT: Next Generation Information Access I explain that Google’s time sense is well developed and of considerable importance to next generation solutions the company hopes to offer. Why the craw fishing? Well, Apple could just buy Google and make the bitter taste of the Apple Board of Directors’ experience a thing of the past.

Now to temporal matters in the here and now.

CyberOSINT relies on automated collection, analysis, and report generation. In order to make sense of data and information crunched by an NGIA system, time is a really key metatag item. To figure out time, a system has to understand:

  • The date and time stamp
  • Versioning (previous, current, and future document, data items, and fact iterations)
  • Times and dates contained in a structured data table
  • Times and dates embedded in content objects themselves; for example, a reference to “last week” or in some cases, optical character recognition of the data on a surveillance tape image.

For the average query, this type of time detail is overkill. The “time and date” of an event, therefore, requires disambiguation, determination and tagging of specific time types, and then capturing the date and time data with markers for document or data versions.

image

A simplification of Recorded Future’s handling of unstructured data. The system can also handle structured data and a range of other data management content types. Image copyright Recorded Future 2014.

Sounds like a lot of computational and technical work.

In CyberOSINT, I describe Google’s and In-Q-Tel’s investments in Recorded Future, one of the data forward NGIA companies. Recorded Future has wizards who developed the Spotfire system which is now part of the Tibco service. There are Xooglers like Jason Hines. There are assorted wizards from Sweden, countries the most US high school software cannot locate on a map, and assorted veterans of high technology start ups.

An NGIA system delivers actionable information to a human or to another system. Conversely a licensee can build and integrate new solutions on top of the Recorded Future technology. One of the company’s key inventions is numerical recipes that deal effectively with the notion of “time.” Recorded Future uses the name “Tempora” as shorthand for the advanced technology that makes time along with predictive algorithms part of the Recorded Future solution.

Read more

Autonomy: Leading the Push Beyond Enterprise Search

January 30, 2015

In “CyberOSINT: Next Generation Information Access,” I describe Autonomy’s math-first approach to content processing. The reason is that after the veil of secrecy was lifted with regard to the signal processing`methods used for British intelligence tasks, Cambridge University became one of the hot beds for the use of Bayesian, LaPlacian, and Markov methods. These numerical recipes proved to be both important and controversial. Instead of relying on manual methods, humans selected training sets, tuned the thresholds, and then turned the smart software loose. Math is not required to understand what Autonomy packaged for commercial use: Signal processing separated noise in a channel and allowed software to process the important bits. Thank you, Claude Shannon and the good Reverend Bayes.

What did Autonomy receive for this breakthrough? Not much but the company did generate more than $600 million in revenues about 10 years after opening for business. As far as I know, no other content processing vendor has reached this revenue target. Endeca, for the sake of comparison, flat lined at about $130 million in the year that Oracle bought the Guided Navigation outfit for about $1.0 billion.

For one thing the British company BAE (British Aerospace Engineering) licensed the Autonomy system and began to refine its automated collection, analysis, and report systems. So what? The UK became by the late 1990s the de facto leader in automated content activities. Was BAE the only smart outfit in the late 1990s? Nope, there were other outfits who realized the value of the Autonomy approach. Examples range from US government entities to little known outfits like the Wynyard Group.

In the CyberOSINT volume, you can get more detail about why Autonomy was important in the late 1990s, including the name of the university8 professor who encouraged Mike Lynch to make contributions that have had a profound impact on intelligence activities. For color, let me mention an anecdote that is not in the 176 page volume. Please, keep in mind that Autonomy was, like i2 (another Cambridge University spawned outfit) a client prior to my retirement.) IBM owns i2 and i2 is profiled in CyberOSINT in Chapter 5, “CyberOSINT Vendors.” I would point out that more than two thirds of the monograph contains information that is either not widely available or not available via a routine Bing, Google, or Yandex query. For example, Autonomy does not make publicly available a list of its patent documents. These contain specific information about how to think about cyber OSINT and moving beyond keyword search.

Some Color: A Conversation with a Faux Expert

In 2003 I had a conversation with a fellow who was an “expert” in content management, a discipline that is essentially a step child of database technology. I want to mention this person by name, but I will avoid the inevitable letter from his attorney rattling a saber over my head. This person publishes reports, engages in litigation with his partners, kowtows to various faux trade groups, and tries to keep secret his history as a webmaster with some Stone Age skills.

Not surprisingly this canny individual had little good to say about Autonomy. The information I provided about the Lynch technology, its applications, and its importance in next generation search were dismissed with a comment I will not forget, “Autonomy is a pile of crap.”

Okay, that’s an informed opinion for a clueless person pumping baloney about the value of content management as a separate technical field. Yikes.

In terms of enterprise search, Autonomy’s competitors criticized Lynch’s approach. Instead of a keyword search utility that was supposed to “unlock” content, Autonomy delivered a framework. The framework operated in an automated manner and could deliver keyword search, point and click access like the Endeca system, and more sophisticated operations associated with today’s most robust cyber OSINT solutions. Enterprise search remains stuck in the STAIRS III and RECON era. Autonomy was the embodiment of the leap from putting the burden of finding on humans to shifting the load to smart software.

image

A diagram from Autonomy’s patents filed in 2001. What’s interesting is that this patent cites an invention by Dr. Liz Liddy with whom the ArnoldIT team worked in the late 1990s. A number of content experts understood the value of automated methods, but Autonomy was the company able to commercialize and build a business on technology that was not widely known 15 years ago. Some universities did not teach Bayesian and related methods because these were tainted by humans who used judgments to set certain thresholds. See US 6,668,256. There are more than 100 Autonomy patent documents. How many of the experts at IDC, Forrester, Gartner, et al have actually located the documents, downloaded them, and reviewed the systems, methods, and claims? I would suggest a tiny percentage of the “experts.” Patent documents are not what English majors are expected to read.”

That’s important and little appreciated by the mid tier outfits’ experts working for IDC (yo, Dave Schubmehl, are you ramping up to recycle the NGIA angle yet?) Forrester (one of whose search experts told me at a MarkLogic event that new hires for search were told to read the information on my ArnoldIT.com Web site like that was a good thing for me), Gartner Group (the conference and content marketing outfit), Ovum (the UK counterpart to Gartner), and dozens of other outfits who understand search in terms of selling received wisdom, not insight or hands on facts.

Read more

Enterprise Search Problems: Why NGIA Systems Push Beyond Traditional Information Access Methods

January 29, 2015

Enterprise search has been useful. However, the online access methods have changed. Unfortunately, most enterprise search systems and the enterprise applications based on keyword and category access have lagged behind user needs.

The information highway is littered with the wrecks of enterprise search vendors who promised a solution to findability challenges and failed to deliver. Some of the vendors have been forgotten by today’s keyword and category access vendors. Do you know about the business problems that disappointed licensees and cost investors millions of dollars? Are you familiar with Convera, Delphes, Entopia, Fulcrum Technologies, Hakia, Siderean Software, and many other companies.

cover for ads

A handful of enterprise search vendors dodged implosion by selling out. Artificial Linguistics, Autonomy, Brainware, Endeca, Exalead, Fast Search, InQuira, iPhrase, ISYS Search Software, and Triple Hop were sold. Thus, their investors received their money back and in some cases received a premium. The $11 billion paid for Autonomy dwarfed the billion dollar purchase prices of Endeca and Fast Search and Transfer. But most of the companies able to sell their information retrieval systems sold for much less. IBM acquired Vivisimo for about $20 million and promptly justified the deal by describing Vivisimo’s metasearch system as a Big Data solution. Okay.

Today a number of enterprise search vendors walk a knife edge. A loss of a major account or a misstep that spooks investors can push a company over the financial edge in the blink of an eye. Recently I noticed that Dieselpoint has not updated its Web site for a while. Antidot seems to have faded from the US market. Funnelback has turned down the volume. Hakia went offline.

A few firms generate considerable public relations noise. Attivio, BA Insight, Coveo, and IBM Watson appear to be competing to become the leaders in today’s enterprise search sector. But today’s market is very different from the world of 2003-2004 when I wrote the first of three editions of the 400 page Enterprise Search Report. Each of these companies is asserting that their system provides business intelligence,  customer support, and traditional enterprise search. Will any of these companies be able to match Autonomy’s 2008 revenues of $600 million. I doubt it.

The reason is not the availability of open source search. Elasticsearch, in fact, is arguably better than any of the for fee keyword and concept centric information retrieval systems. The problems of the enterprise search sector are deeper.

Read more

Garbling the Natural Language Processors

December 30, 2014

Natural language processing is becoming a popular analytical tool as well as a quicker way for search and customer support. Dragon Nuance is at the tip of everyone’s tongue when NLP enters a conversation, but there are other products with their own benefits. Code Project recently reviewed three of NLP in, ”A Review Of Three Natural Language Processors, AlchemyAPI, OpenCalais, And Semantria.”

Rather than sticking readers with plain product reviews, Code Project explains what NLP is used for and how it accomplishes it. While NLP is used for vocal commands, it can do many other things: improve SEO, knowledge management, text mining, text analytics, content visualization and monetization, decision support, automatic classification, and regulatory compliance. NLP extracts entities aka proper nouns from content, then classifies, tags, and provides a sentiment score to give each entity a meaning.

In layman’s terms:

“…the primary purpose of an NLP is to extract the nouns, determine their types, and provide some “scoring” (relevance or sentiment) of the entity within the text.  Using relevance, one can supposedly filter out entities to those that are most relevant in the document.  Using sentiment analysis, one can determine the overall sentiment of an entity in the document, useful for determining the “tone” of the document with regards to an entity — for example, is the entity “sovereign debt” described negatively, neutrally, or positively in the document?”

NLP categorizes the human element in content. Its usefulness will become more apparent in future years, especially as people rely more and more on electronic devices for communication, consumerism, and interaction.

Whitney Grace, December 30, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Narrative Science Gets Money to Crunch Numbers

December 18, 2014

A smaller big data sector that specializes in text analysis to generate content and reports is burgeoning with startups. Venture Beat takes a look out how one of these startups, Narrative Science, is gaining more attention in the enterprise software market: “Narrative Science Pulls In $10M To Analyze Corporate Data And Turn It Into Text-Based Reports.”

Narrative Science started out with software that created sport and basic earnings articles for newspaper filler. It has since grown into help businesses in different industries to take their data by the digital horns and leverage it.

Narrative Science recently received $10 million in funding to further develop its software. Stuart Frankel, chief executive, is driven to help all industries save time and resources by better understanding their data

“ ‘We really want to be a technology provider to those media organizations as opposed to a company that provides media content,’ Frankel said… ‘When humans do that work…it can take weeks. We can really get that down to a matter of seconds.’”

From making content to providing technology? It is quite a leap for Narrative Science. While they appear to have a good product, what is it they exactly do?

Whitney Grace, December 18, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Machine Intelligence on One Big Poster

December 12, 2014

I need this in my office. I will dump my early 1940s French posters and go for logos.

Navigate to this link: http://bit.ly/1sdmBL0. You will be able to download a copy of an infographic (poster) that summarizes “The Current State of Machine Intelligence.” There are some interesting editorial decisions; for example, the cheery Google logo turns up in deep learning, predictive APIs, automotive, and personal assistant. I quite liked the inclusion of IBM Watson in artificial intelligence—recipes with tamarind and post-video editing game show champion. I found the listing of Palantir as one of the “intelligence tools” outfits. Three observations:

  1. I am not sure if the landscape captures what machine intelligence is
  2. The categories, while brightly colored, do not make clear how a core technology can be speech recognition but not part of the “rethinking industries” category
  3. Shouldn’t Google be in every category?

I am confident that mid tier consultants and reputation surfers like Dave Schubmehl will find the chart a source of inspiration. Does Digital Reasoning actually have a product? The company did not make the cut for the top 60 companies in NGIA systems. Hmmm. Live and learn.

Stephen E Arnold, December 12, 2014

Lexalytics Positions Semantria in Europe

December 12, 2014

Analytics outfit Lexalytics is going all-in on their European expansion. The write-up, “Lexalytics Expands International Presence: Launches Pain-Free Text Mining Customization” at Virtual-Strategy Magazine tells us that the company has boosted the language capacity of their recently acquired Semantria platform. The text-analytics and sentiment-analysis platform now includes Japanese, Arabic, Malay, and Russian in its supported-language list, which already included English, French, German, Chinese, Spanish, Portuguese, Italian, and Korean.

Lexalytics is also setting up servers in Europe. Because of upcoming changes to EU privacy law, we’re told companies will soon be prohibited from passing data into the U.S. Thanks to these new servers, European clients will be able to use Semantria’s cloud services without running afoul of the law.

Last summer, the company courted Europeans’ attention by becoming a sponsor of the 2014 Enterprise Hackathon in Prague. The press release tells us:

“All participants of the Hackathon were granted unlimited access and support to the Semantria API during the event. Nearly every team tried Semantria during the 36 hours they had to build a program that could crunch enough data to be used at the enterprise level. Redmore says, “We love innovative, quick development events, and are always looking for good events to support. Please contact us if you have a hackathon where you can use the power of our text mining solutions, and we’ll talk about hooking you up!”

Lexalytics is proud to have been the first to offer sentiment analysis, auto theme detection, and Wikipedia integration. Designed to integrate with third-party applications, their text analysis software is chugs along in the background at many data-related organizations. Founded in 2003, Lexalytics is headquartered in Amherst, Massachusetts.

Cynthia Murrell, December 12, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Blast toward the Moon With Rocket Software

December 8, 2014

YouTube informational videos are great. They are short, snappy, and often help people retain more information about a product than reading the “about” page on a Web site. Rocket Software has its own channel and the video “Rocket Enterprise Search And Text Analytics” packs a lot of details into 2.49 minutes. The video is described as:

“We provide an integrated search platform for gathering, indexing, and searching both structured and unstructured data?making the information that you depend on more accessible, useful, and intelligent.”

How does Rocket Software defend that statement? The video opens with a prediction that by 2020 data usage will have increased to forty trillion gigabytes. It explains that data is the new enterprise currency and that it needs to be kept organized, then it drops into a plug for the company’s software. The compare themselves to other companies by saying Rocket Software makes the enterprise search and text analytics as simple as a download and then it will be up and running. Other enterprise searches require custom coding, but Rocket Software explains it offers these options out of the box. Plus it is a cheaper product without having to sacrifice quality.

Software usage these days is about functionality and ease of use for powerful software. Rocket Software states it offers this. Try putting it to the test.

Whitney Grace, December 08, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

eDigital Research and Lexalytics Team Up on Real Time Text Analytics

November 11, 2014

Through the News section of their website, eDigitalResearch announces a new partnership in, “eDigitalResearch Partner with Lexalytics on Real-Time Text Analytics Solution.” The two companies are integrating Lexalytics’ Salience analysis engine into eDigital’s HUB analysis and reporting interface. The write-up tells us:

“By utilising and integrating Lexalytics Salience text analysis engine into eDigitalResearch’s own HUB system, the partnership will provide clients with a real-time, secure solution for understanding what customers are saying across the globe. Able to analyse comments from survey responses to social media – in fact any form of free text – eDigitalResearch’s HUB Text Analytics will provide the power and platform to really delve deep into customer comments, monitor what is being said and alert brands and businesses of any emerging trends to help stay ahead of the competition.”

Based in Hampshire, U.K., eDigitalResearch likes to work closely with their clients to produce the best solution for each. The company began in 1999 with the launch of the eMysteryShopper, a novel concept at the time. As of this writing, eDigitalResearch is looking to hire a developer and senior developer (in case anyone here is interested.)

Founded in 2003, Lexalytics is proud to have brought the first sentiment analysis engine to market. Designed to integrate with third-party applications, their text analysis software is chugging along in the background at many data-related companies. Lexalytics is headquartered in Amherst, Massachusetts.

Cynthia Murrell, November 11, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Next Page »