Lexalytics Offers Tunable Text Mining
May 13, 2015
Want to do text mining without some of the technical hassles? if so, you will want to read about Lexalytics “the industry’s most tunable and configurable text mining technology.” Navigate to “Lexalytics Unveils Industry’s First Wizard for Text Mining and Sentiment Analysis.” I learned that text mining can be fun, easy, and intuitive.” I highlighted this quote from the news story as an indication that one does not need to understand exactly what’s going on in the text mining process:
“Before, our customers had to understand the meaning of things like ‘alpha-numeric content threshold’ and ‘entities confidence threshold,'” Jeff continued. “Lexalytics provides the most knobs to turn to get the results exactly as you want them, and now our customers don’t even have to think about them.”
Text mining, the old-fashioned way, required understanding of what was required, what procedures were appropriate, and ability to edit or write scripts. There are other skills that used to be required as the entry fee to text mining. The modern world of interfaces allows anyone to text mine. Do users understand the outputs? Sure. Perfectly.
As I read the story, I recalled a statement in “A Review of Three Natural Language Processors, AlchemyAPI, OpenCalais, and Semantria.” Here is the quote I noted in that July 2014 write up by Marc Clifton:
I find the concept of Natural Language Processing intriguing and that it holds many possibilities for helping to filter and analyze the vast and growing amount of information out there on the web. However, I’m not quite sure exactly how one uses the output of an NLP service in a productive way that goes beyond simple keyword matching. Some people will of course be interested in whether the sentiment is positive or negative, and I think the idea of extracting concepts (AlchemyAPI) and topics (Semantria) are useful in extracting higher level abstractions regarding a document. NLP is therefore an interesting field of study and I believe that the people who provide NLP services would benefit from the feedback of users to increase the value of their service.
Perhaps the feedback was, “Make this stuff easy to do.” Now the challenge is to impart understanding to what a text mining system outputs. That might be a bit more difficult.
Stephen E Arnold, May 13, 2015
Hoping to End Enterprise Search Inaccuracies
May 1, 2015
Enterprise search is limited to how well users tag their content and the preloaded taxonomies. According Tech Target’s Search Content Management blog, text analytics might be the key to turning around poor enterprise search performance: “How Analytics Engines Could Finally-Relieve Enterprise Pain.” Text analytics turns out to only be part of the solution. Someone had the brilliant idea to use text analytics to classification issues in enterprise search, making search reactive to user input to proactive to search queries.
In general, analytics search engines work like this:
“The first is that analytics engines don’t create two buckets of content, where the goal is to identify documents that are deemed responsive. Instead, analytics engines identify documents that fall into each category and apply the respective metadata tags to the documents. Second, people don’t use these engines to search for content. The engines apply metadata to documents to allow search engines to find the correct information when people search for it. Text analytics provides the correct metadata to finally make search work within the enterprise.”
Supposedly, they are fixing the tagging issue by removing the biggest cause for error: humans. Microsoft caught onto how much this could generate profit, so they purchased Equivio in 2014 and integrated the FAST Search platform into SharePoint. Since Microsoft is doing it, every other tech company will copy and paste their actions in time. Enterprise search is gull of faults, but it has improved greatly. Big data trends have improved search quality, but tagging continues to be an issue. Text analytics search engines will probably be the newest big data field for development. Hint for developers: work on an analytics search product, launch it, and then it might be bought out.
Whitney Grace, May 1 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Twitter Plays Hard Ball or DataSift Knows the End Is in Sight
April 11, 2015
I read “Twitter Ends its Partnership with DataSift – Firehose Access Expires on August 13, 2015.” DataSift supports a number of competitive and other intelligence services with its authorized Twitter stream. The write up says:
DataSift’s customers will be able to access Twitter’s firehose of data as normal until August 13th, 2015. After that date all the customers will need to transition to other providers to receive Twitter data. This is an extremely disappointing result to us and the ecosystem of companies we have helped to build solutions around Twitter data.
I found this interesting. Plan now or lose that authorized firehose. Perhaps Twitter wants more money? On the other hand, maybe DataSift realizes that for some intelligence tasks, Facebook is where the money is. Twitter is a noise machine. Facebook, despite its flaws, is anchored in humans, but the noise is increasing. Some content processes become more tricky with each business twist and turn.
Stephen E Arnold, April 11, 2015
Predicting Plot Holes Isn’t So Easy
April 10, 2015
According to The Paris Review’s blog post “Man In Hole II: Man In Deeper Hole” Mathew Jockers created an analysis tool to predict archetypal book plots:
A rough primer: Jockers uses a tool called “sentiment analysis” to gauge “the relationship between sentiment and plot shape in fiction”; algorithms assign every word in a novel a positive or negative emotional value, and in compiling these values he’s able to graph the shifts in a story’s narrative. A lot of negative words mean something bad is happening, a lot of positive words mean something good is happening. Ultimately, he derived six archetypal plot shapes.”
Academics, however, found some problems with Jockers’s tool, such as is it possible to assign all words an emotional variance and can all plots really take basic forms? The problem is that words are as nuanced as human emotion, perspectives change in an instant, and sentiments are subjective. How would the tool rate sarcasm?
All stories have been broken down into seven basic plots, so why can it not be possible to do the same for book plots? Jockers already identified six basic book plots and there are some who are curiously optimistic about his analysis tool. It does beg the question if will staunch author’s creativity or if it will make English professors derive even more subjective meaning from Ulysses?
Whitney Grace, April 10, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Attensity Adds Semantic Markup
April 3, 2015
You have been waiting for more markup. I know I have, and that is why I read “Attensity Semantic Annotation: NLP-Analyse für Unternehmensapplikationen.”
So your wait and mine—over.
Attensity, a leading in figuring out what human discourse means, has rolled out a software development kit so you can do a better job with customer engagement and business intelligence. Attensity offers Dynamic Data Discovery. Unlike traditional analysis tools, Attensity does not focus on keywords. You know, what humans actually use to communicate.
Attensity uses natural language processing in order to identify concepts and issues in plain language. I must admit that I have heard this claim from a number of vendors, including long forgotten systems like DR LINK, among others.
The idea is that the SDK makes it easier to filter data to evaluate textual information and identify issues. Furthermore the SDK performs fast content fusion. The result is, as other vendors have asserted, insight. There was a vendor called Inxight which asserted quite similar functions in 1997. At one time, Attensity had a senior manager from Inxight, but I assume the attribution of functions is one of Attensity’s innovations. (Forgive me for mentioning vendors with which some 20 somethings know quite well.)
If you are dependent upon Java, Attensity is an easy fit. I assume that if you are one of the 150 million plus Microsoft SharePoint outfits, Attensity integration may require a small amount of integration work.
According the Attensity, the benefits of Attensity’s markup approach is that the installation is on site and therefore secure. I am not sure about this because security is person dependent, so cloud or on site, security remains an “issue” different from the one’s Attensity’s system identifies.
Attensity, like Oracle, provides a knowledge base for specific industries. Oracle made term lists available for a number of years. Maybe since its acquisition of Artificial Linguistics in the mid 1990s?
Attensity supports five languages. For these five languages, Attensity can determine the “tone” of the words used in a document. Presumably a company like Bitext can provide additional language support if Attensity does not have these ready to install.
Vendors continue to recycle jargon and buzzwords to describe certain core functions available from many different vendors. If your metatagging outfit is not delivering, you may want to check out Attensity’s solution.
Stephen E Arnold, April 3, 2015
SAS Text Miner Provides Valuable Predictive Analytics
March 25, 2015
If you are searching for predictive analytics software that provides in-depth text analysis with advanced linguistic capabilities, you may want to check out “SAS Text Miner.” Predictive Analytics Today runs down the features and what SAS Text Miner and details how it works.
It is a user-friendly software with data visualization, flexible entity options, document theme discovery, and more.
“The text analytics software provides supervised, unsupervised, and semi-supervised methods to discover previously unknown patterns in document collections. It structures data in a numeric representation so that it can be included in advanced analytics, such as predictive analysis, data mining, and forecasting. This version also includes insightful reports describing the results from the rule generator node, providing clarity to model training and validation results.”
SAS Text Miner includes other features that draw on automatic Boolean rule generation to categorize documents and other rules can be exported into Boolean rules. Data sets can be made from a directory on crawled from the Web. The visual analysis feature highlights the relationships between discovered patterns and displays them using a concept link diagram. SAS Text Miner has received high praise as a predictive analytics software and it might be the solution your company is looking for.
Whitney Grace, March 25, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Recorded Future: Google and Cyber OSINT
February 2, 2015
I find the complaints about Google’s inability to handle time amusing. On the surface, Google seems to demote, ignore, or just not understand the concept of time. For the vast majority of Google service users, Google is no substitute for the users’ investment of time and effort into dating items. But for the wide, wide Google audience, ads, not time, are more important.
Does Google really get an F in time? The answer is, “Nope.”
In CyberOSINT: Next Generation Information Access I explain that Google’s time sense is well developed and of considerable importance to next generation solutions the company hopes to offer. Why the craw fishing? Well, Apple could just buy Google and make the bitter taste of the Apple Board of Directors’ experience a thing of the past.
Now to temporal matters in the here and now.
CyberOSINT relies on automated collection, analysis, and report generation. In order to make sense of data and information crunched by an NGIA system, time is a really key metatag item. To figure out time, a system has to understand:
- The date and time stamp
- Versioning (previous, current, and future document, data items, and fact iterations)
- Times and dates contained in a structured data table
- Times and dates embedded in content objects themselves; for example, a reference to “last week” or in some cases, optical character recognition of the data on a surveillance tape image.
For the average query, this type of time detail is overkill. The “time and date” of an event, therefore, requires disambiguation, determination and tagging of specific time types, and then capturing the date and time data with markers for document or data versions.
A simplification of Recorded Future’s handling of unstructured data. The system can also handle structured data and a range of other data management content types. Image copyright Recorded Future 2014.
Sounds like a lot of computational and technical work.
In CyberOSINT, I describe Google’s and In-Q-Tel’s investments in Recorded Future, one of the data forward NGIA companies. Recorded Future has wizards who developed the Spotfire system which is now part of the Tibco service. There are Xooglers like Jason Hines. There are assorted wizards from Sweden, countries the most US high school software cannot locate on a map, and assorted veterans of high technology start ups.
An NGIA system delivers actionable information to a human or to another system. Conversely a licensee can build and integrate new solutions on top of the Recorded Future technology. One of the company’s key inventions is numerical recipes that deal effectively with the notion of “time.” Recorded Future uses the name “Tempora” as shorthand for the advanced technology that makes time along with predictive algorithms part of the Recorded Future solution.
Autonomy: Leading the Push Beyond Enterprise Search
January 30, 2015
In “CyberOSINT: Next Generation Information Access,” I describe Autonomy’s math-first approach to content processing. The reason is that after the veil of secrecy was lifted with regard to the signal processing`methods used for British intelligence tasks, Cambridge University became one of the hot beds for the use of Bayesian, LaPlacian, and Markov methods. These numerical recipes proved to be both important and controversial. Instead of relying on manual methods, humans selected training sets, tuned the thresholds, and then turned the smart software loose. Math is not required to understand what Autonomy packaged for commercial use: Signal processing separated noise in a channel and allowed software to process the important bits. Thank you, Claude Shannon and the good Reverend Bayes.
What did Autonomy receive for this breakthrough? Not much but the company did generate more than $600 million in revenues about 10 years after opening for business. As far as I know, no other content processing vendor has reached this revenue target. Endeca, for the sake of comparison, flat lined at about $130 million in the year that Oracle bought the Guided Navigation outfit for about $1.0 billion.
For one thing the British company BAE (British Aerospace Engineering) licensed the Autonomy system and began to refine its automated collection, analysis, and report systems. So what? The UK became by the late 1990s the de facto leader in automated content activities. Was BAE the only smart outfit in the late 1990s? Nope, there were other outfits who realized the value of the Autonomy approach. Examples range from US government entities to little known outfits like the Wynyard Group.
In the CyberOSINT volume, you can get more detail about why Autonomy was important in the late 1990s, including the name of the university8 professor who encouraged Mike Lynch to make contributions that have had a profound impact on intelligence activities. For color, let me mention an anecdote that is not in the 176 page volume. Please, keep in mind that Autonomy was, like i2 (another Cambridge University spawned outfit) a client prior to my retirement.) IBM owns i2 and i2 is profiled in CyberOSINT in Chapter 5, “CyberOSINT Vendors.” I would point out that more than two thirds of the monograph contains information that is either not widely available or not available via a routine Bing, Google, or Yandex query. For example, Autonomy does not make publicly available a list of its patent documents. These contain specific information about how to think about cyber OSINT and moving beyond keyword search.
Some Color: A Conversation with a Faux Expert
In 2003 I had a conversation with a fellow who was an “expert” in content management, a discipline that is essentially a step child of database technology. I want to mention this person by name, but I will avoid the inevitable letter from his attorney rattling a saber over my head. This person publishes reports, engages in litigation with his partners, kowtows to various faux trade groups, and tries to keep secret his history as a webmaster with some Stone Age skills.
Not surprisingly this canny individual had little good to say about Autonomy. The information I provided about the Lynch technology, its applications, and its importance in next generation search were dismissed with a comment I will not forget, “Autonomy is a pile of crap.”
Okay, that’s an informed opinion for a clueless person pumping baloney about the value of content management as a separate technical field. Yikes.
In terms of enterprise search, Autonomy’s competitors criticized Lynch’s approach. Instead of a keyword search utility that was supposed to “unlock” content, Autonomy delivered a framework. The framework operated in an automated manner and could deliver keyword search, point and click access like the Endeca system, and more sophisticated operations associated with today’s most robust cyber OSINT solutions. Enterprise search remains stuck in the STAIRS III and RECON era. Autonomy was the embodiment of the leap from putting the burden of finding on humans to shifting the load to smart software.
A diagram from Autonomy’s patents filed in 2001. What’s interesting is that this patent cites an invention by Dr. Liz Liddy with whom the ArnoldIT team worked in the late 1990s. A number of content experts understood the value of automated methods, but Autonomy was the company able to commercialize and build a business on technology that was not widely known 15 years ago. Some universities did not teach Bayesian and related methods because these were tainted by humans who used judgments to set certain thresholds. See US 6,668,256. There are more than 100 Autonomy patent documents. How many of the experts at IDC, Forrester, Gartner, et al have actually located the documents, downloaded them, and reviewed the systems, methods, and claims? I would suggest a tiny percentage of the “experts.” Patent documents are not what English majors are expected to read.”
That’s important and little appreciated by the mid tier outfits’ experts working for IDC (yo, Dave Schubmehl, are you ramping up to recycle the NGIA angle yet?) Forrester (one of whose search experts told me at a MarkLogic event that new hires for search were told to read the information on my ArnoldIT.com Web site like that was a good thing for me), Gartner Group (the conference and content marketing outfit), Ovum (the UK counterpart to Gartner), and dozens of other outfits who understand search in terms of selling received wisdom, not insight or hands on facts.
Enterprise Search Problems: Why NGIA Systems Push Beyond Traditional Information Access Methods
January 29, 2015
Enterprise search has been useful. However, the online access methods have changed. Unfortunately, most enterprise search systems and the enterprise applications based on keyword and category access have lagged behind user needs.
The information highway is littered with the wrecks of enterprise search vendors who promised a solution to findability challenges and failed to deliver. Some of the vendors have been forgotten by today’s keyword and category access vendors. Do you know about the business problems that disappointed licensees and cost investors millions of dollars? Are you familiar with Convera, Delphes, Entopia, Fulcrum Technologies, Hakia, Siderean Software, and many other companies.
A handful of enterprise search vendors dodged implosion by selling out. Artificial Linguistics, Autonomy, Brainware, Endeca, Exalead, Fast Search, InQuira, iPhrase, ISYS Search Software, and Triple Hop were sold. Thus, their investors received their money back and in some cases received a premium. The $11 billion paid for Autonomy dwarfed the billion dollar purchase prices of Endeca and Fast Search and Transfer. But most of the companies able to sell their information retrieval systems sold for much less. IBM acquired Vivisimo for about $20 million and promptly justified the deal by describing Vivisimo’s metasearch system as a Big Data solution. Okay.
Today a number of enterprise search vendors walk a knife edge. A loss of a major account or a misstep that spooks investors can push a company over the financial edge in the blink of an eye. Recently I noticed that Dieselpoint has not updated its Web site for a while. Antidot seems to have faded from the US market. Funnelback has turned down the volume. Hakia went offline.
A few firms generate considerable public relations noise. Attivio, BA Insight, Coveo, and IBM Watson appear to be competing to become the leaders in today’s enterprise search sector. But today’s market is very different from the world of 2003-2004 when I wrote the first of three editions of the 400 page Enterprise Search Report. Each of these companies is asserting that their system provides business intelligence, customer support, and traditional enterprise search. Will any of these companies be able to match Autonomy’s 2008 revenues of $600 million. I doubt it.
The reason is not the availability of open source search. Elasticsearch, in fact, is arguably better than any of the for fee keyword and concept centric information retrieval systems. The problems of the enterprise search sector are deeper.
Garbling the Natural Language Processors
December 30, 2014
Natural language processing is becoming a popular analytical tool as well as a quicker way for search and customer support. Dragon Nuance is at the tip of everyone’s tongue when NLP enters a conversation, but there are other products with their own benefits. Code Project recently reviewed three of NLP in, ”A Review Of Three Natural Language Processors, AlchemyAPI, OpenCalais, And Semantria.”
Rather than sticking readers with plain product reviews, Code Project explains what NLP is used for and how it accomplishes it. While NLP is used for vocal commands, it can do many other things: improve SEO, knowledge management, text mining, text analytics, content visualization and monetization, decision support, automatic classification, and regulatory compliance. NLP extracts entities aka proper nouns from content, then classifies, tags, and provides a sentiment score to give each entity a meaning.
In layman’s terms:
“…the primary purpose of an NLP is to extract the nouns, determine their types, and provide some “scoring” (relevance or sentiment) of the entity within the text. Using relevance, one can supposedly filter out entities to those that are most relevant in the document. Using sentiment analysis, one can determine the overall sentiment of an entity in the document, useful for determining the “tone” of the document with regards to an entity — for example, is the entity “sovereign debt” described negatively, neutrally, or positively in the document?”
NLP categorizes the human element in content. Its usefulness will become more apparent in future years, especially as people rely more and more on electronic devices for communication, consumerism, and interaction.
Whitney Grace, December 30, 2014
Sponsored by ArnoldIT.com, developer of Augmentext