Digging for Data Gold
April 1, 2014
Tech Radar has an article that suggests an idea we have never heard before: “How Text Mining Can Help Your Business Dig Gold.” Be mindful that was a sarcastic comment. It is already common knowledge that text mining is advantageous tool to learn about customers, products, new innovations, market trends, and other patterns. One of big data’s main scopes is capturing that information from an organization’s data. The article explains how much data is created in a single minute from text with some interesting facts (2.46 million Facebook posts, wow!).
It suggests understanding the type of knowledge you wish to capture and finding software with a user-friendly dashboard. It ends on this note:
“In summary, you need to listen to what the world is trying to tell you, and the premier technology for doing so is “text mining.” But, you can lean on others to help you use this daunting technology to extract the right conversations and meanings for you.”
The entire article is an overview of what text mining can do and how it is beneficial. It does not go further than basic explanations or how to mine the gold in the data mine. That will require further reading. We suggest a follow up article that explains how text mining can also lead to fool’s gold.
Whitney Grace, April 01, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Twenty Electric Text Analytics Platforms
March 11, 2014
Butler Analytics collected a list of “20+ Text Analytics Platforms” that delve through the variety of text analytics platforms available and what their capabilities are. According to the list, text analytics has not reached its full maturity yet. There are three main divisions in the area: natural language processing, text mining, and machine learning. Each is distinct and each company has their own approach to using these processes:
“Some suppliers have applied text analytics to very specific business problems, usually centering on customer data and sentiment analysis. This is an evolving field and the next few years should see significant progress. Other suppliers provide NLP based technologies so that documents can be categorized and meaning extracted from them. Text mining platforms are a more recent phenomenon and provide a mechanism to discover patterns that might be used in operational activities. Text is used to generate extra features which might be added to structured data for more accurate pattern discovery. There is of course overlap and most suppliers provide a mixture of capabilities. Finally we should not forget information retrieval, more often branded as enterprise search technology, where the aim is simply to provide a means of discovering and accessing data that are relevant to a particular query. This is a separate topic to a large extent, although again there is overlap.”
Reading through the list shows the variety of options users have when it comes to text analytics. There does not appear to be a right or wrong way, but will the diverse offerings eventually funnel
down to few fully capable platforms?
Whitney Grace, March 11, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Elsevier Gingerly Opens Papers to Text Mining
February 25, 2014
Even organizations not known for their adaptability can change, but don’t expect it to be rapid when it happens. Nature announces, “Elsevier Opens Its Papers to Text-Mining.” Researchers have been feeling stymied for years by the sluggish, case-by-case process through which academic publishers considered requests to computationally pull information from published papers. Now that the technical barriers to processing such requests are being remedied (at Elsevier now and expected at other publishers soon), some say the legal restrictions being placed on text mining are too severe. Reporter Richard Van Noorden writes:
“Under the arrangements, announced on 26 January at the American Library Association conference in Philadelphia, Pennsylvania, researchers at academic institutions can use Elsevier’s online interface (API) to batch-download documents in computer-readable XML format. Elsevier has chosen to provisionally limit researchers to 10,000 articles per week. These can be freely mined — so long as the researchers, or their institutions, sign a legal agreement. The deal includes conditions: for instance, that researchers may publish the products of their text-mining work only under a licence that restricts use to non-commercial purposes, can include only snippets (of up to 200 characters) of the original text, and must include links to original content.”
Others are concerned not that the terms are too restrictive, but that there are any terms at all. The article goes on:
“But some researchers feel that a dangerous precedent is being set. They argue that publishers wrongly characterize text-mining as an activity that requires extra rights to be granted by licence from a copyright holder, and they feel that computational reading should require no more permission than human reading. ‘The right to read is the right to mine,’ says Ross Mounce of the University of Bath, UK, who is using content-mining to construct maps of species’ evolutionary relationships.”
Not to be left out of the discussion, governments are making their own policies. The U.K. will soon make text mining for non-commercial use exempt from copyright, so any content a Brit has paid for they will have the right to mine. Amid concerns about stifled research, the European Commission is also looking into the issue.
Meanwhile, some have already made the most of Elsevier’s new terms. For example, the European consortium the Human Brain Project is using it to work through technical issues in their project: the pursuit of a supercomputer that recreates everything we know about the human brain.
Cynthia Murrell, February 25, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Getting a Failing Grade in Artificial Intelligence: Watson and Siri
February 12, 2014
I read “Gödel, Escher, Bach: An Eternal Golden Braid” in 1999 or 2000. My reaction was, “I am glad I did not have Dr. Douglas R. Hofstadter critiquing my lame work for the PhD program at my university. Dr. Hofstadter’s intellect intimidated me. I had to look up “Bach” because I knew zero about the procreative composer of organ music. (Heh, heh)
Imagine my surprise when I read “Why Watson and Siri Are Not Real AI” in Popular Mechanics magazine. Popular Mechanics is not my first choice as an information source for analysis of artificial intelligence and related disciplines. Popular Mechanics explains saws, automobiles, and gadgets.
But there was the story, illustration with one of those bluish Jeopardy Watson photographs. The write up is meaty because Popular Mechanics asked Dr. Hofstadter questions and presented his answers. No equations. No arcane references. No intimidating the fat, ugly grad student.
The point of the write up is probably not one that IBM and Apple will like. Dr. Hofstadter does not see the “artificial intelligence” in Watson and Siri as “thinking machines.” (I share this view along with DARPA, I believe.)
Here’s a snippet of the Watson analysis:
Watson is basically a text search algorithm connected to a database just like Google search. It doesn’t understand what it’s reading. In fact, read is the wrong word. It’s not reading anything because it’s not comprehending anything. Watson is finding text without having a clue as to what the text means. In that sense, there’s no intelligence there. It’s clever, it’s impressive, but it’s absolutely vacuous.
I had to look up vacuous. It means, according to the Google “define” function: “having or showing a lack of thought or intelligence; mindless.” Okay, mindless. Isn’t IBM going to build a multi-billion dollar a year business on Watson’s technology? Isn’t IBM delivering a landslide business to the snack shops adjacent its new Watson offices in Manhattan? Isn’t Watson saving lives in Africa?
The interview uses a number of other interesting words; for example:
- Hype
- Silliest
- Shambles
- Slippery
- Profits
Yet my favorite is the aforementioned—vacuous.
Please, read the interview in its entirety. I am not sure it will blunt the IBM and Apple PR machines, but kudos to Popular Mechanics. Now if the azure chip consultants, the failed Webmasters turned search experts, and the MBA pitch people would shift from hyperbole to reality, some clarity would return to the discussion of information retrieval.
Stephen E Arnold, February 11, 2014
From Scanning to eDiscovery to Fraud Triangle Analytics
February 9, 2014
Search and content processing vendors are innovating for 2014. The shift from a back office function like scanning to searching and then “solutions” is a familar path for companies engaged in information retrieval.
I read a 38 page white paper explaining a new angle—fraud triangle analytics. You can get a copy of the explanation by navigating to http://bit.ly/1o6YpnXi and going through the registration process.
The ZyLab concept is that three factors usually surface when fraud exists. These are a payoff, an opportunity, and “ the mindset of the fraudster that justifies them to commit fraud.”
ZyLab’s system uses content analytics, discovery, sentiment analysis, metatagging, faceted search, and visualization to help the analyst chase down the likelihood of fraud. ZyLab weaves in the go-to functions for attorneys from its system. Four case examples are provided, including the Enron matter.
Unlike some search vendors, ZyLab is focusing on a niche. Law enforcement is a market that a number of companies are pursuing. A number of firms offer similar tools, and the competition in this sector is increasing. IBM, for example, has products that perform or can be configured to perform in a somewhat similar manner.
IBM has the i2 product and may be in the process of acquiring a company that adds dramatic fraud detection functionality to the i2 product. This rumored acquisition adds content acquisition different from traditional credit card statements and open source content (little data or big data forms).
As some commercial markets for traditional search and content processing, some vendors are embracing the infrastructure or framework approach. This is a good idea, and it is one that has been evident since the days of Fulcrum Technologies’ launch and TeraText’s infrastructure system. Both date from the 1980s. (My free analysis of the important TeraText system will appear be available on the Xenky.com Web site at the end of this month.)
At ZyLab, search is still important, but it is now a blended set of software package with the FTA notion. As the world shifts to apps and predictive methods, it is interesting to watch the re-emergence of approaches popular with vendors little known by some today.
Stephen E Arnold, February 9, 2014
Government Buys into Text Analytics
February 7, 2014
What do you make of this headline from All Analytics: “Text And The City: Municipalities Discover Text Analytics”? Businesses have been using text mining software for awhile and understand the insights it can deliver to business decisions. The same goes for law firms that must wade through piles of litigation. Are governments really only catching onto text mining software now?
The article reports on several examples where municipal governments have employed text mining and analytics. Law enforcement agencies are using it to identify key concepts to deliver quick information to officials. The 311 systems, known as the source of local information and immediate contact with services, is another system that can benefit from text analytics, because it can organize and process the information faster and more consistently.
There are many ways text analytics can be helpful to local governments:
“Identifying root causes is a unique value proposition for text analytics in government. It’s one thing to know something happened — a crime, a missed garbage collection, a school expulsion — and another to understand where the problem started. Conventional data often lacks clues about causes, but text reveals a lot.”
The bigger question is will local governments spend the money on these systems? Perhaps, but analytic software is expensive and governments are pressured to find low-cost solutions. Expertise and money are in short supply on this issue.
Whitney Grace, February 07, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Free Stopword List
January 29, 2014
A happy quack to the reader who alerted me to www.libertypages.com. The site provides a downloadable list of stopwords. You can find the link at http://bit.ly/1fnubsY. It appears that this original list was generated by Dr. Gerald Salton. A quick scan of the list suggests that some updating may be needed. The Liberty Pages Web site redirects to Lextek, developers of Onix. I have a profile of the Onix system. Once the Autonomy IDOL and TeraText profiles are on the Xenky site, I will hunt around for my Lextek analysis. The company is still in business, operating out of a home in Provo, Utah.
Stephen E Arnold, January 29, 2014
IBM Wrestling with Watson
January 8, 2014
“IBM Struggles to turn Watson into Big Business” warrants a USA Today treatment. You can find the story in the hard copy of the newspaper on page A 1 and A 2. I saw a link to the item online at http://on.wsj.com/1iShfOG but you may have to pay to read it or chase down a Penguin friendly instance of the article.
The main point is that IBM targeted $10 billion in Watson revenue by 2023. Watson has generated less than $100 million in revenue I presume since the system “won” the Jeopardy game show.
The Wall Street Journal article is interesting because it contains a number of semantic signals, for example:
- The use of the phrase “in a ditch” in reference to a a project at the University of Texas M.D. Anderson Cancer Center
- The statement “Watson is having more trouble solving real-life problems”
- The revelation that “Watson doesn’t work with standard hardware”
- An allegedly accurate quote from a client that says “Watson initially took too long to learn”
- The assertion that “IBM reworked Watson’s training regimen”
- The sprinkling of “could’s” and “if’s”
I came away from the story with a sense of déjà vu. I realized that over the last 25 years I have heard similar information about other “smart” search systems. The themes run through time the way a bituminous coal seam threads through the crust of the earth. When one of these seams catches fire, there are few inexpensive and quick ways to put out the fire. Applied to Watson, my hunch is that the cost of getting Watson to generate $10 billion in revenue is going to be a very big number.
The Wall Street Journal story references the need for humans to learn and then to train Watson about the topic. When Watson goes off track, more humans have to correct Watson. I want to point out that training a smart system on a specific corpus of content is tricky. Algorithms can be quite sensitive to small errors in initial settings. Over time, the algorithms do their thing and wander. This translates to humans who have to monitor the smart system to make sure it does not output information in which it has generated confidence scores that are wrong or undifferentiated. The Wall Street Journal nudges this state of affairs in this passage:
In a recent visit to his [a Sloan Kettering oncologist] pulled out an iPad and showed a screen from Watson that listed three potential treatments. Watson was less than 32% confident that any of them were [sic] correct.
Then the Wall Street Journal reported that tweaking Watson was tough, saying:
The project initially ran awry because IBM’s engineers and Anderson’s doctors didn’t understand each other.
No surprise, but the fix just adds to the costs of the system. The article revealed:
IBM developers now meet with doctors several times a week.
Why is this Watson write up intriguing to me? There are four reasons:
First, the Wall Street Journal makes clear that dreams about dollars from search and content processing are easy to inflate and tough to deliver. Most search vendors and their stakeholders discover the difference between marketing hyperbole and reality.
Second, the Watson system is essentially dependent on human involvement. The objective of certain types of smart software is to reduce the need for human involvement. Watching Star Trek and Spock is not the same as delivering advanced systems that work and are affordable.
Third, the revenue generated by Watson is actually pretty good. Endeca hit $100 million between 1998 and 2011 when it was acquired by Oracle. Autonomy achieved $800 million between 1996 and 2011 when it was purchased by Hewlett Packard. Watson has been available for a couple of years. The problem is that the goal is, it appears, out of reach even for a company with IBM’s need for a hot new product and the resources to sell almost anything to large organizations.
Fourth, Watson is walking down the same path that STAIRS III, an early IBM search system, followed. IBM embraced open source to help reduce the cost of delivering basic search. Now IBM is finding that the value-adds are more difficult than key word matching and Boolean centric information retrieval. When a company does not learn from its own prior experiences in content processing, the voyage of discovery becomes more risky.
Net net: IBM has its hands full. I am confident that an azure chip consultant and a couple of 20 somethings can fix up Watson in a nonce. But if remediation is not possible, IBM may vie with Hewlett Packard as the pre-eminent example of the perils of the search and content processing business.
Stephen E Arnold, January 8, 2014
HP and Its New IDOL Categorizer
January 1, 2014
I read “Analytics for Human Information: Optimize Information Categorization with HP IDOL.” I noticed that HP did not reference the original reference to the 1998 categorization technology in its write up. From my point of view, news about something developed 15 years ago and referenced in subsequent Autonomy collateral is not something fresh to me. In fact, presenting the categorizer as something “amazing” suggests a superficial grasp of the history of IDOL technology which dates from the late 1980s and early 1990s. It is fascinating how some “experts” in content processing reinvent the wheel and display their intellectual process in such an amusing way. Is it possible to fool oneself and others? Remarkable.
Update, January 1, 2014, 11 am Eastern:
Hewlett Packard is publicizing IDOL’s automatic categorization capability. As a point of fact, this function has been available for 15 years. Here’s a description from a 2001 Autonomy IDOL Server Technical Brief, 2001.
DOL server can automatically categorize data with no requirement for manual input whatsoever. The flexibility of Autonomy’s Categorization feature allows you to precisely derive categories using concepts found within unstructured text. This ensures that all data is classified in the correct context with the utmost accuracy. Autonomy’s Categorization feature is a completely scalable solution capable of handling
high volumes of information with extreme accuracy and total consistency. Rather than relying on rigid rule based category definitions such as Legacy Keyword and Boolean Operators, Autonomy’s infrastructure relies on an elegant pattern matching process based on concepts to categorize documents and automatically insert tag data sets, route content or alert users to highly relevant information pertinent to the users profile. This highly efficient process means that Autonomy is able to categorize upwards of four million documents in 24 hours per CPU instance, that’s approximately one document, every 25 milliseconds. Autonomy hooks into virtually all repositories and data formats respecting all security and access entitlements, delivering complete reliability. IDOL server accepts a category or piece of content and returns categories ranked by conceptual similarity. This determines for which categories the piece of content is most appropriate, so that the piece of content can subsequently be tagged, routed or filed accordingly.
Stephen E Arnold, January 1, 2014
A Non Search Person Explains Why Search Is a Lost Cause
December 16, 2013
The author of “2013: the Year ‘the Stream’ Crested” is focused on tapping into flows of data. Twitter and real time “Big Data” streams are the subtext for the essay. I liked the analysis. In one 2,500 word write up, the severe weaknesses of enterprise and Web search systems are exposed.
The main point of the article is that “the stream”—that is, flows of information and data—is what people want. The flow is of sufficient volume that making sense of it is difficult. Therefore, an opportunity exists for outfits like The Atlantic to provide curation, perspective, and editorial filtering. The write up’s code for this higher-value type of content process is “the stock.”
The article asserts:
This is the strange circumstance that obtained in 2013, given the volume of the stream. Regular Internet users only had three options: 1) be overwhelmed 2) hire a computer to deploy its logic to help sort things 3) get out of the water.
The take away for me is that the article makes clear that search and retrieval just don’t work. Some “new” is needed. Perhaps this frustration with search is the trigger behind the interest in “artificial intelligence” and “machine learning”? Predictive analytics may have a shot at solving the problem of finding and identifying needed information, but from what I have seen, there is a lot of talk about fancy math and little evidence that it works at low cost in a manner that makes sense to the average person. Data scientists are not a dime a dozen. Average folks are.
Will the search and content processing vendors step forward and provide concrete facts that show a particular system can solve a Big Data problem for Everyman and Everywoman? We know Google is shifting to an approach to search that yields revenue. Money, not precision and recall, is increasingly important. The search and content vendors who toss around the word “all” have not been able to deliver unless the content corpus is tightly defined and constrained.
Isn’t it obvious that processing infinite flows and changes to “old” content are likely to cost a lot of money. Google, Bing, and Yandex search are not particularly “good.” Each is becoming a system designed to support other functions. In fact, looking for information that is only five or six years “old” is an exercise in frustration. Where has that document “gone.” What other data are not in the index. The vendors are not talking.
In the enterprise, the problem is almost as hopeless. Vendors invent new words to describe a function that seems to convey high value. Do you remember this catchphrase: “One step to ROI”? How do you think that company performed? The founders were able to sell the company and some of the technology lives on today, but the limitations of the system remain painfully evident.
Search and retrieval is complex, expensive to implement in an effective manner, and stuck in a rut. Giving away a search system seems to reduce costs? But are license fees the major expense? Embracing fancy math seems to deliver high value answers? But are the outputs accurate? Users just assume these systems work.
Kudos to Atlantic for helping to make clear that in today’s data world, something new is needed. Changing the words used to describe such out of favor functions as “editorial policy”, controlled terms, scheduled updates, and the like is more popular than innovation.
Stephen E Arnold, December 16, 2013