April 27, 2014
I read “Algorithm Distinguishes Memes from Ordinary Information.” The article reports that algorithms can pick out memes. A “meme”, according to Google, is “an element of a culture or system of behavior that may be considered to be passed from one individual to another by nongenetic means, especially imitation.” The passage that caught my attention is:
Having found the most important memes, Kuhn and co studied how they have evolved in the last hundred years or so. They say most seem to rise and fall in popularity very quickly. “As new scienti?c paradigms emerge, the old ones seem to quickly lose their appeal, and only a few memes manage to top the rankings over extended periods of time,” they say.
The factoid that reminded me how far smart software has yet to travel is:
To test whether these phrases are indeed interesting topics in physics, Kuhn and co asked a number of experts to pick out those that were interesting. The only ones they did not choose were: 12. Rashba, 14. ‘strange nonchaotic’ and 15. ‘in NbSe3′. Kuhn and co also checked Wikipedia, finding that about 40 per cent of these words and phrases have their own corresponding entries. Together this provides compelling evidence that the new method is indeed finding interesting and important ideas.
Systems produce outputs that are not yet spot on. I concluded that scientists, like marketers, like whizzy new phrases and ideas. Jargon, it seems, is an important part of specialist life.
Stephen E Arnold, April 27, 2014
April 23, 2014
Small time analytics isn’t really as startup-y as people may think anymore. These companies are in high demand and are pulling in some serious cash. We discovered just how much and how serious from a recent Cambridge Science Park article, “Cambridge Text Analytics Linguamatics Hits $10m in Sales.”
According to the story:
Linguamatics’ sales showed strong growth and exceeded ten million dollars in 2013, it was announced today – outperforming the company’s targeted growth and expected sales figures. The increased sales came from a boost in new customers and increased software licenses to existing customers in the pharmaceutical and healthcare sectors. This included 130 per cent growth in healthcare sales plus increased sales in professional services.
This earning potential has clearly grabbed the attention of investors. This, is feeding a cycle of growth, which is why the Linguamaticses of the world can rake in impressive numbers. Just the other day, for example, Tech Circle reported on a microscopic Mumbai big data company that landed $3m in investments. They say it takes money to make money and right now, the world of big data analytics has that cycle down pat. It won’t last forever, but it’s fun to watch as it does.
Patrick Roland, April 23, 2014
April 1, 2014
Tech Radar has an article that suggests an idea we have never heard before: “How Text Mining Can Help Your Business Dig Gold.” Be mindful that was a sarcastic comment. It is already common knowledge that text mining is advantageous tool to learn about customers, products, new innovations, market trends, and other patterns. One of big data’s main scopes is capturing that information from an organization’s data. The article explains how much data is created in a single minute from text with some interesting facts (2.46 million Facebook posts, wow!).
It suggests understanding the type of knowledge you wish to capture and finding software with a user-friendly dashboard. It ends on this note:
“In summary, you need to listen to what the world is trying to tell you, and the premier technology for doing so is “text mining.” But, you can lean on others to help you use this daunting technology to extract the right conversations and meanings for you.”
The entire article is an overview of what text mining can do and how it is beneficial. It does not go further than basic explanations or how to mine the gold in the data mine. That will require further reading. We suggest a follow up article that explains how text mining can also lead to fool’s gold.
March 11, 2014
Butler Analytics collected a list of “20+ Text Analytics Platforms” that delve through the variety of text analytics platforms available and what their capabilities are. According to the list, text analytics has not reached its full maturity yet. There are three main divisions in the area: natural language processing, text mining, and machine learning. Each is distinct and each company has their own approach to using these processes:
“Some suppliers have applied text analytics to very specific business problems, usually centering on customer data and sentiment analysis. This is an evolving field and the next few years should see significant progress. Other suppliers provide NLP based technologies so that documents can be categorized and meaning extracted from them. Text mining platforms are a more recent phenomenon and provide a mechanism to discover patterns that might be used in operational activities. Text is used to generate extra features which might be added to structured data for more accurate pattern discovery. There is of course overlap and most suppliers provide a mixture of capabilities. Finally we should not forget information retrieval, more often branded as enterprise search technology, where the aim is simply to provide a means of discovering and accessing data that are relevant to a particular query. This is a separate topic to a large extent, although again there is overlap.”
Reading through the list shows the variety of options users have when it comes to text analytics. There does not appear to be a right or wrong way, but will the diverse offerings eventually funnel
down to few fully capable platforms?
February 25, 2014
Even organizations not known for their adaptability can change, but don’t expect it to be rapid when it happens. Nature announces, “Elsevier Opens Its Papers to Text-Mining.” Researchers have been feeling stymied for years by the sluggish, case-by-case process through which academic publishers considered requests to computationally pull information from published papers. Now that the technical barriers to processing such requests are being remedied (at Elsevier now and expected at other publishers soon), some say the legal restrictions being placed on text mining are too severe. Reporter Richard Van Noorden writes:
“Under the arrangements, announced on 26 January at the American Library Association conference in Philadelphia, Pennsylvania, researchers at academic institutions can use Elsevier’s online interface (API) to batch-download documents in computer-readable XML format. Elsevier has chosen to provisionally limit researchers to 10,000 articles per week. These can be freely mined — so long as the researchers, or their institutions, sign a legal agreement. The deal includes conditions: for instance, that researchers may publish the products of their text-mining work only under a licence that restricts use to non-commercial purposes, can include only snippets (of up to 200 characters) of the original text, and must include links to original content.”
Others are concerned not that the terms are too restrictive, but that there are any terms at all. The article goes on:
“But some researchers feel that a dangerous precedent is being set. They argue that publishers wrongly characterize text-mining as an activity that requires extra rights to be granted by licence from a copyright holder, and they feel that computational reading should require no more permission than human reading. ‘The right to read is the right to mine,’ says Ross Mounce of the University of Bath, UK, who is using content-mining to construct maps of species’ evolutionary relationships.”
Not to be left out of the discussion, governments are making their own policies. The U.K. will soon make text mining for non-commercial use exempt from copyright, so any content a Brit has paid for they will have the right to mine. Amid concerns about stifled research, the European Commission is also looking into the issue.
Meanwhile, some have already made the most of Elsevier’s new terms. For example, the European consortium the Human Brain Project is using it to work through technical issues in their project: the pursuit of a supercomputer that recreates everything we know about the human brain.
Cynthia Murrell, February 25, 2014
February 12, 2014
I read “Gödel, Escher, Bach: An Eternal Golden Braid” in 1999 or 2000. My reaction was, “I am glad I did not have Dr. Douglas R. Hofstadter critiquing my lame work for the PhD program at my university. Dr. Hofstadter’s intellect intimidated me. I had to look up “Bach” because I knew zero about the procreative composer of organ music. (Heh, heh)
Imagine my surprise when I read “Why Watson and Siri Are Not Real AI” in Popular Mechanics magazine. Popular Mechanics is not my first choice as an information source for analysis of artificial intelligence and related disciplines. Popular Mechanics explains saws, automobiles, and gadgets.
But there was the story, illustration with one of those bluish Jeopardy Watson photographs. The write up is meaty because Popular Mechanics asked Dr. Hofstadter questions and presented his answers. No equations. No arcane references. No intimidating the fat, ugly grad student.
The point of the write up is probably not one that IBM and Apple will like. Dr. Hofstadter does not see the “artificial intelligence” in Watson and Siri as “thinking machines.” (I share this view along with DARPA, I believe.)
Here’s a snippet of the Watson analysis:
Watson is basically a text search algorithm connected to a database just like Google search. It doesn’t understand what it’s reading. In fact, read is the wrong word. It’s not reading anything because it’s not comprehending anything. Watson is finding text without having a clue as to what the text means. In that sense, there’s no intelligence there. It’s clever, it’s impressive, but it’s absolutely vacuous.
I had to look up vacuous. It means, according to the Google “define” function: “having or showing a lack of thought or intelligence; mindless.” Okay, mindless. Isn’t IBM going to build a multi-billion dollar a year business on Watson’s technology? Isn’t IBM delivering a landslide business to the snack shops adjacent its new Watson offices in Manhattan? Isn’t Watson saving lives in Africa?
The interview uses a number of other interesting words; for example:
Yet my favorite is the aforementioned—vacuous.
Please, read the interview in its entirety. I am not sure it will blunt the IBM and Apple PR machines, but kudos to Popular Mechanics. Now if the azure chip consultants, the failed Webmasters turned search experts, and the MBA pitch people would shift from hyperbole to reality, some clarity would return to the discussion of information retrieval.
Stephen E Arnold, February 11, 2014
February 9, 2014
Search and content processing vendors are innovating for 2014. The shift from a back office function like scanning to searching and then “solutions” is a familar path for companies engaged in information retrieval.
I read a 38 page white paper explaining a new angle—fraud triangle analytics. You can get a copy of the explanation by navigating to http://bit.ly/1o6YpnXi and going through the registration process.
The ZyLab concept is that three factors usually surface when fraud exists. These are a payoff, an opportunity, and “ the mindset of the fraudster that justifies them to commit fraud.”
ZyLab’s system uses content analytics, discovery, sentiment analysis, metatagging, faceted search, and visualization to help the analyst chase down the likelihood of fraud. ZyLab weaves in the go-to functions for attorneys from its system. Four case examples are provided, including the Enron matter.
Unlike some search vendors, ZyLab is focusing on a niche. Law enforcement is a market that a number of companies are pursuing. A number of firms offer similar tools, and the competition in this sector is increasing. IBM, for example, has products that perform or can be configured to perform in a somewhat similar manner.
IBM has the i2 product and may be in the process of acquiring a company that adds dramatic fraud detection functionality to the i2 product. This rumored acquisition adds content acquisition different from traditional credit card statements and open source content (little data or big data forms).
As some commercial markets for traditional search and content processing, some vendors are embracing the infrastructure or framework approach. This is a good idea, and it is one that has been evident since the days of Fulcrum Technologies’ launch and TeraText’s infrastructure system. Both date from the 1980s. (My free analysis of the important TeraText system will appear be available on the Xenky.com Web site at the end of this month.)
At ZyLab, search is still important, but it is now a blended set of software package with the FTA notion. As the world shifts to apps and predictive methods, it is interesting to watch the re-emergence of approaches popular with vendors little known by some today.
Stephen E Arnold, February 9, 2014
February 7, 2014
What do you make of this headline from All Analytics: “Text And The City: Municipalities Discover Text Analytics”? Businesses have been using text mining software for awhile and understand the insights it can deliver to business decisions. The same goes for law firms that must wade through piles of litigation. Are governments really only catching onto text mining software now?
The article reports on several examples where municipal governments have employed text mining and analytics. Law enforcement agencies are using it to identify key concepts to deliver quick information to officials. The 311 systems, known as the source of local information and immediate contact with services, is another system that can benefit from text analytics, because it can organize and process the information faster and more consistently.
There are many ways text analytics can be helpful to local governments:
“Identifying root causes is a unique value proposition for text analytics in government. It’s one thing to know something happened — a crime, a missed garbage collection, a school expulsion — and another to understand where the problem started. Conventional data often lacks clues about causes, but text reveals a lot.”
The bigger question is will local governments spend the money on these systems? Perhaps, but analytic software is expensive and governments are pressured to find low-cost solutions. Expertise and money are in short supply on this issue.
Whitney Grace, February 07, 2014
January 29, 2014
A happy quack to the reader who alerted me to www.libertypages.com. The site provides a downloadable list of stopwords. You can find the link at http://bit.ly/1fnubsY. It appears that this original list was generated by Dr. Gerald Salton. A quick scan of the list suggests that some updating may be needed. The Liberty Pages Web site redirects to Lextek, developers of Onix. I have a profile of the Onix system. Once the Autonomy IDOL and TeraText profiles are on the Xenky site, I will hunt around for my Lextek analysis. The company is still in business, operating out of a home in Provo, Utah.
Stephen E Arnold, January 29, 2014
January 8, 2014
“IBM Struggles to turn Watson into Big Business” warrants a USA Today treatment. You can find the story in the hard copy of the newspaper on page A 1 and A 2. I saw a link to the item online at http://on.wsj.com/1iShfOG but you may have to pay to read it or chase down a Penguin friendly instance of the article.
The main point is that IBM targeted $10 billion in Watson revenue by 2023. Watson has generated less than $100 million in revenue I presume since the system “won” the Jeopardy game show.
The Wall Street Journal article is interesting because it contains a number of semantic signals, for example:
- The use of the phrase “in a ditch” in reference to a a project at the University of Texas M.D. Anderson Cancer Center
- The statement “Watson is having more trouble solving real-life problems”
- The revelation that “Watson doesn’t work with standard hardware”
- An allegedly accurate quote from a client that says “Watson initially took too long to learn”
- The assertion that “IBM reworked Watson’s training regimen”
- The sprinkling of “could’s” and “if’s”
I came away from the story with a sense of déjà vu. I realized that over the last 25 years I have heard similar information about other “smart” search systems. The themes run through time the way a bituminous coal seam threads through the crust of the earth. When one of these seams catches fire, there are few inexpensive and quick ways to put out the fire. Applied to Watson, my hunch is that the cost of getting Watson to generate $10 billion in revenue is going to be a very big number.
The Wall Street Journal story references the need for humans to learn and then to train Watson about the topic. When Watson goes off track, more humans have to correct Watson. I want to point out that training a smart system on a specific corpus of content is tricky. Algorithms can be quite sensitive to small errors in initial settings. Over time, the algorithms do their thing and wander. This translates to humans who have to monitor the smart system to make sure it does not output information in which it has generated confidence scores that are wrong or undifferentiated. The Wall Street Journal nudges this state of affairs in this passage:
In a recent visit to his [a Sloan Kettering oncologist] pulled out an iPad and showed a screen from Watson that listed three potential treatments. Watson was less than 32% confident that any of them were [sic] correct.
Then the Wall Street Journal reported that tweaking Watson was tough, saying:
The project initially ran awry because IBM’s engineers and Anderson’s doctors didn’t understand each other.
No surprise, but the fix just adds to the costs of the system. The article revealed:
IBM developers now meet with doctors several times a week.
Why is this Watson write up intriguing to me? There are four reasons:
First, the Wall Street Journal makes clear that dreams about dollars from search and content processing are easy to inflate and tough to deliver. Most search vendors and their stakeholders discover the difference between marketing hyperbole and reality.
Second, the Watson system is essentially dependent on human involvement. The objective of certain types of smart software is to reduce the need for human involvement. Watching Star Trek and Spock is not the same as delivering advanced systems that work and are affordable.
Third, the revenue generated by Watson is actually pretty good. Endeca hit $100 million between 1998 and 2011 when it was acquired by Oracle. Autonomy achieved $800 million between 1996 and 2011 when it was purchased by Hewlett Packard. Watson has been available for a couple of years. The problem is that the goal is, it appears, out of reach even for a company with IBM’s need for a hot new product and the resources to sell almost anything to large organizations.
Fourth, Watson is walking down the same path that STAIRS III, an early IBM search system, followed. IBM embraced open source to help reduce the cost of delivering basic search. Now IBM is finding that the value-adds are more difficult than key word matching and Boolean centric information retrieval. When a company does not learn from its own prior experiences in content processing, the voyage of discovery becomes more risky.
Net net: IBM has its hands full. I am confident that an azure chip consultant and a couple of 20 somethings can fix up Watson in a nonce. But if remediation is not possible, IBM may vie with Hewlett Packard as the pre-eminent example of the perils of the search and content processing business.
Stephen E Arnold, January 8, 2014