May 5, 2014
SAS is a well-recognized player in IT game as a purveyor of data, security, and analytics software. In modern terms they are a big player in big data and in order to beef up their offerings we caught word that SAS had updated its Text Miner. SAS Text Miner is advertised as a way for users to not only harness information in legacy data, but also in Web sites, databases, and other text sources. The process can be used to discover new ideas and improve decision-making.
SAS Text Miner a variety of benefits that make it different from the standard open source download. Not only do users receive the license and tech support, but Text Miner offers the ability to process and analyze knowledge in minutes, an interactive user interface, and predictive and data mining modeling techniques. The GUI is what will draw in developers:
“Interactive GUIs make it easy to identify relevance, modify algorithms, document assignments and group materials into meaningful aggregations. So you can guide machine-learning results with human insights. Extend text mining efforts beyond basic start-and-stop lists using custom entities and term trend discovery to refine automatically generated rules.”
Being able to modify proprietary software is a deal breaker these days. With multiple options for text mining software, being able to make it unique is what will sell it.
April 27, 2014
I read “Algorithm Distinguishes Memes from Ordinary Information.” The article reports that algorithms can pick out memes. A “meme”, according to Google, is “an element of a culture or system of behavior that may be considered to be passed from one individual to another by nongenetic means, especially imitation.” The passage that caught my attention is:
Having found the most important memes, Kuhn and co studied how they have evolved in the last hundred years or so. They say most seem to rise and fall in popularity very quickly. “As new scienti?c paradigms emerge, the old ones seem to quickly lose their appeal, and only a few memes manage to top the rankings over extended periods of time,” they say.
The factoid that reminded me how far smart software has yet to travel is:
To test whether these phrases are indeed interesting topics in physics, Kuhn and co asked a number of experts to pick out those that were interesting. The only ones they did not choose were: 12. Rashba, 14. ‘strange nonchaotic’ and 15. ‘in NbSe3′. Kuhn and co also checked Wikipedia, finding that about 40 per cent of these words and phrases have their own corresponding entries. Together this provides compelling evidence that the new method is indeed finding interesting and important ideas.
Systems produce outputs that are not yet spot on. I concluded that scientists, like marketers, like whizzy new phrases and ideas. Jargon, it seems, is an important part of specialist life.
Stephen E Arnold, April 27, 2014
April 23, 2014
Small time analytics isn’t really as startup-y as people may think anymore. These companies are in high demand and are pulling in some serious cash. We discovered just how much and how serious from a recent Cambridge Science Park article, “Cambridge Text Analytics Linguamatics Hits $10m in Sales.”
According to the story:
Linguamatics’ sales showed strong growth and exceeded ten million dollars in 2013, it was announced today – outperforming the company’s targeted growth and expected sales figures. The increased sales came from a boost in new customers and increased software licenses to existing customers in the pharmaceutical and healthcare sectors. This included 130 per cent growth in healthcare sales plus increased sales in professional services.
This earning potential has clearly grabbed the attention of investors. This, is feeding a cycle of growth, which is why the Linguamaticses of the world can rake in impressive numbers. Just the other day, for example, Tech Circle reported on a microscopic Mumbai big data company that landed $3m in investments. They say it takes money to make money and right now, the world of big data analytics has that cycle down pat. It won’t last forever, but it’s fun to watch as it does.
Patrick Roland, April 23, 2014
April 1, 2014
Tech Radar has an article that suggests an idea we have never heard before: “How Text Mining Can Help Your Business Dig Gold.” Be mindful that was a sarcastic comment. It is already common knowledge that text mining is advantageous tool to learn about customers, products, new innovations, market trends, and other patterns. One of big data’s main scopes is capturing that information from an organization’s data. The article explains how much data is created in a single minute from text with some interesting facts (2.46 million Facebook posts, wow!).
It suggests understanding the type of knowledge you wish to capture and finding software with a user-friendly dashboard. It ends on this note:
“In summary, you need to listen to what the world is trying to tell you, and the premier technology for doing so is “text mining.” But, you can lean on others to help you use this daunting technology to extract the right conversations and meanings for you.”
The entire article is an overview of what text mining can do and how it is beneficial. It does not go further than basic explanations or how to mine the gold in the data mine. That will require further reading. We suggest a follow up article that explains how text mining can also lead to fool’s gold.
March 11, 2014
Butler Analytics collected a list of “20+ Text Analytics Platforms” that delve through the variety of text analytics platforms available and what their capabilities are. According to the list, text analytics has not reached its full maturity yet. There are three main divisions in the area: natural language processing, text mining, and machine learning. Each is distinct and each company has their own approach to using these processes:
“Some suppliers have applied text analytics to very specific business problems, usually centering on customer data and sentiment analysis. This is an evolving field and the next few years should see significant progress. Other suppliers provide NLP based technologies so that documents can be categorized and meaning extracted from them. Text mining platforms are a more recent phenomenon and provide a mechanism to discover patterns that might be used in operational activities. Text is used to generate extra features which might be added to structured data for more accurate pattern discovery. There is of course overlap and most suppliers provide a mixture of capabilities. Finally we should not forget information retrieval, more often branded as enterprise search technology, where the aim is simply to provide a means of discovering and accessing data that are relevant to a particular query. This is a separate topic to a large extent, although again there is overlap.”
Reading through the list shows the variety of options users have when it comes to text analytics. There does not appear to be a right or wrong way, but will the diverse offerings eventually funnel
down to few fully capable platforms?
February 25, 2014
Even organizations not known for their adaptability can change, but don’t expect it to be rapid when it happens. Nature announces, “Elsevier Opens Its Papers to Text-Mining.” Researchers have been feeling stymied for years by the sluggish, case-by-case process through which academic publishers considered requests to computationally pull information from published papers. Now that the technical barriers to processing such requests are being remedied (at Elsevier now and expected at other publishers soon), some say the legal restrictions being placed on text mining are too severe. Reporter Richard Van Noorden writes:
“Under the arrangements, announced on 26 January at the American Library Association conference in Philadelphia, Pennsylvania, researchers at academic institutions can use Elsevier’s online interface (API) to batch-download documents in computer-readable XML format. Elsevier has chosen to provisionally limit researchers to 10,000 articles per week. These can be freely mined — so long as the researchers, or their institutions, sign a legal agreement. The deal includes conditions: for instance, that researchers may publish the products of their text-mining work only under a licence that restricts use to non-commercial purposes, can include only snippets (of up to 200 characters) of the original text, and must include links to original content.”
Others are concerned not that the terms are too restrictive, but that there are any terms at all. The article goes on:
“But some researchers feel that a dangerous precedent is being set. They argue that publishers wrongly characterize text-mining as an activity that requires extra rights to be granted by licence from a copyright holder, and they feel that computational reading should require no more permission than human reading. ‘The right to read is the right to mine,’ says Ross Mounce of the University of Bath, UK, who is using content-mining to construct maps of species’ evolutionary relationships.”
Not to be left out of the discussion, governments are making their own policies. The U.K. will soon make text mining for non-commercial use exempt from copyright, so any content a Brit has paid for they will have the right to mine. Amid concerns about stifled research, the European Commission is also looking into the issue.
Meanwhile, some have already made the most of Elsevier’s new terms. For example, the European consortium the Human Brain Project is using it to work through technical issues in their project: the pursuit of a supercomputer that recreates everything we know about the human brain.
Cynthia Murrell, February 25, 2014
February 12, 2014
I read “Gödel, Escher, Bach: An Eternal Golden Braid” in 1999 or 2000. My reaction was, “I am glad I did not have Dr. Douglas R. Hofstadter critiquing my lame work for the PhD program at my university. Dr. Hofstadter’s intellect intimidated me. I had to look up “Bach” because I knew zero about the procreative composer of organ music. (Heh, heh)
Imagine my surprise when I read “Why Watson and Siri Are Not Real AI” in Popular Mechanics magazine. Popular Mechanics is not my first choice as an information source for analysis of artificial intelligence and related disciplines. Popular Mechanics explains saws, automobiles, and gadgets.
But there was the story, illustration with one of those bluish Jeopardy Watson photographs. The write up is meaty because Popular Mechanics asked Dr. Hofstadter questions and presented his answers. No equations. No arcane references. No intimidating the fat, ugly grad student.
The point of the write up is probably not one that IBM and Apple will like. Dr. Hofstadter does not see the “artificial intelligence” in Watson and Siri as “thinking machines.” (I share this view along with DARPA, I believe.)
Here’s a snippet of the Watson analysis:
Watson is basically a text search algorithm connected to a database just like Google search. It doesn’t understand what it’s reading. In fact, read is the wrong word. It’s not reading anything because it’s not comprehending anything. Watson is finding text without having a clue as to what the text means. In that sense, there’s no intelligence there. It’s clever, it’s impressive, but it’s absolutely vacuous.
I had to look up vacuous. It means, according to the Google “define” function: “having or showing a lack of thought or intelligence; mindless.” Okay, mindless. Isn’t IBM going to build a multi-billion dollar a year business on Watson’s technology? Isn’t IBM delivering a landslide business to the snack shops adjacent its new Watson offices in Manhattan? Isn’t Watson saving lives in Africa?
The interview uses a number of other interesting words; for example:
Yet my favorite is the aforementioned—vacuous.
Please, read the interview in its entirety. I am not sure it will blunt the IBM and Apple PR machines, but kudos to Popular Mechanics. Now if the azure chip consultants, the failed Webmasters turned search experts, and the MBA pitch people would shift from hyperbole to reality, some clarity would return to the discussion of information retrieval.
Stephen E Arnold, February 11, 2014
February 9, 2014
Search and content processing vendors are innovating for 2014. The shift from a back office function like scanning to searching and then “solutions” is a familar path for companies engaged in information retrieval.
I read a 38 page white paper explaining a new angle—fraud triangle analytics. You can get a copy of the explanation by navigating to http://bit.ly/1o6YpnXi and going through the registration process.
The ZyLab concept is that three factors usually surface when fraud exists. These are a payoff, an opportunity, and “ the mindset of the fraudster that justifies them to commit fraud.”
ZyLab’s system uses content analytics, discovery, sentiment analysis, metatagging, faceted search, and visualization to help the analyst chase down the likelihood of fraud. ZyLab weaves in the go-to functions for attorneys from its system. Four case examples are provided, including the Enron matter.
Unlike some search vendors, ZyLab is focusing on a niche. Law enforcement is a market that a number of companies are pursuing. A number of firms offer similar tools, and the competition in this sector is increasing. IBM, for example, has products that perform or can be configured to perform in a somewhat similar manner.
IBM has the i2 product and may be in the process of acquiring a company that adds dramatic fraud detection functionality to the i2 product. This rumored acquisition adds content acquisition different from traditional credit card statements and open source content (little data or big data forms).
As some commercial markets for traditional search and content processing, some vendors are embracing the infrastructure or framework approach. This is a good idea, and it is one that has been evident since the days of Fulcrum Technologies’ launch and TeraText’s infrastructure system. Both date from the 1980s. (My free analysis of the important TeraText system will appear be available on the Xenky.com Web site at the end of this month.)
At ZyLab, search is still important, but it is now a blended set of software package with the FTA notion. As the world shifts to apps and predictive methods, it is interesting to watch the re-emergence of approaches popular with vendors little known by some today.
Stephen E Arnold, February 9, 2014
February 7, 2014
What do you make of this headline from All Analytics: “Text And The City: Municipalities Discover Text Analytics”? Businesses have been using text mining software for awhile and understand the insights it can deliver to business decisions. The same goes for law firms that must wade through piles of litigation. Are governments really only catching onto text mining software now?
The article reports on several examples where municipal governments have employed text mining and analytics. Law enforcement agencies are using it to identify key concepts to deliver quick information to officials. The 311 systems, known as the source of local information and immediate contact with services, is another system that can benefit from text analytics, because it can organize and process the information faster and more consistently.
There are many ways text analytics can be helpful to local governments:
“Identifying root causes is a unique value proposition for text analytics in government. It’s one thing to know something happened — a crime, a missed garbage collection, a school expulsion — and another to understand where the problem started. Conventional data often lacks clues about causes, but text reveals a lot.”
The bigger question is will local governments spend the money on these systems? Perhaps, but analytic software is expensive and governments are pressured to find low-cost solutions. Expertise and money are in short supply on this issue.
Whitney Grace, February 07, 2014
January 29, 2014
A happy quack to the reader who alerted me to www.libertypages.com. The site provides a downloadable list of stopwords. You can find the link at http://bit.ly/1fnubsY. It appears that this original list was generated by Dr. Gerald Salton. A quick scan of the list suggests that some updating may be needed. The Liberty Pages Web site redirects to Lextek, developers of Onix. I have a profile of the Onix system. Once the Autonomy IDOL and TeraText profiles are on the Xenky site, I will hunt around for my Lextek analysis. The company is still in business, operating out of a home in Provo, Utah.
Stephen E Arnold, January 29, 2014