March 6, 2014
If you are currently conducting research on natural language processing software, but have come to a halt in resources, we located Connexor’s “NLP Library.” Connexor is a company that develops text analysis software components, solutions, and services. They are experts in their line of work and are keen to help people utilize their data to its full extent. Connexor explains that:
“Connexor components have turned out to be necessary in many types of software products and solutions that need linguistic intelligence in text analytics tasks. We work with software houses, service providers, system integrators, resellers and research labs, in the fields of education, health, security, business and administration. We have customers and partners in over 30 countries.”
The company’s NLP Library includes bibliographic citations for articles. We can assume that Connexor employees wrote these articles. They range on a variety of subjects dealing with natural language processing, text evaluation, and they even touch on emotion extraction from text. These articles are a handy resource, especially if you need up to date research. There is only one article for 2014, but the year is still young and more are probably on the way.
February 12, 2014
The presentation on slideshare titled Got Chaos? Extracting Business Intelligence from Email with Natural Language Processing and Dynamic Graph Analysis discusses the work by Digital Reasoning and Paragon Science. Digital Reasoning asserts that it is an Oracle for human language data. There are color-coded sentences that illustrate the abilities of Natural Language Processing, from recognizing people and location words to entities related to a single concept and associated entities. The presentation consists of many equations, but the overview explains,
“In this presentation, O’Reilly author and Digital Reasoning CTO Matthew Russell along with Dr. Steve Kramer, founder and chief scientist at Paragon Science, discuss how Digital Reasoning processed the Enron corpus with its advanced Natural Language Processing (NLP) technology – effectively transforming it into building blocks that are viable for data science. Then, Paragon Science used dynamic graph analysis inspired from particle physics to tease out insights from the data..”
Ultimately the point of the entire process was to gain a better understanding of how the Enron catastrophe could be avoided in other enterprises. It is difficult to say whether Digital Reasoning is imitating IBM Watson or if IBM Watson is imitating Digital Reasoning. At any rate it sound familiar, didn’t Autonomy, TeraText, and other firms push into this sector decades ago?
Chelsea Kerwin, February 12, 2014
February 3, 2014
Machine translation can be a wonderful thing, but one key language has garnered less consideration than other widely-used languages. Though both Google and Babylon have made good progress [pdf] on Arabic translation, folks at The Stanford Natural Language Processing Group know there is plenty of room for improvement. These scientists are working close that gap with their Arabic Natural Language Processing project.
The page’s overview tells us:
“Arabic is the largest member of the Semitic language family and is spoken by nearly 500 million people worldwide. It is one of the six official UN languages. Despite its cultural, religious, and political significance, Arabic has received comparatively little attention by modern computational linguistics. We are remedying this oversight by developing tools and techniques that deliver state-of-the-art performance in a variety of language processing tasks. Machine translation is our most active area of research, but we have also worked on statistical parsing and part-of-speech tagging. This page provides links to our freely available software along with a list of relevant publications.”
The page holds a collection of useful links. There are software links, beginning with their statistical Stanford Arabic Parser. There are also links to eight papers, in pdf form, that either directly discuss Arabic or use it as an experimental subject. Anyone interested in machine translation may want to bookmark this helpful resource.
Cynthia Murrell, February 03, 2014
January 22, 2014
Did you know that there was an open source version of ClearForest called Calais? Neither did we, until we read about it in the article posted on OpenCalais called, “Calais: Connect. Everything.” Along with a short instructional video, is a text explanation about how the software works. OpenCalais Web Service automatically creates rich semantic metadata using natural language processing, machine learning, and other methods to analyze for submitted content. A list of tags are generated and returned to the user for review and then the user can paste them onto other documents.
The metadata can be used in a variety of ways for improvement:
“The metadata gives you the ability to build maps (or graphs or networks) linking documents to people to companies to places to products to events to geographies to… whatever. You can use those maps to improve site navigation, provide contextual syndication, tag and organize your content, create structured folksonomies, filter and de-duplicate news feeds, or analyze content to see if it contains what you care about.”
The OpenCalais Web Service relies on a dedicated community to keep making progress and pushing the application forward. Calais takes the same approach as other open source projects, except this one is powered by Thomson Reuters.
January 8, 2014
“IBM Struggles to turn Watson into Big Business” warrants a USA Today treatment. You can find the story in the hard copy of the newspaper on page A 1 and A 2. I saw a link to the item online at http://on.wsj.com/1iShfOG but you may have to pay to read it or chase down a Penguin friendly instance of the article.
The main point is that IBM targeted $10 billion in Watson revenue by 2023. Watson has generated less than $100 million in revenue I presume since the system “won” the Jeopardy game show.
The Wall Street Journal article is interesting because it contains a number of semantic signals, for example:
- The use of the phrase “in a ditch” in reference to a a project at the University of Texas M.D. Anderson Cancer Center
- The statement “Watson is having more trouble solving real-life problems”
- The revelation that “Watson doesn’t work with standard hardware”
- An allegedly accurate quote from a client that says “Watson initially took too long to learn”
- The assertion that “IBM reworked Watson’s training regimen”
- The sprinkling of “could’s” and “if’s”
I came away from the story with a sense of déjà vu. I realized that over the last 25 years I have heard similar information about other “smart” search systems. The themes run through time the way a bituminous coal seam threads through the crust of the earth. When one of these seams catches fire, there are few inexpensive and quick ways to put out the fire. Applied to Watson, my hunch is that the cost of getting Watson to generate $10 billion in revenue is going to be a very big number.
The Wall Street Journal story references the need for humans to learn and then to train Watson about the topic. When Watson goes off track, more humans have to correct Watson. I want to point out that training a smart system on a specific corpus of content is tricky. Algorithms can be quite sensitive to small errors in initial settings. Over time, the algorithms do their thing and wander. This translates to humans who have to monitor the smart system to make sure it does not output information in which it has generated confidence scores that are wrong or undifferentiated. The Wall Street Journal nudges this state of affairs in this passage:
In a recent visit to his [a Sloan Kettering oncologist] pulled out an iPad and showed a screen from Watson that listed three potential treatments. Watson was less than 32% confident that any of them were [sic] correct.
Then the Wall Street Journal reported that tweaking Watson was tough, saying:
The project initially ran awry because IBM’s engineers and Anderson’s doctors didn’t understand each other.
No surprise, but the fix just adds to the costs of the system. The article revealed:
IBM developers now meet with doctors several times a week.
Why is this Watson write up intriguing to me? There are four reasons:
First, the Wall Street Journal makes clear that dreams about dollars from search and content processing are easy to inflate and tough to deliver. Most search vendors and their stakeholders discover the difference between marketing hyperbole and reality.
Second, the Watson system is essentially dependent on human involvement. The objective of certain types of smart software is to reduce the need for human involvement. Watching Star Trek and Spock is not the same as delivering advanced systems that work and are affordable.
Third, the revenue generated by Watson is actually pretty good. Endeca hit $100 million between 1998 and 2011 when it was acquired by Oracle. Autonomy achieved $800 million between 1996 and 2011 when it was purchased by Hewlett Packard. Watson has been available for a couple of years. The problem is that the goal is, it appears, out of reach even for a company with IBM’s need for a hot new product and the resources to sell almost anything to large organizations.
Fourth, Watson is walking down the same path that STAIRS III, an early IBM search system, followed. IBM embraced open source to help reduce the cost of delivering basic search. Now IBM is finding that the value-adds are more difficult than key word matching and Boolean centric information retrieval. When a company does not learn from its own prior experiences in content processing, the voyage of discovery becomes more risky.
Net net: IBM has its hands full. I am confident that an azure chip consultant and a couple of 20 somethings can fix up Watson in a nonce. But if remediation is not possible, IBM may vie with Hewlett Packard as the pre-eminent example of the perils of the search and content processing business.
Stephen E Arnold, January 8, 2014
December 20, 2013
Partnerships develop when companies each possess a strength and then combine forces to build a beneficial relationship. The CogBlog, Cognition’s Semantic NLP Blog, announced a new relationship in the post, “Cognition To Power Grabbit’s Online Recommendation Engine.” Cognition is a leading name in semantic analysis and language process and Grabbit is the developer of a cloud-hosted suite of Web services. Together they have formed a strategic partnership that will combine Cognition’s natural language processing technology with Grabbit’s patent-pending system for making online recommendations of products, content, and people. The idea behind pairing the two technologies is that the semantic software would analyze social media content and then Grabbit’s software would then make product recommendations based on the data.
The article states:
“Cognition provides a powerful set of semantic tools to power Grabbit’s new web services. The scope of Cognition’s Semantic Map is more than double the size of any other computational linguistic dictionary for English, and includes more than ten million semantic connections that are comprised of semantic contexts, meaning representations, taxonomy and word meaning distinctions. The Map encompasses over 540,000 word senses (word and phrase meanings); 75,000 concept classes (or synonym classes of word meanings); 8,000 nodes in the technology’s ontology or classification scheme; and 510,000 word stems (roots of words) for the English language. Cognition’s lexical resources encode a wealth of semantic, morphological and syntactic information about the words contained within documents and their relationships to each other. These resources were created, codified and reviewed by lexicographers and linguists over a span of more than 25 years.”
Why do I get the feeling that online shopping is going to get even more complicated? Personal qualms aside, Cognition and Grabbit are not the first companies that come to mind when it comes to social media analytics and e-commerce. This partnership is not the first endeavor to cash in on Internet sales.
Whitney Grace, December 20, 2013
December 11, 2013
I read “Natural language Processing in the Kitchen.” The post was particularly relevant because I had worked through “The Main Trick in Machine Learning.” The essay does an excellent job of explaining coefficients (what I call for ease of recall, “thresholds.”) The idea is that machine learning requires a human to make certain judgments. Autonomy IDOL uses Bayesian methods and the company has for many years urged licensees to “train” the IDOL system. Not only that, successful Bayesian systems, like a young child, have to be prodded or retrained. How much and how often depends on the child. For Bayesian-like systems, the “how often” and “how much” varies by the licensees’ content contexts.
Now back to the Los Angeles Times’ excellent article about indexing and classifying a small set of recipes. Here’s the quote to note:
Computers can really only do so much.
When one jots down the programming and tuning work required to index recipes, keep in mind the “The Main Trick in Machine Learning.” There are three important lessons I draw from the boundary between these two write ups:
- Smart software requires programming and fiddling. At the present time (December 2013), this reality is as it has been for the last 50 years, maybe more.
- The humans fiddling with or setting up the content processing system have to be pretty darned clever. The notion of “user friendliness” is strongly disabused by these two articles. Flashy graphics and marketers’ cooing are not going to cut the mustard or the sirloin steak.
- The properly set up system with filtered information processed without some human intervention hits 98 percent accuracy. The main point is that relevance is a result of humans, software, and consistent, on point content.
How many enterprise search and content processing vendors explain that a failure to put appropriate resources toward the search or content processing implementation guarantees some interesting issues. Among them, systems will routinely deliver results that are not germane to the user’s query.
The roots of dissatisfaction with incumbent search and retrieval systems is not the systems themselves. In my opinion, most are quite similar, differing only in relatively minor details. (For examples of the similarity, review the reports at Xenky’s Vendor Profiles page.)
How many vendors have been excoriated because their customers failed to provide the cash, time, and support necessary to deliver a high-performance system? My hunch is that the vendors are held responsible for failures that are predestined by licensees’ desire to get the best deal possible and believe that magic just happens without the difficult, human-centric work that is absolutely essential for success.
Stephen E Arnold, December 11, 2013
December 10, 2013
Natural language processing software is a boon to physicians who are required to keep immaculate documentation. Hispanic Business reports that the “Huntsman Cancer Institute uses Linguamatics I2E To Automatically Extract Insights From Clinical Pathology Documents.” The Huntsman Cancer Institute (HCI) is located at the University of Utah. By using the Linguamatics I2E natural language processing software, HCI will turn its unstructured data in EMRs into actionable information to conduct better research and seek new insights in cancer treatments and outcomes.
The article states:
“HCI is using Linguamatics I2E with its in-house clinical informatics infrastructure to extract discrete data from the unstructured text contained in surgical, pathology, radiology, and clinical notes related to hematology oncology disease areas such as Leukemia and Lymphoma. The resulting data is loaded into an integrated biobanking, clinical research, and genomic annotation platform. This enables HCI’s clinicians and principal investigators to harness the richest possible set of data for research into patient outcomes, comparative effectiveness, and genetic drivers of disease. Analysis at this scale can find information that would often be missed when reading documents one at a time. In addition HCI has a better range and quality of data to support clinical trial matching and increase numbers of patients on trials.”
There is a wealth of medical information available in unstructured data and it is one of the biggest markets for big data. Medical professionals spend hours studying patient records. The I2E gives medical professionals analytics that frees their time, improves research processes, and patient outcomes.
Whitney Grace, December 10, 2013
October 18, 2013
For those who know the open-source programming language Ruby, NLP is a script away. Sitepoint shares some basic techniques in, “Natural Language Processing with Ruby: N-Grams.” This first piece in a series begins at the beginning; developer Nathan Kleyn writes:
“Natural Language Processing (NLP for short) is the process of processing written dialect with a computer. The processing could be for anything – language modeling, sentiment analysis, question answering, relationship extraction, and much more. In this series, we’re going to look at methods for performing some basic and some more advanced NLP techniques on various forms of input data. One of the most basic techniques in NLP is n-gram analysis, which is what we’ll start with in this article!”
Kleyn explains his subject clearly, with plenty of code examples so we can see what’s going on. He goes into the following: what it means to split strings of characters into n-gram chunks; selecting a good data source (he sends readers to the comprehensive Brown Corpus); writing an n-gram class; extracting sentences from the Corpus; and, finally, n-gram analysis. The post includes links to the source code he uses in the article.
In the next installment, Kleyn intends to explore Markov chaining, which uses probability to approximate language and generate “pseudo-random” text. This series may be just the thing for folks getting into, or considering, the natural language processing field.
Cynthia Murrell, October 18, 2013
September 27, 2013
The article titled “Multimodal Natural Language Interface for Faceted Search” In Patent Application Approval Process on Hispanic Business reveals that inventors in California have applied for a patent of their natural language interface. The inventors are quoted in the article as claiming that the problem of users implementing a “successful query” revolves around an issue of transparency in the criteria of the search being held. The inventors, Farzad Ehsani, Silke Maren Witt-Ehsani filed their patent application in February of 2013 and the patent was made available online early in September of 2013. The article states,
“Solving this problem requires an interface that is natural for the user while producing validly formatted search queries that are sensitive to the structure of the data, and that gives the user an easy and natural method for identifying and modifying search criteria. Ideally, such a system should select an appropriate search engine and tailor its queries based upon the indexing system used by the search engine. Possessing this ability would allow more efficient, accurate and seamless retrieval of appropriate information.”
This quote from the inventors continues on to address the current methods which do not meet the expectations of users in terms of selecting the best search engine and data repository as well as not formulating the search query in the appropriate manner.
Chelsea Kerwin, September 27, 2013