OpenCalais Has Big Profile Users

April 2, 2014

OpenCalais is an open source project that creates rich semantic data by using natural language processing and other analytical methods through a Web service interface. It is a simple explanation for a piece of powerful software. OpenCalais was originally part of ClearForest, but Thomson Reuters acquired the project in 2007. Instead of marketing OpenCalais as proprietary software, Reuters allowed it to remain open. OpenCalais has since become valued metadata open source software that is used on blogs to specialized museum collections.

There are many notables who use OpenCalais and a sample can be found on “The List Of OpenCalais Implementations Grows.”

OpenCalais is excited about the new additions to the list:

“Add 10 to the list of innovative sites and services that use OpenCalais to reduce costs, deliver compelling content experiences and mine the social web for insight. See our press release for more details on each. We are thrilled to recognize the following new sites and services that are changing the way we engage with news and the social Web. They join a growing number of others in media, publishing, blogging, and news aggregation who use OpenCalais.”

Among them are The New Republic, Al Jazeera’s English blogging news networks, Slate Magazine’s blogging network, and I*heart* Sea.” Not only do news Web sites use OpenCalais, but news aggregation apps do as well, including, Feedly. DocumentCloud, and OpenPublish. Expect the list to grow even longer and consider OpenCalais for your own metadata solution.

Whitney Grace, April 02, 2014
Sponsored by, developer of Augmentext

IBM Watson: Now a Foodie

March 17, 2014

One of my two or three readers sent me a link to “IBM’s New Food Truck Uses a Supercomputer to Dream Up All Their ‘Surprising’ Recipes.” For code wrappers and Lucene, Watson is a versatile information processing system. Instead of an online demo of Web indexing, I learned about “surprising recipes.”

The initiative to boost Watson toward its $10 billion revenue goal involves the Institute of culinary Education.” The idea is that IBM and ICE deliver “computational creativity” to create new recipes. Julia Child would probably resist computerizing her food activities. Her other, less well known activities, would have eagerly accepted Watson’s inputs.

The article quotes IBM as saying:

“Creating a recipe for a novel and flavorful meal is the result of a system that generates millions of ideas out of the quintillions of possibilities,” IBM writes. “And then predicts which ones are the most surprising and pleasant, applying big data in new ways.”

The article even includes a video. Apparently the truck made an appearance at South by Southwest. From my cursory research, the Watson truck was smart enough to be elsewhere when the alleged inebriated driver struck attendees near the pivot point of Austin’s night life.

The IBM marketing professionals are definitely clear headed and destined for fame as the food truck gnaws its way into the $10 billion revenue objective. Did IBM researchers ask Watson is this was an optimal use of its computational capabilities. Did Watson contribute to the new Taco Bell loaded beefy nacho grillers. Ay, Caramba!

Stephen E Arnold,

A Roundup Of NLP

March 6, 2014

If you are currently conducting research on natural language processing software, but have come to a halt in resources, we located Connexor’s “NLP Library.” Connexor is a company that develops text analysis software components, solutions, and services. They are experts in their line of work and are keen to help people utilize their data to its full extent. Connexor explains that:

“Connexor components have turned out to be necessary in many types of software products and solutions that need linguistic intelligence in text analytics tasks. We work with software houses, service providers, system integrators, resellers and research labs, in the fields of education, health, security, business and administration. We have customers and partners in over 30 countries.”

The company’s NLP Library includes bibliographic citations for articles. We can assume that Connexor employees wrote these articles. They range on a variety of subjects dealing with natural language processing, text evaluation, and they even touch on emotion extraction from text. These articles are a handy resource, especially if you need up to date research. There is only one article for 2014, but the year is still young and more are probably on the way.

Whitney Grace, March 06, 2014
Sponsored by, developer of Augmentext

Digital Reasoning and Paragon Science Promote Natural Language Processing and Graph Analysis

February 12, 2014

The presentation on slideshare titled Got Chaos? Extracting Business Intelligence from Email with Natural Language Processing and Dynamic Graph Analysis discusses the work by Digital Reasoning and Paragon Science. Digital Reasoning asserts that it is an Oracle for human language data. There are color-coded sentences that illustrate the abilities of Natural Language Processing, from recognizing people and location words to entities related to a single concept and associated entities. The presentation consists of many equations, but the overview explains,

“In this presentation, O’Reilly author and Digital Reasoning CTO Matthew Russell along with Dr. Steve Kramer, founder and chief scientist at Paragon Science, discuss how Digital Reasoning processed the Enron corpus with its advanced Natural Language Processing (NLP) technology – effectively transforming it into building blocks that are viable for data science. Then, Paragon Science used dynamic graph analysis inspired from particle physics to tease out insights from the data..”

Ultimately the point of the entire process was to gain a better understanding of how the Enron catastrophe could be avoided in other enterprises. It is difficult to say whether Digital Reasoning is imitating IBM Watson or if IBM Watson is imitating Digital Reasoning. At any rate it sound familiar, didn’t Autonomy, TeraText, and other firms push into this sector decades ago?

Chelsea Kerwin, February 12, 2014

Sponsored by, developer of Augmentext

Stanford NLP Group Tackles Arabic Machine Translation

February 3, 2014

Machine translation can be a wonderful thing, but one key language has garnered less consideration than other widely-used languages. Though both Google and Babylon have made good progress [pdf] on Arabic translation, folks at The Stanford Natural Language Processing Group know there is plenty of room for improvement. These scientists are working close that gap with their Arabic Natural Language Processing project.

The page’s overview tells us:

“Arabic is the largest member of the Semitic language family and is spoken by nearly 500 million people worldwide. It is one of the six official UN languages. Despite its cultural, religious, and political significance, Arabic has received comparatively little attention by modern computational linguistics. We are remedying this oversight by developing tools and techniques that deliver state-of-the-art performance in a variety of language processing tasks. Machine translation is our most active area of research, but we have also worked on statistical parsing and part-of-speech tagging. This page provides links to our freely available software along with a list of relevant publications.”

The page holds a collection of useful links. There are software links, beginning with their statistical Stanford Arabic Parser. There are also links to eight papers, in pdf form, that either directly discuss Arabic or use it as an experimental subject. Anyone interested in machine translation may want to bookmark this helpful resource.

Cynthia Murrell, February 03, 2014

Sponsored by, developer of Augmentext

Learn About the Open Source Alternative to ClearForest

January 22, 2014

Did you know that there was an open source version of ClearForest called Calais? Neither did we, until we read about it in the article posted on OpenCalais called, “Calais: Connect. Everything.” Along with a short instructional video, is a text explanation about how the software works. OpenCalais Web Service automatically creates rich semantic metadata using natural language processing, machine learning, and other methods to analyze for submitted content. A list of tags are generated and returned to the user for review and then the user can paste them onto other documents.

The metadata can be used in a variety of ways for improvement:

“The metadata gives you the ability to build maps (or graphs or networks) linking documents to people to companies to places to products to events to geographies to… whatever. You can use those maps to improve site navigation, provide contextual syndication, tag and organize your content, create structured folksonomies, filter and de-duplicate news feeds, or analyze content to see if it contains what you care about.”

The OpenCalais Web Service relies on a dedicated community to keep making progress and pushing the application forward. Calais takes the same approach as other open source projects, except this one is powered by Thomson Reuters.

Whitney Grace, January 22, 2014
Sponsored by, developer of Augmentext

IBM Wrestling with Watson

January 8, 2014

“IBM Struggles to turn Watson into Big Business” warrants a USA Today treatment. You can find the story in the hard copy of the newspaper on page A 1 and A 2. I saw a link to the item online at but you may have to pay to read it or chase down a Penguin friendly instance of the article.

The main point is that IBM targeted $10 billion in Watson revenue by 2023. Watson has generated less than $100 million in revenue I presume since the system “won” the Jeopardy game show.

The Wall Street Journal article is interesting because it contains a number of semantic signals, for example:

  • The use of the phrase “in a ditch” in reference to a a project at the University of Texas M.D. Anderson Cancer Center
  • The statement “Watson is having more trouble solving real-life problems”
  • The revelation that “Watson doesn’t work with standard hardware”
  • An allegedly accurate quote from a client that says “Watson initially took too long to learn”
  • The assertion that “IBM reworked Watson’s training regimen”
  • The sprinkling of “could’s” and  “if’s”

I came away from the story with a sense of déjà vu. I realized that over the last 25 years I have heard similar information about other “smart” search systems. The themes run through time the way a bituminous coal seam threads through the crust of the earth. When one of these seams catches fire, there are few inexpensive and quick ways to put out the fire. Applied to Watson, my hunch is that the cost of getting Watson to generate $10 billion in revenue is going to be a very big number.

The Wall Street Journal story references the need for humans to learn and then to train Watson about the topic. When Watson goes off track, more humans have to correct Watson. I want to point out that training a smart system on a specific corpus of content is tricky. Algorithms can be quite sensitive to small errors in initial settings. Over time, the algorithms do their thing and wander. This translates to humans who have to monitor the smart system to make sure it does not output information in which it has generated confidence scores that are wrong or undifferentiated. The Wall Street Journal nudges this state of affairs in this passage:

In a recent visit to his [a Sloan Kettering oncologist] pulled out an iPad and showed a screen from Watson that listed three potential treatments. Watson was less than 32% confident  that any of them were [sic] correct.

Then the Wall Street Journal reported that tweaking Watson was tough, saying:

The project initially ran awry because IBM’s engineers and Anderson’s doctors didn’t understand each other.

No surprise, but the fix just adds to the costs of the system. The article revealed:

IBM developers now meet with doctors several times a week.

Why is this Watson write up intriguing to me? There are four reasons:

First, the Wall Street Journal makes clear that dreams about dollars from search and content processing are easy to inflate and tough to deliver. Most search vendors and their stakeholders discover the difference between marketing hyperbole and reality.

Second, the Watson system is essentially dependent on human involvement. The objective of certain types of smart software is to reduce the need for human involvement. Watching Star Trek and Spock is not the same as delivering advanced systems that work and are affordable.

Third, the revenue generated by Watson is actually pretty good. Endeca hit $100 million between 1998 and 2011 when it was acquired by Oracle. Autonomy achieved $800 million between 1996 and 2011 when it was purchased by Hewlett Packard. Watson has been available for a couple of years. The problem is that the goal is, it appears, out of reach even for a company with IBM’s need for a hot new product and the resources to sell almost anything to large organizations.

Fourth, Watson is walking down the same path that STAIRS III, an early IBM search system, followed. IBM embraced open source to help reduce the cost of delivering basic search. Now IBM is finding that the value-adds are more difficult than key word matching and Boolean centric information retrieval. When a company does not learn from its own prior experiences in content processing, the voyage of discovery becomes more risky.

Net net: IBM has its hands full. I am confident that an azure chip consultant and a couple of 20 somethings can fix up Watson in a nonce. But if remediation is not possible, IBM may vie with Hewlett Packard as the pre-eminent example of the perils of the search and content processing business.

Stephen E Arnold, January 8, 2014

Grabbing onto a Partnership

December 20, 2013

Partnerships develop when companies each possess a strength and then combine forces to build a beneficial relationship. The CogBlog, Cognition’s Semantic NLP Blog, announced a new relationship in the post, “Cognition To Power Grabbit’s Online Recommendation Engine.” Cognition is a leading name in semantic analysis and language process and Grabbit is the developer of a cloud-hosted suite of Web services. Together they have formed a strategic partnership that will combine Cognition’s natural language processing technology with Grabbit’s patent-pending system for making online recommendations of products, content, and people. The idea behind pairing the two technologies is that the semantic software would analyze social media content and then Grabbit’s software would then make product recommendations based on the data.

The article states:

“Cognition provides a powerful set of semantic tools to power Grabbit’s new web services. The scope of Cognition’s Semantic Map is more than double the size of any other computational linguistic dictionary for English, and includes more than ten million semantic connections that are comprised of semantic contexts, meaning representations, taxonomy and word meaning distinctions. The Map encompasses over 540,000 word senses (word and phrase meanings); 75,000 concept classes (or synonym classes of word meanings); 8,000 nodes in the technology’s ontology or classification scheme; and 510,000 word stems (roots of words) for the English language. Cognition’s lexical resources encode a wealth of semantic, morphological and syntactic information about the words contained within documents and their relationships to each other. These resources were created, codified and reviewed by lexicographers and linguists over a span of more than 25 years.”

Why do I get the feeling that online shopping is going to get even more complicated? Personal qualms aside, Cognition and Grabbit are not the first companies that come to mind when it comes to social media analytics and e-commerce. This partnership is not the first endeavor to cash in on Internet sales.

Whitney Grace, December 20, 2013

Sponsored by, developer of Augmentext

Quote to Note: NLP and Recipes for Success and Failure

December 11, 2013

I read “Natural language Processing in the Kitchen.” The post was particularly relevant because I had worked through “The Main Trick in Machine Learning.” The essay does an excellent job of explaining coefficients (what I call for ease of recall, “thresholds.”) The idea is that machine learning requires a human to make certain judgments. Autonomy IDOL uses Bayesian methods and the company has for many years urged licensees to “train” the IDOL system. Not only that, successful Bayesian systems, like a young child, have to be prodded or retrained. How much and how often depends on the child. For Bayesian-like systems, the “how often” and “how much” varies by the licensees’ content contexts.

Now back to the Los Angeles Times’ excellent article about indexing and classifying a small set of recipes. Here’s the quote to note:

Com­puters can really only do so much.

When one jots down the programming and tuning work required to index recipes, keep in mind the “The Main Trick in Machine Learning.” There are three important lessons I draw from the boundary between these two write ups:

  1. Smart software requires programming and fiddling. At the present time (December 2013), this reality is as it has been for the last 50 years, maybe more.
  2. The humans fiddling with or setting up the content processing system have to be pretty darned clever. The notion of “user friendliness” is strongly disabused by these two articles. Flashy graphics and marketers’ cooing are not going to cut the mustard or the sirloin steak.
  3. The properly set up system with filtered information processed without some human intervention hits 98 percent accuracy. The main point is that relevance is a result of humans, software, and consistent, on point content.

How many enterprise search and content processing vendors explain that a failure to put appropriate resources toward the search or content processing implementation guarantees some interesting issues. Among them, systems will routinely deliver results that are not germane to the user’s query.

The roots of dissatisfaction with incumbent search and retrieval systems is not the systems themselves. In my opinion, most are quite similar, differing only in relatively minor details. (For examples of the similarity, review the reports at Xenky’s Vendor Profiles page.)

How many vendors have been excoriated because their customers failed to provide the cash, time, and support necessary to deliver a high-performance system? My hunch is that the vendors are held responsible for failures that are predestined by licensees’ desire to get the best deal possible and believe that magic just happens without the difficult, human-centric work that is absolutely essential for success.

Stephen E Arnold, December 11, 2013

Language Software Joins Battle Against Cancer

December 10, 2013

Natural language processing software is a boon to physicians who are required to keep immaculate documentation. Hispanic Business reports that the “Huntsman Cancer Institute uses Linguamatics I2E To Automatically Extract Insights From Clinical Pathology Documents.” The Huntsman Cancer Institute (HCI) is located at the University of Utah. By using the Linguamatics I2E natural language processing software, HCI will turn its unstructured data in EMRs into actionable information to conduct better research and seek new insights in cancer treatments and outcomes.

The article states:

“HCI is using Linguamatics I2E with its in-house clinical informatics infrastructure to extract discrete data from the unstructured text contained in surgical, pathology, radiology, and clinical notes related to hematology oncology disease areas such as Leukemia and Lymphoma. The resulting data is loaded into an integrated biobanking, clinical research, and genomic annotation platform. This enables HCI’s clinicians and principal investigators to harness the richest possible set of data for research into patient outcomes, comparative effectiveness, and genetic drivers of disease. Analysis at this scale can find information that would often be missed when reading documents one at a time. In addition HCI has a better range and quality of data to support clinical trial matching and increase numbers of patients on trials.”

There is a wealth of medical information available in unstructured data and it is one of the biggest markets for big data. Medical professionals spend hours studying patient records. The I2E gives medical professionals analytics that frees their time, improves research processes, and patient outcomes.

Whitney Grace, December 10, 2013

Sponsored by, developer of Augmentext

Next Page »