December 11, 2013
I read “Natural language Processing in the Kitchen.” The post was particularly relevant because I had worked through “The Main Trick in Machine Learning.” The essay does an excellent job of explaining coefficients (what I call for ease of recall, “thresholds.”) The idea is that machine learning requires a human to make certain judgments. Autonomy IDOL uses Bayesian methods and the company has for many years urged licensees to “train” the IDOL system. Not only that, successful Bayesian systems, like a young child, have to be prodded or retrained. How much and how often depends on the child. For Bayesian-like systems, the “how often” and “how much” varies by the licensees’ content contexts.
Now back to the Los Angeles Times’ excellent article about indexing and classifying a small set of recipes. Here’s the quote to note:
Computers can really only do so much.
When one jots down the programming and tuning work required to index recipes, keep in mind the “The Main Trick in Machine Learning.” There are three important lessons I draw from the boundary between these two write ups:
- Smart software requires programming and fiddling. At the present time (December 2013), this reality is as it has been for the last 50 years, maybe more.
- The humans fiddling with or setting up the content processing system have to be pretty darned clever. The notion of “user friendliness” is strongly disabused by these two articles. Flashy graphics and marketers’ cooing are not going to cut the mustard or the sirloin steak.
- The properly set up system with filtered information processed without some human intervention hits 98 percent accuracy. The main point is that relevance is a result of humans, software, and consistent, on point content.
How many enterprise search and content processing vendors explain that a failure to put appropriate resources toward the search or content processing implementation guarantees some interesting issues. Among them, systems will routinely deliver results that are not germane to the user’s query.
The roots of dissatisfaction with incumbent search and retrieval systems is not the systems themselves. In my opinion, most are quite similar, differing only in relatively minor details. (For examples of the similarity, review the reports at Xenky’s Vendor Profiles page.)
How many vendors have been excoriated because their customers failed to provide the cash, time, and support necessary to deliver a high-performance system? My hunch is that the vendors are held responsible for failures that are predestined by licensees’ desire to get the best deal possible and believe that magic just happens without the difficult, human-centric work that is absolutely essential for success.
Stephen E Arnold, December 11, 2013
December 10, 2013
Natural language processing software is a boon to physicians who are required to keep immaculate documentation. Hispanic Business reports that the “Huntsman Cancer Institute uses Linguamatics I2E To Automatically Extract Insights From Clinical Pathology Documents.” The Huntsman Cancer Institute (HCI) is located at the University of Utah. By using the Linguamatics I2E natural language processing software, HCI will turn its unstructured data in EMRs into actionable information to conduct better research and seek new insights in cancer treatments and outcomes.
The article states:
“HCI is using Linguamatics I2E with its in-house clinical informatics infrastructure to extract discrete data from the unstructured text contained in surgical, pathology, radiology, and clinical notes related to hematology oncology disease areas such as Leukemia and Lymphoma. The resulting data is loaded into an integrated biobanking, clinical research, and genomic annotation platform. This enables HCI’s clinicians and principal investigators to harness the richest possible set of data for research into patient outcomes, comparative effectiveness, and genetic drivers of disease. Analysis at this scale can find information that would often be missed when reading documents one at a time. In addition HCI has a better range and quality of data to support clinical trial matching and increase numbers of patients on trials.”
There is a wealth of medical information available in unstructured data and it is one of the biggest markets for big data. Medical professionals spend hours studying patient records. The I2E gives medical professionals analytics that frees their time, improves research processes, and patient outcomes.
Whitney Grace, December 10, 2013
October 18, 2013
For those who know the open-source programming language Ruby, NLP is a script away. Sitepoint shares some basic techniques in, “Natural Language Processing with Ruby: N-Grams.” This first piece in a series begins at the beginning; developer Nathan Kleyn writes:
“Natural Language Processing (NLP for short) is the process of processing written dialect with a computer. The processing could be for anything – language modeling, sentiment analysis, question answering, relationship extraction, and much more. In this series, we’re going to look at methods for performing some basic and some more advanced NLP techniques on various forms of input data. One of the most basic techniques in NLP is n-gram analysis, which is what we’ll start with in this article!”
Kleyn explains his subject clearly, with plenty of code examples so we can see what’s going on. He goes into the following: what it means to split strings of characters into n-gram chunks; selecting a good data source (he sends readers to the comprehensive Brown Corpus); writing an n-gram class; extracting sentences from the Corpus; and, finally, n-gram analysis. The post includes links to the source code he uses in the article.
In the next installment, Kleyn intends to explore Markov chaining, which uses probability to approximate language and generate “pseudo-random” text. This series may be just the thing for folks getting into, or considering, the natural language processing field.
Cynthia Murrell, October 18, 2013
September 27, 2013
The article titled “Multimodal Natural Language Interface for Faceted Search” In Patent Application Approval Process on Hispanic Business reveals that inventors in California have applied for a patent of their natural language interface. The inventors are quoted in the article as claiming that the problem of users implementing a “successful query” revolves around an issue of transparency in the criteria of the search being held. The inventors, Farzad Ehsani, Silke Maren Witt-Ehsani filed their patent application in February of 2013 and the patent was made available online early in September of 2013. The article states,
“Solving this problem requires an interface that is natural for the user while producing validly formatted search queries that are sensitive to the structure of the data, and that gives the user an easy and natural method for identifying and modifying search criteria. Ideally, such a system should select an appropriate search engine and tailor its queries based upon the indexing system used by the search engine. Possessing this ability would allow more efficient, accurate and seamless retrieval of appropriate information.”
This quote from the inventors continues on to address the current methods which do not meet the expectations of users in terms of selecting the best search engine and data repository as well as not formulating the search query in the appropriate manner.
Chelsea Kerwin, September 27, 2013
August 8, 2013
Projections and opportunities are often forecasted for emerging technologies and natural language processing is no exception. We took a look back at an article from earlier in the year posted on Semantic Web: “Looking Ahead to a User Experience Transformed by Conversational Interfaces and NLP.” According to this article, software that is able to understand human intention will play a vital role in transforming business processes and search technology.
IBM distinguished engineer Currie Boyle is quoted as stating the following:
This ecosystem change is happening in the industry…discussing the desire for business dialogue management systems to try to determine the intent of a user seeking information and the intent of the author who wrote it, and matching the two by that intent, even if they don’t share the same words in common to express it. The applications range from consumer conversational and context-aware systems to business professionals finding answers in structured or unstructured data through via natural language interfaces to boosting call contact center performance with dialogue management.
Expert System solutions offer precise analytics using their core semantic search technologies. Their linguistic analysis capabilities enhance the extraction and application of data in the natural language interface.
Megan Feil, August 8, 2013
July 30, 2013
Enterprise organizations are increasingly loosening the leash on mobility and this is causing an emphasis on cross-device search. CMS Wire ran an article on the subject called, “Cross-Device Search: The Next Step in Mobile Search Delivery.” The author discusses the known issues with mobile search within the consumer sector and points to natural search interfaces as a remarkable technology that will be one of the building blocks of mobile search.
Both collaborative search and cross-device search stick out as technologies that many companies will begin needing to utilize more and more frequently.
The article does a good job summing up where mobile search delivery will begin:
In 2011 Greg Nudelman wrote Designing Search — UX Strategies for eCommerce Success which had a strong mobile focus and there is an excellent chapter on mobile search in the recent book on Designing the Search Experience by Tony Russell-Rose and Tyler Tate. There is a consensus that natural search interfaces will be an important feature of mobile search design.
Natural search interfaces, or natural language interface (as others call this technology), are a vital piece of technology delivered by many innovative companies like Expert System. One of their solutions, Cogito Answers, utilizes natural language interface to understand the intention of users to deliver information and answers quickly and accurately with a single click.
Megan Feil, July 30, 2013
July 28, 2013
I saw a flurry of links to a news release titled “New Patented Text Analytics Analytics Approach [sic]” about a text analytics package. The company receiving the patent is OdinText / Anderson Analytics. The company asserts that it provides a text analytics system for market research professionals. I was intrigued by an “analytics analytics approach.”
The news story describes US 8,475,498, “Natural Language Text Analytics.” The abstract states:
A method of text analytics includes filtering a plurality of unfiltered records having unstructured data into at least a first group and a second group. The first group and said second group each include at least two records and the first group is different than the second group. The method includes determining a first proportion of occurrence for a term by comparing a first number of records having at least one occurrence of the term in the first group to a first total number of records in the first group, determining a second proportion of occurrence for the term by comparing a second number of records having at least one occurrence of the term in said second group to a second total number of records in the second group, and comparing the first proportion of occurrence to the second proportion of occurrence to yield a resultant comparison occurrence.
Anderson Analytics’ Web site says:
We Focus on Getting Accurate and Relevant Data. Quality research starts with quality data, and the best answers come from well thought out questions. Whether we are working with internal business data or gathering primary research, we make sure that projects are of correct and sufficient scope to accurately address the business need.
I scanned the document and thought about Ramanathan Guha’s programmable search engine and context server invention; for example, US 8,316,040 and its related inventions from 2007 forward. The Guha system and method are quite different from the Odin/Anderson system and method.
If you are an NLP savvy marketer, you may want to take a closer look at OdinText. The system “overcomes, alleviates, and/or mitigates one or more of the aforementioned [references a list of known NLP search problems] and other deleterious effects of prior art.
Google and Dr. Guha, you may have some work to do.
Stephen E Arnold, July 28, 2013
Sponsored by Xenky
July 16, 2013
In Search Engine Journal we came across a recent article outlining two important topics in the search arena today, “The Difference Between Semantic Search and Semantic Web.” The post presents definitions for each and delves into the numerous distinctions between these terms.
Pulling from Cambridge Semantics, the article asserts that the Semantic Web is a set of technologies that store and query information, usually numbers and dates. Textual data is not typically stored in large quantities.
We thought their simple explanation of semantic search was a good starting point for those learning about the technology:
Semantic Search is the process of typing something into a search engine and getting more results than just those that feature the exact keyword you typed into the search box. Semantic search will take into account the context and meaning of your search terms. It’s about understanding the assumptions that the searcher is making when typing in that search query.
We also appreciate that the article refers to semantic search as a concept that is not new, but is currently gaining much traction. Essentially semantic search mirrors the process people use when reading; text is analyzed and context is developed in order for a rich understanding to be developed. Many innovative technologies are emerging out of this concept. For example, solutions from Expert System offer precise analytics using their core semantic search technologies. Their linguistic analysis capabilities enhance the extraction and application of data in the natural language interface.
Megan Feil, July 16, 2013
May 29, 2013
Why Are We Still Waiting for Natural Language Processing, an article on The Chronicle of Higher Education, explores the failure of the 21st century to produce Natural Language Processing, or NLP. This would mean the ability of computers to process natural human language. The steps required are explained in the article,
“ In the 1980s I was convinced that computers would soon be able to simulate the basics of what (I hope) you are doing right now: processing sentences and determining their meanings.
To do this, computers would have to master three things. First, enough syntax to uniquely identify the sentence; second, enough semantics to extract its literal meaning; and third, enough pragmatics to infer the intent behind the utterance, and thus discerning what should be done or assumed given that it was uttered.”
Currently, typing a question into Google can result in exactly the opposite information from what you are seeking. This is because it is unable to infer, since natural conversation is full of gaps and assumptions that we are all trained to leap through without failure. According to the article, the one company that seemed to be coming close to this technology was Powerset in 2008. After making a deal with Microsoft, however, their site now only redirects to Bing, a Google clone. Maybe NLP like Big Data, business intelligence, and predictive analytics is just a buzzword with marketing value.
Chelsea Kerwin, May 29, 2013
April 6, 2013
One of my two or three readers sent me a link to a LinkedIn post in the Information Access and Search Professionals section of the job hunting and consultant networking service. LinkedIn owns Slideshare (a hosting service for those who are comfortable communicating with presentations) and Pulse (an information aggregation service which plays the role of a selective dissemination of information service via a jazzy interface).
The posting which the reader wanted me to read was “How Natural Language Processing Will Change E Commerce Search Forever.” Now that is a bold statement. Most of the search systems we have tested feature facets, prediction, personalization, hit boosting for specials and deals, and near real time inventory updating.
The company posting the information put a version of the LinkedIn information on the Web at Inbenta.
The point of the information is to suggest that Inbenta can deliver more functionality which is backed by what is called “search to buy conversions.” In today’s economy, that’s catnip to many ecommerce site owners who—I presume—use Endeca, Exalead, SLI, and EasyAsk, among others.
I am okay with a vendor like Inbenta or any of the analytics hustlers asserting that one type of cheese is better than another. In France alone, there are more than 200 varieties and each has a “best”. When it comes to search, there is no easy way to do a tasting unless I can get my hands on the fungible Chevrotin.
Search, like cheese, has to be experienced, not talked about. A happy nibble to Alpes gourmet at http://www.alpesgourmet.com/fromage-savoie-vercors/1008.php
In the case of this Inbenta demonstration, I am enjoined to look at two sets of results from a the Grainger.com site. The problem is I cannot read the screenshots. I am not able to determine if the present Grainer.com site is the one used for the “before” and “after” examples.
Next I am asked to look at queries from PCMall.com. Again, I could not read the screenshots. The write up says:
Again, the actual details of the search results are not important; just pay attention that both are very different. But in both cases, wasn’t what we searched basically the same thing? Why are the results so different?
The same approach was used to demonstrate that Amazon’s ecommerce search is doing some interesting things. Amazon is working on search at this time, and I think the company realizes that its system for ecommerce and for the hosted service leaves something out of the cookie recipe.
My view is that if a vendor wants to call attention to differences, perhaps these simple guidelines would eliminate the confusion and frustration I experience when I try to figure out what is going on, what is good and bad, and how the outputs differ:
First, provide a link to each of the systems so I can run the queries and look at the results myself. I did not buy into the Watson Jeopardy promotion because in television, magic takes place in some editing studios. Screenshots which I cannot read nor replicate open the door to similar suspicions.
Second, to communicate the “fix” I need more than an empty data table. A list of options does not help me. We continue to struggle with systems which describe a “to be” future yet cannot deliver a “here and now” result. I have a long and winding call with an analytics vendor in Nashville, Tennessee which follows a similar, abstract path in explaining what the company’s technology does. If one cannot show functionality, I don’t have time to listen to science fiction.
Third, the listing of high profile sites is useful for search engine optimization, but not for making crystal clear the whys and wherefores of a content processing system. Specific information is needed, please.
To wrap up, let me quote from the Inbenta essay:
By applying these techniques on e-commerce website search, we have accomplished the following results in the first few weeks.
- Increase in conversion ratio: +1.73%
- Increase average purchase value: +11%
Okay, interesting numbers. What is the factual foundation of them? What method was used to calculate the deltas? What was the historical base of the specific sites in the sample?
In a world in which vendors and their pet consultants jump forward with predictions, assertions, and announcements of breakthroughs—some simple facts can be quite helpful. I am okay with self promotion but when asking me to see comparisons, I have to be able to run the queries myself. Without that important step, I am skeptical just as I was with the sci-fi fancies of the folks who put marketing before substance.
Stephen E Arnold, April 6, 2013
Sponsored by Augmentext