October 18, 2013
For those who know the open-source programming language Ruby, NLP is a script away. Sitepoint shares some basic techniques in, “Natural Language Processing with Ruby: N-Grams.” This first piece in a series begins at the beginning; developer Nathan Kleyn writes:
“Natural Language Processing (NLP for short) is the process of processing written dialect with a computer. The processing could be for anything – language modeling, sentiment analysis, question answering, relationship extraction, and much more. In this series, we’re going to look at methods for performing some basic and some more advanced NLP techniques on various forms of input data. One of the most basic techniques in NLP is n-gram analysis, which is what we’ll start with in this article!”
Kleyn explains his subject clearly, with plenty of code examples so we can see what’s going on. He goes into the following: what it means to split strings of characters into n-gram chunks; selecting a good data source (he sends readers to the comprehensive Brown Corpus); writing an n-gram class; extracting sentences from the Corpus; and, finally, n-gram analysis. The post includes links to the source code he uses in the article.
In the next installment, Kleyn intends to explore Markov chaining, which uses probability to approximate language and generate “pseudo-random” text. This series may be just the thing for folks getting into, or considering, the natural language processing field.
Cynthia Murrell, October 18, 2013
September 27, 2013
The article titled “Multimodal Natural Language Interface for Faceted Search” In Patent Application Approval Process on Hispanic Business reveals that inventors in California have applied for a patent of their natural language interface. The inventors are quoted in the article as claiming that the problem of users implementing a “successful query” revolves around an issue of transparency in the criteria of the search being held. The inventors, Farzad Ehsani, Silke Maren Witt-Ehsani filed their patent application in February of 2013 and the patent was made available online early in September of 2013. The article states,
“Solving this problem requires an interface that is natural for the user while producing validly formatted search queries that are sensitive to the structure of the data, and that gives the user an easy and natural method for identifying and modifying search criteria. Ideally, such a system should select an appropriate search engine and tailor its queries based upon the indexing system used by the search engine. Possessing this ability would allow more efficient, accurate and seamless retrieval of appropriate information.”
This quote from the inventors continues on to address the current methods which do not meet the expectations of users in terms of selecting the best search engine and data repository as well as not formulating the search query in the appropriate manner.
Chelsea Kerwin, September 27, 2013
May 29, 2013
Why Are We Still Waiting for Natural Language Processing, an article on The Chronicle of Higher Education, explores the failure of the 21st century to produce Natural Language Processing, or NLP. This would mean the ability of computers to process natural human language. The steps required are explained in the article,
“ In the 1980s I was convinced that computers would soon be able to simulate the basics of what (I hope) you are doing right now: processing sentences and determining their meanings.
To do this, computers would have to master three things. First, enough syntax to uniquely identify the sentence; second, enough semantics to extract its literal meaning; and third, enough pragmatics to infer the intent behind the utterance, and thus discerning what should be done or assumed given that it was uttered.”
Currently, typing a question into Google can result in exactly the opposite information from what you are seeking. This is because it is unable to infer, since natural conversation is full of gaps and assumptions that we are all trained to leap through without failure. According to the article, the one company that seemed to be coming close to this technology was Powerset in 2008. After making a deal with Microsoft, however, their site now only redirects to Bing, a Google clone. Maybe NLP like Big Data, business intelligence, and predictive analytics is just a buzzword with marketing value.
Chelsea Kerwin, May 29, 2013
November 21, 2012
I continue to learn about companies with high-value content processing technologies. The challenge in real-time translation, if one believes the Google marketing, is now in “game over” mode. The winner, of course, is Google. Other firms can head to the showers and maybe think about competing in another business sector.
But some of that Google confidence may be based on assumptions about Google’s language processing expertise, not more recent systems and methods. I know. This is “burn at the stake” information to a Googler.
However, I saw a demonstration which made clear to me that Google’s “kitchen sink” approach to figuring out how to handle speech input and near real time translation may not be in step with other firm’s approaches. The company with some quite interesting translation technology and a commitment to easy integration is IMT Holdings. The privately held company’s product is Rosoka.
IMT Holdings, Corp. was founded in 2007. Our background is in US government contracting. In the course of the firm’s work, Mr. Sorah saw that the existing NLP or Natural Language Processing (NLP) tools were not able to handle the volumes and complexities of the data they needed to process. In December of 2011, IMT began actively marketing its NLP technology.
I was able after some telephone tag and email to interview Mike Sorah, one of the co-founders of IMT and one of the wizards behind the Rosoka technology.
Mr. Sorah told me:
Many of the existing NLP tools claim to be multilingual, but what they mean is that they have linguistic knowledge bases usually acquired from vendors who provide dictionaries and libraries that make NLP an issue for many licensees. But most of the NLP system don’t process documents that contain English and Chinese or English and Spanish. In the world of our clients, mixed language documents are important. These have to be processed as part of the normal stream, not put in an exception folder and maybe never processed or processed after a delay of hours or days.
The Rosoka system is different from other NLP and translation systems on the market at this time. He asserted:
In most multilingual NLP systems, the customer needs to know before they process the document what language the document is so they can load the appropriate language-specific knowledge base. What we did via our proprietary Rosoka algorithms was to take a multilingual look at the world. Our system automatically understands that a document may be in English or Chinese, or even English and Spanish mixed. The language angle is huge. We randomly sample Twitter stream and have been tweeting the top 10 languages of the week are. English varies between 35 to 45% of the tweets. Every language that Rosoka can process is included. Our multilingual support is not not sold as separate, add-on functionality.
You can read the full text of the interview with Mike Sorah in the ArnoldIT.com Search Wizards Speak series at this link. More information about IMT and Rosoka is available from the firm’s Web site, http://www.imtholdings.com.
Stephen E. Arnold, November 21, 2012
August 14, 2012
The New York Times has published an extensive account of the natural-language tragedy, “Goldman Sachs and the $580 Million Black Hole.” The five page article is a very interesting read. The gist, though, is simple enough: Goldman Sachs failed to look out for their client’s best interests. What a surprise.
You have probably heard of the natural language software NaturallySpeaking, developed by Dragon Systems. Dragon Systems is, at heart, the enterprising Jim and Janet Baker, who spent almost twenty years building their innovative software and their company. In fact, their work is considered to have advanced speech technology much faster than anyone expected. Some of it might even have made its way into Apple’s Siri.
When it came time to reap their rewards, the pair turned to Goldman Sachs for advice on the over-half-billion-dollar deal. Back in 1999, it still seemed like a good idea to trust the prominent investment firm. It wasn’t. Reporter Loren Feldman summarizes the trouble:
“With Goldman Sachs on the job, the corporate takeover of Dragon Systems in an all-stock deal went terribly wrong. Goldman collected millions of dollars in fees — and the Bakers lost everything when Lernout & Hauspie was revealed to be a spectacular fraud. . . . Only later did the Bakers learn that Goldman Sachs itself had at one point considered investing in L.& H. but had walked away after some digging into the company.
“This being Wall Street, a lot of money is now at stake. In federal court in Boston, the Bakers are demanding damages, including interest and legal fees, that could top $1 billion.”
Not only did Goldman direct their own dollars away from L.& H., the suit alleges, they also failed to scrutinize L.& H. for their client when Dragon’s CFO pointed out troubling signs. I turns out that the person in charge of such investigations had left Goldman and not been replaced. Oops. That didn’t keep Goldman from keeping the $5 million consultation fee. Naturally.
Meanwhile, companies who picked up pieces of the Bakers’ technology at auction after L.& H. fell have gone on to develop them into lucrative commodities. The couple was left with neither their invention nor any fraction of the money it was worth.
The case is expected to be decided sometime this November. Feldman burrowed into the wealth of legal filings surrounding the case to craft this article. He has found eye-opening insights into Goldman Sachs’ culture and practices. The piece is worth reading for that reason alone.
It is also a moving tale about a tech- and language-savvy couple who put in the time, effort, passion, and smarts to build their business, and who are now fighting to regain what is rightfully theirs. I wish them luck.
Cynthia Murrell, August 14, 2012
July 9, 2012
Wikipedia is a go to source for quick answers outside the classroom, but many don’t realize Wiki is an ever evolving information source. Geekosystem’s article “Wikistats Show You What Parts Of Wikipedia Are Changing” provides a visual way to see what is changing within Wikipedia.
The performance program was explained as:
“Utilizing technology from Datasift, a social data platform with a specialization in real-time streams, Wikistats lists some clear, concise information you can use to see how Wikipedia is flowing and changing out from under you. Using Natural Language Processing, Wikistats is able to suss realtime trends and updates. In short, Wikistats will show you what pages are being updated the most right now, how many edits they get by how many unique users, and how many lines are being added vs. how many are being deleted.”
Enlightenment was gained when actually viewing the chart below:
This program calculates well defined reports on Wikipedia’s traffic, and Wiki frequenters might find the above chart surprising. The report in this case shows the reality that Wikipedia is an over flowing pool of information.
We are not saying Wikipedia is unreliable, but one should never solely rely on one information source. The chart simply provides a visual way to see what is changing within Wikipedia and help users understand how data flows. This programs potential for real time use on other sites could be tremendous.
Jennifer Shockley, July 9, 2012
July 2, 2012
It is possible to teach an old dog new tricks according to Semanticweb.com’s article, ‘FirstRain Spotlights Semantics Across Domains’. Semantic approaches for a targeted domain work well because one can train the NLP engine to recognize key words that are applied. The downside is that the business world of today is vast and the current training limitations for specific domains cannot always scale.
FirstRain has opened a unique version of a semantic obedience school as:
“Affinity scoring must be a breakthrough for classes of information where there is a lot of ambiguity, and the cool thing about it is that you can actually apply it in a way to create a virtuous self-improving spiral that works across massively different information domains. When you set up the correct feedback loop of affinity scoring and don’t encode to different domains, but let it swing across those you are trying to match things to, you can create a self-learning system.”
The new system derived by FirstRain is capable of re-training the most stubborn of semantics and inspiring functionality. By creating adaptable semantics they have taught an already workable system to handle a variety of information in an even more efficient process. The semantic obedience school could very well be the next big thing in the business world if all goes as they plan. The new routine seems feasible, so has FirstRain cracked the tough training nut of cross domain semantics?
Jennifer Shockley, Juuly 2, 2012
June 11, 2012
A new, “cool,” vendor has been announced in a list of Cool Vendors in the Analytics and Business Intelligence, 2012 report by Garner, Inc.
According to the article, “EasyAsk Named ‘Cool Vendor’ by Leading Analyst Firm,” EasyAsk’s Siri-like mobile app for corporate data is one to note. The app, named Quiri, combines voice and NLP to provide a usable, and apparently “cool,” user-experience. A video demonstration of the product is available here. The article states:
“Quiri offers users Siri-like built-in speech recognition and natural language processing, allowing users to conveniently speak their business questions and get immediate answers to business questions. Users tap a microphone button, speak a request and Quiri retrieves the answer from existing corporate data.
EasyAsk eCommerce search and merchandising software – available on-premise or as a service (SaaS) – leads the industry in customer conversion by providing the right products on the first page, every time.”
We find this to be an interesting angle for a product spotlight. We aren’t sure if this is a pay-to-play write-up or an objective analysis. We also aren’t sure what “cool” means when referring to a product’s usability, but look forward to seeing more from EasyAsk.
Andrea Hayden, June 11, 2012
Sponsored by PolySpot
June 5, 2012
More explanations of how Google’s smart system becomes so intelligent; not too much illumination on precision and recall however. Research Blog hosts a post from a Google research team titled “From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas.” They begin by laying out the primary Google challenge:
“Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google’s core competency, and important for many other tasks in information retrieval and natural language processing.”
Researchers Valentin Spitkovsky and Peter Norvig go on to detail some of the techniques they have used, including building on the traditional encyclopedia model, much like Wikipedia. They then get into some technical particulars like language strings and inverted indexes; see the article for more. Or for in-depth detail, see the teams paper, “A Cross-Lingual Dictionary for English Wikipedia Concepts.”
Cynthia Murrell, June 5, 2012
Sponsored by PolySpot
May 31, 2012
The current version of Semantic Knowledge’s Troupe is now available for Download at no cost. This useful tool has been benefiting business for over a decade and has yet to outlive its usefulness.
Semantic-Knowledge has been in business since 1994 providing business consumers with the means to increase ROI with simplified Natural Language Processing software including; Semantic Search Engine, Text Analysis, Intelligent Desktop Search, Text Mining and Classification systems.
Tropes will perform different types of text analyses but the overall purpose is to assign, to analyze and to examine text. A basic summery of the program is:
“Content analysis consists in revealing the framework of a text, i.e. its meaning. This necessarily implies two things. First, there must be a theoretical conception of the text: this must describe both the textual organization of the things that are said and the structural organization of the thought-processes of the people who say them. Secondly, it implies the use of a tool derived from this theoretical conception and rigorously excludes the subjectivity of the investigator, at least until the analysis is finished.”
Tropes offer’s considerable time savings and enhances strategic data. Therefore it can help businesses to yield an exceptional Return On Investments (ROI). Since Tropes is no longer a commercial product, now users can experiment with this text based programming without the cost incurred during its initial release.
Jennifer Shockley, May 31, 2012
Sponsored by PolySpot