November 21, 2012
I continue to learn about companies with high-value content processing technologies. The challenge in real-time translation, if one believes the Google marketing, is now in “game over” mode. The winner, of course, is Google. Other firms can head to the showers and maybe think about competing in another business sector.
But some of that Google confidence may be based on assumptions about Google’s language processing expertise, not more recent systems and methods. I know. This is “burn at the stake” information to a Googler.
However, I saw a demonstration which made clear to me that Google’s “kitchen sink” approach to figuring out how to handle speech input and near real time translation may not be in step with other firm’s approaches. The company with some quite interesting translation technology and a commitment to easy integration is IMT Holdings. The privately held company’s product is Rosoka.
IMT Holdings, Corp. was founded in 2007. Our background is in US government contracting. In the course of the firm’s work, Mr. Sorah saw that the existing NLP or Natural Language Processing (NLP) tools were not able to handle the volumes and complexities of the data they needed to process. In December of 2011, IMT began actively marketing its NLP technology.
I was able after some telephone tag and email to interview Mike Sorah, one of the co-founders of IMT and one of the wizards behind the Rosoka technology.
Mr. Sorah told me:
Many of the existing NLP tools claim to be multilingual, but what they mean is that they have linguistic knowledge bases usually acquired from vendors who provide dictionaries and libraries that make NLP an issue for many licensees. But most of the NLP system don’t process documents that contain English and Chinese or English and Spanish. In the world of our clients, mixed language documents are important. These have to be processed as part of the normal stream, not put in an exception folder and maybe never processed or processed after a delay of hours or days.
The Rosoka system is different from other NLP and translation systems on the market at this time. He asserted:
In most multilingual NLP systems, the customer needs to know before they process the document what language the document is so they can load the appropriate language-specific knowledge base. What we did via our proprietary Rosoka algorithms was to take a multilingual look at the world. Our system automatically understands that a document may be in English or Chinese, or even English and Spanish mixed. The language angle is huge. We randomly sample Twitter stream and have been tweeting the top 10 languages of the week are. English varies between 35 to 45% of the tweets. Every language that Rosoka can process is included. Our multilingual support is not not sold as separate, add-on functionality.
You can read the full text of the interview with Mike Sorah in the ArnoldIT.com Search Wizards Speak series at this link. More information about IMT and Rosoka is available from the firm’s Web site, http://www.imtholdings.com.
Stephen E. Arnold, November 21, 2012
August 14, 2012
The New York Times has published an extensive account of the natural-language tragedy, “Goldman Sachs and the $580 Million Black Hole.” The five page article is a very interesting read. The gist, though, is simple enough: Goldman Sachs failed to look out for their client’s best interests. What a surprise.
You have probably heard of the natural language software NaturallySpeaking, developed by Dragon Systems. Dragon Systems is, at heart, the enterprising Jim and Janet Baker, who spent almost twenty years building their innovative software and their company. In fact, their work is considered to have advanced speech technology much faster than anyone expected. Some of it might even have made its way into Apple’s Siri.
When it came time to reap their rewards, the pair turned to Goldman Sachs for advice on the over-half-billion-dollar deal. Back in 1999, it still seemed like a good idea to trust the prominent investment firm. It wasn’t. Reporter Loren Feldman summarizes the trouble:
“With Goldman Sachs on the job, the corporate takeover of Dragon Systems in an all-stock deal went terribly wrong. Goldman collected millions of dollars in fees — and the Bakers lost everything when Lernout & Hauspie was revealed to be a spectacular fraud. . . . Only later did the Bakers learn that Goldman Sachs itself had at one point considered investing in L.& H. but had walked away after some digging into the company.
“This being Wall Street, a lot of money is now at stake. In federal court in Boston, the Bakers are demanding damages, including interest and legal fees, that could top $1 billion.”
Not only did Goldman direct their own dollars away from L.& H., the suit alleges, they also failed to scrutinize L.& H. for their client when Dragon’s CFO pointed out troubling signs. I turns out that the person in charge of such investigations had left Goldman and not been replaced. Oops. That didn’t keep Goldman from keeping the $5 million consultation fee. Naturally.
Meanwhile, companies who picked up pieces of the Bakers’ technology at auction after L.& H. fell have gone on to develop them into lucrative commodities. The couple was left with neither their invention nor any fraction of the money it was worth.
The case is expected to be decided sometime this November. Feldman burrowed into the wealth of legal filings surrounding the case to craft this article. He has found eye-opening insights into Goldman Sachs’ culture and practices. The piece is worth reading for that reason alone.
It is also a moving tale about a tech- and language-savvy couple who put in the time, effort, passion, and smarts to build their business, and who are now fighting to regain what is rightfully theirs. I wish them luck.
Cynthia Murrell, August 14, 2012
July 9, 2012
Wikipedia is a go to source for quick answers outside the classroom, but many don’t realize Wiki is an ever evolving information source. Geekosystem’s article “Wikistats Show You What Parts Of Wikipedia Are Changing” provides a visual way to see what is changing within Wikipedia.
The performance program was explained as:
“Utilizing technology from Datasift, a social data platform with a specialization in real-time streams, Wikistats lists some clear, concise information you can use to see how Wikipedia is flowing and changing out from under you. Using Natural Language Processing, Wikistats is able to suss realtime trends and updates. In short, Wikistats will show you what pages are being updated the most right now, how many edits they get by how many unique users, and how many lines are being added vs. how many are being deleted.”
Enlightenment was gained when actually viewing the chart below:
This program calculates well defined reports on Wikipedia’s traffic, and Wiki frequenters might find the above chart surprising. The report in this case shows the reality that Wikipedia is an over flowing pool of information.
We are not saying Wikipedia is unreliable, but one should never solely rely on one information source. The chart simply provides a visual way to see what is changing within Wikipedia and help users understand how data flows. This programs potential for real time use on other sites could be tremendous.
Jennifer Shockley, July 9, 2012
July 2, 2012
It is possible to teach an old dog new tricks according to Semanticweb.com’s article, ‘FirstRain Spotlights Semantics Across Domains’. Semantic approaches for a targeted domain work well because one can train the NLP engine to recognize key words that are applied. The downside is that the business world of today is vast and the current training limitations for specific domains cannot always scale.
FirstRain has opened a unique version of a semantic obedience school as:
“Affinity scoring must be a breakthrough for classes of information where there is a lot of ambiguity, and the cool thing about it is that you can actually apply it in a way to create a virtuous self-improving spiral that works across massively different information domains. When you set up the correct feedback loop of affinity scoring and don’t encode to different domains, but let it swing across those you are trying to match things to, you can create a self-learning system.”
The new system derived by FirstRain is capable of re-training the most stubborn of semantics and inspiring functionality. By creating adaptable semantics they have taught an already workable system to handle a variety of information in an even more efficient process. The semantic obedience school could very well be the next big thing in the business world if all goes as they plan. The new routine seems feasible, so has FirstRain cracked the tough training nut of cross domain semantics?
Jennifer Shockley, Juuly 2, 2012
June 11, 2012
A new, “cool,” vendor has been announced in a list of Cool Vendors in the Analytics and Business Intelligence, 2012 report by Garner, Inc.
According to the article, “EasyAsk Named ‘Cool Vendor’ by Leading Analyst Firm,” EasyAsk’s Siri-like mobile app for corporate data is one to note. The app, named Quiri, combines voice and NLP to provide a usable, and apparently “cool,” user-experience. A video demonstration of the product is available here. The article states:
“Quiri offers users Siri-like built-in speech recognition and natural language processing, allowing users to conveniently speak their business questions and get immediate answers to business questions. Users tap a microphone button, speak a request and Quiri retrieves the answer from existing corporate data.
EasyAsk eCommerce search and merchandising software – available on-premise or as a service (SaaS) – leads the industry in customer conversion by providing the right products on the first page, every time.”
We find this to be an interesting angle for a product spotlight. We aren’t sure if this is a pay-to-play write-up or an objective analysis. We also aren’t sure what “cool” means when referring to a product’s usability, but look forward to seeing more from EasyAsk.
Andrea Hayden, June 11, 2012
Sponsored by PolySpot
June 5, 2012
More explanations of how Google’s smart system becomes so intelligent; not too much illumination on precision and recall however. Research Blog hosts a post from a Google research team titled “From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas.” They begin by laying out the primary Google challenge:
“Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google’s core competency, and important for many other tasks in information retrieval and natural language processing.”
Researchers Valentin Spitkovsky and Peter Norvig go on to detail some of the techniques they have used, including building on the traditional encyclopedia model, much like Wikipedia. They then get into some technical particulars like language strings and inverted indexes; see the article for more. Or for in-depth detail, see the teams paper, “A Cross-Lingual Dictionary for English Wikipedia Concepts.”
Cynthia Murrell, June 5, 2012
Sponsored by PolySpot
May 31, 2012
The current version of Semantic Knowledge’s Troupe is now available for Download at no cost. This useful tool has been benefiting business for over a decade and has yet to outlive its usefulness.
Semantic-Knowledge has been in business since 1994 providing business consumers with the means to increase ROI with simplified Natural Language Processing software including; Semantic Search Engine, Text Analysis, Intelligent Desktop Search, Text Mining and Classification systems.
Tropes will perform different types of text analyses but the overall purpose is to assign, to analyze and to examine text. A basic summery of the program is:
“Content analysis consists in revealing the framework of a text, i.e. its meaning. This necessarily implies two things. First, there must be a theoretical conception of the text: this must describe both the textual organization of the things that are said and the structural organization of the thought-processes of the people who say them. Secondly, it implies the use of a tool derived from this theoretical conception and rigorously excludes the subjectivity of the investigator, at least until the analysis is finished.”
Tropes offer’s considerable time savings and enhances strategic data. Therefore it can help businesses to yield an exceptional Return On Investments (ROI). Since Tropes is no longer a commercial product, now users can experiment with this text based programming without the cost incurred during its initial release.
Jennifer Shockley, May 31, 2012
Sponsored by PolySpot
May 26, 2012
I have had to look up the antecedents for InQuira again. I wanted to create this post to make it easy to reference these two firms which were combined to create InQuira. InQuira was acquired by Oracle Corp. in that company’s push to address its long-standing search and content processing issues. I have in my Overflight system the 2006 InQuira marketing collateral which, I noticed, provides a crib sheet for the many enterprise search vendors piling into the customer support segment. What’s interesting is that customer support is one of the sectors where open source search is getting some attention.
The antecedents of InQuira were:
- Answerfriend. The company had software which could understand text. In 2000, the company landed Accenture as a customer. Answerfriend pivoted on its natural language processing technology. Allegedly Answerfriend could handle both structured an unstructured data. Sound familiar in 2012?
- Electric Knowledge Inc. This also was an NLP shop. The technology was based on computational linguistic technology. This company had licensed its technology to Bank of America, an outfit which has had a long history of trying to find a search system which meets its requirements.
InQuira was created in 2002. The notion of hooking together two separate vendors to do the 1+1=3 thing has been used more recently by Lexalytics and Attensity.
At one time, InQuira was the answer system used by Yahoo’s customer support service. I encountered this when I tried to cancel a Yahoo service. The InQuira service was not too helpful to me. I just killed the credit card and solved the problem.
The marketing pitch of InQuira is as fresh today as it was in 2002. How much progress has there been in search and content processing in the last decade? Could the marketing collateral for a 2002 Oldsmobile be used without any changes? Probably not. Search has a limited supply of jargon, and it gets recycled endlessly in my opinion.
Stephen E Arnold, May 26, 2012
Sponsored by Polyspot
May 7, 2012
Social media is swarming with sound bites about social media. We recently came across a bit of information about Bitext’s recent SIG brainstorming meeting, which prompted further investigation into their company. As their name implies, they are concerned with text bits. Or, as the name we know it as: unstructured content.
There event was a big success with attendance turning out to be double what they expected. Social media and business strategies were discussed, in particularly in relation to their primary concern of semantics.
Amongst several solutions, consulting services and research and development, NaturalFinder stood out as having value on par with other semantically enriched search technology:
“NaturalFinder is the essential complement for any Internet or intranet search engine as it allows users to query in natural language (Spanish, English, French…) without using Booleans or wildcards. Thanks to its linguistic technology, users can focus on typing their queries in his/hew own words as if he/she talked to another person. NaturalFinder will return all relevant documents and more documents than traditional search engines, which are based on keywords.”
It is clear here that technology is continuing to adapt to the larger trend of pervasive informal language. First, we saw unstructured content, as opposed to traditional structured content, utilized for business analytics. Now, we are creating tools that allow search engines to mimic human intelligence.
Megan Feil, May 7, 2012
Sponsored by Ikanow
March 20, 2012
WillQuitSmoking.com recently shared a video on a new analytics system being used at a healthcare facility in Austin, Texas that is using this technology to save lives by manages large amount of unstructured data.
According to the article, “Seton Healthcare Uses IBM Content and Predictive Analytics to Improve Care & Lower CHF Readmissions,” Seton Healthcare relies on IBM Content and Predictive Analytics to identify high-risk congestive heart failure (CHF) patients for interventive care and to avoid preventable readmissions.
The article states:
Natural language processing enables analysis of both structured (ie lab results) and unstructured data (ie physician notes, discharge summaries), opening the door to rich clinical and operational insights that were hidden in inaccessible free text files. Seton can now identify trends and patterns in patient care and outcomes, uncovering sometimes obscure correlations or disparities buried in years of medical records; these can dramatically improve diagnosis and treatment.
One of the reasons that this article is really cool is because you can learn by watching a video, not using a live, online demo of the technology. Yep, we think movies are much better than live systems. Are videos easier to control than a game show? Yep. Yep. Yep.
Jasmine Ashton, March 20, 2012
Sponsored by Pandia.com