November 13, 2014
Short honk: Microsoft offers an online translation service. It was called Bing once. That name has gone the way of the dodo. Details are here: “Bing Translator Picks Up an Update, Drops Bing Name and Adds Offline Translation for Vietnamese.” Just Bing it, but make sure you know the current name. Is this what MBAs learn today?
Stephen E Arnold, November 13, 2014
November 11, 2014
Three or four days ago I received a LinkedIn message that a new thread had been started on the Enterprise Search Engine Professionals group. You will need to be a member of LinkedIn and do some good old fashioned brute force search to locate the thread with this headline, “Enterprise Search with Chinese, Spanish, and English Content.”
The question concerned a LinkedIn user information vacuum job. A member of the search group wanted recommendations for a search system that would deliver “great results with content outside of English.” Most of the intelligence agencies have had this question in play for many years.
The job hunters, consultants, and search experts who populate the forum do not step forth with intelligence agency type responses. In a decision making environment when inputs in a range of language are the norm for risk averse, the suggestions offered to the LinkedIn member struck me as wide of the mark. I wouldn’t characterize the answers as incorrect. Uninformed or misinformed are candidate adjectives, however.
One suggestion offered to the questioner was a request to define “great.” Like love and trust, great is fuzzy and subjective. The definition of “great”, according the expert asking the question, boils down to “precision, mainly that the first few results strike the user as correct.” Okay, the user must perceive results as “correct.” But as ambiguous as this answer remains, the operative term is precision.
In search, precision is not fuzzy. Precision has a definition that many students of information retrieval commit to memory and then include in various tests, papers, and public presentations. For a workable definition, see Wikipedia’s take on the concept or L. Egghe’s “The Measures Precision, Recall, Fallout, and Miss As a function of the Number of Retrieved Documents and Their Mutual Interrelations, Universiiteit Antwerp, 2000.
In simple terms, the system matches the user’s query. The results are those that the system determines containing identical or statistically close results to the user’s query. Old school brute force engines relied on string matching. Think RECON. More modern search systems toss in term matching after truncation, nearness of the terms used in the user query to the occurrence of terms in the documents, and dozens of other methods to determine likely relevant matches between the user’s query and the document set’s index.
With a known corpus like ABI/INFORM in the early 1980s, a trained searcher testing search systems can craft queries for that known result set. Then as the test queries are fed to the search system, the results can be inspected and analyzed. Running test queries was an important part of our analysis of a candidate search system; for example, the long-gone DIALCOM system or a new incarnation of the European Space Agency’s system. Rigorous testing and analysis makes it easy to spot dropped updates or screw ups that routinely find their way into bulk file loads.
Our rule of thumb was that if an ABI/INFORM index contained a term, a high precision result set on SDC ORBIT would include a hit with that term in the respective hit. If the result set did not contain a match, it was pretty easy to pinpoint where the indexing process started dropping files.
However, when one does not know what’s been indexed, precision drifts into murkier areas. After all, how can one know if a result is on point if one does not know what’s been indexed? One can assume that a result set is relevant via inspection and analysis, but who has time for that today. That’s the danger in the definition of precision in what the user perceives. The user may not know what he or she is looking for. The user may not know the subject area or the entities associated consistently with the subject area. Should anyone be surprised when the user of a system has no clue what a system output “means”, whether the results are accurate, or whether the content is germane to the user’s understanding of the information needed.
Against this somewhat drab backdrop, the suggestions offered to the LinkedIn person looking for a search engine that delivers precision over non-English content or more accurately content that is not the primary language of the person doing a search are revelatory.
Here are some responses I noted:
- Hire an integrator (Artirix, in this case) and let that person use the open source Lucene based Elasticsearch system to deliver search and retrieval. Sounds simplistic. Yep, it is a simple answer that ignores source language translation, connectors, index updates, and methods for handling the pesky issues related to how language is used. Figuring out what a source document in an language with which the user is not fluent is fraught with challenges. Forget dictionaries. Think about the content processing pipeline. Search is almost the caboose at the end of a very long train.
- Use technology from LinguaSys. This is a semantic system that is probably not well known outside of a narrow circle of customers. This is a system with some visibility within the defense sector. Keep in mind that it performs some of the content processing functions. The technology has to be integrated into a suitable information retrieval system. LinguaSys is the equivalent of adding a component to a more comprehensive system. Another person mentioned BASIS Technologies, another company providing multi language components.
- Rely on LucidWorks. This is an open source search system based on SOLR. The company has spun the management revolving door a number of times.
- License Dassault’s Exalead system. The idea is wroth considering, but how many organizations are familiar with Exalead or willing to embrace the cultural approach of France’s premier engineering firm. After years of effort, Exalead is not widely known in some pretty savvy markets. But the Exalead technology is not 100 percent Exalead. Third party software delivers the goods, so Exalead is an integrator in my view.
- Embrace the Fast Search & Transfer technology, now incorporated into Microsoft SharePoint. Unmentioned is the fact that Fast Search relied on a herd of human linguists in Germany and elsewhere to keep its 1990s multi lingual system alive and well. Fast Search, like many other allegedly multi lingual systems, rely on rules and these have to be written, tweaked, and maintained.
So what did the LinkedIn member learn? The advice offers one popular approach: Hire an integrator and let that company deliver a “solution.” One can always fire an integrator, sue the integrator, or go to work for the integrator when the CFO tries to cap the cost of system that must please a user who may not know the meaning of nus in Japanese from a now almost forgotten unit of Halliburton.
The other approach is to go open source. Okay. Do it. But as my analysis of the Danish Library’s open source search initiative in Online suggested, the work is essentially never done. Only a tolerant government and lax budget oversight makes this avenue feasible for many organizations with a search “problem.”
The most startling recommendation was to use Fast Search technology. My goodness. Are there not other multi lingual capable search systems dating from the 1990s available? Autonomy, anyone?
Net net: The LinkedIn enterprise search threads often underscore one simple fact:
Enterprise search is assumed to be one system, an app if you will.
One reason for the frequent disappointment with enterprise search is this desire to buy an iPad app, not engineer a constellation of systems that solve quite specific problems.
Stephen E Arnold,November 11, 2014
January 17, 2014
The article promoting LinguistNow on Language I/O website is titled Fast, Human Translation of CRM Content. If you are in need of an alternative to Google Translate for Oracle Applications this is the article for you. LinguistNow is offered as a product suite by Language I/O that is capable of plugging directly into Oracle-RightNow and Salesforce CRM platforms.
The article explains:
“Our CRM-agnostic GoLinguist server can integrate with any CRM that exposes the right set of APIs. We also provide ready-to-deploy LinguistNow add-ins specifically for RightNow CX/CRM. Within Oracle RightNow and Salesforce, LinguistNow allows you to request translations of answer and incident content with the click of a mouse.”
LinguistNow also automates machine and human translation processes for Help Desk and FAQ email content. This method of quickly and automatically exporting and importing translatable content will not only reduce response times for clients but also the risk of human error that increases with every step that a user must perform manually. The article also includes a user testimonial from the VP of SurveyMonkey who claims that the aid of LinguistNow is responsible for saving the company tens of thousands and made the company more efficient. Demos and pricing are available through the contact page.
Chelsea Kerwin, January 17, 2014
May 29, 2013
We came across a 4,600 word news release about the language translation software market. The study has more than 400 pages and covers a wide range of topics, including mobile phone translation systems. We worked on the Topeka Capital Markets’ Google voice report. We are biased because Google seems to have a significant technology and resource edge. As we worked through the news release we did see a list of the firms which WinterGreen discusses.
A notable translation helper, the Rosetta Stone. A happy quack to the British Museum at www.britishmuseum.org.
I want to snag the list because it had some surprises as well as both familiar and unfamiliar firms in the inventory. Here’s what I noticed in the news release:
ABBYY Lingvo (http://www.lingvo-online.ru/en)
Alchemy CATALYST (http://www.alchemysoftware.com/)
AppTek HMT (now a unit of SAIC. http://www.saic.com)
Cognition Technologies (www.cognition.com)
Duolingo (more of a learning system. http://duolingo.com/)
Google (ah, the GOOG)
Hewlett Packard (maybe www.autonomy.com)
IBM WebSphere Translation Server (try http://goo.gl/hGS2R)
Kilgray Translation Technologies (http://kilgray.com/)
Language Engineering (http://www.lec.com)
Language Weaver (Now part of SDL. See http://goo.gl/IH3mg)
Lingo24 (An agency. See http://www.lingo24.com/)
Lionbridge (crowdsourcing and integrator at http://www.lionbridge.com/)
Mission Essential Personnel (humans for rent at http://www.lionbridge.com/)
Plunet BusinessManager (A management system at http://www.plunet.com/us/)
Proz.com (humans for rent at http://www.proz.com)
RWS Legal Translation (http://www.rws.com/EN/)
Reverso (Free. See http://www.reverso.net/text_translation.aspx?lang=EN)
SDL Trados (Part of SDL. See http://www.trados.com/en/)
Sail Labs (http://www.sail-labs.com/)
Softissimo (Services and software. http://www.softissimo.com/softissimo.asp?lang=IT)
Symbio Software (http://www.symbio.com/)
Translations.com (Services and software. http://www.translations.com/)
Translators without Borders (Humans for rent. http://translatorswithoutborders.org/)
Veveo (More semantics than translation. http://corporate.veveo.net/)
Vignette (Open Text. http://www.opentext.com)
Word Magic Technology (I could not locate.)
WorldLingo (Rent a human. http://goo.gl/dhiu)
Of these 30 or so companies, there were some which struck me a surprise. Hewlett Packard, for example, owns Autonomy. I suppose that other units of Hewlett Packard have translation capabilities, but were these licensed or home grown? Also, the inclusion of Vignette is interesting. I must admit that I don’t hear much about Vignette as a translation system. The list makes translation look robust. The key players boil down to a handful of companies. I did not spot firms in the translation services or software business in China, India, Japan, or Russia, but I may have missed these firms in the WinterGreen news release describing the report.
If you want to buy a copy of the report, which I assume has paragraphs unlike the news release, point your browser at http://goo.gl/97e2s and have your credit card ready. The report is about US$7,500.
Stephen E Arnold, May 29, 2013
Sponsored by Augmentext
February 4, 2013
A happy quack to the reader who sent me the story “App Seeks to Translate Entire Web.” The article appeared in the February 4, 2013 USA Today, page 3A. This is a publication one of my library contractors calls “McPaper.” Well, if the Economist can use McDonald’s as a method of pegging currency buying power, I am okay with a report about a start up which wants to translate the “entire Web.” The Web is dynamic, so the latency issue is one which someone needs to consider. You can find a version of the story at this link, but I don’t know how long it will be alive.
The idea is that Duolingo is a “massive scale online collaboration” tool. You can do the Rosetta stone type language learning or you can relay on the system to translate the Web. The Web site for this challenge to Google’s little appreciated online translation capability is www.duolingo.com. The system is linked to Carnegie Mellon University and a wizard (Luis von Ahn) who was a TED speaker. Link here. (To make the short link I used the enjoyable Captcha to key “udoping squ”. Coincidence? If you are a fan of the security feature which requires a user to type letters which are messed up in order to access a Web site, then you will be primed to embrace Duolingo. The inventor of Duolingo worked on the Captcha system.)
Is Google poised to snap up this innovation? Who knows.
Stephen E Arnold, February 4, 2013
November 27, 2012
Due to the ever increasingly globalized workforce, it is more important than ever that data analytics providers are able to appeal to a multitude of countries and languages and corner the polyglot market. Matthew Aslett of the Too Much Information blog recently reported on this topic in the article, “The Dawn of Polyglot Analytics.”
According to Aslett, the emergence of a polyglot analytics platform exemplifies a new approach to data analytics that is based on the user’s approach to analytics rather than the nature of the data.
The article states:
Polyglot analytics explains why we are seeing adoption of Hadoop and MapReduce as a complement to existing data warehousing deployments. It explains, for example, why a company like LinkedIn might adopt Hadoop for its People You May Know feature while retaining its investment in Aster Data for other analytic use cases. Polyglot analytics also explains why a company like eBay would retain its Teradata Enterprise Data Warehouse for storing and analyzing traditional transactional and customer data, as well as adopting Hadoop for storing and analyzing clickstream, user behaviour and other un/semi-structured data, while also adopting an exploratory analytic platform based on Teradata’s Extreme Data Appliance for extreme analytics on a combination of transaction and user behaviour data pulled from both its EDW and Hadoop deployments.
One company that is currently excelling in polyglot analytics is Polyspot. In the recent blog post, “Polyspot is Polyglot” we learned that Polyspot offers its services in over 50 languages. Language is no longer a hindrance to data management success. PolySpot warrants a close look. The company offers high value technology within the reach of most organizations’ budgets.
Jasmine Ashton, November 27, 2012
January 30, 2012
The rapid development of Web-based technologies over the last decade has created a unique opportunity to bring together the world’s scientists by making it easy for them to share research information. With the shift from US-centric, English language information to information published in other languages, researchers find that facility in one or two other languages is inadequate.
The Multilingual Challenge
Multilingual search increases the value of research output by making it available to a wider audience. Seamless federation and automated translation makes available research from China, Japan, Russia, and other countries prolific in science publication to researchers who may lack facility in certain languages. In the area of patent research, multilingual search greatly broadens the scope of patent research. For English speakers, the availability of multilingual federated search exposes English speakers to diverse perspectives from researchers in foreign countries.
For example, China’s research output is now far outpacing the rest of the world. In 2006 China’s research and development output surpassed that of Japan, the UK and Germany. At this pace, China will overtake the USA in a few years. But non US innovation is not confined to Asia and Europe. Brazil’s share of research output is growing rapidly.
Sample system output from WorldWideScience.org, powered by Deep Web Technologies’ multilingual federating system.
Deep Web Technologies (DWT) is one of the leaders in federated search. Federation requires taking a user’s query and using it to obtain search results from other indexes and search-and-retrieval systems. For example, Deep Web Technologies’ Explorit product handles this process, returning to the user a blended set of results. For the user, federation eliminates the need to frame a query for Google, Medline, USA.gov, and the NASA website. The user frames a query, sends it to Explorit and a single, relevance-ranked results list is displayed to the user.
DWT has moved beyond single language federation and grown to become the leader in federated search of the deep web. This has resulted in the launch of their ground-breaking, patent pending multilingual federated search capability in June of 2011.
“We now live in a much more interconnected world where information is available in a variety of languages,” noted Abe Lederman, President and CTO of Deep Web Technologies. “Major advances in machine translation have made it possible for DWT to develop a revolutionary new Explorit product that breaks down language barriers and advances scientific collaboration and business productivity.”
May 2, 2011
“Google Translate Has Great Uses, Disastrous Misuses,” asserts Deseret News. We agree.
As writer Adam Wooten states, machine translation sites like Google Translate are wonderful for getting the gist of text in a language you don’t happen to be fluent in. In fact, there are several others that also address this need, including: GeoFluent from Lionbridge; IMTranslator; and WorldLingo.
In TMCnet’s article, “Lionbridge launches GeoFluent Real-Time Translation Platform,” writer Jai C.S. describes the new offering. This paid platform boasts “a statistical machine translation engine developed in IBM’s Watson Research Center,” and includes a multilingual chat feature. It also runs on multiple devices and promises secure exchanges. Check it out here.
ImTranslator describes their product in their own blog. This option includes components like a virtual keyboard, spell-check, a multilingual dictionary, and email and print functions. I like the idea of the “convert text into voice in 10 languages” feature. The translation itself, however, relies on Google Translate.
WorldLingo lures us in with a free translator, similar to Google’s, but it offers several pay options a la carte. These include machine translations for entire Web sites, emails, documents, and chat. Also, and this is important, the company offers the services of professional human translators.
Which brings us back to Deseret’s article. When it comes to weighty matters such as legal documents, financial information, and marketing copy, you’d better call in the humans. Check out this partial list of machine translation mistakes Wooten cites:
“A Chinese restaurant sign displayed the words ‘Translate Server Error’ above its storefront after a free translation site failed. A newspaper mistranslation repeatedly misquoted a former president of Kazakhstan as referring to the important issue of ‘passing gas.’ Israeli journalists nearly sparked an international incident when they seemed to insult a Dutch diplomat’s mother in a machine-translated message. Finally, an automatically translated furniture tag contained a racist slur that seriously offended customers in Toronto, Canada.”
For personal reading or for informal communications where all parties are aware of the limits of machine translation, these tools fit the bill– free or inexpensive, quick, and easy. The additional tools from the paid sites could come in very handy; the chat feature sticks out as one that may quickly become indispensable.
However, there’s no substitute for knowledgeable professional humans when it comes to the important stuff. It’s worth the investment of time and money– unless you don’t mind publishing something akin to the Chinese menu that offered “Stir-fried wikipedia.” Mmm, sounds tasty!
Cynthia Murrell, May 2, 2011
April 13, 2011
The Silicon India News article “Kalam Launches Language Translation Software” introduces a new program launched by former president Kalam. The Machine Translation (MT) system is designed to transplant languages on the Internet and is the collaboration of 17 institutions. The new program was recently introduced at the 20th International World Wide Web Conference. We learned:
“According to Rajeev Sangal, Director, IIIT Hyderabad, the MT System was based on the computational paninian grammar (CPG), which works very well for free word order languages, and Indian languages in particular.
India has hundreds of languages and this new technology could be a direct step towards breaking down the language barrier that exists even within the country itself. There are a limited number of languages available on the market right now but there are immediate plans to introduce more in the near future. The translation program definitely sounds promising but in a world where most people have access to the free Google translation program it seems hard to compete. Over time additional language pairs will be added. Most of the enterprise search vendors with multi-lingual support handle a number of languages. Kalam is focusing on languages often not included in the standard language pair pack.
April Holmes, April 13, 2011
November 20, 2010
As someone who worked for several years in South Korea dealing with the language barrier, I’m always interested in translation software. I’ve found that Google Translate has gotten better and better since its debut, but it still can’t truly translate accurately. Translation software still needs some human intervention, but yet companies need to automate as much as possible to decrease costs.
Two recent stories reflect the intersection the human and the machine in translation services, with the machine achieving increasing importance. SAIC Expands Human Language Technology Offerings for Federal and Commercial Customers tells of the Fortune 500 company’s acquisition of technology, intellectual property, and related assets from AppTek Partners, Applications Technology, and Media Mind. SAIC is expanding its already widely-used human translation and interpretation services through these faster automated services. In sum: “The deal will bolster SAIC’s existing portfolio of more than 70 languages and dialects, helping linguists provide enhanced translation, interpretation and analysis service to U.S. government and commercial firm decision makers.”
In related news, Lingotek Enables Users to Easily Translate SharePoint Content reports on the translation software’s new directly embedded language tool, which allows users to translate within SharePoint. The value to clients is that “By combining best-in-class machine translation solutions and real-time community translations, SharePoint users will be able to produce ‘real time’ volunteer translations in a third amount of the time while reducing large costs.” This is a boon for SharePoint users, who will no longer have to do their translation outside the program. Both these stories reflect that automated translation saves money, but I wonder if the machine will ever be able to completely replace the human factor.
Alice Wasielewski, November 20, 2010