CyberOSINT banner

Be the CIA Librarian

May 3, 2016

Research is a vital tool for the US government, especially the Central Intelligence Agency which is why they employee librarians.  The Central Intelligence Agency is one of the main forces of the US Intelligence Community, focused on gathering information for the President and the Cabinet.  The CIA is also the topic of much fictionalized speculation in stories, mostly spy and law enforcement dramas.  Having played an important part in the United States history, could you imagine the files in its archives?

If you have a penchant for information, the US government, and a library degree then maybe you should apply to the CIA’s current job opening: as a CIA librarian.  CNN Money explains one of the perks of the job is its salary: “The CIA Is Hiring…A $100,000 Librarian.”  Beyond the great salary, which CNN is quick to point out is more than the typical family income.  Librarians server as more than people who recommend decent books to read, they serve as an entry point for research and bridge the gap between understanding knowledge and applying it in the actual field.

“In addition to the cachet of working at the CIA, ‘librarians also have opportunities to serve as embedded, or forward deployed, information experts in CIA offices and select Intelligence Community agencies.’  Translation: There may be some James Bond-like opportunities if you want them.”

Most of this librarian’s job duties will probably be assisting agents with tracking down information related to intelligence missions and interpreting it.  It is just a guess, however.  Who knows, maybe the standard CIA agent touts a gun to the stacks?


Whitney Grace, May 3, 2016
Sponsored by, publisher of the CyberOSINT monograph

Online Translation: Google or Microsoft?

March 1, 2016

HI have solved the translation problem. I live in Harrod’s Creek, Kentucky. Folks here speak Kentucky. No other language needed. However, gentle reader, you may want to venture into lands where one’s native language is not spoken or written. You will need online translation.

Should I forget Systran and other industrial strength solutions of yesteryear. Today the choice is Google or Microsoft if I understand “2 Main Reasons Why Google Translate Is Ahead of Microsoft and Skype.” (The link worked on February 22, 2016. If it does not work when you read this blog post, you may have to root around. That’s life in the zip zip world today.)

Reason one is that Google supports more languages than Microsoft. The total is 100 plus. The write up is sufficiently amazed to describe the language support of the Alphabet Google thing as “mind blowing.” Okay.

Reason two is that Google’s translation function works on smartphone. The write up points out:

You can hand-write, speak, type, or even take a picture of a given language and Google Translate will translate it for you. Not only this but on Android, some of the translation features are available offline. So, some features are accessible even if you do not have access to the internet.

The write up does not dig too deeply into Microsoft’s translation capability. If you are interested in Microsoft’s quite capable and useful services, navigate to the Microsoft Language Portal. Google is okay, but one service may not do the job a person who does not speak Kentucky requires.

Stephen E Arnold, February 27, 2016

When Google Translate Is Not Enough

September 16, 2015

I read a delightful article called “The British Library Is Crowdsourcing the Translation of a Mysterious 13th-Century Sword Inscription.” I am not too keen on edged weapons. Nevertheless, I am interested in becoming sharper when it comes to translation methods.

The write up states:

+NDXOXCHWDRGHDXORVI+ This inscription, engraved on a 13th-century double-edged sword owned by the British Museum, is the medieval mystery of the moment. Stumped by its cryptic engraving, last week the British Library tapped the interwebs for its crowd wisdom, asking commenters to help decode the meaning.

What makes the article entertaining is the fact that the British Library, backed with the formidable talents of British universities where linguistics absolutely thrives is turning to the hoi polloi for assistance.

And assist did the rustics. Consult the original article for the full span of human ingenuity. Here’s the comment I enjoyed from a non rustic:

“Everything is explained in Winnie the Pooh.”

A Google search reveals more questions:



Stephen E Arnold, September 16, 2015

Captain Page Delivers the Google Translator

January 15, 2015

Well, one of the Star Trek depictions is closer to reality. Google announced a new and Microsoft maiming translate app. You can read about this Bing body blow in “Hallo, Hola, Ola to the New, More Powerful Google Translate App.” Google has more translation goodies in its bag of floppy discs. My hunch is that we will see them when Microsoft responds to this new Google service.

The app includes an image translation feature. From my point of view, this is helpful when visiting countries that do not make much effort to provide US English language signage. Imagine that! No English signs in Xi’an or Kemerovo Oblast.

The broader impact will be on the industrial strength, big buck translation systems available from the likes of BASIS Tech and SDL. These outfits will have to find a way to respond, not to the functions, but the Google price point. Ouch. Free.

Stephen E Arnold, January 15, 2015

Machine versus Human Translations

January 7, 2015

I am fascinated with the notion of real time translation. I recall with fondness lunches with my colleagues at Ziff in Foster City. Then we talked about the numerous opportunities to create killer software solutions. Translation would be “solved”. Now 27 years later, progress has been made, just slowly.

Every once in a while an old technical cardboard box gets hauled out from under the car port. There are old ideas that just don’t have an affordable, reliable, practical solution. After rummaging in the box, the enthusiasts put it back on the shelf and move on to the next YouTube video.

I read “The Battle of the Translators: Man vs Machine.” The write up tackles Skype’s real time translation feature. Then there is a quick excursion through Google Translate.

The passage I noted was:

So, while machine translations may be great for rudimentary translations or even video calls, professional human translators are expert craftsmen, linguists, wordsmiths and proofreaders all wrapped in one. In addition to possessing cultural insight, they also are better editors who shape and perfect a piece for better public consumption, guaranteeing a level of faithfulness to the original document — a skill that not even the most cutting-edge machine translation technology is capable of doing just yet. Machine translators are simply not yet at the level of their chess-playing counterparts, which can beat humans at their own game. As long as automatic translators lack the self-awareness, insight and fluency of a professional human translator, a combination of human translation assisted by machine translation may be the optimal solution.

I include a chapter about automated translation in CyberOSINT: Next Generation Information Access. You can express interest in ordering by writing benkent2020 at yahoo dot com. In the CyberOSINT universe, machine translation exists cheek-by-jowl with humans.

For large flows of information in many different languages, there are not enough human translators to handle the work load. Machine based translations , therefore, are an essential component of most cyber OSINT systems. For certain content, a human has to make sure that the flagged item is what the smart software thinks it is.

The problem becomes one of having enough capacity to handle first the machine translation load and then the human part of the process. For many language pairs, there are not enough humans. I don’t see a quick fix for this multi-lingual talent shortfall.

The problem is a difficult one. Toss in slang, aliases, code words and phrases, and neologisms. Stir in a bit of threat with or without salt. Do the best you can with what you have.

Translation is a thorny problem. The squabbles of the math oriented and the linguistic camps are of little interest to me. Good enough translation is what we have from both machines and humans.

I don’t see a fix that will allow me to toss out the cardboard box with its musings from 30 years ago.

Stephen E Arnold, January 7, 2015

Remember Bing Translator?

November 13, 2014

Short honk: Microsoft offers an online translation service. It was called Bing once. That name has gone the way of the dodo. Details are here: “Bing Translator Picks Up an Update, Drops Bing Name and Adds Offline Translation for Vietnamese.” Just Bing it, but make sure you know the current name. Is this what MBAs learn today?

Stephen E Arnold, November 13, 2014

LinkedIn Enterprise Search: Generalizations Abound

November 11, 2014

Three or four days ago I received a LinkedIn message that a new thread had been started on the Enterprise Search Engine Professionals group. You will need to be a member of LinkedIn and do some good old fashioned brute force search to locate the thread with this headline, “Enterprise Search with Chinese, Spanish, and English Content.”

The question concerned a LinkedIn user information vacuum job. A member of the search group wanted recommendations for a search system that would deliver “great results with content outside of English.” Most of the intelligence agencies have had this question in play for many years.

The job hunters, consultants, and search experts who populate the forum do not step forth with intelligence agency type responses. In a decision making environment when inputs in a range of language are the norm for risk averse, the suggestions offered to the LinkedIn member struck me as wide of the mark. I wouldn’t characterize the answers as incorrect. Uninformed or misinformed are candidate adjectives, however.

One suggestion offered to the questioner was a request to define “great.” Like love and trust, great is fuzzy and subjective. The definition of “great”, according the expert asking the question, boils down to “precision, mainly that the first few results strike the user as correct.” Okay, the user must perceive results as “correct.” But as ambiguous as this answer remains, the operative term is precision.

In search, precision is not fuzzy. Precision has a definition that many students of information retrieval commit to memory and then include in various tests, papers, and public presentations. For a workable definition, see Wikipedia’s take on the concept or L. Egghe’s “The Measures Precision, Recall, Fallout, and Miss As a function of the Number of Retrieved Documents and Their Mutual Interrelations, Universiiteit Antwerp, 2000.

In simple terms, the system matches the user’s query. The results are those that the system determines containing identical or statistically close results to the user’s query. Old school brute force engines relied on string matching. Think RECON. More modern search systems toss in term matching after truncation, nearness of the terms used in the user query to the occurrence of terms in the documents, and dozens of other methods to determine likely relevant matches between the user’s query and the document set’s index.

With a known corpus like ABI/INFORM in the early 1980s, a trained searcher testing search systems can craft queries for that known result set. Then as the test queries are fed to the search system, the results can be inspected and analyzed. Running test queries was an important part of our analysis of a candidate search system; for example, the long-gone DIALCOM system or a new incarnation of the European Space Agency’s system. Rigorous testing and analysis makes it easy to spot dropped updates or screw ups that routinely find their way into bulk file loads.

Our rule of thumb was that if an ABI/INFORM index contained a term, a high precision result set on SDC ORBIT would include a hit with that term in the respective hit. If the result set did not contain a match, it was pretty easy to pinpoint where the indexing process started dropping files.

However, when one does not know what’s been indexed, precision drifts into murkier areas. After all, how can one know if a result is on point if one does not know what’s been indexed? One can assume that a result set is relevant via inspection and analysis, but who has time for that today. That’s the danger in the definition of precision in what the user perceives. The user may not know what he or she is looking for. The user may not know the subject area or the entities associated consistently with the subject area. Should anyone be surprised when the user of a system has no clue what a system output “means”, whether the results are accurate, or whether the content is germane to the user’s understanding of the information needed.

Against this somewhat drab backdrop, the suggestions offered to the LinkedIn person looking for a search engine that delivers precision over non-English content or more accurately content that is not the primary language of the person doing a search are revelatory.

Here are some responses I noted:

  • Hire an integrator (Artirix, in this case) and let that person use the open source Lucene based Elasticsearch system to deliver search and retrieval. Sounds simplistic. Yep, it is a simple answer that ignores source language translation, connectors, index updates, and methods for handling the pesky issues related to how language is used. Figuring out what a source document in an language with which the user is not fluent is fraught with challenges. Forget dictionaries. Think about the content processing pipeline. Search is almost the caboose at the end of a very long train.
  • Use technology from LinguaSys. This is a semantic system that is probably not well known outside of a narrow circle of customers. This is a system with some visibility within the defense sector. Keep in mind that it performs some of the content processing functions. The technology has to be integrated into a suitable information retrieval system. LinguaSys is the equivalent of adding a component to a more comprehensive system. Another person mentioned BASIS Technologies, another company providing multi language components.
  • Rely on LucidWorks. This is an open source search system based on SOLR. The company has spun the management revolving door a number of times.
  • License Dassault’s Exalead system. The idea is wroth considering, but how many organizations are familiar with Exalead or willing to embrace the cultural approach of France’s premier engineering firm. After years of effort, Exalead is not widely known in some pretty savvy markets. But the Exalead technology is not 100 percent Exalead. Third party software delivers the goods, so Exalead is an integrator in my view.
  • Embrace the Fast Search & Transfer technology, now incorporated into Microsoft SharePoint. Unmentioned is the fact that Fast Search relied on a herd of human linguists in Germany and elsewhere to keep its 1990s multi lingual system alive and well. Fast Search, like many other allegedly multi lingual systems, rely on rules and these have to be written, tweaked, and maintained.

So what did the LinkedIn member learn? The advice offers one popular approach: Hire an integrator and let that company deliver a “solution.” One can always fire an integrator, sue the integrator, or go to work for the integrator when the CFO tries to cap the cost of system that must please a user who may not know the meaning of nus in Japanese from a now almost forgotten unit of Halliburton.

The other approach is to go open source. Okay. Do it. But as my analysis of the Danish Library’s open source search initiative in Online suggested, the work is essentially never done. Only a tolerant government and lax budget oversight makes this avenue feasible for many organizations with a search “problem.”

The most startling recommendation was to use Fast Search technology. My goodness. Are there not other multi lingual capable search systems dating from the 1990s available? Autonomy, anyone?

Net net: The LinkedIn enterprise search threads often underscore one simple fact:

Enterprise search is assumed to be one system, an app if you will.

One reason for the frequent disappointment with enterprise search is this desire to buy an iPad app, not engineer a constellation of systems that solve quite specific problems.

Stephen E Arnold,November 11, 2014

LinguistNow a Viable Alternative to Google Translate for Oracle Applications

January 17, 2014

The article promoting LinguistNow on Language I/O website is titled Fast, Human Translation of CRM Content. If you are in need of an alternative to Google Translate for Oracle Applications this is the article for you. LinguistNow is offered as a product suite by Language I/O that is capable of plugging directly into Oracle-RightNow and Salesforce CRM platforms.

The article explains:

“Our CRM-agnostic GoLinguist server can integrate with any CRM that exposes the right set of APIs. We also provide ready-to-deploy LinguistNow add-ins specifically for RightNow CX/CRM. Within Oracle RightNow and Salesforce, LinguistNow allows you to request translations of answer and incident content with the click of a mouse.”

LinguistNow also automates machine and human translation processes for Help Desk and FAQ email content. This method of quickly and automatically exporting and importing translatable content will not only reduce response times for clients but also the risk of human error that increases with every step that a user must perform manually. The article also includes a user testimonial from the VP of SurveyMonkey who claims that the aid of LinguistNow is responsible for saving the company tens of thousands and made the company more efficient. Demos and pricing are available through the contact page.

Chelsea Kerwin, January 17, 2014

Sponsored by, developer of Augmentext

WinterGreen Translation Companies and Services

May 29, 2013

We came across a 4,600 word news release about the language translation software market. The study has more than 400 pages and covers a wide range of topics, including mobile phone translation systems. We worked on the Topeka Capital Markets’ Google voice report. We are biased because Google seems to have a significant technology and resource edge. As we worked through the news release we did see a list of the firms which WinterGreen discusses.

A notable translation helper, the Rosetta Stone. A happy quack to the British Museum at

I want to snag the list because it had some surprises as well as both familiar and unfamiliar firms in the inventory. Here’s what I noticed in the news release:

ABBYY Lingvo (
Alchemy CATALYST (
AppTek HMT (now a unit of SAIC.
Babylon (free)
Bitext (
CallMiner (
Cloudwords (
Cognition Technologies (
Duolingo (more of a learning system.
Google (ah, the GOOG)
Hewlett Packard (maybe
IBM WebSphere Translation Server (try
Kilgray Translation Technologies (
KudoZ (
Language Engineering (http://
Language Weaver (Now part of SDL. See
Lingo24 (An agency. See
Lingotek (
Lionbridge (crowdsourcing and integrator at
Mission Essential Personnel (humans for rent at
Moravia (
MultiCorpora (
Nuance (
OpenAmplify (
Plunet BusinessManager (A  management system at (humans for rent at
RWS Legal Translation (
Reverso (Free. See
SDL Trados (Part of SDL. See
Sail Labs (
Softissimo (Services and software.
Symbio Software (
Systran ( (Services and software.
Translators without Borders (Humans for rent.
Veveo (More semantics than translation.
Vignette (Open Text.
Word Magic Technology (I could not locate.)
WorldLingo (Rent a human.

Of these 30 or so companies, there were some which struck me a surprise. Hewlett Packard, for example, owns Autonomy. I suppose that other units of Hewlett Packard have translation capabilities, but were these licensed or home grown? Also, the inclusion of Vignette is interesting. I must admit that I don’t hear much about Vignette as a translation system. The list makes translation look robust. The key players boil down to a handful of companies. I did not spot firms in the translation services or software business in China, India, Japan, or Russia, but I may have missed these firms in the WinterGreen news release describing the report.

If you want to buy a copy of the report, which I assume has paragraphs unlike the news release, point your browser at and have your credit card ready. The report is about US$7,500.

Stephen E Arnold, May 29, 2013

Sponsored by Augmentext

Challenging Google Translate: Duolingo

February 4, 2013

A happy quack to the reader who sent me the story “App Seeks to Translate Entire Web.” The article appeared in the February 4, 2013 USA Today, page 3A. This is a publication one of my library contractors calls “McPaper.” Well, if the Economist can use McDonald’s as a method of pegging currency buying power, I am okay with a report about a start up which wants to translate the “entire Web.” The Web is dynamic, so the latency issue is one which someone needs to consider. You can find a version of the story at this link, but I don’t know how long it will be alive.

The idea is that Duolingo is a “massive scale online collaboration” tool. You can do the Rosetta stone type language learning or you can relay on the system to translate the Web. The Web site for this challenge to Google’s little appreciated online translation capability is The system is linked to Carnegie Mellon University and a wizard (Luis von Ahn) who was a TED speaker. Link here. (To make the short link I used the enjoyable Captcha to key “udoping squ”. Coincidence? If you are a fan of the security feature which requires a user to type letters which are messed up in order to access a Web site, then you will be primed to embrace Duolingo. The inventor of Duolingo worked on the Captcha system.)

Is Google poised to snap up this innovation? Who knows.

Stephen E Arnold, February 4, 2013

Next Page »