Optical Character Recognition for Less

July 10, 2020

Optical character recognition software was priced high, low, and in between. Sure, the software mostly worked if you like fixing four or five errors per scanned page with 100 words on it. Oh, you use small sized type. That’s eight to 10 errors per scanned page. Good enough I suppose.

You may want to check out EasyOCR, now available via Github. The information page says:

Ready-to-use OCR with 40+ languages supported including Chinese, Japanese, Korean and Thai.

Worth a look.

Stephen E Arnold, July 10, 2020

CFO Surprises: Making Smart Software Smarter

April 27, 2020

The Cost of Training NLP Models is a useful summary. However, the write up leaves out some significant costs.

The focus of the paper is a:

review the cost of training large-scale language models, and the drivers of these costs.

The cost factors discussed include:

  • The paradox of compute costs going down yet the cost of processing data goes up—a lot. The reason is that more data are needed and more data can be crunched more quickly. Zoom go the costs.
  • The unknown unknowns associated with processing the appropriate amount of data to make the models work as well as they can
  • The wide use of statistical models which have a voracious appetite for training data.

These are valid points. However, the costs of training include other factors, and these are significant as well; for example:

  1. The directs and indirects associated with creating training sets
  2. The personnel costs required to assess and define retraining and the information assembly required for that retraining
  3. The costs of normalizing training corpuses.

More research into the costs of smart software training and tuning is required.

Stephen E Arnold, April 28, 2020


Acquisdata: High Value Intelligence for Financial and Intelligence Analysts

March 31, 2020

Are venture capitalist, investment analysts, and other financial professionals like intelligence officers? The answer, according to James Harker-Mortlock, is, “Yes.”

The reasons, as DarkCyber understands them, are:

  • Financial professionals to be successful have to be data omnivores; that is, masses of data, different types, and continuously flowing inputs
  • The need for near real time or real time data streams can make the difference between making a profit and losses
  • The impact of changing work patterns on the trading floor are forcing even boutique investment firms and global giants to rely upon smart software to provide a competitive edge. These smart systems require data for training machine learning modules.

James Harker-Mortlock, founder of Acquidata, told DarkCyber:

The need for high-value data from multiple sources in formats easily imported into analytic engines is growing rapidly. Our Acquisdata service provides what the financial analysts and their smart software require. We have numerous quant driven hedge funds downloading all our data every week to assist them in maintaining a comprehensive picture of their target companies and industries.”

According to the company’s Web site, Acquisdata:

Acquisdata is a fast growing digital financial publishing company. Established in 2010, we have quickly become a provider to the world’s leading financial news companies, including Thomson Reuters/Refinitiv, Bloomberg, Factset, IHS Markit, and Standard and Poor’s Capital IQ, part of McGraw Hill Financial, and ISI Emerging Markets. We also provide content to a range of global academic and business database providers, including EBSCO, ProQuest, OCLC, Research & Markets, CNKI and Thomson Reuters West. We know and understand the electronic publishing business well. Our management has experience in the electronic publishing industry going back 40 years. We aim to provide comprehensive and timely information for investors and others interested in the drivers of the global economy, primarily through our core products, the Industry SnapShot, Company SnapShot and Executive SnapShot products. Our units provide the annual and interim reports of public companies around the world and fundamental research on companies in emerging markets sectors, and aggregated data from third-party sources. In a world where electronic publishing is quickly changing the way we consume news and information, Acquisdata is at the very forefront of providing digital news and content solutions.

DarkCyber was able to obtain one of the firm’s proprietary Acquisdata Industry Snapshots. “United States Armaments, 16 March 2020” provides a digest of information about the US weapons industry. the contents of the 66 page report include news and commentary, selected news releases, research data, industry sector data, and company-specific information.


Obtaining these types of information from many commercial sources poses a problem for a financial professional. Some reports are in Word files; some are in Excel; some are in Adobe PDF image format; and some are in formats proprietary to a data aggregator. We provide data in XML which can be easily imported into an analytic system; for example, Palantir’s Metropolitan or similar analytical tool. PDF versions of the more than 100 weekly reports are available.

DarkCyber’s reaction to these intelligence “briefs” was positive. The approach is similar to the briefing documents prepared for the White House.

Net net: The service is of high value and warrants a close look for professionals who need current, multi-type data about a range of company and industry investment opportunities.

You can get more information about Acquisdata at www.acquidata.com.

Stephen E Arnold, March 31, 2020

Content for Deep Learning: The Lionbridge View

March 17, 2020

Here is a handy resource. Lionbridge AI shares “The Best 25 Datasets for Natural Language Processing.” The list is designed as a starting point for those just delving into NLP. Writer Meiryum Ali begins:

“Natural language processing is a massive field of research. With so many areas to explore, it can sometimes be difficult to know where to begin – let alone start searching for data. With this in mind, we’ve combed the web to create the ultimate collection of free online datasets for NLP. Although it’s impossible to cover every field of interest, we’ve done our best to compile datasets for a broad range of NLP research areas, from sentiment analysis to audio and voice recognition projects. Use it as a starting point for your experiments, or check out our specialized collections of datasets if you already have a project in mind.”

The suggestions are divided by purpose. For use in sentiment analysis, Ali notes one needs to train machine learning models on large, specialized datasets like the Multidomain Sentiment Analysis Dataset or the Stanford Sentiment Treebank. Some text datasets she suggests for natural language processing tasks like voice recognition or chatbots include 20 Newsgroups, the Reuters News Dataset, and Princeton University’s WordNet. Audio speech datasets that made the list include the audiobooks of LibriSpeech, the Spoken Wikipedia Corpora, and the Free Spoken Digit Dataset. The collection concludes with some more general-purpose datasets, like Amazon Reviews, the Blogger Corpus, the Gutenberg eBooks List, and a set of questions and answers from Jeopardy. See the write-up for more on each of these entries as well as the rest of Ali’s suggestions in each category.

This being a post from Lionbridge, an AI training data firm, it naturally concludes with an invitation to contact them when ready to move beyond these pre-made datasets to one customized for you. Based in Waltham, Massachusetts, the company was founded in 1996 and acquired by H.I.G. Capital in 2017.

Cynthia Murrell, March 17, 2020

IslandInText Reborn: TLDRThis

March 16, 2020

Many years ago (maybe 25+), we tested a desktop summarization tool called IslandInText. [#1 below] I believe, if my memory is working today, this was software developed in Australia by Island Software. There was a desktop version and a more robust system for large-scale summarizing of text. In the 1980s, there was quite a bit of interest in automatic summarization of text. Autonomy’s system could be configured to generate a précis if one was familiar with that system. Google’s basic citation is a modern version of what smart software can do to suggest what’s in a source item. No humans needed, of course. Too expensive and inefficient for the big folks I assume.

For many years, human abstract and indexing professionals were on staff. Our automated systems, despite their usefulness, could not handle nuances, special inclusions in source documents like graphs and tables, list of entities which we processed with the controlled term MANYCOMPANIES, and other specialized functions. I would point out that most of today’s “modern” abstracting and indexing services are simply not as good as the original services like ABI / INFORM, Chemical Abstracts, Engineering Index, Predicasts, and other pioneers in the commercial database sector. (Anyone remember Ev Brenner? That’s what I thought, gentle reader. One does not have to bother oneself with the past in today’s mobile phone search expert world.)

For a number of years, I worked in the commercial database business. In order to speed the throughput of our citations to pharmaceutical, business, and other topic domains – machine text summarization was of interest to me and my colleagues.

A reader informed me that a new service is available. It is called TLDRThis. Here’s what the splash page looks like:


One can paste text or provide a url, and the system returns a synopsis of the source document. (The advanced service generates a more in dept summary, but I did not test this. I am not too keen on signing up without knowing what the terms and conditions are.) There is a browser extension for the service. For this url, the system returned this summary:

Enterprise Search: The Floundering Fish!

Stephen E. Arnold Monitors Search,Content Processing,Text Mining,Related Topics His High-Tech Nerve Center In Rural Kentucky.,He Tries To Winnow The Goose Feathers The Giblets. He Works With Colleagues,Worldwide To Make This Web Log Useful To Those Who Want To Go,Beyond Search . Contact Him At Sa,At,Arnoldit.Com. His Web Site,With Additional Information About Search Is  |    Oct 27, 2011  |  Time Saved: 5 mins

  1. I am thinking about another monograph on the topic of “enterprise search.” The subject seems to be a bit like the motion picture protagonist Jason.
  2. The landscape of enterprise search is pretty much unchanged.
  3. But the technology of yesterday’s giants of enterprise search is pretty much unchanged.
  4. The reality is that the original Big Five had and still have technology rooted in the mid to late 1990s.

We noted several positive functions; for example, identifying the author and providing a synopsis of the source, even the goose feathers’ reference. On the downside, the system missed the main point of the article; that is, enterprise search has been a bit of a chimera for decades. Also, the system ignored the entities (company names) in the write up. These are important in my experience. People search for names, concepts, and events. The best synopses capture some of the entities and tell the reader to get the full list and other information from the source document. I am not sure what to make of the TLDRThis’ display of a picture which makes zero sense without the context of the full article. I fed the system a PDF which did not compute and I tried a bit.ly link which generated a request to refresh the page, not the summary.

To get an “advanced summary”, one must sign up. I did not choose to do that. I have added this site to our “follow” list. I will make a note to try and find out who developed this service.

The pricing ranges from free for basic summarization to $60 per year for Bronze level service. Among its features are 100 summaries per month and “exclusive features”. These are coming soon. The top level service is $10 per month. The fee includes 300 summaries a month and “exclusive features.” These are also coming soon. The Platinum service is $20 per month and includes 1,000 summaries per month. These are “better” and will include forthcoming advanced features.

Stay tuned.

[#1 ] In the early 1990s, search and retrieval was starting to move from the esoteric world of commercial databases to desktop and UNIX machines. IslandSoft, founded in 1993, offered a search and retrieval system. My files from this time revealed that IslandSoft’s description of its system could be reused by today’s search and retrieval marketers. Here’s what IslandSoft said about InText:

IslandInTEXT is a document retrieval and management application for PCs and Unix workstations. IslandInTEXT’s powerful document analysis engine lets users quickly access documents through plain English queries, summarize large documents based on content rather than key words, and automatically route incoming text and documents to user-defined SmartFolders. IslandInTEXT offers the strongest solution yet to help organize and utilize information with large numbers of legacy documents residing on PCs, workstations, and servers as well as the proliferation of electronic mail documents and other data. IslandInTEXT supports a number of popular word processing formats including IslandWrite, Microsoft Word, and WordPerfect plus ASCII text.

IslandInTEXT Includes:

  • File cabinet/file folder metaphor.
  • HTML conversion.
  • Natural language queries for easily locating documents.
  • Relevancy ranking of query results.
  • Document summaries based on statistical relevance from 1 to 99% of the original document—create executive summaries of large documents instantly. [This means that the user can specify how detailed the summarization was; for example, a paragraph or a page or two.]
  • Summary Options. Summaries can be based on key word selection, key word ordering, key sentences, and many more.

[For example:] SmartFolder Routing. Directs incoming text and documents to user-defined folders. Hot Link Pointers. Allow documents to be viewed in their native format without creating copies of the original documents. Heuristic/Learning Architecture. Allows InTEXT to analyze documents according to the author’s style.

A page for InText is still online as of today at http://www.intext.com/. The company appears to have ceased operations in 2010. Data in my files indicate that the name and possibly the code is owned by CP Software, but I have not verified this. I did not include InText in my first edition of Enterprise Search Report, which I wrote in 2003 and 2004. The company had falled behind market leaders Autonomy, Endeca, and Fast Search & Transfer.

I am surprised at how many search and retrieval companies today are just traveling along well worn paths in the digital landscape. Does search work? Nope. That’s why there are people who specialize, remember things, and maintain personal files. Mobile device search means precision and recall are digital dodo birds in my opinion.

Stephen E Arnold, March 16, 2020


Venntel: Some Details

February 18, 2020

Venntel in Virginia has the unwanted attention of journalists. The company provides mobile location data and services. Like many of the firms providing specialized services to the US government, Venntel makes an effort to communicate with potential government customers via trade shows, informal gatherings, and referrals.

Venntel’s secret sauce is cleaner mobile data. The company says:

Over 50% of location data is flawed. Venntel’s proprietary platform efficiently distinguishes between erroneous data and data of value. The platform delivers 100% validated data, allowing your team to focus on results – not data quality.

Image result for map mobile phone location

NextGov reported in “Senator Questions DHS’ Use of Cellphone Location Data for Immigration Enforcement” some information about the company; for example:

  • Customers include DHS and CBP
  • Mobile and other sources of location data are available from the company
  • The firm offers software
  • Venntel, like Oracle and other data aggregators, obtains information from third-party sources; for example, marketing companies brokering mobile phone app data

Senator. Ed Markey, a democrat from Massachusetts, has posed questions to the low profile company and has requested answers by March 3, 2020.

A similar issued surfaced for other mobile data specialists. Other geo-analytic specialists work overtime to have zero public facing profile. Example, you ask. Try to chase down information about Geogence. (Bing and Google try their darnedest to change “Geogence” to “geofence.” This is a tribute to the name choice the stakeholders of Geogence have selected, and a clever exploitation of Bing’s and Google’s inept attempts to “help” its users find information.

If you want to get a sense of what can be done with location data, check out this video which provides information about the capabilities of Maltego, a go-to system to analyze cell phone records and geolocate actions. The video is two years old, but it is representative of the basic functions. Some specialist companies wrap more user friendly interfaces and point-and-click templates for analysts and investigators to use. There are hybrid systems which combine Analyst Notebook type functions with access to email and mobile phone data. Unlike the Watson marketing, IBM keeps these important services in the background because the company wants to focus on the needs of its customers, not on the needs of “real” journalists chasing “real news.”

DarkCyber laments the fact that special services companies which try to maintain a low profile and serve a narrow range of customers is in the news.

Stephen E Arnold, February 18, 2020

Acquiring Data: Addressing a Bottleneck

February 12, 2020

Despite all the advances in automation and digital technology, humans are still required to manually input information into computers. While modern technology makes automation easier than ever millions of hours are spent on data entry. Artificial intelligence and deep learning could be the key to ending data entry says Venture Beat article, “How Rossum Is Using Deep Learning To Extract Data From Any Document.”

Rossum is an AI startup based in Prague, Czechoslovakia, founded by Tomas Gogar, Tomas Tunys, and Petr Baudis. Rossum was started in 2017 and its client list has grown to include top tier clients: IBM, Box, Siemens, Bloomberg, and Siemens. Its recent project focuses on using deep learning to end invoice data entry. Instead of relying entirely on optical character recognition (OCR) Rossum uses “cognitive data capture” that trains machines to evaluate documents like a human. Rossum’s cognitive data capture is like an OCR upgrade:

“OCR tools rely on different sets of rules and templates to cover every type of invoice they may come across. The training process can be slow and time-consuming, given that a company may need to create hundreds of new templates and rule sets. In contrast, Rossum said its cloud-based software requires minimal effort to set up, after which it can peruse a document like a human does — regardless of style or formatting — and it doesn’t rely on fully structured data to extract the content companies need. The company also claims it can extract data 6 times faster than with manual entry while saving companies up to 80% in costs.”

Rossum’s cloud approach to cognitive data capture differentiates it from similar platforms due to being located on the cloud. Because Rossum does not need on-site installation, all of Rossum’s rescuers and engineering goes directly to client support. It is similar to Salesforce’s software-as-a-service model established in 1999.

The cognitive data capture tool works faster and unlike its predecessors:

“Rossum’s pretrained AI engine can be tried and tested within a couple of minutes of integrating its REST API. As with any self-respecting machine learning system, Rossum’s AI adapts as it learns from customers’ data. Rossum claims an average accuracy rate of around 95%, and in situations where its system can’t identify the correct data fields, it asks a human operator for feedback to improve from.”

Rossum is not searching to replace human labor, instead they want to free up human time to focus on more complex problems.

Whitney Grace, February 12, 2020

TemaTres: Open Source Indexing Tool Updated

February 11, 2020

Open source software is the foundation for many proprietary software startups, including the open source developers themselves. Most open source software tends to lag in the manner of updates and patches, but TemaTres recently updated according to blog post, “TemaTres 3.1 Release Is Out! Open Source Web Tool To Manage Controlled Vocabularies.”

TemaTres is an open source vocabulary server designed to manage controlled vocabularies, taxonomies, and thesauri. The recent update includes the following:

“Utility for importing vocabularies encoded in MARC-XML format

  • Utility for the mass export of vocabulary in MARC-XML format
  • New reports about global vocabulary structure (ex: https://r020.com.ar/tematres/demo/sobre.php?setLang=en#global_view)
  • Distribution of terms according to depth level
  • Distribution of sum of preferred terms and the sum of alternative terms
  • Distribution of sum of hierarchical relationships and sum of associative relationships
  • Report about terms with relevant degree of centrality in the vocabulary (according to prototypical conditions)
  • Presentation of terms with relevant degree of centrality in each facet
  • New options to config the presentation of notes: define specific types of note as prominent (the others note types will be presented in collapsed div).
  • Button for Copy to clipboard the terms with indexing value (Copy-one-click button)
  • New user login scheme (login)
  • Allows to config and add Google Analytics tracking code (parameter in config.tematres.php file)
  • Improvements in standard exposure of metadata tags
  • Inclusion of the term notation or code in the search box predictive text
  • Compatibility with PHP 7.2”

TemaTres does updates frequently, but it is monitored. The main ethos about open source is to give back as much as you take. TemaTres appears to follow this modus operandi. It TemaTres wants to promote its web image, the organization should really upgrade its Web site, fix the broken links, and provide more information on what the software actually does.

Whitney Grace, February 11, 2020

Need a Specialized String Matcher for Tracking Entities?

January 21, 2020

Specialized services are available to track strings; for example, the name of an entity (person, place, event), an email handle, or any other string. These services may not be offered to the public. A potential customer has to locate a low profile operation, go through a weird series of interactions, and then work quite hard to get a demo of the super stealthy technology. Once the “I am a legitimate customer” drill is complete, the individual wanting to use the stealthy service has to pay hundreds, thousands, or even more per month. In our DarkCyber video program we have profiled some of these businesses.

No more.


The technology and possibly a massive expansion of monitoring is poised to make tools reserved for government agencies available to anyone with an Internet connection and a credit card. Brandchirps.com provides:

Online reputation management monitoring. The idea is that when the string entered in the standing query service appears, the user will be modified. The company says:

We allow you to input your brand, your name, or other data so you make sure your reputation stays up to date.

The service tracks competitors too. The service is easy to use:

Simply enter your competitor’s names and keep track of what they are doing right, or doing wrong!

How much does the service cost? Are we talking a letter verifying that you are working for law enforcement or an intelligence agency? A six figure budget? A staff of technologists.


The cost of the service (as of January 20, 2020) is:

  • $7 per month for five keywords
  • $16 per month for 20 keywords

Several observations:

  • The cost for this service which allegedly monitors the Web and social media is very low. Government organizations strapped for cash are likely to check out this service.
  • The system does not cover the Dark Web and other “interesting” content, but that could be changed by licensing data sets from specialists, assuming legal and financial requirements of the Dark Web content aggregators can be negotiated by Brandchirps.
  • It is not clear at this time if the service monitors metadata on images and videos, podcast titles, descriptions, and metadata, or other high-value content.
  • The world of secret monitoring and alerts has become more accessible which can inspire innovators to make use of this tool in novel ways.

Net net: Brandchirps is one more example of a technique once removed from general public access that has lost its mantle of secrecy. Will this type of service force the hand of specialized vendors? Yep.

Stephen E Arnold, January 21, 2020

Why Archived Information Can Be Useful

January 11, 2020

There’s nothing like a ubiquitous service like email and systems for keeping copies of information. Online is interesting and often surprising. This thought struck DarkCyber while reading the Time Magazine article “‘This Airplane Is Designed by Clowns.’ Internal Boeing Messages Describe Efforts to Dodge FAA Scrutiny of MAX.” Here’s the passage of interest:

“This airplane is designed by clowns, who in turn are supervised by monkeys,” said one company pilot in messages to a colleague in 2016, which Boeing disclosed publicly late Thursday.

Will the clowns and monkeys protest.

Another statement which comes directly from the Guide Book for Captain Obvious Rhetoric, which may have influenced this Time Magazine editorial insight:

The communications threaten to upend Boeing’s efforts to rebuild public trust in the 737 Max…

Ah, email and magazines. One good thing, however. No references to AI, NLP, or predictive analytics appear in the write up.

Stephen E Arnold, January 11, 2020

Next Page »

  • Archives

  • Recent Posts

  • Meta