Acquiring Data: Addressing a Bottleneck

February 12, 2020

Despite all the advances in automation and digital technology, humans are still required to manually input information into computers. While modern technology makes automation easier than ever millions of hours are spent on data entry. Artificial intelligence and deep learning could be the key to ending data entry says Venture Beat article, “How Rossum Is Using Deep Learning To Extract Data From Any Document.”

Rossum is an AI startup based in Prague, Czechoslovakia, founded by Tomas Gogar, Tomas Tunys, and Petr Baudis. Rossum was started in 2017 and its client list has grown to include top tier clients: IBM, Box, Siemens, Bloomberg, and Siemens. Its recent project focuses on using deep learning to end invoice data entry. Instead of relying entirely on optical character recognition (OCR) Rossum uses “cognitive data capture” that trains machines to evaluate documents like a human. Rossum’s cognitive data capture is like an OCR upgrade:

“OCR tools rely on different sets of rules and templates to cover every type of invoice they may come across. The training process can be slow and time-consuming, given that a company may need to create hundreds of new templates and rule sets. In contrast, Rossum said its cloud-based software requires minimal effort to set up, after which it can peruse a document like a human does — regardless of style or formatting — and it doesn’t rely on fully structured data to extract the content companies need. The company also claims it can extract data 6 times faster than with manual entry while saving companies up to 80% in costs.”

Rossum’s cloud approach to cognitive data capture differentiates it from similar platforms due to being located on the cloud. Because Rossum does not need on-site installation, all of Rossum’s rescuers and engineering goes directly to client support. It is similar to Salesforce’s software-as-a-service model established in 1999.

The cognitive data capture tool works faster and unlike its predecessors:

“Rossum’s pretrained AI engine can be tried and tested within a couple of minutes of integrating its REST API. As with any self-respecting machine learning system, Rossum’s AI adapts as it learns from customers’ data. Rossum claims an average accuracy rate of around 95%, and in situations where its system can’t identify the correct data fields, it asks a human operator for feedback to improve from.”

Rossum is not searching to replace human labor, instead they want to free up human time to focus on more complex problems.

Whitney Grace, February 12, 2020

TemaTres: Open Source Indexing Tool Updated

February 11, 2020

Open source software is the foundation for many proprietary software startups, including the open source developers themselves. Most open source software tends to lag in the manner of updates and patches, but TemaTres recently updated according to blog post, “TemaTres 3.1 Release Is Out! Open Source Web Tool To Manage Controlled Vocabularies.”

TemaTres is an open source vocabulary server designed to manage controlled vocabularies, taxonomies, and thesauri. The recent update includes the following:

“Utility for importing vocabularies encoded in MARC-XML format

  • Utility for the mass export of vocabulary in MARC-XML format
  • New reports about global vocabulary structure (ex: https://r020.com.ar/tematres/demo/sobre.php?setLang=en#global_view)
  • Distribution of terms according to depth level
  • Distribution of sum of preferred terms and the sum of alternative terms
  • Distribution of sum of hierarchical relationships and sum of associative relationships
  • Report about terms with relevant degree of centrality in the vocabulary (according to prototypical conditions)
  • Presentation of terms with relevant degree of centrality in each facet
  • New options to config the presentation of notes: define specific types of note as prominent (the others note types will be presented in collapsed div).
  • Button for Copy to clipboard the terms with indexing value (Copy-one-click button)
  • New user login scheme (login)
  • Allows to config and add Google Analytics tracking code (parameter in config.tematres.php file)
  • Improvements in standard exposure of metadata tags
  • Inclusion of the term notation or code in the search box predictive text
  • Compatibility with PHP 7.2”

TemaTres does updates frequently, but it is monitored. The main ethos about open source is to give back as much as you take. TemaTres appears to follow this modus operandi. It TemaTres wants to promote its web image, the organization should really upgrade its Web site, fix the broken links, and provide more information on what the software actually does.

Whitney Grace, February 11, 2020

Need a Specialized String Matcher for Tracking Entities?

January 21, 2020

Specialized services are available to track strings; for example, the name of an entity (person, place, event), an email handle, or any other string. These services may not be offered to the public. A potential customer has to locate a low profile operation, go through a weird series of interactions, and then work quite hard to get a demo of the super stealthy technology. Once the “I am a legitimate customer” drill is complete, the individual wanting to use the stealthy service has to pay hundreds, thousands, or even more per month. In our DarkCyber video program we have profiled some of these businesses.

No more.

image

The technology and possibly a massive expansion of monitoring is poised to make tools reserved for government agencies available to anyone with an Internet connection and a credit card. Brandchirps.com provides:

Online reputation management monitoring. The idea is that when the string entered in the standing query service appears, the user will be modified. The company says:

We allow you to input your brand, your name, or other data so you make sure your reputation stays up to date.

The service tracks competitors too. The service is easy to use:

Simply enter your competitor’s names and keep track of what they are doing right, or doing wrong!

How much does the service cost? Are we talking a letter verifying that you are working for law enforcement or an intelligence agency? A six figure budget? A staff of technologists.

Nope.

The cost of the service (as of January 20, 2020) is:

  • $7 per month for five keywords
  • $16 per month for 20 keywords

Several observations:

  • The cost for this service which allegedly monitors the Web and social media is very low. Government organizations strapped for cash are likely to check out this service.
  • The system does not cover the Dark Web and other “interesting” content, but that could be changed by licensing data sets from specialists, assuming legal and financial requirements of the Dark Web content aggregators can be negotiated by Brandchirps.
  • It is not clear at this time if the service monitors metadata on images and videos, podcast titles, descriptions, and metadata, or other high-value content.
  • The world of secret monitoring and alerts has become more accessible which can inspire innovators to make use of this tool in novel ways.

Net net: Brandchirps is one more example of a technique once removed from general public access that has lost its mantle of secrecy. Will this type of service force the hand of specialized vendors? Yep.

Stephen E Arnold, January 21, 2020

Why Archived Information Can Be Useful

January 11, 2020

There’s nothing like a ubiquitous service like email and systems for keeping copies of information. Online is interesting and often surprising. This thought struck DarkCyber while reading the Time Magazine article “‘This Airplane Is Designed by Clowns.’ Internal Boeing Messages Describe Efforts to Dodge FAA Scrutiny of MAX.” Here’s the passage of interest:

“This airplane is designed by clowns, who in turn are supervised by monkeys,” said one company pilot in messages to a colleague in 2016, which Boeing disclosed publicly late Thursday.

Will the clowns and monkeys protest.

Another statement which comes directly from the Guide Book for Captain Obvious Rhetoric, which may have influenced this Time Magazine editorial insight:

The communications threaten to upend Boeing’s efforts to rebuild public trust in the 737 Max…

Ah, email and magazines. One good thing, however. No references to AI, NLP, or predictive analytics appear in the write up.

Stephen E Arnold, January 11, 2020

Linguistics: Becoming Useful to Regular People?

January 8, 2020

Now here is the linguistic reference app I have been waiting for: IDEA’s “In Other Words.” Finally, an online resource breaks the limiting patterns left over from book-based resources like traditional dictionaries and thesauri. The app puts definitions into context by supplying real-world examples from both fiction and nonfiction works of note from the 20th and 21st centuries. It also lets users explore several types of linguistic connections. Not surprisingly, this thoroughly modern approach leverages a combination of artificial and human intelligence. Here is how they did it:

“Building on the excellent definitions written by the crowd-sourced editors at Wiktionary, IDEA’s lexicographic team wrote more than 2,700 short, digestible definitions for all common words, including ‘who,’ ‘what,’ and ‘the.’ For over 100k other words that also have Wikipedia entries, we included a snippet of the article as well. To power the app, our team created the IDEA Linguabase, a database of word relationships built on an analysis of various published and open source dictionaries and thesauri, an artificial intelligence analysis of a large corpus of published content, and original lexicographic work. Our app offers relationships for over 300,000 terms and presents over 60 million interrelationships. These include close relationships, such as synonyms, as well as broader associations and thousands of interesting lists, such as types of balls, types of insects, words for nausea, and kinds of needlework. Additionally, the app has extensive information on word families (e.g., ‘jump,’ ‘jumping’) and common usage (‘beautiful woman’ vs. ‘handsome man’), revealing words that commonly appear before or after a word in real use. In Other Words goes beyond the traditional reference text by allowing users to explore interesting facts about words and wordplay, such as common letter patterns and phonetics/rhymes.”

The team has endeavored to give us an uncluttered, intuitive UI that makes it quick to look up a word and easy to follow a chain of meanings and associations. Users can also save and share what they have found across devices. Be warned, though—In Other Words does not shy away from salty language; it even points out terms that were neutral in one time period and naughty in another. (They will offer a sanitized version for families and schools.) They say the beta version is coming soon and will be priced at $4.99, or $25 with a custom tutorial. We look forward to it.

Cynthia Murrell, January 8, 2020

Megaputer Spans Text Analysis Disciplines

January 6, 2020

What exactly do we mean by “text analysis”? That depends entirely on the context. Megaputer shares a useful list of the most popular types in its post, “What’s in a Text Analysis Tool?” The introduction explains:

“If you ask five different people, ‘What does a Text Analysis tool do?’, it is very likely you will get five different responses. The term Text Analysis is used to cover a broad range of tasks that include identifying important information in text: from a low, structural level to more complicated, high-level concepts. Included in this very broad category are also tools that convert audio to text and perform Optical Character Recognition (OCR); however, the focus of these tools is on the input, rather than the core tasks of text analysis. Text Analysis tools not only perform different tasks, but they are also targeted to different user bases. For example, the needs of a researcher studying the reactions of people on Twitter during election debates may require different Text Analysis tasks than those of a healthcare specialist creating a model for the prediction of sepsis in medical records. Additionally, some of these tools require the user to have knowledge of a programming language like Python or Java, whereas other platforms offer a Graphical User Interface.”

The list begins with two of the basics—Part-of-Speech (POS) Taggers and Syntactic Parsing. These tasks usually underpin more complex analysis. Concordance or Keyword tools create alphabetical lists of a text’s words and put them into context. Text Annotation Tools, either manual or automated, tag parts of a text according to a designated schema or categorization model, while Entity Recognition Tools often use knowledge graphs to identify people, organizations, and locations. Topic Identification and Modeling Tools derive emerging themes or high-level subjects using text-clustering methods. Sentiment Analysis Tools diagnose positive and negative sentiments, some with more refinement than others. Query Search Tools let users search text for a word or a phrase, while Summarization Tools pick out and present key points from lengthy texts (provided they are well organized.) See the article for more on any of these categories.

The post concludes by noting that most text analysis platforms offer one or two of the above functions, but that users often require more than that. This is where the article shows its PR roots—Megaputer, as it happens, offers just such an all-in-one platform called PolyAnalyst. Still, the write-up is a handy rundown of some different text-analysis tasks.

Based in Bloomington, Indiana, Megaputer launched in 1997. The company grew out of AI research from the Moscow State University and Bauman Technical University. Just a few of their many prominent clients include HP, Johnson & Johnson, American Express, and several US government offices.

Cynthia Murrell, January 02, 2020

Cambridge Analytica: Maybe a New Name and Some of the Old Methods?

December 29, 2019

DarkCyber spotted an interesting factoid in “HH Plans to Work with the Re-Branded Cambridge Analytica to Influence 2021 Elections.”

The new company, Auspex International, will keep former Cambridge Analytica director Mark Turnbull at the helm.

Who is HH? He is President Hakainde Hichilema, serving at this time in Zambia.

The business focus of Auspex is, according to the write up:

We’re not a data company, we’re not a political consultancy, we’re not a research company and we’re not necessarily just a communications company. We’re a combination of all four.—Ahmad *Al-Khatib, a Cairo born investor

You can obtain some information about Auspex at this url: https://www.auspex.ai/.

DarkCyber noted the use of the “ai” domain. See the firm’s “What We Believe” information at this link. It is good to have a reason to get out of bed in the morning.

Stephen E Arnold, December 29, 2019

Insight from a Microsoft Professional: Susan Dumais

December 1, 2019

Dr. Susan Dumais is Microsoft Technical Fellow and Deputy Lab Director of MSR AI. She knows that search has evolved from discovering information to getting tasks done. In order. To accomplish tasks, search queries are a fundamental and they are rooted in people’s information needs. The Microsoft Research Podcast interviewed Dr. Dumais in the episode, “HCI, IR, And The Search For Better Search With Dr. Susan Dumais.”

Dr. Dumais shared that most of her work centered around search stems from frustrations she encountered with her own life. These included trouble learning Unix OS and vast amounts of spam. At the beginning of the podcast, she runs down the history of search and how it has changed in the past twenty years. Search has become more intuitive, especially give the work Dr. Dumais did when providing context to search.

“Host: Context in anything makes a difference with language and this is integrally linked to the idea of personalization, which is a buzz word in almost every area of computer science research these days: how can we give people a “valet service” experience with their technical devices and systems? So, tell us about the technical approaches you’ve taken on context in search, and how they’ve enabled machines to better recognize or understand the rich contextual signals, as you call them, that can help humans improve their access to information?

Susan Dumais: If you take a step back and consider what a web search engine is, it’s incredibly difficult to understand what somebody is looking for given, typically, two to three words. These two to three words appear in a search box and what you try to do is match those words against billions of documents. That’s a really daunting challenge. That challenge becomes a little easier if you can understand things about where the query is coming from. It doesn’t fall from the sky, right? It’s issued by a real live human being. They have searched for things in the longer term, maybe more acutely in the current session. It’s situated in a particular location in time. All of those signals are what we call context that help understand why somebody might be searching and, more importantly, what you might do to help them, what they might mean by that. You know, again, it’s much easier to understand queries if you have a little bit of context about it.”

Dr. Dumais has a practical approach to making search work for the average user. It is the everyday tasks that build up that power how search is shaped and its functionality. She represents an enlightened technical expert that understands the perspective of the end user.

Whitney Grace, November 30, 2019

Google Trends Used to Reveal Misspelled Wirds or Is It Words?

November 25, 2019

We spotted a listing of the most misspelled words in each of the USA’s 50 states. Too bad Puerto Rico. Kentucky’s most misspelled word is “ninety.” Navigate to Considerable and learn what residents cannot spell. How often? Silly kweston.

The listing includes some bafflers and may reveal what can go wrong with data from an online ad sales data collection system; for example:

  • Washington, DC (which is not a state in DarkCyber’s book) cannot spell “enough”; for example, “enuf already with these televised hearings and talking heads”
  • Idaho residents cannot spell embarrassed, which as listeners to Kara Swisher know has two r’s and two s’s. Helpful that.
  • Montana residents cannot spell “comma.” Do those in Montana use commas?
  • And not surprisingly, those in Tennessee cannot spell “intelligent.” Imagine that!

What happens if one trains smart software on these data?

Sumthink mite go awf the railz.

Stephen E Arnold, November 25, 2019

Info Extraction: Improving?

November 21, 2019

Information extraction (IE) is key to machine learning and artificial intelligence (AI), especially for natural language processing (NLP). The problem with information extraction is while information is pulled from datasets it often lacks context, thusly it fails to properly categorize and rationalize the data. Good Men Project shares some hopeful news for IE in the article, “Measuring Without Labels: A Different Approach To Information Extraction.”

Current IE relies on an AI programmed with a specific set of schema that states what information needs to be extracted. A retail Web site like Amazon probably uses an IE AI programmed to extract product names, UPCs, and price, while a travel Web site like Kayak uses an IE AI to find price, airlines, dates, and hotel names. For law enforcement officials, it is particularly difficult to design schema for human trafficking, because datasets on that subject do not exist. Also traditional IE methods, such as crowdsourcing, do not work due to the sensitivity.

In order to create a reliable human trafficking dataset and prove its worth, the IE dependencies between extractions. A dependency works as:

“Consider the network illustrated in the figure above. In this kind of network, called attribute extraction network (AEN), we model each document as a node. An edge exists between two nodes if their underlying documents share an extraction (in this case, names). For example, documents D1 and D2 are connected by an edge because they share the extraction ‘Mayank.’ Note that constructing the AEN only requires the output of an IE, not a gold standard set of labels. Our primary hypothesis in the article was that, by measuring network-theoretic properties (like the degree distribution, connectivity etc.) of the AEN, correlations would emerge between these properties and IE performance metrics like precision and recall, which require a sufficiently large gold standard set of IE labels to compute. The intuition is that IE noise is not random noise, and that the non-random nature of IE noise will show up in the network metrics. Why is IE noise non-random? We believe that it is due to ambiguity in the real world over some terms, but not others.”

Using the attributes names, phone numbers, and locations, correlations were discovered. AI systems that have dependencies creates a new methodology to evaluate them. Network science relies on non-abstract interactions to test IE, but the AEN is an abstract network of IE interactions. The mistakes, in fact, allow law enforcement to use IE AI to acquire the desired information without having a practice dataset.

Whitney Grace, November 21, 2019

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta