Fixing Language: No Problem

August 7, 2020

Many years ago I studied with a fellow who was the world’s expert on the morpheme _burger. Yep, hamburger, cheeseburger, dumbburger, nothingburger, and so on. Dr. Lev Sudek (I think that was his last name but after 50 years former teachers blur in my mind like a smidgen of mustard on a stupidburger.) I do recall his lecture on Indo-European languages, the importance of Sanskrit, and the complexity of Lithuanian nouns. (Why Lithuanian? Many, many inflections.) Those languages evolving or de-volving from Sanskrit or ur-Sanskrit differentiated among male, female, singular, neuter, plural, and others. I am thinking 16 for nouns but again I am blurring the Sriacha on the Incredible burger.

This morning, as I wandered past the Memoryburger Restaurant, I spotted “These Are the Most Gender-Biased Languages in the World (Hint: English Has a Problem).” The write up points out that Carnegie Mellon analyzed languages and created a list of biased languages. What are the languages with an implicit problem regarding bias? Here a list of the top 10 gender abusing, sexist pig languages:

  1. Danish
  2. German
  3. Norwegian
  4. Dutch
  5. Romanian
  6. English
  7. Hebrew
  8. Swedish
  9. Mandarin
  10. Persian

English is number 6, and if I understand Fast Company’s headline, English has a problem. Apparently Chinese and Persian do too, but the write up tiptoes around these linguistic land mines. Go with the Covid ridden, socially unstable, and financially stressed English speakers. Yes, ignore the Danes, the Germans, Norwegians, Dutch, and Romanians.

So what’s the fix for the offensive English speakers? The write up dodges this question, narrowing to algorithmic bias. I learned:

The implications are profound: This may partially explain where some early stereotypes about gender and work come from. Children as young as 2 exercise these biases, which cannot be explained by kids’ lived experiences (such as their own parents’ jobs, or seeing, say, many female nurses). The results could also be useful in combating algorithmic bias.

Profound indeed. But the French have a simple, logical, and  “c’est top” solution. The Académie Française. This outfit is the reason why an American draws a sneer when asking where the computer store is in Nimes. The Académie Française does not want anyone trying to speak French to use a disgraced term like computer.

How’s that working out? Hashtag and Franglish are chugging right along. That means that legislating language is not getting much traction. You can read a 290 page dissertation about the dust up. Check out “The Non Sexist Language Debate in French and English.” A real thriller.

The likelihood of enforcing specific language and usage changes on the 10 worst offenders strikes me as slim. Language changes, and I am not sure the morpheme –burger expert understood decades ago how politicallycorrectburgers could fit into an intellectual menu.

Stephen E Arnold, August 7, 2020

Tick Tock Becomes Tit for Tat: The Apple and Xiao-i Issue

August 5, 2020

Okay, let’s get the company names out of the way:

  • Shanghai Zhizhen Network Technology Company is known as Zhizhen
  • Zhizhen is also known as Xiao-i
  • Apple is the outfit with the virtual assistant Siri.

Zhizhen owns a patent for a virtual assistant. In 2013, Apple was sued for violating a Chinese patent. Apple let loose a flock of legal eagles to demonstrate that its patents were in force and that a Chinese voice recognition patent was invalid. The Chinese court denied Apple’s argument.

Tick tock tick tock went the clock. Then the alarm sounded. Xiao-i owns the Chinese patent, and that entity is suing Apple.

Apple Faces $1.4B Suit from Chinese AI Company” reports:

Shanghai Zhizhen Network Technology Co. said in a statement on Monday it was suing Apple for an estimated 10 billion yuan ($1.43 billion) in damages in a Shanghai court, alleging the iPhone and iPad maker’s products violated a patent the Chinese company owns for a virtual assistant whose technical architecture is similar to Siri. Siri, a voice-activated function in Apple’s smartphones and laptops, allows users to dictate text messages or set alarms on their devices.

But more than the money, the Xiao-i outfit “asked Apple to stop sales, production, and the use of products fluting such a patent.”

Coincidence? Maybe. The US wants to curtail TikTok, and now Xiao-i wants to put a crimp in Apple’s China revenues.

Several observations:

  • More trade related issues are likely
  • Intellectual property disputes will become more frequent. China will use its patents to inhibit American business. This is a glimpse of a future in which the loss of American knowledge value will add friction to the US activities
  • Downstream consequences are likely to ripple through non-Chinese suppliers of components and services to Apple. China is using Apple to make a point about the value of Chinese intellectual property and the influence of today’s China.

Just as China has asserted is cyber capabilities, the Apple patent dispute — regardless of its outcome — is another example of China’s understanding of American tactics, modifying them, and using them to try to gain increased economic, technical, and financial advantage.

Stephen E Arnold, August 3, 2020

Natural Language Processing: Useful Papers Selected by an Informed Human

July 28, 2020

Nope, no artificial intelligence involved in this curated list of papers from a recent natural language conference. Ten papers are available with a mouse click. Quick takeaway: Adversarial methods seem to be a hot ticket. Navigate to “The Ten Must Read NLP/NLU Papers from the ICLR 2020 Conference.” Useful editorial effort and a clear, adult presentation of the bibliographic information. Kudos to jakubczakon.

Stephen E Arnold, July 27, 2020

Optical Character Recognition for Less

July 10, 2020

Optical character recognition software was priced high, low, and in between. Sure, the software mostly worked if you like fixing four or five errors per scanned page with 100 words on it. Oh, you use small sized type. That’s eight to 10 errors per scanned page. Good enough I suppose.

You may want to check out EasyOCR, now available via Github. The information page says:

Ready-to-use OCR with 40+ languages supported including Chinese, Japanese, Korean and Thai.

Worth a look.

Stephen E Arnold, July 10, 2020

CFO Surprises: Making Smart Software Smarter

April 27, 2020

The Cost of Training NLP Models is a useful summary. However, the write up leaves out some significant costs.

The focus of the paper is a:

review the cost of training large-scale language models, and the drivers of these costs.

The cost factors discussed include:

  • The paradox of compute costs going down yet the cost of processing data goes up—a lot. The reason is that more data are needed and more data can be crunched more quickly. Zoom go the costs.
  • The unknown unknowns associated with processing the appropriate amount of data to make the models work as well as they can
  • The wide use of statistical models which have a voracious appetite for training data.

These are valid points. However, the costs of training include other factors, and these are significant as well; for example:

  1. The directs and indirects associated with creating training sets
  2. The personnel costs required to assess and define retraining and the information assembly required for that retraining
  3. The costs of normalizing training corpuses.

More research into the costs of smart software training and tuning is required.

Stephen E Arnold, April 28, 2020

 

Acquisdata: High Value Intelligence for Financial and Intelligence Analysts

March 31, 2020

Are venture capitalist, investment analysts, and other financial professionals like intelligence officers? The answer, according to James Harker-Mortlock, is, “Yes.”

The reasons, as DarkCyber understands them, are:

  • Financial professionals to be successful have to be data omnivores; that is, masses of data, different types, and continuously flowing inputs
  • The need for near real time or real time data streams can make the difference between making a profit and losses
  • The impact of changing work patterns on the trading floor are forcing even boutique investment firms and global giants to rely upon smart software to provide a competitive edge. These smart systems require data for training machine learning modules.

James Harker-Mortlock, founder of Acquidata, told DarkCyber:

The need for high-value data from multiple sources in formats easily imported into analytic engines is growing rapidly. Our Acquisdata service provides what the financial analysts and their smart software require. We have numerous quant driven hedge funds downloading all our data every week to assist them in maintaining a comprehensive picture of their target companies and industries.”

According to the company’s Web site, Acquisdata:

Acquisdata is a fast growing digital financial publishing company. Established in 2010, we have quickly become a provider to the world’s leading financial news companies, including Thomson Reuters/Refinitiv, Bloomberg, Factset, IHS Markit, and Standard and Poor’s Capital IQ, part of McGraw Hill Financial, and ISI Emerging Markets. We also provide content to a range of global academic and business database providers, including EBSCO, ProQuest, OCLC, Research & Markets, CNKI and Thomson Reuters West. We know and understand the electronic publishing business well. Our management has experience in the electronic publishing industry going back 40 years. We aim to provide comprehensive and timely information for investors and others interested in the drivers of the global economy, primarily through our core products, the Industry SnapShot, Company SnapShot and Executive SnapShot products. Our units provide the annual and interim reports of public companies around the world and fundamental research on companies in emerging markets sectors, and aggregated data from third-party sources. In a world where electronic publishing is quickly changing the way we consume news and information, Acquisdata is at the very forefront of providing digital news and content solutions.

DarkCyber was able to obtain one of the firm’s proprietary Acquisdata Industry Snapshots. “United States Armaments, 16 March 2020” provides a digest of information about the US weapons industry. the contents of the 66 page report include news and commentary, selected news releases, research data, industry sector data, and company-specific information.

image

Obtaining these types of information from many commercial sources poses a problem for a financial professional. Some reports are in Word files; some are in Excel; some are in Adobe PDF image format; and some are in formats proprietary to a data aggregator. We provide data in XML which can be easily imported into an analytic system; for example, Palantir’s Metropolitan or similar analytical tool. PDF versions of the more than 100 weekly reports are available.

DarkCyber’s reaction to these intelligence “briefs” was positive. The approach is similar to the briefing documents prepared for the White House.

Net net: The service is of high value and warrants a close look for professionals who need current, multi-type data about a range of company and industry investment opportunities.

You can get more information about Acquisdata at www.acquidata.com.

Stephen E Arnold, March 31, 2020

Content for Deep Learning: The Lionbridge View

March 17, 2020

Here is a handy resource. Lionbridge AI shares “The Best 25 Datasets for Natural Language Processing.” The list is designed as a starting point for those just delving into NLP. Writer Meiryum Ali begins:

“Natural language processing is a massive field of research. With so many areas to explore, it can sometimes be difficult to know where to begin – let alone start searching for data. With this in mind, we’ve combed the web to create the ultimate collection of free online datasets for NLP. Although it’s impossible to cover every field of interest, we’ve done our best to compile datasets for a broad range of NLP research areas, from sentiment analysis to audio and voice recognition projects. Use it as a starting point for your experiments, or check out our specialized collections of datasets if you already have a project in mind.”

The suggestions are divided by purpose. For use in sentiment analysis, Ali notes one needs to train machine learning models on large, specialized datasets like the Multidomain Sentiment Analysis Dataset or the Stanford Sentiment Treebank. Some text datasets she suggests for natural language processing tasks like voice recognition or chatbots include 20 Newsgroups, the Reuters News Dataset, and Princeton University’s WordNet. Audio speech datasets that made the list include the audiobooks of LibriSpeech, the Spoken Wikipedia Corpora, and the Free Spoken Digit Dataset. The collection concludes with some more general-purpose datasets, like Amazon Reviews, the Blogger Corpus, the Gutenberg eBooks List, and a set of questions and answers from Jeopardy. See the write-up for more on each of these entries as well as the rest of Ali’s suggestions in each category.

This being a post from Lionbridge, an AI training data firm, it naturally concludes with an invitation to contact them when ready to move beyond these pre-made datasets to one customized for you. Based in Waltham, Massachusetts, the company was founded in 1996 and acquired by H.I.G. Capital in 2017.

Cynthia Murrell, March 17, 2020

IslandInText Reborn: TLDRThis

March 16, 2020

Many years ago (maybe 25+), we tested a desktop summarization tool called IslandInText. [#1 below] I believe, if my memory is working today, this was software developed in Australia by Island Software. There was a desktop version and a more robust system for large-scale summarizing of text. In the 1980s, there was quite a bit of interest in automatic summarization of text. Autonomy’s system could be configured to generate a précis if one was familiar with that system. Google’s basic citation is a modern version of what smart software can do to suggest what’s in a source item. No humans needed, of course. Too expensive and inefficient for the big folks I assume.

For many years, human abstract and indexing professionals were on staff. Our automated systems, despite their usefulness, could not handle nuances, special inclusions in source documents like graphs and tables, list of entities which we processed with the controlled term MANYCOMPANIES, and other specialized functions. I would point out that most of today’s “modern” abstracting and indexing services are simply not as good as the original services like ABI / INFORM, Chemical Abstracts, Engineering Index, Predicasts, and other pioneers in the commercial database sector. (Anyone remember Ev Brenner? That’s what I thought, gentle reader. One does not have to bother oneself with the past in today’s mobile phone search expert world.)

For a number of years, I worked in the commercial database business. In order to speed the throughput of our citations to pharmaceutical, business, and other topic domains – machine text summarization was of interest to me and my colleagues.

A reader informed me that a new service is available. It is called TLDRThis. Here’s what the splash page looks like:

image

One can paste text or provide a url, and the system returns a synopsis of the source document. (The advanced service generates a more in dept summary, but I did not test this. I am not too keen on signing up without knowing what the terms and conditions are.) There is a browser extension for the service. For this url, the system returned this summary:

Enterprise Search: The Floundering Fish!

Stephen E. Arnold Monitors Search,Content Processing,Text Mining,Related Topics His High-Tech Nerve Center In Rural Kentucky.,He Tries To Winnow The Goose Feathers The Giblets. He Works With Colleagues,Worldwide To Make This Web Log Useful To Those Who Want To Go,Beyond Search . Contact Him At Sa,At,Arnoldit.Com. His Web Site,With Additional Information About Search Is  |    Oct 27, 2011  |  Time Saved: 5 mins

  1. I am thinking about another monograph on the topic of “enterprise search.” The subject seems to be a bit like the motion picture protagonist Jason.
  2. The landscape of enterprise search is pretty much unchanged.
  3. But the technology of yesterday’s giants of enterprise search is pretty much unchanged.
  4. The reality is that the original Big Five had and still have technology rooted in the mid to late 1990s.

We noted several positive functions; for example, identifying the author and providing a synopsis of the source, even the goose feathers’ reference. On the downside, the system missed the main point of the article; that is, enterprise search has been a bit of a chimera for decades. Also, the system ignored the entities (company names) in the write up. These are important in my experience. People search for names, concepts, and events. The best synopses capture some of the entities and tell the reader to get the full list and other information from the source document. I am not sure what to make of the TLDRThis’ display of a picture which makes zero sense without the context of the full article. I fed the system a PDF which did not compute and I tried a bit.ly link which generated a request to refresh the page, not the summary.

To get an “advanced summary”, one must sign up. I did not choose to do that. I have added this site to our “follow” list. I will make a note to try and find out who developed this service.

The pricing ranges from free for basic summarization to $60 per year for Bronze level service. Among its features are 100 summaries per month and “exclusive features”. These are coming soon. The top level service is $10 per month. The fee includes 300 summaries a month and “exclusive features.” These are also coming soon. The Platinum service is $20 per month and includes 1,000 summaries per month. These are “better” and will include forthcoming advanced features.

Stay tuned.

[#1 ] In the early 1990s, search and retrieval was starting to move from the esoteric world of commercial databases to desktop and UNIX machines. IslandSoft, founded in 1993, offered a search and retrieval system. My files from this time revealed that IslandSoft’s description of its system could be reused by today’s search and retrieval marketers. Here’s what IslandSoft said about InText:

IslandInTEXT is a document retrieval and management application for PCs and Unix workstations. IslandInTEXT’s powerful document analysis engine lets users quickly access documents through plain English queries, summarize large documents based on content rather than key words, and automatically route incoming text and documents to user-defined SmartFolders. IslandInTEXT offers the strongest solution yet to help organize and utilize information with large numbers of legacy documents residing on PCs, workstations, and servers as well as the proliferation of electronic mail documents and other data. IslandInTEXT supports a number of popular word processing formats including IslandWrite, Microsoft Word, and WordPerfect plus ASCII text.

IslandInTEXT Includes:

  • File cabinet/file folder metaphor.
  • HTML conversion.
  • Natural language queries for easily locating documents.
  • Relevancy ranking of query results.
  • Document summaries based on statistical relevance from 1 to 99% of the original document—create executive summaries of large documents instantly. [This means that the user can specify how detailed the summarization was; for example, a paragraph or a page or two.]
  • Summary Options. Summaries can be based on key word selection, key word ordering, key sentences, and many more.

[For example:] SmartFolder Routing. Directs incoming text and documents to user-defined folders. Hot Link Pointers. Allow documents to be viewed in their native format without creating copies of the original documents. Heuristic/Learning Architecture. Allows InTEXT to analyze documents according to the author’s style.

A page for InText is still online as of today at http://www.intext.com/. The company appears to have ceased operations in 2010. Data in my files indicate that the name and possibly the code is owned by CP Software, but I have not verified this. I did not include InText in my first edition of Enterprise Search Report, which I wrote in 2003 and 2004. The company had falled behind market leaders Autonomy, Endeca, and Fast Search & Transfer.

I am surprised at how many search and retrieval companies today are just traveling along well worn paths in the digital landscape. Does search work? Nope. That’s why there are people who specialize, remember things, and maintain personal files. Mobile device search means precision and recall are digital dodo birds in my opinion.

Stephen E Arnold, March 16, 2020

 

Venntel: Some Details

February 18, 2020

Venntel in Virginia has the unwanted attention of journalists. The company provides mobile location data and services. Like many of the firms providing specialized services to the US government, Venntel makes an effort to communicate with potential government customers via trade shows, informal gatherings, and referrals.

Venntel’s secret sauce is cleaner mobile data. The company says:

Over 50% of location data is flawed. Venntel’s proprietary platform efficiently distinguishes between erroneous data and data of value. The platform delivers 100% validated data, allowing your team to focus on results – not data quality.

Image result for map mobile phone location

NextGov reported in “Senator Questions DHS’ Use of Cellphone Location Data for Immigration Enforcement” some information about the company; for example:

  • Customers include DHS and CBP
  • Mobile and other sources of location data are available from the company
  • The firm offers software
  • Venntel, like Oracle and other data aggregators, obtains information from third-party sources; for example, marketing companies brokering mobile phone app data

Senator. Ed Markey, a democrat from Massachusetts, has posed questions to the low profile company and has requested answers by March 3, 2020.

A similar issued surfaced for other mobile data specialists. Other geo-analytic specialists work overtime to have zero public facing profile. Example, you ask. Try to chase down information about Geogence. (Bing and Google try their darnedest to change “Geogence” to “geofence.” This is a tribute to the name choice the stakeholders of Geogence have selected, and a clever exploitation of Bing’s and Google’s inept attempts to “help” its users find information.

If you want to get a sense of what can be done with location data, check out this video which provides information about the capabilities of Maltego, a go-to system to analyze cell phone records and geolocate actions. The video is two years old, but it is representative of the basic functions. Some specialist companies wrap more user friendly interfaces and point-and-click templates for analysts and investigators to use. There are hybrid systems which combine Analyst Notebook type functions with access to email and mobile phone data. Unlike the Watson marketing, IBM keeps these important services in the background because the company wants to focus on the needs of its customers, not on the needs of “real” journalists chasing “real news.”

DarkCyber laments the fact that special services companies which try to maintain a low profile and serve a narrow range of customers is in the news.

Stephen E Arnold, February 18, 2020

Acquiring Data: Addressing a Bottleneck

February 12, 2020

Despite all the advances in automation and digital technology, humans are still required to manually input information into computers. While modern technology makes automation easier than ever millions of hours are spent on data entry. Artificial intelligence and deep learning could be the key to ending data entry says Venture Beat article, “How Rossum Is Using Deep Learning To Extract Data From Any Document.”

Rossum is an AI startup based in Prague, Czechoslovakia, founded by Tomas Gogar, Tomas Tunys, and Petr Baudis. Rossum was started in 2017 and its client list has grown to include top tier clients: IBM, Box, Siemens, Bloomberg, and Siemens. Its recent project focuses on using deep learning to end invoice data entry. Instead of relying entirely on optical character recognition (OCR) Rossum uses “cognitive data capture” that trains machines to evaluate documents like a human. Rossum’s cognitive data capture is like an OCR upgrade:

“OCR tools rely on different sets of rules and templates to cover every type of invoice they may come across. The training process can be slow and time-consuming, given that a company may need to create hundreds of new templates and rule sets. In contrast, Rossum said its cloud-based software requires minimal effort to set up, after which it can peruse a document like a human does — regardless of style or formatting — and it doesn’t rely on fully structured data to extract the content companies need. The company also claims it can extract data 6 times faster than with manual entry while saving companies up to 80% in costs.”

Rossum’s cloud approach to cognitive data capture differentiates it from similar platforms due to being located on the cloud. Because Rossum does not need on-site installation, all of Rossum’s rescuers and engineering goes directly to client support. It is similar to Salesforce’s software-as-a-service model established in 1999.

The cognitive data capture tool works faster and unlike its predecessors:

“Rossum’s pretrained AI engine can be tried and tested within a couple of minutes of integrating its REST API. As with any self-respecting machine learning system, Rossum’s AI adapts as it learns from customers’ data. Rossum claims an average accuracy rate of around 95%, and in situations where its system can’t identify the correct data fields, it asks a human operator for feedback to improve from.”

Rossum is not searching to replace human labor, instead they want to free up human time to focus on more complex problems.

Whitney Grace, February 12, 2020

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta