Nosing Beyond the Machine Learning from Human Curated Data Sets: Autonomy 1996 to Smart Software 2019
April 24, 2019
How does one teach a smart indexing system like Autonomy’s 1996 “neurodynamic” system?* Subject matter experts (SMEs) assembled training collection of textual information. The article and other content would replicate the characteristics of the content which the Autonomy system would process; that is, index and make searchable or analyzable. The work was important. Get the training data wrong and the indexing system would assign metadata or “index terms” and “category names” which could cause a query to generate results the user could perceive as incorrect.
How would a licensee adjust the Autonomy “black box”? (Think of my reference to Autonomy and search as a way of approaching “smart software” and “artificial intelligence.”)
The method was to perform re-training. The approach was practical and for most content domains, the re-training worked. It was an iterative process. Because the words in the corpus fed into the “black box” included new words, concepts, bound phrases, entities, and key sequences, there were several functions integrated into the basic Autonomy system as it matured. Examples ranged from support for term lists (controlled vocabularies) and dictionaries.
The combination of re-training and external content available to the system allowed Autonomy to deliver useful outputs.
Where the optimal results departed from the real world results usually boiled down to several factors, often working in concert. First, licensees did not want to pay for re-training. Second, maintenance of the external dictionaries was necessary because new entities arrive with reasonable frequency. Third, testing and organizing the freshening training sets and the editorial work required to keep dictionaries ship shape was too expensive, time consuming, and tedious.
Not surprisingly, some licensees grew unhappy with their Autonomy IDOL (integrated data operating layer) system. That, in my opinion, was not Autonomy’s fault. Autonomy explained in the presentations I heard what was required to get a system up and running and outputting results that could easily hit 80 percent or higher on precision and recall tests.
The Autonomy approach is widely used. In fact, wherever there is a Bayesian system in use, there is the training, re-training, external knowledge base demand. I just took a look at Haystax Constellation. It’s Bayesian and Haystax makes it clear that the “model” has to be training. So what’s changed between 1996 and 2019 with regards to Bayesian methods?
Nothing. Zip. Zero.
Who Is Assisting China in Its Technology Push?
March 20, 2019
I read “U.S. Firms Are Helping Build China’s Orwellian State.” The write up is interesting because it identifies companies which allegedly provide technology to the Middle Kingdom. The article also uses an interesting phrase; that is, “tech partnerships.” Please, read the original article for the names of the US companies allegedly cooperating with China.
I want to tell a story.
Several years ago, my team was asked to prepare a report for a major US university. Our task was to try and answer what I thought was a simple question when I accepted the engagement, “Why isn’t this university’s computer science program ranked in the top ten in the US?”
The answer, my team and I learned, had zero to do with faculty, courses, or the intelligence of students. The primary reason was that the university’s graduates were returning to their “home countries.” These included China, Russia, and India, among others. In one advanced course, there was no US born, US educated student.
We documented that for over a seven year period, when the undergraduate, the graduate students, and post doctoral students completed their work, they had little incentive to start up companies in proximity to the university, donate to the school’s fund raising, and provide the rah rah that happy graduates often do. To see the rah rah in action, may I suggest you visit a “get together” of graduates near Stanford or an eatery in Boston or on NCAA elimination week end in Las Vegas.
How could my client fix this problem? We were not able to offer a quick fix or even an easy fix. The university had institutionalized revenue from non US student and was, when we did the research, dependent on non US students. These students were very, very capable and they came to the US to learn, form friendships, and sharpen their business and technical “soft” skills. These, I assume, were skills put to use to reach out to firms where a “soft” contact could be easily initiated and brought to fruition.
Follow the threads and the money.
China has been a country eager to learn in and from the US. The identification of some US firms which work with China should not be a surprise.
However, I would suggest that Foreign Policy or another investigative entity consider a slightly different approach to the topic of China’s technical capabilities. Let me offer one example. Consider this question:
What Israeli companies provide technology to China and other countries which may have some antipathy to the US?
This line of inquiry might lead to some interesting items of information; for example, a major US company which meets on a regular basis with a counterpart with what I would characterize as “close links” to the Chinese government. One colloquial way to describe the situation is like a conduit. Digging in this field of inquiry, one can learn how the Israeli company “flows” US intelligence-related technology from the US and elsewhere through an intermediary so that certain surveillance systems in China can benefit directly from what looks like technology developed in Israel.
Net net: If one wants to understand how US technology moves from the US, the subject must be examined in terms of academic programs, admissions, policies, and connections as well as from the point of view of US company investments in technologies which received funding from Chinese sources routed through entities based in Israel. Looking at a couple of firms does not do the topic justice and indeed suggests a small scale operation.
Uighur monitoring is one thread to follow. But just one.
Stephen E Arnold, March 20, 2019
Identification of Machine Generated Text: Not There Yet
March 18, 2019
“A.I. Generated Text Is Supercharging Fake News. This Is How We Fight Back” provides a run down of projects focused on figuring out if a sentence were written by a human or smart software. IBM’s “visual tool” is described this way by an IBM data scientist:
“[Our current] visual tool might not be the solution to that, but it might help to create algorithms that work like spam detection algorithms,” he said. “Imagine getting emails or reading news, and a browser plug-in tells you for the current text how likely it was produced by model X or model Y.”
Okay, not there yet.
The article references XceptionNet but does not provide a link. If you want to know a bit about this approach, click this link. Interesting but not designed for text.
Net net: There is no fool proof way to determine if a chunk of content has been created:
- Entirely by a human writing to a template; for example, certain traditional news story about a hearing or a sport score
- Entirely by software processing digital content streaming from a third party
- A combination of human and smart software.
As some individuals emerge from schools with little training in more traditional types of research and source verification, understanding the difference between information which is written by a careless or stupid human from information assembled by a semi-smart software system is likely to be difficult for people.
Identification of text features is tricky. Exciting opportunities for researchers; for example, should a search and retrieval system automatically NOT our machine generated text?
Stephen E Arnold, March 18, 2019
Text Analysis Toolkits
March 16, 2019
One of the DarkCyber team spotted a useful list, published by MonkeyLearn. Tucked into a narrative called “Text Analysis: The Only Guide You’ll Ever Need” was a list of natural language processing open source tools, programming languages, and software. Each description is accompanied with links and in several cases comments. See the original article for more information.
Caret
CoreNLP
Java
Keras
mlr
NLTK
OpenNLP
Python
SpaCy
Scikit-learn
TensorFlow
PyTorch
R
Weka
Stephen E Arnold, March 16, 2019
MIT Watson Widget That Allegedly Detects Machine Generated Text
March 11, 2019
The venerable IBM and the even more venerable MIT have developed a widget that allegedly detects machine generated texts. You can feed AP stories into the demo system available at this link. To keep things academic, a bogus text will have a preponderance of green highlights. Human generated texts like academic research papers have some green but more yellow orange and purple words. A clue for natural language generation system developers to exploit? Just a thought.
Here’s the report for the text preceding this sentence. It seems that I wrote the sentence which is semi reassuring. I am on autopilot when dealing with smart software purporting to know when a Twitter, Facebook, Twitch or Discord post is generated by a human or a bot.
I deleted the digital heart because I don’t think a humanoid at either IBM or MIT generated the icon. The system does not comprehend emojis but presents one to a page visitor.
Watson can you discern the true from the false? I have an IBM Watson ad somewhere. Perhaps I will feed its text into the system.
Stephen E Arnold, March 11, 2019
Trint Transcription
January 29, 2019
DarkCyber Annex noted “Taming a World Filled with Video and Audio, Using Transcription and AI.” The story explains a service which makes transcriptions of non text information; stated another way, voice to text.
The seminal insight, according to ZDNet, was:
The idea was that we would align the text — the machine-generated transcript and source audio — to the spoken word and do it accurately to the millisecond, so that you could follow it like karaoke, and then we had to figure out a way to correct it. That’s where it got really interesting. What we did was, we came up with the idea of merging a text editor, like Word, to an audio-video player and creating one tool that had two very distinct functions.
Users of the service include “some of the biggest media names, such as The New York Times, ABC News, Thomson Reuters, AP, ESPN, and BBC Worldwide.”
The write up helpfully omits the Trint url which is https://trint.com/.
There was no information provided about the number of languages supported, so here’s that information:
Trint currently offers language models for North American English, British English, Australian English, English – All Accents, European Spanish, European French, German, Italian, Portuguese, Russian, Polish, Finnish, Hungarian, Dutch, Danish, Polish and Swedish.
Also, Trint is a for fee service.
One key function of transcription is that it has to time connect real time streams of audio, link text chat messages and disambiguate emojis in accompanying messages, and make sense of text displayed on the screen in a picture in picture implementation.
DarkCyber Annex does not know of a service which can deliver this type of cross linked service.
Stephen E Arnold, January 29, 2019
Automatic Text Categorization Goes Mainstream
January 10, 2019
Blogger and scaling consultant Abe Winter declares, “Automatic Categorization of Text Is a Core Tool Now.” Noting that, as of last year, companies are using automatic text categorization regularly, Winter clarifies what he is, and is not, referring to here:
“I’m talking about taking a database with short freeform text fields and automatically tagging them according to a tagged sample corpus. I’m not talking about text synthesis, anything to do with speech, automatic chat, question answering, or Alexa Skills.”
Though Winter observes the trend, he is not sure why 2018 was a tipping point. He writes:
“We’ve had some of the building blocks for this kind of text processing for decades, including the stats tools and the training corpuses. Does deep learning help? I don’t know but at minimum it helps by delivering sexy headlines that keep AI in the news, which in turn convinces business stakeholders this is something they can get behind. It wasn’t magic before and it’s not magic now; the output of these algorithms still requires some amount of quality control and manual inspection. But business leaders are now willing to admit that the old manual way of doing things also had drawbacks….”
The write-up goes on to observe that, while text categorization now works well enough for the mainstream, speech and conversation interfaces still fall short of flawless functionality. He directs our attention to this Google Duplex conversation agent demo as he alludes to some troubling trends in corporate AI deployment. He closes with a word to programmers wondering whether they should add natural language processing to their toolkits:
“The part of the question I can’t answer is how big is the job pool, how long will the bubble last and how much expertise do you need to get more money than you make now? … For myself, I’m learning the basic techniques because they feel core to my industry skill set. I’m staying open to chances to apply them and to work with experts. I’m not even at the midpoint of my career and want to stay ahead of the curve.”
It does seem that natural language processing is not about to go away any time soon.
Cynthia Murrell, January 10, 2019
Ontotext Rank
December 5, 2018
Ontotext, a text processing vendor, has posted a demonstration of its ranking technology. You can find the demos at this link. The graphic below was generated by the system on December 3, 2018, at 0900 am US Eastern time. I specified the industry as information technology and the sub industry as search. Here’s what the system displayed:
A few observations:
- I specified 25 companies. The system displayed 10. I assume someone from the company will send me an email that the filters I applied did not have sufficient data to generate the desired result. Perhaps those data should be displayed?
- No Google Search nor Microsoft Bing search appeared. Google, a search vendor, has been in the news in the countries I have visited recently.
- RightNow appeared. The company is (I thought) a unit of Oracle.
- Publishers Clearing House sells magazine subscriptions. PCH does not offer information retrieval in the sense that I understand the bound phrase.
Net net: I am not sure about the size of the data set or what the categories mean.
You need to decide for yourself whether to use this service or Google Trends or a similar “popularity” or “sentiment” analysis system.
Stephen E Arnold, December 5, 2018
Digital Reasoning: From Intelligence Centric Text Retrieval to Wealth Management
November 12, 2018
Vendors of text processing systems have had to find new ways to generate revenue. The early days of entity extraction and social graphs provided customers from the US government and specialized companies like Booz, Allen & Hamilton.
Today, different economic realities have forced change.
The capitalist tool published “Digital Reasoning Brings AI To Wealth Management.” The write up does little to put Digital Reasoning in context. The company was founded in 2000. The firm accepted outside financing which now amounts to about $100 million. The firm became cozy with IBM, labored in the vineyards of the star crossed Distributed Common Ground System, and then faced a fire storm of competition from companies big and small. The reason? Entity extraction and link analysis became commodities. The fancy math also migrated into a wide range of applications.
New buzzwords appeared and gained currency. These ranged from artificial intelligence (who knows that that phrase means?) to real time data analytics (Yeah, what is “real time”?).
Digital Reasoning’s response is interesting. The company, like Attivio and Coveo, has nosed into customer support. But the intriguing play is that the Digital Reasoning system, which was text centric, is now packaging its system to help wealth management firms.
Is this text based?
Sure is. I learned:
For advisors, Digital Reasoning helps them prioritize which customers to focus on, which can be useful when an adviser may have 200 or more clients. At the management level, Digital Reasoning can show if the firm has specific advisors getting a lot of complaints so it can respond with training and intervention. At a strategic level, it can sift through communications and identify if customers are looking for a specific offering or type of product.
Interesting approach.
The challenge, of course, will be to differentiate Digital Reasoning’s system from those available from dozens of vendors.
Digital Reasoning has investors who want a return on their $100 million. After 18 years, time may be compressing as once solutions once perceived as sophisticated become more widely available and subject to price pressure.
Rumors of Amazon’s interest in this “wealth management” sector have reached us in Harrod’s Creek. That might be another reason why the low profile Digital Reasoning is stirring the PR waters using the capitalist’s tool, Forbes Magazine, once a source of “real” news.
Stephen E Arnold, November 12, 2018
Smart Software and Clever Humans
September 23, 2018
Online translation works pretty well. If you want 70 to 85 percent accuracy, you are home free. Most online translation systems handle routine communications like short blog posts written in declarative sentences and articles written in technical jargon just fine. Stick to mainstream languages, and the services work okay.
But if you want an online system to translate my pet phrases like HSSCM or azure chip consultant, you have to attend more closely. HSSCM refers to the way in which some Silicon Valley outfits run their companies. You know. Like a high school science club which decides that proms are for goofs and football players are not smart. The azure chip thing refers to consulting firms which lack the big time reputation of outfits like Bain, BCG, Booz, etc. (Now don’t get me wrong. The current incarnations of these blue chip outfits is far from stellar. Think questionable practices. Maybe criminal behavior.) The azure chip crowd means second string, maybe third string, knowledge work. Just my opinion, but online translation systems don’t get my drift. My references to Harrod’s Creek are geocoding nightmares when I reference squirrel hunting and bourbon in cereal. Savvy?
I was, therefore, not surprised when I read “AI Company Accused of Using Humans to Fake Its AI.” The main point seems to be:
[An[ interpreter accuses leading voice recognition company of ripping off his work and disguising it as the efforts of artificial intelligence.
There are rumors that some outfits use Amazon’s far from mechanical Turk or just use regular employees who can translate that which baffles the smart software.
The allegation from a former human disguised as smart software offered this information to Sixth Tone, a blog publishing the article:
In an open letter posted on Quora-like Q&A platform Zhihu, interpreter Bell Wang claimed he was one of a team of simultaneous interpreters who helped translate the 2018 International Forum on Innovation and Emerging Industries Development on Thursday. The forum claimed to use iFlytek’s automated interpretation service.
Trust me, you zippy millennials, smart software can be fast. It can be efficient. It can be less expensive than manual methods. But it can be wrong. Not just off base. Playing a different game with expensive Ronaldo types.
Why not run this blog post through Google Translate and check out the French or Spanish the system produces? Better yet, aim the system as a poor quality surveillance video or a VoIP call laden with insider talk between a cartel member and the Drug Llama?
Stephen E Arnold, September 23, 2018