Into R? A List for You

May 12, 2019

Computerworld, which runs some pretty unusual stories, published “Great R Packages for Data Import, Wrangling and Visualization.” “Great” is an interesting word. In the lingo of Computerworld, a real journalist did some searching, talked to some people, and created a list. As it turns out, the effort is useful. Looking at the Computerworld table is quite a bit easier than trying to dig information out of assorted online sources. Plus, people are not too keen on the phone and email thing now.

The listing includes a mixture of different tools, software, and utilities. There are more than 80 listings. I wasn’t sure what to make of XML’s inclusion in the list, but, the source is Computerworld, and I assume that the “real” journalist knows much more than I.

Two observations:

  • Earthworm lists without classification or alphabetization are less useful to me than listings which are sorted by tags and alphabetized within categories. Excel does perform this helpful trick.
  • Some items in the earthworm list have links and others do not. Consistency, I suppose, is the hobgoblin of some types of intellectual work
  • An indication of which item is free or for fee would be useful too.

Despite these shortcomings, you may want to download the list and tuck it into your “Things I love about R” folder.

Stephen E Arnold, May 12, 2019

China: Patent Translation System

May 10, 2019

Patents are usually easily findable documents. However, reading a patent once found is a challenge. Up the ante if the patent is in a language the person does not read. “AI Used to Translate Patent Documents” provides some information about a new system available from the Intellectual Property Publishing House. According to the article in China Daily:

The system can translate Chinese into English, Japanese and German and vice versa. Its accuracy in two-way translation between Chinese and Japanese has reached 95 percent, far more than the current industry average, and the rest has topped 90 percent…

The system uses a dictionary, natural language processing algorithms, and a computational model. In short, this is a collection of widely used methods tuned over a decade by the Chinese organization. In that span, Thomson Reuters dropped out of the patent game, and just finding patents, even in the US, can be a daunting task.

Translation has been an even more difficult task for some lawyers, researchers, analysts, and academics.

If the information in the China Daily article is accurate, China may have an intellectual property advantage., The write up offers some details, which sound interesting; for example:

  • Translation of a Japanese document: five seconds
  • Patent documents record 90 percent of a country’s technology and innovation
  • China has “a huge database of global patents”.

And the other 10 percent? Maybe other methods are employed.

Stephen E Arnold, May 10, 2019

Cognitive Engine: What Powers the USAF Platform?

May 1, 2019

Last week I met with a university professor who does cutting edge data and text mining and also shepherds PhD candidates. In the course of our 90 minute conversation, I noticed some reference books which had SPSS on the cover. The procedures implemented at this particular university worked well.

After the meeting, I was thinking about the newer approaches which are becoming publicly available. The USAF has started talking about its “cognitive engine.” I thought I heard at a conference that some technology developed developed by Nutonian, now part of a data and text mining roll up, had influenced the project.

The Nutonian system is predictive with a twist. The person using the system can rely on the smart software to perform the numerous intermediary steps required when using more traditional systems.

The article “The US Air Force Will Showcase Its Many Technological Advances in the USAF Lab Day.” The original is in Chinese but Freetranslate.com can help out if don’t read Chinese or have a close by contact who does.

The USAF wants to deploy a cognitive platform into which vendors can “plug in” their systems. The Chinese write up reported:

AFRL’s Autonomy Capability Team 3 (ACT3) is developing artificial intelligence on a large scale through the development and application of the Air Force Cognitive Engine (ACE), an artificial intelligence software platform. Put into application. The software platform architecture reduces the barriers to entry for artificial intelligence applications and provides end-user applications with the ability to cover a range of artificial intelligence problem types. In the application, the software platform connects educated end users, developers, and algorithms implemented in software, task data, and computing hardware to the process of creating an artificial intelligence solution.

The article also provides some interesting details which were not included in some of the English language reports about this session; for example:

  • Smart rockets
  • An agile pod
  • Pathogen identification.

A couple of observations:

First, obviously the Chinese writer had access to information about the Lab Day demonstrations.

Second, the cognitive platform does not mention foundation vendors, which I understand.

Third, it would be delightful to visit a university and see documentation and information about the next-generation predictive analytics systems available.

Stephen E Arnold, May 1, 2019

Here’s what the Chinese writer reported about the

Latest GraphDB Edition Available

April 25, 2019

A new version of GraphDB is now available, we learn from the company’s News post, “Ontotext’s GraphDB 8.9 Boosts Semantic Similarity Search.” The semantic graph database offers a couple new features inspired by user feedback. We learn:

“The semantic similarity search is based on the Random Indexing algorithm. … The latest GraphDB release enables users to create hybrid similarity searches using pre-built text-based similarity vectors for the predication-based similarity index. The index combines the power of graph topology with the text similarity. The users can control the index accuracy by specifying the number of iterations required to refine the embeddings. Another improvement is that now GraphDB 8.9 allows users to boost the term weights when searching in text-based similarity indexes. It also simplifies the processes of abortion of running queries or updates from the SPARQL editor in the Workbench.”

The database continues to be updated to the current RDF4J 2.4.6 public release. GraphDB comes in Free, Standard, and Enterprise editions. Begun in 2000, Ontotext is based in Sofia, Bulgaria, and maintains its North American office in New York City.

Cynthia Murrell, April 25, 2019

Nosing Beyond the Machine Learning from Human Curated Data Sets: Autonomy 1996 to Smart Software 2019

April 24, 2019

How does one teach a smart indexing system like Autonomy’s 1996 “neurodynamic” system?* Subject matter experts (SMEs) assembled training collection of textual information. The article and other content would replicate the characteristics of the content which the Autonomy system would process; that is, index and make searchable or analyzable. The work was important. Get the training data wrong and the indexing system would assign metadata or “index terms” and “category names” which could cause a query to generate results the user could perceive as incorrect.

image

How would a licensee adjust the Autonomy “black box”? (Think of my reference to Autonomy and search as a way of approaching “smart software” and “artificial intelligence.”)

The method was to perform re-training. The approach was practical and for most content domains, the re-training worked. It was an iterative process. Because the words in the corpus fed into the “black box” included new words, concepts, bound phrases, entities, and key sequences, there were several functions integrated into the basic Autonomy system as it matured. Examples ranged from support for term lists (controlled vocabularies) and dictionaries.

The combination of re-training and external content available to the system allowed Autonomy to deliver useful outputs.

Where the optimal results departed from the real world results usually boiled down to several factors, often working in concert. First, licensees did not want to pay for re-training. Second, maintenance of the external dictionaries was necessary because new entities arrive with reasonable frequency. Third, testing and organizing the freshening training sets and the editorial work required to keep dictionaries ship shape was too expensive, time consuming, and tedious.

Not surprisingly, some licensees grew unhappy with their Autonomy IDOL (integrated data operating layer) system. That, in my opinion, was not Autonomy’s fault. Autonomy explained in the presentations I heard what was required to get a system up and running and outputting results that could easily hit 80 percent or higher on precision and recall tests.

The Autonomy approach is widely used. In fact, wherever there is a Bayesian system in use, there is the training, re-training, external knowledge base demand. I just took a look at Haystax Constellation. It’s Bayesian and Haystax makes it clear that the “model” has to be training. So what’s changed between 1996 and 2019 with regards to Bayesian methods?

Nothing. Zip. Zero.

Read more

Who Is Assisting China in Its Technology Push?

March 20, 2019

I read “U.S. Firms Are Helping Build China’s Orwellian State.” The write up is interesting because it identifies companies which allegedly provide technology to the Middle Kingdom. The article also uses an interesting phrase; that is, “tech partnerships.” Please, read the original article for the names of the US companies allegedly cooperating with China.

I want to tell a story.

Several years ago, my team was asked to prepare a report for a major US university. Our task was to try and answer what I thought was a simple question when I accepted the engagement, “Why isn’t this university’s computer science program ranked in the top ten in the US?”

The answer, my team and I learned, had zero to do with faculty, courses, or the intelligence of students. The primary reason was that the university’s graduates were returning to their “home countries.” These included China, Russia, and India, among others. In one advanced course, there was no US born, US educated student.

We documented that for over a seven year period, when the undergraduate, the graduate students, and post doctoral students completed their work, they had little incentive to start up companies in proximity to the university, donate to the school’s fund raising, and provide the rah rah that happy graduates often do. To see the rah rah in action, may I suggest you visit a “get together” of graduates near Stanford or an eatery in Boston or on NCAA elimination week end in Las Vegas.

How could my client fix this problem? We were not able to offer a quick fix or even an easy fix. The university had institutionalized revenue from non US student and was, when we did the research, dependent on non US students. These students were very, very capable and they came to the US to learn, form friendships, and sharpen their business and technical “soft” skills. These, I assume, were skills put to use to reach out to firms where a “soft” contact could be easily initiated and brought to fruition.

threads fixed

Follow the threads and the money.

China has been a country eager to learn in and from the US. The identification of some US firms which work with China should not be a surprise.

However, I would suggest that Foreign Policy or another investigative entity consider a slightly different approach to the topic of China’s technical capabilities. Let me offer one example. Consider this question:

What Israeli companies provide technology to China and other countries which may have some antipathy to the US?

This line of inquiry might lead to some interesting items of information; for example, a major US company which meets on a regular basis with a counterpart with what I would characterize as “close links” to the Chinese government. One colloquial way to describe the situation is like a conduit. Digging in  this field of inquiry, one can learn how the Israeli company “flows” US intelligence-related technology from the US and elsewhere through an intermediary so that certain surveillance systems in China can benefit directly from what looks like technology developed in Israel.

Net net: If one wants to understand how US technology moves from the US, the subject must be examined in terms of academic programs, admissions, policies, and connections as well as from the point of view of US company investments in technologies which received funding from Chinese sources routed through entities based in Israel. Looking at a couple of firms does not do the topic justice and indeed suggests a small scale operation.

Uighur monitoring is one thread to follow. But just one.

Stephen E Arnold, March 20, 2019

Identification of Machine Generated Text: Not There Yet

March 18, 2019

A.I. Generated Text Is Supercharging Fake News. This Is How We Fight Back” provides a run down of projects focused on figuring out if a sentence were written by a human or smart software. IBM’s “visual tool” is described this way by an IBM data scientist:

“[Our current] visual tool might not be the solution to that, but it might help to create algorithms that work like spam detection algorithms,” he said. “Imagine getting emails or reading news, and a browser plug-in tells you for the current text how likely it was produced by model X or model Y.”

Okay, not there yet.

The article references XceptionNet but does not provide a link. If you want to know a bit about this approach, click this link. Interesting but not designed for text.

Net net: There is no fool proof way to determine if a chunk of content has been created:

  • Entirely by a human writing to a template; for example, certain traditional news story about a hearing or a sport score
  • Entirely by software processing digital content streaming from a third party
  • A combination of human and smart software.

As some individuals emerge from schools with little training in more traditional types of research and source verification, understanding the difference between information which is written by a careless or stupid human from information assembled by a semi-smart software system is likely to be difficult for people.

Identification of text features is tricky. Exciting opportunities for researchers; for example, should a search and retrieval system automatically NOT our machine generated text?

Stephen E Arnold, March 18, 2019

Text Analysis Toolkits

March 16, 2019

One of the DarkCyber team spotted a useful list, published by MonkeyLearn. Tucked into a narrative called “Text Analysis: The Only Guide You’ll Ever Need” was a list of natural language processing open source tools, programming languages, and software. Each description is accompanied with links and in several cases comments. See the original article for more information.

Caret

CoreNLP

Java

Keras

mlr

NLTK

OpenNLP

Python

SpaCy

Scikit-learn

TensorFlow

PyTorch

R

Weka

Stephen E Arnold, March 16, 2019

MIT Watson Widget That Allegedly Detects Machine Generated Text

March 11, 2019

The venerable IBM and the even more venerable MIT have developed a widget that allegedly detects machine generated texts. You can feed AP stories into the demo system available at this link. To keep things academic, a bogus text will have a preponderance of green highlights. Human generated texts like academic research papers have some green but more yellow orange and purple words. A clue for natural language generation system developers to exploit? Just a thought.

Here’s the report for the text preceding this sentence. It seems that I wrote the sentence which is semi reassuring. I am on autopilot when dealing with smart software purporting to know when a Twitter, Facebook, Twitch or Discord post is generated by a human or a bot.

image

I deleted the digital heart because I don’t think a humanoid at either IBM or MIT generated the icon. The system does not comprehend emojis but presents one to a page visitor.

Watson can you discern the true from the false? I have an IBM Watson ad somewhere. Perhaps I will feed its text into the system.

Stephen E Arnold, March 11, 2019

Trint Transcription

January 29, 2019

DarkCyber Annex noted “Taming a World Filled with Video and Audio, Using Transcription and AI.” The story explains a service which makes transcriptions of non text information; stated another way, voice to text.

The seminal insight, according to ZDNet, was:

The idea was that we would align the text — the machine-generated transcript and source audio — to the spoken word and do it accurately to the millisecond, so that you could follow it like karaoke, and then we had to figure out a way to correct it. That’s where it got really interesting. What we did was, we came up with the idea of merging a text editor, like Word, to an audio-video player and creating one tool that had two very distinct functions.

Users of the service include “some of the biggest media names, such as The New York Times, ABC News, Thomson Reuters, AP, ESPN, and BBC Worldwide.”

The write up helpfully omits the Trint url which is https://trint.com/.

There was no information provided about the number of languages supported, so here’s that information:

Trint currently offers language models for North American English, British English, Australian English, English – All Accents, European Spanish, European French, German, Italian, Portuguese, Russian, Polish, Finnish, Hungarian, Dutch, Danish, Polish and Swedish.

Also, Trint is a for fee service.

One key function of transcription is that it has to time connect real time streams of audio, link text chat messages and disambiguate emojis in accompanying messages, and make sense of text displayed on the screen in a picture in picture implementation.

DarkCyber Annex does not know of a service which can deliver this type of cross linked service.

Stephen E Arnold, January 29, 2019

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta