Into R? A List for You

May 12, 2019

Computerworld, which runs some pretty unusual stories, published “Great R Packages for Data Import, Wrangling and Visualization.” “Great” is an interesting word. In the lingo of Computerworld, a real journalist did some searching, talked to some people, and created a list. As it turns out, the effort is useful. Looking at the Computerworld table is quite a bit easier than trying to dig information out of assorted online sources. Plus, people are not too keen on the phone and email thing now.

The listing includes a mixture of different tools, software, and utilities. There are more than 80 listings. I wasn’t sure what to make of XML’s inclusion in the list, but, the source is Computerworld, and I assume that the “real” journalist knows much more than I.

Two observations:

  • Earthworm lists without classification or alphabetization are less useful to me than listings which are sorted by tags and alphabetized within categories. Excel does perform this helpful trick.
  • Some items in the earthworm list have links and others do not. Consistency, I suppose, is the hobgoblin of some types of intellectual work
  • An indication of which item is free or for fee would be useful too.

Despite these shortcomings, you may want to download the list and tuck it into your “Things I love about R” folder.

Stephen E Arnold, May 12, 2019

Cognitive Engine: What Powers the USAF Platform?

May 1, 2019

Last week I met with a university professor who does cutting edge data and text mining and also shepherds PhD candidates. In the course of our 90 minute conversation, I noticed some reference books which had SPSS on the cover. The procedures implemented at this particular university worked well.

After the meeting, I was thinking about the newer approaches which are becoming publicly available. The USAF has started talking about its “cognitive engine.” I thought I heard at a conference that some technology developed developed by Nutonian, now part of a data and text mining roll up, had influenced the project.

The Nutonian system is predictive with a twist. The person using the system can rely on the smart software to perform the numerous intermediary steps required when using more traditional systems.

The article “The US Air Force Will Showcase Its Many Technological Advances in the USAF Lab Day.” The original is in Chinese but Freetranslate.com can help out if don’t read Chinese or have a close by contact who does.

The USAF wants to deploy a cognitive platform into which vendors can “plug in” their systems. The Chinese write up reported:

AFRL’s Autonomy Capability Team 3 (ACT3) is developing artificial intelligence on a large scale through the development and application of the Air Force Cognitive Engine (ACE), an artificial intelligence software platform. Put into application. The software platform architecture reduces the barriers to entry for artificial intelligence applications and provides end-user applications with the ability to cover a range of artificial intelligence problem types. In the application, the software platform connects educated end users, developers, and algorithms implemented in software, task data, and computing hardware to the process of creating an artificial intelligence solution.

The article also provides some interesting details which were not included in some of the English language reports about this session; for example:

  • Smart rockets
  • An agile pod
  • Pathogen identification.

A couple of observations:

First, obviously the Chinese writer had access to information about the Lab Day demonstrations.

Second, the cognitive platform does not mention foundation vendors, which I understand.

Third, it would be delightful to visit a university and see documentation and information about the next-generation predictive analytics systems available.

Stephen E Arnold, May 1, 2019

Here’s what the Chinese writer reported about the

Nosing Beyond the Machine Learning from Human Curated Data Sets: Autonomy 1996 to Smart Software 2019

April 24, 2019

How does one teach a smart indexing system like Autonomy’s 1996 “neurodynamic” system?* Subject matter experts (SMEs) assembled training collection of textual information. The article and other content would replicate the characteristics of the content which the Autonomy system would process; that is, index and make searchable or analyzable. The work was important. Get the training data wrong and the indexing system would assign metadata or “index terms” and “category names” which could cause a query to generate results the user could perceive as incorrect.

image

How would a licensee adjust the Autonomy “black box”? (Think of my reference to Autonomy and search as a way of approaching “smart software” and “artificial intelligence.”)

The method was to perform re-training. The approach was practical and for most content domains, the re-training worked. It was an iterative process. Because the words in the corpus fed into the “black box” included new words, concepts, bound phrases, entities, and key sequences, there were several functions integrated into the basic Autonomy system as it matured. Examples ranged from support for term lists (controlled vocabularies) and dictionaries.

The combination of re-training and external content available to the system allowed Autonomy to deliver useful outputs.

Where the optimal results departed from the real world results usually boiled down to several factors, often working in concert. First, licensees did not want to pay for re-training. Second, maintenance of the external dictionaries was necessary because new entities arrive with reasonable frequency. Third, testing and organizing the freshening training sets and the editorial work required to keep dictionaries ship shape was too expensive, time consuming, and tedious.

Not surprisingly, some licensees grew unhappy with their Autonomy IDOL (integrated data operating layer) system. That, in my opinion, was not Autonomy’s fault. Autonomy explained in the presentations I heard what was required to get a system up and running and outputting results that could easily hit 80 percent or higher on precision and recall tests.

The Autonomy approach is widely used. In fact, wherever there is a Bayesian system in use, there is the training, re-training, external knowledge base demand. I just took a look at Haystax Constellation. It’s Bayesian and Haystax makes it clear that the “model” has to be training. So what’s changed between 1996 and 2019 with regards to Bayesian methods?

Nothing. Zip. Zero.

Read more

Text Analysis Toolkits

March 16, 2019

One of the DarkCyber team spotted a useful list, published by MonkeyLearn. Tucked into a narrative called “Text Analysis: The Only Guide You’ll Ever Need” was a list of natural language processing open source tools, programming languages, and software. Each description is accompanied with links and in several cases comments. See the original article for more information.

Caret

CoreNLP

Java

Keras

mlr

NLTK

OpenNLP

Python

SpaCy

Scikit-learn

TensorFlow

PyTorch

R

Weka

Stephen E Arnold, March 16, 2019

Ontotext Rank

December 5, 2018

Ontotext, a text processing vendor, has posted a demonstration of its ranking technology. You can find the demos at this link. The graphic below was generated by the system on December 3, 2018, at 0900 am US Eastern time. I specified the industry as information technology and the sub industry as search. Here’s what the system displayed:

image

A few observations:

  1. I specified 25 companies. The system displayed 10. I assume someone from the company will send me an email that the filters I applied did not have sufficient data to generate the desired result. Perhaps those data should be displayed?
  2. No Google Search nor Microsoft Bing search appeared. Google, a search vendor, has been in the news in the countries I have visited recently.
  3. RightNow appeared. The company is (I thought) a unit of Oracle.
  4. Publishers Clearing House sells magazine subscriptions. PCH does not offer information retrieval in the sense that I understand the bound phrase.

Net net: I am not sure about the size of the data set or what the categories mean.

You need to decide for yourself whether to use this service or Google Trends or a similar “popularity” or “sentiment” analysis system.

Stephen E Arnold, December 5, 2018

Digital Reasoning: From Intelligence Centric Text Retrieval to Wealth Management

November 12, 2018

Vendors of text processing systems have had to find new ways to generate revenue. The early days of entity extraction and social graphs provided customers from the US government and specialized companies like Booz, Allen & Hamilton.

Today, different economic realities have forced change.

The capitalist tool published “Digital Reasoning Brings AI To Wealth Management.” The write up does little to put Digital Reasoning in context. The company was founded in 2000. The firm accepted outside financing which now amounts to about $100 million. The firm became cozy with IBM, labored in the vineyards of the star crossed Distributed Common Ground System, and then faced a fire storm of competition from companies big and small. The reason? Entity extraction and link analysis became commodities. The fancy math also migrated into a wide range of applications.

New buzzwords appeared and gained currency. These ranged from artificial intelligence (who knows that that phrase means?) to real time data analytics (Yeah, what is “real time”?).

Digital Reasoning’s response is interesting. The company, like Attivio and Coveo, has nosed into customer support. But the intriguing play is that the Digital Reasoning system, which was text centric, is now packaging its system to help wealth management firms.

Is this text based?

Sure is. I learned:

For advisors, Digital Reasoning helps them prioritize which customers to focus on, which can be useful when an adviser may have 200 or more clients. At the management level, Digital Reasoning can show if the firm has specific advisors getting a lot of complaints so it can respond with training and intervention. At a strategic level, it can sift through communications and identify if customers are looking for a specific offering or type of product.

Interesting approach.

The challenge, of course, will be to differentiate Digital Reasoning’s system from those available from dozens of vendors.

Digital Reasoning has investors who want a return on their $100 million. After 18 years, time may be compressing as once solutions once perceived as sophisticated become more widely available and subject to price pressure.

Rumors of Amazon’s interest in this “wealth management” sector have reached us in Harrod’s Creek. That might be another reason why the low profile Digital Reasoning is stirring the PR waters using the capitalist’s tool, Forbes Magazine, once a source of “real” news.

Stephen E Arnold, November 12, 2018

Picking and Poking Palantir Technologies: A New Blood Sport?

April 25, 2018

My reaction to “Palantir Has Figured Out How to Make Money by Using Algorithms to Ascribe Guilt to People, Now They’re Looking for New Customers” is a a sign and a groan.

I don’t work for Palantir Technologies, although I have been a consultant to one of its major competitors. I do lecture about next generation information systems at law enforcement and intelligence centric conferences in the US and elsewhere. I also wrote a book called “CyberOSINT: Next Generation Information Access.” That study has spawned a number of “experts” who are recycling some of my views and research. A couple of government agencies have shortened by word “cyberosint” into the “cyint.” In a manner of speaking, I have an information base which can be used to put the actions of companies which offer services similar to those available from Palantir in perspective.

The article in Boing Boing falls into the category of “yikes” analysis. Suddenly, it seems, the idea that cook book mathematical procedures can be used to make sense of a wide range of data. Let me assure you that this is not a new development, and Palantir is definitely not the first of the companies developing applications for law enforcement and intelligence professionals to land customers in financial and law firms.

baseball card part 5

A Palantir bubble gum card shows details about a person of interest and links to underlying data from which the key facts have been selected. Note that this is from an older version of Palantir Gotham. Source: Google Images, 2015

Decades ago, a friend of mine (Ev Brenner, now deceased) was one of the pioneers using technology and cook book math to make sense of oil and gas exploration data. How long ago? Think 50 years.

The focus of “Palantir Has Figured Out…” is that:

Palantir seems to be the kind of company that is always willing to sell magic beans to anyone who puts out an RFP for them. They have promised that with enough surveillance and enough secret, unaccountable parsing of surveillance data, they can find “bad guys” and stop them before they even commit a bad action.

Okay, that sounds good in the context of the article, but Palantir is just one vendor responding to the need for next generation information access tools from many commercial sectors.

Read more

CyberOSINT: Next Generation Information Access Explains the Tech Behind the Facebook, GSR, Cambridge Analytica Matter

April 5, 2018

In 2015, I published CyberOSINT: Next Generation Information Access. This is a quick reminder that the profiles of the vendors who have created software systems and tools for law enforcement and intelligence professionals remains timely.

The 200 page book provides examples, screenshots, and explanations of the tools which are available to analyze social media information. The book is the most comprehensive run down of the open source, commercial, and cloud based systems which can make sense of social media data, lawful intercept data, and general text and imagery content.

Companies described in this collection of “tools” include:

  • Cyveillance (now LookingGlass)
  • Decisive Analytics
  • IBM i2 (Analysts Notebook)
  • Geofeedia
  • Leidos
  • Palantir Gotham
  • and more than a dozen developers of commercial and open source, high impact cyberOSINT tool vendors.

The book is available for $49. Additional information is available on my Xenky.com Web site. You can buy the PDF book online at this link gum.co/cyberosint.

Get the CyberOSINT monograph. It’s the standard reference for practical and effective analysis, text analytics, and next generation solutions.

Stephen E Arnold, April 5, 2018

Insight into the Value of Big Data and Human Conversation

April 5, 2018

Big data and AI have been tackling tons of written material for years. But actual spoken human conversation has been largely overlooked in this world, mostly due to the difficulty of collecting this information. However, that is on the cusp of changing as we discovered from a white paper from the Business and Local Government Resource Center,The SENSEI Project: Making Sense of Human Conversations.”

According to the paper:

“In the SENSEI project we plan to go beyond keyword search and sentence-based analysis of conversations. We adapt lightweight and large coverage linguistic models of semantic and discourse resources to learn a layered model of conversations. SENSEI addresses the issue of multi-dimensional textual, spoken and metadata descriptors in terms of semantic, para-semantic and discourse structures.”

While some people are excited about the potential for advancement this kind of big data research presents, others are a little more nervous; for example, one or two of the 87 million individuals whose Facebook data found its way into the capable hands of GSR and Facebook.

In fact, there is a growing movement, according to the Guardian, to scale back big data intrusion. What makes this interesting is that advocates are demanding companies that harvest our information for big data purposes give some of that money back to the people whom the info originate, not unlike how songwriters are given royalties every time their music is used for film or television. Putting a financial stipulation on big data collection could cause SENSEI to top its brake pedal. Maybe?

Patrick Roland, April 5, 2018

Can Factmata Do What Other Text Analytics Firms Cannot?

April 2, 2018

Consider it a sign of the times—Information Management reveals, “Twitter, Craigslist Co-Founders Back Fact-Check Startup Factmata.” Writer Jeremy Kahn reports:

“Twitter Inc. co-founder Biz Stone and Craigslist Inc. co-founder Craig Newmark are investing in London-based fact-checking startup Factmata, the company said Thursday. … Factmata aims to use artificial intelligence to help social media companies, publishers and advertising networks weed out fake news, propaganda and clickbait. The company says its technology can also help detect online bullying and hate speech.”

Particularly amid concerns about the influence of Russian-backed propaganda in U.S. and the U.K., several tech firms and other organizations have taken aim at false information online. What about Factmata has piqued the interest of leading investors? We’re informed:

“Dhruv Ghulati, Factmata’s chief executive officer, said the startup’s approach to fact-checking differs from other companies. While some companies are looking at a wide range of content, Factmata is initially focused exclusively on news. Many automated fact-checking approaches rely primarily on metadata – the information behind the scenes that describe online news items and other posts. But Factmata is using natural language processing to assess the actual words, including the logic being used, whether assertions are backed up by facts and whether those facts are attributed to reputable sources.”

Ghulati goes on to predict Facebook will be supplanted as users’ number one news source within the next decade. Apparently, we can look forward to the launch of Factmata’s own news service sometime “later this year.”

We will wait. We do want to point out that based on the information available to the Beyond Search and DarkCyber research teams, no vendor has been able to identify text which is weaponized at a high level of accuracy without the assistance of expensive, human, and vacation hungry subject matter experts.

Maybe Factmata will “mata”?

Cynthia Murrell, April 2, 2018

Next Page »

  • Archives

  • Recent Posts

  • Meta