Google Trends Used to Reveal Misspelled Wirds or Is It Words?

November 25, 2019

We spotted a listing of the most misspelled words in each of the USA’s 50 states. Too bad Puerto Rico. Kentucky’s most misspelled word is “ninety.” Navigate to Considerable and learn what residents cannot spell. How often? Silly kweston.

The listing includes some bafflers and may reveal what can go wrong with data from an online ad sales data collection system; for example:

  • Washington, DC (which is not a state in DarkCyber’s book) cannot spell “enough”; for example, “enuf already with these televised hearings and talking heads”
  • Idaho residents cannot spell embarrassed, which as listeners to Kara Swisher know has two r’s and two s’s. Helpful that.
  • Montana residents cannot spell “comma.” Do those in Montana use commas?
  • And not surprisingly, those in Tennessee cannot spell “intelligent.” Imagine that!

What happens if one trains smart software on these data?

Sumthink mite go awf the railz.

Stephen E Arnold, November 25, 2019

Gender Bias in Old Books. Rewrite Them?

October 9, 2019

Here is an interesting use of machine learning. Salon tells us “What Reading 3.5 Million Books Tells Us About Gender Stereotypes.” Researchers led by University of Copenhagen’s Dr. Isabelle Augenstein analyzed 11 billion English words in literature published between 1900 and 2008. Not surprisingly, the results show that adjectives about appearance were most often applied to women (“beautiful” and “sexy” top the list), while men were more likely to be described by character traits (“righteous,” “rational,” and “brave” were most frequent). Writer Nicole Karlis describes how the team approached the analysis:

“Using machine learning, the researchers extracted adjectives and verbs connected to gender-specific nouns, like ‘daughter.’ Then the researchers analyzed whether the words had a positive, negative or neutral point of view. The analysis determined that negative verbs associated with appearance are used five times more for women than men. Likewise, positive and neutral adjectives relating to one’s body appearance occur twice as often in descriptions of women. The adjectives used to describe men in literature are more frequently ones that describe behavior and personal qualities.

“Researchers noted that, despite the fact that many of the analyzed books were published decades ago, they still play an active role in fomenting gender discrimination, particularly when it comes to machine learning sorting in a professional setting. ‘The algorithms work to identify patterns, and whenever one is observed, it is perceived that something is “true.” If any of these patterns refer to biased language, the result will also be biased,’ Augenstein said. ‘The systems adopt, so to speak, the language that we people use, and thus, our gender stereotypes and prejudices.’” Augenstein explained this can be problematic if, for example, machine learning is used to sift through employee recommendations for a promotion.”

Karlis does list some caveats to the study—it does not factor in who wrote the passages, what genre they were pulled from, or how much gender bias permeated society at the time. The research does affirm previous results, like the 2011 study that found 57% of central characters in children’s books are male.

Dr. Augenstein hopes her team’s analysis will raise awareness about the impact of gendered language and stereotypes on machine learning. If they choose, developers can train their algorithms on less biased materials or program them to either ignore or correct for biased language.

Cynthia Murrell, October 9, 2019

Trovicor: A Slogan as an Equation

August 2, 2019

We spotted this slogan on the Trovicor Web site:

The Trovicor formula: Actionable Intelligence = f (data generation; fusion; analysis; visualization)

The function consists of four buzzwords used by vendors of policeware and intelware:

  • Data generation (which suggests metadata assigned to intercepted, scraped, or provided content objects)
  • Fusion (which means in DarkCyber’s world a single index to disparate data)
  • Analysis (numerical recipes to identify patterns or other interesting data
  • Virtualization (use of technology to replace old school methods like 1950s’ style physical wire taps, software defined components, and software centric widgets).

The buzzwords make it easy to identify other companies providing somewhat similar services.

Trovicor maintains a low profile. But obtaining open source information about the company may be a helpful activity.

Stephen E Arnold, August 2, 2019

Need a Machine Learning Algorithm?

July 17, 2019

r entry

The R-Bloggers.com Web site published “101 Machine Learning Algorithms for Data Science with Cheat Sheets.” The write up recycles information from DataScienceDojo, and some of the information looks familiar. But lists of algorithms are not original. They are useful. What sets this list apart is the inclusion of “cheat sheets.”

What’s a cheat sheet?

In this particular collection, a cheat sheet looks like this:

r entry example

You can see the entry for the algorithm: Bernoulli Naive Bayes with a definition. The “cheat sheet” is a link to a python example. In this case, the example is a link to an explanation on the Chris Albon blog.

What’s interesting is that the 101 algorithms are grouped under 18 categories. Of these 18, Bayes and derivative methods total five.

No big deal, but in my lectures about widely used algorithms I highlight 10, mostly because it is a nice round number. The point is that most of the analytics vendors use the same basic algorithms. Variations among products built on these algorithms are significant.

As analytics systems become more modular — that  is, like Lego blocks — it seems that the trajectory of development will be to select, preconfigure thresholds, and streamline processes in a black box.

Is this good or bad?

It depends on whether one’s black box is a dominant solution or platform?

Will users know that this almost inevitable narrowing has upsides and downsides?

Nope.

Stephen E Arnold, July 17, 2019

New Jargon: Consultants, Start Your Engines

July 13, 2019

I read “What Is “Cognitive Linguistics“? The article appeared in Psychology Today. Disclaimer: I did some work for this outfit a long time ago. Anybody remember Charles Tillinghast, “CRM” when it referred to people, not a baloney discipline for a Rolodex filled with sales lead, and the use of Psychology Today as a text in a couple of universities? Yeah, I thought not. The Ziff connection is probably lost in the smudges of thumb typing too.

Onward: The write up explains a new spin on psychology, linguistics, and digital interaction. The jargon for this discipline or practice, if you will is:

Cognitive Linguistics

I must assume that the editorial processes at today’s Psychology Today are genetically linked to the procedures in use in — what was it, 1972? — but who knows.

excited fixed

Here’s the definition:

The cognitive linguistics enterprise is characterized by two key commitments. These are:
i) the Generalization Commitment: a commitment to the characterization of general principles that are responsible for all aspects of human language, and
ii) the Cognitive Commitment: a commitment to providing a characterization of general principles for language that accords with what is known about the mind and brain from other disciplines. As these commitments are what imbue cognitive linguistics with its distinctive character, and differentiate it from formal linguistics.

If you are into psychology and figuring out how to manipulate people or a Google ranking, perhaps this is the intellectual gold worth more than stolen treasure from Montezuma.

Several observations:

  1. I eagerly await an estimate from IDC for the size of the cognitive linguistics market, and I am panting with anticipation for a Garnter magic quadrant which positions companies as leaders, followers, outfits which did not pay for coverage, and names found with a Google search at Starbuck’s south of the old PanAm Building. Cognitive linguistics will have to wait until the two giants of expertise figure out how to define “personal computer market”, however.
  2. A series of posts from Dave Amerland and assorted wizards at SEO blogs which explain how to use the magic of cognitive linguistics to make a blog page — regardless of content, value, and coherence — number one for a Google query.
  3. A how to book from Wiley publishing called “Cognitive Linguistics for Dummies” with online reference material which may or many not actually be available via the link in the printed book
  4. A series of conferences run by assorted “instant conference” organizers with titles like “The Cognitive Linguistics Summit” or “Cognitive Linguistics: Global Impact”.

So many opportunities. Be still, my heart.

Cognitive linguistics — it’s time has come. Not a minute too soon for a couple of floundering enterprise search vendors to snag the buzzword and pivot to implementing cognitive linguistics for solving “all your information needs.” Which search company will embrace this technology: Coveo, IBM Watson, Sinequa?

DarkCyber is excited.

Stephen E Arnold, July 13, 2019

Sentiment Analysis: Can a Monkey Can Do It?

June 27, 2019

Sentiment analysis is a machine learning tool companies are employing to understand how their customers feel about their services and products. It is mainly deployed on social media platforms, including Facebook, Instagram, and Twitter. The Monkey Learn blog details how sentiment analysis is specifically being used on Twitter in the post, “Sentiment Analysis Of Twitter.”

Using sentiment analysis is not a new phenomenon, but there are still individuals unaware of the possible power at their fingertips. Monkey Learn specializes in customer machine learning solutions that include intent, keywords, and, of course, sentiment analysis. The post is a guide on the basics of sentiment analysis: what it is, how it works, and real life examples. Monkey Learn defines sentiment analysis as:

Sentiment analysis (a.k.a opinion mining) is the automated process of identifying and extracting the subjective information that underlies a text. This can be either an opinion, a judgment, or a feeling about a particular topic or subject. The most common type of sentiment analysis is called ‘polarity detection’ and consists in classifying a statement as ‘positive’, ‘negative’ or ‘neutral’.”

It also relies on natural language processing (NLP) to understand the information’s context.

Monkey Learn explains that sentiment analysis is important because most of the world’s digital data is unstructured. Machine learning with NLP’s assistance can quickly sort large data sets and detect their polarity. Monkey Learn promises with their sentiment analysis to bring their customers scalability, consistent criteria, and real-time analysis. Many companies are using Twitter sentiment analysis for customer service, brand monitoring, market research, and political campaigns.

The article is basically a promotional piece for Monkey Learn, but it does work as a starting guide for sentiment analysis.

Whitney Grace, June 27, 2019

Into R? A List for You

May 12, 2019

Computerworld, which runs some pretty unusual stories, published “Great R Packages for Data Import, Wrangling and Visualization.” “Great” is an interesting word. In the lingo of Computerworld, a real journalist did some searching, talked to some people, and created a list. As it turns out, the effort is useful. Looking at the Computerworld table is quite a bit easier than trying to dig information out of assorted online sources. Plus, people are not too keen on the phone and email thing now.

The listing includes a mixture of different tools, software, and utilities. There are more than 80 listings. I wasn’t sure what to make of XML’s inclusion in the list, but, the source is Computerworld, and I assume that the “real” journalist knows much more than I.

Two observations:

  • Earthworm lists without classification or alphabetization are less useful to me than listings which are sorted by tags and alphabetized within categories. Excel does perform this helpful trick.
  • Some items in the earthworm list have links and others do not. Consistency, I suppose, is the hobgoblin of some types of intellectual work
  • An indication of which item is free or for fee would be useful too.

Despite these shortcomings, you may want to download the list and tuck it into your “Things I love about R” folder.

Stephen E Arnold, May 12, 2019

Cognitive Engine: What Powers the USAF Platform?

May 1, 2019

Last week I met with a university professor who does cutting edge data and text mining and also shepherds PhD candidates. In the course of our 90 minute conversation, I noticed some reference books which had SPSS on the cover. The procedures implemented at this particular university worked well.

After the meeting, I was thinking about the newer approaches which are becoming publicly available. The USAF has started talking about its “cognitive engine.” I thought I heard at a conference that some technology developed developed by Nutonian, now part of a data and text mining roll up, had influenced the project.

The Nutonian system is predictive with a twist. The person using the system can rely on the smart software to perform the numerous intermediary steps required when using more traditional systems.

The article “The US Air Force Will Showcase Its Many Technological Advances in the USAF Lab Day.” The original is in Chinese but Freetranslate.com can help out if don’t read Chinese or have a close by contact who does.

The USAF wants to deploy a cognitive platform into which vendors can “plug in” their systems. The Chinese write up reported:

AFRL’s Autonomy Capability Team 3 (ACT3) is developing artificial intelligence on a large scale through the development and application of the Air Force Cognitive Engine (ACE), an artificial intelligence software platform. Put into application. The software platform architecture reduces the barriers to entry for artificial intelligence applications and provides end-user applications with the ability to cover a range of artificial intelligence problem types. In the application, the software platform connects educated end users, developers, and algorithms implemented in software, task data, and computing hardware to the process of creating an artificial intelligence solution.

The article also provides some interesting details which were not included in some of the English language reports about this session; for example:

  • Smart rockets
  • An agile pod
  • Pathogen identification.

A couple of observations:

First, obviously the Chinese writer had access to information about the Lab Day demonstrations.

Second, the cognitive platform does not mention foundation vendors, which I understand.

Third, it would be delightful to visit a university and see documentation and information about the next-generation predictive analytics systems available.

Stephen E Arnold, May 1, 2019

Here’s what the Chinese writer reported about the

Nosing Beyond the Machine Learning from Human Curated Data Sets: Autonomy 1996 to Smart Software 2019

April 24, 2019

How does one teach a smart indexing system like Autonomy’s 1996 “neurodynamic” system?* Subject matter experts (SMEs) assembled training collection of textual information. The article and other content would replicate the characteristics of the content which the Autonomy system would process; that is, index and make searchable or analyzable. The work was important. Get the training data wrong and the indexing system would assign metadata or “index terms” and “category names” which could cause a query to generate results the user could perceive as incorrect.

image

How would a licensee adjust the Autonomy “black box”? (Think of my reference to Autonomy and search as a way of approaching “smart software” and “artificial intelligence.”)

The method was to perform re-training. The approach was practical and for most content domains, the re-training worked. It was an iterative process. Because the words in the corpus fed into the “black box” included new words, concepts, bound phrases, entities, and key sequences, there were several functions integrated into the basic Autonomy system as it matured. Examples ranged from support for term lists (controlled vocabularies) and dictionaries.

The combination of re-training and external content available to the system allowed Autonomy to deliver useful outputs.

Where the optimal results departed from the real world results usually boiled down to several factors, often working in concert. First, licensees did not want to pay for re-training. Second, maintenance of the external dictionaries was necessary because new entities arrive with reasonable frequency. Third, testing and organizing the freshening training sets and the editorial work required to keep dictionaries ship shape was too expensive, time consuming, and tedious.

Not surprisingly, some licensees grew unhappy with their Autonomy IDOL (integrated data operating layer) system. That, in my opinion, was not Autonomy’s fault. Autonomy explained in the presentations I heard what was required to get a system up and running and outputting results that could easily hit 80 percent or higher on precision and recall tests.

The Autonomy approach is widely used. In fact, wherever there is a Bayesian system in use, there is the training, re-training, external knowledge base demand. I just took a look at Haystax Constellation. It’s Bayesian and Haystax makes it clear that the “model” has to be training. So what’s changed between 1996 and 2019 with regards to Bayesian methods?

Nothing. Zip. Zero.

Read more

Text Analysis Toolkits

March 16, 2019

One of the DarkCyber team spotted a useful list, published by MonkeyLearn. Tucked into a narrative called “Text Analysis: The Only Guide You’ll Ever Need” was a list of natural language processing open source tools, programming languages, and software. Each description is accompanied with links and in several cases comments. See the original article for more information.

Caret

CoreNLP

Java

Keras

mlr

NLTK

OpenNLP

Python

SpaCy

Scikit-learn

TensorFlow

PyTorch

R

Weka

Stephen E Arnold, March 16, 2019

Next Page »

  • Archives

  • Recent Posts

  • Meta