The Roots of Common Machine Learning Errors

October 11, 2019

It is a big problem when faulty data analysis underpins big decisions or public opinion, and it is happening more often in the age of big data. Data Science Central outlines several “Common Errors in Machine Learning Due to Poor Statistics Knowledge.” Easy to make mistakes? Yep. Easy to manipulate outputs? Yep. We believe the obvious fix is to make math point and click—let developers decide for a clueless person.

Blogger Vincent Granville describes what he sees as the biggest problem:

“Probably the worst error is thinking there is a correlation when that correlation is purely artificial. Take a data set with 100,000 variables, say with 10 observations. Compute all the (99,999 * 100,000) / 2 cross-correlations. You are almost guaranteed to find one above 0.999. This is best illustrated in may article How to Lie with P-values (also discussing how to handle and fix it.) This is being done on such a large scale, I think it is probably the main cause of fake news, and the impact is disastrous on people who take for granted what they read in the news or what they hear from the government. Some people are sent to jail based on evidence tainted with major statistical flaws. Government money is spent, propaganda is generated, wars are started, and laws are created based on false evidence. Sometimes the data scientist has no choice but to knowingly cook the numbers to keep her job. Usually, these ‘bad stats’ end up being featured in beautiful but faulty visualizations: axes are truncated, charts are distorted, observations and variables are carefully chosen just to make a (wrong) point.”

Granville goes on to specify several other sources of mistakes. Analysts sometimes take for granted the accuracy of their data sets, for example, instead of performing a walk-forward test. Relying too much on the old standbys R-squared measures and normal distributions can also lead to errors. Furthermore, he reminds us, scale-invariant modeling techniques must be used when data is expressed in different units (like yards and miles). Finally, one must be sure to handle missing data correctly—do not assume bridging the gap with an average will produce accurate results. See the post for more explanation on each of these points.

Cynthia Murrell, October 11, 2019

Information and the More Exposure Effect

October 1, 2019

The article “Why Do Older People Hate New Music?” caught my attention. Music is not a core interest at DarkCyber. We do mention in our Dark Web 2 lecture that beat sharing and selling sites which permit message exchange are an important source of social content.

This “oldsters hate new” angle is important. The write up contains this assertion:

One of the most researched laws of social psychology is something called the “mere exposure effect.” In a nutshell, it means that the more we’re exposed to something, the more we tend to like it. This happens with people we know, the advertisements we see and, yes, the songs we listen to.

Like many socio-psycho-econo assertions, this idea sounds plausible. Let’s assume that it is correct and apply the insight to online information.

Online news services purport to provide news for me, world news, and other categories. When I review outputs from several services like SmartNews, News360, and Google News, for example, it is clear that the information presented looks and conveys the same information.

If the exposure point is accurate, these services are conditioning me to accept and feel comfortable with specific information. SmartNews shows me soccer news, reports about cruise ship deaths, and write ups which underscore the antics of certain elected officials.

These services do not coordinate, but they do rely on widely used numerical recipes and feedback about what I click on or ignore. What’s interesting is that each of these services delivers a package of content which reflects each service’s view of what interests me.

The problem is that I look at less and less content on these services. Familiarity means that I don’t need to know more about certain topics.

Consequently, as the services become smarter, I move way from these services.

The psychological write up reports:

Psychology research has shown that the emotions that we experience as teens seem more intense than those that comes later. We also know that intense emotions are associated with stronger memories and preferences. All of this might explain why the songs we listen to during this period become so memorable and beloved.

Is familiarity making me more content with online news? Sorry, no.

The familiarity makes it easier to recognize that significant content is not being presented. That’s an interesting issue if my reaction is not peculiar to me.

How does one find additional information about the unfamiliar? Search does not deliver effectively in my opinion.

Stephen E Arnold, October 2, 2019

Should Social Media Algorithms be Used to Predict Crime?

September 18, 2019

Do we want Thought Police? Because this is how you get Thought Police. Though tragedies like the recent mass shootings in El Paso and Dayton are horrifying, some “solutions” are bound to do more harm than good. President Trump’s recent call for social-media companies to predict who will become a mass shooter so authorities can preemptively move against them is right out of Orwell’s 1984. Digital Trends asks, “Can Social Media Predict Mass Shootings Before They Happen?” Technically, it probably can, but with limited accuracy. Journalist Mathew Katz writes:

“Companies like Google, Facebook, Twitter, and Amazon already use algorithms to predict your interests, your behaviors, and crucially, what you like to buy. Sometimes, an algorithm can get your personality right – like when Spotify somehow manages to put together a playlist full of new music you love. In theory, companies could use the same technology to flag potential shooters. ‘To an algorithm, the scoring of your propensity [to] purchase a particular pair of shoes is not very different from the scoring of your propensity to become a mass murderer—the main difference is the data set being scored,’ wrote technology and marketing consultant Shelly Palmer in a newsletter on Sunday. But preventing mass shootings before they happen raises some thorny legal questions: how do you determine if someone is just angry online rather than someone who could actually carry out a shooting? Can you arrest someone if a computer thinks they’ll eventually become a shooter?”

That is what we must decide as a society. We also need to ask whether algorithms are really up to the task. We learn:

“The Partnership on AI, an organization looking at the future of artificial intelligence, conducted an intensive study on algorithmic tools that try to ‘predict’ crime. Their conclusion? ‘These tools should not be used alone to make decisions to detain or to continue detention.’”

But we all know that once people get an easy-to-use tool, the ease-of-use can quickly trump accuracy. Think of how often you see ads online for products you would never buy, Katz prompts. Then consider how it would feel to be arrested for a crime you would never commit.

Cynthia Murrell, September 18, 2019

Handy Visual Reference of Data Model Evaluation Techniques

September 12, 2019

There are many ways to evaluate one’s data models, and Data Science Central presents an extensive yet succinct reference in visual form—“Model Evaluation Techniques in One Picture.” Together, the image and links make for a useful resource. Creator Stephanie Glen writes:

“The sheer number of model evaluation techniques available to assess how good your model is can be completely overwhelming. As well as the oft-used confidence intervals, confusion matrix and cross validation, there are dozens more that you could use for specific situations, including McNemar’s test, Cochran’s Q, Multiple Hypothesis testing and many more. This one picture whittles down that list to a dozen or so of the most popular. You’ll find links to articles explaining the specific tests and procedures below the image.”

Glen may be underselling her list of links after the graphic; it would be worth navigating to her post for that alone. The visual, though, elegantly simplifies a complex topic. It is divided into these subtopics: general tests and tools; regression; classification: visual aids; and Classification: statistics and tools. Interested readers should check it out; you might just decide to bookmark it for future reference, too.

Cynthia Murrell, September 12, 2019

Disrupting Neural Nets: Adversarial Has a More Friendly Spin Than Weaponized

August 28, 2019

In my lecture about manipulation of algorithms, I review several methods for pumping false signals into a data set in order to skew outputs.

The basic idea is that if an entity generates content pulses which are semantically or otherwise related in a way the smart software “counts”, then the outputs are altered.

A good review of some of these flaws in neural network classifiers appears in “How Reliable Are Neural Networks Classifiers Against Unforeseen Adversarial Attacks.”

DarkCyber noted this statement in the write up:

attackers could target autonomous vehicles by using stickers or paint to create an adversarial stop sign that the vehicle would interpret as a ‘yield’ or other sign. A confused car on a busy day is a potential catastrophe packed in a 2000 pound metal box.

Dramatic, yes. Far fetched? Not too much.

Providing weaponized data objects to smart software can screw up the works. Examples range from adversarial clothing, discussed in the DarkCyber video program for August 27, 2019, to the wonky predictions that Google makes when displaying personalized ads.

The article reviews an expensive and time consuming method for minimizing the probability of weaponized data mucking up the outputs.

The problem, of course, is that smart software is supposed to handle the tricky, expensive, and slow process of assembling and refining a training set of data. Talk about smart software is really cheap. Delivering systems which operate in the real world is another kettle of what appear to be fish as determined by a vector’s norm.

The Analytics India article is neither broad nor deep. It does raise awareness of the rather interesting challenges which lurk within smart software.

Understanding how smart software can get off base and drift into LaLa Land begins with identifying the problem.

Smart software cannot learn and discriminate with the type of accuracy many people assume is delivered. Humans assume a system output is 99 percent accurate; for example, Is it raining?

The reality is that adversarial inputs can reduce the accuracy rate significantly.

On good days, smart software can hit 85 to 90 percent accuracy. That’s good enough unless a self driving car hits you. But with adversarial or weaponized data, that accuracy rate can drop below the 65 percent level which most of the systems DarkCyber has tested can reliably achieve.

To sum up, smart software makes mistakes. Weaponized data input into a smart software can increase the likelihood of an error.

The methods can be used in commercial and military theaters.

Neither humans nor software can prevent this from happening on a consistent basis.

So what? Yes, that’s a good question.

Stephen E Arnold, August 29. 2019

Smart Software but No Mention of Cathy O’Neil

August 21, 2019

I read “Flawed Algorithms Are Grading Millions of Students’ Essays.” I also read Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O’Neil, which was published in 2016. My recollection is that Ms. O’Neil made appearances on some podcasts; for instance, Econ Talk, a truly exciting economics-centric program. It is indeed possible that the real news outfit Motherboard/Vice did not. Before I read the article, I did a search of the text for “O’Neil” and “Weapons of Math Destruction.” I found zero hits. Why? The author, editor, and publisher did not include a pointer to her book. Zippo. There’s a reference to the “real news” outfit ProPublica. There’s a reference to the Vice investigation. Would this approach work in freshman composition with essays graded by a harried graduate student?

Here’s the point. Ms. O’Neil did a very good job of explaining the flaws of automated systems. Recycling is the name of the game. After all, DarkCyber is recycling this “original” essay containing “original” research, isn’t it?

I noted this passage in the write up:

Research is scarce on the issue of machine scoring bias, partly due to the secrecy of the companies that create these systems. Test scoring vendors closely guard their algorithms, and states are wary of drawing attention to the fact that algorithms, not humans, are grading students’ work. Only a handful of published studies have examined whether the engines treat students from different language backgrounds equally, but they back up some critics’ fears.

Yeah, but there is a relatively recent book on the subject.

I noted this statement in the write up:

Here’s the first sentence from the essay addressing technology’s impact on humans’ ability to think for themselves…

I like the “ability to think for themselves.”

So do I. In fact, I would suggest that this write up is an example of the loss of this ability.

A mere 2,000 words and not a room or a thought or a tiny footnote about Ms. O’Neil. Flawed? I leave it to you to decide.

Stephen E Arnold, August 21, 2019

More on Biases in Smart Software

August 7, 2019

Bias in machine learning strikes again. Citing a study performed by Facebook AI Research, The Verge reports, “AI Is Worse at Identifying Household Items from Lower-Income Countries.” Researchers studied the accuracy of five top object-recognition algorithms, Microsoft Azure, Clarifai, Google Cloud Vision, Amazon Rekognition, and IBM Watson, using this dataset of objects from around the world. Writer James Vincent tells us:

“The researchers found that the object recognition algorithms made around 10 percent more errors when asked to identify items from a household with a $50 monthly income compared to those from a household making more than $3,500. The absolute difference in accuracy was even greater: the algorithms were 15 to 20 percent better at identifying items from the US compared to items from Somalia and Burkina Faso.”

Not surprisingly, researchers point to the usual suspect—the similar backgrounds and financial brackets of most engineers who create algorithms and datasets. Vincent continues:

“In the case of object recognition algorithms, the authors of this study say that there are a few likely causes for the errors: first, the training data used to create the systems is geographically constrained, and second, they fail to recognize cultural differences. Training data for vision algorithms, write the authors, is taken largely from Europe and North America and ‘severely under sample[s] visual scenes in a range of geographical regions with large populations, in particular, in Africa, India, China, and South-East Asia.’ Similarly, most image datasets use English nouns as their starting point and collect data accordingly. This might mean entire categories of items are missing or that the same items simply look different in different countries.”

Why does this matter? For one thing, it means object recognition performs better for certain audiences than others in systems as benign as photo storage services, as serious as security cameras, and as crucial self-driving cars. Not only that, we’re told, the biases found here may be passed into other types of AI that will not receive similar scrutiny down the line. As AI products pick up speed throughout society, developers must pay more attention to the data on which they train their impressionable algorithms.

Cynthia Murrell, August 7, 2019

Trovicor: A Slogan as an Equation

August 2, 2019

We spotted this slogan on the Trovicor Web site:

The Trovicor formula: Actionable Intelligence = f (data generation; fusion; analysis; visualization)

The function consists of four buzzwords used by vendors of policeware and intelware:

  • Data generation (which suggests metadata assigned to intercepted, scraped, or provided content objects)
  • Fusion (which means in DarkCyber’s world a single index to disparate data)
  • Analysis (numerical recipes to identify patterns or other interesting data
  • Virtualization (use of technology to replace old school methods like 1950s’ style physical wire taps, software defined components, and software centric widgets).

The buzzwords make it easy to identify other companies providing somewhat similar services.

Trovicor maintains a low profile. But obtaining open source information about the company may be a helpful activity.

Stephen E Arnold, August 2, 2019

Smart Software: About Those Methods?

July 23, 2019

An interesting paper germane to machine learning and smart software is available from The title? “Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches”.

The punch line for this academic document is, in the view of DarkCyber:

No way.

Your view may be different, but you will have to read the document, check out the diagrams, and scan the supporting information available on Github at this link.

The main idea is:

In this work, we report the results of a systematic analysis of algorithmic proposals for top-n recommendation tasks. Specifically, we considered 18 algorithms that were presented at top-level research conferences in the last years. Only 7 of them could be reproduced with reasonable effort. For these methods, it however turned out that 6 of them can often be outperformed with comparably simple heuristic methods, e.g., based on nearest-neighbor or graph-based techniques. The remaining one clearly outperformed the baselines but did not consistently outperform a well-tuned non-neural linear ranking method. Overall, our work sheds light on a number of potential problems in today’s machine learning scholarship and calls for improved scientific practices in this area.

So back to my summary, “No way.”

Here’s a “oh, how interesting chart.” Note the spikes:


Several observations:

  1. In an effort to get something to work, those who think in terms of algorithms take shortcuts; that is, operate in a clever way to produce something that’s good enough. “Good enough” is pretty much a C grade or “passing.”
  2. Math whiz hand waving and MBA / lawyer ignorance of what human judgments operate within an algorithmic operation guarantee that “good enough” becomes “Let’s see if this makes money.” You can substitute “reduce costs” if you wish. No big difference.
  3. Users accept whatever outputs a smart system deliver. Most people believe that “computers are right.” There’s nothing DarkCyber can do to make people more aware.
  4. Algorithms can be fiddled in the following ways: [a] Let these numerical recipes and the idiosyncrasies of calculation will just do their thing; for example, drift off in a weird direction or produce the equivalent of white noise; [b] get skewed because of the data flowing into the system automagically (very risky) or via human subject matter experts (also very risky); [c] the programmers implementing the algorithm focus on the code, speed, and deadline, not how the outputs flow; for example, k-means can be really mean and Bayesian methods can bay at the moon.

Net net: Worth reading this analysis.

Stephen E Arnold, July 23, 2019

Need a Machine Learning Algorithm?

July 17, 2019

r entry

The Web site published “101 Machine Learning Algorithms for Data Science with Cheat Sheets.” The write up recycles information from DataScienceDojo, and some of the information looks familiar. But lists of algorithms are not original. They are useful. What sets this list apart is the inclusion of “cheat sheets.”

What’s a cheat sheet?

In this particular collection, a cheat sheet looks like this:

r entry example

You can see the entry for the algorithm: Bernoulli Naive Bayes with a definition. The “cheat sheet” is a link to a python example. In this case, the example is a link to an explanation on the Chris Albon blog.

What’s interesting is that the 101 algorithms are grouped under 18 categories. Of these 18, Bayes and derivative methods total five.

No big deal, but in my lectures about widely used algorithms I highlight 10, mostly because it is a nice round number. The point is that most of the analytics vendors use the same basic algorithms. Variations among products built on these algorithms are significant.

As analytics systems become more modular — that  is, like Lego blocks — it seems that the trajectory of development will be to select, preconfigure thresholds, and streamline processes in a black box.

Is this good or bad?

It depends on whether one’s black box is a dominant solution or platform?

Will users know that this almost inevitable narrowing has upsides and downsides?


Stephen E Arnold, July 17, 2019

Next Page »

  • Archives

  • Recent Posts

  • Meta