The Roots of Common Machine Learning Errors

October 11, 2019

It is a big problem when faulty data analysis underpins big decisions or public opinion, and it is happening more often in the age of big data. Data Science Central outlines several “Common Errors in Machine Learning Due to Poor Statistics Knowledge.” Easy to make mistakes? Yep. Easy to manipulate outputs? Yep. We believe the obvious fix is to make math point and click—let developers decide for a clueless person.

Blogger Vincent Granville describes what he sees as the biggest problem:

“Probably the worst error is thinking there is a correlation when that correlation is purely artificial. Take a data set with 100,000 variables, say with 10 observations. Compute all the (99,999 * 100,000) / 2 cross-correlations. You are almost guaranteed to find one above 0.999. This is best illustrated in may article How to Lie with P-values (also discussing how to handle and fix it.) This is being done on such a large scale, I think it is probably the main cause of fake news, and the impact is disastrous on people who take for granted what they read in the news or what they hear from the government. Some people are sent to jail based on evidence tainted with major statistical flaws. Government money is spent, propaganda is generated, wars are started, and laws are created based on false evidence. Sometimes the data scientist has no choice but to knowingly cook the numbers to keep her job. Usually, these ‘bad stats’ end up being featured in beautiful but faulty visualizations: axes are truncated, charts are distorted, observations and variables are carefully chosen just to make a (wrong) point.”

Granville goes on to specify several other sources of mistakes. Analysts sometimes take for granted the accuracy of their data sets, for example, instead of performing a walk-forward test. Relying too much on the old standbys R-squared measures and normal distributions can also lead to errors. Furthermore, he reminds us, scale-invariant modeling techniques must be used when data is expressed in different units (like yards and miles). Finally, one must be sure to handle missing data correctly—do not assume bridging the gap with an average will produce accurate results. See the post for more explanation on each of these points.

Cynthia Murrell, October 11, 2019

Comments

Got something to say?





  • Archives

  • Recent Posts

  • Meta