Machine Learning: Cheating Is a Feature?

August 9, 2022

I read “MIT Boffins Make AI Chips 1 Million Times Faster Than the Synapses in the Human Brain. Plus: Why ML Research Is Difficult to Produce – and Army Lab Extends AI Contract with Palantir.” I dismissed the first item as some of the quantum supremacy stuff output by high school science club types. I ignored the Palantir Technologies’ item because the US Army has to make a distributed common ground system work and leave resolution to the next team rotation. Good or bad, Palantir has the ball. But the middle item in the club sandwich article contains a statement I found particularly interesting.

If you have followed out comments about smart software, we have taken a pragmatic view of getting “AI/ML” systems to work in the 80 to 95 percent confidence range in a consistent way even when new “content objects” are fed into the zeros and ones. To get off on the right foot, human subject matter experts assembled training data which reflected the content the system would be processing in the real world. The way smart software is expected to work is that it learns… on its own… sort of. It is very time consuming and very expensive to create hand crafted training sets and then “update” the system with the affected module. What if the prior content had to be reprocessed? Well, not too many have the funds, time, resources, and patience for that.

Thus, today’s AI/ML forward leaning cost conscious wizards want to use synthetic data, minimize the human SMEs’ cost and time, and do everything auto-magically. Sounds good. Yes, and the ideas make great PowerPoint decks too.

The sentence in the article which caught may attention is this one:

Data leakage occurs when the data used to train an algorithm can leak into its testing; when its performance is assessed the model seems better than it actually is because it has already, in effect, seen the answers to the questions. Sometimes machine learning methods seem more effective than they are because they aren’t tested in more robust settings.

Here’s the link to “Leakage and the Reproducibility Crisis in ML-Based Science in which more details appear. Wowza if these experts are correct. Who goes swimming without a functioning snorkel? Maybe the Google?

Stephen E Arnold, August 8, 2022

Comments

Got something to say?





  • Archives

  • Recent Posts

  • Meta