Eliminate Bias from Human Curated Training Sets with Training Sets Created by Humans
September 5, 2016
I love the quest for objectivity in training smart software. I recall a professor in my undergraduate days named Dr. Stephen Pence I believe. He was an interesting fellow who enjoyed pointing out logical fallacies. Pence introduced me to the work of Stephen Toulmin, an author who is a fun read.
I thought about argument by sign when I read “Language Necessarily Contains Human Biases, and So Will Machines Trained on Language Corpora.” The write up points out that smart software processing human utterances for “information” will end up with biases. The notion matches my experience.
I highlighted:
for 50 occupation words (doctor, engineer, …), we can accurately predict the percentage of U.S. workers in that occupation who are women using nothing but the semantic closeness of the occupation word to feminine words!… These results simultaneously show that the biases in question are embedded in human language, and that word embeddings are picking up the biases.
Algorithms, the write up points out, “Algorithms don’t have a way to identify biases.”
When we read about smart software taking a query like “beautiful girls” and returning a skewed data set, we wonder how vendors can ignore the distortions in their artificially intelligent routines.
Objectivity, gentle reader, is not easy to come by. Vendors of smart software who ignore the biases created by training sets and by the engineers’ decisions about threshold settings in numerical recipes may benefit from some critical thinking. Reading the work of Toulmin may be helpful as well.
Stephen E Arnold, September 5, 2016