Image Search: Biased by Language. The Fix? Use Humans!
April 19, 2017
Houston, we (male, female, uncertain) have a problem. Bias is baked into some image analysis and just about every other type of smart software.
The first step in solving a problem is to acknowledge that a problem exists. The second step is more difficult.
I read “The Reason Why Most of the Images That Show Up When You Search for Doctor Are White Men.” The headline identifies the problem. However, what does one do about biases rooted in human utterance.
My initial thought was to eliminate human utterances. No fancy dancing required. Just let algorithms do what algorithms do. I realized that although this approach has a certain logical completeness, implementation may meet with a bit of resistance.
What does the write up have to say about the problem? (Remember. The fix is going to be tricky.)
Research from Princeton University suggests that these biases, like associating men with doctors and women with nurses, come from the language taught to the algorithm. As some data scientists say, “garbage in, garbage out”: Without good data, the algorithm isn’t going to make good decisions.
Okay, right coast thinking. I feel more comfortable.
What does the write up present as wizard Aylin Caliskan’s view of the problem? A post doctoral researcher seems to be a solid choice for a source. I assume the wizard is a human, so perhaps he, she, it is biased? Hmmm.
I highlighted in true blue several passages from the write up / interview with he, she, it. Let’s look at three statements, shall we?
Regarding genderless languages like Turkish:
when you directly translate, and “nurse” is “she,” that’s not accurate. It should be “he or she or it” is a nurse. We see that it’s making a biased decision—it’s a very simple example of machine translation, but given that these models are incorporated on the web or any application that makes use of textual data, it’s the foundation of most of these applications. If you search for “doctor” and look at the images, you’ll see that most of them are male. You won’t see an equal male and female distribution.
If accurate, this observation means that the “fix” is going to be difficult. Moving from a language without gender identification to a language with gender identification requires changing the target language. Easy for software. Tougher for a human. If the language and its associations are anchored in the brain of a target language speaker, change may be, how shall I say it, a trifle difficult. My fix looks pretty good at this point.
And what about images and videos? I learned:
Yes, anything that text touches. Images and videos are labeled to they can be used on the web. The labels are in text, and it has been shown that those labels have been biased.
And the fix is a human doing the content selection, indexing, and dictionary tweaking. Not so fast. The cost of indexing with humans is very expensive. Don’t believe me. Download 10,000 Wikipedia articles and hire some folks to index them from the controlled term list humans set up. Let me know if you can hit $17 per indexed article. My hunch is that you will exceed this target by several orders of magnitude. (Want to know where the number comes from? Contact me and we discuss a for fee deal for this high value information.)
How does the write up solve the problem? Here’s the capper:
…you cannot directly remove the bias from the dataset or model because it’s giving a very accurate representation of the world, and that’s why we need a specialist to deal with this at the application level.
Notice that my solution is to eliminate humans entirely. Why? The pipe dream of humans doing indexing won’t fly due to [a] time, [b] cost, [c] the massive flows of data to index. Forget the mother of all bombs.
Think about the mother of all indexing backlogs. The gap would make the Modern Language Association’s “gaps” look like weekend catch up party. Is this a job for the operating system for machine intelligence?
Stephen E Arnold, April 17, 2017