Machine Learning Foibles: Are We Surprised? Nope
March 18, 2020
Eurekalert published “Study Shows Widely Used Machine Learning Methods Don’t Work As Claimed.” Imagine that? The article states:
Researchers demonstrated the mathematical impossibility of representing social networks and other co0mplex networks using popular methods of low dimensional embeddings.
To put the allegations and maybe mathematical proof in context, there are many machine learning methods and even more magical thresholds the data whiz kids fiddle to generate acceptable outputs. The idea is that as long as the outputs are “good enough”, the training method is okay to use. Statistics is just math with some good old fashioned “thumb on the scale” opportunities.
The article states:
The study evaluated techniques known as “low-dimensional embeddings,” which are commonly used as input to machine learning models. This is an active area of research, with new embedding methods being developed at a rapid pace. But Seshadhri and his coauthors say all these methods share the same shortcomings.
What are the shortcomings?
Seshadhri and his coauthors demonstrated mathematically that significant structural aspects of complex networks are lost in this embedding process. They also confirmed this result by empirically by testing various embedding techniques on different kinds of complex networks.
The method discards or ignores information, relying on a fuzz ball which puts an individual into a “geometric representation.” Individuals’ social connections are lost in the fuzzification procedures.
Big deal. Sort of. The paper opens the door to many graduate students’ beavering away on the “accuracy” of machine learning procedures.
Stephen E Arnold, March 18, 2020
The Google: Geofence Misdirection a Consequence of Good Enough Analytics?
March 18, 2020
What a surprise—the use of Google tracking data by police nearly led to a false arrest, we’re told in the NBC News article, “Google Tracked his Bike Ride Past a Burglarized Home. That Made him a Suspect.” Last January, programmer and recreational cyclist Zachary McCoy received an email from Google informing him, as it does, that the cops had demanded information from his account. He had one week to try to block the release in court, yet McCoy had no idea what prompted the warrant. Writer Jon Schuppe reports:
“There was one clue. In the notice from Google was a case number. McCoy searched for it on the Gainesville Police Department’s website, and found a one-page investigation report on the burglary of an elderly woman’s home 10 months earlier. The crime had occurred less than a mile from the home that McCoy … shared with two others. Now McCoy was even more panicked and confused.”
After hearing of his plight, McCoy’s parents sprang for an attorney:
“The lawyer, Caleb Kenyon, dug around and learned that the notice had been prompted by a ‘geofence warrant,’ a police surveillance tool that casts a virtual dragnet over crime scenes, sweeping up Google location data — drawn from users’ GPS, Bluetooth, Wi-Fi and cellular connections — from everyone nearby. The warrants, which have increased dramatically in the past two years, can help police find potential suspects when they have no leads. They also scoop up data from people who have nothing to do with the crime, often without their knowing ? which Google itself has described as ‘a significant incursion on privacy.’ Still confused ? and very worried ? McCoy examined his phone. An avid biker, he used an exercise-tracking app, RunKeeper, to record his rides.”
Aha! There was the source of the “suspicious” data—RunKeeper tapped into his Android phone’s location service and fed that information to Google. The records show that, on the day of the break-in, his exercise route had taken him past the victim’s house three times in an hour. Eventually, the lawyer was able to convince the police his client (still not unmasked by Google) was not the burglar. Perhaps ironically, it was RunKeeper data showing he had been biking past the victim’s house for months, not just proximate to the burglary, that removed suspicion.
Luck, and a good lawyer, were on McCoy’s side, but the larger civil rights issue looms large. Though such tracking data is anonymized until law enforcement finds something “suspicious,” this case illustrates how easy it can be to attract that attention. Do geofence warrants violate our protections against unreasonable searches? See the article for more discussion.
Cynthia Murrell, March 18, 2020
Math Resources
January 27, 2020
One of the DarkCyber team spotted a list of math resources available. Some cost money; others are free. Math Vault lists courses, platforms, tools, and question – answering sites. Some are relatively mainstream like Wolfram Alpha; others, less well publicized like ProofWiki. You can find the listing at this link.
Kenny Toth, January 26, 2020
Quadratic Equations: A New Method
December 15, 2019
If you deal with quadratic equations, you will want to read “A New Way to Make Quadratic Equations Easy.” The procedure is straightforward, and apparently has been either overlooked, lost in time, or dismissed as out of step with current teaching methods. Worth a look but my high school mathematics teacher Ms. Blackburn would not approve. She liked old school methods, including whacking teen aged boys on the head with her wooden ruler.
Stephen E Arnold, December 15. 2019
Calculus Made Almost Easy
December 2, 2019
Just a quick tip of the hat to 0a.io. You have to love that url. Navigate to “Calculus Explained with Pics and Gifs.”
The site provides an overview of calculus. Pictures and animations make it easy to determine if one was sleeping in calculus class or paying attention.
The site went live with the information five years ago. One of the DarkCyber team spotted it and sent along the link. Worth a visit.
Stephen E Arnold, December 2, 2019
Can Machine Learning Pick Out The Bullies?
November 13, 2019
In Walt Disney’s 1942 classic Bambi, Thumper the rabbit was told, “If you can’t say something nice, don’t say nothing at all.”
Poor grammar aside, the thumping rabbit did delivered wise advice to the audience. Then came the Internet and anonymity, when the trolls were released to the world. Internet bullying is one of the world’s top cyber crimes, along with identity and money theft. Passionate anti-bullying campaigners, particularly individuals who were cyber-bullying victims, want social media Web sites to police their users and prevent the abusive crime. Trying to police the Internet is like herding cats. It might be possible with the right type of fish, but cats are not herd animals and scatter once the tasty fish is gone.
Technology might have advanced enough to detect bullying and AI could be the answer. Innovation Toronto wrote, “Machine Learning Algorithms Can Successfully Identify Bullies And Aggressors On Twitter With 90 Percent Accuracy.” AI’s biggest problem is that algorithms can identify and harvest information, they lack the ability to understand emotion and context. Many bullying actions on the Internet are sarcastic or hidden within metaphors.
Computer scientist Jeremy Blackburn and his team from Binghamton University analyzed bullying behavior patterns on Twitter. They discovered useful information to understand the trolls:
“ ‘We built crawlers — programs that collect data from Twitter via variety of mechanisms,’ said Blackburn. ‘We gathered tweets of Twitter users, their profiles, as well as (social) network-related things, like who they follow and who follows them.’ ”
The researchers then performed natural language processing and sentiment analysis on the tweets themselves, as well as a variety of social network analyses on the connections between users. The researchers developed algorithms to automatically classify two specific types of offensive online behavior, i.e., cyber bullying and cyber aggression. The algorithms were able to identify abusive users on Twitter with 90 percent accuracy. These are users who engage in harassing behavior, e.g. those who send death threats or make racist remarks to users.
“‘In a nutshell, the algorithms ‘learn’ how to tell the difference between bullies and typical users by weighing certain features as they are shown more examples,’ said Blackburn.”
Blackburn and his teams’ algorithm only detects the aggressive behavior, it does not do anything to prevent cyber bullying. The victims still see and are harmed by the comments and bullying users, but it does give Twitter a heads up on removing the trolls.
The anti-bullying algorithm prevents bullying only after there are victims. It does little assist the victims, but it does prevent future attacks. What steps need to be taken to prevent bullying altogether? Maybe schools need to teach classes on Internet etiquette with the Common Core, then again if it is not on the test it will not be in a classroom.
Whitney Grace, November 13, 2019
Tech Backlash: Not Even Apple and Goldman Sachs Exempt
November 11, 2019
Times are indeed interesting. Two powerful outfits—Apple (the privacy outfit with a thing for Chinese food) and Goldman Sachs (the we-make-money-every way possible organization) are the subject of “Viral Tweet about Apple Card Leads to Goldman Sachs Probe.” The would-be president’s news machine stated, “Tech entrepreneur alleged inherent bias in algorithms for card.” The card, of course, is the Apple-Goldman revenue-generating credit card. Navigate to the Bloomberg story. Get the scoop.
On the other hand, just look at one of the dozens and dozens of bloggers commenting about this bias, algorithm, big name story. Even more intriguing is that the aggrieved tweeter’s wife had her credit score magically changed. Remarkable how smart algorithms work.
DarkCyber does not want to retread truck tires. We do have three observations:
- The algorithm part may be more important than the bias angle. The reason is that algorithms embody bias, and now non-technical and non-financial people are going to start asking questions: Superficial at first and then increasingly on point. Not good for algorithms when humans obviously can fiddle the outputs.
- Two usually untouchable companies are now in the spotlight for subjective, touchy feely things with which neither company is particularly associated. This may lead to some interesting information about what’s up in the clubby world of the richest companies in the world. Discrimination maybe? Carelessness? Indifference? Greed? We have to wait and listen.
- Even those who may have worked at these firms and who now may be in positions of considerable influence may find themselves between a squash wall and sweaty guests who aren’t happy about an intentional obstruction. Those corporate halls which are often tomb-quiet may resound with stressed voices. “Apple” carts which allegedly sell to anyone may be upset. Cleaning up after the spill may drag the double’s partners from two exclusive companies into a task similar to cleaning sea birds after the gulf oil spill.
Will this issue get news traction? Will it become a lawyer powered railroad handcar creeping down the line?
Fascinating stuff.
Stephen E Arnold, November 11, 2019
Visual Data Exploration via Natural Language
November 4, 2019
New York University announced a natural language interface for data visualization. You can read the rah rah from the university here. The main idea is that a person can use simple English to create complex machine learning based visualizations. Sounds like the answer to a Wall Street analyst’s prayers.
The university reported:
A team at the NYU Tandon School of Engineering’s Visualization and Data Analytics (VIDA) lab, led by Claudio Silva, professor in the department of computer science and engineering, developed a framework called VisFlow, by which those who may not be experts in machine learning can create highly flexible data visualizations from almost any data. Furthermore, the team made it easier and more intuitive to edit these models by developing an extension of VisFlow called FlowSense, which allows users to synthesize data exploration pipelines through a natural language interface.
You can download (as of November 3, 2019, but no promises the document will be online after this date) “FlowSense: A Natural Language Interface for Visual Data Exploration within a Dataflow System.”
DarkCyber wants to point out that talking to a computer to get information continues to be of interest to many researchers. Will this innovation put human analysts out of their jobs.
Maybe not tomorrow but in the future. Absolutely. And what will those newly-unemployed people do for money?
Interesting question and one some may find difficult to consider at this time.
Stephen E Arnold, November 4, 2019
Bias: Female Digital Assistant Voices
October 17, 2019
It was a seemingly benign choice based on consumer research, but there is an unforeseen complication. TechRadar considers, “The Problem with Alexa: What’s the Solution to Sexist Voice Assistants?” From smart speakers to cell phones, voice assistants like Amazon’s Alexa, Microsoft’s Cortana, Google’s Assistant, and Apple’s Siri generally default to female voices (and usually sport female-sounding names) because studies show humans tend to respond best to female voices. Seems like an obvious choice—until you consider the long-term consequences. Reporter Olivia Tambini cites a report UNESCO issued earlier this year that suggests the practice sets us up to perpetuate sexist attitudes toward women, particularly subconscious biases. She writes:
“This progress [society has made toward more respect and agency for women] could potentially be undone by the proliferation of female voice assistants, according to UNESCO. Its report claims that the default use of female-sounding voice assistants sends a signal to users that women are ‘obliging, docile and eager-to-please helpers, available at the touch of a button or with a blunt voice command like “hey” or “OK”.’ It’s also worrying that these voice assistants have ‘no power of agency beyond what the commander asks of it’ and respond to queries ‘regardless of [the user’s] tone or hostility’. These may be desirable traits in an AI voice assistant, but what if the way we talk to Alexa and Siri ends up influencing the way we talk to women in our everyday lives? One of UNESCO’s main criticisms of companies like Amazon, Google, Apple and Microsoft is that the docile nature of our voice assistants has the unintended effect of reinforcing ‘commonly held gender biases that women are subservient and tolerant of poor treatment’. This subservience is particularly worrying when these female-sounding voice assistants give ‘deflecting, lackluster or apologetic responses to verbal sexual harassment’.”
So what is a voice-assistant maker to do? Certainly, male voices could be used and are, in fact, selectable options for several models. Another idea is to give users a wide variety of voices to choose from—not just different genders, but different accents and ages, as well. Perhaps the most effective solution would be to use a gender-neutral voice; one dubbed “Q” has now been created, proving it is possible. (You can listen to Q through the article or on YouTube.)
Of course, this and other problems might have been avoided had there been more diversity on the teams behind the voices. Tambini notes that just seven percent of information- and communication-tech patents across G20 countries are generated by women. As more women move into STEM fields, will unintended gender bias shrink as a natural result?
Cynthia Murrell, October 17, 2019
The Roots of Common Machine Learning Errors
October 11, 2019
It is a big problem when faulty data analysis underpins big decisions or public opinion, and it is happening more often in the age of big data. Data Science Central outlines several “Common Errors in Machine Learning Due to Poor Statistics Knowledge.” Easy to make mistakes? Yep. Easy to manipulate outputs? Yep. We believe the obvious fix is to make math point and click—let developers decide for a clueless person.
Blogger Vincent Granville describes what he sees as the biggest problem:
“Probably the worst error is thinking there is a correlation when that correlation is purely artificial. Take a data set with 100,000 variables, say with 10 observations. Compute all the (99,999 * 100,000) / 2 cross-correlations. You are almost guaranteed to find one above 0.999. This is best illustrated in may article How to Lie with P-values (also discussing how to handle and fix it.) This is being done on such a large scale, I think it is probably the main cause of fake news, and the impact is disastrous on people who take for granted what they read in the news or what they hear from the government. Some people are sent to jail based on evidence tainted with major statistical flaws. Government money is spent, propaganda is generated, wars are started, and laws are created based on false evidence. Sometimes the data scientist has no choice but to knowingly cook the numbers to keep her job. Usually, these ‘bad stats’ end up being featured in beautiful but faulty visualizations: axes are truncated, charts are distorted, observations and variables are carefully chosen just to make a (wrong) point.”
Granville goes on to specify several other sources of mistakes. Analysts sometimes take for granted the accuracy of their data sets, for example, instead of performing a walk-forward test. Relying too much on the old standbys R-squared measures and normal distributions can also lead to errors. Furthermore, he reminds us, scale-invariant modeling techniques must be used when data is expressed in different units (like yards and miles). Finally, one must be sure to handle missing data correctly—do not assume bridging the gap with an average will produce accurate results. See the post for more explanation on each of these points.
Cynthia Murrell, October 11, 2019