Smart Software but No Mention of Cathy O’Neil

August 21, 2019

I read “Flawed Algorithms Are Grading Millions of Students’ Essays.” I also read Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O’Neil, which was published in 2016. My recollection is that Ms. O’Neil made appearances on some podcasts; for instance, Econ Talk, a truly exciting economics-centric program. It is indeed possible that the real news outfit Motherboard/Vice did not. Before I read the article, I did a search of the text for “O’Neil” and “Weapons of Math Destruction.” I found zero hits. Why? The author, editor, and publisher did not include a pointer to her book. Zippo. There’s a reference to the “real news” outfit ProPublica. There’s a reference to the Vice investigation. Would this approach work in freshman composition with essays graded by a harried graduate student?

Here’s the point. Ms. O’Neil did a very good job of explaining the flaws of automated systems. Recycling is the name of the game. After all, DarkCyber is recycling this “original” essay containing “original” research, isn’t it?

I noted this passage in the write up:

Research is scarce on the issue of machine scoring bias, partly due to the secrecy of the companies that create these systems. Test scoring vendors closely guard their algorithms, and states are wary of drawing attention to the fact that algorithms, not humans, are grading students’ work. Only a handful of published studies have examined whether the engines treat students from different language backgrounds equally, but they back up some critics’ fears.

Yeah, but there is a relatively recent book on the subject.

I noted this statement in the write up:

Here’s the first sentence from the essay addressing technology’s impact on humans’ ability to think for themselves…

I like the “ability to think for themselves.”

So do I. In fact, I would suggest that this write up is an example of the loss of this ability.

A mere 2,000 words and not a room or a thought or a tiny footnote about Ms. O’Neil. Flawed? I leave it to you to decide.

Stephen E Arnold, August 21, 2019

More on Biases in Smart Software

August 7, 2019

Bias in machine learning strikes again. Citing a study performed by Facebook AI Research, The Verge reports, “AI Is Worse at Identifying Household Items from Lower-Income Countries.” Researchers studied the accuracy of five top object-recognition algorithms, Microsoft Azure, Clarifai, Google Cloud Vision, Amazon Rekognition, and IBM Watson, using this dataset of objects from around the world. Writer James Vincent tells us:

“The researchers found that the object recognition algorithms made around 10 percent more errors when asked to identify items from a household with a $50 monthly income compared to those from a household making more than $3,500. The absolute difference in accuracy was even greater: the algorithms were 15 to 20 percent better at identifying items from the US compared to items from Somalia and Burkina Faso.”

Not surprisingly, researchers point to the usual suspect—the similar backgrounds and financial brackets of most engineers who create algorithms and datasets. Vincent continues:

“In the case of object recognition algorithms, the authors of this study say that there are a few likely causes for the errors: first, the training data used to create the systems is geographically constrained, and second, they fail to recognize cultural differences. Training data for vision algorithms, write the authors, is taken largely from Europe and North America and ‘severely under sample[s] visual scenes in a range of geographical regions with large populations, in particular, in Africa, India, China, and South-East Asia.’ Similarly, most image datasets use English nouns as their starting point and collect data accordingly. This might mean entire categories of items are missing or that the same items simply look different in different countries.”

Why does this matter? For one thing, it means object recognition performs better for certain audiences than others in systems as benign as photo storage services, as serious as security cameras, and as crucial self-driving cars. Not only that, we’re told, the biases found here may be passed into other types of AI that will not receive similar scrutiny down the line. As AI products pick up speed throughout society, developers must pay more attention to the data on which they train their impressionable algorithms.

Cynthia Murrell, August 7, 2019

Trovicor: A Slogan as an Equation

August 2, 2019

We spotted this slogan on the Trovicor Web site:

The Trovicor formula: Actionable Intelligence = f (data generation; fusion; analysis; visualization)

The function consists of four buzzwords used by vendors of policeware and intelware:

  • Data generation (which suggests metadata assigned to intercepted, scraped, or provided content objects)
  • Fusion (which means in DarkCyber’s world a single index to disparate data)
  • Analysis (numerical recipes to identify patterns or other interesting data
  • Virtualization (use of technology to replace old school methods like 1950s’ style physical wire taps, software defined components, and software centric widgets).

The buzzwords make it easy to identify other companies providing somewhat similar services.

Trovicor maintains a low profile. But obtaining open source information about the company may be a helpful activity.

Stephen E Arnold, August 2, 2019

Smart Software: About Those Methods?

July 23, 2019

An interesting paper germane to machine learning and smart software is available from The title? “Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches”.

The punch line for this academic document is, in the view of DarkCyber:

No way.

Your view may be different, but you will have to read the document, check out the diagrams, and scan the supporting information available on Github at this link.

The main idea is:

In this work, we report the results of a systematic analysis of algorithmic proposals for top-n recommendation tasks. Specifically, we considered 18 algorithms that were presented at top-level research conferences in the last years. Only 7 of them could be reproduced with reasonable effort. For these methods, it however turned out that 6 of them can often be outperformed with comparably simple heuristic methods, e.g., based on nearest-neighbor or graph-based techniques. The remaining one clearly outperformed the baselines but did not consistently outperform a well-tuned non-neural linear ranking method. Overall, our work sheds light on a number of potential problems in today’s machine learning scholarship and calls for improved scientific practices in this area.

So back to my summary, “No way.”

Here’s a “oh, how interesting chart.” Note the spikes:


Several observations:

  1. In an effort to get something to work, those who think in terms of algorithms take shortcuts; that is, operate in a clever way to produce something that’s good enough. “Good enough” is pretty much a C grade or “passing.”
  2. Math whiz hand waving and MBA / lawyer ignorance of what human judgments operate within an algorithmic operation guarantee that “good enough” becomes “Let’s see if this makes money.” You can substitute “reduce costs” if you wish. No big difference.
  3. Users accept whatever outputs a smart system deliver. Most people believe that “computers are right.” There’s nothing DarkCyber can do to make people more aware.
  4. Algorithms can be fiddled in the following ways: [a] Let these numerical recipes and the idiosyncrasies of calculation will just do their thing; for example, drift off in a weird direction or produce the equivalent of white noise; [b] get skewed because of the data flowing into the system automagically (very risky) or via human subject matter experts (also very risky); [c] the programmers implementing the algorithm focus on the code, speed, and deadline, not how the outputs flow; for example, k-means can be really mean and Bayesian methods can bay at the moon.

Net net: Worth reading this analysis.

Stephen E Arnold, July 23, 2019

Need a Machine Learning Algorithm?

July 17, 2019

r entry

The Web site published “101 Machine Learning Algorithms for Data Science with Cheat Sheets.” The write up recycles information from DataScienceDojo, and some of the information looks familiar. But lists of algorithms are not original. They are useful. What sets this list apart is the inclusion of “cheat sheets.”

What’s a cheat sheet?

In this particular collection, a cheat sheet looks like this:

r entry example

You can see the entry for the algorithm: Bernoulli Naive Bayes with a definition. The “cheat sheet” is a link to a python example. In this case, the example is a link to an explanation on the Chris Albon blog.

What’s interesting is that the 101 algorithms are grouped under 18 categories. Of these 18, Bayes and derivative methods total five.

No big deal, but in my lectures about widely used algorithms I highlight 10, mostly because it is a nice round number. The point is that most of the analytics vendors use the same basic algorithms. Variations among products built on these algorithms are significant.

As analytics systems become more modular — that  is, like Lego blocks — it seems that the trajectory of development will be to select, preconfigure thresholds, and streamline processes in a black box.

Is this good or bad?

It depends on whether one’s black box is a dominant solution or platform?

Will users know that this almost inevitable narrowing has upsides and downsides?


Stephen E Arnold, July 17, 2019

Exclusive: DataWalk Explained by Chris Westphal

July 9, 2019

An Interview with Chris Westphal” provides an in-depth review of a company now disrupting the analytic and investigative software landscape.

DataWalk is a company shaped by a patented method for making sense of different types of data. The technique is novel and makes it possible for analysts to extract high value insights from large flows of data in near real time with an unprecedented ease of use.

DarkCyber interviewed in late June 2019 Chris Westphal, the innovator who co-founded Visual Analytics. That company’s combination of analytics methods and visualizations was acquired by Raytheon in 2013. Now Westphal is applying his talents to a new venture DataWalk.

Westphal, who monitors advanced analytics, learned about DataWalk and joined the firm in 2017 as the Chief Analytics Officer. The company has grown rapidly and now has client relationships with corporations, governments, and ministries throughout the world. Applications of the DataWalk technology include investigators focused on fraud, corruption, and serious crimes.

Unlike most investigative and analytics systems, users can obtain actionable outputs by pointing and clicking. The system captures these clicks on a ribbon. The actions on the ribbon can be modified, replayed, and shared.

In an exclusive interview with Mr. Westphal, DarkCyber learned:

The [DataWalk] system gets “smarter” by encoding the analytical workflows used to query the data; it stores the steps, values, and filters to produce results thereby delivering more consistency and reliability while minimizing the training time for new users. These workflows (aka “easy buttons”) represent domain or mission-specific knowledge acquired directly from the client’s operations and derived from their own data; a perfect trifecta!

One of the differentiating features of DataWalk’s platform is that it squarely addresses the shortage of trained analysts and investigators in many organizations. Westphal pointed out:

…The workflow idea is one of the ingredients in the DataWalk secret sauce. Not only do these workflows capture the domain expertise of the users and offer management insights and metrics into their operations such as utilization, performance, and throughput, they also form the basis for scoring any entity in the system. DataWalk allows users to create risk scores for any combination of workflows, each with a user-defined weight, to produce an overall, aggregated score for every entity. Want to find the most suspicious person? Easy, just select the person with the highest risk-score and review which workflows were activated. Simple. Adaptable. Efficient.

Another problem some investigative and analytic system developers face is user criticism. According to Westphal, DataWalk takes a different approach:

We listen carefully to our end-user community. We actively solicit their feedback and we prioritize their inputs. We try to solve problems versus selling licenses… DataWalk is focused on interfacing to a wide range of data providers and other technology companies. We want to create a seamless user experience that maximizes the utility of the system in the context of our client’s operational environments.

For more information about DataWalk, navigate to For the full text of the interview, click this link. You can view a short video summary of DataWalk in the July 2, 2019, DarkCyber Video available on Vimeo.

Stephen E Arnold, July 9, 2019

Knowledge Graphs: Getting Hot

July 4, 2019

Artificial intelligence, semantics, and machine learning may lose their pride of place in the techno-jargon whiz bang marketing world. I read “A Common Sense View of Knowledge Graphs,” and noted this graph:


This is a good, old fashioned, Gene Garfield (remember him, gentle reader) citation analysis. The idea is that one can “see” how frequently an author or, in this case, a concept has been cited in the “literature.” Now publishers are dropping like flies and are publishing bunk. Nevertheless, one can see that using the phrase knowledge graph is getting popular within the sample of “literature” parsed for this graph. (No, I don’t recommend trying to perform citation analysis in Bing, Facebook, or Google. The reasons will just depress me and you, gentle reader.)

The section of the write I found useful and worthy of my “research” file is the collection of references to documents defining “knowledge graph.” This is useful, helpful research.

The write up also includes a diagram which may be one of the first representations of a graph centric triple. I thought this was something cooked up by Drs. Bray, Guha, and others in the tsunami of semantic excitement.

One final point: The list of endnotes is also useful. In short, good write up. The downside is that if the article gets wider distribution, a feeding frenzy among money desperate consultants, advisers, and analysts will be ignited like a Fourth of July fountain of flame.

Stephen E Arnold, July 4, 2019

Google: A Question of Judgment

July 3, 2019

In the realm of unintended consequences, this one is a doozy. MIT Technology Review reports, “YouTube’s Algorithm Makes it Easy for Pedophiles to Find More Videos of Children.” The brief write-up provides just-the-facts coverage of the disturbing issue. Writer Charlotte Jee summarizes:

“YouTube’s automated recommendation system has gathered a collection of prepubescent, partially clothed children and is recommending it to people who have watched similar videos, the New York Times reports. While some of the recommendations have been switched off on certain videos, the company has refused to end the practice. …

We noted:

“YouTube disabled comments on many videos of children in February after an outcry over pedophiles using the comment section to guide each other. It doesn’t let kids under 13 open accounts. However, it won’t stop recommending videos of children because it is worried about negative impact on family vloggers, some of whom have many millions of followers. In a blog post responding to the New York Times story, YouTube said that it was ‘limiting’ recommendations on some videos that may put children at risk.”

Those limits are to be applied to videos with minors in “risky situations,” though the blog post does not specify who, or what, will make that judgment. Jee is suspicions of YouTube’s motivations, noting that the site’s goal is to capture and keep “eyeballs.” Despite what else is allowed to thrive across the platform, the company apparently decided to draw a (dotted) line at this issue.

Cynthia Murrell, July 3, 2019

Machine Learning: Whom Does One Believe?

June 28, 2019

Ah, another day begins with mixed messages. Just what the relaxed, unstressed modern decider needs.

First, navigate to “Reasons Why Machine Learning can Prove Beneficial for Your Organization.” The reasons include:

  • Segment customer coverage. No, I don’t know what this means either.
  • Accurate business forecasts. No, machine learning systems cannot predict horse races or how a business will do. How about the impact of tariffs or a Fed interest rate change?
  • Improved customer experience. No, experiences are not improving. How do I know? Ask a cashier to make change? Try to get an Amazon professional to explain how to connect a Mac laptop to an Audible account WITHOUT asking, “May I take control of your computer with our software?”
  • Make decisions confidently. Yep, that’s what a decider does in the stable, positive, uplifting work environment of a electronic exchange when a bug costs millions in a two milliseconds.
  • Automate your routine tasks. Absolutely. Automation works well. Ask the families of those killed by “intelligence stoked” automobiles or smart systems on a 737 Max.

But there’s a flip side to these cheery “beneficial” outcomes. Navigate to “Machine Learning Systems Are Stuck in a Rut.” We noted these statements. First a quote from a technical paper.

In this paper we argue that systems for numerical computing are stuck in a local basin of performance and programmability. Systems researchers are doing an excellent job improving the performance of 5-year old benchmarks, but gradually making it harder to explore innovative machine learning research ideas.

Next this comment by the person who wrote the “Learning Systems” article:

The thrust of the argument is that there’s a chain of inter-linked assumptions / dependencies from the hardware all the way to the programming model, and any time you step outside of the mainstream it’s sufficiently hard to get acceptable performance that researchers are discouraged from doing so.

Which is better? Which is correct?

Be a decider either using a black box or the stuff between your ears.

Stephen E Arnold, June 28, 2019

Handy List of Smart Software Leaders

June 27, 2019

As the field of AI grows, it can be difficult to keep track of the significant players. Datamation shares a useful list in, “Top 45 Artificial Intelligence Companies.” If you skim the lineup, just keep in mind—entries are not ranked in any way, simply listed in alphabetical order. Writer Andy Patrizio begins with some observations about the industry:

“AI is driving significant investment from venture capitalist firms, giant firms like Microsoft and Google, academic research, and job openings across a multitude of sectors. All of this is documented in the AI Index, produced by Stanford University’s Human-Centered AI Institute. …

We noted:

“Consulting giant Accenture believes AI has the potential to boost rates of profitability by an average of 38 percentage points and could lead to an economic boost of US$14 trillion in additional gross value added (GVA) by 2035. In Truth, artificial intelligence holds a plethora of possibilities—and risks. ‘It will have a huge economic impact but also change society, and it’s hard to make strong predictions, but clearly job markets will be affected,’ said Yoshua Bengio, a professor at the University of Montreal, and head of the Montreal Institute for Learning Algorithms.”

For their selections, Datamation chose companies of particular note and those that have invested heavily in AI. Many names are ones you would expect to see, like Amazon, Google, IBM, and Microsoft. Others are more specialized—robotics platforms Anki and CloudMinds, for example, or iCarbonX, Tempus, and Zebra Medical Vision for healthcare. Several entries are open source. Check out the article for more.

Cynthia Murrell, June 24, 2019

Next Page »

  • Archives

  • Recent Posts

  • Meta