Has TikTok Set Off an Another Alarm in Washington, DC?

September 9, 2021

Perhaps TikTok was hoping the recent change to its privacy policy would slip under the radar. The Daily Dot reports that “Senators are ‘Alarmed’ at What TikTok Might Be Doing with your Biometric Data.” The video-sharing platform’s new policy specifies it now “may collect biometric identifiers and biometric information,” like “faceprints and voiceprints.” Why are we not surprised? Two US senators expressed alarm at the new policy which, they emphasize, affects nearly 130 million users while revealing few details. Writer Andrew Wyrich reports,

“That change has sparked Sen. Amy Klobuchar (D-Minn.) and Sen. John Thune (R-S.D.) to ask TikTok for more information on how the app plans to use that data they said they’d begin collecting. Klobuchar and Thune wrote a letter to TikTok earlier this month, which they made public this week. In it, they ask the company to define what constitutes a ‘faceprint’ and a ‘voiceprint’ and how exactly that collected data will be used. They also asked whether that data would be shared with third parties and how long the data will be held by TikTok. … Klobuchar and Thune also asked the company to tell them whether it was collecting biometric data on users under 18 years old; whether it will ‘make any inferences about its users based on faceprints and voiceprints;’ and whether the company would use machine learning to determine a user’s age, gender, race, or ethnicity based on the collected faceprints or voiceprints.”

Our guess is yes to all three, though we are unsure whether the company will admit as much. Nevertheless, the legislators make it clear they expect answers to these questions as well as a list of all entities that will have access to the data. We recommend you do not hold your breath, Senators.

Cynthia Murrell, September 9, 3021

Change Is Coming But What about Un-Change?

September 8, 2021

My research team is working on a short DarkCyber video about automating work processes related to smart software. The idea is that one smart software system can generate an output to update another smart output system. The trend was evident more than a decade ago in the work of Dr. Zbigniew Michalewicz, his son, and collaborators. He is the author of How to Solve It: Modern Heuristics. There were predecessors and today many others following smart approaches to operations for artificial intelligence or what is called by thumbtypers AIOps. The DarkCyber video will become available on October 5, 2021. We’ll try to keep the video peppy because smart software methods are definitely exciting and mostly invisible. And like other embedded components, some of these “modules” will become components, commoditized, and just used “as is.” That’s important because who worries about a component in a larger system? Do you wonder if the microwave is operating at peak efficiency with every component chugging along up to spec? Nope and nope.

I read a wonderful example of Silicon Valley MBA thinking called “It’s Time to Say “Ok, Boomer!” to Old School Change Management.” At first glance, the ideas about efficiency and keeping pace with technical updates make sense. The write up states:

There are a variety of dated methods when it comes to change management. Tl;dr it’s lots of paper and lots of meetings. These practices are widely regarded as effective across the industry, but research shows this is a common delusion and change management itself needs to change.

Hasta la vista Messrs. Drucker and the McKinsey framework.

The write up points out that a solution is at hand:

DevOps teams push lots of changes and this is creating a bottleneck as manual change management processes struggle to keep up. But, the great thing about DevOps is that it solves the problem it creates. One of the key aspects where DevOps can be of great help in change management is in the implementation of compliance. If the old school ways of managing change are too slow why not automate them like everything else? We already do this for building, testing and qualifying, so why not change? We can use the same automation to record change events in real time and implement release controls in the pipelines instead of gluing them on at the end.

Does this seem like circular reasoning?

I want to point out that if one of the automation components operates using probability and the thresholds are incorrect, the data poisoned (corrupted by intent or chance) or the “averaging” which is a feature of some systems triggers a butterfly effect, excitement may ensue. The idea is that a small change may have a large impact downstream; for example, a wing flap in Biloxi could create a flood in the 28th Street Flatiron stop.

Several observations:

  • AIOps are already in operation at outfits like the Google and will be componentized in an AWS-style package
  • Embedded stuff, like popular libraries, are just used and not thought about. The practice brings joy to bad actors who corrupt some library offerings
  • Once a component is up and running and assumed to be okay, those modules themselves resist change. When 20 somethings encounter mainframe code, their surprise is consistent. Are we gonna change this puppy or slap on a wrapper? What’s your answer, gentle reader?

Net net: AIOps sets the stage for more Timnit Gebru shoot outs about bias and discrimination as well as the type of cautions produced by Cathy O’Neil in Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy.

Okay, thumbtyper.

Stephen E Arnold, September 8, 2021

TikTok: No Big Deal? Data Collection: No Big Deal Either

September 7, 2021

Here’s an interesting and presumably dead accurate statement from “TikTok Overtakes YouTube for Average Watch Time in US and UK.”

YouTube’s mass audience means it’s getting more demographics that are comparatively light internet users… it’s just reaching everyone who’s online.

So this means Google is number one? The write up points out:

The Google-owned video giant has an estimated two billion monthly users, while TikTok’s most recent public figures suggested it had about 700 million in mid-2020.

Absolutely. To me, it looks as if two billion is bigger than 700 million.

But TikTok has “upended the streaming and social landscape.”

How? Two billion is bigger than 700 million. Googlers like metrics, and that’s a noticeable difference.

I learned that the average time per user spent on the apps is higher for TikTok than for YouTube. TikTok has a high levels of “engagement.”

Google YouTube has more users, but TikTok users are apparently more hooked on the short form content from the quasi-China influenced outfit.

Advertisers will care. Retailers who want to hose users with product pitches via TikTok care.

Data harvesters at TikTok will definitely care. The more time spent on a monitored app provides a more helpful set of data about the users. These users can be tagged and analyzed using helpful open source tools like Bootleg.

Just a point to consider: How useful will time series data be about a TikTok user or user cluster? How useful will such data be when it comes time to identify a candidate for insider action? But some Silicon Valley wizards pooh pooh TikTok data collection. Maybe a knowledge gap for this crowd?

Stephen E Arnold, September 9, 2021

FR Is Going Far

September 6, 2021

Law enforcement officials are using facial recognition software and the array of cameras that cover the majority of the world to identify bad actors. The New York Times reports on a story that was used to track down a terroristic couple: “A Fire In Minnesota. An Arrest In Mexico. Cameras Everywhere.”

Mena Yousif is an Iranian refuge and Jose Felan is a felon. The couple were frustrated about the current state of the American law enforcement system and government, especially after George Floyd’s death. They set fire to buildings, including schools, stores, gas stations, and caused damage to over 1500. The ATF posted videos of the pair online, asking for any leads to their arrests. The ATF received tips as Felan and Yousif traveled across the US to the Mexican border. The were on the run for two weeks before they were identified outside of a motel in Texas.

Mexican authorities deployed a comprehensive facial recognition system, deployed in 2019, and it was used to find Felan and Yousif. Dahua Technology designed Mexico’s facial recognition system. Dahua is a Chinese company, one of the largest video surveillance companies in the world, and is partially owned by the its government. The US Defense and Commerce departments blacklisted Dahua for China’s treatment of Uighur Muslims and the trade war. Dahua denies the allegations and stated that it cannot control how its technology is used. Facial recognition did not catch Yousif and Felan, instead they were given a tip.

China is marketing surveillance technology to other countries, particularly in South America, Asia, and Africa, as a means to minimize crime and promote order. There are issues with the technology being perfect and the US does use it despite them:

“In the United States, facial recognition technology is widely used by law enforcement officials, though poorly regulated. During a congressional hearing in July, lawmakers expressed surprise that 20 federal agencies were using it without having fully assessed the risks of misuse or bias — some algorithms have been found to work less accurately on women and people of color, and it has led to mistaken arrests. The technology can be a powerful and effective crime-solving tool, though, placing it, for now, at a tipping point. At the start of the hearing, Representative Sheila Jackson Lee, Democrat of Texas, highlighted the challenge for Congress — or anyone — in determining the benefits and downsides to using facial recognition: It’s not clear how well it works or how widely it’s used. As Ms. Jackson Lee said, “Information on how law enforcement agencies have adopted facial recognition technology remains underreported or nonexistent.”

Many governments around the world, including the US, seem poised to their increase the amount of facial recognition and tracking technology for law and order. What is interesting is that China has been a pacesetter.

Whitney Grace, September 9, 2021

Not an Onion Report: Handwaving about Swizzled Data

August 24, 2021

I read at the suggestion of a friend “These Data Are Not Just Excessively Similar. They Are Impossibly Similar.” At first glance, I thought the write up was a column in an Onion-type of publication. Nope, someone copied the same data set and pasted it into itself.

Here’s what the write up says:

The paper’s Excel spreadsheet of the source data indicated mathematical malfeasance.

Malfeasance. Okay.

But what caught my interest was the inclusion of this name: Dan Ariley. If this is the Dan Ariely who wrote these books, that fact alone is suggestive. If it is a different person, then we are dealing with routine data dumbness or data dishonesty.

image

The write up contains what I call academic ducking and covering. You may enjoy this game, but I find it boring. Non reproducible results, swizzled data, and massaged numerical recipes are the status quo.

Is there a fix? Nope, not as long as most people cannot make change or add up the cost of items in a grocery basket. Smart software depends on data. And if those data are like those referenced in this Metafilter article, well. Excitement.

Stephen E Arnold, August 24, 2021

Big Data, Algorithmic Bias, and Lots of Numbers Will Fix Everything (and Your Check Is in the Mail)

August 20, 2021

We must remember, “The check is in the mail” and “I will always respect you” and “You can trust me.” Ah, great moments in the University of Life’s chapbook of factoids.

I read “Moving Beyond Algorithmic Bias Is a Data Problem”. I was heartened by the essay. First, the document has a document object identifier and a link to make checking updates easy. Very good. Second, the focus of the write up is the inherent problem of most of the Fancy Dan baloney charged big data marketing to which I have been subjected in the last six or seven years. Very, very good.

I noted this statement in the essay:

Why, despite clear evidence to the contrary, does the myth of the impartial model still hold allure for so many within our research community? Algorithms are not impartial, and some design choices are better than others.

Notice the word “myth”. Notice the word “choices.” Yep, so much for the rock solid nature of big data, models, and predictive silliness based on drag-and-drop math functions.

I also starred this important statement by Donald Knuth:

Donald Knuth said that computers do exactly what they are told, no more and no less.

What’s the real world behavior of smart anti-phishing cyber security methods? What about the autonomous technology in some nifty military gear like the Avenger drone?

Google may not be thrilled with the information in this essay nor thrilled about the nailing of the frat bros’ tail to the wall; for example:

The belief that algorithmic bias is a dataset problem invites diffusion of responsibility. It absolves those of us that design and train algorithms from having to care about how our design choices can amplify or curb harm. However, this stance rests on the precarious assumption that bias can be fully addressed in the data pipeline. In a world where our datasets are far from perfect, overall harm is a product of both the data and our model design choices.

Perhaps this explains why certain researchers’ work is not zipping around Silicon Valley at the speed of routine algorithm tweaks? The statement could provide some useful insight into why Facebook does not want pesky researchers at NYU’s Ad Observatory digging into how Facebook manipulates perception and advertisers.

The methods for turning users and advertisers into puppets is not too difficult to figure out. That’s why certain companies obstruct researchers and manufacture baloney, crank up the fog machine, and offer free jargon stew to everyone including researchers. These are the same entities which insist they are not monopolies. Do you believe that these are mom-and-pop shops with a part time mathematician and data wrangler coming in on weekends? Gee, I do.

The “Moving beyond” article ends with a snappy quote:

As Lord Kelvin reflected, “If you cannot measure it, you cannot improve it.”

Several observations are warranted:

  1. More thinking about algorithmic bias is helpful. The task is to get people to understand what’s happening and has been happening for decades.
  2. The interaction of math most people don’t understand and very simple objectives like make more money or advance this agenda is a destabilizing force in human behavior. Need an example. The Taliban and its use of WhatsApp is interesting, is it not?
  3. The fix to the problems associated with commercial companies using algorithms as monetary and social weapons requires control. The question is from whom and how.

Stephen E Arnold, August 20, 2021

Nifty Interactive Linear Algebra Text

August 17, 2021

Where was this text when I was an indifferent student in a one-cow town high school? I suggest you take a look at Dan Margalit’s and Joseph Rabinoff’s Interactive linear Algebra. The text is available online and as a PDF version. The information is presented clearly and there are helpful illustrations. Some of them wiggle and jump. This is a must-have in my opinion. Linear algebra in the age of whiz-bang smart methods? Yes. One comment: When we checked the online version, the hot links in the index did not resolve. Use the Next link.

Stephen E Arnold, August 17, 2021

Spreadsheet Fever: It Is Easy to Catch

August 9, 2021

Regression is useful. I regress to my mean with every tick of my bio clock. I read “A Simple Regression Problem.” I like these explainer type of articles.

This write up contains a paragraph amplifying model fitting techniques, and  I found this passage thought provoking. Here it is:

If you use Excel, you can try various types of trend lines to approximate the blue curve, and even compute the regression coefficients and the R-squared for each tested model. You will find very quickly that the power trend line is the best model by far, that is, An is very well approximated (for large values of n) by An = b n^c. Here n^c stands for n at power c; also, b and c are the regression coefficients. In other words, log An = log b + c log n (approximately).

The bold face indicates the words and phrases I found suggestive. With this encouraged twiddling, one can get a sense of how fancy match can be converted into a nifty array of numbers which flow. Even better, a graphic can be generated with a click.

What happens when data scientists and algorithm craftspeople assemble their confection of dozens, even hundreds, of similar procedures. Do you talk about Bayesian drift at the golf club? If yes, then toss in spreadsheet fever’s warning signs.

Stephen E Arnold, August 9, 2021

Why Some Outputs from Smart Software Are Wonky

July 26, 2021

Some models work like a champ. Utility rate models are reasonably reliable. When it is hot, use of electricity goes up. Rates are then “adjusted.” Perfect. Other models are less solid; for example, Bayesian systems which are not checked every hour or large neural nets which are “assumed” to be honking along like a well-ordered flight of geese. Why do I offer such Negative Ned observations? Experience for one thing and the nifty little concepts tossed out by Ben Kuhn, a Twitter persona. You can locate this string of observations at this link. Well, you could as of July 26, 2021, at 630 am US Eastern time. Here’s a selection of what are apparently the highlights of Mr. Kuhn’s conversation with “a former roommate.” That’s provenance enough for me.

Item One:

Most big number theory results are apparently 50-100 page papers where deeply understanding them is ~as hard as a semester-long course. Because of this, ~nobody has time to understand all the results they use—instead they “black-box” many of them without deeply understanding.

Could this be true? How could newly minted, be an expert with our $40 online course, create professionals who use models packaged in downloadable and easy to plug in modules be unfamiliar with the inner workings of said bundles of brilliance? Impossible? Really?

Item Two:

A lot of number theory is figuring out how to stitch together many different such black boxes to get some new big result. Roommate described this as “flailing around” but also highly effective and endorsed my analogy to copy-pasting code from many different Stack Overflow answers.

Oh, come on. Flailing around. Do developers flail or do they “trust” the outfits who pretend to know how some multi-layered systems work. Fiddling with assumptions, thresholds, and (close your ears) the data themselves  are never, ever a way to work around a glitch.

Item Three

Roommate told a story of using a technique to calculate a number and having a high-powered prof go “wow, I didn’t know you could actually do that”

No kidding? That’s impossible in general, and that expression would never be uttered at Amazon-, Facebook-, and Google-type operations, would it?

Will Mr. Kuhn be banned for heresy. [Keep in mind how Wikipedia defines this term: “is any belief or theory that is strongly at variance with established beliefs or customs, in particular the accepted beliefs of a church or religious organization.”] Just repeating an idea once would warrant a close encounter with an Iron Maiden or a pile of firewood. Probably not today. Someone might emit a slightly critical tweet, however.

Stephen E Arnold, July 26, 2021

Elasticsearch Versus RocksDB: The Old Real Time Razzle Dazzle

July 22, 2021

Something happens. The “event” is captured and written to the file. Even if you are watching the “something” happening, there is latency between the event and the sensor or the human perceiving the event. The calculus of real time is mostly avoiding too much talk about latency. But real time is hot because who wants to look at old data, not TikTok fans and not the money-fueled lovers of Robinhood.

Rockset CEO on Mission to Bring Real-Time Analytics to the Stack” used lots of buzzwords, sidesteps inherent latency, and avoids commentary on other allegedly real-time analytics systems. Rockset is built on RockDB, an open source software. Nevertheless, there is some interesting information about Elasticsearch; for example:

  • Unsupported factoids like: “Every enterprise is now generating more data than what Google had to index in [year] 2000.”
  • No definition or baseline for “simple”: “The combination of the converged index along with the distributed SQL engine is what allows Rockset to be fast, scalable, and quite simple to operate.”
  • Different from Elasticsearch and RocksDB: “So the biggest difference between Elastic and RocksDB comes from the fact that we support full-featured SQL including JOINs, GROUP BY, ORDER BY, window functions, and everything you might expect from a SQL database. Rockset can do this. Elasticsearch cannot.”
  • Similarities with Rockset: “So Lucene and Elasticsearch have a few things in common with Rockset, such as the idea to use indexes for efficient data retrieval.”
  • Jargon and unique selling proposition: “We use converged indexes, which deliver both what you might get from a database index and also what you might get from an inverted search index in the same data structure. Lucene gives you half of what a converged index would give you. A data warehouse or columnar database will give you the other half. Converged indexes are a very efficient way to build both.”

Amazon has rolled out its real time system, and there are a number of options available from vendors like Trendalyze.

Each of these vendors emphasizes real time. The problem, however, is that latency exists regardless of system. Each has use cases which make their system seem to be the solution to real time data analysis. That’s what makes horse races interesting. These unfold in real time if one is at the track. Fractional delays have big consequences for those betting their solution is the least latent.

Stephen E Arnold, July 22, 2021

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta