Confessions? It Is That Time of Year

December 23, 2022

Forget St. Augustine.

Big data, data science, or whatever you want to call is was the precursor to artificial intelligence. Tech people pursued careers in the field, but after the synergy and hype wore off the real work began. According to WD in his RYX,R blog post: “Goodbye, Data Science,” the work is tedious, low-value, unwilling, and left little room for career growth.

WD worked as a data scientist for a few years, then quit in pursuit of the higher calling as a data engineer. He will be working on the implementation of data science instead of its origins. He explained why he left in four points:

• “The work is downstream of engineering, product, and office politics, meaning the work was only often as good as the weakest link in that chain.

• Nobody knew or even cared what the difference was between good and bad data science work. Meaning you could suck at your job or be incredible at it and you’d get nearly the same regards in either case.

• The work was often very low value-add to the business (often compensating for incompetence up the management chain).

• When the work’s value-add exceeded the labor costs, it was often personally unfulfilling (e.g. tuning a parameter to make the business extra money).”

WD’s experiences sound like everyone who is disenchanted with their line of work. He worked with managers who would not listen when they were told stupid projects would fail. The managers were more concerned with keeping their bosses and shareholders happy. He also mentioned that engineers are inflamed with self-grandeur and scientists are bad at code. He worked with young and older data people who did not know what they were doing.

As a data engineer, WD has more free time, more autonomy, better career advancements, and will continue to learn.

Whitney Grace, December 23, 2022

Trackers? More Plentiful Than Baloney Press Releases

December 22, 2022

You are absolutely correct if you think more Web sites are asking you to approve their cookie settings. More Web sites are tracking your personal information to send you targeted ads. Tech Radar explains more about trackers in, “You’re Not Wrong-Websites Have Way More Trackers Now.”

NordVPN discovered that the average Web site has forty-eight trackers and it is putting users at risk. NordVPN used three tracker blockers (Badger, Brave, and uBlock Origin) to count the number of trackers across the one hundred more popular Web sites in twenty-five countries. Social media platforms had the most trackers at 160, health Web sites were the second with forty-six, and digital media Web sites have twenty-eight. Ironically government and adult Web sites had the least amount of trackers.

Third parties were tied to the trackers. Thirty percent belonged to Google, 11% to Facebook, and Adobe had 7%. All data is used for marketing reasons. North and Central Europe had the least amount of trackers, because of privacy laws. The US is a tracker’s playground, because there are not any blanket laws that protect user privacy. It is an Orwellian system for capitalist purposes:

“For NordVPN, the problem with collecting this data is that it can be used to profile the users in great detail. The profile is then sold to advertising companies, whose ads “follow” the users around the internet to collect even more data.

Worse still, cybercriminals might get their hands on this data at any point, and could then use this data in phishing attacks that use a victim’s in-depth personal profile to appear authentic, making them more likely to fall for the ruse.”

The article doubles as a marketing tool for VPN services, particularly NordVPN. VPNs collect user information too, except they can hide it. Interesting how the article wants to inform people about the dangers of tracking and wants to sell a product too.

Whitney Grace, December 22, 2022

Common Sense: A Refreshing Change in Tech Write Ups

December 13, 2022

I want to give a happy quack to this article: “Forget about Algorithms and Models — Learn How to Solve Problems First.” The common sense write up suggests that big data cowboys and cowgirls make sure of their problem solving skills before doing the algorithm and model Lego drill. To make this point clear: Put foundations in place before erecting a structure which may fail in interesting ways.

The write up says:

For programmers and data scientists, this means spending time understanding the problem and finding high-level solutions before starting to code.

But in an era of do your own research and thumbtyping will common sense prevail?

Not often.

The article provides a list a specific steps to follow as part of the foundation for the digital confection. Worth reading; however, the write up tries to be upbeat.

A positive attitude is a plus. Too bad common sense is not particularly abundant in certain fascinating individual and corporate actions; to wit:

  • Doing the FBX talkathons
  • Installing spyware without legal okays
  • Writing marketing copy that asserts a cyber security system will protect a licensee.

You may have your own examples. Common sense? Not abundant in my opinion. That’s why a book like How to Solve It: Modern Heuristics is unlikely to be on many nightstands of some algorithm and data analysts. Do I know this for a fact? Nope, just common sense. Thumbtypers, remember?

Stephen E Arnold, December 13, 2022

TikTok: Algorithmic Data Slurping

November 14, 2022

There are several reasons TikTok rocketed to social-media dominance in just a few years. For example, Its user friendly creation tools plus a library of licensed tunes make it easy to create engaging content. Then there was the billion-dollar marketing campaign that enticed users away from Facebook and Instagram. But, according to the Guardian, it was the recommendation engine behind its For You Page (FYP) that really did the trick. Writer Alex Hern describes “How TikTok’s Algorithm Made It a Success: ‘It Pushes the Boundaries.’” He tells us:

“The FYP is the default screen new users see when opening the app. Even if you don’t follow a single other account, you’ll find it immediately populated with a never-ending stream of short clips culled from what’s popular across the service. That decision already gave the company a leg up compared to the competition: a Facebook or Twitter account with no friends or followers is a lonely, barren place, but TikTok is engaging from day one. It’s what happens next that is the company’s secret sauce, though. As you scroll through the FYP, the makeup of videos you’re presented with slowly begins to change, until, the app’s regular users say, it becomes almost uncannily good at predicting what videos from around the site are going to pique your interest.”

And so a user is hooked. Beyond the basics, specifically how the algorithm works is a mystery even, we’re told, to those who program it. We do know the AI takes the initiative. Instead of only waiting for users to select a video or tap a reaction, it serves up test content and tweaks suggestions based on how its suggestions are received. This approach has another benefit. It ensures each video posted on the platform is seen by at least one user, and every positive interaction multiplies its reach. That is how popular content creators quickly amass followers.

Success can be measured different ways, of course. Though TikTok has captured a record number of users, it is not doing so well in the critical monetization category. Estimates put its 2021 revenue at less than 5% of Facebook’s, and efforts to export its e-commerce component have not gone as hoped. Still, it looks like the company is ready to try, try again. Will its persistence pay off?

Cynthia Murrell, November 14, 2022

About Fancy Math: Struggles Are Not Exciting, Therefore Ignored

November 10, 2022

Ask yourself, “How many of my colleagues understand statistical procedures?” I am waiting.

Okay, enough time.

Navigate to “Pollsters Struggle to Improve Forecasts.” If you have a dead tree version of the Wall Street Journal, the story appears on page A4. If you have the online version of the paper, pay up and click this link. If you cannot locate the story, well, that’s life in the Murdoch high tech universe.

The article reports:

Overall, national polls in 2020 were the most inaccurate in 40 years, a study by the main association of survey researchers found, and state-level polls in 2016 were significantly off the mark.

So what?

Check out “Nate Silver Admits He Got Played by the GOP But Blames the Democrats for Not Using Poor Polling Practices.” The write up explains how a wizard fumbled the data ball. How many other whiz kids fumble data balls but do not come up with lame excuses and finger pointing?

Smart software relies on procedures not too distant from those used by pollsters. How accurate are the outputs from these massively hyped systems? Close enough for horseshoes? Good enough? Are you ready to let smart software determine how to treat your cancer, drive your vehicle, and grade your bright young 10-year-old?

Stephen E Arnold, November 10, 2022

What Is Better Than Biometrics Emotion Analysis of Surveillance Videos?

October 27, 2022

Many years ago, my team worked on a project to parse messages, determine if a text message was positive or negative, and flag the negative ones. Then of those negative messages, our job was to rank the negative messages in a league table. The team involved professionals in my lab in rural Kentucky, some whiz kids in big universities, a handful of academic experts, and some memorable wizards located offshore. (I have some memories, but, alas, these are not suitable for this write up.)

We used the most recent mechanisms to fiddle information from humanoid outputs. Despite the age of some numerical recipes, we used the latest and greatest. What surprised everyone is that our approach worked, particularly for the league table of the most negative messages. After reviewing our data, we formulated a simple, speedy way to pinpoint the messages which required immediate inspection by a person.

What was our solution for the deployable system?

Did we rely on natural language processing? Nope.

Did we rely on good old Reverend Bayes? Nope.

Did we rely on statistical analysis? Nope.

How did we do this? (Now keep in mind this was more than 15 years ago.)

We used a look up table of keywords.

Why? It delivered the league table of the most negative messages more than 85 percent of the time. The lookups were orders of magnitude faster than the fancy numerical recipes. The system was explainable. The method was extensible to second order negative messages with synonym expansion and, in effect, a second pass on the non-really negative messages. Yep, we crept into the 90 percent range.

I thought about this work for a company which went the way of most lavishly funded wild and crazy start ups from the go to years when I read “U.K. Watchdog Issues First of Its Kind Warning Against ‘Immature’ Emotional Analysis Tech.” This article addresses fancy methods for parsing images and other content to determine if a person is happy or sad. In reality, the purpose of these systems for some professional groups is to identify a potential bad actor before that individual creates content for the “if it bleeds, it leads” new organizations.

The article states:

The Information Commissioner’s Office, Britain’s top privacy watchdog, issued a searing warning to companies against using so-called “emotional analysis” tech, arguing it’s still “immature” and that the risks associated with it far outweigh any potential benefits.

You should read the full article to get the juicy details. Remember the text approach required one level of technology. We used a look up table because the magical methods were too expensive and too time consuming when measured against what was needed: Reasonable accuracy.

Taking videos and images, processing them, and determining if the individual in the image is a good actor or a bad actor, a happy actor or a sad actor, a nut job actor or a relative of Mother Teresa’s is another kettle of code.

Let’s go back to the question which is the title of this blog post: What Is Better Than Biometrics Emotion Analysis?

The answer is objective data about the clicks, dwell time, and types of indexed content an individual consumes. Lots of clicks translates to a signal of interest. Dwell time indicates attention. Cross correlate these data with other available information from primary sources and one can pinpoint some factoids that are useful in “knowing” about an individual.

My interest in the article was not the source article’s reminder that expectations for a technology are usually over inflated. My reaction was, “Imagine how useful TikTok data would be in identify individuals with specific predilections, mood changes plotted over time, and high value signals about an individual’s interests.”

Yep, just a reminder that TikTok is in a much better place when it comes to individual analysis than relying on some complicated methods which don’t work very well.

Practical is better.

Stephen E Arnold, October 27, 2022

A Data Taboo: Poisoned Information But We Do Not Discuss It Unless… Lawyers

October 25, 2022

In a conference call yesterday (October 24, 2022), I mentioned one of my laws of online information; specifically, digital information can be poisoned. The venom can be administered by a numerically adept MBA or a junior college math major taking short cuts because data validation is hard work. The person on the call was mildly surprised because the notion of open source and closed source “facts” intentionally weaponized is an uncomfortable subject. I think the person with whom I was speaking blinked twice when I pointed what should be obvious to most individuals in the intelware business. Here’s the pointy end of reality:

Most experts and many of the content processing systems assume that data are good enough. Plus, with lots of data any irregularities are crunched down by steamrolling mathematical processes.

The problem is that articles like “Biotech Firm Enochian Says Co Founder Fabricated Data” makes it clear that MBA math as well as experts hired to review data can be caught with their digital clothing in a pile. These folks are, in effect, sitting naked in a room with people who want to make money. Nakedness from being dead wrong can lead to some career turbulence; for example, prison.

The write up reports:

Enochian BioSciences Inc. has sued co-founder Serhat Gumrukcu for contractual fraud, alleging that it paid him and his husband $25 million based on scientific data that Mr. Gumrukcu altered and fabricated.

The article does not explain precisely how the data were “fabricated.” However, someone with Excel skills or access to an article like “Top 3 Python Packages to Generate Synthetic Data” and Fiverr.com or similar gig work site can get some data generated at a low cost. Who will know? Most MBAs math and statistics classes focus on meeting targets in order to get a bonus or amp up a “service” fee for clicking a mouse. Experts who can figure out fiddled data sets take the time if they are motivated by professional jealousy or cold cash. Who blew the whistle on Theranos? A data analyst? Nope. A “real” journalist who interviewed people who thought something was goofy in the data.

My point is that it is trivially easy to whip up data to support a run at tenure or at a group of MBAs desperate to fund the next big thing as the big tech house of cards wobbles in the winds of change.

Several observations:

  1. The threat of bad or fiddled data is rising. My team is checking a smart output by hand because we simply cannot trust what a slick, new intelware system outputs. Yep, trust is in short supply among my research team.
  2. Individual inspection of data from assorted open and closed sources is accepted as is. The attitude is that the law of big numbers, the sheer volume of data, or the magic of cross correlation will minimize errors. Sure these processes will, but what if the data are weaponized and crafted to avoid detection? The answer is to check each item. How’s that for a cost center?
  3. Uninformed individuals (yep, I am including some data scientists, MBAs, and hawkers of data from app users) don’t know how to identify weaponized data nor know what to do when such data are identified.

Does this suggest that a problem exists? If yes, what’s the fix?

[a] Ignore the problem

[b] Trust Google-like outfits who seek to be the source for synthetic data

[c] Rely on MBAs

[d] Rely on jealous colleagues in the statistics department with limited tenure opportunities

[e] Blink.

Pick one.

Stephen E Arnold, October 25, 2022

TikTok: Tracking Humanoids? Nope, Never, Ever

October 21, 2022

I read “TikTok Denies It Could Be Used to Track US Citizens.” Allegedly linked to the cheerful nation state China, TikTok allegedly asserts that it cannot, does not, and never ever thought about analyzing log data. Nope, we promise.

The article asserts:

The social media giant said on Twitter that it has never been used to “target” the American government, activists, public figures or journalists. The firm also says it does not collect precise location data from US users.

Here’s a good question: Has notion of persistent cookies, geospatial data, content consumption analytics, psychological profiling based on thematics have never jived with TikTok data at the Surveillance Soirée?

The answer is, according to the Beeb:

The firm [TikTok] also says it does not collect precise location data from US users. It was responding to a report in Forbes that data would have been accessed without users’ knowledge or consent. The US business magazine, which cited documents it had seen, reported that ByteDance had started a monitoring project to investigate misconduct by current and former employees. It said the project, which was run by a Beijing-based team, had planned to collect location data from a US citizen on at least two occasions.

Saying is different from doing in my opinion.

Based on my limited experience with online, would it be possible for a smart system with access to log data to do some high-value data analysis? Would it be possible to link the analytics’ output with a cluster of users? Would be possible to cross correlate data so that individuals with a predicted propensity of a desired behavior to be identified?

Of course not. Never. Nation states and big companies are fountains of truth.

TikTok. Why worry?

Stephen E Arnold, October 21, 2022

Webb Wobbles: Do Other Data Streams Stumble Around?

October 4, 2022

I read an essay identified as an essay from The_Byte In Futurism with the content from Nature. Confused? I am.

The title of the article is “Scientists May Have Really Screwed Up on Early James Webb Findings.” The “Webb” is not the digital construct, but the space telescope. The subtitle about the data generated from the system is:

I don’t think anybody really expected this to be as big of an issue as it’s becoming.

Space is not something I think about. Decades ago I met a fellow named Fred G., who was engaged in a study of space warfare. Then one of my colleague Howard F. joined my team after doing some satellite stuff with a US government agency. He didn’t volunteer any information to me, and I did not ask. Space may be the final frontier, but I liked working on online from my land based office, thank you very much.

The article raises an interesting point; to wit:

When the first batch of data dropped earlier this summer, many dived straight into analysis and putting out papers. But according to new reporting by Nature, the telescope hadn’t been fully calibrated when the data was first released, which is now sending some astronomers scrambling to see if their calculations are now obsolete. The process of going back and trying to find out what parts of the work needs to be redone has proved “thorny and annoying,” one astronomer told Nature.

The idea is that the “Webby” data may have been distorted, skewed, or output with knobs and dials set incorrectly. Not surprisingly those who used these data to do spacey stuff may have reached unjustifiable conclusions. What about those nifty images, the news conferences, and the breathless references to the oldest, biggest, coolest images from the universe?

My thought is that the analyses, images, and scientific explanations are wrong to some degree. I hope the data are as pure as online clickstream data. No, no, strike that. I hope the data are as rock solid as mobile GPS data. No, no, strike that too. I hope the data are accurate like looking out the window to determine if it is a clear or cloudy day. Yes, narrowed scope, first hand input, and a binary conclusion.

Unfortunately in today’s world, that’s not what data wranglers do on the digital ranch.

If the “Webby” data are off kilter, my question is:

What about the data used to train smart software from some of America’s most trusted and profitable companies? Could these data be making incorrect decisions flow from models so that humans and downstream systems keep producing less and less reliable results?

My thought is, “Who wants to think about data being wrong, poisoned, or distorted?” People want better, faster, cheaper. Some people want to leverage data in cash or a bunker in Alaska. Others like Dr. Timnit Gebru wants her criticisms of the estimable Google to get some traction, even among those who snorkel and do deep dives.

If the scientists, engineers, and mathematicians fouled up with James Webb data, isn’t it possible that some of the big data outfits are making similar mistakes with calibration, data verification, analysis, and astounding observations?

I think the “Webby” moment is important. Marketers are not likely to worry too much.

Stephen E Arnold, October 4, 2022

Predicting the Future: For Money or Marketing?

August 22, 2022

A few days ago I was talking with some individuals who want to rely of predictive methods. These individuals had examples of 90 percent accuracy. Among the factoids offered were matching persons of interest with known bad actors, identifying CSAM in photographs, and predicting where an event would occur. Yep, 90 percent.

I did not mention counter examples.

A few moments ago, I emailed a link to the article titled “High-Frequency Trading Firms Can Easily Get to 64% Accuracy in Predicting Direction of the Next Trade, Princeton Study Finds.” The article states:

In its IPO filing in 2014, Virtu Financial said it had exactly one day of trading losses in 1,238 days. That kind of consistent profitability seems to be still the case: a new study from a team at Princeton University found that predictability in high frequency trading returns and durations is “large, systemic and pervasive”. They focused on the period from Jan. 2019 to Dec. 2020, which includes the turmoil when the coronavirus pandemic first hit the western world. With what they said was minimal algorithmic tuning, they can get to 64% accuracy for predicting the direction of the next trade over the next five seconds.

How accurate can the system referenced become? I noted this statement:

The Princeton researchers also simulated the effect that acquiring some signal on the direction of the order flow would have for the accuracy of the predictions. The idea is that knowledge could be gained by looking at order flow at different exchanges. That would boost the return predictability from 14% to 27%, and price direction accuracy from 68% to 79%.

Encouraging? Yes. A special case? Yes.

Flip the data to losses:

  1. The fail rate is 36 percent for the 2014 data
  2. The fail rate achieved by processing data from multiple source was  21 percent.

But 90 percent? Not yet.

What happens if one tries to use synthetic data to predict what an individual in a statistically defined cluster wants?

Yeah. Not there yet with amped up Bayesian methods and marketing collateral. Have these Princeton researchers linked with a high frequency trading outfit yet? Good PR generates opportunities in my experience.

Stephen E Arnold, August 22, 2022

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta