Webb Wobbles: Do Other Data Streams Stumble Around?

October 4, 2022

I read an essay identified as an essay from The_Byte In Futurism with the content from Nature. Confused? I am.

The title of the article is “Scientists May Have Really Screwed Up on Early James Webb Findings.” The “Webb” is not the digital construct, but the space telescope. The subtitle about the data generated from the system is:

I don’t think anybody really expected this to be as big of an issue as it’s becoming.

Space is not something I think about. Decades ago I met a fellow named Fred G., who was engaged in a study of space warfare. Then one of my colleague Howard F. joined my team after doing some satellite stuff with a US government agency. He didn’t volunteer any information to me, and I did not ask. Space may be the final frontier, but I liked working on online from my land based office, thank you very much.

The article raises an interesting point; to wit:

When the first batch of data dropped earlier this summer, many dived straight into analysis and putting out papers. But according to new reporting by Nature, the telescope hadn’t been fully calibrated when the data was first released, which is now sending some astronomers scrambling to see if their calculations are now obsolete. The process of going back and trying to find out what parts of the work needs to be redone has proved “thorny and annoying,” one astronomer told Nature.

The idea is that the “Webby” data may have been distorted, skewed, or output with knobs and dials set incorrectly. Not surprisingly those who used these data to do spacey stuff may have reached unjustifiable conclusions. What about those nifty images, the news conferences, and the breathless references to the oldest, biggest, coolest images from the universe?

My thought is that the analyses, images, and scientific explanations are wrong to some degree. I hope the data are as pure as online clickstream data. No, no, strike that. I hope the data are as rock solid as mobile GPS data. No, no, strike that too. I hope the data are accurate like looking out the window to determine if it is a clear or cloudy day. Yes, narrowed scope, first hand input, and a binary conclusion.

Unfortunately in today’s world, that’s not what data wranglers do on the digital ranch.

If the “Webby” data are off kilter, my question is:

What about the data used to train smart software from some of America’s most trusted and profitable companies? Could these data be making incorrect decisions flow from models so that humans and downstream systems keep producing less and less reliable results?

My thought is, “Who wants to think about data being wrong, poisoned, or distorted?” People want better, faster, cheaper. Some people want to leverage data in cash or a bunker in Alaska. Others like Dr. Timnit Gebru wants her criticisms of the estimable Google to get some traction, even among those who snorkel and do deep dives.

If the scientists, engineers, and mathematicians fouled up with James Webb data, isn’t it possible that some of the big data outfits are making similar mistakes with calibration, data verification, analysis, and astounding observations?

I think the “Webby” moment is important. Marketers are not likely to worry too much.

Stephen E Arnold, October 4, 2022

Predicting the Future: For Money or Marketing?

August 22, 2022

A few days ago I was talking with some individuals who want to rely of predictive methods. These individuals had examples of 90 percent accuracy. Among the factoids offered were matching persons of interest with known bad actors, identifying CSAM in photographs, and predicting where an event would occur. Yep, 90 percent.

I did not mention counter examples.

A few moments ago, I emailed a link to the article titled “High-Frequency Trading Firms Can Easily Get to 64% Accuracy in Predicting Direction of the Next Trade, Princeton Study Finds.” The article states:

In its IPO filing in 2014, Virtu Financial said it had exactly one day of trading losses in 1,238 days. That kind of consistent profitability seems to be still the case: a new study from a team at Princeton University found that predictability in high frequency trading returns and durations is “large, systemic and pervasive”. They focused on the period from Jan. 2019 to Dec. 2020, which includes the turmoil when the coronavirus pandemic first hit the western world. With what they said was minimal algorithmic tuning, they can get to 64% accuracy for predicting the direction of the next trade over the next five seconds.

How accurate can the system referenced become? I noted this statement:

The Princeton researchers also simulated the effect that acquiring some signal on the direction of the order flow would have for the accuracy of the predictions. The idea is that knowledge could be gained by looking at order flow at different exchanges. That would boost the return predictability from 14% to 27%, and price direction accuracy from 68% to 79%.

Encouraging? Yes. A special case? Yes.

Flip the data to losses:

  1. The fail rate is 36 percent for the 2014 data
  2. The fail rate achieved by processing data from multiple source was  21 percent.

But 90 percent? Not yet.

What happens if one tries to use synthetic data to predict what an individual in a statistically defined cluster wants?

Yeah. Not there yet with amped up Bayesian methods and marketing collateral. Have these Princeton researchers linked with a high frequency trading outfit yet? Good PR generates opportunities in my experience.

Stephen E Arnold, August 22, 2022

Bayes and the Good Old Human Input

July 25, 2022

Curious about one of the fave mathematical methods for smart software? You may want to take a look at the online version of Bayes Rules! An Introduction to Applied Bayesian Modeling. A hard copy is available via this link. The book includes helpful explanations and examples. Topics often ignored by other authors get useful coverage; for example, the Metropolis-Hastings algorithm. I liked the chapter about non0normal hierarchical regression. The method promises some interesting surprises when applied to less-than-pristine data sets; for example, cryptic messages on a private Telegram channel or coded content from a private Facebook group. Good work and recommended by the Arnold IT team.

More general knowledge of human input methods can be helpful. Some talking about the remarkable achievements of smart software can overlook the pivot points in numerical recipes tucked into training set formation or embedded processes.

Stephen E Arnold, July 25, 2022

IBM Smart Software and Technology: Will There Be a Double Fault?

July 9, 2022

It has been a few years since Wimbledon started using AI to engage fans and the media. The longstanding partnership between IBM and the venerable All England Lawn Tennis Club captured the Best Fan Engagement by a Brand trophy at the 2022 Sports Technology Awards. The “IBM Power Index with Watson,” “IBM Match Insights with Watson,” and “Personalized Recommendations and Highlights Reels” were their winners. Maybe Watson has finally found its niche. We learn what changes are in store this season in the company’s press release, “IBM Reveals New AI and Cloud Powered Fan Experiences for Wimbledon 2022.” The write-up specifies:

“New features for 2022 include:

* ‘Win Factors’ brings enhanced explainability to ‘Match Insights’: Building on the existing Match Insights feature of the Wimbledon app and Wimbledon.com, IBM is providing an additional level of explainability into what factors are being analyzed by the AI system to determine match insights and predictions. Win Factors will provide fans with an increased understanding of the elements affecting player performance, such as the IBM Power Index, court surface, ATP/WTA rankings, head-to-head, ratio of games won, net of sets won, recent performance, yearly success, and media punditry.

* ‘Have Your Say’ with a new interactive fan predictions feature: For the first time, users can register their own predictions for match outcomes on the Wimbledon app and Wimbledon.com, through the Have Your Say feature. They can then compare their prediction with the aggregated predictions of other fans and the AI-powered Likelihood to Win predictions generated by IBM.”

The “digital fan experiences” use a combination of on-premises and cloud systems. Developers have trained the machine-learning models on data from prior matches using Watson Studio and Watson Discovery. See the press release for more specifics on each feature.

Cynthia Murrell, July 9, 2022

Differences between Amateur and Pro Analysts: A Sci-Fi Adventure

July 5, 2022

I read “One of the Most Prominent Crypto Hedge Funds Just Defaulted on a $670 Million Loan.” I also read some of the reports about the company. You can refresh your understanding of “real” analysts at work. Try this link even though the main Three Arrows’ site is throwing 404s.

I then read “10 Differences between Amateurs and Professional Analysts.” (You may have spit up an email or pay to read this estimable essay about differences in data wrangling pony riders.) I considered each of the points of differentiation. Here are three, but you will have to consult the original article yourself to be further enlightened.

  1. Handling lots of data. Yeah, let’s ask Dr. Timnit Gebru about that. My experience is that those better at analytics can make those data perform like trained ponies at the Barnum & Bailey Circus.
  2. Immunity to data science bias. Yeah, let us check out how the AI demos respond to requests for certain topics. Try Crungus on DALL-E. Working good, right?
  3. Refusing to be a data charlatan. And Three Arrows? Just an anomaly, perhaps?

Net net: No difference unless measured in ångströms and an happy ignorance of poisoned data when sucking down alternative information. What could go wrong? Answer: Three Arrows.

Stephen E Arnold, July 5, 2022

Spicing Up Possibly Biased Algorithms with Wiener Math

June 27, 2022

Let’s assume that the model described in “The Mathematics of Human Behavior: How My New Model Can Spot Liars and Counter Disinformation” is excellent. Let’s further assume that it generates “reliable” outputs which correspond to what humanoids do in real life. A final building block is to use additional predictive analytics to process the outputs of the Wiener-esque model and pipe them into an online advertising system like Apple’s, Facebook’s, Google’s, or TikTok’s.

This sounds like a useful thought experiment.

Consider this statement from the cited article:

In this new “information-based” approach, the behavior of a person – or group of people – over time is deduced by modeling the flow of information. So, for example, it is possible to ask what will happen to an election result (the likelihood of a percentage swing) if there is “fake news” of a given magnitude and frequency in circulation. But perhaps most unexpected are the deep insights we can glean into the human decision-making process. We now understand, for instance, that one of the key traits of the Bayes updating is that every alternative, whether it is the right one or not, can strongly influence the way we behave.

These statements suggest that the outputs can be used for different use cases.

Now how will this new model affect online advertising and in a larger context how will the model allows humanoid thoughts and actions to be shaped or weaponized. My initial ideas are:

  1. Feedback signals about content which does not advance an agenda. The idea is that that “flagged” content object never is available to an online user. Is this a more effective form of filtering? I think dynamic pre-filtering is a winner for some.
  2. Filtered content can be weaponized to advance a particular line of thought. The metaphor is that a protective mother does not allow the golden child to play outside at dusk without appropriate supervision. The golden child gleams in the gloaming and learns to avoid risky behaviors unless an appropriate guardian (maybe a Musk Optimus) is shadowing the golden child.
  3. Ads can be matched against what the Amazon, Apple, Facebook, Google, and TikTok systems have identified as appropriate. The resulting ads generated by combining the proprietary methods with those described in the write up increase the close rate by a positive amount.
  4. Use cases for law enforcement exist as well.

Exciting opportunities abound. Once again, I am glad I am old. Were he alive, Norbert Wiener might share my “glad I am old” notion when confronted with applied Wiener math.

Stephen E Arnold, June 26, 2022

DarkCyber, March 29, 2022: An Interview with Chris Westphal, DataWalk

March 29, 2022

Chris Westphal is the Chief Analytics Officer of DataWalk, a firm providing an investigative and analysis tool to commercial and government organizations. The 12-minute interview covers DataWalk’s unique capabilities, its data and information resources, and the firm’s workflow functionality. The video can be viewed on YouTube at this location.

Stephen E Arnold, March 29, 2022

American Airlines Scores Points on the Guy

January 24, 2022

I read “American Airlines Suing the Points Guy Over App That Synchs Frequent Flyer Data.” I have tried to avoid flying. Too many hassles for my assorted Real ID cards, my US government super ID, and passengers who won’t follow rules as wonky as some may be.

The write up focuses on a travel tips sites which “lets users track airline miles from multiple airlines in one place.” The article is interesting and includes some interesting information; for example, consider this statement in the write up:

“Consumers are always in control of their own data on The Points Guy App — they decide which loyalty programs and credit cards are accessible for the purpose of making their points-and-miles journey easier,” The Points Guy founder Brian Kelly said in a statement emailed to The Verge. The site is “choosing to fight back against American Airlines on behalf of travelers to protect their rights to access their points and miles so they can travel smarter,” he added.

The write up includes a legal document in one of those helpful formats which make access on a mobile device pretty challenging for a 77 year old.

As wonderful as the write up is, I noticed one point (not the Guy’s) I would have surfaced; namely, “Why is it okay for big companies to federate and data mine user information but not okay for an individual consumer/customer?”

The reason? We are running the show. Get with it or get off and lose your points. Got that, Points Guy?

Stephen E Arnold, January 24, 2022

New Search Platform Focuses on Protecting Intellectual Property

January 21, 2022

Here is a startup offering a new search engine, now in beta. Huski uses AI to help companies big and small reveal anyone infringing on their intellectual property, be it text or images. It also promises solutions for title optimization and even legal counsel. The platform was developed by a team of startup engineers and intellectual property litigation pros who say they want to support innovative businesses from the planning stage through protection and monitoring. The Technology page describes how the platform works:

“* Image Recognition: Our deep learning-based image recognition algorithm scans millions of product listings online to quickly and accurately find potentially infringing listings with images containing the protected product.

* Natural Language Processing: Our machine learning algorithm detects infringements based on listing information such as price, product description, and customer reviews, while simultaneously improving its accuracy based on patterns it finds among confirmed infringements.

* Largest Knowledge Graph in the Field: Our knowledge graph connects entities such as products, trademarks, and lawsuits in an expansive network. Our AI systems gather data across the web 24/7 so that you can easily base decisions on the most up-to-date information.

* AI-Powered Smart Insights: What does it mean to your brands and listings when a new trademark pops out? How about when a new infringement case pops out? We’ll help you discover the related insights that you may never know otherwise.

* Big Data: All of the above intelligence is being derived from the data universe of the eCommerce, intellectual property, and trademark litigation. Our data engine is the biggest ‘black hole’ in that universe.”

Founder Guan Wang and his team promise a lot here, but only time will tell if they can back it up. Launched in the challenging year of 2020, Huski.ai is based in Silicon Valley but it looks like it does much of its work online. The niche is not without competition, however. Perhaps a Huski will cause the competition to run away?

Cynthia Murrell, January 21, 2021

Palantir at the Intersection of Extremists and Prescription Fraud

January 5, 2022

Blogger Ron Chapman II, ESQ, seems to be quite the fan of Palantir Technologies. We get that impression from his post, “Palantir’s Anti-Terror Tech Used to Fight RX Fraud.” The former Marine fell in love with the company’s tech in Afghanistan, where its analysis of terrorist attack patterns proved effective. We especially enjoyed the rah rah write-up’s line about Palantir’s “success on the battlefield.” Chapman is not the only one enthused about the government-agency darling.

As for Palantir’s move into detecting prescription fraud, we learn the company begins with open-source data from the likes of census data, public and private studies, and Medicare’s Meaningful Use program. Chapman describes the firm’s methodology:

“Palantir then cross-references varying sets of Medicare data to determine which providers statistically deviate from the norm amongst large data sets. For instance, Palantir can analyze prescription data to determine which providers rank the highest in opiate prescribing for a local area. Palantir can then cross-reference those claims against patient location data to determine if the providers’ patients are traveling long distances for opiates. Palantir can further analyze the data to determine if the patient population of a provider has been previously treated by a physician on the Office of Inspector General exclusion database (due to prior misconduct) which would indicate that the patients are not ‘legitimate.’ By using ‘big data’ to determine which providers deviate from statistical trends, Palantir can provide a more accurate basis for a payment audit, generate probable cause for search warrants, or encourage a federal grand jury to further investigate a provider’s activities. After the government obtains additional provider-specific data, Palantir can analyze specific patient files, cell phone data, email correspondence, and electronic discovery. Investigators can review cell phone data and email correspondence to determine if networks exist between providers and patients and determine the existence of a healthcare fraud conspiracy or patient brokering.”

Despite his fondness for Palantir, Chapman does include the obligatory passage on privacy and transparency concerns. He notes that healthcare providers, specifically, are concerned about undue scrutiny should their patient care decisions somehow diverge from a statistical norm. A valid consideration. As with law enforcement, the balance between the good of society and individual rights is a tricky one. Palantir was launched in 2003 by Peter Theil, who was also a cofounder of PayPal and is a notorious figure to some. The company is based in Denver, Colorado.

Cynthia Murrell, January 5, 2022

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta