Common Sense from an AI-Centric Outfit: How Refreshing
July 11, 2024
This essay is the work of a dumb dinobaby. No smart software required.
In the wild and wonderful world of smart software, common sense is often tucked beneath a stack of PowerPoint decks and vaporized in jargon-spouting experts in artificial intelligence. I want to highlight “Interview: Nvidia on AI Workloads and Their Impacts on Data Storage.” An Nvidia poohbah named Charlie Boyle output some information that is often ignored by quite a few of those riding the AI pony to the pot of gold at the end of the AI rainbow.
The King Arthur of senior executives is confident that in his domain he is the master of his information. By the way, this person has an MBA, a law degree, and a CPA certification. His name is Sir Walter Mitty of Dorksford, near Swindon. Thanks, MSFT Copilot. Good enough.
Here’s the pivotal statement in the interview:
… a big part of AI for enterprise is understanding the data you have.
Yes, the dwellers in carpetland typically operate with some King Arthur type myths galloping around the castle walls; specifically:
Myth 1: We have excellent data
Myth 2: We have a great deal of data and more arriving every minute our systems are online
Myth 3: Out data are available and in just a few formats. Processing the information is going to be pretty easy.
Myth 4: Out IT team can handle most of the data work. We may not need any outside assistance for our AI project.
Will companies map these myths to their reality? Nope.
The Nvidia expert points out:
…there’s a ton of ready-made AI applications that you just need to add your data to.
“Ready made”: Just like a Betty Crocker cake mix my grandmother thought tasted fake, not as good as home made. Granny’s comment could be applied to some of the AI tests my team have tracked; for example, the Big Apple’s chatbot outputting comments which violated city laws or the exciting McDonald’s smart ordering system. Sure, I like bacon on my on-again, off-again soft serve frozen dessert. Doesn’t everyone?
The Nvidia experts offers this comment about storage:
If it’s a large model you’re training from scratch you need very fast storage because a lot of the way AI training works is they all hit the same file at the same time because everything’s done in parallel. That requires very fast storage, very fast retrieval.
Is that a problem? Nope. Just crank up the cloud options. No big deal, except it is. There are costs and time to consider. But otherwise this is no big deal.
The article contains one gems and wanders into marketing “don’t worry” territory.
From my point of view, the data issue is the big deal. Bad, stale, incomplete, and information in odd ball formats — these exist in organizations now. The mass of data may have 40 percent or more which has never been accessed. Other data are back ups which contain versions of files with errors, copyright protected data, and Boy Scout trip plans. (Yep, non work information on “work” systems.)
Net net: The data issue is an important one to consider before getting into the let’s deploy a customer support smart chatbot. Will carpetland dwellers focus on the first step? Not too often. That’s why some AI projects get lost or just succumb to rising, uncontrollable costs. Moving data? No problem. Bad data? No problem. Useful AI system? Hmmm. How much does storage cost anyway? Oh, not much.
Stephen E Arnold, July 11, 2024
Mastercard and Customer Information: A Lone Ranger?
October 26, 2023
Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.
In my lectures, I often include a pointer to sites selling personal data. Earlier this month, I explained that the clever founder of Frank Financial acquired email information about high school students from two off-the-radar data brokers. These data were mixed with “real” high school student email addresses to provide a frothy soup of more than a million email addresses. These looked okay. The synthetic information was “good enough” to cause JPMorgan Chase to output a bundle of money to the alleged entrepreneur winners.
A fisherman chasing a slippery eel named Trust. Thanks, MidJourney. You do have a knack for recycling Godzilla art, don’t you?
I thought about JPMorgan Chase when I read “Mastercard Should Stop Selling Our Data.” The article makes clear that Mastercard sells its customers (users?) data. Mastercard is a financial institution. JPMC is a financial institution. One sells information; the other gets snookered by data. I assume that’s the yin and yang of doing business in the US.
The larger question is, “Are financial institutions operating in a manner harmful to themselves (JPMC) and harmful to others (personal data about Mastercard customers (users?). My hunch is that today I am living in an “anything goes” environment. Would the Great Gatsby be even greater today? Why not own Long Island and its railroad? That sounds like a plan similar to those of high fliers, doesn’t it?
The cited article has a bias. The Electronic Frontier Foundation is allegedly looking out for me. I suppose that’s a good thing. The article aims to convince me; for example:
the company’s position as a global payments technology company affords it “access to enormous amounts of information derived from the financial lives of millions, and its monetization strategies tell a broader story of the data economy that’s gone too far.” Knowing where you shop, just by itself, can reveal a lot about who you are. Mastercard takes this a step further, as U.S. PIRG reported, by analyzing the amount and frequency of transactions, plus the location, date, and time to create categories of cardholders and make inferences about what type of shopper you may be. In some cases, this means predicting who’s a “big spender” or which cardholders Mastercard thinks will be “high-value”—predictions used to target certain people and encourage them to spend more money.
Are outfits like Chase Visa selling their customer (user) data? (Yep, the same JPMC whose eagle eyed acquisitions’ team could not identify synthetic data) and enables some Amazon credit card activities. Also, what about men-in-the-middle like Amazon? The data from its much-loved online shopping, book store, and content brokering service might be valuable to some I surmise? How much would an entity pay for information about an Amazon customer who purchased item X (a 3D printer) and purchased Kindle books about firearm related topics be worth?
The EFF article uses a word which gives me the willies: Trust. For a time, when I was working in different government agencies, the phrase “trust but verify” was in wide use. Am I able to trust the EFF and its interpretation from a unit of the Public Interest Network? Am I able to trust a report about data brokering? Am I able to trust an outfit like JPMC?
My thought is that if JPMC itself can be fooled by a 31 year old and a specious online app, “trust” is not the word I can associate with any entity’s action in today’s business environment.
This dinobaby is definitely glad to be old.
Stephen E Arnold, October 26, 2023
Why Some Outputs from Smart Software Are Wonky
July 26, 2021
Some models work like a champ. Utility rate models are reasonably reliable. When it is hot, use of electricity goes up. Rates are then “adjusted.” Perfect. Other models are less solid; for example, Bayesian systems which are not checked every hour or large neural nets which are “assumed” to be honking along like a well-ordered flight of geese. Why do I offer such Negative Ned observations? Experience for one thing and the nifty little concepts tossed out by Ben Kuhn, a Twitter persona. You can locate this string of observations at this link. Well, you could as of July 26, 2021, at 630 am US Eastern time. Here’s a selection of what are apparently the highlights of Mr. Kuhn’s conversation with “a former roommate.” That’s provenance enough for me.
Item One:
Most big number theory results are apparently 50-100 page papers where deeply understanding them is ~as hard as a semester-long course. Because of this, ~nobody has time to understand all the results they use—instead they “black-box” many of them without deeply understanding.
Could this be true? How could newly minted, be an expert with our $40 online course, create professionals who use models packaged in downloadable and easy to plug in modules be unfamiliar with the inner workings of said bundles of brilliance? Impossible? Really?
Item Two:
A lot of number theory is figuring out how to stitch together many different such black boxes to get some new big result. Roommate described this as “flailing around” but also highly effective and endorsed my analogy to copy-pasting code from many different Stack Overflow answers.
Oh, come on. Flailing around. Do developers flail or do they “trust” the outfits who pretend to know how some multi-layered systems work. Fiddling with assumptions, thresholds, and (close your ears) the data themselves are never, ever a way to work around a glitch.
Item Three
Roommate told a story of using a technique to calculate a number and having a high-powered prof go “wow, I didn’t know you could actually do that”
No kidding? That’s impossible in general, and that expression would never be uttered at Amazon-, Facebook-, and Google-type operations, would it?
Will Mr. Kuhn be banned for heresy. [Keep in mind how Wikipedia defines this term: “is any belief or theory that is strongly at variance with established beliefs or customs, in particular the accepted beliefs of a church or religious organization.”] Just repeating an idea once would warrant a close encounter with an Iron Maiden or a pile of firewood. Probably not today. Someone might emit a slightly critical tweet, however.
Stephen E Arnold, July 26, 2021
The Ultimate Private Public Partnership?
October 7, 2020
It looks as though the line between the US government and Silicon Valley is being blurred into oblivion. That is the message we get as we delve into Unlimited Hangout’s report, “New Pentagon-Google Partnership Suggests AI Will Soon Be Used to Diagnose Covid-19.” Writer Whitney Webb begins by examining evidence that a joint project between the Pentagon’s young Defense Innovation Unit (DIU) and Google Cloud is poised to expand from predicting cancer cases to also forecasting the spread of COVID-19. See the involved write-up for that evidence, but we are more interested in Webb’s further conclusion—that the US military & intelligence agencies and big tech companies like Google, Amazon, Microsoft, and others are nigh inseparable. Many of their decision makers are the same, their projects do as much for companies’ bottom lines as for the public good, and they are swimming in the same pools of (citizen’) data. We learn:
“NSCAI [National Security Commission on Artificial Intelligence] unites the US intelligence community and the military, which is already collaborating on AI initiatives via the Joint Artificial Intelligence Center and Silicon Valley companies. Notably, many of those Silicon Valley companies—like Google, for instance—are not only contractors to US intelligence, the military, or both but were initially created with funding from the CIA’s In-Q-Tel, which also has a considerable presence on the NSCAI. Thus, while the line between Silicon Valley and the US national-security state has always been murky, now that line is essentially nonexistent as entities like the NSCAI, DIB [Defense Innovation Board], and DIU, among several others, clearly show. Whereas China, as Robert Work noted, has the ‘civil-military fusion’ model at its disposal, the NSCAI and the US government respond to that model by further fusing the US technology industry with the national-security state.”
Recent moves in this arena involve healthcare-related projects. They are billed as helping citizens stay healthy, and that is a welcome benefit, but there is much more to it. The key asset here, of course, is all that tasty data—real-world medical information that can be used to train and refine valuable AI algorithms. Webb writes:
“Thus, the implementation of the Predictive Health program is expected to amass troves upon troves of medical data that offer both the DIU and its partners in Silicon Valley the ‘rare opportunity’ for training new, improved AI models that can then be marketed commercially.”
Do we really want private companies generating profit from public data?
Cynthia Murrell, October 7, 2020
Google: Human Data Generators
July 29, 2020
DarkCyber spotted this interesting article, which may or may not be true. But it is fascinating. The story is “Google Working on Smart Tattoos That Turn Skin into Living Touchpad.” The write up states:
Google is working on smart tattoos that, when applied to skin, will transform the human body into a living touchpad via embedded sensors. Part of Google Research, the wearable project is called “SkinMarks” that uses rub-on tattoos. The project is an effort to create the next generation of wearable technology devices…
DarkCyber believes that the research project makes it clear that Google is indeed intent collecting personal data. Where will the tattoo be applied? Forehead in Central America street gang fashion?
Russian prisoner style with appropriate Google iconography?
A tasteful tramp stamp approach?
The possibilities are plentiful if the report is accurate.
Stephen E Arnold, July 29, 2020
Ontotext: GraphDB Update Arrives
January 31, 2020
Semantic knowledge firm Ontotext has put out an update to its graph database, The Register announces in, “It’s Just Semantics: Bulgarian Software Dev Ontotext Squeezes Out GraphDB 9.1.” Some believe graph databases are The Answer to a persistent issue. The article explains:
“The aim of applying graph database technology to enterprise data is to try to overcome the age-old problem of accessing latent organizational knowledge; something knowledge management software once tried to address. It’s a growing thing: Industry analyst Gartner said in November the application of graph databases will ‘grow at 100 per cent annually over the next few years’. GraphDB is ranked at eighth position on DB-Engines’ list of most popular graph DBMS, where it rubs shoulders with the likes of tech giants such as Microsoft, with its Azure Cosmos DB, and Amazon’s Neptune. ‘GraphDB is very good at text analytics because any natural language is very ambiguous: a project name could be a common English word, for example. But when you understand the context and how entities are connected, you can use these graph models to disambiguate the meaning,’ [GraphDB product manager Vassil] Momtchev said.”
The primary feature of this update is support for the Shapes Constraint Language, or SHACL, which the World Wide Web Consortium recommends for validating data graphs against a set of conditions. This support lets the application validate data against the schema whenever new data is loaded to the database instead of having to manually run queries to check. A second enhancement allows users to track changes in current or past database transactions. Finally, the database now supports network authentication protocol Kerberos, eliminating the need to store passwords on client computers.
Cynthia Murrell, January 31, 2020
Data Are a Problem? And the Solution Is?
January 8, 2020
I attended a conference about managing data last year. I sat in six sessions and listened as enthusiastic people explained that in order to tap the value of data, one has to have a process. Okay? A process is good.
Then in each of the sessions, the speakers explained the problem and outlined that knowing about the data and then putting it in a system is the way to derive value.
Neither Pros Nor Cons: Just Consulting Talk
This morning I read an article called “The Pros and Cons of Data Integration Architectures.” The write up concludes with this statement:
Much of the data owned and stored by businesses and government departments alike is constrained by the silos it’s stuck in, many of which have been built over the years as organizations grow. When you consider the consolidation of both legacy and new IT systems, the number of these data silos only increases. What’s more, the impact of this is significant. It has been widely reported that up to 80 per cent of a data scientist’s time is spent on collecting, labeling, cleaning and organizing data in order to get it into a usable form for analysis.
Now this is most true. However, the 80 percent figure is not backed up. An IDG expert whipped up some percentages about data and time, and these, I suspect, have become part of the received wisdom of those struggling with silos for decades. Most of a data scientist’s time is frittered away in meetings, struggling with budgets and other resources, and figuring out what data are “good” and what to do with the data identified by person or machine as “bad.”
The source of this statement is MarkLogic, a privately held company founded in 2001 and a magnet for $173 million from funding sources. That works out to an 18 years young start up if DarkCyber adopts a Silicon Valley T shirt.
A modern silo is made of metal and impervious to some pests and most types of weather.
One question the write up begs is, “After 18 years, why hasn’t the methodology of MarkLogic swept the checker board?” But the same question can be asked of other providers’ solutions, open source solutions, and the home grown solutions creaking in some government agencies in Europe and elsewhere.
Several reasons:
- The technical solution offered by MarkLogic-type companies can “work”; however, proprietary considerations linked with the issues inherent in “silos” have caused data management solutions to become consultantized; that is, process becomes the task, not delivering on the promise of data, elther dark or sunlit.
- Customers realize that the cost of dealing with the secrecy, legal, and technical problems of disparate, digital plastic trash bags of bits cannot be justified. Like odd duck knickknacks one of my failed publishers shoved into his lumber room, ignoring data is often a good solution.
- Individuals tasked with organizing data begin with gusto and quickly morph into bureaucrats who treasure meetings with consultants and companies pitching magic software and expensive wizards able to make the code mostly work.
DarkCyber recognizes that with boundaries like budgets, timetables, measurable objectives, federation can deliver some zip.
Silos: A Moment of Reflection
The article uses the word “silo” five times. That’s the same frequency of its use in the presentations to which I listened in mid December 2019.
So you want to break down this missile silo which is hardened and protected by autonomous weapons? That’s what happens when a data scientist pokes around a pharma company’s lab notebook for a high potential new drug.
Let’s pause a moment to consider what a silo is. A silo is a tower or a pit used to store core, wheat, or some other grain. Dust is silos can be exciting. Tip: Don’t light a match in a silo on a dry, hot day in a state where farms still operate. A silo can also be a structure used to house a ballistic missile, but one has to be a child of the Cold War to appreciate this connotation.
As applied to data, it seems that a silo is a storage device containing data. Unlike a silo used to house maize or a nuclear capable missile, the data silo contains information of value. How much value? No one knows. Are the data in a digital silo explosive? Who knows? Maybe some people should not know? What wants to flick a Bic and poke around?
Federating Data: Easy, Hard, or Poorly Understood Until One Tries It at Scale?
March 8, 2019
I read two articles this morning.
One article explained that there’s a new way to deal with data federation. Always optimistic, I took a look at “Data-Driven Decision-Making Made Possible using a Modern Data Stack.” The revolution is to load data and then aggregate. The old way is to transform, aggregate, and model. Here’s a diagram from DAS43. A larger version is available at this link.
Hard to read. Yep, New Millennial colors. Is this a breakthrough?
I don’t know.
When I read “2 Reasons a Federated Database Isn’t Such a Slam-Dunk”, it seems that the solution outlined by DAS42 and the InfoWorld expert are not in sync.
There are two reasons. Count ‘em.
One: performance
Two: security.
Yeah, okay.
Some may suggest that there are a handful of other challenges. These range from deciding how to index audio, video, and images to figuring out what to do with different languages in the content to determining what data are “good” for the task at hand and what data are less “useful.” Date, time, and geocodes metadata are needed, but that introduces the not so easy to solve indexing problem.
So where are we with the “federation thing”?
Exactly the same place we were years ago…start ups and experts notwithstanding. But then one has to wrangle a lot of data. That’s cost, gentle reader. Big money.
Stephen E Arnold, March 8, 2019
Fragmented Data: Still a Problem?
January 28, 2019
Digital transitions are a major shift for organizations. The shift includes new technology and better ways to serve clients, but it also includes massive amounts of data. All organizations with a successful digital implementation rely on data. Too much data, however, can hinder organizations’ performance. The IT Pro Portal explains how data and something called mass data fragmentation is a major issue in the article, “What Is Mass Data Fragmentation, And What Are IT Leaders So Worried About It?”
The biggest question is: what exactly is mass data fragmentation? I learned:
“We believe one of the major culprits is a phenomenon called mass data fragmentation. This is essentially just a technical way of saying, ’data that is siloed, scattered and copied all over the place’ leading to an incomplete view of the data and an inability to extract real value from it. Most of the data in question is what’s called secondary data: data sets used for backups, archives, object stores, file shares, test and development, and analytics. Secondary data makes up the vast majority of an organization’s data (approximately 80 per cent).”
The article compares the secondary data to an iceberg, most of it is hidden beneath the surface. The poor visibility leads to compliance and vulnerability risks. In other words, security issues that put the entire organization at risk. Most organizations, however, view their secondary data as a storage bill, compliance risk (at least that is good), and a giant headache.
When surveyed about the amount of secondary data they have, it was discovered that organizations had multiple copies of the same data spread over the cloud and on premise locations. IT teams are expected to manage the secondary data across all the locations, but without the right tools and technology the task is unending, unmanageable, and the root of more problems.
If organizations managed their mass data fragmentation efficiently it would increase their bottom line, reduce costs, and reduce security risks. With more access points to sensitive data and they are not secure, it increases the risk of hacking and information being stolen.
Whitney Grace, January 28, 2019
Amazon Intelligence Gets a New Data Stream
June 28, 2018
I read “Amazon’s New Blue Crew.” The idea is that Amazon can disintermediate FedEx, UPS (the outfit with the double parking brown trucks), and the US Postal Service.
On the surface, the idea makes sense. Push down delivery to small outfits. Subsidize them indirectly and directly. Reduce costs and eliminate intermediaries not directly linked to Amazon.
FedEx, UPS, and the USPS are not the most nimble outfits around. I used to get FedEx envelopes every day or two. I haven’t seen one of those for months. Shipping vis UPS is a hassle. I fill out forms and have to manage odd slips of paper with arcane codes on them. The US Postal Services works well for letters, but I have noticed some returns for “addresses not found.” One was an address in the city in which I live. I put the letter in the recipient’s mailbox. That worked.
The write up reports:
The new program lets anyone run their own package delivery fleet of up to 40 vehicles with up to 100 employees. Amazon works with the entrepreneurs — referred to as “Delivery Service Partners” — and pays them to deliver packages while providing discounts on vehicles, uniforms, fuel, insurance, and more. They operate their own businesses and hire their own employees, though Amazon requires them to offer health care, paid time off, and competitive wages. Amazon said entrepreneurs can get started with as low as $10,000 and earn up to $300,000 annually in profit.
Now what’s the connection to Amazon streaming data services and the company’s intelligence efforts? Several hypotheses come to mind:
- Amazon obtains fine grained detail about purchases and delivery locations. These are data which no longer can be captured in a non Amazon delivery service system
- The data can be cross correlated; for example, purchasers of a Kindle title with the delivery of a particular product; for example, hydrogen peroxide
- Amazon’s delivery data make it possible to capture metadata about delivery time, whether a person accepted the package or if the package was left at the door and other location details such as a blocked entrance, for instance.
A few people dropping off packages is not particularly useful. Scale up the service across Amazon operations in the continental states or a broader swatch of territory and the delivery service becomes a useful source of high value information.
FedEx and UPS are ripe for disruption. But so is the streaming intelligence sector. Worth monitoring this ostensible common sense delivery play.
Stephen E Arnold, June 28, 2018