Synthetic Data: Yes, They Are a Thing

March 13, 2023

“Real” data — that is, data generated by humans — are expensive to capture, normalize, and manipulate. But, those “real” data are important. Unfortunately some companies have sucked up real data and integrated those items into products and services. Now regulators are awakening from a decades-long slumber and taking a look into the actions of certain data companies. More importantly, a few big data outfits are aware of the [a] the costs and [b] the risks of real data.

Enter synthetic data.

If you are unfamiliar with the idea, navigate to “What is Synthetic Data? The Good, the Bad, and the Ugly.” The article states:

The privacy engineering community can help practitioners and stakeholders identify the use cases where synthetic data can be used safely, perhaps even in a semi-automated way. At the very least, the research community can provide actionable guidelines to understand the distributions, types of data, tasks, etc. where we could achieve reasonable privacy-utility tradeoffs via synthetic data produced by generative models.

Helpful, correct?

The article does not point out two things which I find of interest.

First, the amount of money a company can earn by operating efficient synthetic data factories is likely to be substantial. Like other digital products, the upside can be profitable and give the “owner” of the synthetic data market and IBM-type of old-school lock in.

Second, synthetic data can be weaponized either intentionally via data poisoning or algorithm shaping.

I just wanted to point out that a useful essay does not explore what may be two important attributes of synthetic data. Will regulators rise to the occasion? Unlikely.

Stephen E Arnold, March 13, 2023

Amazon Data Sets

February 21, 2023

Do you want to obtain data sets for analysis or making smart software even more crafty? Navigate to the AWS Marketplace. This Web page makes it easy to search through the more than 350 data products on offer. There is a Pricing Model check box. Click it if you want to see the no-cost data sets. There are some interesting options in the left side Refine Results area. For example, there are 366 open data licenses available. I find this interesting because when I examined the page, there were 362 data products. What are the missing four? I noted that there are 2,340 “standard data subscription agreements.” Again the difference between the 366 on offer and the 2,340 is interesting. A more comprehensive listing of data sources appears in the PrivacyRights’ listing. With some sleuthing, you may be able to identify other, lower profile ways to obtain data too. I am not willing to add some color about these sources in this free blog post.

Stephen E Arnold, February 21, 2023

Datasette: Useful Tool for Crime Analysts

February 15, 2023

If you want to explore data sets, you may want to take a look at the “open source multi-tool for exploring and publishing data.” The Datasette Swiss Army knife “is a tool for exploring and publishing data.”

The company says,

It helps people take data of any shape, analyze and explore it, and publish it as an interactive website and accompanying API. Datasette is aimed at data journalists, museum curators, archivists, local governments, scientists, researchers and anyone else who has data that they wish to share with the world. It is part of a wider ecosystem of 42 tools and 110 plugins dedicated to making working with structured data as productive as possible.

A handful of demos are available. Worth a look.

Stephen E Arnold, February 15, 2023

Summarize for a Living: Should You Consider a New Career?

February 13, 2023

In the pre-historic age of commercial databases, many people earned money by reading and summarizing articles, documents, monographs, and consultant reports. In order to prepare and fact check a 130 word summary of an article in the Harvard Business Review in 1980, the cost to the database publisher worked out to something like $17 to $25 per summary for what I would call general business information. (If you want more information about this number, write, and maybe you will get more color on the number.) Flash forward to the present, the cost for a human to summarize an article in the Harvard Business Review has increased. That’s why it is necessary to pay to access and print an abstract from a commercial online service. Even with yesterday’s technology, the costs are a killer. Now you know why software that eliminates the human, the editorial review, the fact checking, and the editorial policies which define what is indexed, why, and what terms are assigned is a joke to many of those in the search and retrieval game.

I mention this because if you are in the A&I business (abstracting and indexing), you may want to take a look at HunKimForks ChatGPT Arxiv Extension. The idea is that ChatGPT can generate an abstract which is certainly less fraught with cost and management hassles than running one of the commercial database content generation systems dependent on humans, some with degrees in chemistry, law, or medicine.

Are the summaries any good? For the last 40 years abstracts and summaries have been, in my opinion, degrading. Fact checking is out the window along with editorial policies, style guidelines, and considered discussion of index terms, classification codes, time handling and signifying, among other, useful knowledge attributes.

Three observations:

  1. Commercial database publishers may want to check out this early-days open source contribution
  2. Those engaged in abstracting, writing summaries of books, and generating distillations of turgid government documents (yep, blue chip consulting firms I an thinking of you) may want to think about their future
  3. Say “hello” to increasingly inaccurate outputs from smart software. Recursion and liquid algorithms are not into factual stuff.

Stephen E Arnold, February 13, 2023

SQL Made Easy: Better Than a Human? In Some Cases

January 9, 2023

Just a short item for anyone who has to formulate Structured Query Language queries. Years ago, SQL queries were a routine for my research team. Today, the need has decreased. I have noticed that my recollection and muscle memory for SQL queries have eroded. Now there is a solution which seems to work reasonably well. Is the smart software as skilled as our precious Howard? Nope. But Howard lives in DC, and I am in rural Kentucky. Since neither of us like email or telephones, communicate via links to data available for download and analysis. Hey, the approach works for us. But SQL queries. Just navigate to TEXT2SQL.AI. Once you sign in using one of the popular privacy invasion methods, you can enter a free text statement and get a well formed SQL query. Is the service useful? It may be. The downside is the overt data collection approach.

Stephen E Arnold, January 9, 2023

Confessions? It Is That Time of Year

December 23, 2022

Forget St. Augustine.

Big data, data science, or whatever you want to call is was the precursor to artificial intelligence. Tech people pursued careers in the field, but after the synergy and hype wore off the real work began. According to WD in his RYX,R blog post: “Goodbye, Data Science,” the work is tedious, low-value, unwilling, and left little room for career growth.

WD worked as a data scientist for a few years, then quit in pursuit of the higher calling as a data engineer. He will be working on the implementation of data science instead of its origins. He explained why he left in four points:

• “The work is downstream of engineering, product, and office politics, meaning the work was only often as good as the weakest link in that chain.

• Nobody knew or even cared what the difference was between good and bad data science work. Meaning you could suck at your job or be incredible at it and you’d get nearly the same regards in either case.

• The work was often very low value-add to the business (often compensating for incompetence up the management chain).

• When the work’s value-add exceeded the labor costs, it was often personally unfulfilling (e.g. tuning a parameter to make the business extra money).”

WD’s experiences sound like everyone who is disenchanted with their line of work. He worked with managers who would not listen when they were told stupid projects would fail. The managers were more concerned with keeping their bosses and shareholders happy. He also mentioned that engineers are inflamed with self-grandeur and scientists are bad at code. He worked with young and older data people who did not know what they were doing.

As a data engineer, WD has more free time, more autonomy, better career advancements, and will continue to learn.

Whitney Grace, December 23, 2022

The Internet: Cue the Music. Hit It, Regrets, I Have Had a Few

December 21, 2022

I have been around online for a few years. I know some folks who were involved in creating what is called “the Internet.” I watched one of these luminaries unbutton his shirt and display a tee with the message, “TCP on everything.” Cute, cute, indeed. (I had the task of introducing this individual only to watch the disrobing and the P on everything joke. Tip: It was not a joke.)

Imagine my reaction when I read “Inventor of the World Wide Web Wants Us to Reclaim Our Data from Tech Giants.” The write up states:

…in an era of growing concern over privacy, he believes it’s time for us to reclaim our personal data.

Who wants this? Tim Berners-Lee and a startup. Content marketing or a sincere effort to derail the core functionality of ad trackers, beacons, cookies which expire in 99 years, etc., etc.

The article reports:

Berners-Lee hopes his platform will give control back to internet users. “I think the public has been concerned about privacy — the fact that these platforms have a huge amount of data, and they abuse it,” he says. “But I think what they’re missing sometimes is the lack of empowerment. You need to get back to a situation where you have autonomy, you have control of all your data.”

The idea is that Web 3 will deliver a different reality.

Do you remember this lyric:

Yes, there were times I’m sure you knew
When I bit off more than I could chew
But through it all, when there was doubt
I ate it up and spit it out
I faced it all and I stood tall and did it my way.

The my becomes big tech, and it is the information highway. There’s no exit, no turnaround, and no real chance of change before I log off for the final time.

Yeah, digital regrets. How’s that working out at Amazon, Facebook, Google, Twitter, and Microsoft among others? Unintended consequences and now the visionaries are standing tall on piles of money and data.

Change? Sure, right away.

Stephen E Arnold, December 21, 2022

Google Did What? Misleading Users? Google!

November 15, 2022

In the midst of an economic downturn, most businesses try to avoid: [a] bad publicity regarding a sensitive issue and [b] paying lots of cash to US states. I suppose I could add [c] buying Twitter and [d] funding the metaverse, but let’s stick to the information in “Google Will Pay $392m to 40 States in Largest Ever US Privacy Settlement.”

For a big outfit like the Google my thought is that the negative publicity is more painful than writing checks. But advertisers are affected by the economic downturn and may be looking for ways to make sales without cutting deals with companies found guilt of user/customer surveillance.

The write up, which I assume is mostly on the money, says:

The states’ investigation was sparked by a 2018 Associated Press story, which found that Google continued to track people’s location data even after they opted out of such tracking by disabling a feature the company called “location history”.

The article points out:

It [the penalty] comes at a time of mounting unease over privacy and surveillance by tech companies that has drawn growing outrage from politicians and scrutiny by regulators.

Free services are great as long as users/customers don’t know exactly what’s happening. In the early days of the Google, there was not a generation interested in dinobaby ideas. Well, this decision suggests that some dinobabies with law degrees expect commercial enterprises to act with some sense of propriety.

The article makes clear exactly what Google did:

The attorneys general said Google misled users about its location tracking practices since at least 2014, violating state consumer protection laws. As part of the settlement, Google also agreed to make those practices more transparent to users. That includes showing them more information when they turn location account settings on and off and keeping a webpage that gives users information about the data Google collects.

Hmmm. What about targeted ads which miss their targets? Perhaps that’s an issue which will capture the attention of US attorneys general? Perhaps, but I am not optimistic. Awareness and subsequent legal processes move slowly, and slow is the friend of some firms.

Stephen E Arnold, November 15, 2022

One Tiny Point about Oracle

September 5, 2022

When silicon valley-type real news outfits “correct” one another, we tend to wonder why. In this case, it appears Gizmodo writer Matt Novak feels readers should know one key bit of information omitted by a recent Vox article: the fact that “Larry Ellison’s Oracle Started As a CIA Project.” He writes:

“Vox simply says that Oracle was founded in ‘the late 1970s’ and ‘sells a line of software products that help large and medium-sized companies manage their operations.’ All of which is true! But as the article continues, it somehow ignores the fact that Oracle has always been a significant player in the national security industry. And that its founder would not have made his billions without helping to build the tools of our modern surveillance state.”

One of those tools, of course, being the sort of database Oracle specializes in. The write-up emphasizes Ellison’s longstanding belief in a large federal database, asserting the attacks of 9/11 gave the tech tycoon the chance to push his vision. Novak quotes:

“‘The single greatest step we Americans could take to make life tougher for terrorists would be to ensure that all the information in myriad government databases was copied into a single, comprehensive national security database,’ Larry Ellison wrote in the New York Times in January of 2002. ‘Creating such a database is technically simple. All we have to do is copy information from the hundreds of separate law enforcement databases into a single database. A national security database could be built in a few months,’ Ellison explained. ‘A national security database combined with biometrics, thumb prints, hand prints, iris scans or whatever is best can be used to detect people with false identities.'”

We are not sure whether Novak is suggesting Vox deliberately downplayed Oracle’s role in facilitating a surveillance state infrastructure. He certainly wants us to know the company’s fortunes rose after that fateful day in September 2001, with federal government contracts making up 23 percent if its licensing revenue in 2003 to the tune of $2.5 billion. We are reminded Oracle’s David Carney stated in 2002, while trying but failing to avoid sounding callous, that 9/11 had been good for business. Perhaps Vox did not believe this facet of Oracle’s history to be relevant, but Gizmodo can consider us, dear readers, duly informed.

Cynthia Murrell, September 5, 2022

Data: A Disappointing Ride Down Zero Lane to Cell One

August 26, 2022

Projects meant to glean business insights through the analysis of vast troves of data still tend to disappoint. On its blog, British data-project management firm Brijj lists “5 Reasons Why 80% of Data and Insight Projects Fail.” The write-up tells us:

“In the UK alone, we spend £24bn on data projects every year. According to recent studies, however, organizational leadership has been dissatisfied with the value they get from data. In fact, they consider 80% of all data projects a failure. That equates to £19bn of waste. And why? Because so many don’t do the basics well. They never stood a chance.”

Not surprisingly, writer and Brijj founder/CEO Adrian Mitchell suggests consulting outside data experts from the start to make sure one’s project delivers those sweet, sweet insights:

“The bottom line is that both data creators and their business customers need to be involved in the data & insight project from the initial question through to the outcome and work closely together for it to provide actionable insights and urge action. Currently, there are many gaps between the two groups, resulting in disconnect, frustrations, time and financial losses, and no real-world outcomes. Organizations need to close these to truly harness the power of data and maximize its value.”

The list Mitchell offers looks awfully familiar; we think we have heard some of these “reasonsbefore. We are told the biggest problem is asking the wrong questions in the first place. Then there is, as mentioned above, a lack of collaboration between data analysts and their clients. If one has managed to gather useful bits of knowledge, they must be both communicated to the right people and made easy to find. Finally, standardized systems (like Brijj’s, we presume) should be put in place to make the whole process easier for the technically disinclined.

Perhaps Mitchell is right and these measures can help some companies make the most of the data they were persuaded to accumulate? It is worth keeping in mind, though, that any concepts derived by software have limitations… just like a blind data.

Cynthia Murrell, August 26, 2022

Next Page »

  • Archives

  • Recent Posts

  • Meta