Data Federation? Loser. Go with a Data Lake House

February 8, 2022

I have been the phrase “data lake house” or “datalake house.” I noted some bold claims about a new data lake house approach in “Managed Data Lakehouse Startup Onehouse Launches with $8M in Funding.” The write up states:

One of the flagship features of Onehouse’s lakehouse service is a technology called incremental processing. It allows companies to start analyzing their data soon after it’s generated, which is difficult when using traditional technologies.

The write up adds:

The company’s lakehouse service automatically optimizes customers’ data ingestion workflows to improve performance, the startup says. Because the service is delivered via the cloud on a fully managed basis, customers don’t have to manage the underlying infrastructure.

The idea of course is that traditional methods of handling data are [a] slow, [b] expensive, and [c] difficult to implement.

The premise is that the data lake house delivers more efficient use of data and a way to “future proof the data architected for machine learning / data science down the line.”

When I read this I thought of Vivisimo’s explanation of its federating method. IBM bought Vivisimo, and I assume that it is one of the ingredient in IBM’s secret big data sauce. MarkLogic also suggested in one presentation I sat through that its system would ingest data and the MarkLogic system (once eyed by the Google as a possible acquisition) would allow near real time access to the data. One person in the audience was affiliated with the US Library of Congress, and that individual seemed quite enthused about MarkLogic. And there are companies which facilitate data manipulation; for example, Kofax and its data connectors.

From my point of view, the challenge is that today large volumes of data are available. These data have to be moved from point A to point B. Ideally data do not require transformation. At some point in the flow, data in motion can be processed. There are firms which offer real time or near real time data analytics; for example, Trendalyze.com.

Conversion, moving, saving, and then doing something “more” with the data remain challenges. Maybe Onehouse has the answer?

Stephen E Arnold, February 8, 2022

Why Some Outputs from Smart Software Are Wonky

July 26, 2021

Some models work like a champ. Utility rate models are reasonably reliable. When it is hot, use of electricity goes up. Rates are then “adjusted.” Perfect. Other models are less solid; for example, Bayesian systems which are not checked every hour or large neural nets which are “assumed” to be honking along like a well-ordered flight of geese. Why do I offer such Negative Ned observations? Experience for one thing and the nifty little concepts tossed out by Ben Kuhn, a Twitter persona. You can locate this string of observations at this link. Well, you could as of July 26, 2021, at 630 am US Eastern time. Here’s a selection of what are apparently the highlights of Mr. Kuhn’s conversation with “a former roommate.” That’s provenance enough for me.

Item One:

Most big number theory results are apparently 50-100 page papers where deeply understanding them is ~as hard as a semester-long course. Because of this, ~nobody has time to understand all the results they use—instead they “black-box” many of them without deeply understanding.

Could this be true? How could newly minted, be an expert with our $40 online course, create professionals who use models packaged in downloadable and easy to plug in modules be unfamiliar with the inner workings of said bundles of brilliance? Impossible? Really?

Item Two:

A lot of number theory is figuring out how to stitch together many different such black boxes to get some new big result. Roommate described this as “flailing around” but also highly effective and endorsed my analogy to copy-pasting code from many different Stack Overflow answers.

Oh, come on. Flailing around. Do developers flail or do they “trust” the outfits who pretend to know how some multi-layered systems work. Fiddling with assumptions, thresholds, and (close your ears) the data themselves  are never, ever a way to work around a glitch.

Item Three

Roommate told a story of using a technique to calculate a number and having a high-powered prof go “wow, I didn’t know you could actually do that”

No kidding? That’s impossible in general, and that expression would never be uttered at Amazon-, Facebook-, and Google-type operations, would it?

Will Mr. Kuhn be banned for heresy. [Keep in mind how Wikipedia defines this term: “is any belief or theory that is strongly at variance with established beliefs or customs, in particular the accepted beliefs of a church or religious organization.”] Just repeating an idea once would warrant a close encounter with an Iron Maiden or a pile of firewood. Probably not today. Someone might emit a slightly critical tweet, however.

Stephen E Arnold, July 26, 2021

Data Federation: Sure, Works Perfectly

June 1, 2021

How easy is it to snag a dozen sets of data, normalize them, parse them, and extract useful index terms, assign classifications, and other useful hooks? “Automated Data Wrangling” provides an answer sharply different from what marketers assert.

A former space explorer, now marooned on a beautiful dying world explains that the marketing assurances of dozens upon dozens of companies are baloney. Here’s a passage I noted:

Most public data is a mess. The knowledge required to clean it up exists. Cloud based computational infrastructure is pretty easily available and cost effective. But currently there seems to be a gap in the open source tooling. We can keep hacking away at it with custom rule-based processes informed by our modest domain expertise, and we’ll make progress, but as the leading researchers in the field point out, this doesn’t scale very well. If these kinds of powerful automated data wrangling tools are only really available for commercial purposes, I’m afraid that the current gap in data accessibility will not only persist, but grow over time. More commercial data producers and consumers will learn how to make use of them, and dedicate financial resources to doing so, knowing that they’ll be reap financial rewards. While folks working in the public interest trying to create universal public goods with public data and open source software will be left behind struggling with messy data forever.

Marketing is just easier than telling the truth about what’s needed in order to generate information which can be processed by a downstream procedure.

Stephen E Arnold, June xx, 2021

Does Google Manifest Addiction to Personal Data?

March 31, 2021

I read an amusing “we don’t do that!” write up in “Google Collects 20 Times More Telemetry from Android Devices Than Apple from iOS.” The cyber security firm Recorded Future points to academic research asserting:

The study unearthed some uncomfortable results. For starters, Prof. Leith said that “both iOS and Google Android transmit telemetry, despite the user explicitly opting out of this [option].” Furthermore, “this data is sent even when a user is not logged in (indeed even if they have never logged in),” the researcher said. [Weird bold face in original text removed.]

Okay, this is the stuff of tenure. The horrors of monopolies and clueless users who happily gobble up free services.

What’s amazing is that the write up does not point out the value of these data for predictive analytics. That’s the business of Recorded Future, right? Quite an oversight. That’s what happens when “news” stumbles over the business model paying for marketing via content. Clever? Of course.

The reliability of the probabilities generated by the Recorded Future methods pivot on having historical and real time data. No wonder Google and Apple suggest that “we don’t do that.”

Recorded Future’s marketing is one thing, but Google’s addiction to data is presenting itself in quite fascinating ways. Navigate to “Google’s New App Automagically Organizes Your Scanned Documents.” The write up states:

The app lets you scan documents and then it uses AI to automatically name and sort them into different categories such as bills, IDs, and vehicles.

And what happens?

To make it easy to find documents, you can also search through the full text of the document.

What types of documents does a happy user scan? Maybe the Covid vaccination card? Maybe legal documents like mortgages, past due notices from a lawyer, divorce papers, and similar tough-to-obtain information of a quite private and personal nature?

My point is that mobile devices are data collection devices. The data are used to enable the Apple and Google business models. Ads, information about preferences, clues to future actions, and similar insights are now routinely available to those with access to the data and analytic systems.

The professor on the tenure track or gunning for an endowed chair can be surprised by practices which have been refined over many years. Not exactly ground breaking research.

Google obtaining access to scanned personal documents? No big deal. Think how easy and how convenient the free app makes taming old fashioned paper. I wonder if Google has an addiction to data and can no longer help itself?

Without meaningful regulation, stunned professors and mobile device users in love with convenience are cementing monopoly control over information flows.

Oh, Recorded Future was once a start up funded by Google and In-Q-Tel. Is that a useful fact?

Stephen E Arnold, March 31, 2021

Checking Out Registered Foreign Agents

December 14, 2020

Navigate to https://datasette.io. The Web page explains a service which permits manipulation of structured data. The service seems quite useful. One of the demonstrations makes it possible to explore Datasette functionality by searching for registered foreign agents. This is an interesting demonstration and some of the information returned are quite useful. You can locate the FARA Department of Justice data at this link.

Stephen E Arnold, December 14, 2020

Hazy Promises of AI Data Magic

December 11, 2020

Forbes has posted an article that sounds full of promise, “How to Understand All of Your Data to Transform Your Business.” Unfortunately, the piece is full of logical flaws. We note that writer Daniel Fallmann’s company, Mindbreeze, is part of Fallsoft in Austria, and is Microsoft centric. When he speaks of “all” your data, he seems to be talking about the inclusion of unstructured data. That is the holy grail data management vendors have been chasing for years, with less success than once hoped. Fallmann states what is now the obvious:

“Almost everybody hates filling out forms. That’s why you write a note instead. You send an email or text. You record an audio message. You create a video. You communicate in an unstructured, humanized way. Unlike metadata in forms, which are structured, these other methods of communication are unstructured. Unstructured data lacks metadata, and semi-structured information has limited metadata. The real value of unstructured data like an email, for example, is in the body of that email. You and I can often make sense of an email and other semi-structured and unstructured information. However, for a company, and for search, understanding the essence of a message is not that easy. This is problematic because when you can’t get to the essence of a message, you miss out on opportunities. You find it difficult — if not impossible — to connect the dots of your enterprise data. As a result, a wealth of knowledge that already exists in your enterprise goes to waste. That’s a lot of waste considering that unstructured data represents more than 80% of enterprise data.”

All true. But being able to define the problem does not mean one has the solution. The piece goes on to assert that machine learning can be used to connect the dots between structured and unstructured data, to criticize mindless silo migrations, and to stress the value of removing outdated or incorrect data from one’s database. So far so good. But Fallmann’s generic claims that new technology is “changing everything” lack substance. He fails to provide any factual backup for his assertions about AI or any definition of knowledge management or content management systems.

Doesn’t this company license enterprise search?

Cynthia Murrell, December 11, 2020

How to Be a Data Scientist

December 9, 2020

Do you want to be a data scientist without [a] going to a university, [b] watching YouTube videos, and [c] relying on persistence? If you answer “yes” to any of these questions, “You Don’t Need a Ph.D. in Data Science, but…” offers a road map. One tip: Figure out how to do a regression in Excel. Okay.

  • The write up includes a number of suggestions, including:
  • Kaggle notebooks
  • Free book books
  • Free courses from universities
  • Why Python, R, and SQL should be on your radar
  • The value of math and statistics
  • How to get a job.

Interesting summary. But imagine math and statistics at the tail end of the article. Perhaps whose disciplines should have been identified at the top of the list. Just a thought.

Stephen E Arnold, December 9, 2020

Exclusive: Interview with DataWalk’s Chief Analytics Officer Chris Westphal, Who Guides an Analytics Rocket Ship

October 21, 2020

I spoke with Chris Westphal, Chief Analytics Officer for DataWalk about the company’s string of recent contract “wins.” These range from commercial engagements to heavy lifting for the US Department of Justice.

Chris Westphal, founder of Visual Analytics (acquired by Raytheon) brings his one-click approach to advanced analytics.

The firm provides what I have described as an intelware solution. DataWalk ingests data and outputs actionable reports. The company has leap-frogged a number of investigative solutions, including IBM’s Analyst’s Notebook and the much-hyped Palantir Technologies’ Gotham products. This interview took place in a Covid compliant way. In my previous Chris Westphal interviews, we met at intelligence or law enforcement conferences. Now the experience is virtual, but as interesting and information in July 2019. In my most recent interview with Mr. Westphal, I sought to get more information on what’s causing DataWalk to make some competitors take notice of the company and its use of smart software to deliver what customers want: Results, not PowerPoint presentations and promises. We spoke on October 8, 2020.

DataWalk is an advanced analytics tool with several important innovations. On one hand, the company’s information processing system performs IBM i2 Analyst’s Notebook and Palantir Gotham type functions — just with a more sophisticated and intuitive interface. On the other hand, Westphal’s vision for advanced analytics has moved past what he accomplished with his previous venture Visual Analytics. Raytheon bought that company in 2013. Mr. Westphal has turned his attention to DataWalk. The full text of our conversation appears below.

Read more

Another Data Marketplace: Amazon, Microsoft, Oracle, or Other Provider for This Construct?

August 31, 2020

The European Union is making a sharp U-turn on data privacy, we learn from MIT Technology Review’s article, “The EU Is Launching a Market for Personal Data. Here’s What That Means for Privacy.” The EU has historically protected its citizens’ online privacy with vigor, fighting tooth and nail against the commercial exploitation of private information. As of February, though, the European Commission has decided on a completely different data strategy (PDF). Reporter Anna Artyushina writes:

The Trusts Project, the first initiative put forth by the new EU policies, will be implemented by 2022. With a €7 million [8.3 million USD] budget, it will set up a pan-European pool of personal and nonpersonal information that should become a one-stop shop for businesses and governments looking to access citizens’ information. Global technology companies will not be allowed to store or move Europeans’ data. Instead, they will be required to access it via the trusts. Citizens will collect ‘data dividends,’ which haven’t been clearly defined but could include monetary or nonmonetary payments from companies that use their personal data. With the EU’s roughly 500 million citizens poised to become data sources, the trusts will create the world’s largest data market. For citizens, this means the data created by them and about them will be held in public servers and managed by data trusts. The European Commission envisions the trusts as a way to help European businesses and governments reuse and extract value from the massive amounts of data produced across the region, and to help European citizens benefit from their information.”

It seems shifty they have yet to determine just how citizens will benefit from this data exploitation, I mean, value-extraction. There is no guarantee people will have any control over their information, and there is currently no way to opt out. This change is likely to ripple around the world, as the way EU approaches data regulation has long served as an example to other countries.

The concept of data trusts has been around since 2018, when Sir Tim Berners Lee proposed it. Such a trust could be for-profit, for a charitable cause, or simply for data storage and protection. As Artyushina notes, whether this particular trust actually protects citizens depends on the wording of its charter and the composition of its board of directors. See the article for examples of other trusts gone wrong, as well as possible solutions. Let us hope this project is set up and managed in a way that puts citizens first.

Cynthia Murrell, August 31, 2020

Amazon and Toyota: Tacoma Connects to AWS

August 20, 2020

This is just a very minor story. For most people, the information reported in “Toyota, Amazon Web Services Partner On Cloud-Connected Vehicle Data” will be irrelevant. The value of the data collected by the respective firms and their partners is trivial and will not have much impact. Furthermore, any data processed within Amazon’s streaming data marketplace and made available to some of the firm’s customers will be of questionable value. That’s why I am not immediately updating my Amazon reports to include the Toyota and insurance connection.

Now to the minor announcement:

Toyota will use AWS’ services to process and analyze data “to help Toyota engineers develop, deploy, and manage the next generation of data-driven mobility services for driver and passenger safety, security, comfort, and convenience in Toyota’s cloud-connected vehicles. The MSPF and its application programming interfaces (API) will enable Toyota to use connected vehicle data to improve vehicle design and development, as well as offer new services such as rideshare, full-service lease, proactive vehicle maintenance notifications and driving behavior-based insurance.

Are there possible implications from this link up? Sure, but few people care about Amazon’s commercial, financial, and governmental services, why think about issues like:

  • Value of the data to the AWS streaming data marketplace
  • Link analytics related to high risk individuals or fleet owners
  • Significance of the real time data to predictive analytics, maybe to insurance carriers and others?

Nope, not much of a big deal at all. Who cares? Just mash that Buy Now button and move on. Curious about how Amazon ensures data integrity in such a system? If you are, you can purchase our 50 page report about Amazon’s advanced data security services. Just write darkcyber333 at yandex dot com.

But I know first hand after two years of commentary, shopping is more fun than thinking about Amazon examined from a different viewshed.

Stephen E Arnold, August 20, 2020

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta