Data Federation? Loser. Go with a Data Lake House

February 8, 2022

I have been the phrase “data lake house” or “datalake house.” I noted some bold claims about a new data lake house approach in “Managed Data Lakehouse Startup Onehouse Launches with $8M in Funding.” The write up states:

One of the flagship features of Onehouse’s lakehouse service is a technology called incremental processing. It allows companies to start analyzing their data soon after it’s generated, which is difficult when using traditional technologies.

The write up adds:

The company’s lakehouse service automatically optimizes customers’ data ingestion workflows to improve performance, the startup says. Because the service is delivered via the cloud on a fully managed basis, customers don’t have to manage the underlying infrastructure.

The idea of course is that traditional methods of handling data are [a] slow, [b] expensive, and [c] difficult to implement.

The premise is that the data lake house delivers more efficient use of data and a way to “future proof the data architected for machine learning / data science down the line.”

When I read this I thought of Vivisimo’s explanation of its federating method. IBM bought Vivisimo, and I assume that it is one of the ingredient in IBM’s secret big data sauce. MarkLogic also suggested in one presentation I sat through that its system would ingest data and the MarkLogic system (once eyed by the Google as a possible acquisition) would allow near real time access to the data. One person in the audience was affiliated with the US Library of Congress, and that individual seemed quite enthused about MarkLogic. And there are companies which facilitate data manipulation; for example, Kofax and its data connectors.

From my point of view, the challenge is that today large volumes of data are available. These data have to be moved from point A to point B. Ideally data do not require transformation. At some point in the flow, data in motion can be processed. There are firms which offer real time or near real time data analytics; for example, Trendalyze.com.

Conversion, moving, saving, and then doing something “more” with the data remain challenges. Maybe Onehouse has the answer?

Stephen E Arnold, February 8, 2022

Coalesce: Tackling the Bottleneck Few Talk About

February 1, 2022

Coalesce went stealth, the fancier and more modern techno slang for “going dark,” to work on projects in secret. The company has returned to the light, says Crowd Fund Insider with a robust business plan and product, plus loads of funding: “Coalesce Debuts From Stealth, Attracts $5.92M For Analytics Platform.”

Coalesce is run by a former Oracle employee and it develops products and services similar to Oracle, but with a Marklogic spin. That is one way to interpret how Coalesce announced its big return with its Coalesce Data Transformation platform that offers modeling, cleansing, governance, and documentation of data with analytical efficiency and flexibility. Do no forger that 11.2 Capital and GreatPoint Ventures raised $5.92 million in seed funding for the new data platform. Coalesce plans to use the funding for engineering functions, developing marketing strategy, and expanding sales.

Coalesce noticed that there is a weak link between organizations’ cloud analytics and actively making use of data:

“ ‘The largest bottleneck in the data analytics supply chain today is transformations. As more companies move to the cloud, the weaknesses in their data transformation layer are becoming apparent,’ said Armon Petrossian, the co-founder and CEO of Coalesce. “Data teams are struggling to keep up with the demands from the business, and this problem has only continued to grow with the volumes and complexity of data combined with the shortage of skilled people. We are on a mission to radically improve the analytics landscape by making enterprise-scale data transformations as efficient and flexible as possible.’”

Coalesce might be duplicating Oracle and MarkLogic, but if they have discovered a niche market in cloud analytics then they are about to rocket from their stealth. Hopefully the company will solve the transformation problem instead of issuing marketing statements as many other firms do.

Whitney Grace, February 1, 2022

Fuzzifying Data: Yeah, Sure

January 19, 2022

Data are often alleged to be anonymous, but they may not be. Expert companies such as LexisNexis, Acxiom, and mobile phone providers argue that as long as personal identifiers, including names, address, etc., are removed from data it is rendered harmless. Unfortunately data can be re-anonymized without too much trouble. Wired posted Justin Sherman’s article, “Big Data May Not Know Your Name. But It Knows Everything Else.”

Despite humans having similar habits, there is some truth in the phrase “everyone is unique.” With a few white hat or black hat tactics, user data can be traced back to the originator. Data proves to be not only individualized based on a user’s unique identity, but there are also minute ways to gather personal information ranging from Internet search history, GPS logs, and IP address. Companies that want to sell you goods and services purchase the data, but also governments and law enforcement agencies do as well.

There are stringent privacy regulations in place, but in the face of the all mighty dollar and governments bypassing their own laws, it is like spitting in the wind. The scariest fact is that nothing is secret anymore:

“The irony that data brokers claim that their “anonymized” data is risk-free is absurd: Their entire business model and marketing pitch rests on the premise that they can intimately and highly selectively track, understand, and micro target individual people.

This argument isn’t just flawed; it’s also a distraction. Not only do these companies usually know your name anyway, but data simply does not need to have a name or social security number attached to cause harm. Predatory loan companies and health insurance providers can buy access to advertising networks and exploit vulnerable populations without first needing those people’s names. Foreign governments can run disinformation and propaganda campaigns on social media platforms, leveraging those companies’ intimate data on their users, without needing to see who those individuals are.”

Companies and organizations need to regulate themselves, while governments need to pass laws that protect their citizens from bad actors. Self-regulation in the face of dollar signs is like asking a person with sweet tooth to stop eating sugar. However, if governments concentrated on types of data and types of data collection and sharing to regulate rather than a blanket solution could protect users.

Let’s think about the implications. No, let’s not.

Whitney Grace January 19, 2022

What Is Better Than One Logic? Two Logics?

December 22, 2021

Search, database, intelligence, data management and analytics firm MarkLogic continues to evolve and grow. Business Wire reveals, “MarkLogic Acquires Leading Metadata Management Provider Smartlogic.” Good choice—we have found Smartlogic to be innovative, reliable, and responsive. We expect MarkLogic will be able to preserve these characteristics, considering Smartlogic’s top brass will be sticking around. The press release tells us:

“As part of the transaction, Smartlogic’s founder and Chief Executive Officer, Jeremy Bentley, as well as other members of the senior management team, will join the MarkLogic executive team. Financial terms of the transaction were not disclosed. Founded in 2006, Smartlogic has deciphered, filtered, and connected data for many of the world’s largest organizations to help solve their complex data problems. Global organizations in the energy, healthcare, life sciences, financial services, government and intelligence, media and publishing, and high-tech manufacturing industries rely on Smartlogic’s metadata and AI platform every day to enrich enterprise information with context and meaning, as well as extract critical facts, entities, and relationships to power their businesses. For the past four years, Smartlogic has been recognized as a leader by Gartner’s Magic Quadrant for Metadata Management Solutions and by Info-Tech as the preeminent leader of the Data Quadrant for Metadata Management (May 2021).”

Based in San Carlos, California, MarkLogic was founded in 2001 and gained steam in 2012 when it picked up former Oracle database division leader Gary Bloom. Smartlogic is headquartered in San Jose, less than 30 miles away. Perhaps MarkLogic’s XML with taxonomy management will triumph in more markets and bring the Oracle outfit to its knees? Perhaps index term management is the killer app?

Cynthia Murrell, December 22, 2021

What Google Knows about the Honest You

December 10, 2021

I read this quote in a Kleenex story about Google’s lists of popular searches:

“You’re never as honest as you are with your search engine. You get a sense of what people genuinely care about and genuinely want to know — and not just how they’re presenting themselves to the rest of the world.”

The alleged Googler crafting this statement is a data editor. You can read more about the highly selective and unverified Google search trends in “What Google’s Trending Searches Say about America in 2021.”

For me, the statement allows several observations:

  1. A person acting in an unguarded way reveals information not usually disseminated in “guarded” settings; for example, a job interview
  2. The word “honest” implies an unvarnished look at the psycho-social factors within a single person
  3. A collection of data points about the psycho-social aspects of a single person makes it possible to tag, classify, and relate that individual to others. Numerical procedures allow a person or system with access to those data to predict certain behaviors, predispositions, or actions.

Thus, the collection of searches, clicks, and items created by an individual using Google services such as Gmail and YouTube create a palette of color from which a data maestro can paint a picture.

Predestination has never been easier, more automatable, or cheaper to convert into an actionable knowledgebase for smart software. Yep, just simple queries. Useful indeed.

Stephen E Arnold, December 10, 2021

Microsoft: Amazing Quote about Support

August 12, 2021

I read “El Reg talks to Azure Data veep as Microsoft flicks the switch on Azure Arc for SQL Managed Instances: Longevity, PostgreSQL, and the Default Relational Database of Choice.” I like the phrase “default relational database of choice.” Okay, confidence can be a positive.

Most of the interview is not-so-surprising stuff: End-of-life assurances, hits of a catholic approach to the Codd structure, and a general indifference to the Amazon database initiatives. That’s okay. The expert is Rohan Kumar, who is going to speak Redmond, a peculiar dialect of jargon which often reveals little relevant to the ordinary person trying to restore a trashed SQL Server table.

I did spot one tiny comment. Here is this remarkable assertion:

“We will never let any of our customers run into challenges because Microsoft decided, ‘hey, we’re not going to support you’.”

No kidding? For real? I mean none of the code blocking, security challenging stuff?

Stephen E Arnold, August 12, 2021

Elasticsearch Versus RocksDB: The Old Real Time Razzle Dazzle

July 22, 2021

Something happens. The “event” is captured and written to the file. Even if you are watching the “something” happening, there is latency between the event and the sensor or the human perceiving the event. The calculus of real time is mostly avoiding too much talk about latency. But real time is hot because who wants to look at old data, not TikTok fans and not the money-fueled lovers of Robinhood.

Rockset CEO on Mission to Bring Real-Time Analytics to the Stack” used lots of buzzwords, sidesteps inherent latency, and avoids commentary on other allegedly real-time analytics systems. Rockset is built on RockDB, an open source software. Nevertheless, there is some interesting information about Elasticsearch; for example:

  • Unsupported factoids like: “Every enterprise is now generating more data than what Google had to index in [year] 2000.”
  • No definition or baseline for “simple”: “The combination of the converged index along with the distributed SQL engine is what allows Rockset to be fast, scalable, and quite simple to operate.”
  • Different from Elasticsearch and RocksDB: “So the biggest difference between Elastic and RocksDB comes from the fact that we support full-featured SQL including JOINs, GROUP BY, ORDER BY, window functions, and everything you might expect from a SQL database. Rockset can do this. Elasticsearch cannot.”
  • Similarities with Rockset: “So Lucene and Elasticsearch have a few things in common with Rockset, such as the idea to use indexes for efficient data retrieval.”
  • Jargon and unique selling proposition: “We use converged indexes, which deliver both what you might get from a database index and also what you might get from an inverted search index in the same data structure. Lucene gives you half of what a converged index would give you. A data warehouse or columnar database will give you the other half. Converged indexes are a very efficient way to build both.”

Amazon has rolled out its real time system, and there are a number of options available from vendors like Trendalyze.

Each of these vendors emphasizes real time. The problem, however, is that latency exists regardless of system. Each has use cases which make their system seem to be the solution to real time data analysis. That’s what makes horse races interesting. These unfold in real time if one is at the track. Fractional delays have big consequences for those betting their solution is the least latent.

Stephen E Arnold, July 22, 2021

Governments Heavy Handed on Social Media Content

July 21, 2021

In the US, government entities “ask” for data. In other countries, there may be different approaches; for example, having data pushed directly to government data lakes.

Governments around the world are paying a lot more attention to content on Twitter and other social media, we learn from, “Twitter Sees Big Jump in Gov’t Demands to Remove Content of Journalists” at TechCentral. According to data released by the platform, demands increased by 26% in the second half of last year. We wonder how many of these orders involved false information and how many simply contained content governments did not like. That detail is not revealed, but we do learn the 199 journalist and news outlet accounts were verified. The report also does not divulge which countries made the demands or which ones Twitter obliged. We do learn:

“Twitter said in the report that India was now the single largest source of all information requests from governments during the second half of 2020, overtaking the US, which was second in the volume of requests. The company said globally it received over 14,500 requests for information between 1 July and 31 December, and it produced some or all of the information in response to 30% of the requests. Such information requests can include governments or other entities asking for the identities of people tweeting under pseudonyms. Twitter also received more than 38,500 legal demands to take down various content, which was down 9% from the first half of 2020, and said it complied with 29% of the demands. Twitter has been embroiled in several conflicts with countries around the world, most notably India over the government’s new rules aimed at regulating content on social media. Last week, the company said it had hired an interim chief compliance officer in India and would appoint other executives in order to comply with the rules.”

Other platforms are also receiving scrutiny from assorted governments. In response to protests, for example, Cuba has restricted access to Facebook and messaging apps. Also recently, Nigeria banned Twitter altogether and prohibited TV and radio stations from using it as a source of information. Meanwhile, social media companies continue to face scrutiny for the presence of hate speech, false information, and propaganda on their sites. We are reminded CEOs Jack Dorsey of Twitter, Mark Zuckerberg of Facebook, and Sundar Pichai of Google appeared in a hearing before the US congress on misinformation just last March. And most recently, all three platforms had to respond to criticisms over racist attacks against black players on England’s soccer team. Is it just me, or are these problems getting worse instead of better?

Cynthia Murrell, July 21, 2021

Databases: Old Wine, New Bottles, and Now Updated Labels with More Jargon and Buzzwords

June 29, 2021

I read “It’s the Golden Age of Databases. It Can’t Last.” The subtitle is fetching too:

Startups are reaping huge funding rounds. But money alone won’t be enough to top the current market leaders.

I think that it is important to keep in mind that databases once resided within an organization. In 1980, I had my employer’s customer database in a small closet in my office. I kept my office locked, and anyone who needed access had to find me, set up an appointing, and do a look up. Was I paranoid? Yep, and I suppose that’s why I never went to work for flexi-think outfits intellectually allied with Microsoft or SolarWinds, among others.

Today the cloud is the rage. Why? It’s better, faster, and cheaper. Just pick any two and note that I did not include “more secure.” If you want some color about the “cost” of the cloud pursuit fueled by cost cutting, check out this high flying financial outfit’s essay “Andreesen Horowitz Partner Martin Casado Says the Cost of Cloud Computing Is a $100 Billion Drag on the Biggest Software Companies, Sparking a Huge Debate across the Industry.” Some of the ideas are okay; others strike me as similar to those suggesting the Egyptian pyramids are big batteries. The point is that many companies embraced the cloud in search of reducing the cost and hassle of on premises systems and people.

One of the upsides of the cloud is the crazy marketing assertions that a bunch of disparate data can be dumped into a “cloud system” and become instantly available for Fancy Dan analytics. Yeah, and I have a bridge to sell you in Brooklyn. I accept PayPal too.

The “Golden Age” write up works over time to make the new databases exciting for investors who want a big payout. I did note this statement in the write up which is chock-a-block with vendor names:

Ultimately, Databricks and Snowflake’s main competitors probably aren’t each other, but rather Microsoft, AWS and Google.

Do you think it would be helpful to mention IBM and Oracle? I do.

Here’s another important statement from the write up:

One thing is certain: The big data revolution isn’t slowing down. And that means the war over managing it and putting the information to use will only get more fierce.

Why the “fierce”? Perhaps it will be the investors in the whizzy new “we can federate and be better, faster, and cheaper” outfits who put the pedal to the metal. The reality is that big outfits license big brands. Change is time consuming and expensive. And the seamless data lakes with data lake houses on them? Probably still for sale after owners realize that data magic is expensive, time consuming, and fiddly.

But rah rah is solid info today.

Stephen E Arnold, June 29, 2021

Need to Tame the Information Tsunamis in Databases? DbSurfer May Be Your Deviled Egg

June 2, 2021

An interesting article “DbSurfer: A Search and Navigation Tool for Relational Databases” describes a novel way to locate information in Codd databases. Nope, I won’t make a reference to codfish. The surfing metaphor is good enough today.

The write up states:

We present a new application for keyword search within relational databases, which uses a novel algorithm to solve the join discovery problem by finding Memex-like trails through the graph of foreign key dependencies. It differs from previous efforts in the algorithms used, in the presentation mechanism and in the use of primary-key only database queries at query-time to maintain a fast response for users.

The Memex reference is not to the mostly forgotten Australian search and retrieval system. The Memex in this paper is a nod to everyone’s information hero Vannevar Bush’s fanciful “memex device.” (No, Google is not a memex device.)

The method involves “joins” and “tails.” The result is a system that allows keyword search and navigation through relational databases.

The paper includes a useful list of references. (Some recent computer science graduates who are billing themselves as search experts might find reading a few of the citations helpful. Just a friendly suggestion to the AI, NLP, and semantic whiz types.)

Is this a product? Nope, not yet. Interesting idea, however.

Stephen E Arnold, June 2, 2021

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta