Fuzzifying Data: Yeah, Sure

January 19, 2022

Data are often alleged to be anonymous, but they may not be. Expert companies such as LexisNexis, Acxiom, and mobile phone providers argue that as long as personal identifiers, including names, address, etc., are removed from data it is rendered harmless. Unfortunately data can be re-anonymized without too much trouble. Wired posted Justin Sherman’s article, “Big Data May Not Know Your Name. But It Knows Everything Else.”

Despite humans having similar habits, there is some truth in the phrase “everyone is unique.” With a few white hat or black hat tactics, user data can be traced back to the originator. Data proves to be not only individualized based on a user’s unique identity, but there are also minute ways to gather personal information ranging from Internet search history, GPS logs, and IP address. Companies that want to sell you goods and services purchase the data, but also governments and law enforcement agencies do as well.

There are stringent privacy regulations in place, but in the face of the all mighty dollar and governments bypassing their own laws, it is like spitting in the wind. The scariest fact is that nothing is secret anymore:

“The irony that data brokers claim that their “anonymized” data is risk-free is absurd: Their entire business model and marketing pitch rests on the premise that they can intimately and highly selectively track, understand, and micro target individual people.

This argument isn’t just flawed; it’s also a distraction. Not only do these companies usually know your name anyway, but data simply does not need to have a name or social security number attached to cause harm. Predatory loan companies and health insurance providers can buy access to advertising networks and exploit vulnerable populations without first needing those people’s names. Foreign governments can run disinformation and propaganda campaigns on social media platforms, leveraging those companies’ intimate data on their users, without needing to see who those individuals are.”

Companies and organizations need to regulate themselves, while governments need to pass laws that protect their citizens from bad actors. Self-regulation in the face of dollar signs is like asking a person with sweet tooth to stop eating sugar. However, if governments concentrated on types of data and types of data collection and sharing to regulate rather than a blanket solution could protect users.

Let’s think about the implications. No, let’s not.

Whitney Grace January 19, 2022

What Is Better Than One Logic? Two Logics?

December 22, 2021

Search, database, intelligence, data management and analytics firm MarkLogic continues to evolve and grow. Business Wire reveals, “MarkLogic Acquires Leading Metadata Management Provider Smartlogic.” Good choice—we have found Smartlogic to be innovative, reliable, and responsive. We expect MarkLogic will be able to preserve these characteristics, considering Smartlogic’s top brass will be sticking around. The press release tells us:

“As part of the transaction, Smartlogic’s founder and Chief Executive Officer, Jeremy Bentley, as well as other members of the senior management team, will join the MarkLogic executive team. Financial terms of the transaction were not disclosed. Founded in 2006, Smartlogic has deciphered, filtered, and connected data for many of the world’s largest organizations to help solve their complex data problems. Global organizations in the energy, healthcare, life sciences, financial services, government and intelligence, media and publishing, and high-tech manufacturing industries rely on Smartlogic’s metadata and AI platform every day to enrich enterprise information with context and meaning, as well as extract critical facts, entities, and relationships to power their businesses. For the past four years, Smartlogic has been recognized as a leader by Gartner’s Magic Quadrant for Metadata Management Solutions and by Info-Tech as the preeminent leader of the Data Quadrant for Metadata Management (May 2021).”

Based in San Carlos, California, MarkLogic was founded in 2001 and gained steam in 2012 when it picked up former Oracle database division leader Gary Bloom. Smartlogic is headquartered in San Jose, less than 30 miles away. Perhaps MarkLogic’s XML with taxonomy management will triumph in more markets and bring the Oracle outfit to its knees? Perhaps index term management is the killer app?

Cynthia Murrell, December 22, 2021

What Google Knows about the Honest You

December 10, 2021

I read this quote in a Kleenex story about Google’s lists of popular searches:

“You’re never as honest as you are with your search engine. You get a sense of what people genuinely care about and genuinely want to know — and not just how they’re presenting themselves to the rest of the world.”

The alleged Googler crafting this statement is a data editor. You can read more about the highly selective and unverified Google search trends in “What Google’s Trending Searches Say about America in 2021.”

For me, the statement allows several observations:

  1. A person acting in an unguarded way reveals information not usually disseminated in “guarded” settings; for example, a job interview
  2. The word “honest” implies an unvarnished look at the psycho-social factors within a single person
  3. A collection of data points about the psycho-social aspects of a single person makes it possible to tag, classify, and relate that individual to others. Numerical procedures allow a person or system with access to those data to predict certain behaviors, predispositions, or actions.

Thus, the collection of searches, clicks, and items created by an individual using Google services such as Gmail and YouTube create a palette of color from which a data maestro can paint a picture.

Predestination has never been easier, more automatable, or cheaper to convert into an actionable knowledgebase for smart software. Yep, just simple queries. Useful indeed.

Stephen E Arnold, December 10, 2021

Microsoft: Amazing Quote about Support

August 12, 2021

I read “El Reg talks to Azure Data veep as Microsoft flicks the switch on Azure Arc for SQL Managed Instances: Longevity, PostgreSQL, and the Default Relational Database of Choice.” I like the phrase “default relational database of choice.” Okay, confidence can be a positive.

Most of the interview is not-so-surprising stuff: End-of-life assurances, hits of a catholic approach to the Codd structure, and a general indifference to the Amazon database initiatives. That’s okay. The expert is Rohan Kumar, who is going to speak Redmond, a peculiar dialect of jargon which often reveals little relevant to the ordinary person trying to restore a trashed SQL Server table.

I did spot one tiny comment. Here is this remarkable assertion:

“We will never let any of our customers run into challenges because Microsoft decided, ‘hey, we’re not going to support you’.”

No kidding? For real? I mean none of the code blocking, security challenging stuff?

Stephen E Arnold, August 12, 2021

Elasticsearch Versus RocksDB: The Old Real Time Razzle Dazzle

July 22, 2021

Something happens. The “event” is captured and written to the file. Even if you are watching the “something” happening, there is latency between the event and the sensor or the human perceiving the event. The calculus of real time is mostly avoiding too much talk about latency. But real time is hot because who wants to look at old data, not TikTok fans and not the money-fueled lovers of Robinhood.

Rockset CEO on Mission to Bring Real-Time Analytics to the Stack” used lots of buzzwords, sidesteps inherent latency, and avoids commentary on other allegedly real-time analytics systems. Rockset is built on RockDB, an open source software. Nevertheless, there is some interesting information about Elasticsearch; for example:

  • Unsupported factoids like: “Every enterprise is now generating more data than what Google had to index in [year] 2000.”
  • No definition or baseline for “simple”: “The combination of the converged index along with the distributed SQL engine is what allows Rockset to be fast, scalable, and quite simple to operate.”
  • Different from Elasticsearch and RocksDB: “So the biggest difference between Elastic and RocksDB comes from the fact that we support full-featured SQL including JOINs, GROUP BY, ORDER BY, window functions, and everything you might expect from a SQL database. Rockset can do this. Elasticsearch cannot.”
  • Similarities with Rockset: “So Lucene and Elasticsearch have a few things in common with Rockset, such as the idea to use indexes for efficient data retrieval.”
  • Jargon and unique selling proposition: “We use converged indexes, which deliver both what you might get from a database index and also what you might get from an inverted search index in the same data structure. Lucene gives you half of what a converged index would give you. A data warehouse or columnar database will give you the other half. Converged indexes are a very efficient way to build both.”

Amazon has rolled out its real time system, and there are a number of options available from vendors like Trendalyze.

Each of these vendors emphasizes real time. The problem, however, is that latency exists regardless of system. Each has use cases which make their system seem to be the solution to real time data analysis. That’s what makes horse races interesting. These unfold in real time if one is at the track. Fractional delays have big consequences for those betting their solution is the least latent.

Stephen E Arnold, July 22, 2021

Governments Heavy Handed on Social Media Content

July 21, 2021

In the US, government entities “ask” for data. In other countries, there may be different approaches; for example, having data pushed directly to government data lakes.

Governments around the world are paying a lot more attention to content on Twitter and other social media, we learn from, “Twitter Sees Big Jump in Gov’t Demands to Remove Content of Journalists” at TechCentral. According to data released by the platform, demands increased by 26% in the second half of last year. We wonder how many of these orders involved false information and how many simply contained content governments did not like. That detail is not revealed, but we do learn the 199 journalist and news outlet accounts were verified. The report also does not divulge which countries made the demands or which ones Twitter obliged. We do learn:

“Twitter said in the report that India was now the single largest source of all information requests from governments during the second half of 2020, overtaking the US, which was second in the volume of requests. The company said globally it received over 14,500 requests for information between 1 July and 31 December, and it produced some or all of the information in response to 30% of the requests. Such information requests can include governments or other entities asking for the identities of people tweeting under pseudonyms. Twitter also received more than 38,500 legal demands to take down various content, which was down 9% from the first half of 2020, and said it complied with 29% of the demands. Twitter has been embroiled in several conflicts with countries around the world, most notably India over the government’s new rules aimed at regulating content on social media. Last week, the company said it had hired an interim chief compliance officer in India and would appoint other executives in order to comply with the rules.”

Other platforms are also receiving scrutiny from assorted governments. In response to protests, for example, Cuba has restricted access to Facebook and messaging apps. Also recently, Nigeria banned Twitter altogether and prohibited TV and radio stations from using it as a source of information. Meanwhile, social media companies continue to face scrutiny for the presence of hate speech, false information, and propaganda on their sites. We are reminded CEOs Jack Dorsey of Twitter, Mark Zuckerberg of Facebook, and Sundar Pichai of Google appeared in a hearing before the US congress on misinformation just last March. And most recently, all three platforms had to respond to criticisms over racist attacks against black players on England’s soccer team. Is it just me, or are these problems getting worse instead of better?

Cynthia Murrell, July 21, 2021

Databases: Old Wine, New Bottles, and Now Updated Labels with More Jargon and Buzzwords

June 29, 2021

I read “It’s the Golden Age of Databases. It Can’t Last.” The subtitle is fetching too:

Startups are reaping huge funding rounds. But money alone won’t be enough to top the current market leaders.

I think that it is important to keep in mind that databases once resided within an organization. In 1980, I had my employer’s customer database in a small closet in my office. I kept my office locked, and anyone who needed access had to find me, set up an appointing, and do a look up. Was I paranoid? Yep, and I suppose that’s why I never went to work for flexi-think outfits intellectually allied with Microsoft or SolarWinds, among others.

Today the cloud is the rage. Why? It’s better, faster, and cheaper. Just pick any two and note that I did not include “more secure.” If you want some color about the “cost” of the cloud pursuit fueled by cost cutting, check out this high flying financial outfit’s essay “Andreesen Horowitz Partner Martin Casado Says the Cost of Cloud Computing Is a $100 Billion Drag on the Biggest Software Companies, Sparking a Huge Debate across the Industry.” Some of the ideas are okay; others strike me as similar to those suggesting the Egyptian pyramids are big batteries. The point is that many companies embraced the cloud in search of reducing the cost and hassle of on premises systems and people.

One of the upsides of the cloud is the crazy marketing assertions that a bunch of disparate data can be dumped into a “cloud system” and become instantly available for Fancy Dan analytics. Yeah, and I have a bridge to sell you in Brooklyn. I accept PayPal too.

The “Golden Age” write up works over time to make the new databases exciting for investors who want a big payout. I did note this statement in the write up which is chock-a-block with vendor names:

Ultimately, Databricks and Snowflake’s main competitors probably aren’t each other, but rather Microsoft, AWS and Google.

Do you think it would be helpful to mention IBM and Oracle? I do.

Here’s another important statement from the write up:

One thing is certain: The big data revolution isn’t slowing down. And that means the war over managing it and putting the information to use will only get more fierce.

Why the “fierce”? Perhaps it will be the investors in the whizzy new “we can federate and be better, faster, and cheaper” outfits who put the pedal to the metal. The reality is that big outfits license big brands. Change is time consuming and expensive. And the seamless data lakes with data lake houses on them? Probably still for sale after owners realize that data magic is expensive, time consuming, and fiddly.

But rah rah is solid info today.

Stephen E Arnold, June 29, 2021

Need to Tame the Information Tsunamis in Databases? DbSurfer May Be Your Deviled Egg

June 2, 2021

An interesting article “DbSurfer: A Search and Navigation Tool for Relational Databases” describes a novel way to locate information in Codd databases. Nope, I won’t make a reference to codfish. The surfing metaphor is good enough today.

The write up states:

We present a new application for keyword search within relational databases, which uses a novel algorithm to solve the join discovery problem by finding Memex-like trails through the graph of foreign key dependencies. It differs from previous efforts in the algorithms used, in the presentation mechanism and in the use of primary-key only database queries at query-time to maintain a fast response for users.

The Memex reference is not to the mostly forgotten Australian search and retrieval system. The Memex in this paper is a nod to everyone’s information hero Vannevar Bush’s fanciful “memex device.” (No, Google is not a memex device.)

The method involves “joins” and “tails.” The result is a system that allows keyword search and navigation through relational databases.

The paper includes a useful list of references. (Some recent computer science graduates who are billing themselves as search experts might find reading a few of the citations helpful. Just a friendly suggestion to the AI, NLP, and semantic whiz types.)

Is this a product? Nope, not yet. Interesting idea, however.

Stephen E Arnold, June 2, 2021

Data Federation: Sure, Works Perfectly

June 1, 2021

How easy is it to snag a dozen sets of data, normalize them, parse them, and extract useful index terms, assign classifications, and other useful hooks? “Automated Data Wrangling” provides an answer sharply different from what marketers assert.

A former space explorer, now marooned on a beautiful dying world explains that the marketing assurances of dozens upon dozens of companies are baloney. Here’s a passage I noted:

Most public data is a mess. The knowledge required to clean it up exists. Cloud based computational infrastructure is pretty easily available and cost effective. But currently there seems to be a gap in the open source tooling. We can keep hacking away at it with custom rule-based processes informed by our modest domain expertise, and we’ll make progress, but as the leading researchers in the field point out, this doesn’t scale very well. If these kinds of powerful automated data wrangling tools are only really available for commercial purposes, I’m afraid that the current gap in data accessibility will not only persist, but grow over time. More commercial data producers and consumers will learn how to make use of them, and dedicate financial resources to doing so, knowing that they’ll be reap financial rewards. While folks working in the public interest trying to create universal public goods with public data and open source software will be left behind struggling with messy data forever.

Marketing is just easier than telling the truth about what’s needed in order to generate information which can be processed by a downstream procedure.

Stephen E Arnold, June xx, 2021

A Field of Data Silos: No Problem

May 5, 2021

The hype about silos has followed data to the cloud. IT Brief grumbles, “How Cloud Silos Are Holding Organisations Back.” Although the brief write-up acknowledges that silos can be desirable, it issues the familiar call to unify the data therein. PureStorage CTO Mark Jobbins writes:

“Overcoming the challenges presented by having cloud silos requires organisations to develop a robust data architecture. Having a common data platform should form the foundation of the data architecture, one that decouples applications and their data from their underlying infrastructure, preventing organizations from being locked into a single delivery model. Working with a multi-cloud architecture is valuable because it helps organizations utilize best-in-breed services from the various cloud service providers. It also reduces vendor lock-in, improves redundancy, and lets businesses choose the ideal features they need for their operations. It’s important to have a strong multi-cloud strategy to ensure the business gets the right mix of security, performance, and cost. The strategy should include the tools and technologies that consolidate cloud resources into a single, cohesive interface for managing cloud infrastructure. Hybrid clouds bring public and private clouds together.”

Such “hybrid clouds” allow an organization to retain those advantages of that multi-cloud architecture with the blessed unified platform. Of course, this is no simple task, so we are told one must recruit a gifted storage specialist to help. We presume this is where Jobbins’ company comes in.

Cynthia Murrell, May 5, 2021

Next Page »

  • Archives

  • Recent Posts

  • Meta