Elasticsearch Versus RocksDB: The Old Real Time Razzle Dazzle

July 22, 2021

Something happens. The “event” is captured and written to the file. Even if you are watching the “something” happening, there is latency between the event and the sensor or the human perceiving the event. The calculus of real time is mostly avoiding too much talk about latency. But real time is hot because who wants to look at old data, not TikTok fans and not the money-fueled lovers of Robinhood.

Rockset CEO on Mission to Bring Real-Time Analytics to the Stack” used lots of buzzwords, sidesteps inherent latency, and avoids commentary on other allegedly real-time analytics systems. Rockset is built on RockDB, an open source software. Nevertheless, there is some interesting information about Elasticsearch; for example:

  • Unsupported factoids like: “Every enterprise is now generating more data than what Google had to index in [year] 2000.”
  • No definition or baseline for “simple”: “The combination of the converged index along with the distributed SQL engine is what allows Rockset to be fast, scalable, and quite simple to operate.”
  • Different from Elasticsearch and RocksDB: “So the biggest difference between Elastic and RocksDB comes from the fact that we support full-featured SQL including JOINs, GROUP BY, ORDER BY, window functions, and everything you might expect from a SQL database. Rockset can do this. Elasticsearch cannot.”
  • Similarities with Rockset: “So Lucene and Elasticsearch have a few things in common with Rockset, such as the idea to use indexes for efficient data retrieval.”
  • Jargon and unique selling proposition: “We use converged indexes, which deliver both what you might get from a database index and also what you might get from an inverted search index in the same data structure. Lucene gives you half of what a converged index would give you. A data warehouse or columnar database will give you the other half. Converged indexes are a very efficient way to build both.”

Amazon has rolled out its real time system, and there are a number of options available from vendors like Trendalyze.

Each of these vendors emphasizes real time. The problem, however, is that latency exists regardless of system. Each has use cases which make their system seem to be the solution to real time data analysis. That’s what makes horse races interesting. These unfold in real time if one is at the track. Fractional delays have big consequences for those betting their solution is the least latent.

Stephen E Arnold, July 22, 2021

Governments Heavy Handed on Social Media Content

July 21, 2021

In the US, government entities “ask” for data. In other countries, there may be different approaches; for example, having data pushed directly to government data lakes.

Governments around the world are paying a lot more attention to content on Twitter and other social media, we learn from, “Twitter Sees Big Jump in Gov’t Demands to Remove Content of Journalists” at TechCentral. According to data released by the platform, demands increased by 26% in the second half of last year. We wonder how many of these orders involved false information and how many simply contained content governments did not like. That detail is not revealed, but we do learn the 199 journalist and news outlet accounts were verified. The report also does not divulge which countries made the demands or which ones Twitter obliged. We do learn:

“Twitter said in the report that India was now the single largest source of all information requests from governments during the second half of 2020, overtaking the US, which was second in the volume of requests. The company said globally it received over 14,500 requests for information between 1 July and 31 December, and it produced some or all of the information in response to 30% of the requests. Such information requests can include governments or other entities asking for the identities of people tweeting under pseudonyms. Twitter also received more than 38,500 legal demands to take down various content, which was down 9% from the first half of 2020, and said it complied with 29% of the demands. Twitter has been embroiled in several conflicts with countries around the world, most notably India over the government’s new rules aimed at regulating content on social media. Last week, the company said it had hired an interim chief compliance officer in India and would appoint other executives in order to comply with the rules.”

Other platforms are also receiving scrutiny from assorted governments. In response to protests, for example, Cuba has restricted access to Facebook and messaging apps. Also recently, Nigeria banned Twitter altogether and prohibited TV and radio stations from using it as a source of information. Meanwhile, social media companies continue to face scrutiny for the presence of hate speech, false information, and propaganda on their sites. We are reminded CEOs Jack Dorsey of Twitter, Mark Zuckerberg of Facebook, and Sundar Pichai of Google appeared in a hearing before the US congress on misinformation just last March. And most recently, all three platforms had to respond to criticisms over racist attacks against black players on England’s soccer team. Is it just me, or are these problems getting worse instead of better?

Cynthia Murrell, July 21, 2021

Databases: Old Wine, New Bottles, and Now Updated Labels with More Jargon and Buzzwords

June 29, 2021

I read “It’s the Golden Age of Databases. It Can’t Last.” The subtitle is fetching too:

Startups are reaping huge funding rounds. But money alone won’t be enough to top the current market leaders.

I think that it is important to keep in mind that databases once resided within an organization. In 1980, I had my employer’s customer database in a small closet in my office. I kept my office locked, and anyone who needed access had to find me, set up an appointing, and do a look up. Was I paranoid? Yep, and I suppose that’s why I never went to work for flexi-think outfits intellectually allied with Microsoft or SolarWinds, among others.

Today the cloud is the rage. Why? It’s better, faster, and cheaper. Just pick any two and note that I did not include “more secure.” If you want some color about the “cost” of the cloud pursuit fueled by cost cutting, check out this high flying financial outfit’s essay “Andreesen Horowitz Partner Martin Casado Says the Cost of Cloud Computing Is a $100 Billion Drag on the Biggest Software Companies, Sparking a Huge Debate across the Industry.” Some of the ideas are okay; others strike me as similar to those suggesting the Egyptian pyramids are big batteries. The point is that many companies embraced the cloud in search of reducing the cost and hassle of on premises systems and people.

One of the upsides of the cloud is the crazy marketing assertions that a bunch of disparate data can be dumped into a “cloud system” and become instantly available for Fancy Dan analytics. Yeah, and I have a bridge to sell you in Brooklyn. I accept PayPal too.

The “Golden Age” write up works over time to make the new databases exciting for investors who want a big payout. I did note this statement in the write up which is chock-a-block with vendor names:

Ultimately, Databricks and Snowflake’s main competitors probably aren’t each other, but rather Microsoft, AWS and Google.

Do you think it would be helpful to mention IBM and Oracle? I do.

Here’s another important statement from the write up:

One thing is certain: The big data revolution isn’t slowing down. And that means the war over managing it and putting the information to use will only get more fierce.

Why the “fierce”? Perhaps it will be the investors in the whizzy new “we can federate and be better, faster, and cheaper” outfits who put the pedal to the metal. The reality is that big outfits license big brands. Change is time consuming and expensive. And the seamless data lakes with data lake houses on them? Probably still for sale after owners realize that data magic is expensive, time consuming, and fiddly.

But rah rah is solid info today.

Stephen E Arnold, June 29, 2021

Need to Tame the Information Tsunamis in Databases? DbSurfer May Be Your Deviled Egg

June 2, 2021

An interesting article “DbSurfer: A Search and Navigation Tool for Relational Databases” describes a novel way to locate information in Codd databases. Nope, I won’t make a reference to codfish. The surfing metaphor is good enough today.

The write up states:

We present a new application for keyword search within relational databases, which uses a novel algorithm to solve the join discovery problem by finding Memex-like trails through the graph of foreign key dependencies. It differs from previous efforts in the algorithms used, in the presentation mechanism and in the use of primary-key only database queries at query-time to maintain a fast response for users.

The Memex reference is not to the mostly forgotten Australian search and retrieval system. The Memex in this paper is a nod to everyone’s information hero Vannevar Bush’s fanciful “memex device.” (No, Google is not a memex device.)

The method involves “joins” and “tails.” The result is a system that allows keyword search and navigation through relational databases.

The paper includes a useful list of references. (Some recent computer science graduates who are billing themselves as search experts might find reading a few of the citations helpful. Just a friendly suggestion to the AI, NLP, and semantic whiz types.)

Is this a product? Nope, not yet. Interesting idea, however.

Stephen E Arnold, June 2, 2021

Data Federation: Sure, Works Perfectly

June 1, 2021

How easy is it to snag a dozen sets of data, normalize them, parse them, and extract useful index terms, assign classifications, and other useful hooks? “Automated Data Wrangling” provides an answer sharply different from what marketers assert.

A former space explorer, now marooned on a beautiful dying world explains that the marketing assurances of dozens upon dozens of companies are baloney. Here’s a passage I noted:

Most public data is a mess. The knowledge required to clean it up exists. Cloud based computational infrastructure is pretty easily available and cost effective. But currently there seems to be a gap in the open source tooling. We can keep hacking away at it with custom rule-based processes informed by our modest domain expertise, and we’ll make progress, but as the leading researchers in the field point out, this doesn’t scale very well. If these kinds of powerful automated data wrangling tools are only really available for commercial purposes, I’m afraid that the current gap in data accessibility will not only persist, but grow over time. More commercial data producers and consumers will learn how to make use of them, and dedicate financial resources to doing so, knowing that they’ll be reap financial rewards. While folks working in the public interest trying to create universal public goods with public data and open source software will be left behind struggling with messy data forever.

Marketing is just easier than telling the truth about what’s needed in order to generate information which can be processed by a downstream procedure.

Stephen E Arnold, June xx, 2021

A Field of Data Silos: No Problem

May 5, 2021

The hype about silos has followed data to the cloud. IT Brief grumbles, “How Cloud Silos Are Holding Organisations Back.” Although the brief write-up acknowledges that silos can be desirable, it issues the familiar call to unify the data therein. PureStorage CTO Mark Jobbins writes:

“Overcoming the challenges presented by having cloud silos requires organisations to develop a robust data architecture. Having a common data platform should form the foundation of the data architecture, one that decouples applications and their data from their underlying infrastructure, preventing organizations from being locked into a single delivery model. Working with a multi-cloud architecture is valuable because it helps organizations utilize best-in-breed services from the various cloud service providers. It also reduces vendor lock-in, improves redundancy, and lets businesses choose the ideal features they need for their operations. It’s important to have a strong multi-cloud strategy to ensure the business gets the right mix of security, performance, and cost. The strategy should include the tools and technologies that consolidate cloud resources into a single, cohesive interface for managing cloud infrastructure. Hybrid clouds bring public and private clouds together.”

Such “hybrid clouds” allow an organization to retain those advantages of that multi-cloud architecture with the blessed unified platform. Of course, this is no simple task, so we are told one must recruit a gifted storage specialist to help. We presume this is where Jobbins’ company comes in.

Cynthia Murrell, May 5, 2021

TikTok: A Good Point about Data Collection

April 21, 2021

I wish I could recall which addled Silicon Valley podcaster explained that TikTok was not a problem. I would urge this individual to read in the British paper the article “Case Launched Against TikTok over Collection of Children’s Data.” The essay explains:

Despite a minimum age requirement of 13, Ofcom found last year that 42% of UK eight to 12-year-olds used TikTok. As with other social media companies such as Facebook, there have long been concerns about data collection and the UK’s Information Commissioner’s Office is investigating TikTok’s handling of children’s personal information. Longfield said: “We’re not trying to say that it’s not fun. Families like it. It’s been something that’s been really important over lockdown, it’s helped people keep in touch, they’ve had lots of enjoyment. But my view is that the price to pay for that shouldn’t be there – for their personal information to be illegally collected en masse, and passed on to others, most probably for financial gain, without them even knowing about it. “And the excessive nature of that collection is something which drove us to [challenge] TikTok rather than others.

The cloud of unknowing swirling around individuals who insist that data collection from children is no big deal is large and possibly impenetrable.

TikTok says it is an outfit staying within the bright white lines. Nevertheless, according to the write up:

In February last year, ByteDance, the Chinese company legally domiciled in the Cayman Islands that owns TikTok, was fined a record £4.2m ($5.7m) in the US for illegally collecting personal information from children under 13.

Add to the actions which triggered the fine, TikTok is an outfit associated with China. The data from TikTok might add some useful insights about user predilections if those data flow into a Chinese aggregation system.

To the cheerleaders for TikTok, I would suggest a rethink of your position. However, it is possible that funding for some cheerleading squads may be coming from interesting sources and carry along some other agendas. Bad actors can operate within a regulation lax environment. That’s a reality.

Stephen E Arnold, April 21, 2021

Google: Cookies Not Enough! More More More!

April 6, 2021

Cookies are a necessary Internet evil. They are annoying, but they power Internet commerce at the expense of user privacy. And users demand more privacy, tech giants are already designing technology and the Internet for a post-cookie world. Google, says One Zero via Medium, wants to control everything a user does on the Internet: “Google’s ‘Privacy-First Web’ Is Really A Google-First Web.”

Google promised that third-party cookies would disappear by 2022. The company also promises not to support ad technology that tracks user information across the Web. Google is not doing this to be kind, instead Google wants to be a become a better contender in private Internet browsing. Apple and Mozilla, companies that do not rely on targeted advertising revenue, already protect users from cookies with their Internet browsers.

Google’s business strategy is to use its status as the world’s most popular search engine and provider of many free Internet services to its advantage. That means Google has access to loads of first-party data aka the stuff that advertisers want to create targeted ads.

Google is also working on alternate tracking frameworks, but some tech experts see it as a bad idea. These alternate tracking frameworks would delete the old cookie problems and replace them with a brand new set of problems.

It appears cookies will become obsolete by the middle of the 2020s, but how does that translate into money and user privacy?

“Merits aside, it’s clear that Google is positioning itself for a more privacy-conscious future in ways that seek to preserve its dominance — likely at the expense of a slew of smaller rivals. There is a whole value chain built around third-party cookies and individual user tracking, and a lot of that value is likely to go poof…. The big picture here is that a handful of giants — in this case, Apple and Google — are powerful enough to essentially dictate the terms of the modern internet to everyone else. That they’re now moving toward models that are (arguably) better for consumer privacy is welcome. The problem is that they’re also quite obviously remolding the playing field in their own interests.”

Users will effectively have better privacy protections, but their information will be in the hands of a few powerful companies. Is that good? Is that bad? History shows it is better for there to be competition to ensure stability in a mixed capitalist economy.

Whitney Grace, April 6, 2021

Who Spends $69 Million on a Digital String? Pals Do.

April 1, 2021

The buyer of Beeple’s digital art is Metakovan. One suggestion is a person allegedly named Vignesh Sundaresan. NBC, the real news outfit, was not convinced and reported: “Metakovan’s real identity is not known.

Sure but don’t tell The Straits Times which reported in the story “I Don’t Have a Car or House” that the savvy buyer of a digital string is allegedly Vignesh Sundaresan, an entrepreneur, a technopreneur in fact. Plus, I love the quote attributed to the digital Warrant Buffet type:

I don’t have a car or house.

Makes sense. Singapore has apartments, lots of apartments. A rental in a Marina Bay makes it easy to get around. No encumbrances to haul around like some Roman statues from a covert dig near Naples (Italy, not lovely Florida). A Grab ride is good enough when physical movement is required.

Yep, a digital Warren Buffet.

Stephen E Arnold, April 1, 2021

Why Use an Open Source Database? Brilliant Inadvertent Explanation

February 15, 2021

I thought, “Why bother to read ‘Everything You Should Know about the Oracle Database.’” I am delighted that I did. I read the article in The Tech Block twice! The information attempts to explain some of Oracle’s licensing guidelines. The author does a workmanlike job of explaining number of users; for example:

If you create an account for five hundred individuals, and only fifty individuals use it, you still need about five hundred licenses. This means that you’ve got to pay utmost attention to who is accessing the software. In addition, you may require a separate license not only for people but also for devices that directly or indirectly access the database. It’s also essential that you constantly check who needs access and who doesn’t. This will help you not only reduce your risk of exposure but also save you money. Being found contravening Oracle licensing agreements can be very costly. In some extreme cases, organizations have been fined millions of dollars.

The point is Oracle charges for people who don’t use the database. On one hand, this makes sense. Oracle has to do “work” to configure a database to handle users. (Remember the good old days of having to allocate more memory to a table. Ho ho ho. Wait. The good old days are today’s days.)

The write up contains eight more missteps an Oracle customer can trip and break the bean counter’s financial ankles.

Net net: The explanation makes it quite clear why some organizations use open source databases. Perhaps the author did not intend to anti-market Oracle’s database? From my point of view, that is exactly what the information in “Everything You Should Know…” delivers.

Stephen E Arnold, February 16, 2021

Next Page »

  • Archives

  • Recent Posts

  • Meta