A Reminder about Deleting Data

July 15, 2019

If you believe data are deleted, you may want to take a deep breath and read “Good Luck Deleting Someone’s Private Info from a Trained Neural Network – It’s Likely to Bork the Whole Thing. Researchers Show Limited Success in Getting Rid of Data.”

With a title like this, there’s not much left to say. We did note this one, cautious quote:

Zou [a whiz researcher] said it isn’t entirely impossible, however. “We don’t have tools just yet but we are hoping to develop these deletion tools in the next few months.”

Will there be such tools? I have been stumbling along with databases since the 1960s, and deletes which delete are still not available.

Just a reminder that what one believes is not what happens within data management systems.

Stephen E Arnold, July 15, 2019

When Is a Deletion a Real Deletion

June 29, 2019

Years ago we created the Point (Top 5% of the Internet). You oldsters may remember our badge which was for a short period of Internet time a thing.

point logo

When we started work on the service in either 1992 or 1993, one of the people working with the team put the demo in the Paradox database. Hey, who knew that traffic would explode, and advertisers would contact us to put their messages on the site.

The Paradox database was not designed to deal with the demands we put upon it. One of its charming characteristics was that when we deleted something, the space was not reclaimed. Paradox — like many, many other databases — just removed the index pointer. The “space” and hence some charming idiosyncrasies remained.

Flash forward decades. A deletion may not be a deletion. Different “databases” handle deletions in different ways. Plus anyone with experience working with long forgotten systems like the Information Dimensions’ system to the total weirdness of a CICS system knows that paranoid people back up and back up as often as possible. Why? Fool with an AS400 database at the wrong time doing something trivial and poof. Everything is gone. More modern databases? Consider this passage from the Last Pickle:

The process of deletion becomes more interesting when we consider that Cassandra stores its data in immutable files on disk. In such a system, to record the fact that a delete happened, a special value called a “tombstone” needs to be written as an indicator that previous values are to be considered deleted.

When one digs around in database files, it is possible to come across these deleted data. People are amazed when a Windows file can be recovered. Yep, deletions don’t explain exactly what has been “deleted” and the conditions under which the data can be undeleted. Deletion allows one to assume one thing when the data have been safely archived, converted to tokens, or munged into a dossier.

Put these two things together and what do you get? A minimum of two places to look for deleted data. Look in the database files themselves, and look in backups.

In short, deleted data may not be deleted.

image

How does one know if data are “there”? Easy. Grunt work.

Why is this journey to the world of Paradox relevant?

Navigate to “Google Now Lets Users Auto-Delete Their Location and Web History.” Note this passage:

Specifically, Google account holders will be able to choose a time limit of either 3 or 18 months, after which, their location, web, and app history will automatically be deleted.

Some questions?

  • Who verifies that the content has been removed from indexes and data files?
  • Who verifies that the data have been expunged from metadata linked to the user?
  • What does deletion mean as the word is used by Google?
  • From what has something been deleted?

Like hotel temperature controls, fiddling with the knobs may change nothing.

Stephen E Arnold, June 29, 2019

Data Science Book: Free for Now

May 24, 2019

We spotted a post by Capri Granville which points to a free data science book. The post also provides a link to other free books. The Microsoft Research India book is “Foundations of Data Science” by Ravi Kannan. You can as of May 24, 2019, download the book without charge at this link: https://www.cs.cornell.edu/jeh/book.pdf. Cornell charges students about $55,188 for an academic year. DarkCyber believes that “free” may not be an operative word where the Theory Center used to love those big IBM computers. No, they were not painted Azure.

Stephen E Arnold, May 24, 2019

IBM Hyperledger: More Than a Blockchain or Less?

May 17, 2019

Though the IBM-backed open-source project Hyperledger has been prominent on the blockchain scene since 2016, The Next Web declares, “IBM’s Hyperledger Isn’t a Real Blockchain—Here’s Why.” Kadena president, and writer, Stuart Popejoy tells us:

“A blockchain is a decentralized and distributed database, an immutable ledger of events or transactions where truth is determined by a consensus mechanism — such as participants voting to agree on what gets written — so that no central authority arbitrates what is true. IBM’s definition of blockchain captures the distributed and immutable elements of blockchain but conveniently leaves out decentralized consensus — that’s because IBM Hyperledger Fabric doesn’t require a true consensus mechanism at all.

We noted this statement as well:

“Instead, it suggests using an ‘ordering service’ called Kafka, but without enforced, democratized, cryptographically-secure voting between participants, you can’t really prove whether an agent tampers with the ledger. In effect, IBM’s ‘blockchain’ is nothing more than a glorified time-stamped list of entries. IBM’s architecture exposes numerous potential vulnerabilities that require a very small amount of malicious coordination. For instance, IBM introduces public-key cryptography ‘inside the network’ with validator signatures, which fundamentally invalidates the proven security model of Bitcoin and other real blockchains, where the network can never intermediate a user’s externally-provided public key signature.”

Then there are IBM’s approaches to architecture, security flaws, and smart contracts to consider, as well as misleading performance numbers. See the article for details on each of those criticisms. Popejoy concludes with the prediction that better blockchains are bound to be developed, alongside a more positive approach to technology in general, across society.

Cynthia Murrell, May 17, 2019

Machine Learning and Data Quality

April 23, 2019

We’re updating our data quality files as part of the run up to my lecture at the TechnoSecurity & Digital Forensics Conference. A paper by Sanau.co is worth reading if you are thinking about how to solve some issues with the accuracy of the outputs of some machine learning systems. “Dear AI Startups: Your ML Models Are Dying Quietly.” The slow deterioration of certain Bayesian methods has been a subject I have addressed for years. The Sanau write up called to my attention another source of data deterioration or data rot; that is, seemingly logical changes made to field names and the insidious downstream consequences of these changes. The article provides useful explanations and a concrete example drawn from ecommerce. The article has a much broader application. Worth reading.

Stephen E Arnold, April 23, 2019

The Surf Is Up for the Word Dark

April 4, 2019

Just a short note. I read this puffy wuffy write up about a new market research report. Its title?

The Research Report “Dark Analytics Market: Global Industry Analysis 2013 – 2017 and Opportunity Assessment; 2018 – 2028 ” provides information on pricing, market analysis, shares, forecast, and company profiles for key industry participants

What caught my attention is not the authors’ attempt to generate some dough via open source data collection and a touch of Excel fever.

Here’s what caught my attention:

Dark analytics is the analysis of dark data present in the enterprises. Dark data is generally is referred as raw data or information buried in text, tables, figures that organizations acquire in various business operations and store it but, is unused to derive insights and for decision making in business. Organizations nowadays are realizing that there is a huge risk associated with losing competitive edge in business and regulatory issues that comes with not analyzing and processing this data. Hence, dark analytics is a practice followed in enterprises that advances in analyzing computer network operations and pattern recognition.

Yes, buried data treasure. Now the cost of locating, accessing, validating, and normalizing these time encrusted nuggets?

Answer: A lot. A whole lot. That’s part of the reason old data are not particularly popular in some organizations. The idea of using a consulting firm or software from SAP is not particularly thrilling to my DarkCyber team. (Our use of “dark” is different too.)

Stephen E Arnold, April 4, 2019

Content Management: Now a Playground for Smart Software?

March 28, 2019

CMS or content management systems are a hoot. Sometimes they work; sometimes they don’t. How does one keep these expensive, cranky databases chugging along in the zip zip world of content utilities which are really inexpensive?

Smart software and predictive analytics?

Managing a website is not what is used to be, and one of the biggest changes to content management systems is the use of predictive analytics. The Smart Data Collective discusses “The Fascinating Role of Predictive Analytics in CMS Today.” Reporter Ryan Kh writes:

“Predictive analytics is changing digital marketing and website management. In previous posts, we have discussed the benefits of using predictive analytics to identify the types of customers that are most likely to convert and increase the value of your lead generation strategy. However, there are also a lot of reasons that you can use predictive analytics in other ways. Improving the quality of your website is one of them. One of the main benefits of predictive analytics in 2019 is in improving the performance of content management systems. There are a number of different types of content management systems on the market, including WordPress, Joomla, Drupal, and Shopify. There are actually hundreds of content management systems on the market, but these are some of the most noteworthy. One of the reasons that they are standing out so well against their competitors is that they use big data solutions to get the most value for their customers.”

The author notes two areas in which predictive analytics are helping companies’ bottom lines: fraud detection and, of course, marketing optimization; the latter through capacities like more effective lead generation and content validation.

Yep, CMS with AI. The future with spin.

Cynthia Murrell, March 28, 2019

Federating Data: Easy, Hard, or Poorly Understood Until One Tries It at Scale?

March 8, 2019

I read two articles this morning.

One article explained that there’s a new way to deal with data federation. Always optimistic, I took a look at “Data-Driven Decision-Making Made Possible using a Modern Data Stack.” The revolution is to load data and then aggregate. The old way is to transform, aggregate, and model. Here’s a diagram from DAS43. A larger version is available at this link.das42 diagram

Hard to read. Yep, New Millennial colors. Is this a breakthrough?

I don’t know.

When I read “2 Reasons a Federated Database Isn’t Such a Slam-Dunk”, it seems that the solution outlined by DAS42 and the InfoWorld expert are not in sync.

There are two reasons. Count ‘em.

One: performance

Two: security.

Yeah, okay.

Some may suggest that there are a handful of other challenges. These range from deciding how to index audio, video, and images to figuring out what to do with different languages in the content to determining what data are “good” for the task at hand and what data are less “useful.” Date, time, and geocodes metadata are needed, but that introduces the not so easy to solve indexing problem.

So where are we with the “federation thing”?

Exactly the same place we were years ago…start ups and experts notwithstanding. But then one has to wrangle a lot of data. That’s cost, gentle reader. Big money.

Stephen E Arnold, March 8, 2019

Fragmented Data: Still a Problem?

January 28, 2019

Digital transitions are a major shift for organizations. The shift includes new technology and better ways to serve clients, but it also includes massive amounts of data. All organizations with a successful digital implementation rely on data. Too much data, however, can hinder organizations’ performance. The IT Pro Portal explains how data and something called mass data fragmentation is a major issue in the article, “What Is Mass Data Fragmentation, And What Are IT Leaders So Worried About It?”

The biggest question is: what exactly is mass data fragmentation? I learned:

“We believe one of the major culprits is a phenomenon called mass data fragmentation. This is essentially just a technical way of saying, ’data that is siloed, scattered and copied all over the place’ leading to an incomplete view of the data and an inability to extract real value from it. Most of the data in question is what’s called secondary data: data sets used for backups, archives, object stores, file shares, test and development, and analytics. Secondary data makes up the vast majority of an organization’s data (approximately 80 per cent).”

The article compares the secondary data to an iceberg, most of it is hidden beneath the surface. The poor visibility leads to compliance and vulnerability risks. In other words, security issues that put the entire organization at risk. Most organizations, however, view their secondary data as a storage bill, compliance risk (at least that is good), and a giant headache.

When surveyed about the amount of secondary data they have, it was discovered that organizations had multiple copies of the same data spread over the cloud and on premise locations. IT teams are expected to manage the secondary data across all the locations, but without the right tools and technology the task is unending, unmanageable, and the root of more problems.

If organizations managed their mass data fragmentation efficiently it would increase their bottom line, reduce costs, and reduce security risks. With more access points to sensitive data and they are not secure, it increases the risk of hacking and information being stolen.

Whitney Grace, January 28, 2019

Relatives Got You Down? Check Out BigQuery and Redshift

December 25, 2018

I read “Redshift Vs BigQuery: What Are The Factors To Consider Before Choosing A Data Warehouse.” With Oracle on the ropes and database technology chugging along, why pay attention to old school solutions?

The article sets out to compare and contrast BigQuery (one of the Google progeny known to have consorted with a certain Mr. Dremel.) Amazon has more database products and services than I can keep track of. But RedShift is one of them, and it is important if an intelware company uses AWS and the RedShift technology.

Which system is more “flexible”? I learned:

In the case of Redshift, if anything goes kaput during a transaction, Amazon Redshift allows users to perform roll-back to ensure that data get backs to the consistent state. BigQuery works on the principle of append-only data and its storage engine strictly follows this technique. This becomes a major disadvantage to the user when something goes wrong during the transaction process, forcing them to restart from the beginning or specific point. Another key point is that duplicating data in BigQuery is hard to achieve and costly. Both the technologies have reservations regarding insertion of streaming data, with Redshift taking edge by guaranteeing storage of data with additional care from the user. On the other hand, BigQuery supports de-duplication of streaming data in the most effective way by using time window.

The write up points out:

As compared to BigQuery, Redshift is considerably more expensive costing $0.08 per GB, compared to BigQuery which costs $0.02 per GB. However, BigQuery offers only storage and not queries. The platform charges separately for queries based upon processed data at $5/TB. As BigQuery lacks indexes and various analytical queries, the scanning of data is a huge and costly process. In most cases, users opt for Amazon Redshift as it is predictable, simple and encourages data usage and analytics.

Which is “better”? Not surprisingly, both are really swell. Helpful. But the Beyond Search goose was curious about:

  • Performance
  • Latency for different types of queries
  • Programming requirements

But swell is fine.

Stephen E Arnold, December 25, 2018

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta