The Myth of Data Federation: Not a New Problem, Not One Easily Solved

July 8, 2020

I read “A Plan to Make Police Data Open Source Started on Reddit.” The main point of this particular article is:

The Police Data Accessibility Project aims to request, download, clean, and standardize public records that right now are overly difficult to find.

Interesting, but I interpreted the Silicon Valley centric write up differently. If you are a marketer of systems which purport to normalize disparate types of data, aggregate them, federate indexes, and make the data accessible, analyzable, retrievable, and bang on dead simple — stop reading now. I don’t want to deal with squeals from vendors about their superior systems.

For the individual reading this sentence, a word of advice. Fasten your seat belt.

Some points to consider when reading the article cited above, listening to a Vimeo “insider” sales pitch, or just doing techno babble with your Spin class pals:

Dealing with disparate data requires time and money as well as NOT ONE but multiple software tools.
Even with a well resourced and technologically adept staff, exceptions require attention. A failure to deal with the stuff in the Exceptions folder can skew the outputs of some Fancy Dan analytic systems. Example: How about that Detroit facial recognition system? Nifty, eh?
The flows of real time data are a big problem — are you ready for this — a challenge to the Facebooks, Googles, and Microsofts of the world. The reason is that the volume of data and CHANGES TO THOSE ALREADY PROCESSED ITEMS OF INFORMATION is a very, very tough problem. No, faster processors, bigger pipes, and zippy SSDs won’t do the job. The trouble lies within, the intradevice and intra software module flow. The fix is to sample, and sampling increases the risk of inaccuracies. Example: Remember Detroit’s facial recognition accuracy. The arrested individual may share some impressions with you.
The baloney about “all” data or “any” type is crazy talk. When one deals with more than 18,000 police forces in the US, outputs from surveillance devices from different vendors, and the geodumps of individuals and their ad tracking beacons — this is going to be mashed up and made usable. Noble idea. There are many noble ideas.

Why am I taking the time to repeat what anyone with experience in large scale data normalization and analysis knows?

Baloney can be thinly sliced, smeared with gochujang, and served on Delft plates. Know what? Still baloney.

Gobble this:

Still, data is an important piece of understanding what law enforcement looks like in the US now, and what it could look like in the future. And making that information more accessible, and the stories people tell about policing more transparent, is a first step.

But the killer assumption is that the humans involved don’t make errors, systems remain online, and file formats are forever.

That baloney. It really is incredible. Just not what you think.

Stephen E Arnold, July 8, 2020

Written by Stephen E. Arnold · Filed Under Aggregation, Analytics, Database, Search

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.