Anonymized Location Data: an Oxymoron?
May 13, 2020
Location data. To many the term sounds innocuous, boring really. Perhaps that is why society has allowed apps to collect and sell it with no significant regulation. An engaging (and well-illustrated) piece from Norway’s NRK News, “Revealed by Mobile,” shares the minute details journalists were able to put together about one citizen from location data purchased on the open market. Graciously, this man allowed the findings to published as a cautionary tale. We suggest you read the article for yourself to absorb the chilling reality. (The link we share above runs through Google Translate.)
Vendors of location data would have us believe the information is completely anonymized and cannot be tied to the individuals who generated it. It is only good for general uses like statistics and regional marketing, they assert. Intending to put that claim to the test, NRK purchased a batch of Norwegian location data from the British firm Tamoco. Their investigation shows anonymization is an empty promise. Though the data is stripped of directly identifying information, buyers are a few Internet searches away from correlating location patterns with individuals. Journalists Trude Furuly, Henrik Lied, and Martin Gundersen tell us:
“All modern mobile phones have a GPS receiver, which with the help of satellite can track the exact position of the phone with only a few meters distance. The position data NRK acquired consisted of a table with four hundred million map coordinates from mobiles in Norway. …
“All the coordinates were linked to a date, time, and specific mobile. Thus, the coordinates showed exactly where a mobile or tablet had been at a particular time. NRK coordinated the mobile positions with a map of Norway. Each position was marked on the map as an orange dot. If a mobile was in a location repeatedly and for a long time, the points formed larger clusters. Would it be possible for us to find the identity of a mobile owner by seeing where the phone had been, in combination with some simple web searches? We selected a random mobile from the dataset.
“NRK searched the address where the mobile had left many points about the nights. The search revealed that a man and a woman lived in the house. Then we searched their Facebook profiles. There were several pictures of the two smiling together. It seemed like they were boyfriend and girlfriend. The man’s Facebook profile stated that he worked in a logistics company. When we searched the company in question, we discovered that it was in the same place as the person used to drive in the morning. Thus, we had managed to trace the person who owned the cell phone, even though the data according to Tamoco should have been anonymized.”
The journalists went on to put together a detailed record of that man’s movements over several months. It turns out they knew more about his trip to the zoo, for example, than he recalled himself. When they revealed their findings to their subject, he was shocked and immediately began deleting non-essential apps from his phone. Read the article; you may find yourself doing the same.
Cynthia Murrell, May 12, 2020
Google Apple Contact Tracing Interface
May 9, 2020
Now Toronto published “Here’s What Apple and Google’s COVID-19 Contact Tracing App Looks Like.” The article includes sample screenshots and some explanation about the data displayed. Worth a look. Much more is possible in terms of tracking, contact mapping, and analytics, of course. Who or what will have access to these more useful views of the collected data?
Stephen E Arnold, May 9, 2020
Sigma Gets $30 Million In Key Funding
April 30, 2020
Once the economic ramifications from the COVID-19 pandemic are underway and you are adjusting your investment portfolio, data analytics company stocks should not lose any value. Why? Data analytics platforms are in high demand and Sigma Computing recently nabbed: “Sigma Computing Raises $30 Million More For Cloud Data Analytics Tools” says Venture Beat.
Sigma Computing held a series B round of founding and added another $30 million to their fund. Investors in the second funding round include Sutter Hill Ventures and Altimeter Capital. CEO for Sigma Computing Rob Woollen said the money would be used for product development and product support.
Woollen stated that data is useless without making it comprehendible and capable of delivering actionable BI insights. Sigma makes data useable, but also keeping in mind the importance of governance, security issues, and compliance. Sigma uses a spreadsheet-like UI that transforms data from any source into useful insights, plus the search tool is powerful:
“Searches can be performed by natural language and by filter, the results of which can be compiled in an embeddable report and delivered via email. Where collaboration is concerned, Sigma’s link feature enables users to map data relationships and add linked data to documents. The platform’s workspaces are conducive to sharing — they can be circulated among teams, departments, or entire organizations — and spotlight important data blocks, worksheets, and interfaces with visual badges and a range of visualizations.”
Sigma Computing includes Zumper, Navis, LendUp, Clover, Volta, and Olivela among their clients. They sell software for data visualization and big data/business analytics, both markets combined are worth over $11 million. It sounds like a good investment.
Whitney Grace, April 30, 2020
GeoSpark Analytics: Real Time Analytics
April 6, 2020
In late 2017, OGSystems chopped out some of the firm’s analytics capabilities. The new company was Geospark Analytics. The service provided enabled customers like the US Department of Defense and FEMA to obtain information about important new events. “Events” is jargon for an alert plus data about something that is important.
“FEMA Contractor Tracing Coronavirus Deaths Uses Web Scraping, Social Media Monitoring” explains one use of the system. The write up says:
Geospark Analytics combines machine learning and big data to analyze events in real-time and warn of potential disruptions to the businesses of high-dollar private and public clientele…
Like Bluedot in Canada, Geospark was one of the monitoring companies analyzing open source and some specialized data to find interesting events. The write up continues:
Geospark Analytics’ product, called Hyperion, the namesake of the Titan son of Uranus (meaning, “watcher from above”), fingered Wuhan as a “hotspot,” in the company’s parlance, within hours after news of the virus first broke. “Hotspots tracks normal patterns of activity across the globe and provides a visual cue to flag disruptive events that could impact your employees, operations, and investments and result in billions of dollars in economic losses,” the company’s website says.
Engadget points out that there are a couple of companies with the name “Geospark.” DarkCyber finds this interesting. This statement provides more color about the Geospark approach:
Geospark Analytics claims to have processed “6.8 million” sources of information; everything from tweets to economic reports. “We geo-position it, we use natural language processing, and we have deep learning models that categorize the data into event and health models,” Goolgasian [Geospark’s CEO] said. It’s through these many millions of data points that the company creates what it calls a “baseline level of activity” for specific regions, such as Wuhan. A spike of activity around any number of security-, military-, or health-related topics and the system flags it as a potential disruption.
How does Geospark avoid the social media noise, bias, and disinformation that finds its way into open source content? The article states:
“We rely more on traditional data sources and we don’t do anything that isn’t publicly available,” Goolgasian said, echoing a common refrain among data firms that fuel surveillance products by mining the internet itself.
Providing specialized services to government agencies is not much of a surprise in DarkCyber’s opinion. Financial firms can also be avid consumers of real-time data. The idea is to get the jump on the competition which probably has its own source of digital insights.
Other observations:
- The apparent “surprise” threading through the Engadget article is a bit off putting. DarkCyber is aware of a number of social media and specialized content monitoring services. In fact, there is a surplus of these operations and not all will survive in the present business climate.
- Detecting and alerting are helpful but the messengers failed to achieve impact. How does DarkCyber know? Well, there is the lockdown.
- Publicizing what companies like Geospark and others do to generate income can have interesting consequences.
Net net: Some types of specialized services are difficult to explain in a way that reduces blowback. Some of the blowback have significant impact on social media analytics companies. The Geofeedia case is a reminder. I know. I know. “What’s a Geofeedia some may ask?”
Good question and DarkCyber thinks few know the answer. Plucking insights from information many people believe to be privileged can be fraught with business shock waves.
Stephen E Arnold, April 6, 2020
Forget Weak Priors, Certain Predictive Methods Just Fail
April 2, 2020
Nope. No equations. No stats speak. Tested predictive models were incorrect.
Navigate to “Researchers Find AI Is Bad at Predicting GPA, Grit, Eviction, Job Training, Layoffs, and Material Hardship.” Here’s the finding, which is a delightfully clear:
A paper coauthored by over 112 researchers across 160 data and social science teams found that AI and statistical models, when used to predict six life outcomes for children, parents, and households, weren’t very accurate even when trained on 13,000 data points from over 4,000 families.
So what? The write up states in the form of a quote from the author of the paywalled paper:
“Here’s a setting where we have hundreds of participants and a rich data set, and even the best AI results are still not accurate,” said study co-lead author Matt Salganik, a professor of sociology at Princeton and interim director of the Center for Information Technology Policy at the Woodrow Wilson School of Public and International Affairs. “These results show us that machine learning isn’t magic; there are clearly other factors at play when it comes to predicting the life course.”
We noted this comment from a researcher at Princeton University:
In the end, even the best of the over 3,000 models submitted — which often used complex AI methods and had access to thousands of predictor variables — weren’t spot on. In fact, they were only marginally better than linear regression and logistic regression, which don’t rely on any form of machine learning.
Several observations:
- Nice work AAAS. Keep advancing science with a paywall germane to criminal justice and policeware.
- Over inflation of the “value” of outputs from models is common in marketing. DarkCyber thinks that the weaknesses of these methods needs more than a few interviews with people like the Cathy O’Neil, author of Weapons of Math Destruction.
- Are those afflicted with innumeracy willing to delegate certain important actions to procedures which are worse than relying on luck, flipping a coin, or Monte Carlo methods?
Net net: No one made accurate predictions. Yep, no one. Thought stimulating research with implication for predictive analytics adherents. This open source paper provides some of the information referenced in the AAAS paper: Measuring the Predictability of Life Outcomes with a scientific mass collaboration
Stephen E Arnold, April 2, 2020
Wolfram Mathematica
March 19, 2020
DarkCyber noted “In Less Than a Year, So Much New: Launching Version 12.1 of Wolfram Language & Mathematica” contains highly suggestive information. Yes, this is a mathy program. The innovations are significant for analysts and some government professionals. To cite one example:
I’ve been recording hundreds of hours of video in connection with a new project I’m working on. So I decided to try our new capabilities on it. It’s spectacular! I could take a 4-hour video, and immediately extract a bunch of sample frames from it, and then—yes, in a few hours of CPU time—“summarize the whole video”, using SpeechRecognize to do speech-to-text on everything that was said and then generating a word cloud…
DarkCyber reacts positively to other additions and enhancements to the Mathematica “system.” Version 12.1 will make it easier to develop specific functions for policeware and intelware use cases.
Remarkable because the “system” can geo-everything. That’s important in many situations.
Stephen E Arnold, March 19, 2020
Israel and Mobile Phone Data: Some Hypotheticals
March 19, 2020
DarkCyber spotted a story in the New York Times: “Israel Looks to Repurpose a Trove of Cell Phone Data.” The story appeared in the dead tree edition on March 17, 2020, and you can access the online version of the write up at this link.
The write up reports:
Prime Minister Benjamin Netanyahu of Israel authorized the country’s internal security agency to tap into a vast , previously undisclosed trove of cell phone data to retract the movements of people who have contracted the corona virus and identify others who should be quarantined because their paths crossed.
Okay, cell phone data. Track people. Paths crossed. So what?
Apparently not much.
The Gray Lady does the handwaving about privacy and the fragility of democracy in Israel. There’s a quote about the need for oversight when certain specialized data are retained and then made available for analysis. Standard journalism stuff.
DarkCyber’s team talked about the write up and what the real journalists left out of the story. Remember. DarkCyber operates from a hollow in rural Kentucky and knows zero about Israel’s data collection realities. Nevertheless, my team was able to identify some interesting use cases.
Let’s look at a couple and conclude with a handful of observations.
First, the idea of retaining cell phone data is not exactly a new one. What if these data can be extracted using an identifier for a person of interest? What if a time-series query could extract the geolocation data for each movement of the person of interest captured by a cell tower? What if this path could be displayed on a map? Here’s a dummy example of what the plot for a single person of interest might look like. Please, note these graphics are examples selected from open sources. Examples are not related to a single investigation or vendor. These are for illustrative purposes only.
Source: Standard mobile phone tracking within a geofence. Map with blue lines showing a person’s path. SPIE at https://bit.ly/2TXPBby
Useful indeed.
Second, what if the intersection of two or more individuals can be plotted. Here’s a simulation of such a path intersection:
Source: Map showing the location of a person’s mobile phone over a period of time. Tyler Bell at https://bit.ly/2IVqf7y
Would these data provide a way to identify an individual with a mobile phone who was in “contact” with a person of interest? Would the authorities be able to perform additional analyses to determine who is in either party’s social network?
Third, could these relationship data be minded so that connections can be further explored?
Source: Diagram of people who have crossed paths visualized via Analyst Notebook functions. Globalconservation.org
Can these data be arrayed on a timeline? Can the routes be converted into an animation that shows a particular person of interest’s movements at a specific window of time?
Source: Vertical dots diagram from Recorded Future showing events on a timeline. https://bit.ly/39Xhbex
These hypothetical displays of data derived from cross correlations, geotagging, and timeline generation based on date stamps seem feasible. If earnest individuals in rural Kentucky can see the value of these “secret” data disclosed in the New York Times’ article, why didn’t the journalist and the others who presumably read the story?
What’s interesting is that systems, methods, and tools clearly disclosed in open source information is overlooked, ignored, or just not understood.
Now the big question: Do other countries have these “secret” troves of data?
DarkCyber does not know; however, it seems possible. Log files are a useful function of data processes. Data exhaust may have value.
Stephen E Arnold, March 19, 2020
First Counting Bees, Now Predicting Parrots
March 5, 2020
DarkCyber found amusing the write up “Parrots Can Make Predictions Based on Probabilities” interesting. With the corona virus data widely available, will these poly-nomial avians lend their expertise to global health administrators?
The write up asserts:
They [scientists] discovered the kea, a species of large parrot found in New Zealand, can make inferences and predict events based previous knowledge or experience. They [yep, this is a reference to the parrots] even performed better than chimps in some experiments.
The write up states:
The team said it is the first time this complex cognitive ability has been demonstrated in an animal outside of the great apes, which could help shed light on the “evolutionary history of statistical inference”.
Now is the time to apply parrot intelligence to tough computing problems like the Corona virus research. Polly, do you want a protein predictive output?
Stephen E Arnold, March 5, 2020
Amazon: Buying More Innovation
February 26, 2020
DarkCyber noted the article “Amazon Acquires Turkish Startup Datarow.” The word “startup” is rather loosely applied. Datarow was founded in 2016. Not a spring chicken in DarkCyber’s view is a four year old outfit.
What’s interesting about this acquisition is that it provides the sometimes unartful Amazon with an outfit that specializes in making easier-to-use data tools. The firm appears to have been built around AWS Redshift.
The company’s quite wonky Web site says:
We’re proud to have created an innovative tool that facilitates data exploration and visualization for data analysts in Amazon Redshift, providing users with an easy to use interface to create tables, load data, author queries, perform visual analysis, and collaborate with others to share SQL code, analysis, and results. Together with AWS, we look forward to taking our tool to the next level for customers.
The company provides what it calls “data governance,” a term which DarkCyber means “get your act together” with regard to information. This is easier said than done, but it is a hot button among companies struggling to reduce costs, comply with assorted rules and regulations, and figure out what’s actually happening in their lines of business. Profit and loss statements are not up to the job of dealing with diverse content, audio, video, real time data, and tweets. Well, neither is Amazon, but that’s not germane.
Will Amazon AWS Redshift (love the naming, don’t you?) become easier to use? Perhaps Datarow will become responsible for the AWS Web site?
Stephen E Arnold, February 26, 2020
Facial Recognition: Those Error Rates? An Issue, Of Course
February 21, 2020
DarkCyber read “Machines Are Struggling to Recognize People in China.” The write up asserts:
The country’s ubiquitous facial recognition technology has been stymied by face masks.
One of the unexpected consequences of the Covid 19 virus is that citizens with face masks cannot be recognized.
“Unexpected” when adversarial fashion has been getting some traction among those who wish to move anonymously.
The write up adds:
Recently, Chinese authorities in some provinces have made medical face masks mandatory in public and the use and popularity of these is going up across the country. However, interestingly, as millions of masks are now worn by Chinese people, there has been an unintended consequence. Not only have the country’s near ubiquitous facial-recognition surveillance cameras been stymied, life is reported to have become difficult for ordinary citizens who use their faces for everyday things such as accessing their homes and bank accounts.
Now an “admission” by a US company:
Companies such as Apple have confirmed that the facial recognition software on their phones need a view of the person’s full face, including the nose, lips and jaw line, for them to work accurately. That said, a race for the next generation of facial-recognition technology is on, with algorithms that can go beyond masks. Time will tell whether they work. I bet they will.
To sum up: Masks defeat facial recognition. The future is a method of identification that can work with what is not covered plus any other data available to the system; for example, pattern of walking and geo-location.
For now, though, the remedy for the use of masks is lousy facial recognition and more effort to find innovations.
The author of the write up is a — wait for it — venture capital professional. And what country leads the world in facial recognition? China, according to the VC professional.
The future is better person recognition of which the face is one factor.
Stephen E Arnold, February 21, 2020