The Only Dataset Search Tool: What Does That Tell Us about Google?

April 11, 2024

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

If you like semi-jazzy, academic write ups, you will revel in “Discovering Datasets on the Web Scale: Challenges and Recommendations for Google Dataset Search.” The write up appears in a publication associated with Jeffrey Epstein’s favorite university. It may be worth noting that MIT and Google have teamed to offer a free course in Artificial Intelligence. That is the next big thing which does hallucinate at times while creating considerable marketing angst among the techno-giants jousting to emerge as the go-to source of the technology.

Back to the write up. Google created a search tool to allow a user to locate datasets accessible via the Internet. There are more than 700 data brokers in the US. These outfits will sell data to most people who can pony up the cash. Examples range from six figure fees for the Twitter stream to a few hundred bucks for boat license holders in states without much water.

The write up says:

Our team at Google developed Dataset Search, which differs from existing dataset search tools because of its scope and openness: potentially any dataset on the web is in scope.


A very large, money oriented creature enjoins a worker to gather data. If someone asks, “Why?”, the monster says, “Make up something.” Thanks MSFT Copilot. How is your security today? Oh, that’s too bad.

The write up does the academic thing of citing articles which talk about data on the Web. There is even a table which organizes the types of data discovery tools. The categorization of general and specific is brilliant. Who would have thought there were two categories of a vertical search engine focused on Web-accessible data. I thought there was just one category; namely, gettable. The idea is that if the data are exposed, take them. Asking permission just costs time and money. The idea is that one can apologize and keep the data.

The article includes a Googley graphic. The French portal, the Italian “special” portal, and the Harvard “dataverse” are identified. Were there other Web accessible collections? My hunch is that Google’s spiders such down as one famous Googler said, “All” the world’s information. I will leave it to your imagination to fill in other sources for the dataset pages. (I want to point out that Google has some interesting technology related to converting data sets into normalized data structures. If you are curious about the patents, just write benkent2020 at yahoo dot com, and one of my researchers will send along a couple of US patent numbers. Impressive system and method.)

The section “Making Sense of Heterogeneous Datasets” is peculiar. First, the Googlers discovered the basic fact of data from different sources — The data structures vary. Think in terms  of grapes and deer droppings. Second, the data cannot be “trusted.” There is no fix to this issue for the team writing the paper. Third, the authors appear to be unaware of the patents I mentioned, particularly the useful example about gathering and normalizing data about digital cameras. The method applies to other types of processed data as well.

I want to jump to the “beyond metadata” idea. This is the mental equivalent of “popping” up a perceptual level. Metadata are quite important and useful. (Isn’t it odd that Google strips high value metadata from its search results; for example, time and data?) The authors of the paper work hard to explain that the Google approach to data set search adds value by grouping, sorting, and tagging with information not in any one data set. This is common sense, but the Googley spin on this is to build “trust.” Remember: This is an alleged monopolist engaged in online advertising and co-opting certain Web services.

Several observations:

  1. This is another of Google’s high-class PR moves. Hooking up with MIT and delivering razz-ma-tazz about identifying spiderable content collections in the name of greater good is part of the 2024 Code Red playbook it seems. From humble brag about smart software to crazy assertions like quantum supremacy, today’s Google is a remarkable entity
  2. The work on this “project” is divorced from time. I checked my file of Google-related information, and I found no information about the start date of a vertical search engine project focused on spidering and indexing data sets. My hunch is that it has been in the works for a while, although I can pinpoint 2006 as a year in which Google’s technology wizards began to talk about building master data sets. Why no time specifics?
  3. I found the absence of AI talk notable. Perhaps Google does not think a reader will ask, “What’s with the use of these data? I can’t use this tool, so why spend the time, effort, and money to index information from a country like France which is not one of Google’s biggest fans. (Paris was, however, the roll out choice for the answer to Microsoft and ChatGPT’s smart software announcement. Plus that presentation featured incorrect information as I recall.)

Net net: I think this write up with its quasi-academic blessing is a bit of advance information to use in the coming wave of litigation about Google’s use of content to train its AI systems. This is just a hunch, but there are too many weirdnesses in the academic write up to write off as intern work or careless research writing which is more difficult in the wake of the stochastic monkey dust up.

Stephen E Arnold, April 11, 2024

French Building and Structure Geo-Info

February 23, 2024

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

OSINT professionals may want to take a look at a French building and structure database with geo-functions. The information is gathered and made available by the Observatoire National des Bâtiments. Registration is required. A user can search by city and address. The data compiled up to 2022 cover France’s metropolitan areas and includes geo services. The data include address, the built and unbuilt property, the plot, the municipality, dimensions, and some technical data. The data represent a significant effort, involving the government, commercial and non-governmental entities, and citizens. The dataset includes more than 20 million addresses. Some records include up to 250 fields.



To access the service, navigate to One is invited to register or use the online version. My team recommends registering. Note that the site is in French. Copying some text and data and shoving it into a free online translation service like Google’s may not be particularly helpful. French is one of the languages that Google usually handles with reasonable facilities. For this site, Google Translate comes up with tortured and off-base translations.

Stephen E Arnold, February 23, 2024

Meta Never Met a Kid Data Set It Did Not Find Useful

January 5, 2024

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

Adults are ripe targets for data exploitation in modern capitalism. While adults fight for their online privacy, most have rolled over and accepted the inevitable consumer Big Brother. When big tech companies go after monetizing kids, however, that’s when adults fight back like rabid bears. Engadget writes about how Meta is fighting against the federal government about kids’ data: “Meta Sues FTC To Block New Restrictions On Monetizing Kids’ Data.”

Meta is taking the FTC to court to prevent them from reopening a 2020 $5 billion landmark privacy case and to allow the company to monetize kids’ data on its apps. Meta is suing the FTC, because a federal judge ruled that the FTC can expand with new, more stringent rules about how Meta is allowed to conduct business.

Meta claims the FTC is out for a power grab and is acting unconstitutionally, while the FTC reports the claimants consistently violates the 2020 settlement and the Children’s Online Privacy Protection Act. FTC wants its new rules to limit Meta’s facial recognition usage and initiate a moratorium on new products and services until a third party audits them for privacy compliance.

Meta is not a huge fan of the US Federal Trade Commission:

“The FTC has been a consistent thorn in Meta’s side, as the agency tried to stop the company’s acquisition of VR software developer Within on the grounds that the deal would deter "future innovation and competitive rivalry." The agency dropped this bid after a series of legal setbacks. It also opened up an investigation into the company’s VR arm, accusing Meta of anti-competitive behavior."

The FTC is doing what government agencies are supposed to do: protect its citizens from greedy and harmful practices like those from big business. The FTC can enforce laws and force big businesses to pay fines, put leaders in jail, or even shut them down. But regulators have been decades ramping up to take meaningful action. The result? The thrashing over kiddie data.

Whitney Grace, January 5, 2024

Data Mesh: An Innovation or a Catchphrase?

October 18, 2023

Vea4_thumb_thumb_thumb_thumb_thumb_t[2]Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

Have you ever heard of data mesh? It’s a concept that has been around the tech industry for a while but is gaining more traction through media outlets. Most of the hubbub comes from press releases, such as TechCrunch’s: “Nextdata Is Building Data Mesh For Enterprise.”

Data mesh can be construed as a data platform architecture that allows users to access information where it is. No transferring of the information to a data lake or data warehouse is required. A data lake is a centralized, scaled data storage repository, while a data warehouse is a traditional enterprise system that analyzes data from different sources which may be local or remote.

Nextdata is a data mesh startup founded by Zhamek Dehghani. Nextdata is a “data-mesh-native” platform to design, share, create, and apply data products for analytics. Nextdata is directly inspired by Dehghani’s work at Thoughtworks. Instead of building storing and using data/metadata in single container, Dehghani built a mesh system. How does the NextData system work?

“Every Nextdata data product container has data governance policies ‘embedded as code.’ These controls are applied from build to run time, Dehghani says, and at every point at which the data product is stored, accessed or read. ‘Nextdata does for data what containers and web APIs do for software,’ she added. ‘The platform provides APIs to give organizations an open standard to access data products across technologies and trust boundaries to run analytical and machine-learning workloads ‘distributedly.’ (sic) Instead of requiring data consumers to copy data for reprocessing, Nextdata APIs bring processing to data, cutting down on busy work and reducing data bloat.’’

NextData received $12 million in seed investment to develop her system’s tooling and hire more people for the product, engineering, and marketing teams. Congratulations on the funding. It is not clear at this time that the approach will add latency to operations or present security issues related to disparate users’ security levels.

Whitney Grace, October 18, 2023

Scinapse Is A Free Academic-Centric Database

July 11, 2023

Vea4_thumb_thumb_thumb_thumb_thumb_t[1]Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

Quality academic worthy databases are difficult to locate outside of libraries and schools. Google Scholar attempted to qualify as an alternative to paywalled databases, but it returns repetitive and inaccurate results. Thanks to AI algorithms, free databases improved, such as Scinapse.

Scinapse is designed by Pluto and it is advertised as the “researcher’s favorite search engine. Scinapse delivers accurate and updated research materials in each search. Many free databases pull their results from old citations and fail to include recent publications. Pluto promises Scinapse delivers high-performing results due to its original algorithm optimized for research.

The algorithm returns research materials based on when it was published, how many times it was citied, and how impactful a paper was in notable journals. Scinapse consistently delivers results that are better than Google Scholar. Each search item includes a complete citation for quick reference. The customized filters offer the typical ways to narrow or broaden results, including journal, field of study, conference, author, publication year, and more.

People can also create an account to organize their research in reading lists, share with other scholars, or export as a citation list. Perhaps the most innovative feature is the paper recommendations where Scinapse sends paper citations that align with research. Scinapse aggregates over 48,000 journals. There are users in 196 countries and 1,130 reputable affiliations. Scinapse’s data sources include Microsoft Research, PubMed, Semantic Scholar, and Springer Nature.

Whitney Grace, July 11, 2023

Google and Its Use of the Word “Public”: A Clever and Revenue-Generating Policy Edit

July 6, 2023

Vea4_thumb_thumb_thumb_thumb_thumb_t[1]Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

If one has the cash, one can purchase user-generated data from more than 500 data publishers in the US. Some of these outfits are unknown. When a liberal Wall Street Journal reporter learns about Venntel or one of these outfits, outrage ensues. I am not going to explain how data from a user finds its ways into the hands of a commercial data aggregator or database publisher. Why not Google it? Let me know how helpful that research will be.

Why are these outfits important? The reasons include:

  1. Direct from app information obtained when a clueless mobile user accepts the Terms of Use. Do you hear the slurping sounds?
  2. Organizations with financial data and savvy data wranglers who cross correlate data from multiple sources?
  3. Outfits which assemble real-time or near-real-time user location data. How useful are those data in identifying military locations with a population of individuals who exercise wearing helpful heart and step monitoring devices?

Navigate to “Google’s Updated Privacy Policy States It Can Use Public Data to Train its AI Models.” The write up does not make clear what “public data” are. My hunch is that the Google is not exceptionally helpful with its definitions of important “obvious” concepts. The disconnect is the point of the policy change. Public data or third-party data can be purchased, licensed, used on a cloud service like an Oracle-like BlueKai clone, or obtained as part of a commercial deal with everyone’s favorite online service LexisNexis or one of its units.

7 4 ad exec

A big advertiser demonstrates joy after reading about Google’s detailed prospect targeting reports. Dossiers of big buck buyers are available to those relying on Google for online text and video sales and marketing. The image of this happy media buyer is from the elves at MidJourney.

The write up states with typical Silicon Valley “real” news flair:

By updating its policy, it’s letting people know and making it clear that anything they publicly post online could be used to train Bard, its future versions and any other generative AI product Google develops.

Okay. “the weekend” mentioned in the write up is the 4th of July weekend. Is this a hot news or a slow news time? If you picked “hot”, you are respectfully wrong.

Now back to “public.” Think in terms of Google’s licensing third-party data, cross correlating those data with its log data generated by users, and any proprietary data obtained by Google’s Android or Chrome software, Gmail, its office apps, and any other data which a user clicking one of those “Agree” boxes cheerfully mouses through.

The idea, if the information in Google patent US7774328 B2. What’s interesting is that this granted patent does not include a quite helpful figure from the patent application US2007 0198481. Here’s the 16 year old figure. The subject is Michael Jackson. The text is difficult to read (write your Congressman or Senator to complain). The output is a machine generated dossier about the pop star. Note that it includes aliases. Other useful data are in the report. The granted patent presents more vanilla versions of the dossier generator, however.

profile 2007 0198481

The use of “public” data may enhance the type of dossier or other meaty report about a person. How about a map showing the travels of a person prior to providing a geo-fence about an individual’s location on a specific day and time. Useful for some applications? If these “inventions” are real, then the potential use cases are interesting. Advertisers will probably be interested? Can you think of other use cases? I can.

The cited article focuses on AI. I think that more substantive use cases fit nicely with the shift in “policy” for public data. Have your asked yourself, “What will Mandiant professionals find interesting in cross correlated data?”

Stephen E Arnold, July 6, 2023

Synthetic Data: Yes, They Are a Thing

March 13, 2023

“Real” data — that is, data generated by humans — are expensive to capture, normalize, and manipulate. But, those “real” data are important. Unfortunately some companies have sucked up real data and integrated those items into products and services. Now regulators are awakening from a decades-long slumber and taking a look into the actions of certain data companies. More importantly, a few big data outfits are aware of the [a] the costs and [b] the risks of real data.

Enter synthetic data.

If you are unfamiliar with the idea, navigate to “What is Synthetic Data? The Good, the Bad, and the Ugly.” The article states:

The privacy engineering community can help practitioners and stakeholders identify the use cases where synthetic data can be used safely, perhaps even in a semi-automated way. At the very least, the research community can provide actionable guidelines to understand the distributions, types of data, tasks, etc. where we could achieve reasonable privacy-utility tradeoffs via synthetic data produced by generative models.

Helpful, correct?

The article does not point out two things which I find of interest.

First, the amount of money a company can earn by operating efficient synthetic data factories is likely to be substantial. Like other digital products, the upside can be profitable and give the “owner” of the synthetic data market and IBM-type of old-school lock in.

Second, synthetic data can be weaponized either intentionally via data poisoning or algorithm shaping.

I just wanted to point out that a useful essay does not explore what may be two important attributes of synthetic data. Will regulators rise to the occasion? Unlikely.

Stephen E Arnold, March 13, 2023

Amazon Data Sets

February 21, 2023

Do you want to obtain data sets for analysis or making smart software even more crafty? Navigate to the AWS Marketplace. This Web page makes it easy to search through the more than 350 data products on offer. There is a Pricing Model check box. Click it if you want to see the no-cost data sets. There are some interesting options in the left side Refine Results area. For example, there are 366 open data licenses available. I find this interesting because when I examined the page, there were 362 data products. What are the missing four? I noted that there are 2,340 “standard data subscription agreements.” Again the difference between the 366 on offer and the 2,340 is interesting. A more comprehensive listing of data sources appears in the PrivacyRights’ listing. With some sleuthing, you may be able to identify other, lower profile ways to obtain data too. I am not willing to add some color about these sources in this free blog post.

Stephen E Arnold, February 21, 2023

Datasette: Useful Tool for Crime Analysts

February 15, 2023

If you want to explore data sets, you may want to take a look at the “open source multi-tool for exploring and publishing data.” The Datasette Swiss Army knife “is a tool for exploring and publishing data.”

The company says,

It helps people take data of any shape, analyze and explore it, and publish it as an interactive website and accompanying API. Datasette is aimed at data journalists, museum curators, archivists, local governments, scientists, researchers and anyone else who has data that they wish to share with the world. It is part of a wider ecosystem of 42 tools and 110 plugins dedicated to making working with structured data as productive as possible.

A handful of demos are available. Worth a look.

Stephen E Arnold, February 15, 2023

Summarize for a Living: Should You Consider a New Career?

February 13, 2023

In the pre-historic age of commercial databases, many people earned money by reading and summarizing articles, documents, monographs, and consultant reports. In order to prepare and fact check a 130 word summary of an article in the Harvard Business Review in 1980, the cost to the database publisher worked out to something like $17 to $25 per summary for what I would call general business information. (If you want more information about this number, write, and maybe you will get more color on the number.) Flash forward to the present, the cost for a human to summarize an article in the Harvard Business Review has increased. That’s why it is necessary to pay to access and print an abstract from a commercial online service. Even with yesterday’s technology, the costs are a killer. Now you know why software that eliminates the human, the editorial review, the fact checking, and the editorial policies which define what is indexed, why, and what terms are assigned is a joke to many of those in the search and retrieval game.

I mention this because if you are in the A&I business (abstracting and indexing), you may want to take a look at HunKimForks ChatGPT Arxiv Extension. The idea is that ChatGPT can generate an abstract which is certainly less fraught with cost and management hassles than running one of the commercial database content generation systems dependent on humans, some with degrees in chemistry, law, or medicine.

Are the summaries any good? For the last 40 years abstracts and summaries have been, in my opinion, degrading. Fact checking is out the window along with editorial policies, style guidelines, and considered discussion of index terms, classification codes, time handling and signifying, among other, useful knowledge attributes.

Three observations:

  1. Commercial database publishers may want to check out this early-days open source contribution
  2. Those engaged in abstracting, writing summaries of books, and generating distillations of turgid government documents (yep, blue chip consulting firms I an thinking of you) may want to think about their future
  3. Say “hello” to increasingly inaccurate outputs from smart software. Recursion and liquid algorithms are not into factual stuff.

Stephen E Arnold, February 13, 2023

Next Page »

  • Archives

  • Recent Posts

  • Meta