How about This Intelligence Blindspot: Poisoned Data for Smart Software

February 23, 2023

One of the authors is a Googler. I think this is important because the Google is into synthetic data; that is, machine generated information for training large language models or what I cynically refer to as “smart software.”

The article / maybe reproducible research is “Poisoning Web Scale Datasets Is Practical.”  Nine authors of whom four are Googlers have concluded that a bad actor, government, rich outfit, or crafty students in Computer Science 301 can inject information into content destined to be used for training. How can this be accomplished. The answer is either by humans, ChatGPT outputs from an engineered query, or a combination. Why would someone want to “poison” Web accessible or thinly veiled commercial datasets? Gee, I don’t know. Oh, wait, how about control information and framing of issues? Nah, who would want to do that?

The paper’s authors conclude with more than one-third of that Google goodness. No, wait. There are no conclusions. Also, there are no end notes. What there is a road map explaining the mechanism for poisoning.

One key point for me is the question, “How is poisoning related to the use of synthetic data?”

My hunch is that synthetic data are more easily manipulated than going through the hoops to poison publicly accessible data. That’s time and resource intensive. The synthetic data angle makes it more difficult to identify the type of manipulations in the generation of a synthetic data set which could be mingled with “live” or allegedly-real data.

Net net: Open source information and intelligence may have a blindspot because it is not easy to determine what’s right, accurate, appropriate, correct, or factual. Are there implications for smart machine analysis of digital information? Yep, in my opinion already flawed systems will be less reliable and the users may not know why.

Stephen E Arnold, February 23, 2023

What Happens When Misinformation Is Sucked Up by Smart Software? Maybe Nothing?

February 22, 2023

I noted an article called “New Research Finds Rampant Misinformation Spreading on WhatsApp within Diasporic Communities.” The source is the Daily Targum. I mention this because the news source is the Rutgers University Campus news service. The article provides some information about a study of misinformation on that lovable Facebook property WhatsApp.

Several points in the article caught my attention:

  1. Misinformation on WhatsApp caused people to be killed; Twitter did its part too
  2. There is an absence of fact checking
  3. There are no controls to stop the spread of misinformation

What is interesting about studies conducted by prestigious universities is that often the findings are neither novel nor surprising. In fact, nothing about social media companies reluctance to spend money or launch ethical methods is new.

What are the consequences? Nothing much: Abusive behavior, social disruption, and, oh, one more thing, deaths.

Stephen E Arnold, February 22, 2023

Secrets Patterns Database

February 15, 2023

One of my researchers called my attention to “Secrets Patterns Database.” For those interested in finding “secrets”, you may want to take a look. The data and scripts are available on GitHub… for now. Among its features are:

  • “Over 1600 regular expressions for detecting secrets, passwords, API keys, tokens, and more.
  • Format agnostic. A Single format that supports secret detection tools, including Trufflehog and Gitleaks.
  • Tested and reviewed Regular expressions.
  • Categorized by confidence levels of each pattern.
  • All regular expressions are tested against ReDos attacks.”

Links to the author’s Web site and LinkedIn profile appear in the GitHub notes.

Stephen E Arnold, February 20, 2023

Datasette: Useful Tool for Crime Analysts

February 15, 2023

If you want to explore data sets, you may want to take a look at the “open source multi-tool for exploring and publishing data.” The Datasette Swiss Army knife “is a tool for exploring and publishing data.”

The company says,

It helps people take data of any shape, analyze and explore it, and publish it as an interactive website and accompanying API. Datasette is aimed at data journalists, museum curators, archivists, local governments, scientists, researchers and anyone else who has data that they wish to share with the world. It is part of a wider ecosystem of 42 tools and 110 plugins dedicated to making working with structured data as productive as possible.

A handful of demos are available. Worth a look.

Stephen E Arnold, February 15, 2023

Modern Research Integrity: Stunning Indeed

February 13, 2023

I read “The Rise and Fall of Peer Review.” The essay addresses what happens when a researcher submits a research paper to a research journal. Many “research” journals are owned by big professional publishing companies. If you are not familiar with that sector, think about a publishing club which markets to libraries and “research” institutions. No articles in “research” publications, no promotion. The method for determining accuracy is to ask experts to read submitted papers, make comments, and send a signal about value of the “research.” I served on the peer review panel for a year and quit. I am no academic, but I know doo doo when it is on my shoe.

Now I want to focus on one passage. Consider this statement:

Why don’t reviewers catch basic errors and blatant fraud? One reason is that they almost never look at the data behind the papers they review, which is exactly where the errors and fraud are most likely to be. In fact, most journals don’t require you to make your data public at all. You’re supposed to provide them “on request,” but most people don’t. That’s how we’ve ended up in sitcom-esque situations like ~20% of genetics papers having totally useless data because Excel autocorrected the names of genes into months and years. (When one editor started asking authors to add their raw data after they submitted a paper to his journal, half of them declined and retracted their submissions. This suggests, in the editor’s words, “a possibility that the raw data did not exist from the beginning.”)


  1. There is exactly one commercial database which added corrections to its entries. Why? Accuracy is expensive and most publishers are not into corrections. I think the feature of that database has been in the trash heap for many, many years. The outfit which bought the database is not into excellence in anything but revenue and profit.
  2. I found it impossible to get access to [a] the author to whom I wanted to address a question directly; that is, on the telephone, or [b] to get the data on which the crazy statistical hoops were displayed. Hey, math is not the key differentiator for many researchers, getting tenure and grants are the prime movers. A peer reviewer with pointed questions? Sorry, no way.
  3. The professional publishers want to follow a process which shifts responsibility for publishing error-filled articles to the “procedure”, the peer reviewers, the editors, and probably the stray dog outside their headquarters. Everyone is responsible for mistakes except them.

Net net: Perhaps the notion of open source accuracy needs to be expanded beyond tweets and Facebook posts?

Stephen  E Arnold, February 14, 2023

Easy Monitoring for Web Site Deltas

February 9, 2023

We have been monitoring selected Clear Web pages for a research project. We looked at a number of solutions and turned to The system is easy to use. Enter a url of the Web page for which you want a notification of a delta (change). Enter an email, and the system will provide you with a notification. The service is free if you want to monitory five Web pages per day. The company has a pricing FAQ which explains the cost of more notification. The Visual Ping service assumes a user wants to monitor the same Web site or sites on a continuous basis. In order to kill monitoring for one site, a bit of effort is required. Our approach was to have a different team member sign up and enter the revised monitor list. There may be an easier way, but without an explicit method, a direct solution worked for us.

Stephen E Arnold, February 9, 2023

Interesting Search Tool: Tumbex

December 13, 2022

Interest in Open Source Intelligence has crossed what I call the Murdoch Wall Street Journal threshold. My MWSJ is that a topic, person, or idea bubbles along for a period of time, in this instance, decades. OSINT was a concept was discussed by a number of people in the 1980s. In fact, one advocate — a former Marine Corps. officer and government professional — organized open source intelligence conferences decades ago. That’s dinobaby history, and I know that few “real news” people remember Robert David Steele or his concepts about open source in general or OSINT in particular. (If you are curious about the history, email the Beyond Search team at benkent2020 @ yahoo dot com. Why? I participated in Mr. Steele’s conferences for many years, and we worked on a number of open source projects for a range of clients until shortly before his death in August 2021.) Yep, history. Sometimes knowing about events can be helpful.

Let’s talk about online information; specifically, an OSINT tool available since 2014 if my memory is working this morning. The tool is called Tumbex. With it, one can search Tumblr content.


Here’s what the Web site says:

Tumbex indexes only tumblr posts which have caption or tags. We analyse the content and define if tumblr or posts are nsfw/adult. If your tumblr was detected as nsfw by mistake, you can request a review and we will manually check your tumblr.

This is interesting. However, with a bit of query testing one can find some quite sporty content on the service.

The service, allegedly became available in 2014, is hosted by the French outfit OVH. According to StatShow, Tumbex has experienced a jump in traffic. The site is not particularly low profile because it has a user base of an estimated one million humans or bots. (Please, keep in mind that click data are often highly suspect regardless of source.) FYI: StatShow can be a useful OSINT resource as well.

If you are interested in some of the OSINT resources my team relies upon, navigate to Click the image and a new window will open with an OSINT resource displayed. No ads, no trackers, no editorial. Just an old fashioned 1994 Web site which can be used fill an idle moment.

Now that the MWSJ threshold has been crossed, OSINT is a thing, an almost-overnight success with some youthful experts emphasizing that the US government has been asleep at the switch. I am not sure that assessment is one I can fully support.

Stephen E Arnold, December 13, 2022

OSINT: HandleFinder

November 22, 2022

If you are looking for a way to identify a user “handle” on various social networks, you may want to take a look at HandleFinder. The service appears to be offered without a fee. The developer does provide a “Buy Me a Coffee” link, so you can support the service. The service accepts a user name. We used our favorite ageing teen screen name ibabyrainbow or babyrainbow on some lower profile services. HandleFinder returned 31 results on our first query. (We ran the query a second time, and the system returned 30 results. We found this interesting.)

The services scanned included Patreon, TikTok, and YouTube, among others. The service did not scan the StreamGun video on demand service or NewRecs.

In order to examine the results, one clicks on service name which is underlined. Note that once one clicks the link, the result set is lost. We found that the link should be opened in a separate tab or window to eliminate the need to rerun the query after after each click. That’s how one of my team discovered the count variance.

When there is no result, the link in HandleFinder does not make this evident. Links to ibabyrainbow on Instagram returned “Page not found.” The result for returned the page of links, which means more clicking.

If one is interested in chasing down social media handles, you may want to check out this service. It is promising and hopefully will be refined.

Stephen E Arnold, November 22, 2022

Free and Useful OSINT Resource

November 8, 2022

For anyone interested in OSINT resources, here is a free eBook from low-profile intelware vendor Babel Street: Cybersecurity Insiders hosts “Open Source Intelligence (OSINT) Use Cases.” As the name implies, the volume describes practical applications of OSINT tools. The description reads:

“Businesses must be continually ready to mitigate all types of commercial and corporate risks. Some risks are known and easy to spot, but many are unknown and constantly evolving – your organization must be prepared to manage all of them or face serious consequences. A robust open-source intelligence (OSINT) platform is the answer. It combines publicly available information (PAI) data sources, with curated data streams, and filters to generate the actionable intelligence required to enhance protection or take action. This eBook includes twelve use cases exploring how OSINT tools help generate insights needed to drive improvements across:

  • Cyber risk management
  • Brand risk management
  • Operational risk management
  • Due diligence”

The book opens with a summary of how OSINT works and how it can be used. One notable use case is the very physical Event and Venue Protection under the otherwise BI-centered Brand Risk Management. One must provide some basic information before downloading the free resource, including name, company, email address, and phone number. Once registered with Cybersecurity Insiders, though, one has access to the site’s considerable roster of free cybersecurity resources. This includes another volume from Babel Street, “Best Practices for Using Publicly Available Information in Global Risk Management.” The firm’s clients include both government agencies and private enterprises. Not coincidentally, its AI analytics platform can assist organizations with the use cases described in the book. Based outside DC in Reston, Virginia, Babel Street was founded in 2009.

Cynthia Murrell, November 8, 2022

DYOR and OSINT Vigilantes

November 7, 2022

DYOR is an acronym used by some online investigators for “do your own research.” The idea is that open source intelligence tools provide information that can be used to identify bad actors. Obviously once an alleged bad actor has been identified, that individual can be tracked down. The body of information gathered can be remarkably comprehensive. For this reason, some law enforcement, criminal analysts, and intelligence professionals have embraced OSINT or open source intelligence as a replacement for the human-centric methods used for many years. Professionals understand the limitations of OSINT, the intelware tools widely available on GitHub and other open source software repositories, and from vendors. The most effective method for compiling information and doing data analysis requires subject matter experts, sophisticated software, and access to information from Web sites, third-party data providers, and proprietary information such as institutional knowledge.

If you are curious about representative OSINT resources used by some professionals, you can navigate to and click. The site will display one of my research team’s OSINT resources. The database the site pulls from contains more than 3,000 items which we update periodically. New, useful OSINT tools and services become available frequently. For example, in the work for one of our projects, we came across a useful open source tool related to Tor relays. It is called OrNetStats. I mention the significance of OSINT because I have been doing lectures about online research. Much of the content in those lectures focuses on open source and what I call OSINT blind spots, a subject few discuss.

The article “The Disturbing Rise of Amateur Predator=Hunting Stings: How the Search for Men Who Prey on Underage Victims Became a YouTube Craze” unintentionally showcases another facet of OSINT. Now anyone can use OSINT tools and resources to examine an alleged bad actor, gather data about an alleged crime, and pursue that individual. The cheerleading for OSINT has created a boom in online investigations. I want to point out that OSINT is not universally accurate. Errors can creep into data intentionally and unintentionally. Examples range from geo-spoofing, identifying the ultimate owner of an online business, and content posted by an individual to discredit a person or business. Soft fraud (that is, criminal type actions which are on the edge of legality like selling bogus fashion handbags on eBay) is often supported by open source information which has been weaponized. One example is fake reviews of restaurants, merchants, products, and services.

I urge you to work through the cited article to get a sense of what “vigilantes” can do with open source information and mostly unfiltered videos and content on social media. I want to call attention to four facets of OSINT in the context of what the cited article calls “predator-hunting stings”:

First, errors and false conclusions are easy to reach. One example is identifying the place of business for an online service facilitating alleged online crime. Some services displace the place of business for some online actors in the middle of the Atlantic Ocean or on obscure islands with minimal technical infrastructure.

Second, information can be weaponized to make it appear that an individual is an alleged bad actor. Gig work sites allow anyone to spend a few dollars to have social media posts created and published. Verification checks are essentially non-existent. One doesn’t need a Russia- or China-system intelligence agency; one needs a way to hire part time workers usually at quite low rates. How does $5 sound.

Third, the buzz being generated about OSINT tools and techniques is equipping more people than ever before to become Sherlock Holmes in today’s datasphere. Some government entities are not open to vigilante inputs; others are. Nevertheless, hype makes it seems that anything found online is usable. Verification and understanding legal guidelines remain important. Even the most scrupulous vigilante may have difficulty getting the attention of some professionals, particularly government employees.

Fourth, YouTube itself has a wide range of educational and propagandistic videos about OSINT. Some of these are okay; others are less okay. Cyber investigators undergo regular, quite specific training in tools, sources, systems, and methods. The programs to which I have been exposed include references to legal requirements and policies which must be followed. Furthermore, OSINT – including vigilante-type inputs – have to be verified. In my lectures, I emphasize that OSINT information should be considered background until those data or the items of information have been corroborated.

What’s the OSINT blind spot in the cited article’s report? My answer is, “Verification and knowledge of legal guideless is less thrilling than chasing down an alleged bad actor.” The thrill of the hunt is one thing; hunting the right thing is another. And hunting in the appropriate way is yet another.

DYOR is a hot concept. It is easy to be burned.

Stephen E Arnold, November 7, 2022

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta