Shall We Train Smart Software on Scientific Papers? That Is an Outstanding Idea!

May 29, 2023

Vea4_thumb_thumb_thumb_thumb_thumb_tNote: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

I read “Fake Scientific Papers Are Alarmingly Common. But New Tools Show Promise in Tackling Growing Symptom of Academia’s Publish or Perish Culture.” New tools sounds great. Navigate to the cited document to get the “real” information.

garbage in garbage out

MidJourney’s representation of a smart software system ingesting garbage and outputting garbage.

My purpose in mentioning this article is to ask a question:

In the last five years how many made up, distorted, or baloney filled journal articles have been produced?

The next question is,

How many of these sci-fi confections of scholarly research have been identified and discarded by the top smart software outfits like Facebook, Google, OpenAI, et al?

Let’s assume that 25 percent of the journal content is fakery.

A question I have is:

How does faked information impact the outputs of the smart software systems?

I can anticipate some answers; for example, “Well, there are a lot of papers so the flawed papers will represent a small portion of the intake data. The law of large numbers or some statistical jibber jabber will try to explain away erroneous information. Remember. Bad information is part of the human landscape. Does this mean smart software is a mirror of errors?

Do smart software outfits remove flawed information? If the peer review process cannot, what methods are the smart outfits using. Perhaps these companies should decide what’s correct and what’s incorrect? That sounds like a Googley-type idea, doesn’t it?

And finally, the third question about the impact of bad information on smart software “outputs” has an answer. No, it is not marketing jargon or a recycling of Google’s seven wonders of the AI world.

The answer, in my opinion, is garbage in and garbage out.

But you knew that, right?

Stephen E Arnold, Mary 29, 2023

OSINT Analysts Alert: Biases Distilled to a One Page Cheat Sheet

March 20, 2023

Toward Parsimony in Bias Research: A Proposed Common Framework of Belief-Consistent Information Processing for a Set of Biases” is an academic write up. Usually I ignore these for two reasons: [a] the documents are content marketing designed to get a grant or further a career and [b] the results are non reproducible.

The write up, despite my skepticism of real researchers, contains one page which I think is a useful checklist of the pitfalls into which some people may happily [a] tumble, [b] live in, and [c] actively seek.

I know this image is unreadable, but I wanted to provide it with a hyperlink so you can snag the image and the full document:

image

Excellent work.

Stephen E Arnold, March 20, 2023

How about This Intelligence Blindspot: Poisoned Data for Smart Software

February 23, 2023

One of the authors is a Googler. I think this is important because the Google is into synthetic data; that is, machine generated information for training large language models or what I cynically refer to as “smart software.”

The article / maybe reproducible research is “Poisoning Web Scale Datasets Is Practical.”  Nine authors of whom four are Googlers have concluded that a bad actor, government, rich outfit, or crafty students in Computer Science 301 can inject information into content destined to be used for training. How can this be accomplished. The answer is either by humans, ChatGPT outputs from an engineered query, or a combination. Why would someone want to “poison” Web accessible or thinly veiled commercial datasets? Gee, I don’t know. Oh, wait, how about control information and framing of issues? Nah, who would want to do that?

The paper’s authors conclude with more than one-third of that Google goodness. No, wait. There are no conclusions. Also, there are no end notes. What there is a road map explaining the mechanism for poisoning.

One key point for me is the question, “How is poisoning related to the use of synthetic data?”

My hunch is that synthetic data are more easily manipulated than going through the hoops to poison publicly accessible data. That’s time and resource intensive. The synthetic data angle makes it more difficult to identify the type of manipulations in the generation of a synthetic data set which could be mingled with “live” or allegedly-real data.

Net net: Open source information and intelligence may have a blindspot because it is not easy to determine what’s right, accurate, appropriate, correct, or factual. Are there implications for smart machine analysis of digital information? Yep, in my opinion already flawed systems will be less reliable and the users may not know why.

Stephen E Arnold, February 23, 2023

What Happens When Misinformation Is Sucked Up by Smart Software? Maybe Nothing?

February 22, 2023

I noted an article called “New Research Finds Rampant Misinformation Spreading on WhatsApp within Diasporic Communities.” The source is the Daily Targum. I mention this because the news source is the Rutgers University Campus news service. The article provides some information about a study of misinformation on that lovable Facebook property WhatsApp.

Several points in the article caught my attention:

  1. Misinformation on WhatsApp caused people to be killed; Twitter did its part too
  2. There is an absence of fact checking
  3. There are no controls to stop the spread of misinformation

What is interesting about studies conducted by prestigious universities is that often the findings are neither novel nor surprising. In fact, nothing about social media companies reluctance to spend money or launch ethical methods is new.

What are the consequences? Nothing much: Abusive behavior, social disruption, and, oh, one more thing, deaths.

Stephen E Arnold, February 22, 2023

Secrets Patterns Database

February 15, 2023

One of my researchers called my attention to “Secrets Patterns Database.” For those interested in finding “secrets”, you may want to take a look. The data and scripts are available on GitHub… for now. Among its features are:

  • “Over 1600 regular expressions for detecting secrets, passwords, API keys, tokens, and more.
  • Format agnostic. A Single format that supports secret detection tools, including Trufflehog and Gitleaks.
  • Tested and reviewed Regular expressions.
  • Categorized by confidence levels of each pattern.
  • All regular expressions are tested against ReDos attacks.”

Links to the author’s Web site and LinkedIn profile appear in the GitHub notes.

Stephen E Arnold, February 20, 2023

Datasette: Useful Tool for Crime Analysts

February 15, 2023

If you want to explore data sets, you may want to take a look at the “open source multi-tool for exploring and publishing data.” The Datasette Swiss Army knife “is a tool for exploring and publishing data.”

The company says,

It helps people take data of any shape, analyze and explore it, and publish it as an interactive website and accompanying API. Datasette is aimed at data journalists, museum curators, archivists, local governments, scientists, researchers and anyone else who has data that they wish to share with the world. It is part of a wider ecosystem of 42 tools and 110 plugins dedicated to making working with structured data as productive as possible.

A handful of demos are available. Worth a look.

Stephen E Arnold, February 15, 2023

Modern Research Integrity: Stunning Indeed

February 13, 2023

I read “The Rise and Fall of Peer Review.” The essay addresses what happens when a researcher submits a research paper to a research journal. Many “research” journals are owned by big professional publishing companies. If you are not familiar with that sector, think about a publishing club which markets to libraries and “research” institutions. No articles in “research” publications, no promotion. The method for determining accuracy is to ask experts to read submitted papers, make comments, and send a signal about value of the “research.” I served on the peer review panel for a year and quit. I am no academic, but I know doo doo when it is on my shoe.

Now I want to focus on one passage. Consider this statement:

Why don’t reviewers catch basic errors and blatant fraud? One reason is that they almost never look at the data behind the papers they review, which is exactly where the errors and fraud are most likely to be. In fact, most journals don’t require you to make your data public at all. You’re supposed to provide them “on request,” but most people don’t. That’s how we’ve ended up in sitcom-esque situations like ~20% of genetics papers having totally useless data because Excel autocorrected the names of genes into months and years. (When one editor started asking authors to add their raw data after they submitted a paper to his journal, half of them declined and retracted their submissions. This suggests, in the editor’s words, “a possibility that the raw data did not exist from the beginning.”)

Observations:

  1. There is exactly one commercial database which added corrections to its entries. Why? Accuracy is expensive and most publishers are not into corrections. I think the feature of that database has been in the trash heap for many, many years. The outfit which bought the database is not into excellence in anything but revenue and profit.
  2. I found it impossible to get access to [a] the author to whom I wanted to address a question directly; that is, on the telephone, or [b] to get the data on which the crazy statistical hoops were displayed. Hey, math is not the key differentiator for many researchers, getting tenure and grants are the prime movers. A peer reviewer with pointed questions? Sorry, no way.
  3. The professional publishers want to follow a process which shifts responsibility for publishing error-filled articles to the “procedure”, the peer reviewers, the editors, and probably the stray dog outside their headquarters. Everyone is responsible for mistakes except them.

Net net: Perhaps the notion of open source accuracy needs to be expanded beyond tweets and Facebook posts?

Stephen  E Arnold, February 14, 2023

Easy Monitoring for Web Site Deltas

February 9, 2023

We have been monitoring selected Clear Web pages for a research project. We looked at a number of solutions and turned to VisualPing.io. The system is easy to use. Enter a url of the Web page for which you want a notification of a delta (change). Enter an email, and the system will provide you with a notification. The service is free if you want to monitory five Web pages per day. The company has a pricing FAQ which explains the cost of more notification. The Visual Ping service assumes a user wants to monitor the same Web site or sites on a continuous basis. In order to kill monitoring for one site, a bit of effort is required. Our approach was to have a different team member sign up and enter the revised monitor list. There may be an easier way, but without an explicit method, a direct solution worked for us.

Stephen E Arnold, February 9, 2023

Interesting Search Tool: Tumbex

December 13, 2022

Interest in Open Source Intelligence has crossed what I call the Murdoch Wall Street Journal threshold. My MWSJ is that a topic, person, or idea bubbles along for a period of time, in this instance, decades. OSINT was a concept was discussed by a number of people in the 1980s. In fact, one advocate — a former Marine Corps. officer and government professional — organized open source intelligence conferences decades ago. That’s dinobaby history, and I know that few “real news” people remember Robert David Steele or his concepts about open source in general or OSINT in particular. (If you are curious about the history, email the Beyond Search team at benkent2020 @ yahoo dot com. Why? I participated in Mr. Steele’s conferences for many years, and we worked on a number of open source projects for a range of clients until shortly before his death in August 2021.) Yep, history. Sometimes knowing about events can be helpful.

Let’s talk about online information; specifically, an OSINT tool available since 2014 if my memory is working this morning. The tool is called Tumbex. With it, one can search Tumblr content.

image

Here’s what the Web site says:

Tumbex indexes only tumblr posts which have caption or tags. We analyse the content and define if tumblr or posts are nsfw/adult. If your tumblr was detected as nsfw by mistake, you can request a review and we will manually check your tumblr.

This is interesting. However, with a bit of query testing one can find some quite sporty content on the service.

The service, allegedly became available in 2014, is hosted by the French outfit OVH. According to StatShow, Tumbex has experienced a jump in traffic. The site is not particularly low profile because it has a user base of an estimated one million humans or bots. (Please, keep in mind that click data are often highly suspect regardless of source.) FYI: StatShow can be a useful OSINT resource as well.

If you are interested in some of the OSINT resources my team relies upon, navigate to www.osintfix.com. Click the image and a new window will open with an OSINT resource displayed. No ads, no trackers, no editorial. Just an old fashioned 1994 Web site which can be used fill an idle moment.

Now that the MWSJ threshold has been crossed, OSINT is a thing, an almost-overnight success with some youthful experts emphasizing that the US government has been asleep at the switch. I am not sure that assessment is one I can fully support.

Stephen E Arnold, December 13, 2022

OSINT: HandleFinder

November 22, 2022

If you are looking for a way to identify a user “handle” on various social networks, you may want to take a look at HandleFinder. The service appears to be offered without a fee. The developer does provide a “Buy Me a Coffee” link, so you can support the service. The service accepts a user name. We used our favorite ageing teen screen name ibabyrainbow or babyrainbow on some lower profile services. HandleFinder returned 31 results on our first query. (We ran the query a second time, and the system returned 30 results. We found this interesting.)

The services scanned included Patreon, TikTok, and YouTube, among others. The service did not scan the StreamGun video on demand service or NewRecs.

In order to examine the results, one clicks on service name which is underlined. Note that once one clicks the link, the result set is lost. We found that the link should be opened in a separate tab or window to eliminate the need to rerun the query after after each click. That’s how one of my team discovered the count variance.

When there is no result, the link in HandleFinder does not make this evident. Links to ibabyrainbow on Instagram returned “Page not found.” The result for Linktr.ee returned the Linktr.ee page of links, which means more clicking.

If one is interested in chasing down social media handles, you may want to check out this service. It is promising and hopefully will be refined.

Stephen E Arnold, November 22, 2022

Next Page »

  • Archives

  • Recent Posts

  • Meta