AI Shocker? Automatic Indexing Does Not Work

May 8, 2023

Vea4_thumb_thumb_thumb_thumb_thumb_tNote: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

I am tempted to dig into my more than 50 years of work in online and pull out a chestnut or two. l will not. Just navigate to “ChatGPT Is Powered by These Contractors Making $15 an Hour” and check out the allegedly accurate statements about the knowledge work a couple of people do.

The write up states:

… contractors have spent countless hours in the past few years teaching OpenAI’s systems to give better responses in ChatGPT.

The write up includes an interesting quote; to wit:

“We are grunt workers, but there would be no AI language systems without it,” said Savreux [an indexer tagging content for OpenAI].

I want to point out a few items germane to human indexers based on my experience with content about nuclear information, business information, health information, pharmaceutical information, and “information” information which thumbtypers call metadata:

  1. Human indexers, even when trained in the use of a carefully constructed controlled vocabulary, make errors, become fatigued and fall back on some favorite terms, and misunderstand the content and assign terms which will mislead when used in a query
  2. Source content — regardless of type — varies widely. New subjects or different spins on what seem to be known concepts mean that important nuances may be lost due to what is included in the available dataset
  3. New content often uses words and phrases which are difficult to understand. I try to note a few of the more colorful “new” words and bound phrases like softkill, resenteeism, charity porn, toilet track, and purity spirals, among others. In order to index a document in a way that allows one to locate it, knowing the term is helpful if there is a full text instance. If not, one needs a handle on the concept which is an index terms a system or a searcher knows to use. Relaxing the meaning (a trick of some clever outfits with snappy names) is not helpful
  4. Creating a training set, keeping it updated, and assembling the content artifacts is slow, expensive, and difficult. (That’s why some folks have been seeking short cuts for decades. So far, humans still become necessary.)
  5. Reindexing, refreshing, or updating the digital construct used to “make sense” of content objects is slow, expensive, and difficult. (Ask an Autonomy user from 1998 about retraining in order to deal with “drift.” Let me know what you find out. Hint: The same issues arise from popular mathematical procedures no matter how many buzzwords are used to explain away what happens when words, concepts, and information change.

Are there other interesting factoids about dealing with multi-type content. Sure there are. Wouldn’t it be helpful if those creating the content applied structure tags, abstracts, lists of entities and their definitions within the field or subject area of the content, and pointers to sources cited in the content object.

Let me know when blog creators, PR professionals, and TikTok artists embrace this extra work.

Pop quiz: When was the last time you used a controlled vocabulary classification code to disambiguate airplane terminal, computer terminal, and terminal disease? How does smart software do this, pray tell? If the write up and my experience are on the same wave length (not surfing wave but frequency wave), a subject matter expert, trained index professional, or software smarter than today’s smart software are needed.

Stephen E Arnold, May 8, 2023

Digital Tech Journalism Killed by a Digital Elephant

May 4, 2023

Vea4_thumb_thumb_thumb_thumb_thumb_thumb_thumbNote: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

I read a labored explanation, analysis, and rhetorical howl from Slate.com. The article is “Digital Media’s Original Sin: The Big Tech Bubble Burst and the News Industry Got Splattered with Shrapnel.” The article states:

For years, the tech industry has propped up digital journalism with advertising revenue, venture capital injections, and far-reaching social platforms.

My view is that the reason for the problem in digital tech journalism is the elephant. When electronic information flows, it acts in a way similar to water eroding soil. In short, flows of electronic information have what I call a “deconstructive element.” The “information business” once consisted of discrete platforms, essentially isolated by choice and by accident. Who in your immediate locale pays attention to the information published in the American Journal of Mathematics? Who reads Craigslist for listings of low-ball vacation rentals near Alex Murdaugh’s “estate”?

Convert this content to digital form and dump the physical form of the data. Then live in a dream world in which those who want the information will flock to a specific digital destination and pay big money for the one story or the privilege of browsing information which may or may not be  accurate. Slate points out that it did not work out.

But what’s the elephant? Digital information to people today is like water to the goldfish in a bowl. It is just there.

The elephant was spawned by a few outfits which figured out that paying money to put content in front of eyeballs. The elephant grew and developed new capabilities; for example, the “pay to play” model of GoTo.com morphed into Overture.com and became something Yahoo.com thought would be super duper. However, the Google was inspired by “pay to play” and had the technical ability to create a system for creating a market from traffic, charging people to put content in front of the eyeballs, and charge anyone in the enabling chain money to use the Google system.

The combination of digital flows’ deconstructive operation plus the quasi-monopolization of online advertising death lethal blows to the crowd Slate addresses. Now the elephant has morphed again, and it is stomping around in the space defined by TikTok. A visual medium with advertising poses a threat to the remaining information producers as well as to Google itself.

The elephant is not immortal. But right now no group is armed with Mossberg Patriot Laminate Marinecotes and the skill to kill the elephant. Electronic information gulping advertising revenue may prove to be harder to kill than a cockroach. Maybe that’s why most people ask, “What elephant?”

Stephen E Arnold, May 4, 2023

Libraries: Who Needs Them? Perhaps Everyone

May 3, 2023

Vea4_thumb_thumb_thumb_thumb_thumb_thumb_thumbNote: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

How dare libraries try to make the works they purchase more easily accessible to their patrons! The Nation ponders, “When You Buy a Book, You Can Loan It to Anyone. This Judge Says Libraries Can’t. Why Not?” The case was brought before the U.S. District Court in Manhattan by four publishers unhappy with the Internet Archive’s (IA) controlled digital lending (CDL) program. We learn the IA does plan to appeal the decision. Writer Michelle M. Wu explains:

“At issue was whether a library could legally digitize the books it already owned and lend the digital copies in place of the print. The IA maintained that it could, as long as it lent only the same number of copies it owned and locked down the digital copies so that a borrower could not copy or redistribute them. It would be doing what libraries had always done, lend books—just in a different format. The publishers, on the other hand, asserted that CDL infringed on authors’ copyrights, making unauthorized copies and sharing these with libraries and borrowers, thereby depriving the authors and publishers of rightful e-book sales. They viewed CDL as piracy. While Judge John G. Koeltl’s opinion addressed many issues, all his reasoning was based on one assumption: that copyright primarily is about authors’ and publishers’ right to profit. Despite the pervasiveness of this belief, the history of copyright tells us something different.”

Wu recounts copyright’s evolution from a means to promote the sharing of knowledge to a way for publishers to rake in every possible dime. The shift was driven by a series of developments in technology. In the 1980s, the new ability to record content to video tape upset Hollywood studios. Apparently, being able to (re)watch a show after its initial broadcast was so beyond the pale a lawsuit was required. Later, Internet-based innovations prompted more legal proceedings. On the other hand, tools evolved that enabled publishers to enforce their interpretation of copyright, no judicial review required. Wu asserts:

“Increasing the impact on the end user, publishers—not booksellers or authors—now control prices and access. They can charge libraries multiple times what they charge an individual and bill them repeatedly for the same content. They can limit the number of copies a library buys, or even refuse to sell e-books to libraries at all. Such actions ultimately reduce the amount of content that libraries can provide to their readers.”

So that is how the original intention of copyright law has been turned on its head. And how publishers are undermining the whole purpose of libraries, which are valiantly trying to keep pace with technology. Perhaps the IA will win it’s appeal and the valuable CDL program will be allowed to continue. Either way, their litigious history suggests publishers will keep fighting for control over content.

Cynthia Murrell, May 3, 2023

TikTok: An App for Mind Control?

March 29, 2023

I read “TikTok Is Part of China’s Cognitive Warfare Campaign.” The write up is an opinion. Before I suggest that the write is missing the big picture, let me highlight what I think sums up the argument:

While a TikTok ban may take out the first and fattest mole, it fails to contend with the wider shift to cognitive warfare as the sixth domain of military operations under way, which includes China’s influence campaigns on TikTok, a mass collection of personal and biometric data from American citizens and their race to develop weapons that could one day directly assault or disable human minds.

The problem for me is that I think the “mind control” angle is just one weapon in a specific application environment. The Middle Kingdom is working like Type A citizen farmers in these strike zones:

  1. Financial. The objective is to get on the renminbi bus and off the donkey cart dollar.
  2. Physical. The efficacy of certain pathogens is familiar to anyone who had an opportunity to wear a mask and stay home for a year or two.
  3. Political. The “deal” between two outstanding nation states in the Middle East is a signal I noted.
  4. Technological. The Huawei superwatch, the steady progress in microprocessor engineering, and those phone-home electric vehicles are significant developments.
  5. Social. Western democracies may not be embracing China-style methods, but some countries like India are definitely feeling the vibe for total control of the Internet.

The Guardian — may the digital overlords smile on the “real” news organization’s JavaScript which reminds how many Guardian articles I read since the “bug” was placed on my computer — gets part of the story correct. Hopefully the editors will cover the other aspects of the Chinese initiative.

TikTok, not the main event. Plus, it does connect to WiFi, Congressperson.

Stephen E Arnold, March 29, 2023

Negative News Gets Attention: Who Knew? Err. Everyone in TV News

March 21, 2023

I love academic studies. I have a friend who worked in television news in New York before he was lured to the Courier Journal’s video operation. I asked him how news was prioritized. His answer: “If it bleeds, it leads.” I think he told me this in 1980. I called him and asked when TV news producers knew about the “lead, bleed” angle. His answer, “Since the first ratings study.”

Now I know the decades old truism is — well — true. No film at 11 for this insight.

If you want a more professional analysis of my friend who grew up in Brooklyn, navigate to “Negativity Drives Online News Consumption.” Feel free to substitute any media type for “online.”

Here’s a statement I found interesting:

Online media is important for society in informing and shaping opinions, hence raising the question of what drives online news consumption.

Ah, who knew?

My takeaway from the write up is basic: If smart software ingests that which is online or in other media, that smart software will “discover” or “recurse” to the “lead, bleed” idea. Do I hear a stochastic parrot squawking? OSINT issue? Yep.

Stephen E Arnold, March 21, 2023

Google and Its Puzzles: Insiders Only, Please

December 26, 2022

ProPublica made available an article of some importance in my opinion. “Porn, Piracy, Fraud: What Lurks Inside Google’s Black Box Ad Empire” walks through the intentional, quite specific engineering of its crucial advertising system to maximize revenue and befuddle (is “defraud” a synonym?) advertisers. I was asked more than a decade ago to do a presentation of my team’s research into Google’s advertising methodology. I declined. At that time, I was doing some consulting work for a company I am not permitted to name. That contract stipulated that I would not talk about a certain firm’s business technologies. I signed because… money.

The ProPublica essay does the revealing about what is presented as a duplicitous, underhanded, and probably illegal business process subsystem. I don’t have to present any of the information I have gathered over the years. I can cite this important article and point out several rocks which the capable writers at ProPublica either did not notice or flipped them over and concluded, “Nah, nothing to see here.”

I urge you to do two things. First, read the ProPublica write up. Number Two: Print it out. My hunch is that it may be disappeared or become quite difficult to find at some point in the future. Why? Ah, grasshopper, that is a question easily answered by the managers who set up Foundem and who were stomped by Googzilla. Alternatively you could chase down a person at the French government tax authority and ask, “Why were French tax forms not findable via a Google search for several years.” These individuals might have the information you need. Shifting gears: Ask Magix, the software company responsible for Sony Vegas why cracks for the software appear in YouTube videos. If you use your imagination, you will come up with ideas for gathering first person information about the lovable online advertising company’s systems and methods. Hint: Look up Dr. Timnit Gebru and inquire about her interactions with one of Google chief scientists. I guarantee that a useful anecdote will bubble up.

So what’s in the write up. Let me highlight a main point and then cite a handful of interesting statements in the article.

What is the main point? In my opinion, ProPublica’s write up says, “The GOOG maximizes its return at the expense of the advertisers and of the users.”

Who knew? Not me. I think the Alphabet Google YouTube DeepMind outfit is the most wonderfulest company in the world. Remember: You heard this here first. I have a priceless Google mouse pad too.

Consider these three statements from the essay. First, Google lingo is interesting:

Google spokesperson Michael Aciman said the company uses a combination of human oversight, automation and self-serve tools to protect ad buyers and said publisher confidentiality is not associated with abuse or low quality.

The idea is that Google is interested in using a hybrid method to protect ad buyers. Plus there is a difference between publishers and confidentiality. I find it interesting that instead of talking about [a] the ads themselves (porn, drugs, etc.), [b] the buyers of advertising which is a distinct industry dependent upon Google for revenue, [c] the companies who want to get their message in front of people allegedly interested in the product of service, or [d] the user of search or some other Google service. Google wants to “protect ad buyers.” And what about the others I have identified? Google doesn’t care. Logical sure but doesn’t Google have the other entities in mind? That’s a question regulators should have asked and had answered after Google settle the litigation with Yahoo over advertising technology, at the time of Google’s acquisition of Oingo (Applied Semantics), or at the time Google acquired DoubleClick. In my opinion, much of the ProPublica write up operates in a neverland of weird Google speak, not the reality of harvesting money from those largely in the dark about what’s happening in the business processes.

Second, consider this statement:

we matched 70% of the accounts in Google’s ad sellers list to one or more domains or apps, more than any dataset ProPublica is aware of. But we couldn’t find all of Google’s publisher partners. What we did find was a system so large, secretive and bafflingly complex that it proved impossible to uncover everyone Google works with and where it’s sending advertisers’ money.

The passage seems to suggest that Google’s engineers went beyond clever and ventured into the murky acreage of intentional obfuscation. It seems as if Google wanted to be able to consume advertising budgets without any entity having the ability to determine [a] if the ad were displayed in a suitable context; that is, did the advertiser’s message match the needs of the user to who the ad was shown.  And [b] was the ad appropriate even if it contained words and phrases on Google’s unofficial stop word lists. (If you have not see these, send an email to benkent2020 at yahoo dot com and one of my team will email you some of the more interesting words that guarantee Google’s somewhat lax processes will definitely try to block. If a word is not on a Google stop list, then the messages will probably be displayed. Remember: As Google terminates six percent of its staff, some of those humans presumably will not be able to review ads per item one above. And [c] note the word “bafflingly”. The focus of much Google engineering over the last 15 years has been to build competitive barriers, extent the monopoly function with “partners”, and double talk in order to keep regulators and curious Congressional people away. That’s my take on  this passage.

Now for the third passage I will cite:

…we uncovered scores of previously unreported peddlers of pirated content, porn and fake audiences that take advantage of Google’s lax oversight to rake in revenue.

I don’t need to say much more about this statement that look at and think about pirated content (copyright), porn (illegal content in some jurisdictions) and fake audiences (cyber fraud). Does this statement suggest that Google is a criminal enterprise? That’s a good question.

I have some high level observations about this excellent article in ProPublica. I offer these in the hope that ProPublica will explore some of these topics or an enterprising graduate student will consider the statements and do some digging.

  1. Why is Google unable to manage its staff? This is an important question because the ad behaviors described in the ProPublica article are the result of executive compensation plans and incentives. Are employees rewarded for implementing operations that further “soft” fraud or worse?
  2. How will Google operate in a more fragmented, more regulated environment? Is one possible behavior a refusal to modify the guiding hand of compensation and incentive programs away from generating more and more money within external constraints? My hunch is that Google will do whatever is necessary to build its revenue.
  3. What mechanisms exist or will be implemented to keep Google’s automated systems operating in a legal, ethical way?

Net net: Finally, after decades of craziness about how wonderful Googzilla is, more critical research is appearing. Is it too little and too late? In my view, yes.

Stephen E Arnold, December 26, 2022

The Internet: Cue the Music. Hit It, Regrets, I Have Had a Few

December 21, 2022

I have been around online for a few years. I know some folks who were involved in creating what is called “the Internet.” I watched one of these luminaries unbutton his shirt and display a tee with the message, “TCP on everything.” Cute, cute, indeed. (I had the task of introducing this individual only to watch the disrobing and the P on everything joke. Tip: It was not a joke.)

Imagine my reaction when I read “Inventor of the World Wide Web Wants Us to Reclaim Our Data from Tech Giants.” The write up states:

…in an era of growing concern over privacy, he believes it’s time for us to reclaim our personal data.

Who wants this? Tim Berners-Lee and a startup. Content marketing or a sincere effort to derail the core functionality of ad trackers, beacons, cookies which expire in 99 years, etc., etc.

The article reports:

Berners-Lee hopes his platform will give control back to internet users. “I think the public has been concerned about privacy — the fact that these platforms have a huge amount of data, and they abuse it,” he says. “But I think what they’re missing sometimes is the lack of empowerment. You need to get back to a situation where you have autonomy, you have control of all your data.”

The idea is that Web 3 will deliver a different reality.

Do you remember this lyric:

Yes, there were times I’m sure you knew
When I bit off more than I could chew
But through it all, when there was doubt
I ate it up and spit it out
I faced it all and I stood tall and did it my way.

The my becomes big tech, and it is the information highway. There’s no exit, no turnaround, and no real chance of change before I log off for the final time.

Yeah, digital regrets. How’s that working out at Amazon, Facebook, Google, Twitter, and Microsoft among others? Unintended consequences and now the visionaries are standing tall on piles of money and data.

Change? Sure, right away.

Stephen E Arnold, December 21, 2022

TikTok Explained without Mentioning Regulation and US Education Failings

December 19, 2022

I am not into TikTok. I enjoy reading analyses of TikTok by individuals who are not engaged in law enforcement, crime analysis, and intelligence work for the US and its allies. Most of these deep dives are entertaining because they miss the obvious: Hoovering data from users for strategic and tactical information weaponization and information operations. I assume that makes me a party pooper, particularly among those who are into the mobile experience. I recall laughing out loud when I listened to a podcast featuring a Silicon Valley news type explaining that TikTok was no big deal. Ho ho ho.

I read this morning (December 17, 2022, 530 am US Eastern) “TikTok’s Secret Sauce.” The write up explains insights gleaned from “a project studying algorithmic amplification and distortion.” Quotes from the write up are in italic to differentiate them from my comments.

I learned:

… the average ratio of hearts to views on TikTok is roughly 5%. People are just not that predictable.

Okay, people are not predictable. May I suggest spending some time with the publicly available information on the Recorded Future Web site? Google and In-Q-Tel were early supporters of this company. The firm’s predictive analytics rely, in part, that people are creatures of habits. Useful information emerges from these types of analyses. In fact, most intelware does, and this includes specialists in other countries, including some not allied with the US.

I learned:

Exploration explains why there are an unending variety of incredibly weird niches on TikTok: the app manages to connect those creators to their niche audiences.

Let’s think in terms of unarticulated needs and desires. TikTok makes it possible for that which is not stated to emerge from user behavior. Feedback ensures that skinny girls and diets that deliver thinness get in front of certain individuals. Feedback is good and finding content that reveals more of the user’s psychographic footprint useful. Why? Manipulation, identification of individuals with certain behavior fingerprints, and amplification of certain messaging. Yep, useful.

I learned:

More generally, in AI applications, the sophistication of the algorithm is rarely the limiting factor.

Interesting. Perhaps the function of TikTok is just obvious. It, in my opinion, so obvious that it is overlooked. In high school more than a half century ago, I recall our class having to read “The Purloined Letter” by that sporty writing Edgar Allan Poe. The main idea is that the obvious is overlooked.

In some countries — might TikTok’s home base be an example — certain actions are obvious and then ignored or misunderstood. TikTok is that type of product. Now, after years of availability, experts are asking questions and digging into the service.

The limiting factor is a failure to understand how online information and services can be weaponized, deliver directed harm, and be viewed as a harmless time waster. Is it too late? Maybe not, but I get a kick out of the reactions of experts to what is as clear and straightforward as driving a vehicle over a mostly clueless pedestrian or ordering spicy regional cuisine without understanding the concept of hot.

Stephen E Arnold, December 19, 2022

A Paradox at the Center of the Internet: No Big Deal

December 2, 2022

The Internet is a mess, but compared to how it was in its early decades it is way more organized. The organization of the Internet is called centralization. Gordon Brander of Unconscious wants the Internet to be decentralized. He says that will happen after it becomes more centralized first, read his explanation here: “Centralization Is Inevitable.” Brander says that the best way to understand the benefits of decentralization is to understand how centralization first happens.

While there are many ways to map centralization, the Internet is concentrated into different hubs or a scale-free network. The best way to define a scale-free network is:

“The defining characteristic of scale-free networks is a power law distribution with a long tail. A small number of nodes with an extremely large number of links, and an extremely large number of nodes with a small number of links. Think Twitter. Most users have a few followers, while a few influencers have millions. This power law distribution grants the biggest hubs a lot of power over the network. It also makes hubs important to the functioning of the network in ways that are not immediately obvious, like keystone species in an ecology.”

These networks emerge because there receive preferential attachment or “the rich-get-richer” scenario. Users prefer a hub/network, ergo it will receive more attention, trust, users, etc. Scale-free networks are also more efficient, because links between systems are smaller.

Another advantage is that they are resilient to attack, i.e. if one part of the hub fails, the entire system continues to run. That also makes networks more vulnerable to attacks, because a well-laced virus could knock out all the nodes.

Brander ends his spiel by stating the centralization and decentralization of the Internet is the circle of life: random start-ups, exponential growth, consolidation, collapse, then repeat. Someone cue The Lion King’s opening song!

Whitney Grace, December 2, 2022

WikiLeaks: Oh, Oh, Some Folks Are Not Happy

December 1, 2022

I read “WikiLeaks Website Is Struggling to Stay Online—As Millions of Documents Disappear.” If the write up is on the money, one lesson from this alleged cancel culture action is to hit the Print to PDF and save a document.” Assuming that online is forever is one of those weird misperceptions many online users have. Nope.

The write up says:

WikiLeaks’ website appears to be coming apart at the seams, with more and more of the organization’s content unavailable without explanation. WikiLeaks technical issues, which have been ongoing for months, have gotten worse in recent weeks as increasingly larger portions of its website no longer function.

The write up points out:

Although WikiLeaks long boasted that it released more than 10 million documents in 10 years, at current, less than 3,000 documents remain accessible, according to an analysis by the Daily Dot of the website’s leaks archive.

What’s interesting is that no one has claimed responsibility for hitting the delete key. What I find interesting is that the site has been online for many years. Now here’s a question, “Who could have taken this action?” Microsoft would say that it was 1,000 engineers working for a nation state. Others might say, “Oh, just a technical glitch.” A few might say, “Teens fooling around?” Does this list exhaust the possibilities?

Stephen E Arnold, December 1, 2022

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta