Problematic Smart Algorithms

December 12, 2023

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

We already know that AI is fundamentally biased if it is trained with bad or polluted data models. Most of these biases are unintentional due ignorance on the part of the developers, I.e. lack diversity or vetted information. In order to improve the quality of AI, developers are relying on educated humans to help shape the data models. Not all of the AI projects are looking to fix their polluted data and ZD Net says it’s going to be a huge problem: “Algorithms Soon Will Run Your Life-And Ruin It, If Trained Incorrectly.”

Our lives are saturated with technology that has incorporated AI. Everything from an application used on a smartphone to a digital assistant like Alexa or Siri uses AI. The article tells us about another type of biased data and it’s due to an ironic problem. The science team of Aparna Balagopalan, David Madras, David H. Yang, Dylan Hadfield-Menell, Gillian Hadfield, and Marzyeh Ghassemi worked worked on an AI project that studied how AI algorithms justified their predictions. The data model contained information from human respondents who provided different responses when asked to give descriptive or normative labels for data.

Normative data concentrates on hard facts while descriptive data focuses on value judgements. The team noticed the pattern so they conducted another experiment with four data sets to test different policies. The study asked the respondents to judge an apartment complex’s policy about aggressive dogs against images of canines with normative or descriptive tags. The results were astounding and scary:

"The descriptive labelers were asked to decide whether certain factual features were present or not – such as whether the dog was aggressive or unkempt. If the answer was "yes," then the rule was essentially violated — but the participants had no idea that this rule existed when weighing in and therefore weren’t aware that their answer would eject a hapless canine from the apartment.

Meanwhile, another group of normative labelers were told about the policy prohibiting aggressive dogs, and then asked to stand judgment on each image.

It turns out that humans are far less likely to label an object as a violation when aware of a rule and much more likely to register a dog as aggressive (albeit unknowingly ) when asked to label things descriptively.

The difference wasn’t by a small margin either. Descriptive labelers (those who didn’t know the apartment rule but were asked to weigh in on aggressiveness) had unwittingly condemned 20% more dogs to doggy jail than those who were asked if the same image of the pooch broke the apartment rule or not.”

The conclusion is that AI developers need to spread the word about this problem and find solutions. This could be another fear mongering tactic like the Y2K implosion. What happened with that? Nothing. Yes, this is a problem but it will probably be solved before society meets its end.

Whitney Grace, December 12, 2023

NewsGuard, Now Guarding Podcasts

May 23, 2023

Vea4_thumb_thumb_thumb_thumb_thumb_t[1]Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

Advertising alongside false or biased information can be bad for a brand’s image, a problem that has obviously escalated in recent years. News vetting service NewsGuard saw a niche and promptly filled it. The firm has provided would-be advertisers with reliability ratings for websites and TV shows since 2018, and now includes podcasts in its appraisals. The company’s PodNews shares the press release, “NewsGuard Launches World’s First Journalist-Vetted Podcast Credibility Ratings to Help Advertisers.”

We learn NewsGuard is working with three top podcast platforms to spread the word to advertisers. The platforms will also use ratings to inform their recommendation engines and moderate content. The write-up explains:

“The podcast ratings include a trust score from 0-10, overall risk level, metadata fields, and a detailed written explanation of the podcast’s content and record of credibility and transparency. The ratings are used by brands and agencies to direct their ad spend toward highly trustworthy, brand-safe news podcasts while being protected from brand-safety and brand-suitability risks inherent in advertising on news and politics content. … NewsGuard determines which news and information podcasts to rate based on factors including reported engagement, estimated ad revenue, and the volume of news and information content in the podcast’s episodes. The podcasts rated by NewsGuard include those that cover topics including politics, current affairs, health, business, and finance. The journalists at NewsGuard assess news and information podcasts based on five journalistic criteria:

  • Does not regularly convey false, unchallenged information: 4 points
  • Conveys news on important topics responsibly: 3 points
  • Is not dominated by one-sided opinion: 1 point
  • Discloses, or does not have, a political agenda: 1 point
  • Differentiates advertising and commercial partnerships from editorial content: 1 point”

The press release shares example scores, or what it calls “Nutrition Labels,” for five podcasts. The top scorer shown is a Murdoch-owned Wall Street Journal podcast, which received a 10 out of 10. Interesting. NewsGuard was launched in 2018 by a pair of journalist entrepreneurs and is based in New York City.

Cynthia Murrell, May 23, 2023

AI Shocker? Automatic Indexing Does Not Work

May 8, 2023

Vea4_thumb_thumb_thumb_thumb_thumb_tNote: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

I am tempted to dig into my more than 50 years of work in online and pull out a chestnut or two. l will not. Just navigate to “ChatGPT Is Powered by These Contractors Making $15 an Hour” and check out the allegedly accurate statements about the knowledge work a couple of people do.

The write up states:

… contractors have spent countless hours in the past few years teaching OpenAI’s systems to give better responses in ChatGPT.

The write up includes an interesting quote; to wit:

“We are grunt workers, but there would be no AI language systems without it,” said Savreux [an indexer tagging content for OpenAI].

I want to point out a few items germane to human indexers based on my experience with content about nuclear information, business information, health information, pharmaceutical information, and “information” information which thumbtypers call metadata:

  1. Human indexers, even when trained in the use of a carefully constructed controlled vocabulary, make errors, become fatigued and fall back on some favorite terms, and misunderstand the content and assign terms which will mislead when used in a query
  2. Source content — regardless of type — varies widely. New subjects or different spins on what seem to be known concepts mean that important nuances may be lost due to what is included in the available dataset
  3. New content often uses words and phrases which are difficult to understand. I try to note a few of the more colorful “new” words and bound phrases like softkill, resenteeism, charity porn, toilet track, and purity spirals, among others. In order to index a document in a way that allows one to locate it, knowing the term is helpful if there is a full text instance. If not, one needs a handle on the concept which is an index terms a system or a searcher knows to use. Relaxing the meaning (a trick of some clever outfits with snappy names) is not helpful
  4. Creating a training set, keeping it updated, and assembling the content artifacts is slow, expensive, and difficult. (That’s why some folks have been seeking short cuts for decades. So far, humans still become necessary.)
  5. Reindexing, refreshing, or updating the digital construct used to “make sense” of content objects is slow, expensive, and difficult. (Ask an Autonomy user from 1998 about retraining in order to deal with “drift.” Let me know what you find out. Hint: The same issues arise from popular mathematical procedures no matter how many buzzwords are used to explain away what happens when words, concepts, and information change.

Are there other interesting factoids about dealing with multi-type content. Sure there are. Wouldn’t it be helpful if those creating the content applied structure tags, abstracts, lists of entities and their definitions within the field or subject area of the content, and pointers to sources cited in the content object.

Let me know when blog creators, PR professionals, and TikTok artists embrace this extra work.

Pop quiz: When was the last time you used a controlled vocabulary classification code to disambiguate airplane terminal, computer terminal, and terminal disease? How does smart software do this, pray tell? If the write up and my experience are on the same wave length (not surfing wave but frequency wave), a subject matter expert, trained index professional, or software smarter than today’s smart software are needed.

Stephen E Arnold, May 8, 2023

Don Quixote Rides Again: Instead of Windmills, the Target Is Official and True Government Documents

December 8, 2022

I read “Archiving Official Documents as an Act of Radical Journalism.” The main idea is that a non governmental entity will collect official and “true” government documents, save them, and make them searchable. Now this is an interesting idea, and it one that most of countries for which I have provided consulting services related to archiving information have solutions. The solutions range from the wild and wooly methods used in the Japanese government to the logical approach implemented in Sweden. There’s a carnival atmosphere in Brazil, and there is a fairly interesting method in Croatia. France? Mais oui.

In each of these countries, one has to have quite specific know how in order to obtain an official and true government document. I know from experience that a person not a resident of some of these countries has pretty much zero chance of getting a public transcript of public hearing. In some cases, even with appropriate insider assistance, finding the documents is often impossible. Sure, the documents are “there.” But due to budget constraints, lousy technology, or staff procedures — not a chance. The Vatican Library has a number of little discussed incidents where pages from old books get chopped out of a priceless volume. Where are those pages now? Hey, where’s that hymn book from the 14th century?

I want you to notice that I did not mention the US. In America we have what some might call “let many flowers bloom” methods. You might think the Library of Congress has government documents. Yeah, sort of, well, some. Keep in mind that the US Senate has documents as does the House. Where are the working drafts of a bill? Try chasing that one down, assuming you have connections and appropriate documentation to poke around. Who has the photos of government nuclear facilities from the 1950. I know where they used to be in the “old” building in Germantown, Maryland. I even know how to run the wonky vertical lift to look in the cardboard boxes. Now? You have to be kidding. What about the public documents from Health and Human Services related to MIC, RAC, and ZPIC? Oh, you haven’t heard about these? Good luck finding them. I could work through every US government agency in which I have worked and provide what I think are fun examples of official government documents that are often quite, quite, quite difficult to locate.

The write up explains its idea which puts a windmill in the targeting device:

Democracy’s Library, a new project of the Internet Archive that launched last month, has begun collecting the world’s government publications into a single, permanent, searchable online repository, so that everyone—journalists, authors, academics, and interested citizens—will always be able to find, read, and use them. It’s a very fundamental form of journalism.

I am not sure the idea is a good one. In some countries, collecting government documents could become what I would characterize as a “problem.” What type of problem? How about fine, jail time, or unpleasantness that can follow you around like Shakespeare’s spaniels at your heels.

Several observations:

  1. Public official government documents change, they disappear, and they become non public without warning. An archive of public government documents will become quite a management challenge when classification changes, regimes change, and when government bureaucracy changes course. Chase down a US government repository librarian at a US government repository library near you and ask some questions. Let me know how that works out when you bring up some of the administrative issues for documents in a collection.
  2. A collection of official and true documents which tries to be comprehensive from a single country is going to be radioactive. Searchable information is problematic. That’s why enterprise search vendors who say, “All the information in your organization is searchable” evokes statements like “Get this outfit out of my office.” Some data is harmless when isolated. Pile data and information together and the stuff can go critical.
  3. Electronic official and true government documents are often inaccessible. Examples range from public information stored in Lotus Notes which is not the world’s best document system in my opinion to PowerPoint reports prepared for a public conference about the US Army’s Distributed Common Ground Information System. Now try to get the public document and you may find that what was okay for a small fish conference in Tyson’s Corner is going to evoke some interesting responses as the requests buck up the line.
  4. Collecting and piling up official and true information sounds good … to some. Others may view the effort with some skepticism because public government information is essentially infinite. Once collected those data may never go away. Never is a long time. How about those FOIA requests?

What’s the fix? Answer: Don Quixote became an icon for a reason, and it was not just elegant Spanish prose.

Stephen E Arnold, December 2022

The Failure of Search: Let Many Flowers Bloom and… Die Alone and Sad

November 1, 2022

I read “Taxonomy is Hard.” No argument from me. Yesterday (October 31, 2022) I spoke with a long time colleague and friend. Our conversations usually include some discussion about the loss of the expertise embodied in the early commercial database firms. The old frameworks, work processes, and shared beliefs among the top 15 or 20 for fee online database companies seem to have scattered and recycled in a quantum crazy digital world. We did not mention Google once, but we could have. My colleague and I agreed on several points:

  • Those who want to make digital information must have an informing editorial policy; that is, what’s the content space, what’s included, what’s excluded, and what problem does the commercial database solve
  • Finding information today is more difficult than it has been our two professional lives. We don’t know if the data are current and accurate (online corrections when publications issue fixes), fit within the editorial policy if there is one or the lack of policy shaped by the invisible hand of politics, advertising, and indifference to intellectual nuances. In some services, “old” data are disappeared presumably due to the cost of maintaining, updating if that is actually done, and working out how to make in depth queries work within available time and budget constraints
  • The steady erosion of precision and recall as reliable yardsticks for determining what a search system can find within a specific body of content
  • Professional indexing and content curation is being compressed or ignored by many firms. The process is expensive, time consuming, and intellectually difficult.

The cited article reflects some of these issues. However, the mirror is shaped by the systems and methods in use today. The approaches pivot on metadata (index terms) and tagging (more indexing). The approach is understandable. The shift to technology which slash the needed for subject matter experts, manual methods, meetings about specific terms or categories, and the other impedimenta are the new normal.

A couple of observations:

  1. The problems of social media boil down to editorial policies. Without these guard rails and the specialists needed to maintain them, finding specific items of information on widely used platforms like Facebook, TikTok, or Twitter, among others is difficult
  2. The challenges of processing video are enormous. The obvious fix is to gate the volume and implement specific editorial guidelines before content is made available to a user. Skipping this basic work task leads to the craziness evident in many services today
  3. Indexing can be supplemented by smart software. However, that smart software can drift off course, so specialists have to intervene and recalibrate the system.
  4. Semantic, statistical, or behavior centric methods for identifying and suggesting possible relevant content require the same expert centric approach. There is no free lunch is automated indexing, even for narrow vocabulary technical fields like nuclear physics or engineered materials. What smart software knows how to deal with new breakthroughs in physics which emerge from the study of inter cell behavior among proteins in the human brain?

Net net: Is it time to re-evaluate some discarded systems and methods? Is it time to accept the fact that technology cannot solve in isolation certain problems? Is it time to recognize that close enough for horseshoes and good enough are not appropriate when it comes to knowledge centric activities? Search engines die when the information garden cannot support the buds and shoots of finding useful information the user seeks.

Stephen E Arnold, November 1, 2022

Controlled Term Lists Morph into Data Catalogs That Are Better, Faster, and Cheaper to Generate

May 24, 2022

Indexing and classifying content is boring. A human subject matter expert asked to extract index terms and assign classification codes work great. But the humanoid SME gets tired and begins assigning general terms from memory. Plus humanoids want health care, retirement benefits, and time to go fishing in the Ozarks. (Yes, the beautiful sunny Ozarks!)

With off-the-shelf smart software available on GitHub or at a bargain price from the ever-secure Microsoft or the warehouse-subleasing Amazon, innovators can use machines to handle the indexing. In order to make the basic into a glam task. Slap on a new bit of jargon, and you are ready to create a data catalog.

16 Top Data Catalog Software Tools to Consider Using in 2022” is a listing of automated indexing and classifying products and services. No humanoids or not too many humanoids needed. The software delivers lower costs and none of the humanoid deterioration after a few hours of indexing. Those software systems are really something: No vacations, no benefits, no health care, and no breaks during which unionization can be discussed.

What’s interesting about the list is that it includes the allegedly quasi monopolistic outfits like Amazon, Google, IBM, Informatica, and Oracle. The write up does not answer the question, “Are the terms and other metadata the trade secret of the customer?” The reason I am curious is that rolling up terms from numerous organizations and indexing each term as originating at a particular company provides a useful data set to analyze for trends, entities, and date and time on the document from which the terms were derived. But no alleged monopoly would look at a cloud customer’s data? Inconceivable.

The list of vendors also includes some names which are not yet among the titans of content processing; for example:

Alation

Alex

Ataccama

Atlan

Boomi

Collibra

Data.world

Erwin

Lumada.

There are some other vendors in the indexing business. You can identify these players by joining NFAIS, now the National Federation of Advanced Information Services. The outfit discarded the now out of favor terminology of abstracting and indexing.  My hunch is that some NFAIS members can point out some of the potential downsides of using smart software to process business and customer information. New terms and jazzy company names can cause digital consternation. But smart software just gets smarter even as it mis-labels, mis-indexes, and mis-understands. No problem: Cheaper, faster, and better. A trifecta. Who needs SMEs to look at an exception file, correct errors, and tune the sysetm? No one!

Stephen E Arnold, May 24, 2022

Google: Admitting What It Does Now That People Believe Google Is the Holy Grail of Information

March 21, 2022

About 25 years. That’s how long it took Google to admit that it divides the world into bluebirds, canaries, sparrows, and dead ducks. Are we talking about our feathered friends? Nope. We are dividing the publicly accessible Web sites into four categories. Note: These are my research team’s classifications:

Bluebirds — Web sites indexed in sort of almost real time. Example: whitehouse.gov and sites which pull big ad sales

Canaries — Web sites that are popular but indexed in a more relaxed manner. Example:  Sites which pull ad money but not at the brand level

Sparrows — Web sites that people look at but pull less lucrative ads. Example: Your site, probably?

Dead ducks — Sites banned, down checked for “quality”, or sites which use Google’s banned words. Example: You will have to use non Google search systems to locate these resources. Example: Drug ads which generate money and kick up unwanted scrutiny from some busy bodies.

Google Says ‘Discovered – Currently Not Indexed’ Status Can Last Forever” explains:

‘Discovered – Currently not indexed’ in the Google Search Console Index Coverage report can potentially last forever, as the search engine doesn’t index every page.

The article adds:

Google doesn’t make any guarantees to crawl and index every webpage. Even though Google is one of the biggest companies in the world, it has finite resources when it comes to computing power.

Monopoly power? Now that Google dominates search it can decide what can be found for billions of people.

This is a great thing for the Google. For others, perhaps not quite the benefit the clueless user expects?

If something cannot be found in the Google Web search index, that something does not exist for lots of people. After 25 years and information control, the Google spills the beans about dead ducks.

Stephen E Arnold, March 21, 2022

2022 Adds More Banished Words To The Lexicon

January 27, 2022

Every year since 1976 the Lake Superior State University located in Michigan compiles a list of banished words to protect and uphold standards in language. The New Zealand Herald examines the list in the article, “Banished Word List For 2022 Takes Aim At Some Kiwi Favorites.” New Zealanders should be upset, because their favorite phases “No worries” made the list.

Many of the words that made the list were due to overuse. In 2020, COVID related terms were high on the list. For 2021, colloquial phrases were criticized. Banned word nominations came from the US, Australia, Canada, Scotland, England, Belgium, and Norway.

“ ‘Most people speak through informal discourse. Most people shouldn’t misspeak through informal discourse. That’s the distinction nominators far and wide made, and our judges agreed with them,’ the university’s executive director of marketing and communications Peter Szatmary said.

LSSU president Dr Rodney Hanley said every year submitters suggested what words and terms to banish by paying close attention to what humanity utters and writes. ‘Taking a deep dive at the end of the day and then circling back make perfect sense. Wait, what?’ he joked.”

Words that made the list were: supply chain, you’re on mute, new normal, deep dive, circle back, asking for a friend, that being said, at the end of the day, no worries, and wait, what?

Whitney Grace January 27, 2022

Search Quality: 2022 Style

January 11, 2022

I read the interesting “Is Google Search Deteriorating? Measuring Google’s Search Quality in 2022?” The approach is different from what was the approach used at the commercial database outfits for which I worked decades ago. We knew what our editorial policy was; that is, we could tell a person exactly what was indexed, how it was indexed, how classification codes were assigned, and what the field codes were for each item in our database. (A field code for those who have never encountered the term means an index term which disambiguates a computer terminal from an airport terminal.) When we tested a search engine — for example, a touch of the DataStar systems — we could determine the precision and recall of the result set. This was math, not an opinion. Yep, we had automatic indexing routines, but we relied primarily on human editors and subject matter experts with a consultant or two tossed in for good measure. (A tip of the Silent 700 paper feed to you, Betty Eddison.)

The cited article takes a different approach. It is mostly subjective. The results of the analysis is that Google is better than Bing. Here’s a key passage:

So Google does outperform Bing (the difference is statistically significant)…

Okay, statistics.

Several observations:

First, I am not sure either Bing’s search team or Google’s search team knows what is in the indexes at any point in time. I assume someone could look, but I know from first hand experience that the young wizards are not interested in the scope of an index. The interest is reducing the load or computational cost of indexing new content objects and updating certain content objects, discarding content domains which don’t pay for their computational costs, and similar MBA inspired engineering efficiencies. Nobody gets a bonus for knowing what’s indexed, when, why, and whether that index set is comprehensive. How deep does Google go unloved Web sites like the Railway Retirement Board?

Second, without time benchmarks and hard data about precision and recall, the subjective approach to evaluating search results misses the point of Bing and Google. These are systems which must generate revenue. Bing has been late to the party, but the Redmond security champs are giving ad sales the old college drop out try.  (A tip of the hat to MSFT’s eternal freshman, Bill Gates, too.) The results which are relevant are the ones that by some algorithmic cartwheels burn through the ad inventory. Money, not understanding user queries, supporting Boolean logic, including date and time information about the content object and when it was last indexed, are irrelevant. In one meeting, I can honestly say no one knew what I was talking about when I mentioned “time” index points.

Third, there are useful search engines which should be used as yardsticks against which to measure the Google and the smaller pretender, Bing. Why not include Swisscows.ch or Yandex.ru or Baidu.com or any of the other seven or eight Web centric and no charge systems. I suppose one could toss in the Google killer Neeva and a handful of metasearch systems. Yep, that’s work. Set up standard queries. Capture results. Analyze those results. Calculate result overlap. Get subject matter experts to evaluate the results. Do the queries at different points in time for a period of three months or more, etc., etc. This is probably not going to happen.

Fourth, what has been filtered. Those stop word lists are fascinating and they make it very difficult to find certain information. With traditional libraries struggling for survival, where is that verifiable research process going to lead? Yep, ad centric, free search systems. It might be better to just guess at some answers.

Net net: Web search is not very good. It never has been. For fee databases are usually an afterthought if thought of at all. It is remarkable how many people pass themselves off as open source intelligence experts, expert online researchers, or digital natives able to find “anything” using their mobile phone.

Folks, most people are living in a cloud of unknowing. Search results shape understanding. A failure of search just means that users have zero chance to figure out if a result from a free Web query is much more than Madison Avenue, propaganda, crooked card dealing, or some other content injection goal.

That’s what one gets when the lowest cost methods to generate the highest ad revenue are conflated with information retrieval. But, hey, you can order a pizza easily.

Stephen E Arnold, January 11, 2022

How AI Might Fake Geographic Data

June 16, 2021

Here is yet another way AI could be used to trick us. The Eurasia Review reports, “Exploring Ways to Detect ‘Deep Fakes’ in Geography.” Researchers at the University of Washington and Oregon State University do not know of any cases where false GIS data has appeared in the wild, but they see it as a strong possibility. In a bid to get ahead of the potential issue, the data scientists created an example of how one might construct such an image and published their findings at Cartography and Geographic Information Science. The Eurasia Review write-up observes:

“Geographic information science (GIS) underlays a whole host of applications, from national defense to autonomous cars, a technology that’s currently under development. Artificial intelligence has made a positive impact on the discipline through the development of Geospatial Artificial Intelligence (GeoAI), which uses machine learning — or artificial intelligence (AI) — to extract and analyze geospatial data. But these same methods could potentially be used to fabricate GPS signals, fake locational information on social media posts, fabricate photographs of geographic environments and more. In short, the same technology that can change the face of an individual in a photo or video can also be used to make fake images of all types, including maps and satellite images. ‘We need to keep all of this in accordance with ethics. But at the same time, we researchers also need to pay attention and find a way to differentiate or identify those fake images,’ Deng said. ‘With a lot of data sets, these images can look real to the human eye.’ To figure out how to detect an artificially constructed image, first you need to construct one.”

We suppose. The researchers suspect they are the first to recognize the potential for GIS fakery, and their paper has received attention around the world. But at what point can one distinguish between warding off a potential scam and giving bad actors ideas? Hard to tell.

The team used the unsupervised deep learning algorithm CycleGAN to introduce parts of Seattle and Beijing into a satellite image of Tacoma, Washington. Curious readers can navigate to the post to view the result, which is convincing to the naked eye. When compared to the actual image using 26 image metrics, however, differences were registered on 20 of them. Details like differences in roof colors, for example, or blurry vs. sharp edges gave it away. We are told to expect more research in this vein so ways of detecting falsified geographic data can be established. The race is on.

Cynthia Murrell, June 16, 2021

Next Page »

  • Archives

  • Recent Posts

  • Meta