Controlled Term Lists Morph into Data Catalogs That Are Better, Faster, and Cheaper to Generate

May 24, 2022

Indexing and classifying content is boring. A human subject matter expert asked to extract index terms and assign classification codes work great. But the humanoid SME gets tired and begins assigning general terms from memory. Plus humanoids want health care, retirement benefits, and time to go fishing in the Ozarks. (Yes, the beautiful sunny Ozarks!)

With off-the-shelf smart software available on GitHub or at a bargain price from the ever-secure Microsoft or the warehouse-subleasing Amazon, innovators can use machines to handle the indexing. In order to make the basic into a glam task. Slap on a new bit of jargon, and you are ready to create a data catalog.

16 Top Data Catalog Software Tools to Consider Using in 2022” is a listing of automated indexing and classifying products and services. No humanoids or not too many humanoids needed. The software delivers lower costs and none of the humanoid deterioration after a few hours of indexing. Those software systems are really something: No vacations, no benefits, no health care, and no breaks during which unionization can be discussed.

What’s interesting about the list is that it includes the allegedly quasi monopolistic outfits like Amazon, Google, IBM, Informatica, and Oracle. The write up does not answer the question, “Are the terms and other metadata the trade secret of the customer?” The reason I am curious is that rolling up terms from numerous organizations and indexing each term as originating at a particular company provides a useful data set to analyze for trends, entities, and date and time on the document from which the terms were derived. But no alleged monopoly would look at a cloud customer’s data? Inconceivable.

The list of vendors also includes some names which are not yet among the titans of content processing; for example:









There are some other vendors in the indexing business. You can identify these players by joining NFAIS, now the National Federation of Advanced Information Services. The outfit discarded the now out of favor terminology of abstracting and indexing.  My hunch is that some NFAIS members can point out some of the potential downsides of using smart software to process business and customer information. New terms and jazzy company names can cause digital consternation. But smart software just gets smarter even as it mis-labels, mis-indexes, and mis-understands. No problem: Cheaper, faster, and better. A trifecta. Who needs SMEs to look at an exception file, correct errors, and tune the sysetm? No one!

Stephen E Arnold, May 24, 2022

Google: Admitting What It Does Now That People Believe Google Is the Holy Grail of Information

March 21, 2022

About 25 years. That’s how long it took Google to admit that it divides the world into bluebirds, canaries, sparrows, and dead ducks. Are we talking about our feathered friends? Nope. We are dividing the publicly accessible Web sites into four categories. Note: These are my research team’s classifications:

Bluebirds — Web sites indexed in sort of almost real time. Example: and sites which pull big ad sales

Canaries — Web sites that are popular but indexed in a more relaxed manner. Example:  Sites which pull ad money but not at the brand level

Sparrows — Web sites that people look at but pull less lucrative ads. Example: Your site, probably?

Dead ducks — Sites banned, down checked for “quality”, or sites which use Google’s banned words. Example: You will have to use non Google search systems to locate these resources. Example: Drug ads which generate money and kick up unwanted scrutiny from some busy bodies.

Google Says ‘Discovered – Currently Not Indexed’ Status Can Last Forever” explains:

‘Discovered – Currently not indexed’ in the Google Search Console Index Coverage report can potentially last forever, as the search engine doesn’t index every page.

The article adds:

Google doesn’t make any guarantees to crawl and index every webpage. Even though Google is one of the biggest companies in the world, it has finite resources when it comes to computing power.

Monopoly power? Now that Google dominates search it can decide what can be found for billions of people.

This is a great thing for the Google. For others, perhaps not quite the benefit the clueless user expects?

If something cannot be found in the Google Web search index, that something does not exist for lots of people. After 25 years and information control, the Google spills the beans about dead ducks.

Stephen E Arnold, March 21, 2022

2022 Adds More Banished Words To The Lexicon

January 27, 2022

Every year since 1976 the Lake Superior State University located in Michigan compiles a list of banished words to protect and uphold standards in language. The New Zealand Herald examines the list in the article, “Banished Word List For 2022 Takes Aim At Some Kiwi Favorites.” New Zealanders should be upset, because their favorite phases “No worries” made the list.

Many of the words that made the list were due to overuse. In 2020, COVID related terms were high on the list. For 2021, colloquial phrases were criticized. Banned word nominations came from the US, Australia, Canada, Scotland, England, Belgium, and Norway.

“ ‘Most people speak through informal discourse. Most people shouldn’t misspeak through informal discourse. That’s the distinction nominators far and wide made, and our judges agreed with them,’ the university’s executive director of marketing and communications Peter Szatmary said.

LSSU president Dr Rodney Hanley said every year submitters suggested what words and terms to banish by paying close attention to what humanity utters and writes. ‘Taking a deep dive at the end of the day and then circling back make perfect sense. Wait, what?’ he joked.”

Words that made the list were: supply chain, you’re on mute, new normal, deep dive, circle back, asking for a friend, that being said, at the end of the day, no worries, and wait, what?

Whitney Grace January 27, 2022

Search Quality: 2022 Style

January 11, 2022

I read the interesting “Is Google Search Deteriorating? Measuring Google’s Search Quality in 2022?” The approach is different from what was the approach used at the commercial database outfits for which I worked decades ago. We knew what our editorial policy was; that is, we could tell a person exactly what was indexed, how it was indexed, how classification codes were assigned, and what the field codes were for each item in our database. (A field code for those who have never encountered the term means an index term which disambiguates a computer terminal from an airport terminal.) When we tested a search engine — for example, a touch of the DataStar systems — we could determine the precision and recall of the result set. This was math, not an opinion. Yep, we had automatic indexing routines, but we relied primarily on human editors and subject matter experts with a consultant or two tossed in for good measure. (A tip of the Silent 700 paper feed to you, Betty Eddison.)

The cited article takes a different approach. It is mostly subjective. The results of the analysis is that Google is better than Bing. Here’s a key passage:

So Google does outperform Bing (the difference is statistically significant)…

Okay, statistics.

Several observations:

First, I am not sure either Bing’s search team or Google’s search team knows what is in the indexes at any point in time. I assume someone could look, but I know from first hand experience that the young wizards are not interested in the scope of an index. The interest is reducing the load or computational cost of indexing new content objects and updating certain content objects, discarding content domains which don’t pay for their computational costs, and similar MBA inspired engineering efficiencies. Nobody gets a bonus for knowing what’s indexed, when, why, and whether that index set is comprehensive. How deep does Google go unloved Web sites like the Railway Retirement Board?

Second, without time benchmarks and hard data about precision and recall, the subjective approach to evaluating search results misses the point of Bing and Google. These are systems which must generate revenue. Bing has been late to the party, but the Redmond security champs are giving ad sales the old college drop out try.  (A tip of the hat to MSFT’s eternal freshman, Bill Gates, too.) The results which are relevant are the ones that by some algorithmic cartwheels burn through the ad inventory. Money, not understanding user queries, supporting Boolean logic, including date and time information about the content object and when it was last indexed, are irrelevant. In one meeting, I can honestly say no one knew what I was talking about when I mentioned “time” index points.

Third, there are useful search engines which should be used as yardsticks against which to measure the Google and the smaller pretender, Bing. Why not include or or or any of the other seven or eight Web centric and no charge systems. I suppose one could toss in the Google killer Neeva and a handful of metasearch systems. Yep, that’s work. Set up standard queries. Capture results. Analyze those results. Calculate result overlap. Get subject matter experts to evaluate the results. Do the queries at different points in time for a period of three months or more, etc., etc. This is probably not going to happen.

Fourth, what has been filtered. Those stop word lists are fascinating and they make it very difficult to find certain information. With traditional libraries struggling for survival, where is that verifiable research process going to lead? Yep, ad centric, free search systems. It might be better to just guess at some answers.

Net net: Web search is not very good. It never has been. For fee databases are usually an afterthought if thought of at all. It is remarkable how many people pass themselves off as open source intelligence experts, expert online researchers, or digital natives able to find “anything” using their mobile phone.

Folks, most people are living in a cloud of unknowing. Search results shape understanding. A failure of search just means that users have zero chance to figure out if a result from a free Web query is much more than Madison Avenue, propaganda, crooked card dealing, or some other content injection goal.

That’s what one gets when the lowest cost methods to generate the highest ad revenue are conflated with information retrieval. But, hey, you can order a pizza easily.

Stephen E Arnold, January 11, 2022

How AI Might Fake Geographic Data

June 16, 2021

Here is yet another way AI could be used to trick us. The Eurasia Review reports, “Exploring Ways to Detect ‘Deep Fakes’ in Geography.” Researchers at the University of Washington and Oregon State University do not know of any cases where false GIS data has appeared in the wild, but they see it as a strong possibility. In a bid to get ahead of the potential issue, the data scientists created an example of how one might construct such an image and published their findings at Cartography and Geographic Information Science. The Eurasia Review write-up observes:

“Geographic information science (GIS) underlays a whole host of applications, from national defense to autonomous cars, a technology that’s currently under development. Artificial intelligence has made a positive impact on the discipline through the development of Geospatial Artificial Intelligence (GeoAI), which uses machine learning — or artificial intelligence (AI) — to extract and analyze geospatial data. But these same methods could potentially be used to fabricate GPS signals, fake locational information on social media posts, fabricate photographs of geographic environments and more. In short, the same technology that can change the face of an individual in a photo or video can also be used to make fake images of all types, including maps and satellite images. ‘We need to keep all of this in accordance with ethics. But at the same time, we researchers also need to pay attention and find a way to differentiate or identify those fake images,’ Deng said. ‘With a lot of data sets, these images can look real to the human eye.’ To figure out how to detect an artificially constructed image, first you need to construct one.”

We suppose. The researchers suspect they are the first to recognize the potential for GIS fakery, and their paper has received attention around the world. But at what point can one distinguish between warding off a potential scam and giving bad actors ideas? Hard to tell.

The team used the unsupervised deep learning algorithm CycleGAN to introduce parts of Seattle and Beijing into a satellite image of Tacoma, Washington. Curious readers can navigate to the post to view the result, which is convincing to the naked eye. When compared to the actual image using 26 image metrics, however, differences were registered on 20 of them. Details like differences in roof colors, for example, or blurry vs. sharp edges gave it away. We are told to expect more research in this vein so ways of detecting falsified geographic data can be established. The race is on.

Cynthia Murrell, June 16, 2021

Why Metadata? The Answer: Easy and Good Enough

April 30, 2021

I read “We Were Promised Strong AI, But Instead We Got Metadata Analysis.” The essay is thoughtful and provides a good summary of indexing’s virtues. The angle of attack is that artificial intelligence has not delivered the zip a couple of bottles of Red Bull provides. Instead, metadata is more like four ounces of Sunny D tangy original.

The write up states:

The phenomenon of metadata replacing AI isn’t just limited to web search. Manually attached metadata trumps machine learning in many fields once they mature – especially in fields where progress is faster than it is in internet search engines. When your elected government snoops on you, they famously prefer the metadata of who you emailed, phoned or chatted to the content of the messages themselves. It seems to be much more tractable to flag people of interest to the security services based on who their friends are and what websites they visit than to do clever AI on the messages they send. Once they’re flagged, a human can always read their email anyway.

This is an accurate statement.

The write up does not address a question I think is important in the AI versus metadata discussion. That question is, “Why?”

Here are some of the reasons I have documented in my books and writings over the years:

  1. Metadata is cheaper to process than spending to get smart software to work in a reliable way
  2. Metadata is good enough; that is, key insights can be derived with maths taught in most undergraduate mathematics programs. (I lectured about the 10 algorithms which everyone uses. Why? These are good enough.)
  3. Machines can do pretty good indexing; that is, key word and bound phrase extraction and mapping, clustering, graphs of wide paths among nodes, people, etc.
  4. Humans have been induced to add their own – often wonky – index terms or hash tags as the thumbtypers characterize their tags
  5. Index analysis (Gene Garfield’s citation analysis) provides reasonably useful indications of what’s important even if one knows zero about a topic, entity, etc.
  6. Packaging indexing – sorry, metadata – as smart software and its ilk converts VCs from skeptics into fantasists. Money flows even though Google’s DeepMind technology is not delivering dump trucks of money to the Alphabet front door. Maybe soon? Who knows?

Net net: The strongest supporters of artificial intelligence have specific needs: Money, vindication of an idea gestated among classmates at a bar, or a desire to become famous.

Who agrees with me? Probably not too many people. As the professionals who founded commercial database products in the late 1970s and early 1980s die off, any chance of getting the straight scoop on the importance of indexing decreases. For AI professionals, that’s probably good news. For those individuals who understand indexing in today’s context, good luck with your mission.

Stephen E Arnold, April 30, 2021

Volv for Brief, Unbiased News

April 5, 2021

We learn about an app that pares the news down to as little information as possible from Insider’s write-up, “Volv Bills Itself as ‘TikTok for News.’ The Snap-Backed App Makes News Stories You Can Read in 9 Seconds.” Who needs in-depth analysis, anyway? Co-founders Shannon Almeida and Priyanka Vazirani wished to create a source of unbiased news; I suppose eliminating any attempts to provide context is one way to do that. Writer Grace Dean tells us:

“It creates news stories, averaging at around 70 words, which users can read in less than nine seconds. The stories are listed in-app in a swipe format that’s easy on the eye. This is crucial to make the app attractive to its millennial target market, Vazirani said. People in their teens and 20s often check their phones before they even get out of bed, logging into various apps to view the latest newsfeed updates. On Volv, users can scroll through and see all the major news stories at a glance. The app combines breaking news with pop culture stories, such as explaining memes that are going viral. A prime example would be Bernie Sanders’ mittens at Joe Biden’s presidential inauguration. In this way, the app can show people the top political and financial stories and covert non-news readers, while also offsetting heavy stories with lighter reads. This approach is paying off. Volv publishes around 50 stories a day and its articles have been read nearly 8 million times so far. Its founders said it has a high retention rate, too.”

Almeida and Vazirani, who had no tech experience before this project, are delighted at its success—they certainly seem to be on to something. We’re told the pair received some good advice from successful entrepreneur Mark Cuban, who shared his thoughts on appealing to millennials and marketing their product to stand out from other news sites. Though Volv currently employs fewer than 10 workers, it is looking to expand to provide more diverse content. Launched last year, the company is based in New York City.

Cynthia Murrell, April 5, 2021

Let Us Now Consider Wonky Data and Tagging

March 31, 2021

As you may know, I find MIT endlessly amusing. From the Jeffrey Epstein matter to smart people who moonlight for other interesting entities, the esteemed university does not disappoint. I noted an article about and MIT finding which is interesting. “MIT”s AI Dataset Study and Startling Findings” reports:

MIT Researchers analyzed 10 test sets from datasets, including ImageNet, and found over 2,900 errors in the ImageNet validation set alone. When used as a benchmark data set, the errors in the dataset were proved to have an incorrect position in correlation to direct observation or ground truth.

So what?

Garbage in, garbage out.

This is not a surprise and it certainly seems obvious. If anything, the researchers’ error rate seems low. There is no information about data pushed into the “exception” folder for indexing systems.

Stephen E Arnold, March 31, 2021

So You Wanna Be a Google?

March 31, 2021

Just a short item which may be of interest to Web indexing wannabes: Datashake has rolled out its Web Scraper API. You can read about how to:

Scrape the web with proxies, CAPTCHA solving, headless browsers and more to avoid being blocked.

You will have to sign up to get “early access” to the service. The service is not free … because scraping Web sites is neither easy nor inexpensive.

There’s not much info about this API as of March 23, 2021, but this type of service beats the pants off trying to cook up our content acquisition scripts in 1993 for the The Point (Top 5% of the Internet). You remember that, don’t you?

Of course, thumbtypers will say, “Okay, boomer, what’s up with that ancient history?”


Stephen E Arnold, March 31, 2021

Historical Revisionism: Twitter and Wikipedia

March 24, 2021

I wish I could recall the name of the slow talking wild-eyed professor who lectured about Mr. Stalin’s desire to have the history of the Soviet Union modified. The tendency was evident early in his career. Ioseb Besarionis dz? Jughashvili became Stalin, so fiddling with received wisdom verified by Ivory Tower types should come as no surprise.

Now we have Google and the right to be forgotten. As awkward as deleting pointers to content may be, digital information invites “reeducation”.

I learned in “Twitter to Appoint Representative to Turkey” that the extremely positive social media outfit will interact with the country’s government. The idea is to make sure content is just A-Okay. Changing tweets for money is a pretty good idea. Even better is coordinating the filtering of information with a nation state is another. But Apple and China seem to be finding a path forward. Maybe Apple in Russia will be a  similar success.

A much more interesting approach to shaping reality is alleged in “Non-English Editions of Wikipedia Have a Misinformation Problem.” Wikipedia has a stellar track record of providing fact rich, neutral information I believe. This “real news” story states:

The misinformation on Wikipedia reflects something larger going on in Japanese society. These WWII-era war crimes continue to affect Japan’s relationships with its neighbors. In recent years, as Japan has seen an increase in the rise of nationalism, then­–Prime Minister Shinzo Abe argued that there was no evidence of Japanese government coercion in the comfort women system, while others tried to claim the Nanjing Massacre never happened.

I am interested in these examples because each provides some color to one of my information “laws”. I have dubbed these “Arnold’s Precepts of Online Information.” Here’s the specific law which provides a shade tree for these examples:

Online information invites revisionism.

Stated another way, when “facts” are online, these are malleable, shapeable, and subjective.

When one runs a query on and then the same query on, ask:

Are these services indexing the same content?

The answer for me is, “No.” Filters, decisions about what to index, and update calendars shape the reality depicted online. Primary sources are a fine idea, but when those sources are shaped as well, what does one do?

The answer is like one of those Borges stories. Deleting and shaping content is more environmentally friendly than burning written records. A python script works with less smoke.

Stephen E Arnold, March24, 2021

Next Page »

  • Archives

  • Recent Posts

  • Meta