Google and Search: A Fix or a Pipe Dream?
September 6, 2024
This essay is the work of a dumb dinobaby. No smart software required.
I read “Dawn of a New Era in Search: Balancing Innovation, Competition, and Public Good.”
Don’t get me wrong. I think multiple search systems are a good thing. The problem is that search (both enterprise and Web) are difficult problems, and these problems are expensive to solve. After working more than 50 years in electronic information, I have seen search systems come and go. I have watched systems morph from search into weird products that hide the search plumbing beneath fancy words like business intelligence and OSINT tools, among others. In 2006 or 2007, one of my financial clients published some of our research. The bank received an email from an “expert” (formerly and Verity) that his firm had better technology than Google. In that conversation, that “expert” said, “I can duplicate Google search for $300 million.” The person who said these incredibly uninformed words is now head of search at Google. Ed Zitron has characterized the individual as the person who killed Google search. Well, that fellow and Google search are still around. This suggests that baloney and high school reunions provide a career path for some people. But search is not understood particularly well at Google at this time. It is, therefore, that awareness of the problems of search is still unknown to judges, search engine marketing experts, developers of metasearch systems which recycle Bing results, and most of the poohbahs writing about search in blogs like Beyond Search.
The poor search kids see the rich guy with lots of money. The kids want it. The situation is not fair to those with little or nothing. Will the rich guy share the money? Thanks, Microsoft Copilot. Good enough. Aren’t you one of the poor Web search vendors?
After five decades of arm wrestling with finding on point information for myself, my clients, and for the search-related start ups with whom I have worked, I have an awareness of how much complexity the word “search” obfuscates. There is a general perception that Google indexes the Web. It doesn’t. No one indexes the Web. What’s indexed are publicly exposed Web pages which a crawler can access. If the response is slow (like many government and underfunded personal / commercial sites), spiders time out. The pages are not indexed. The crawlers have to deal in a successful way with the changes on how Web pages are presented. Upon encountering something for which the crawler is not configured, the Web page is skipped. Certain Web sites are dynamic. The crawler has to cope with these. Then there are Web pages which are not composed of text. The problems are compounded by the vagaries of intermediaries’ actions; for example, what’s being blocked or filtered today? The answer is the crawler skips them.
Without revealing information I am not permitted to share, I want to point out that crawlers have a list which contains bluebirds, canaries, and dead ducks. The bluebirds are indexed by crawlers on an aggressive schedule, maybe multiple times every hour. The canaries are the index-on-a-normal-cycle, maybe once every day or two. The dead ducks are crawled when time permits. Some US government Web sites may not be updated in six or nine months. The crawler visits the site once every six months or even less frequently. Then there are forbidden sites which the crawler won’t touch. These are on the open Web but urls are passed around via private messages. In terms of a Web search, these sites don’t exist.
How much does this cost? The answer is, “At scale, a lot. Indexing a small number of sites is really cheap.” The problem is that in order to pull lots of clicks, one has to have the money to scale or a niche no one else is occupying. Those are hard to find, and when one does, it makes sense to slap a subscription fee on them; for example, POISINDEX.
Why am I running though what strikes me as basic information about searching the Web? “Dawn of a New Era in Search: Balancing Innovation, Competition, and Public Good” is interesting and does a good job of expressing a specific view of Web search and Google’s content and information assets. I want to highlight the section of the write up titled “The Essential Facilities Doctrine.” The idea is that Google’s search index should be made available to everyone. The idea is interesting, and it might work after legal processes in the US were exhausted. The gating factor will be money and the political climate.
From a competitor’s point of view, the index blended with new ideas about how to answer a user’s query would level the playing field. From Google’s point of view it would loss of intellectual property.
Several observations:
- The hunger to punish Big Tech seems to demand being satisfied. Something will come from the judicial decision that Google is a monopoly. It took a couple of decades to arrive at what was obvious to some after the Yahoo ad technology settlement prior to the IPO, but most people didn’t and still don’t get “it.” So something will happen. What is not yet known.
- Wide access to the complete Google index could threaten the national security of the US. Please, think about this statement. I can’t provide any color, but it is a consideration among some professionals.
- An appeal could neutralize some of the “harms,” yet allow the indexing business to continue. Specific provisions might be applied to the decision of Judge Mehta. A modified landscape for search could be created, but online services tend to coalesce into efficient structures. Like the break up of AT&T, the seven Baby Bells and Bell Labs have become AT&T and Verizon. This could happen if “ads” were severed from Web search. But after a period of time, the break up is fighting one of the Arnold Laws of Online: A single monopoly is more efficient and emergent.
To sum up, the time for action came and like a train in Switzerland, left on time. Undoing Google is going to be more difficult than fiddling with Standard Oil or the railroad magnates.
Stephen E Arnold, September 6, 2024
Stop Indexing! And Pay Up!
July 17, 2024
This essay is the work of a dinobaby. Unlike some folks, no smart software improved my native ineptness.
I read “Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI.” The write up appears in two online publications, presumably to make an already contentious subject more clicky. The assertion in the title is the equivalent of someone in Salem, Massachusetts, pointing at a widower and saying, “She’s a witch.” Those willing to take the statement at face value would take action. The “trials” held in colonial Massachusetts. My high school history teacher was a witchcraft trial buff. (I think his name was Elmer Skaggs.) I thought about his descriptions of the events. I recall his graphic depictions and analysis of what I recall as “dunking.” The idea was that if a person was a witch, then that person could be immersed one or more times. I think the idea had been popular in medieval Europe, but it was not a New World innovation. Me-too is a core way to create novelty. The witch could survive being immersed for a period of time. With proof, hanging or burning were the next step. The accused who died was obviously not a witch. That’s Boolean logic in a pure form in my opinion.
The Library in Alexandria burns in front of people who wanted to look up information, learn, and create more information. Tough. Once the cultural institution is gone, just figure out the square root of two yourself. Thanks, MSFT Copilot. Good enough.
The accusations and evidence in the article depict companies building large language models as candidates for a test to prove that they have engaged in an improper act. The crime is processing content available on a public network, indexing it, and using the data to create outputs. Since the late 1960s, digitizing information and making it more easily accessible was perceived as an important and necessary activity. The US government supported indexing and searching of technical information. Other fields of endeavor recognized that as the volume of information expanded, the traditional methods of sitting at a table, reading a book or journal article, making notes, analyzing the information, and then conducting additional research or writing a technical report was simply not fast enough. What worked in a medieval library was not a method suited to put a satellite in orbit or perform other knowledge-value tasks.
Thus, online became a thing. Remember, we are talking punched cards, mainframes, and clunky line printers one day there was the Internet. The interest in broader access to online information grew and by 1985, people recognized that online access was useful for many tasks, not just looking up information about nuclear power technologies, a project I worked on in the 1970s. Flash forward 50 years, and we are upon the moment one can read about the “fact” that Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI.
The write up says:
AI companies are generally secretive about their sources of training data, but an investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube’s rules against harvesting materials from the platform without permission. Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple, and Salesforce.
I understand the surprise some experience when they learn that a software script visits a Web site, processes its content, and generates an index (a buzzy term today is large language model, but I prefer the simpler word index.)
I want to point out that for decades those engaged in making information findable and accessible online have processed content so that a user can enter a query and get a list of indexed items which match that user’s query. In the old days, one used Boolean logic which we met a few moments ago. Today a user’s query (the jazzy term is prompt now) is expanded, interpreted, matched to the user’s “preferences”, and a result generated. I like lists of items like the entries I used to make on a notecard when I was a high school debate team member. Others want little essays suitable for a class assignment on the Salem witchcraft trials in Mr. Skaggs’s class. Today another system can pass a query, get outputs, and then take another action. This is described by the in-crowd as workflow orchestration. Others call it, “taking a human’s job.”
My point is that for decades, the index and searching process has been without much innovation. Sure, software scripts can know when to enter a user name and password or capture information from Web pages that are transitory, disappearing in the blink of an eye. But it is still indexing over a network. The object remains to find information of utility to the user or another system.
The write up reports:
Proof News contributor Alex Reisner obtained a copy of Books3, another Pile dataset and last year published a piece in The Atlantic reporting his finding that more than 180,000 books, including those written by Margaret Atwood, Michael Pollan, and Zadie Smith, had been lifted. Many authors have since sued AI companies for the unauthorized use of their work and alleged copyright violations. Similar cases have since snowballed, and the platform hosting Books3 has taken it down. In response to the suits, defendants such as Meta, OpenAI, and Bloomberg have argued their actions constitute fair use. A case against EleutherAI, which originally scraped the books and made them public, was voluntarily dismissed by the plaintiffs. Litigation in remaining cases remains in the early stages, leaving the questions surrounding permission and payment unresolved. The Pile has since been removed from its official download site, but it’s still available on file sharing services.
The passage does a good job of making clear that most people are not aware of what indexing does, how it works, and why the process has become a fundamental component of many, many modern knowledge-centric systems. The idea is to find information of value to a person with a question, present relevant content, and enable the user to think new thoughts or write another essay about dead witches being innocent.
The challenge today is that anyone who has written anything wants money. The way online works is that for any single user’s query, the useful information constitutes a tiny, miniscule fraction of the information in the index. The cost of indexing and responding to the query is high, and those costs are difficult to control.
But everyone has to be paid for the information that individual “created.” I understand the idea, but the reality is that the reason indexing, search, and retrieval was invented, refined, and given numerous life extensions was to perform a core function: Answer a question or enable learning.
The write up makes it clear that “AI companies” are witches. The US legal system is going to determine who is a witch just like the process in colonial Salem. Several observations are warranted:
- Modifying what is a fundamental mechanism for information retrieval may be difficult to replace or re-invent in a quick, cost-efficient, and satisfactory manner. Digital information is loosey goosey; that is, it moves, slips, and slides either by individual’s actions or a mindless system’s.
- Slapping fines and big price tags on what remains an access service will take time to have an impact. As the implications of the impact become more well known to those who are aggrieved, they may find that their own information is altered in a fundamental way. How many research papers are “original”? How many journalists recycle as a basic work task? How many children’s lives are lost when the medical reference system does not have the data needed to treat the kid’s problem?
- Accusing companies of behaving improperly is definitely easy to do. Many companies do ignore rules, regulations, and cultural norms. Engineering Index’s publisher leaned that bootleg copies of printed Compendex indexes were available in China. What was Engineering Index going to do when I learned this almost 50 years ago? The answer was give speeches, complain to those who knew what the heck a Compendex was, and talk to lawyers. What happened to the Chinese content pirates? Not much.
I do understand the anger the essay expresses toward large companies doing indexing. These outfits are to some witches. However, if the indexing of content is derailed, I would suggest there are downstream consequences. Some of those consequences will make zero difference to anyone. A government worker at a national lab won’t be able to find details of an alloy used in a nuclear device. Who cares? Make some phone calls? Ask around. Yeah, that will work until the information is needed immediately.
A student accustomed to looking up information on a mobile phone won’t be able to find something. The document is a 404 or the information returned is an ad for a Temu product. So what? The kid will have to go the library, which one hopes will be funded, have printed material or commercial online databases, and a librarian on duty. (Good luck, traditional researchers.) A marketing team eager to get information about the number of Telegram users in Ukraine won’t be able to find it. The fix is to hire a consultant and hope those bright men and women have a way to get a number, a single number, good, bad, or indifferent.)
My concern is that as the intensity of the objections about a standard procedure for building an index escalate, the entire knowledge environment is put at risk. I have worked in online since 1962. That’s a long time. It is amazing to me that the plumbing of an information economy has been ignored for a long time. What happens when the companies doing the indexing go away? What happens when those producing the government reports, the blog posts, or the “real” news cannot find the information needed to create information? And once some information is created, how is another person going to find it. Ask an eighth grader how to use an online catalog to find a fungible book. Let me know what you learn? Better yet, do you know how to use a Remac card retrieval system?
The present concern about information access troubles me. There are mechanisms to deal with online. But the reason content is digitized is to find it, to enable understanding, and to create new information. Digital information is like gerbils. Start with a couple of journal articles, and one ends up with more journal articles. Kill this access and you get what you wanted. You know exactly who is the Salem witch.
Stephen E Arnold, July 17, 2024
x
x
x
x
x
x
Which Came First? Cliffs Notes or Info Short Cuts
May 8, 2024
This essay is the work of a dinobaby. Unlike some folks, no smart software improved my native ineptness.
The first online index I learned about was the Stanford Research Institute’s Online System. I think I was a sophomore in college working on a project for Dr. William Gillis. He wanted me to figure out how to index poems for a grant he had. The SRI system opened my eyes to what online indexes could do.
Later I learned that SRI was taking ideas from people like Valerius Maximus (30 CE) and letting a big, expensive, mostly hot group of machines do what a scribe would do in a room filled with rolled up papyri. My hunch is that other workers in similar “documents” figures out that some type of labeling and grouping system made sense. Sure, anyone could grab a roll, untie the string keeping it together, and check out its contents. “Hey,” someone said, “Put a label on it and make a list of the labels. Alphabetize the list while you are at it.”
An old-fashioned teacher struggles to get students to produce acceptable work. She cannot write TL;DR. The parents will find their scrolling adepts above such criticism. Thanks, MSFT Copilot. How’s the security work coming?
I thought about the common sense approach to keeping track of and finding information when I read “The Defensive Arrogance of TL;DR.” The essay or probably more accurately the polemic calls attention to the précis, abstract, or summary often included with a long online essay. The inclusion of what is now dubbed TL;DR is presented as meaning, “I did not read this long document. I think it is about this subject.”
On one hand, I agree with this statement:
We’re at a rolling boil, and there’s a lot of pressure to turn our work and the work we consume to steam. The steam analogy is worthwhile: a thirsty person can’t subsist on steam. And while there’s a lot of it, you’re unlikely to collect enough as a creator to produce much value.
The idea is that content is often hot air. The essay includes a chart called “The Rise of Dopamine Culture, created by Ted Gioia. Notice that the world of Valerius Maximus is not in the chart. The graphic begins with “slow traditional culture” and zips forward to the razz-ma-tazz datasphere in which we try to survive.
I would suggest that the march from bits of grass, animal skins, clay tablets, and pieces of tree bark to such examples of “slow traditional culture” like film and TV, albums, and newspapers ignores the following:
- Indexing and summarizing remained unchanged for centuries until the SRI demonstration
- In the last 61 years, manual access to content has been pushed aside by machine-centric methods
- Human inputs are less useful
As a result, the TL;DR tells us a number of important things:
- The person using the tag and the “bullets” referenced in the essay reveal that the perceived quality of the document is low or poor. I think of this TL;DR as a reverse Good Housekeeping Seal of Approval. We have a user assigned “Seal of Disapproval.” That’s useful.
- The tag makes it possible to either NOT out the content with a TL;DR tag or group documents by the author so tagged for review. It is possible an error has been made or the document is an aberration which provides useful information about the author.
- The person using the tag TL;DR creates a set of content which can be either processed by smart software or a human to learn about the tagger. An index term is a useful data point when creating a profile.
I think the speed with which electronic content has ripped through culture has caused a number of jarring effects. I won’t go into them in this brief post. Part of the “information problem” is that the old-fashioned processes of finding, reading, and writing about something took a long time. Now Amazon presents machine-generated books whipped up in a day or two, maybe less.
TL;DR may have more utility in today’s digital environment.
Stephen E Arnold, May 8, 2024
Problematic Smart Algorithms
December 12, 2023
This essay is the work of a dumb dinobaby. No smart software required.
We already know that AI is fundamentally biased if it is trained with bad or polluted data models. Most of these biases are unintentional due ignorance on the part of the developers, I.e. lack diversity or vetted information. In order to improve the quality of AI, developers are relying on educated humans to help shape the data models. Not all of the AI projects are looking to fix their polluted data and ZD Net says it’s going to be a huge problem: “Algorithms Soon Will Run Your Life-And Ruin It, If Trained Incorrectly.”
Our lives are saturated with technology that has incorporated AI. Everything from an application used on a smartphone to a digital assistant like Alexa or Siri uses AI. The article tells us about another type of biased data and it’s due to an ironic problem. The science team of Aparna Balagopalan, David Madras, David H. Yang, Dylan Hadfield-Menell, Gillian Hadfield, and Marzyeh Ghassemi worked worked on an AI project that studied how AI algorithms justified their predictions. The data model contained information from human respondents who provided different responses when asked to give descriptive or normative labels for data.
Normative data concentrates on hard facts while descriptive data focuses on value judgements. The team noticed the pattern so they conducted another experiment with four data sets to test different policies. The study asked the respondents to judge an apartment complex’s policy about aggressive dogs against images of canines with normative or descriptive tags. The results were astounding and scary:
"The descriptive labelers were asked to decide whether certain factual features were present or not – such as whether the dog was aggressive or unkempt. If the answer was "yes," then the rule was essentially violated — but the participants had no idea that this rule existed when weighing in and therefore weren’t aware that their answer would eject a hapless canine from the apartment.
Meanwhile, another group of normative labelers were told about the policy prohibiting aggressive dogs, and then asked to stand judgment on each image.
It turns out that humans are far less likely to label an object as a violation when aware of a rule and much more likely to register a dog as aggressive (albeit unknowingly ) when asked to label things descriptively.
The difference wasn’t by a small margin either. Descriptive labelers (those who didn’t know the apartment rule but were asked to weigh in on aggressiveness) had unwittingly condemned 20% more dogs to doggy jail than those who were asked if the same image of the pooch broke the apartment rule or not.”
The conclusion is that AI developers need to spread the word about this problem and find solutions. This could be another fear mongering tactic like the Y2K implosion. What happened with that? Nothing. Yes, this is a problem but it will probably be solved before society meets its end.
Whitney Grace, December 12, 2023
NewsGuard, Now Guarding Podcasts
May 23, 2023
Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.
Advertising alongside false or biased information can be bad for a brand’s image, a problem that has obviously escalated in recent years. News vetting service NewsGuard saw a niche and promptly filled it. The firm has provided would-be advertisers with reliability ratings for websites and TV shows since 2018, and now includes podcasts in its appraisals. The company’s PodNews shares the press release, “NewsGuard Launches World’s First Journalist-Vetted Podcast Credibility Ratings to Help Advertisers.”
We learn NewsGuard is working with three top podcast platforms to spread the word to advertisers. The platforms will also use ratings to inform their recommendation engines and moderate content. The write-up explains:
“The podcast ratings include a trust score from 0-10, overall risk level, metadata fields, and a detailed written explanation of the podcast’s content and record of credibility and transparency. The ratings are used by brands and agencies to direct their ad spend toward highly trustworthy, brand-safe news podcasts while being protected from brand-safety and brand-suitability risks inherent in advertising on news and politics content. … NewsGuard determines which news and information podcasts to rate based on factors including reported engagement, estimated ad revenue, and the volume of news and information content in the podcast’s episodes. The podcasts rated by NewsGuard include those that cover topics including politics, current affairs, health, business, and finance. The journalists at NewsGuard assess news and information podcasts based on five journalistic criteria:
- Does not regularly convey false, unchallenged information: 4 points
- Conveys news on important topics responsibly: 3 points
- Is not dominated by one-sided opinion: 1 point
- Discloses, or does not have, a political agenda: 1 point
- Differentiates advertising and commercial partnerships from editorial content: 1 point”
The press release shares example scores, or what it calls “Nutrition Labels,” for five podcasts. The top scorer shown is a Murdoch-owned Wall Street Journal podcast, which received a 10 out of 10. Interesting. NewsGuard was launched in 2018 by a pair of journalist entrepreneurs and is based in New York City.
Cynthia Murrell, May 23, 2023
AI Shocker? Automatic Indexing Does Not Work
May 8, 2023
Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.
I am tempted to dig into my more than 50 years of work in online and pull out a chestnut or two. l will not. Just navigate to “ChatGPT Is Powered by These Contractors Making $15 an Hour” and check out the allegedly accurate statements about the knowledge work a couple of people do.
The write up states:
… contractors have spent countless hours in the past few years teaching OpenAI’s systems to give better responses in ChatGPT.
The write up includes an interesting quote; to wit:
“We are grunt workers, but there would be no AI language systems without it,” said Savreux [an indexer tagging content for OpenAI].
I want to point out a few items germane to human indexers based on my experience with content about nuclear information, business information, health information, pharmaceutical information, and “information” information which thumbtypers call metadata:
- Human indexers, even when trained in the use of a carefully constructed controlled vocabulary, make errors, become fatigued and fall back on some favorite terms, and misunderstand the content and assign terms which will mislead when used in a query
- Source content — regardless of type — varies widely. New subjects or different spins on what seem to be known concepts mean that important nuances may be lost due to what is included in the available dataset
- New content often uses words and phrases which are difficult to understand. I try to note a few of the more colorful “new” words and bound phrases like softkill, resenteeism, charity porn, toilet track, and purity spirals, among others. In order to index a document in a way that allows one to locate it, knowing the term is helpful if there is a full text instance. If not, one needs a handle on the concept which is an index terms a system or a searcher knows to use. Relaxing the meaning (a trick of some clever outfits with snappy names) is not helpful
- Creating a training set, keeping it updated, and assembling the content artifacts is slow, expensive, and difficult. (That’s why some folks have been seeking short cuts for decades. So far, humans still become necessary.)
- Reindexing, refreshing, or updating the digital construct used to “make sense” of content objects is slow, expensive, and difficult. (Ask an Autonomy user from 1998 about retraining in order to deal with “drift.” Let me know what you find out. Hint: The same issues arise from popular mathematical procedures no matter how many buzzwords are used to explain away what happens when words, concepts, and information change.
Are there other interesting factoids about dealing with multi-type content. Sure there are. Wouldn’t it be helpful if those creating the content applied structure tags, abstracts, lists of entities and their definitions within the field or subject area of the content, and pointers to sources cited in the content object.
Let me know when blog creators, PR professionals, and TikTok artists embrace this extra work.
Pop quiz: When was the last time you used a controlled vocabulary classification code to disambiguate airplane terminal, computer terminal, and terminal disease? How does smart software do this, pray tell? If the write up and my experience are on the same wave length (not surfing wave but frequency wave), a subject matter expert, trained index professional, or software smarter than today’s smart software are needed.
Stephen E Arnold, May 8, 2023
Don Quixote Rides Again: Instead of Windmills, the Target Is Official and True Government Documents
December 8, 2022
I read “Archiving Official Documents as an Act of Radical Journalism.” The main idea is that a non governmental entity will collect official and “true” government documents, save them, and make them searchable. Now this is an interesting idea, and it one that most of countries for which I have provided consulting services related to archiving information have solutions. The solutions range from the wild and wooly methods used in the Japanese government to the logical approach implemented in Sweden. There’s a carnival atmosphere in Brazil, and there is a fairly interesting method in Croatia. France? Mais oui.
In each of these countries, one has to have quite specific know how in order to obtain an official and true government document. I know from experience that a person not a resident of some of these countries has pretty much zero chance of getting a public transcript of public hearing. In some cases, even with appropriate insider assistance, finding the documents is often impossible. Sure, the documents are “there.” But due to budget constraints, lousy technology, or staff procedures — not a chance. The Vatican Library has a number of little discussed incidents where pages from old books get chopped out of a priceless volume. Where are those pages now? Hey, where’s that hymn book from the 14th century?
I want you to notice that I did not mention the US. In America we have what some might call “let many flowers bloom” methods. You might think the Library of Congress has government documents. Yeah, sort of, well, some. Keep in mind that the US Senate has documents as does the House. Where are the working drafts of a bill? Try chasing that one down, assuming you have connections and appropriate documentation to poke around. Who has the photos of government nuclear facilities from the 1950. I know where they used to be in the “old” building in Germantown, Maryland. I even know how to run the wonky vertical lift to look in the cardboard boxes. Now? You have to be kidding. What about the public documents from Health and Human Services related to MIC, RAC, and ZPIC? Oh, you haven’t heard about these? Good luck finding them. I could work through every US government agency in which I have worked and provide what I think are fun examples of official government documents that are often quite, quite, quite difficult to locate.
The write up explains its idea which puts a windmill in the targeting device:
Democracy’s Library, a new project of the Internet Archive that launched last month, has begun collecting the world’s government publications into a single, permanent, searchable online repository, so that everyone—journalists, authors, academics, and interested citizens—will always be able to find, read, and use them. It’s a very fundamental form of journalism.
I am not sure the idea is a good one. In some countries, collecting government documents could become what I would characterize as a “problem.” What type of problem? How about fine, jail time, or unpleasantness that can follow you around like Shakespeare’s spaniels at your heels.
Several observations:
- Public official government documents change, they disappear, and they become non public without warning. An archive of public government documents will become quite a management challenge when classification changes, regimes change, and when government bureaucracy changes course. Chase down a US government repository librarian at a US government repository library near you and ask some questions. Let me know how that works out when you bring up some of the administrative issues for documents in a collection.
- A collection of official and true documents which tries to be comprehensive from a single country is going to be radioactive. Searchable information is problematic. That’s why enterprise search vendors who say, “All the information in your organization is searchable” evokes statements like “Get this outfit out of my office.” Some data is harmless when isolated. Pile data and information together and the stuff can go critical.
- Electronic official and true government documents are often inaccessible. Examples range from public information stored in Lotus Notes which is not the world’s best document system in my opinion to PowerPoint reports prepared for a public conference about the US Army’s Distributed Common Ground Information System. Now try to get the public document and you may find that what was okay for a small fish conference in Tyson’s Corner is going to evoke some interesting responses as the requests buck up the line.
- Collecting and piling up official and true information sounds good … to some. Others may view the effort with some skepticism because public government information is essentially infinite. Once collected those data may never go away. Never is a long time. How about those FOIA requests?
What’s the fix? Answer: Don Quixote became an icon for a reason, and it was not just elegant Spanish prose.
Stephen E Arnold, December 2022
The Failure of Search: Let Many Flowers Bloom and… Die Alone and Sad
November 1, 2022
I read “Taxonomy is Hard.” No argument from me. Yesterday (October 31, 2022) I spoke with a long time colleague and friend. Our conversations usually include some discussion about the loss of the expertise embodied in the early commercial database firms. The old frameworks, work processes, and shared beliefs among the top 15 or 20 for fee online database companies seem to have scattered and recycled in a quantum crazy digital world. We did not mention Google once, but we could have. My colleague and I agreed on several points:
- Those who want to make digital information must have an informing editorial policy; that is, what’s the content space, what’s included, what’s excluded, and what problem does the commercial database solve
- Finding information today is more difficult than it has been our two professional lives. We don’t know if the data are current and accurate (online corrections when publications issue fixes), fit within the editorial policy if there is one or the lack of policy shaped by the invisible hand of politics, advertising, and indifference to intellectual nuances. In some services, “old” data are disappeared presumably due to the cost of maintaining, updating if that is actually done, and working out how to make in depth queries work within available time and budget constraints
- The steady erosion of precision and recall as reliable yardsticks for determining what a search system can find within a specific body of content
- Professional indexing and content curation is being compressed or ignored by many firms. The process is expensive, time consuming, and intellectually difficult.
The cited article reflects some of these issues. However, the mirror is shaped by the systems and methods in use today. The approaches pivot on metadata (index terms) and tagging (more indexing). The approach is understandable. The shift to technology which slash the needed for subject matter experts, manual methods, meetings about specific terms or categories, and the other impedimenta are the new normal.
A couple of observations:
- The problems of social media boil down to editorial policies. Without these guard rails and the specialists needed to maintain them, finding specific items of information on widely used platforms like Facebook, TikTok, or Twitter, among others is difficult
- The challenges of processing video are enormous. The obvious fix is to gate the volume and implement specific editorial guidelines before content is made available to a user. Skipping this basic work task leads to the craziness evident in many services today
- Indexing can be supplemented by smart software. However, that smart software can drift off course, so specialists have to intervene and recalibrate the system.
- Semantic, statistical, or behavior centric methods for identifying and suggesting possible relevant content require the same expert centric approach. There is no free lunch is automated indexing, even for narrow vocabulary technical fields like nuclear physics or engineered materials. What smart software knows how to deal with new breakthroughs in physics which emerge from the study of inter cell behavior among proteins in the human brain?
Net net: Is it time to re-evaluate some discarded systems and methods? Is it time to accept the fact that technology cannot solve in isolation certain problems? Is it time to recognize that close enough for horseshoes and good enough are not appropriate when it comes to knowledge centric activities? Search engines die when the information garden cannot support the buds and shoots of finding useful information the user seeks.
Stephen E Arnold, November 1, 2022
Controlled Term Lists Morph into Data Catalogs That Are Better, Faster, and Cheaper to Generate
May 24, 2022
Indexing and classifying content is boring. A human subject matter expert asked to extract index terms and assign classification codes work great. But the humanoid SME gets tired and begins assigning general terms from memory. Plus humanoids want health care, retirement benefits, and time to go fishing in the Ozarks. (Yes, the beautiful sunny Ozarks!)
With off-the-shelf smart software available on GitHub or at a bargain price from the ever-secure Microsoft or the warehouse-subleasing Amazon, innovators can use machines to handle the indexing. In order to make the basic into a glam task. Slap on a new bit of jargon, and you are ready to create a data catalog.
“16 Top Data Catalog Software Tools to Consider Using in 2022” is a listing of automated indexing and classifying products and services. No humanoids or not too many humanoids needed. The software delivers lower costs and none of the humanoid deterioration after a few hours of indexing. Those software systems are really something: No vacations, no benefits, no health care, and no breaks during which unionization can be discussed.
What’s interesting about the list is that it includes the allegedly quasi monopolistic outfits like Amazon, Google, IBM, Informatica, and Oracle. The write up does not answer the question, “Are the terms and other metadata the trade secret of the customer?” The reason I am curious is that rolling up terms from numerous organizations and indexing each term as originating at a particular company provides a useful data set to analyze for trends, entities, and date and time on the document from which the terms were derived. But no alleged monopoly would look at a cloud customer’s data? Inconceivable.
The list of vendors also includes some names which are not yet among the titans of content processing; for example:
Alation
Alex
Ataccama
Atlan
Boomi
Collibra
Data.world
Erwin
Lumada.
There are some other vendors in the indexing business. You can identify these players by joining NFAIS, now the National Federation of Advanced Information Services. The outfit discarded the now out of favor terminology of abstracting and indexing. My hunch is that some NFAIS members can point out some of the potential downsides of using smart software to process business and customer information. New terms and jazzy company names can cause digital consternation. But smart software just gets smarter even as it mis-labels, mis-indexes, and mis-understands. No problem: Cheaper, faster, and better. A trifecta. Who needs SMEs to look at an exception file, correct errors, and tune the sysetm? No one!
Stephen E Arnold, May 24, 2022
Google: Admitting What It Does Now That People Believe Google Is the Holy Grail of Information
March 21, 2022
About 25 years. That’s how long it took Google to admit that it divides the world into bluebirds, canaries, sparrows, and dead ducks. Are we talking about our feathered friends? Nope. We are dividing the publicly accessible Web sites into four categories. Note: These are my research team’s classifications:
Bluebirds — Web sites indexed in sort of almost real time. Example: whitehouse.gov and sites which pull big ad sales
Canaries — Web sites that are popular but indexed in a more relaxed manner. Example: Sites which pull ad money but not at the brand level
Sparrows — Web sites that people look at but pull less lucrative ads. Example: Your site, probably?
Dead ducks — Sites banned, down checked for “quality”, or sites which use Google’s banned words. Example: You will have to use non Google search systems to locate these resources. Example: Drug ads which generate money and kick up unwanted scrutiny from some busy bodies.
“Google Says ‘Discovered – Currently Not Indexed’ Status Can Last Forever” explains:
‘Discovered – Currently not indexed’ in the Google Search Console Index Coverage report can potentially last forever, as the search engine doesn’t index every page.
The article adds:
Google doesn’t make any guarantees to crawl and index every webpage. Even though Google is one of the biggest companies in the world, it has finite resources when it comes to computing power.
Monopoly power? Now that Google dominates search it can decide what can be found for billions of people.
This is a great thing for the Google. For others, perhaps not quite the benefit the clueless user expects?
If something cannot be found in the Google Web search index, that something does not exist for lots of people. After 25 years and information control, the Google spills the beans about dead ducks.
Stephen E Arnold, March 21, 2022