Search and Retrieval: A Sub Sub Assembly

January 2, 2023

What’s happening with search and retrieval? Google’s results irritate some; others are happy with Google’s shaping of information. Web competitors exist; for example, Kagi.com and Neva.com. Both are subscription services. Others provide search results “for free”; examples include Swisscows.com and Yandex.com. You can find metasearch systems (minimal original spidering, just recycling results from other services like Bing.com); for instance, StartPage.com (formerly Ixquick.com) and DuckDuckGo.com. Then there are open source search options. The flagship or flagships are Solr and Lucene. Proprietary systems exist too. These include the ageing X1.com and the even age-ier Coveo system. Remnants of long-gone systems are kicking around too; to wit, BRS and Fulcrum from OpenText, Fast Search now a Microsoft property, and Endeca, owned by Oracle. But let’s look at search as it appears to a younger person today.

A decayed foundation created via smart software on the Mage.space system. A flawed search and retrieval system can make the structure built on the foundation crumble like Southwest Airlines’ reservation system.

First, the primary means of access is via a mobile device. Surprisingly, the source of information for many is video content delivered by the China-linked TikTok or the advertising remora YouTube.com. In some parts of the world, the go-to information system is Telegram, developed by Russian brothers. This is a centralized service, not a New Wave Web 3 confection. One can use the service and obtain information via a query or a group. If one is “special,” an invitation to a private group allows access to individuals providing information about open source intelligence methods or the Russian special operation, including allegedly accurate video snips of real-life war or disinformation.

The challenge is that search is everywhere. Yet in the real world, finding certain types of information is extremely difficult. Obtaining that information may be impossible without informed contacts, programming expertise, or money to pay what would have been called “special librarian research professionals” in the 1980s. (Today, it seems, everyone is a search expert.)

Here’s an example of the type of information which is difficult if not impossible to obtain:

The ownership of a domain
The ownership of a Tor-accessible domain
The date at which a content object was created, the date the content object was indexed, and the date or dates referenced in the content object
Certain government documents; for example, unsealed court documents, US government contracts for third-party enforcement services, authorship information for a specific Congressional bill draft, etc.
A copy of a presentation made by a corporate executive at a public conference.

I can provide other examples, but I wanted to highlight the flaws in today’s findability.

What do these examples say about the efficacy of search?

Years ago, for Searcher Magazine, I wrote an article called “Search Sucks.” I think the editor Barbara Quint changed it to a more politically correct and less accurate title like “Search Does Not Work.” The main point of the piece was to identify the types of unsolved retrieval issues confronting professionally-trained online and traditional researchers. The same problems exist today.

Now many pundits and AI advocates are pitching smart software as the optimal way forward in findability. One of the more interesting mutations of search is described in “AI Allows Dead Woman to Talk to People Who Showed Up at Her Funeral.” Natural language processing, linguistic patterns, and a corpus of text enables smart software to talk with a deceased person.

Another interesting evolutionary mutation strikes at the heart of search vendors who endlessly pitch their search and retrieval system as a way to deliver enhanced customer support. “Customer support” means lower cost interactions with customers. The most recent example of search disguised in a different software shell is “Companies Can Hire a Virtual Person for about $14k a Year in China.” The main idea is that customer service can be delivered via a natural language avatar.

Both of these examples make clear that search and retrieval is now a sub sub system. Keyword matching and semantic analysis make it possible to understand input, craft and answer, and deliver it in a way that requires minimal effort on the part of the person wanting information.

But has search reached a stage of refinement to make sense of what the person interacting with a findability system to deliver high-value answers? I would suggest that today’s search has improved since the days of NASA RECON, SDC Orbit, STAIRS III, and other old-school systems. However, today’s smart software is often as effective as the original Smart system developed by Gerard Salton and his colleagues at Cornell University in the 1960s.

For me, search and retrieval — whether delivered with Dialog Information Services command line or the weirdness of Amazon’s Alexa — leave me with a sense of opportunities lost. Search and retrieval is more than mindless matching or statistical probabilities based on masses of Web content. Developers and vendors continue to dodge such fundamental issues as editorial policy, computational cost, investment in development of systems that solve high-value problems, methods to reduce bias in a result set, and communication about what caused a certain result to appear in response to a user or system input.

I know first-hand how foreign some of these dinobaby points are to today’s search and retrieval experts. Relevance has become an afterthought, particularly when advertising dollars are a lubricant. Precision is difficult due to synonym expansion, hidden stop words, and arbitrary decisions about what to expose to a user or a system.

I am concerned about several trends which I think may become more evident in 2023. Feel free to disagree. I am at an age and station in life that criticism of my ideas is familiar.

First, the emergence of what I call the DYOR expert. This is an individual, a group, a company, or an association which positions itself as an expert researcher. Some of these open source experts do not have training in special librarian tools, mindset, or techniques. Commercial services are ignored because they cost money. Why not use Twitter instead? What I think is emerging is a class of online researchers who will manifest some information blind spots as a result of haste, budget constraints, and knowledge gaps. I am not sure how to address this issue because OSINT is the trendy way to get information. If I raise a doubt, I hear, “Look at the Ukraine Russia war. Without OSINT we would know nothing.” I think it depends on the institution with which the OSINT advocate works. DYOR methods can mislead or just be wrong.

Second, the diminution of search and retrieval. I think many of today’s findability systems have taken their eye off the ball. The difficult parts of search are ignored or assumed to be solved. Why not download something from an open source software repository. That will be good enough. The results will be close enough for horseshoes. That attitude is, in my view, dangerous. As automation becomes increasingly pervasive, it will be impossible to identify that an issue may be buried deep in a sub sub system. The fix is to ignore the problem or create a software shell and move forward. The consequences of this mindset are likely to have some interesting and dire unforeseen consequences.

Third, the idea that anyone and everyone is an expert in search and retrieval is a potentially dangerous illusion. Knowing how to assess a content object and having the know how to verify a datum or an item of information is a key part of a knowledge toolkit. If people don’t know about such a toolkit and have zero desire to master the specific tools, decisions are likely to off the rails. One current example is the failure of Southwest Airlines to be an airline. Search flaws almost guarantee that findability will suffer the same fate as those abandoned bags and idle airplanes.

Net net: Search is an issue. I think that in 2023 more people will realize that it is far more important than cheaper customer support or a weird voice from beyond the grave.

Stephen E Arnold, January 2, 2022

Written by Stephen E. Arnold · Filed Under AI, Feature, News

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.