Search Quality: 2022 Style

January 11, 2022

I read the interesting “Is Google Search Deteriorating? Measuring Google’s Search Quality in 2022?” The approach is different from what was the approach used at the commercial database outfits for which I worked decades ago. We knew what our editorial policy was; that is, we could tell a person exactly what was indexed, how it was indexed, how classification codes were assigned, and what the field codes were for each item in our database. (A field code for those who have never encountered the term means an index term which disambiguates a computer terminal from an airport terminal.) When we tested a search engine — for example, a touch of the DataStar systems — we could determine the precision and recall of the result set. This was math, not an opinion. Yep, we had automatic indexing routines, but we relied primarily on human editors and subject matter experts with a consultant or two tossed in for good measure. (A tip of the Silent 700 paper feed to you, Betty Eddison.)

The cited article takes a different approach. It is mostly subjective. The results of the analysis is that Google is better than Bing. Here’s a key passage:

So Google does outperform Bing (the difference is statistically significant)…

Okay, statistics.

Several observations:

First, I am not sure either Bing’s search team or Google’s search team knows what is in the indexes at any point in time. I assume someone could look, but I know from first hand experience that the young wizards are not interested in the scope of an index. The interest is reducing the load or computational cost of indexing new content objects and updating certain content objects, discarding content domains which don’t pay for their computational costs, and similar MBA inspired engineering efficiencies. Nobody gets a bonus for knowing what’s indexed, when, why, and whether that index set is comprehensive. How deep does Google go unloved Web sites like the Railway Retirement Board?

Second, without time benchmarks and hard data about precision and recall, the subjective approach to evaluating search results misses the point of Bing and Google. These are systems which must generate revenue. Bing has been late to the party, but the Redmond security champs are giving ad sales the old college drop out try.  (A tip of the hat to MSFT’s eternal freshman, Bill Gates, too.) The results which are relevant are the ones that by some algorithmic cartwheels burn through the ad inventory. Money, not understanding user queries, supporting Boolean logic, including date and time information about the content object and when it was last indexed, are irrelevant. In one meeting, I can honestly say no one knew what I was talking about when I mentioned “time” index points.

Third, there are useful search engines which should be used as yardsticks against which to measure the Google and the smaller pretender, Bing. Why not include Swisscows.ch or Yandex.ru or Baidu.com or any of the other seven or eight Web centric and no charge systems. I suppose one could toss in the Google killer Neeva and a handful of metasearch systems. Yep, that’s work. Set up standard queries. Capture results. Analyze those results. Calculate result overlap. Get subject matter experts to evaluate the results. Do the queries at different points in time for a period of three months or more, etc., etc. This is probably not going to happen.

Fourth, what has been filtered. Those stop word lists are fascinating and they make it very difficult to find certain information. With traditional libraries struggling for survival, where is that verifiable research process going to lead? Yep, ad centric, free search systems. It might be better to just guess at some answers.

Net net: Web search is not very good. It never has been. For fee databases are usually an afterthought if thought of at all. It is remarkable how many people pass themselves off as open source intelligence experts, expert online researchers, or digital natives able to find “anything” using their mobile phone.

Folks, most people are living in a cloud of unknowing. Search results shape understanding. A failure of search just means that users have zero chance to figure out if a result from a free Web query is much more than Madison Avenue, propaganda, crooked card dealing, or some other content injection goal.

That’s what one gets when the lowest cost methods to generate the highest ad revenue are conflated with information retrieval. But, hey, you can order a pizza easily.

Stephen E Arnold, January 11, 2022

Cherche: A Neural Search Pipeline

January 10, 2022

For fans of open source search, Cherche is available. The GitHub write up states:

Cherche is meant to be used with small to medium sized corpora. Cherche’s main strength is its ability to build diverse and end-to-end pipelines.

The “neural search” module includes ElasticSearch. The programming team for Cherche consists of Raphaël Sourty and François-Paul Servant. Beyond Search has not fired up the system and run it against our test corpus. We did have in our files a paper called “Knowledge Base Embedding by Cooperative Knowledge Distillation.” That paper states:

Given a set of KBs, our proposed approach KDMKB, learns KB embeddings by mutually and jointly distilling knowledge within a dynamic teacher-student setting. Experimental results on two standard datasets show that knowledge distillation between KBs through entity and relation inference is actually observed. We also show that cooperative learning significantly outperforms the two proposed baselines, namely traditional and sequential distillation.

The idea is that instead of retrieving strings, broader tags (concepts and classifications) appear to provide an advantage; pushing “beyond” old school search.

Stephen E Arnold, January 10, 2022

The Collision of Search Thinkers and the Wide World of Finding

January 4, 2022

To get some insight into the vibrations set off when search thinkers run into market behaviors, you will want to scan the Twitter thread about the need to create an alternative to Google. The focus is medical information. The idea is to return results for a health query without “clickbait sites riddled with crappy ads.” The criticism of the Google was not ignored. No less a luminary than Danny Sullivan replied with Google’s “we are always looking to keep improving our results.”

Digital Don Quixotes saddled up and asserted in this Tweet stream that Google can be beaten. The fix is to create a niche search engine tailored to provide results where Google is just thrilled to present “spam.” Assorted Tweeters added comments.

What do these two Tweeter threads suggest to me?

First, there are niche search engines(what I call vertical search services) that deliver on point results. These are probably not ones most people think about because users of free or ad-supported systems do not know much about finding high value information. Also, I know from my decades in the commercial database business that most “online experts” don’t want to pay for access to commercial online services. Academics get “free” access to content pools like Lexis Nexis, and the “old” Dialog type files because institutions pay the license fees. To the academic user, high value information is “free.” It is not.

Second, a number of Web centric search engines provide reasonably useful results. Examples range from iSeek.com to the Metager system. The mechanism for locating specific information is to frame a query, manually or automatically pass the query to numerous search engines, de-duplicate the result sets, and examine the links. Industrious searchers may enlist tools like Maltego or other open source software to identify potentially helpful items to examine initially. Who wants to do this? I suggest that fewer than three percent of online users pursue this approach. People want to have the mobile phone light up when a pizza joint is nearby or the Tesla’s electric gauge is creeping into the “hello, I need a flat bed truck, please” zone.

Third, Google has operated without meaningful regulation, oversight, or competition for decades. The vaunted ad-revenue engine was not a Google invention. Google took advantage of a particular point in time when searching the Web was gaining traction and useful competition from Alta Vista, Exalead, and Fast Search’s AllTheWeb services were distracted. Google sucked up some AltaVista folks; Exalead was decidedly French; and Fast Search chased the enterprise. Other actions transpired, but the result was that the Google used free to get traffic and traffic made the Yahoo, Overture, GoTo revenue model work like a champ. Remember this was decades ago, not yesterday.

Here’s what I think is going on:

  1. Pundits don’t know or care much about Okeano, Swisscows or  other “free” online search systems. How about searching for those Instagram snaps with Picuki?
  2. Niche search engines are thriving; for example, some of the Israeli specialized software and services firms provide quite helpful access to Facebook content. Who knows? Not too many pundits on the Tweeter and certainly not Google’s PR experts.
  3. Google is not a search engine. Google is a global content system, a fact I explored in my Google: The Digital Gutenberg, originally a long white paper for a government customer who found my view of the world interesting. BearStearns published a report in 2007 which featured my diagram of the Google “octopus” which identified the digital fabric that the company was weaving. Now Google owns the sheep, the dyes, the weaving machines, and the concept of digital fabrics. The overall quality of the Google outputs is “good enough,” and, believe me, it is tough to knock off a global outfit which satisfies the big hump in the standard distribution with something “better.” Whatever “better” means.

Net net: Search is a very, very fuzzy word. At one end of the spectrum are those who are searching well because they can locate an Uber-type service. At the other end of the spectrum are those who deal in extremely rarified content disciplines and have quite good services available; for example, Daylight chemical informatics.

In the middle? A long-standing, persistent and fundamental disconnect between search and what is actually going on in the datasphere.

Pizza? Google’s got that nailed. Need information to fabricate calandria (nuclear terminology)? Google can’t help too much because who searches for calandria, buys ads related to calandria, or knows anything about calandria?

Stephen E Arnold, January 4, 2021

Why Search Is Hard and Quick and Dirty Good Enough Methods Are Train Wrecks

December 15, 2021

I recommend to anyone interested in search and smart software the article “The Business of Extracting Knowledge from Academic Publications.” I am not going to summarize it, nor am I going to discuss why modern search systems are racing toward a collision with useful information retrieval. There was one omission from the essay, and I want to highlight it. I am not critical of this write up. I want to make clear that there is another dimension to scientific, technical, and medical publishing that is often overlooked. I learned this when we created the Pharmaceutical News Index decades ago.

Here’s the omission:

Wizards in technical fields work overtime to obfuscate some of their systems, methods, insights, and findings. The reason is that wizards want to remain wizards and have an ace up their sleeve if one is required to win a poker game for tenure, an over achieving graduate assistant, or some legal eagle involved in a patent dispute. Other reasons for withholding, distorting, and shaping information are related to insecurity. Yep, wizards are wizards in order to have a way to build a defense against those who don’t know what they don’t know and think that what they know defines knowledge.

When it comes to search and retrieval, key words are okay but not perfect. Index terms (what GenXers call tags) are helpful. But the substance of STM content does not yield insights, inventions, or any of the other “knowledge gems” that those pitching smart software believe will spill forth in a results list or a visualization.

What does the information in the article imply for smart software? My answer is, “Misleading or incorrect answers to certain types of inquiries.”

Don’t believe me? That’s okay. Just wait. STM content is “easier” to index than general business writing which is much easier to tag than the excrescences on TikTok, Twitch, or (heaven help me), Twitter.

Stephen E Arnold, December 15, 2021

The Coveo IPO: Making Some Headway

December 9, 2021

A number of Canadian tech companies have recently gone public on the Toronto Stock Exchange only to be met with muted responses. One was enterprise search firm Coveo, which went public in November in order to position itself globally, attract talent, and fund future acquisitions. CEO Louis Têtu appears unconcerned about the apparent indifference to his and other companies’ fledgling stock, The Globe and Mail reports in its piece, “Coveo CEO Dismisses Soft Trading Start on TSX as Quebec Software Company Closes $215-Million IPO.” Writer Sean Silcoff tells us:

“Coveo received more than $1-billion in orders for its IPO… . The stock hit $18 on its first day of trading last Thursday, but has since retreated, briefly trading below the issue price Tuesday. That makes it the fourth new tech issue this autumn – following D2L Corp., Q4 Inc. and E Automotive Inc. – to trade below its issue price. Coveo stock closed Wednesday at $15.30, up 1.7 per cent. Mr. Têtu dismissed Coveo’s ho-hum start as a public company, noting the share price of New York Stock Exchange-listed rival Elastic NV had dropped by 15 per cent over the previous four sessions. ‘There is a set of market dynamics we don’t control; the tide raises and lowers all boats,’ he said. ‘I think the jury is going to be out until the first earnings call [as a public company] and the subsequent earnings call. I think anybody who understands the stock market and IPOs … wouldn’t draw conclusions’ from the stock’s early performance. Coveo became the 20th Canadian tech IPO on the TSX to raise $50-million or more since July, 2020. By contrast, there were 12 such IPOs in the 11 years ended December, 2019.”

I suppose that is a good point—progress is progress, even if it is not at light speed. The write-up [paywalled] includes a few more details about Coveo’s growth and profits. Since its founding in 2005, the company has acquired two AI-powered e-commerce firms: Tooso in 2019 and Qubit in 2021. It sounds like Coveo may have some more companies already in its sights.

The good news is that the stock on December 8, 2021, was trending up. Search and retrieval is a tough business. Just ask the former CEOs of Autonomy and Fast Search & Transfer or take a look at the dust up between Amazon and Elastic. Worth monitoring. Maybe take a stake?

Cynthia Murrell December 10, 2021

What Company Is the Leader in Search Powered by Artificial Intelligence? One Answer May Surprise You. It Did Me.

November 30, 2021

Give up? The answer is Lucidworks, “the leader in AI-powered search.” You can get the gull story from Unite.ai and the article “Will Hayes, CEO of Lucidworks – Interview Series.” What’s “AI”? I don’t know, and the answer is not provided from @IAmWillHayes’ comments. What’s “search”? I don’t know because no specific definition is provided. (Search is a blanket word, covering everything from the open source Lucene in policeware solutions to whiz-bang, patented real time methods for time series data from Trendalyze. And we must not forget the generous offerings of “search” for eDiscovery, product supplier data, chemical structures, streaming video files, code libraries, and mysterious content like the interesting information in encrypted Signal and Telegram interactions. Search at Lucidworks is different it seems.

I noted this statement:

Lucidworks takes mission-critical business problems and solves them with search.

I assume that Lucidworks is disconnected from Dassault Systèmes search based applications approach. There is a 2011 book titled “Search Based Applications: At the Confluence of Search and Database Technologies.” The author is Dr. Gregory Grefenstette with assistance from Laura Wilber. The Lucidworks’ assertion struck me as one more example of marketing hoo hah disconnected from what came before. At least, the Dassault technology was original, not a recycling of open source software.

Here’s another statement offered as an original insight:

Lucidworks offers products and applications for commerce, customer service, and the workplace that use AI and machine learning to solve search. Fusion, our flagship product, uses AI extensively through every stage of enriching data—during ingest and at query time, for understanding user intent, and personalizing results that match that intent.

I want to point out that the Paris-based firm Polyspot used almost the exact same language (both French and English) to describe the company’s approach to information access. Here’s what Bloomberg says about the now repositioned company:

PolySpot SAS develops and publishes enterprise software. The Company’s products offer search and information access solutions designed to improve business and ensure that companies can access the data they need, regardless of their structure, format or origin. PolySpot markets its products internationally.

Dis Yogi Berra or Yogi Bear say: “It’s déjà vu all over again.” I go with the cartoon bear. The aphorism applies to Lucidworks in my opinion.

Lucidworks also does chatbots, fits into the connected experience cloud (CXC), and compounds “value.” Okay. The company, according to @IAmWillHayes, is “leader in next-generation search solutions and we have an exciting roadmap of cloud products coming in the near future.”

I wonder what outfits like Algolia, Coveo, Sphinx Search, and even the heroic X1 think about this assertion. What will Google’s revolving door search experts make of Lucidworks’ bold assertion? What about the crafty laborers in AWS search vineyards who watch the competitors gun for the Bezos bulldozer? What about the innovators working on the somewhat frightening IBM search solution? Maybe Microsoft will just pull a “Fast Search” and buy Lucidworks to beef up its incredible array of finding systems?

My hunch is that Lucidworks has to deal with its backers who want their money back plus some upside. Mix in the harsh market realities of many options, some free or low cost, and others bundled with purpose built solutions like Voyager Labs’ software and what do you get?

I am not sure about your answer. My answer is, “Recycling marketing lingo, ideas, and assertions which are decades old?” Will AI, machine learning, and CXC pull a rabbit from the search magician’s hat?

Maybe. But the investors who have injected more than $200 million into the company may want more than a magic show. And what is “search” and “AI” anyway? Solr with a new outfit from Amazon?

Stephen E Arnold, November 30, 2021

Ask Jeeves Has a Younger Cousin, Ask Jarvis

November 25, 2021

Ask Jeeves.com was a “smart” online search engine. The name lives on in Ask.com. Who remembers? No one. No matter. The younger cousin is now available. Ask Jarvis is “an AI code assistant developed by Assistiv.ai.” The idea is that a hard working developer handling a full time job via Zoom and working on numerous side gigs needs help. Just ask Jarvis when you need a programming tip or a chunk of a manpage. You can find the Web page at https://askjarvis.io. Is it the rule based wonder of the original smart Ask Jeeves.com? Nope, this is an artificial intelligence / machine learning 2021 search system with natural language “powered by OpenAI codex, a descendant of GPT-3.” Years ago this would have been labeled a vertical search engine. Today? I am not sure.

Stephen E Arnold, November 25, 2021

Battle of the Experts? Snowden Versus Sullivan, Wowza

November 19, 2021

This is a hoot: “Edward Snowden Dunks on Search Gurus in Hilarious Twitter Clapback.” Mr. Snowden is an individual who signed a secrecy agreement and elected to ignore it. Mr. Sullivan is a search engine optimization journalist, who is now laboring in the vineyards of Google.

The write up makes clear that Mr. Snowden finds the Google Web search experience problematic. (I wanted to write lousy, but I wish to keep maintain some level of polite discourse.)

Mr. Sullivan points out that Mr. Snowden was talking about “site search.” For those not privy to Google Dorks, a site search requires the names of a site like doe.gov preceded by the Google operator site: At least, that’s the theory.

The write up concludes with a reference to search engine optimization or SEO. That’s Mr. Sullivan’s core competency. Mr. Snowden’s response is not in the article or it could be snagged in the services monitored by the Federal service for supervision of Communications, Information Technology and Mass Media (Roskomnadzor) in everyone favorite satellite destroying country.

Quite a battle. The Snowden Sullivan slugfest. No, think this is emblematic of what has happened to those who ignore secrecy agreements and individuals who have worked hard to make relevance secondary to Google pay to play business processes.

Stephen E Arnold, November 19, 2021

You: Just Bake in Search

November 17, 2021

Google has a new rival, a search engine built with developers in mind: You.com. The platform, now in beta, uses AI to summarize information while supplying links. It also promises never to track queries, sell user data, or push targeted advertising. A couple test searches reveal results neatly tailored to the subject. My first two searches produces Wikipedia articles at the top, followed by general Web results, then topic-specific selections (News, Music, Shopping, etc.), a customized “quick facts” section, and more. When I typed in “pecan pie,” it was smart enough to lead with recipes.

Though the page itself does not emphasize the creator’s focus on developers, he discusses it on the Y Combinator post, “You.com, Private Search Engine that Summarizes the Web—Built for Devs.” He announces:

“My name is Richard Socher, and I’m the founder of you.com, the world’s first open search engine platform that summarizes the web for you. We launched our public beta today, and are excited to share it with you. If you’re a developer, we have several ‘search-apps’ such as StackOverflow (with code snippets), W3Schools, MDN, Copilot-like Code Completion, json checkers, and more. All of them geared to help you code faster. Let us know if you have other app ideas for how to make your coding life better. … We wanted to create a search engine that delivers relevant content, not ads or SEO’d pages, and do it in a whole new interface that puts you in control through personalized preferences.”

We learn more from an article at Venture Beat, “AI-Driven Search Engine You.com Takes on Google with $20M.” Writer Kyle Wiggers reveals that substantial funding is led by Salesforce CEO Marc Benioff. His publication asked Socher about his inspiration for the platform:

“As the economy moves online, it’s You.com’s assertion that the internet is becoming more centralized and controlled by a few powerful, ill-meaning tech corporations. … ‘I had the original idea [for You.com] eight and a half years ago,’ Socher told VentureBeat via email. ‘Today, there’s too much information, and no one has time to read it, process it, or know what to trust. [A] single gatekeeper controls the vast majority of the search market, dictating what you see: too many advertisements and a flood of search-engine-optimized pages … On top of that, 65% of search queries end without a click on another site, which means traffic stays within the Google ecosystem.’

That is a good point. See the Venture Beat article for details on how Socher uses AI to underpin You’s search, the site’s approaches to customization and privacy, and a comparison to its rivals.

Cynthia Murrell November 17, 2021

Elastic Adds Optimyze for Best Cloud Optimization

November 4, 2021

Elastic specializes in enterprise and cloud search solutions, but the company has also branched out by assisting systems in gaining big data insights. Help Net Security details Elastic’s newest move in this area: “Elastic Acquires Optimyze To Deliver Visibility Into Cloud Native Environments.” Optimyze providers a simpler way for users to gain insights from their entire IT ecosystem, eliminate blind spots with Prodfiler, generates continuous system profiling, and low performance overhead code.

Elastic also recently acquired Cmd and build.security. Combined with these other acquisitions, Optimyze with enable Elastic users to monitor and protect data from the unified Elastic Search Platform:

“Optimyze provides frictionless continuous profiling, while the Elastic Search Platform delivers analytics and machine learning capabilities with the ability to correlate and contextualize profiling data with metrics, logs, and traces. The ability to unify the three pillars of observability—metrics, logs and traces—with emerging continuous profiling capabilities delivers actionable insights to customers, leading to improvements in service quality and performance while reducing MTTD (mean-time-to-detect) and MTTR (mean-time-to-resolution).”

Elastic takes the idea of search to a different level. Instead of only concentrating on finding user generated data, Elastic observes, secures, tracks, and locates all kinds of data related to a system’s performance. Does this change the definition of enterprise and cloud search altogether?

Whitney Grace, November 4, 2021

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta