Wonderful Statement about Baked In Search Bias

October 12, 2022

I was scanning the comments related to the HackerNews’ post for this article: “Google’s Million’s of Search Results Are Not Being Served in the Later Pages Search Results.”

Sailfast made this comment at this link:

Yeah – as someone that has run production search clusters before on technologies like Elastic / open search, deep pagination is rarely used and an extremely annoying edge case that takes your cluster memory to zero. I found it best to optimize for whatever is a reasonable but useful for users while also preventing any really seriously resource intensive but low value queries (mostly bots / folks trying to mess with your site) to some number that will work with your server main node memory limits.

The comment outlines a facet of search which is not often discussed.

First, the search plumbing imposes certain constraints. The idea of “all” information is one that many carry around like a trusted portmanteau. What are the constraints of the actual search system available or in use?

Second, optimization is a fancy word that translates to one or more engineers deciding what to do; for example, change a Bayesian prior assumption, trim content based on server latency, filter results by domain, etc.

Third, manipulation of the search system itself by software scripts or “bots” force engineers to figure out what signals are okay and which are not okay. It is possible to inject poisoned numerical strings or phrases into a content stream and manipulate the search system. (Hey, thank you, search engine optimization researchers and information warfare professionals. Great work.)

When I meet a younger person who says, “I am a search expert”, I just shake my head. Even open source intelligence experts display that they live in a cloud of unknowing about search. Most of these professionals are unaware that their “research” comes from Google search and maps.

Net net: Search and retrieval systems manifest bias, from the engineers, from the content itself, from the algorithms, and from user interfaces themselves. That’s why I say in my lectures, “Life is easier if one just believes everything one encounters online.” Thinking in a different way is difficult, requires specialist knowledge, and a willingness to verify… everything.

Stephen E Arnold, October 12, 2022

Elastic: Bouncing Along

October 12, 2022

It seems like open-source search is under pressure. We learn from SiliconAngle that “Elastic Delivers Strong Revenue Growth and Beats Expectations, but Its Stock is Down.” For anyone unfamiliar with Elastic, writer Mike Wheatley describes the company’s integral relationship with open-source software:

“The company sells a commercial version of the popular open-source Elasticsearch platform. Elasticsearch is used by enterprises to store, search and analyze massive volumes of structured and unstructured data. It allows them to do this very quickly, in close to real time. The platform serves as the underlying engine for millions of applications that have complex search features and requirements. In addition to Elasticsearch, Elastic also sells application observability tools that help companies to track network performance, as well as threat detection software.”

Could it be that recent concerns about open-source security issues are more important to investors than fiscal success? The write-up shares some details from the company’s press release:

“The company reported a loss before certain costs such as stock compensation of 15 cents per share, coming in ahead of Wall Street analysts’ consensus estimate of a 17-cent-per-share loss. Meanwhile, Elastic’s revenue grew by 30% year-over-year, to $250.1 million, beating the consensus estimate of $246.2 million. On a constant currency basis, Elastic’s revenue rose 34%. Altogether, Elastic posted a net loss of $69.6 million, more than double the $34.4 million loss it reported in the year-ago period.”

Elastic emphatically accentuates the positive—like the dramatic growth of its cloud-based business and its flourishing subscription base. See the source article or the press release for more details. We are curious to see whether the company’s new chief product officer Ken Exner can find a way to circumvent open-source’s inherent weaknesses. Exner used to work at Amazon overseeing AWS Developer Tools. Founded in 2012, Elastic is based in Mountain View, California.

Cynthia Murrell, October 12, 2022

Waking Up to a Basic Fact of Online: Search and Retrieval Is Terrible

October 10, 2022

I read “Why Search Sucks.” The metadata for the article is, and I quote:

search-web-email-google-streaming-online-shopping-broken-2022-4

I spotted the article in a newsfeed, and I noticed it was published in April 2022 maybe? Who knows. Running a query on Bing, Google and Yandex  for “Insider why search sucks” yielded links to the original paywalled story. The search worked. The reason has more to do with search engine optimization, Google prioritization of search-related information, and the Sillycon Valley source.

Why was there no “$” to indicate a paywall. Why was the data of publication not spelled out in the results? I have no idea. Why one result identified Savanna Durr as the author and the article itself said Adam Rogers was the author?

So for this one query and for billions of users of free, ad-supported Web search engines work so darned well? Free and good enough are the reasons I mention. (Would you believe that some Web search engines have a list of “popular” queries, bots that look at Google results, and workers who tweak the non Google systems to sort of like Google? No. Hey, that’s okay with me.)

The cited article “Why Search Sucks” takes the position that search and retrieval is terrible. Believe me. The idea is not a new one. I have been writing about information access for decades. You can check out some of this work on the Information Today Web site or in the assorted monographs about search that I have written. A good example is the three editions of the “Enterprise Search Report.” I have been consistent in my criticism of search. Frankly not much has changed since the days of STAIRS III and the Smart System. Over the decades, bells and whistles have been added, but to find what one wants online requires consistent indexing, individuals familiar with sources and their provenance, systems which allow the user to formulate a precise query, and online systems which do not fiddle the results. None of these characteristics is common today unless you delve into chemical structure search and even that is under siege.

The author of the “Why Search Sucks” article focuses on some use cases. These are:

  • Email search
  • Social media search (Yep, the Zuckbook properties and the soon to be a Tesla fail whale)
  • Product search (Hello, Amazon, are you there?
  • Streaming search.

The write up provides the author’s or authors’ musings about Google and those who search. The comments are interesting, but none moves the needle.

Stepping back from the write up, I formulated several observations about the write up and the handling of search and its suckiness.

First, search is not a single thing. Specific information retrieval systems and methods are needed for certain topics and specific types of content. I referenced chemical structures intentionally because the retrieval systems must accept visual input, numerical input, words, and controlled term names. A quite specific search architecture and user training are required to make certain queries return useful results. Give Inconel a whirl if you have access to a structured search system. The idea that there is a “universal search” is marketing and just simple minded. Believe it or not one of today’s Googlers complained vociferously on a conference call with a major investment bank about my characterization of Google and the then almost useless Yahoo search.

Second, the pursuit of “good enough” is endemic among researchers and engineers in academic institutions and search-centric vendors. Good enough means that the limits of user capability, system capacity, budget, and time are balanced. Why not fudge how many relevant results exist for a user looking for a way to convert a link into a dot point on a slide in a super smart and busy executive’s PowerPoint for a luncheon talk tomorrow? Trying to deliver something works and meets measurable standards of precision and recall is laughable to some in the information retrieval “space” today.

Third, the hope that “search startups” will deliver non-sucking search is amusing. Smart people have been trying to develop software which delivers on point results with near real time information for more than 50 years. The cost and engineering to implement this type of system is losing traction in the handful of organizations capable of putting up the money, assembling the technical team, and getting the plumbing working is shrinking. Start ups. Baloney.

Net net: I find it interesting that more articles express dismay and surprise that today’s search and retrieval systems suck. After more than half a century of effort, that’s where we are. Fascinating it is that so many self proclaimed search experts are realizing that their self positioning might be off by a country mile.

Stephen E Arnold, October 10, 2022

Looria: Promising Content Processing Method Applied to a Reddit Corpus

September 14, 2022

I have seen a number of me-too product search systems. I burned out on product search after a demonstration of the Endeca wine selector and the SLI Systems’ product search. I thought Google’s Froogle had promise; the GOOG’s Catalog Search was interesting but — well — the PDF thing. There was a flirting with other systems, including the Amazon product search. (Someone told me that this service is A9. Yeah, that’s super but just buy ads and find something vaguely related to what one wants. The margins on ads are slightly better than Kroger’s selling somewhat bland cookies for $4.99 when Walgreen’s (stocked by Kroger) sells the same cookie for $1.00. Nice, right?

I want to point you to Looria, which provides what appears to be a free and maybe demonstration of its technology. The system ingests some Reddit content. The content is parsed, processed, and presented in an interface which combines some Endeca-like categories, text extraction, some analytics, and stuff like a statement about whether a Reddit comment is positive or negative.

There are about a dozen categories in this system (checked today, September 9, 2022). Categories include Pets, Travel, and other “popular” things about which to comment on Reddit without straying into perilous waters or portals of fascination for teenaged youth.

This is worth checking out. The Looria approach has a number of non Reddit use cases. This service looks quite interesting.

Stephen E Arnold, September 14, 2022

A Semantic Search Use Case: But What about General Business Content with Words and Charts?

September 9, 2022

I am okay with semantic search. The idea is that a relatively standard suite of mathematical procedures delivers “close enough for horse shoes” matches germane to a user’s query. Elastic is now combining key word with some semantic goodness. The idea is that mixing methods delivers more useful results. Is this an accurate statement?

The answer is, “It depends on the use cases.”

How Semantic Search Improves Search Accuracy” explains a use case that is anchored in a technical corpus. Now I don’t want to get crossways with a group of search experts. I would submit that, in general, the vocabulary for scientific, medical, and technical information is more constrained. One does not expect to find “cheugy” or OG* in a write up about octonitrocubane.

In my limited experience, what happens is that a constrained corpus allows the developer of a finding system to use precise taxonomies, and some dinobabies may employ controlled vocabularies like those kicking around old-school commercial databases.

However, what happens when the finding system ingests a range of content objects from tweets, online news services, and TikTok-type content?

The write up says:

One particular advantage of semantic search is the resolution of ambiguous terminology and that all specific subtypes (“children”) of a technical term will be found without the need to mention them in the query explicitly.

Sounds good, particularly for scientific and technical content. What about those pesky charts and graphs? These are often useful, but many times are chock full of fudged data. What about the query, “Octonitrocubane invalid data”? I want to have the search system present links to content which may be in an article. Why? I want to make sure the alleged data set squares with my limited knowledge of statistical principles. Yeah, sorry.

The write up asserts:

A lexical search will deliver back all documents in which “pesticides” is mentioned as the text string “pesticides” plus variants thereof. A semantic search will, in addition to all documents containing the text string “pesticides”, also return documents that contain specific pesticides like bixafen, boscalid, or imazamox.

What about a chemical structure search? I want a document with structure information. Few words, just nifty structures just like the stuff inorganic and organic chemists inhale each day. Sorry about that.

Net net: Writing about search is tough when the specific corpus, the content objects, and the presence of controlled terms in addition to strings in a content object are not spelled out. Without this information, the assertions are a bit fluffy.

And the video thing? The DoD, NIST, and other outfits are making videos. Things that go boom are based on chemistry. Can semantic search find the videos and the results of tests?

Yeah, sure. The PowerPoint deck probably says so. Hands on search experience may not. Search-enabled applications may work better than plain old search jazzed up with close enough for horse shoes methods.

Stephen E Arnold, September 9, 2022

[* OG means original gangster]

Here We Go Again: Google Claims To Improve Search Results

August 31, 2022

Google has been blamed for biased search results for years. Users claim that Google pushes paid links to the top of search results without identifying them Organic search results are consigned to the second and third pages. Despite having a monopoly on search and other parts of the tech sector, Google does deliver decent services and products. To maintain its market dominance, Google must continue offering good services. Engadget explores how “Google’s Search AI Now Looks For General Consensus To Highlight More Trustworthy Results.”

Google wants it “search snippets, “blocks of text that appear at the top of search results to answer questions,” to be more accurate. Google designed the Multitask Unified Model AI to search for consensus when selecting a snippet. The AI checks snippets against verified resources to determine a consensus of information. Some queries, such as false premises, should not have snippets, so Google’s AI reduces those by 40%.

Also Google is showing more citations:

“Google is now also making its “About this result” tool more accessible. That’s the panel that pops up when you click on the three dots next to a result, showing you details about the source website before you even visit. Starting later this year, it will be available in eight more languages, including Portuguese, French, Italian, German, Dutch, Spanish, Japanese, and Indonesian. It’s adding more information to the tool starting this week, as well, including how widely a publication is circulated, online reviews about a company, or whether a company is owned by another entity. They’re all pieces of information that could help you decide whether a particular source is trustworthy.”

Google search results with limited returns or do not have verified sources will contain content advisories encouraging users to conduct further research.

It is great that Google is turning itself into an academic database, now if they would only do that for Google Scholar.

Whitney Grace, August 31, 2022

OpenText: Goodwill Search

August 26, 2022

I spotted a short item in the weird orange newspaper called “Micro Focus Shares Jump After Takeover Bid from Canadian Rival.” (This short news item resides behind a paywall. Can’t locate it? Yeah, that’s a problem for some folks.)

What Micro Focus and Open Text are rivals? Interesting.

The key sentence is, in my opinion, ““OpenText agreed to buy its UK rival in an all-cash deal that values
the software developer at £5.1bn.”

Does Open Text have other search and retrieval properties? Yep.

Will Open Text become the big dog in enterprise search? Maybe. The persistent issue is the presence of Elasticsearch, which many developers of search based applications find to be better, faster, and chapter than many commercial offerings. (“Is BRS search user friendly and cheaper?”, ask I. The answer from my viewshed is ho ho ho.)

I want to pay attention going forward to this acquisition. I am curious about the answers to these questions:

  • How will the math work out? It was a cash deal and there is the cost of sales and support to evaluate.
  • Will the Micro Focus customers become really happy campers? It is possible there are some issues with the Micro Focus software.
  • How will Open Text support what appear to be competing options; for example, many of Open Text’s software systems strike me as duplicative. Perhaps centralizing technical development and providing an upgraded customer service solution using the company’s own software will reduce costs.

Notice I did not once mention Autonomy, Recommind, Fulcrum, or Tuxedo. (Reuters mentioned that Micro Focus was haunted by Autonomy’s ghost. Not me. No, no, no.)

Stephen E Arnold, August 26, 2022

Google: Redefines Quality. And What about Ads?

August 23, 2022

When I was working on The Google Legacy (Infonortics, 2004), I gathered information about Google’s method for determining quality. Prior to 2006, Google defined “quality” in a way different from the approach taken at professional indexing and commercial database companies. Professional organizations relied on subject matter experts’ views. Some firms — for example, the Courier Journal & Louisville Times, Predicasts, Engineering Index, the American Petroleum Institute, among others — were old fashioned. Commercial database firms with positive cash flows would hire specialists to provide ideas and suggestions for improving content selection and indexing. At the Courier Journal, we relied on Betty Eddison and a number of other professionals. We also hired honest-to-goodness people with advanced degrees to work on the content we produced.

Google pops up with jibber jabber about voting, a concept floated by an IBM Almaden researcher, and the notion of links and their value. As Google evolved, I collected a list of what amount4ed to 140 or so factors which were used by Google to determine the quality of content. At one time, Dr. Liz Liddy used my compilation as illustrative material for her classes in information science.

By 2006, Google shifted quality from its mysterious and somewhat orthogonal factors to what I call “ad quality.” The concept gained steam when Google acquired Applied Semantics and worked hard to relax a user’s query, match the query to a stack of ads to which the query would relate, and display these as “personalized” and targeted messages. Quality, therefore, became an automated process for working through ad revenue.

Since 2006, Google has been focused on ad revenue. My personal view is that Google has one stream of revenue: Ad revenue. Its other ventures have not demonstrated to me that the company can match its first “me too” innovation. If you don’t remember what that was, think about the Yahoo settlement related to the “inspiration” Google obtained from the GoTo.com and Overture “pay to play” system. The idea was that those with Web pages would pay to get their message in front of a service’s users.

Where is Google quality now? Is it anchored in editorial policies, old fashioned ideas like precision and recall? Is the Google using controlled vocabulary lists designed to allow precise queries? Is Google adding classification codes to disambiguate terms like terminal as in “computer terminal” or “airport terminal”?

Google’s Planned Search Changes Could Upend the Internet” reveals:

Google is trying to improve the quality of search results and reduce the number of misleading sites, misinformation, and clickbait users are subjected to.

I want to point out that the lack of precision and recall in Google’s approach is the firm’s notion that new Web sites are more important than older Web sites, traffic is more important than factual accuracy, and ad revenue goals are the strong force in the Google datasphere.

Thus, after a certain outfit headed by a search engine optimization crazed advanced the SEO “revolution”, the Google is, according the article:

As part of the change, the company will roll out its “helpful content update” to identify content that is primarily written to rank well in search engines and lower its rank. Sullivan says the update seems to especially benefit searches related to tech, online education, shopping, arts, and entertainment. The company is also working to improve access to high-quality reviews, ones that provide helpful, in-depth information.

Does this suggest that Google will focus on high-value content, explicit editorial policies, and professional indexing by subject matter experts?

Nope.

It means quicker depletion of the ad inventory and an effort to cope with the fact that those in middle school and high school use TikTok for information.

Google is officially a dinobaby just one not very good at anything other than selling ads and steering its coal fired steam boat away from the rapids in today’s data flows. For serious information research Google is too consumer oriented. Search based applications are what some researchers prefer. The content in these systems comes from specialized crawls and collections.

The quality list? Old fashioned and antiquated. How much of Google fits in that category? SAIL on, steam boat. Chug chug chug. PR PR PR. Toot toot.

But what about traffic to sites affected by Google’s content rigor?

Just buy ads, of course.

Stephen E Arnold, August 23, 2022

Can Ducks Crawfish? DuckDuckGo Gives Reverse a Go

August 19, 2022

I read “DuckDuckGo removes Carve Out for Microsoft Tracking Scripts after Securing Policy Change.” I learned:

A few months on from a tracking controversy hitting privacy-centric search veteran, DuckDuckGo, the company has announced it’s been able to amend terms with Microsoft, its search syndication partner, that had previously meant its mobile browsers and browser extensions were prevented from blocking advertising requests made by Microsoft scripts on third party sites.

The write up contains Silicon Valley-type talk about how its bold action and deep thinking sparked the backwards duck walk.

I am not sure if ducks can walk backward. In fact, after a security company assured some folks that privacy was number one and then was outed as a warm snuggler of tracking, will I trust the Duck metasearch thing?

The answer is the same for any online service with log files: Nope.

Oh, for the record, some ducks can waddle backwards for a couple of steps and then they try to walk, hop, or swim forward. The backwards thing is an anomaly. Perhaps you have seen a duck do a bit of nifty backwards walking? I have but it was laughable. Some of my test queries on the Duck have been almost as amusing.

Stephen E Arnold, August 19, 2022

YouTube: Some Proof about Unfindable Content

August 17, 2022

I read “5 Sites to Discover the Best YouTube Channels and Creators Recommended for You.” The write up presents five services which make YouTube content “findable.” What I learned from the article is that YouTube videos are, for the most part, unfindable. A YouTuber can stumble upon a particular video and rely on Google’s unusual recommendation system. In my experience, that system is hobbled by its assorted filters and ad-magnetic methods. If I want to locate a video by eSysman (a fellow who reports about big money yachts loved by some money launderers and oligarchs), Google refers me to NautiStyles, YachtsForSale (quite a sales person is visible on that channel), or the flavor of the day like Bering Yachts. eSysman is the inspiration for one former CIA professional, and her edging into the value of open source intelligence. Does Google’s algorithm “sense” this? Nah, not a clue. What if I want some downhome cookin’ with Cowboy Kent, the chuck wagon totin’, trail hand feedin’ Oklahoma chef. Sorry, promoting Italian chefs are not what I was looking for. Cowboy cookin’ is not Italian restaurateurs showing that their skills are sharper than fry cooks in French restaurants. But what about YouTube search? Yes, isn’t it fantastic? Enough said.

What about the services identified in the article? Each offers different ways to find a video or channel on a specific or semi-specific topic. You can navigate to the source document and work your way through the list of curated “finder” sites.

The write up points out:

YouTube has over 50 million channels, but as you might have guessed, most of them aren’t worth subscribing to.

That’s the type of “oh, well, don’t worry statement” that drives me bonkers. Just let someone tell you what’s good. Go with it. Hey, no problemo. Who wants to consider the implications of hours of video uploaded every minute or the fact that there are 50 million channels from the Googlers’ service.

Several observations:

  1. No one knows what is on YouTube. I have some doubts that filters designed to eliminate certain types of content work particularly well. The idea that the Google screens each and every uploaded video with tools constantly updated to keep track of possibly improper videos is interesting to contemplate. Since no one knows what videos contain, how can one know what’s filtered, allowed in mistakenly, blocked inadvertently, or processed using methods not revealed to the public. (Lists of user “handles” can be quite useful for some purposes.)
  2. Are the channels no one can find actually worthless? I am not too sure. There are channels which present information about how to game the Google algorithm posted by alleged Google “partners.” I engaged in a dialogue with this “professional” and found the exchange quite disturbing. I located the huckster by accident, and I can guarantee that keeping track of this individual is not an easy task. Is that a task a Googler will undertake? Yeah, sure.
  3. YouTube search is one of the many “flavors” of information location the company offers. In my experience, none of the Google search services works very well or delivers on point information without frustration. Does this comment apply to Google Patent search? Yep. What about Google News search? Yep yep. What about regular Google search for company using a common word for its name? Yep yep yep. (Google doesn’t have a clue about a company field code, but it sure pushes ads unrelated to anything I search. I love mindless ads for the non-US content surveillance products that help me express myself clearly. Hey, no I won’t buy.)

Net net: YouTube’s utility is designed for Google ads. The murky methods used to filter content and the poor search and recommender systems illustrate why professional libraries and specific indexing guidelines were developed. Google, of course, thinks that type of dinobaby thinking is not hip.

Yes, it is. Unless Google tames the YouTube, the edifice could fall down. TikTok (which has zero effective search) may just knock a wall or trellis in the YouTube garden over. Google wants to be an avant guard non text giant. Even giants have vulnerable points. The article makes clear that third parties cannot do much to make information findable in YouTube. But in a TikTok world, who cares? Advertisers? Google stakeholders? Those who believe Google’s smart software is alive? I go for the software is alive crowd.

Stephen E Arnold, August 17, 2022

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta