Google Relevance: A Light Bulb Flickers

November 20, 2017

The Wall Street Journal published “Google Has Chosen an Answer for You. It’s Often Wrong” on November 17, 2017. The story is online, but you have to pay money to read it. I gave up on the WSJ’s online service years ago because at each renewal cycle, the WSJ kills my account. Pretty annoying because the pivot of the WSJ write up about Google implies that Google does not do information the way “real” news organizations do. Google does not annoy me the way “real” news outfits handle their online services.

For me, the WSJ is a collection of folks who find themselves looking at the exhaust pipes of the Google Hellcat. A source for a story like “Google Has Chosen an Answer for You. It’s Often Wrong” is a search engine optimization expert. Now that’s a source of relevance expertise! Another useful source are the terse posts by Googlers authorized to write vapid, cheery comments in Google’s “official” blogs. The guts of Google’s technology is described in wonky technical papers, the background and claims sections of the Google’s patent documents, and systematic queries run against Google’s multiple content indexes over time. A few random queries does not reveal the shape of the Googzilla in my experience. Toss in a lack of understanding about how Google’s algorithms work and their baked in biases, and you get a write up that slips on a banana peel of the imperative to generate advertising revenue.

I found the write up interesting for three reasons:

  1. Unusual topic. Real journalists rarely address the question of relevance in ad-supported online services from a solid knowledge base. But today everyone is an expert in search. Just ask any millennial, please. Jonathan Edwards had less conviction about his beliefs than a person skilled in the use of locating a pizza joint on a Google Map.
  2. SEO is an authority. SEO (search engine optimization) experts have done more to undermine relevance in online than any other group. The one exception are the teams who have to find ways to generate clicks from advertisers who want to shove money into the Google slot machine in the hopes of an online traffic pay day. Using SEO experts’ data as evidence grinds against my belief that old fashioned virtues like editorial policies, selectivity, comprehensive indexing, and a bear hug applied to precision and recall calculations are helpful when discussing relevance, accuracy, and provenance.
  3. You don’t know what you don’t know. The presentation of the problems of converting a query into a correct answer reminds me of the many discussions I have had over the years with search engine developers. Natural language processing is tricky. Don’t believe me. Grab your copy of Gramatica didactica del espanol and check out the “rules” for el complemento circunstancial. Online systems struggle with what seems obvious to a reasonably informed human, but toss in multiple languages for automated question answer, and “Houston, we have a problem” echoes.

I urge you to read the original WSJ article yourself. You decide how bad the situation is at ad-supported online search services, big time “real” news organizations, and among clueless users who believe that what’s online is, by golly, the truth dusted in accuracy and frosted with rightness.

Humans often take the path of least resistance; therefore, performing high school term paper research is a task left to an ad supported online search system. “Hey, the game is on, and I have to check my Facebook” takes precedence over analytic thought. But there is a free lunch, right?

Image result for there is no free lunch

In my opinion, this particular article fits in the category of dead tree media envy. I find it amusing that the WSJ is irritated that Google search results may not be relevant or accurate. There’s 20 years of search evolution under Googzilla’s scales, gentle reader. The good old days of the juiced up CLEVER methods and Backrub’s old fashioned ideas about relevance are long gone.

I spoke with one of the earlier Googlers in 1999 at a now defunct (thank goodness) search engine conference. As I recall, that confident and young Google wizard told me in a supercilious way that truncation was “something Google would never do.”

What? Huh?

Guess what? Google introduced truncation because it was a required method to deliver features like classification of content. Mr. Page’s comment to me in 1999 and the subsequent embrace of truncation makes clear that Google was willing to make changes to increase its ability to capture the clicks of users. Kicking truncation to the curb and then digging through the gutter trash told me two things: [a] Google could change its mind for the sake of expediency prior to its IPO and [b] Google could say one thing and happily do another.

I thought that Google would sail into accuracy and relevance storms almost 20 years ago. Today Googzilla may be facing its own Ice Age. Articles like the one in the WSJ are just belated harbingers of push back against a commercial company that now has to conform to “standards” for accuracy, comprehensiveness, and relevance.

Hey, Google sells ads. Algorithmic methods refined over the last two decades make that process slick and useful. Selling ads does not pivot on investing money in identifying valid sources and the provenance of “facts.” Not even the WSJ article probes too deeply into the SEO experts’ assertions and survey data.

I assume I should be pleased that the WSJ has finally realized that algorithms integrated with online advertising generate a number of problematic issues for those concerned with factual and verifiable responses.

If Google is in the distortion business, what’s the fix?

Based on my admittedly limited interactions with Googlers, I am not convinced that those now working on search have much awareness of the algorithmic trap in which they are working.

The problem is far broader than Google. The “problem” may reside in the convenience crazed users on online apps and search services. Content processing procedures have some built in flaws, and these numerical recipes may not be able to deliver accuracy some expect. Many of today’s automated  methods cannot match the performance of a reasonable intelligent human with knowledge of a subject matter “space.”

I would suggest that people want answers, not results lists. Plus, humans armed with mobile devices and a schedule stuffed with meetings from an automated calendaring system don’t want to do anything beyond grab and go. This is the intellectual equivalent of nuking a burrito in a microwave before one rushes to a one o’clock meeting. Online question answering systems earn D grades in my experience. That’s what keeps users clicking, gentle reader. Paying a human to figure out what’s a high value source, verifying the indexing, and implementing an editorial policy are relics from a fifth century scriptorium. (High value manuscripts were chained to a shelf. Blog posts are not handled in this manner.)

Let’s assume for the purpose of this no cost blog post that Google results and the concomitant user clicks make the firm’s advertising system operate. What’s the human user got to do with this virtuous circle of clicks, ads, and relevance? You can answer this question as homework: “Is money influencing this answer?”

My perception is that normal users and “real” journalists don’t like to do too much extra work. Think of the pain many experienced when asked to write a paper with footnotes, please. Users take what a system outputs and assume (ass of you and me?) the data are accurate.

Let me wrap up my views of this article and free online question answering systems with these comments:

First, yep, the free lunch thing again. If one wants free information, easy research, and convenience, prepare for manipulation. The fix is to become a more capable researcher.

Second, old fashioned concepts like precision and recall take time and cost money (lots of money) to implement. The reason is that editorial controls have to be applied to content before content is selected for inclusion in an index to a dataset. The index terms have to be tagged in such a manner to differentiate a document or video about airplane terminals from from patients with terminal cancer. In my experience, progress in indexing has stalled. Don’t believe me? Check out the TREC results for the last six or seven years. Stuck at 85 percent accuracy? Well, only 15 percent of the outputs are wrong. And that 85 percent accuracy figure has been around for many, many years. Be happy. You will be use incorrect or off base information only 15 percent of the time. That’s the modern information retrieval game. Make error rates less important with user interface tricks and automatic answers for convenience and revenue. The fix is to do the research, fact checking, and analyzing to avoid being fed dog kibble.

Third, the rise of SEO essentially institutionalized irrelevance. Tricking algorithmic systems is not rocket science. In fact, not much content is required despite the hoo-hah about Russian content volume. The reason is that the 10 most widely used algorithms in content processing are very sensitive to certain minor perturbations in the content streams processed. Forget hummingbird’s flapping its wings in Brazil and think about skewing election results with a few hundred blog posts and some automated retweets. The fix? Demand relevance. Don’t click that juicy link which seems to be the “answer.”

Net net: Ask a question and assume that the answer is right. That’s a risky way to become informed whether one is researching a “real” news story for the WSJ or finding a hotel.

Stephen E Arnold, November 20, 2017


Comments are closed.

  • Archives

  • Recent Posts

  • Meta