Useful Scholarly / Semi-Scholarly Research System with Deduplicated Results

March 24, 2023

I was delighted to receive a link to OpenAIRE Explore. The service is sponsored by a non-profit partnership established in 2018 as a legal outfit. The objective is to “ensure a permanent open scholarly communication infrastructure to support European research.” (I am not sure whoever wrote the description has read “Book Publishers Won’t Stop Until Libraries Are Dead.)

The specific service I found interesting is Explore located at https://explore.openaire.eu. The service is described by OpenAIRE this way:

A comprehensive and open dataset of research information covering 161m publications, 58m research data, 317k research software items, from 124k data sources, linked to 3m grants and 196k organizations.

Maybe looking at that TechDirt article will be useful.

I ran a number of queries. The probably unreadable screenshot below illustrates the nice interface and the results to my query for Hopf fibrations (if this query doesn’t make sense to you, there’s not much I can do. Perhaps OpenAIRE Explore is ill-suited to queries about Taylor Swift and Ticketmaster?):

image

The query returned 127 “hits” and identified four organizations as having people interested in the subject. (Hopf fibrations are quite important, in my opinion.) No ads, no crazy SEO baloney, but probably some non-error checked equations. Plus, the result set was deduplicated. Imagine that. A use Vivisimo-type function available again.

Observation: Some professional publishers are likely to find the service objectionable. Four of the giants are watching their legal eagles circle the hapless Internet Archive. But soon… maybe OpenAIRE will attract some scrutiny.

For now, OpenAIRE Explore is indeed useful.

Stephen E Arnold, March 24, 2023

20 Years Ago: Primus Knowledge Solutions

March 20, 2023

Note: Written by a real-live dinobaby. No smart software involved.

I am not criticizing Primus Knowledge Solutions (acquired by ATG in 2004 and then Oracle purchased ATG in 2011). I would ask that you read this text and consider what was marketed in 2003. The source is a description of Primus’ Answer Engine which was once located at dub dub dub primus.com/products/answerEngine:

Primus Answer Engine helps companies take full advantage of the valuable content that already exists in corporate documents and databases. Using proprietary natural language processing, Answer Engine delivers quick, relevant answers to plain English questions by bringing widespread corporate knowledge to support, agents, as well as to customers, partners, and employees via the web.

What “features” did the system provide two decades ago? The fact sheet I picked up at a search conference in 2003 told me:

  • Natural language processing
  • Scalability
  • Database integration
  • All major document types
  • Insightful reporting
  • Customizable interface
  • Centralized administration.

The system can suggest questions and interprets these or other questions and returns a list of answers found in a company’s online documents. This allows users to view the answer in context if desired.

I mention Primus because it is one example from dozens in my files about NLP technology.

Several observations/questions:

  • Where is Oracle in the ChatGPT derby? May I suggest this link for starters.
  • Isn’t the principal difference between Primus and other NLP “smart software” users are chasing ChatGPT type systems, not innovators outputting marketing words?
  • Are issues like updating training models and their content, biases in the models themselves, and the challenge of accurate, current data enjoying the 2003 naïveté?

Net net: ChatGPT is just one manifestation of innovators’ attempts to deal with the challenge of finding accurate, on-point, and timely information in the digital world. (This is a world I call the datasphere.)

Stephen E Arnold, March 20, 2023

Elasticsearch Guide: More of a Cheat Sheet

March 15, 2023

Elasticsearch has been a go-to solution for searching content either via the open source version or the Elastic technical support option. The system works, and it has many followers and enthusiasts. As a result, one can locate “help” easily online for many hitches in the git along.

I found the information in “Unlocking the Power of Elasticsearch: A Comprehensive Guide to Complex Search Use Cases.” I would suggest that the write up is more like a cheat sheet. Encounter a specific task, check the “Guide,” and sally forth.

I would suggest that many real-life enterprise search needs are often difficult to solve. Examples range from capturing data on a sales professional’s laptop before the colleague deletes the slide dek with the revised price quotation data. No search engine on the planet can get this important information to the legal department if the project goes off the rails. “I can’t find it” is not a helpful answer.

Similar challenges arise when the Elasticsearch system must interact with a line item for a product specified in a purchase order which has a corresponding engineering drawing. Line up the chemical, civil, mechanical, and nuclear engineers and tell them, “Well, that’s an object embedded in the what-do-you-call-it software I never heard of.” Yeah.

Nevertheless, for some helpful tips give the free guide a look.

The mantra is, “Search is easy. Search is a solved problem. Search is no big deal.” Convince yourself. Keep in mind that the mantra does not ring true to me nor does it make me calm.

Stephen E Arnold, March 15, 2023

Hybrid Search: A Gentle Way of Saying “One Size Fits All” Search Like the Google Provides Is Not Going to Work for Some

March 9, 2023

On Hybrid Search” is a content marketing-type report. That’s okay. I found the information useful. What causes me to highlight this post by Qdrant is that one implicit message is: Google’s approach to search is lousy because it is aiming at the lowest common denominator of retrieval while preserving its relevance eroding online ad matching business.

The guts of the write up walks through old school and sort of new school approaches to matching processed content with a query. Keep in mind that most of the technology mentioned in the write up is “old” in the sense that it’s been around for a half decade or more. The “new” technology is about ready to hop on a bike with training wheels and head to the swimming pool. (Yes, there is some risk there I suggest.)

But here’s the key statement in the report for me:

Each search scenario requires a specialized tool to achieve the best results possible. Still, combining multiple tools with minimal overhead is possible to improve the search precision even further. Introducing vector search into an existing search stack doesn’t need to be a revolution but just one small step at a time. You’ll never cover all the possible queries with a list of synonyms, so a full-text search may not find all the relevant documents. There are also some cases in which your users use different terminology than the one you have in your database.

Here’s the statement I am not feeling warm fuzzies:

Those problems are easily solvable with neural vector embeddings, and combining both approaches with an additional reranking step is possible. So you don’t need to resign from your well-known full-text search mechanism but extend it with vector search to support the queries you haven’t foreseen.

Observations:

  • No problems in search when humans are seeking information are “easily solvable with shot gun marriages”.
  • Finding information is no longer enough: The information or data displayed have to be [a] correct, accurate, or at least reproducible; [b] free of injected poisoned information (yep, the burden falls on the indexing engine or engines, not the user who, by definition, does not know an answer or what is needed to answer a query; and [c] the need for having access to “real time” data creates additional computational cost, which is often difficult to justify
  • Basic finding and retrieval is morphing into projected outcomes or implications from the indexed data. Available technology for search and retrieval is not tuned for this requirement.

Stephen E Arnold, March 9, 2023

Take That Googzilla Because You Have One Claw in Your Digital Grave. Honest

March 8, 2023

My, my. How the “we are search experts” set have changed their tune. I am not talking about those who were terminated by the Google. I am not talking about the fawning advertising intermediaries. I am not talking about old school librarians who know how to extract information from commercial databases.

I am talking about the super clever Silicon Valley infused pundits.

Here’s an example: “Google Search Is Dying” from 2022. The write up contains one of the all-time statements from a Google wizard I have encountered. Believe me. I have noted a few over the years.

The speaker is the former champion of search engine optimization and denier of Google’s destruction of precision, recall, and relevance in search results. Here’s the statement:

You said in the post that quotes don’t give exact matches. They really do. Honest.— Google’s public search liaison (that’s a title of which to be proud)

I love it when a Googler uses the word “honest.”

Net net: The Gen X, Y’s, and Z’s perceive themselves as search experts. Okay, living in a cloud of unknowing is ubiquitous today. But “honest”?

Stephen E Arnold, March 8, 2023

Goggle Points Out the ChatGPT Has a Core Neural Disorder: LSD or Spoiled Baloney?

February 16, 2023

I am an old-fashioned dinobaby. I have a reasonably good memory for great moments in search and retrieval. I recall when Danny Sullivan told me that search engine optimization improves relevance. In 2006, Prabhakar Raghavan on a conference call with a Managing Director of a so-so financial outfit explained that Yahoo had semantic technology that made Google’s pathetic effort look like outdated technology.

psy pizza 1 copy

Hallucinating pizza courtesy of the super smart AI app Craiyon.com. The art, not the write up it accompanies, was created by smart software. The article is the work of the dinobaby, Stephen E Arnold. Looks like pizza to me. Close enough for horseshoes like so many zippy technologies.

Now that SEO and its spawn are scrambling to find a way to fiddle with increasingly weird methods for making software return results the search engine optimization crowd’s customers demand, Google’s head of search Prabhakar Raghavan is opining about the oh, so miserable work of Open AI and its now TikTok trend ChatGPT. May I remind you, gentle reader, that OpenAI availed itself of some Googley open source smart software and consulted with some Googlers as it ramped up to the tsunami of PR ripples? May I remind you that Microsoft said, “Yo, we’re putting some OpenAI goodies in PowerPoint.” The world rejoiced and Reddit plus Twitter kicked into rave mode.

Google responded with a nifty roll out in Paris. February is not April, but maybe it should have been in April 2023, not in les temp d’hiver?

I read with considerable amusement “Google Vice President Warns That AI Chatbots Are Hallucinating.” The write up states as rock solid George Washington I cannot tell a lie truth the following:

Speaking to German newspaper Welt am Sonntag, Raghavan warned that users may be delivered complete nonsense by chatbots, despite answers seeming coherent. “This type of artificial intelligence we’re talking about can sometimes lead to something we call hallucination,” Raghavan told Welt Am Sonntag. “This is then expressed in such a way that a machine delivers a convincing but completely fictitious answer.”

LSD or just the Google code relied upon? Was it the Googlers of whom OpenAI asked questions? Was it reading the gems of wisdom in Google patent documents? Was it coincidence?

I recall that Dr. Timnit Gebru and her co-authors of the Stochastic Parrot paper suggest that life on the Google island was not palm trees and friendly natives. Nope. Disagree with the Google and your future elsewhere awaits.

Now we have the hallucination issue. The implication is that smart software like Google-infused OpenAI is addled. It imagines things. It hallucinates. It is living in a fantasy land with bean bag chairs, Foosball tables, and memories of Odwalla juice.

I wrote about the after-the-fact yip yap from Google’s Chair Person of the Board. I mentioned the Father of the Darned Internet’s post ChatGPT PR blasts. Now we have the head of search’s observation about screwed up neural networks.

Yep, someone from Verity should know about flawed software. Yep, someone from Yahoo should be familiar with using PR to mask spectacular failure in search. Yep, someone from Google is definitely in a position to suggest that smart software may be somewhat unreliable because of fundamental flaws in the systems and methods implemented at Google and probably other outfits loving the Tensor T shirts.

Stephen E Arnold, February 16, 2023

Amazing Statement about Google

January 17, 2023

I am not into Twitter. I think that intelware and policeware vendors find the Twitter content interesting. A few of them may be annoyed that the Twitter application programming interface seems go have gone on a walkabout. One of the analyses of Twitter I noted this morning (January 15, 2023, 1035 am) is “Twitter’s Latest ‘Feature’ Is How You Know Elon Musk Is in Over His Head. It’s the Cautionary Tale Every Business Needs to Hear.”

I want to skip over the Twitter palpitations and focus on one sentence:

At least, with Google, the company is good enough at what it does that you can at least squint and sort of see that when it changes its algorithm, it does it to deliver a better experience to its users–people who search for answers on Google.

What about that “at least”? Also, what do you make of the “you can at least squint and sort of see that when it [Google] changes its algorithm”? Squint to see clearly. Into Google? Hmmm. I can squint all day at a result like this and not see anything except advertising and a plug for the Google Cloud for the query online hosting:

image

Helpful? Sure to Google, not to this user.

Now consider the favorite Google marketing chestnut, “a better experience.” Ads and a plug for Google does not deliver to me a better experience. Compare the results for the “online hosting” query to those from www.you.com:

image

Google is the first result, which suggests some voodoo in the search engine optimization area. The other results point to a free hosting service, a PC Magazine review article (which is often an interesting editorial method to talk about) and an outfit called Online Hosting Solution.

Which is better? Google’s ads and self promotion or the new You.com pointer to Google and some sort of relevant links?

Now let’s run the query “online hosting” on Yandex.com (not the Russian language version). Here’s what I get:

image

Note that the first link is to a particular vendor with no ad label slapped on the link. The other links are to listicle articles which present a group of hosting companies for the person running the query to consider.

Of the three services, which requires the “squint” test. I suppose one can squint at the Google result and conclude that it is just wonderful, just not for objective results. The You.com results are a random list of mostly relevant links. But that top hit pointing at Google Cloud makes me suspicious. Why Google? Why not Amazon AWS, Microsoft Azure, the fascinating Epik.com, or another vendor?

In this set of three, Yandex.com strikes me as delivering cleaner, more on point results. Your mileage may vary.

In my experience, systems which deliver answers are a quest. Most of the systems to which I have been exposed seem the digital equivalent of a ride with Don Quixote. The windmills of relevance remain at risk.

Stephen E Arnold, January 17, 2023

Semantic Search for arXiv Papers

January 12, 2023

An artificial intelligence research engineer named Tom Tumiel (InstaDeep) created a Web site called arXivxplorer.com.

imageAccording to his Twitter message (posted on January 7, 2023), the system is a “semantic search engine.” The service implements OpenAI’s embedding model. The idea is that this search method allows a user to “find the most relevant papers.” There is a stream of tweets at this link about the service. Mr. Tumiel states:

I’ve even discovered a few interesting papers I hadn’t seen before using traditional search tools like Google or arXiv’s own search function or even from the ML twitter hive mind… One can search for similar or “more like this” papers by “pasting the arXiv url directly” in the search box or “click the More Like This” button.

I ran several test queries, including this one: “Google Eigenvector.” The system surfaced generally useful papers, including one from January 2022. However, when I included the date 2023 in the search string, arXiv Xplorer did not return a null set. The system displayed hits which did not include the date.

Several quick observations:

  1. The system seems to be “time blind,” which is a common feature of modern search systems
  2. The system provides the abstract when one clicks on a link. The “view” button in the pop up displays the PDF
  3. Comparing result sets from the query with additional search terms surfaces papers reduces the result set size, a refreshing change from queries which display “infinite scrolling” of irrelevant documents.

For those interested in academic or research papers, will OpenAI become aware of the value of dates, limiting queries to endnotes, and displaying a relationship map among topics or authors in a manner similar to Maltego? By combining more search controls with the OpenAI content and query processing, the service might leapfrog the Lucene/Solr type methods. I think that would be a good thing.

Will the implementation of this system add to Google’s search anxiety? My hunch is that Google is not sure what causes the Google system to perturb ate. It may well be that the twitching, the sudden changes in direction, and the coverage of OpenAI itself in blogs may be the equivalent of tremors, soft speaking, and managerial dizziness. Oh, my, that sounds serious.

Stephen E Arnold, January 12, 2022

Google Results Are Relevant… to Google and the Googley

January 3, 2023

We know that NoNeedforGPS will not be joining Prabhakar Raghavan (Google’s alleged head of search) and the many Googlers repurposed to deal with a threat, a real threat. That existential demon is ChatGPT. Dr. Raghavan (formerly of the estimable Verity which was absorbed into the even more estimable Autonomy which is a terra incognita unto itself) is getting quite a bit of Google guidance, help, support, and New Year cheer from those Googlers thrown into a Soviet style project to make that existential threat go away.

NoNeedforGPS questioned on Reddit.com the relevance of Google’s ad-supported sort of Web search engine. The plaintive cry in the post is an image, which is essentially impossible to read, says:

Why does Google show results that have nothing to do with what is searched?

You silly goose, NoNeedforGPS. You fail to understand the purpose of Google search, and you obviously are not privy to discussions by search wizards who embrace a noble concept: It is better to return a result than a null result. A footnote to this brilliant insight is that a null result — that is, a search results page which says, “Sorry, no matches for your query” — make it tough to match ads and convince the lucky advertiser on a blank page that a null result conveys information.

What? A null result conveys information! Are you crazy there in rural Kentucky with snow piled to a height of four French bulldogs standing atop one another?

No, I don’t think I am crazy, which is a negative word, according to some experts at Stanford University.

When I run a query like “Flokinet climate activist”, I really want to see a null result set. My hunch is that some folks in Eastern Europe want me to see an empty set as well.

Let me put the display of irrelevant “hits” in response to a query in context:

  1. With a result set — relevant or irrelevant is irrelevant — Google’s super duper ad matcher can do its magic. Once an ad is displayed (even in a list of irrelevant results to the user), some users click on the ads. In fact, some users cannot tell the difference between a relevant hit and an ad. Whatever the reason for the click, Google gets money.
  2. Many users who run a query don’t know what they are looking for. Here’s an example: A person searches Google for a Greek restaurant. Google knows that there is no Greek restaurant anywhere near the location of  the Google user. Therefore, the system displays results for restaurants close to the user. Google may toss in ads for Greek groceries, sponges from Greece, or a Greek history museum near Dunedin, Florida. Google figures one of these “hits” might solve the user’s problem and result in a click that is related to an ad. Thus, there are no irrelevant results when viewed from Google’s UX (user experience) viewpoint via the crystal lenses of Ad Words, SEO partner teams, or a Googler who has his/her/its finger on the scale of Google objectivity.
  3. The quaint notions of precision and recall have been lost in the mists of time. My hunch is that those who remember that a user often enters a word or phrase in the hopes of getting relevant information related to that which was typed into the query processor are not interested in old fashioned lists of relevant content. The basic reason is that Google gave up on relevance around 2006, and the company has been pursuing money, high school science projects like solving death, and trying to manage the chaos resulting from a management approach best described as anti-suit and pro fun. The fact that Google sort of works is amazing to me.

The sad reality is that Google handles more than 90 percent of the online searches in North America. Years ago I learned that in Denmark, Google handles 100 percent of the online search traffic. Dr. Raghavan can lash hundreds of Googlers to the ChatGPT response meetings, but change may be difficult. Google believes that its approach to smart software is just better. Google has technology that is smarter, more adept at creating college admission essays, and blog posts like this one. Google can do biology, quantum computing, and write marketing copy while wearing a Snorkel and letting code do deep dives.

Net net: NoNeedforGPS does express a viewpoint which is causing people who think they are “expert searchers” to try out DuckDuckGo, You.com, and even the Russian service Yandex.com, among others. Thus, Google is scared. Those looking for information may find a system using ChatGPT returns results that are useful. Once users mired in irrelevant results realizes that they have been operating in the dark, a new dawn may emerge. That’s Dr. Raghavan’s problem, and it may prove to be easier to impress those at a high school reunion than advertisers.

Stephen E Arnold, January 3, 2023

Southwest Crash: What Has Been Happening to Search for Years Revealed

January 2, 2023

What’s the connection between the failure of Southwest Airlines’ technology infrastructure and search? Most people, including assorted content processing experts, would answer the question this way:

None. Finding information and making reservations are totally unrelated.

Fair enough.

The Shameful Open Secret Behind Southwest’s Failure” does not reference finding information as the issue. We learn:

This problem — relying on older or deficient software that needs updating — is known as incurring “technical debt,” meaning there is a gap between what the software needs to be and what it is. While aging code is a common cause of technical debt in older companies — such as with airlines which started automating early — it can also be found in newer systems, because software can be written in a rapid and shoddy way, rather than in a more resilient manner that makes it more dependable and easier to fix or expand.

I think this is a reasonable statement. I suppose a reader with my interest in search and retrieval can interpret the comments as applicable to looking up who owns some of the domains hosted on Megahost.com or some similar service provider. With a little thought, the comment can be stretched to cover the failure some organizations have experienced when trying to index content within their organizations so that employees can find a PowerPoint used by a young sales professional at a presentation at a trade show several weeks in the past.

My view point is that the Southwest failure provides a number of useful insights into the fragility of the software which operates out of sight and out of mind until that software fails.

Here’s my list of observations:

  1. Failure is often a real life version of the adage “the straw that broke the camel’s back”. The idea is that good enough software chugs along until it simply does not work.
  2. Modern software cannot be quickly, easily, or economically fixed. Many senior managers believe that software wrappers and patches can get the camel back up and working.
  3. Patched systems may have hidden, technical or procedural issues. A system may be returned but it may harbor hidden gotchas; for example, the sales professionals PowerPoint. The software may not be in the “system” and, therefore, cannot be found. No problem until a lawyer comes knocking about a disconnect between an installed system and what the sales professional asserted. Findability is broken by procedures, lack of comprehensive data collection, or an error importing a file. Sharing blame is not popular in some circles.

What’s this mean?

My view is that many systems and software work as intended; that is, well enough. No user is aware of certain flaws or errors, particularly when these are shared. Everyone lives with the error, assuming the mistake is the way something is. In search, if one looks for data about Megahost.com and the data are not available, it is easy to say, “Nothing to learn. Move on.” A rounding error in Excel. Move on. An enterprise search system which cannot locate a document? Just move on or call the author and ask for a copy.

The Southwest meltdown is important. The failure of the system makes clear the state of mission critical software. The problem exists in other systems as well, including, tax systems, command and control systems, health care systems, and word processors which cannot reliably number items in a list, among others.

An interesting and exciting 2023 may reveal other Southwest case examples.

Stephen E Arnold, January 2, 2023

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta