More Search Explaining: Will It Help an Employee Locate an Errant PowerPoint?

May 13, 2021

Semantics, Ambiguity, and the role of Probability in NLU” is a search-and-retrieval explainer. After half a century of search explaining, one would think that the technology required to enter a keyword and get a list of documents in which the key word appears would be nailed down. Wrong.

“Search” in 2021 embraces many sub disciplines. These range from explicit index terms like the date of a document to more elusive tags like “sentiment” and “aboutness.” Boolean has been kicked to the curb. Users want to talk to search, at least to Alexa and smartphones. Users want smart software to deliver results without the user having to enter a query. When I worked at Booz, Allen & Hamilton, one of my colleagues (I think his name was Harvey Poppel, the smart person who coined the phrase “paperless office”) suggested that someday a smart system would know when a manager walked into his or her office. The smart software would display what the person needed to know for that day. The idea, I think, was that whist drinking herbal tea, the smart person would read the smart outputs and be more smart when meeting with a client. That was in the late 1970s, and where are we? On Zooms and looking at smartphones. Search is an exercise in frustration, and I think that is why venture firms continue to pour money into ideas, methods, concepts, and demos which have been recycled many times.

I once reproduced a chunk of Autonomy’s marketing collateral in a slide in one of my presentations. I asked those in the audience to guess at what company wrote the text snippet. There were many suggestions, but none was Autonomy. I doubt that today’s search experts are familiar with the lingo of search vendors like Endeca, Verity, InQuire, et all. That’s too bad because the prose used to describe those systems could be recycled with little or no editing for today’s search system prospects.

The write up in question is serious. The author penned the report late last year, but Medium emailed me a link to it a day ago along with a “begging for dollars” plea. Ah, modern online blogs. Works of art indeed.

The article covers these topics as part of the “search” explainer:

  • Ambiguity
  • Understanding
  • Probability

Ambiguity is interesting. One example is a search for the word “terminal.” Does the person submitting the query want information about a computer terminal, a bus terminal, or some other type of terminal; for instance the post terminal on the transformer to my model train set circa 1951? Smart software struggles with this type of ambiguity. I want to point out that a subject matter expert can assign a “field code” to the term and eliminate the ambiguity, but SMEs are expensive and they lose their index precision capability as the work day progresses.

The deal with the “terminal” example, the modern system has to understand [a] what the user wants and [b] what the content objects are about. Yep, aboutness. Today’s smart software does an okay job with technical text because jargon like Octanitrocubane allows relatively on point identification of a document relevant to a chemist in Columbus, Ohio. Toss in a chemical structure diagram, and the precision of the aboutness ticks up a notch. However, if you search for a word replete with social justice meaning, smart software often has a difficult time figuring out the aboutness. One example is a reference to Skokie, Illinois. Is that a radical right wing code word or a town loved for Potawatomi linguistic heritage?

Probability is a bit more specific — usually. The idea in search is that numbers can illuminate some of the dark corners of text’s meaning. Examples are plentiful. Curious about Miley Cyrus on SNL and then at the after party? The search engine will display the most probable content based on whatever data is sluiced through the query matcher and stored in a cache. If others looked at specific articles, then, by golly, a query about Miley is likely or highly probable to be just what the searcher wanted. The difference between ambiguity, understanding, and probability is — in my opinion — part of the problem search vendors faces. No one can explain why, after 50 years of SMART, and Personal Library Software, STAIRS, et al, finding on point information remains frustrating, expensive, and ineffective.

The write up states:

ambiguity was not invented to create uncertainty — it was invented as a genius compression technique for effective communication. And it works like magic, because on the receiving end of the message, there is a genius decoding and decompression technique/algorithm to uncover all that was not said to get at the intended thought behind the message. Now we know very well how we compress our thoughts into a message using a genius encoding scheme, let us now concentrate on finding that genius decoding scheme — a task that we all call now ‘natural language understanding’.

Sounds great. Now try this test. You have a recollection of viewing a PowerPoint a couple of weeks ago at an offsite. You know who the speaker was and you want the slide with the number of instant messages sent per day on WhatsApp? How do you find that data?

[a] Run a query on your Fabasoft, SearchUnify, or Yext system?

[b] Run a query on Google in the hopes that the GOOG will point you to Statista, a company you believe will have the data?

[c] Send an email to the speaker?

[d] All of the above.

I would just send the speaker a text message and hope for an answer. If today’s search systems were smart, wouldn’t the single PowerPoint slide be in my email anyway? Sure, someday.

Stephen E Arnold, May 13, 2021

Be Cool with Boole

May 10, 2021

How often have you turned to a search engine to answer a question? You know the answer is on the tip of your tongue, but you cannot remember anything about it. Take that back, you do remember things about the answer, that is you know “what it is not.” For example, you are trying to remember the name of 1980s transforming robots but they are not Hasbro Transformers. Usually you could use the Boolean operator “not” in the search term, but that does not yield results.

Thankfully Tech Xplore explains that negative search options are on their way in the article: “New Approach Enables Search Engines To Describe Objects With Negative Statements.” Search engines and other computer programs use knowledge bases to answer user questions. The information must be structured in order for it to be discovered. Most information in knowledge bases use positive statements or statements that describe something true. Negative statements are not although they contain valuable information. They are to used, because there is an infinite number of negative statements; therefore impossible to structure every one.

Simon Razniewski of the Saarbücken Max-Planck-Institute for Informatics and his research team created a method to generate negative statements for knowledge bases in different applications. It works by:

“Using Steven Hawking as an example, the novel approach works as follows: First, several reference cases are identified that share a prominent property with the search object. In the example: physicists. The researchers call these comparison cases “peers.” Now, based on the “peers,” a selection of positive assumptions about the initial entity is generated. Since the physicists Albert Einstein and Richard Feynman won the Nobel Prize, the assumption Steven Hawking won the Nobel Prize could be made. Then, the new assumptions are matched with existing information in the knowledge base about the initial entity. If a statement applies to a “peer” but not to the search object, the researchers conclude that it is a negative statement for the search object—i.e., Steven Hawking never won the Nobel Prize. To evaluate the significance of the negative statements generated, they are sorted using various parameters, for example, how often they occurred in the peer group.”

The research team uses recommender systems like those in search engines or on commerce Web sites. They hope to refine the system to identify nuanced negative statements and implicit negative statements. Using negative statements will make search engines more intuitive and the research crosses over into the realms of NLP and AI. Boolean operators could become more obsolete.

Boolean may be back!

Whitney Grace, May 10, 2021

Online: Finding Info Is Easy or Another Dark Pattern?

May 7, 2021

When I attended meetings about online search, I found considerable amusement in comments like “Online makes finding information easy” and “I am an expert at finding information on Google.” Hoots for sure.

I read “How to Find a Buyer or Seller’s Facebook Profile on Marketplace.” According to the write up, at some time in the recent past “finding” information about a person offering something for sale on Facebook Marketplace was easy. Since I have never used Facebook Marketplace, I can accept the facile use of the word “easy” as something a normal thumbtyping Facebooker could do. Some investigators probably had the knowledge required to figure out who was pitching a product allegedly stolen from a bitcoin billionaire.

The write up identifies about nine steps in the process to navigate from a listing’s “seller handle” to the vendor’s Facebook profile. I thought this online search was easy.

I can think of several reasons why Facebook makes finding information difficult with weird words and wonky icons. (One of these was described as a “carrot” in the write up. A carrot? What’s up, Mark?

It is possible that Facebook wants to accrue clicks and stickiness. Since I don’t use Facebook, I am not a good judge of how sticky the site is. I do know that some individuals in government agencies think a lot about Facebook and the information the company’s databases contain.

Another possibility is that Facebook wants to make it more difficult for stalkers, miscreants, and investigators to move from a product listing to the seller information. The happy face side of me says, “Facebook cares about its users.” The frowny face says, “Facebook wants to make life difficult for anyone to get useful information because accountability is a bad thing.”

A third possibility is that Facebook’s engineers are just incompetent.

Net net: Finding information online is easy as long as one works at the organization with the data and the person doing the looking has root. Others get an opportunity to explore a Dark Pattern. Fun. Helpful even.

Stephen E Arnold, May 7, 2021

Searcher Beware or Turpiculo Puella Naso, Take What You Get

May 6, 2021

More Google ads, more questions like this one: How many would knowingly pay to have an algorithm dial a number for them? Apparently, searchers are being tricked into doing just that, we learn from this article posted at the Which? Press Office: “Misleading Customer Service Ads on Google are Costing Consumers, Which? Reveals.”

Researchers at Which?, a consumer advocacy organization in the UK, studied the results that popped up when they searched for car insurer’s phone numbers. They found both high-rate call connection services and claims-management companies often appeared at the top of the list, before the insurers’ own sites. The write-up tells us:

“Which? found one in five searches (21%) displayed adverts for ‘call connecting’ services at the top of the results. These adverts appear above the insurer’s number and when consumers tap on an advert, they’ll be taken to a website which displays a large phone number and a button that says ‘click to call’. Consumers will be put through to their insurer, but via a premium-rate phone number. The cost of making these calls can quickly escalate – with a 30-minute phone call costing £112.50 on Sky, £124.50 on Three and £127.50 on Vodafone.”

As of this writing, £1 equals $1.39 US. That is a lot to skip the bother of dialing (or copy-and-pasting) for oneself. Such ads officially violate Google’s rules, and the company swears it removes them. And yet, there they were. Then there are the claims management companies. We learn:

“The investigation also found ‘click to dial’ ads for claims management companies (CMCs) were rife and appeared in two in five searches (43%) for customer service phone numbers. ‘Click to dial’ ads have a clickable number in the search result itself. Some of these ads can trick customers to believe they’re contacting their insurer, when they’re actually being put through to a third-party to handle their claim, who will take a cut from any insurance payout. These charges often aren’t stated upfront on the CMCs websites and can catch consumers unaware.”

Insurers have been complaining to Google about these ads for years, but can do little about them but warn their customers. Only if the CMC performs certain deceptions, like using an insurer’s logo, can the company petition to have the ads removed. Less infringing tricks, like using the word “official,” are just fine by Google. To get their own ads to appear at the top, insurers must pay more and more protection (aka advertising) money to Google. Again, Google swears it does not allow misleading advertising. Which? is trying convince the search giant to do more to stop these ads, but they are battling uphill against the power of ad revenue. Meanwhile, users are reminded to check for the little word “Ad” in the top corner of search results and to check that results match the term they entered and state the name of the company they are trying to connect with. As long as Google refuses to protect its users, caution is required.

Cynthia Murrell, May 6, 2021

Reddit Search Engines: Some Tweaks Might Be Useful

May 6, 2021

Reddit is a popular and vast social media network. It is also a big disorganized mess. The likelihood of finding a thread you read on the main page three weeks ago is zero to null, unless you happened to make a comment on it. That, however, requires a Reddit account, but not everyone has one. Google and other search engines attempt to locate information on Reddit. Reddit attempts to do the same for itself. Both options have limited results.

Reddit search is a can of worms, much like the web site itself. Information can be found, but it requires a lot of digging. A specialized search algorithm specifically designed to handle the information dump that is Reddit would be the best option. Github hosts a Reddit Search application that does a fair job of locating information, although it has some drawbacks. The search filters are perfect for Reddit, focusing on the author, subreddit, score, dates, search terms, and searching through posts or comments. The more one knows about the post/comment they wish to locate, the better the search application is. However, if searching for basic information on a topic without filling in the subreddit, date span, or author delimiters spits back hundreds of search results.  Reddit Search is similar to how most out of the box search tools function. They work, but need a lot of fine tuning before they are actually useful. Reddit Search does work as long as you have specific information to fill in the search boxes. Otherwise, it only returns semi useful results. The good news is that old Reddit is still available. Hunting remains the name of the game for some online information retrieval tasks.

Whitney Grace, May 6, 2021

Enterprise Search: Please Just Give Users What They Are Searching For

April 30, 2021

Here’s a modest proposal. Be upfront about what “enterprise search” can and cannot do. Nope, will not happen. Enterprise search, like a file manager, is a utility. But those with money bet on enterprise search becoming the next big thing will not admit to the craziness of statements like “index all your information.” All? Yeah, violate privacy, health information regulations, secrets related to acquisitions, etc.

Where is enterprise search? What types of things do the builders of enterprise search consider? CIO Applications gives us some insight in its write-up, “Five Important Features of Enterprise Search Platform.” To hear them tell it, it is all about the UI. We’re informed:

“Enterprise search platforms should have a world-class user interface (UI) that makes it simple and stress-free for users and allows for an excellent user experience. Organizations today face unimaginable volumes of unstructured data, necessitating the creation of an efficient enterprise search platform. An enterprise search tool aids in the analysis and interpretation of organizational data. It assists a company in making better strategic decisions and gaining a competitive advantage.”

According to the post, the five key features include data security, user friendliness, scalability, flexibility & customization, and search analytics. We feel this assessment is off the mark. Lipstick on a pig does not capture the cosmeticizing of a basic function.  Aside from security, these components are perks that should be considered after the core requirement is met—employees have to be able to find what they seek. Unfortunately, most enterprise search systems fall short on search itself. The rest are just bells and whistles to distract from that reality. Keep in mind that in order for a person to locate a PowerPoint with the changes a slightly out of control sales professional made to close a big deal with a new customers more than enterprise search is needed. How does one make search in an enterprise work? How about wave hands, chant AI AI AI, and close the deal with a faked demo? This has worked for many search vendors for many years.

Cynthia Murrell, April 30, 2021

Works Great But Google Upgrades Android Device Search

April 29, 2021

It goes without question that Android mobile devices are superior when it comes to battery longevity and cost. Apple mobile devices are only better when it comes to communication between other Apple products and a universal device search. Slash Gear shares that Android is finally getting a long needed upgrade: “Android Third-Party Launchers Might Finally Get Universal Device Search.”

Universal device search is an out-of-the-box feature for all Microsoft and Apple products, but Android-based OS were left without the option to search everything. Sure, they could download the Google Search app to get this option, but out was only limited to the Pixel launcher and Google Search home screen widget. In other words, it did not even compare to MacOS Spotlight nor Windows Search.

Third-party Android developers were left little to compete with, but Android 12 could finally resolve the debacle. The Android 12 OS has an AppSearchManager API that offers universal search, but it is currently only in preview mode:

“This is definitely good news for developers of the myriad Android launchers available as it at least takes them one step closer to the functionality previously exclusive to Google’s own. At the moment, however, it doesn’t seem to be available just yet and it might be too early to invest in it until the final version lands in Android 12 beta.”

It is ironic that the supreme search giant Google does not offer a universal search comparable to Spotlight or Windows Search. Google is supposed to be the best search engine in the world, so why does it like a basic search function on its mobile devices? And the “universal” thing, please.

Whitney Grace, April 29, 2021

The Internet Archive Dons a Scholar Skin

April 23, 2021

Some of today’s biggest social faux pas are believing everything on the Internet, clicking the first link in search results, and buying items from questionable Internet ads. It is easy to forget that search engines like Google and Bing are for-profit search engines that put paid links at the top of search results. What is even worse is scientific and scholarly information is locked behind expensive paywalls.

Wikipedia is often believed to be a reliable source, but despite the dedication of wiki editors the encyclopedia is not 100% accurate. There are free scholarly databases and newspapers often have their archives online, but that information is not widely known.

Thankfully the Internet Archive is fairly famous. The Internet Archive is a non-profit digital library that provides users with access to millions of free books, music, Web sites, videos, and software. They also allow users to peruse old Web sites with the Wayback Machine.

The Internet Archive recently introduced a brand new service that is sheer genius: Internet Archive Scholar. It is described as:

“This full text search index includes over 25 million research articles and other scholarly documents preserved in the Internet Archive. The collection spans from digitized copies of eighteenth century journals through the latest Open Access conference proceedings and pre-prints crawled from the World Wide Web.”

Why did no one at the Internet Archive think of doing this before? It is a brilliant idea that localizes millions of scholarly articles and other information without paywalls, university matriculation, or a library card. Most of the information available through the Internet Archive Scholar would otherwise remain buried in Google search results or on the Web, like old books gathering dust on library shelves.

Internet Archive Scholar is still in the beta phase and enhancements are a positive step.

Whitney Grace, April 23, 2021

Search Tips: Ideal for the Thumbtyper in a Hurry

April 21, 2021

Finding information is “easy.” Some systems display information before you search for it. A mobile with the time and temperature displayed are examples. Maybe you want to locate a source for flowering Chinese cabbage? Plug the phrase into Bing, Google, Qwant, and Yandex? Bingo super relevant, timely results. Works every time.

If you want to locate information germane to a topic like loss of coolant accident or octonitrocubane, you may need to use a different approach. To get some tips on locating high value, useful information navigate to “Internet Search Tips.” The write up beats the drum for the Internet Archive. That’s okay.

Useful but probably not suitable for those who are into “good enough” results, a category which includes some YouTube stars, most MBAs, and sadly some of the more recent graduates of information science programs.

Stephen E Arnold, April 21, 2021

Google Stop Words: Close Enough for the Mom and Pop Online Ad Vendor

April 15, 2021

I remember from a statistics lecture given by a fellow named Dr. Peplow maybe that fuzzy is one of the main characteristics of statistics. The idea is that a percentage is not a real entity; for example, the average number of lions in a litter is three, give or take a couple of the magnets for hunters and poachers. Depending upon the data set, the “real” number maybe 3.2 cubs in a litter. Who has ever seen a fractional lion? Certainly not me.

Why am I thinking fuzzy? Google is into data. The company collects, counts, and transform “real” data into actions. Whip in some smart software, and the company has processes which transform an advertiser’s need to reach eyeballs with some statistically validated interest in whatever the Mad Ave folks are trying to sell.

Google Has a Secret Blocklist that Hides YouTube Hate Videos from Advertisers—But It’s Full of Holes” suggests that some of the Google procedures are fuzzy. The uncharitable might suggest that Google wants to get close enough to collect ad money. Horse shoe aficionados use the phrase “close enough for horse shoes” to indicate a toss which gets a point or blocks an opponent’s effort. That seems to be one possible message from the Mark Up article.

I noted this passage in the essay:

If you want to find YouTube videos related to “KKK” to advertise on, Google Ads will block you. But the company failed to block dozens of other hate and White nationalist terms and slogans, an investigation by The Markup has found. Using a list of 86 hate-related terms we compiled with the help of experts, we discovered that Google uses a blocklist to try to stop advertisers from building YouTube ad campaigns around hate terms. But less than a third of the terms on our list were blocked when we conducted our investigation.

What seems to be happening is that Google’s methods for taking a term and then “broadening” it so that related terms are identified is not working. The idea is that related terms with a higher “score” are more directly linked to the original term. Words and phrases with lower “scores” are not closely related. The article uses the example of the term KKK.

I learned:

Google Ads suggested millions upon millions of YouTube videos to advertisers purchasing ads related to the terms “White power,” the fascist slogan “blood and soil,” and the far-right call to violence “racial holy war.” The company even suggested videos for campaigns with terms that it clearly finds problematic, such as “great replacement.” YouTube slaps Wikipedia boxes on videos about the “the great replacement,” noting that it’s “a white nationalist far-right conspiracy theory.” Some of the hundreds of millions of videos that the company suggested for ad placements related to these hate terms contained overt racism and bigotry, including multiple videos featuring re-posted content from the neo-Nazi podcast The Daily Shoah, whose official channel was suspended by YouTube in 2019 for hate speech.

It seems to me that Google is filtering specific words and phrases on a stop word list. Then the company is not identifying related terms, particularly words which are synonyms for the word on the stop list.

Is it possible that Google is controlling how it does fuzzification. In order to get clicks and advertising, Google blocks specifics and omits the term expansion and synonym identification settings to eliminate the words and phrases identified by the Mark Up’s investigative team?

These references to synonym expansion and reference to query expansion are likely to be unfamiliar to some people. Nevertheless, fuzzy is in the hands of those who set statistical thresholds.

Fuzzy is not real, but the search results are. Ad money is a powerful force in some situations. The article seems to have uncovered a couple of enlightening examples. String matching coupled with synonym expansion seem to be out of step. Some fuzzification may be helpful in the hate speech methods.

Stephen E Arnold, April 12, 2021

Next Page »

  • Archives

  • Recent Posts

  • Meta