As Privacy Concerns Grow, So Do Search Alternatives
February 23, 2022
Google is sure to remain king of the online search hill for the foreseeable future, but Spark takes a look at a couple burgeoning alternatives in the post, “Search Engines Try to Rival Google by Offering Fewer Ads, More Privacy.” Writer Jonathan Ore begins with Neeva, founded by ex-Googler Sridhar Ramaswamy.
“[Ramaswamy] bills Neeva as an ad-free, private search engine. Results won’t include advertisements, and the company says any information it does collect from users isn’t shared with third parties. That ad-free experience does come with a cost, however: a subscription fee of $5 US per month, after a three-month trial period. Ramaswamy argues that no search engine is truly free, as users end up paying with all the advertisements and affiliate links clogging up search results, making it harder to find the things they actually want.”
That is one way to look at it. We would add that Neeva does have a free version, but naturally hopes users will be enticed to upgrade. Ore notes that, though Neeva emphasizes privacy, it does collect certain user data—like one’s email address, IP address, location data, browser, and OS. The platform uses this information to improve function and performance, states its privacy policy, but promises not to share any of it with third parties.
Next, the write-up takes a look at You.com, a platform that seems tailored to younger audiences. We learn:
“Rather than a mostly-linear list of results sorted in order of relevance or accuracy, You.com displays search results in a grid-like format. It also lets users ‘upvote’ and ‘downvote’ individual results, directly affecting their rankings in future searches. That added flexibility comes at the cost of simplicity, though; The Verge’s Adi Robertson said its layout can appear ‘overwhelming and sort of cluttered’ to anyone used to Google’s linear approach. Co-founder Richard Socher said, however, that he found younger users used to other social media platforms like Instagram or TikTok, which display content in tiles both vertically and horizontally, were able to quickly acclimate themselves to You.com’s unique layout.”
Like Neeva, You.com also emphasizes privacy and refuses to sell user data to advertisers. Can such search platforms really take out Google? Don’t be silly—of course not. But the write-up cites DuckDuckGo as an example of success. That privacy-centric service, launched in 2008, now processes tens of millions of searches daily and employs over 140 workers. Is ad-addicted Google bothered? Probably not. It can well afford to lose such small slices of the search pie and remain decisively in the lead.
Cynthia Murrell, February 23, 2022
Algolia: A New Approach to Relevance?
February 21, 2022
Algolia is a company providing search and retrieval services to a number of companies. A call for résumés provides some interesting assertions about the company, its philosophy, and its goal.
The goal interests me. The posting on the Algolia Web site says:
Our mission is to return relevant search results at all times and give companies the ability to tailor those results to their own specific needs. We are tasked with reimagining a core piece of Algolia’s technology: how search results are ranked. To that end, we are still early in our building, and we are looking for someone who can help us perform experiments and manage the technical aspects of our pilot program, including building clients for users to test our work and tools to evaluate the impact of our changes.
I like the idea of tailoring “search” which is certainly okay if someone knows for that which the individual is looking. I like the idea of ranking because relevance is — to some people — helpful. I like the idea that the company is “early in building.” The right person with the right stuff will make an impact. I like the idea of measuring results, which works reasonably well when the people in the same know that which they need to find.
There are several challenges in delivering or finding better ways to rank search results.
First, today the idea of knowing the corpus and using old-fashioned techniques like precision and recall are not as sexy as capsule network or caps net methods.
Second, users who want to formulate complex search queries like those required to extract semi useful information from Google or a Dataminr feed of social media are rare birds. I heard at one big search outfit that fewer than three percent of queries are a result of complex search statements; for example, site: or filetype:. Serving experts, analysts, and intelligence professionals is different from serving the ingredients for a Sicilian pizza.
Third, the now threadbare truism of lots of data, changing rapidly, and incorporating different content types and a veritable fun house of metadata requires some innovation. So far the best efforts of some bright folks have led to outright failure (Autonomy, Fast Search & Transfer, et al) or recycling endlessly with minor variations the functionality of everyone’s favorite fighter of Amazon, Elastic.
I noted some interesting supportive information in the write up; for instance, the candidate with right stuff must have grit (the sort of effort required to get an advanced degree from MINES ParisTech or Université Paris Saclay or the toughness required to deal with a wealthy family or a generational link to the Capetians. Other ingredients in the “right stuff” trois étoiles cannelés of Bordeaux:
- Trust
- Care
- Candor
- Humility
I am eager to explore the new approach to relevance. But I harbor an abiding affection for a clear explanation of the content indexed and good old Boolean logic. Snorkels, caps nets, and a 21sst century approach to relevance? Meh.
Stephen E Arnold, February 21, 2022
A Google Dork for Everyone
February 21, 2022
In my lectures about open source intelligence for law enforcement and other government professionals, I mention Google Dorks. I won’t go into detail, but the “dork” is a fancy way of saying a person who is an information professional with a knowledge of specialized commands can get semi-on point results from the online ad outfit. See for example this link. Do Googlers wear T shirts emblazoned with the phrase “Don’t be evil.” I saw such a shirt with the message “Don’t be Google,” but I may have misread.
What’s interesting is that Google Dorking is finding its way into the mainstream of the people who perceive themselves as “experts in online research.” Yep, the expertise is often similar to mastering an automatic teller machine, but that’s possibly a characteristic of our Covid era.
“Google Search Is Dying” has undergone a number of updates. The write up states:
Google still gives decent results for many other categories, especially when it comes to factual information. You might think that Google results are pretty good for you, and you have no idea what I’m talking about. What you don’t realize is that you’ve been self-censoring yourself from searching most of the things you would have wanted to search. You already know subconsciously that Google isn’t going to return a good result.
The punch line is “Google is dying.” Yeah, no kidding. When the wizard from Verity and Yahoo got involved, it was not dying. It was gifted a MOAB (that is the mother of all bombs or a disconnect from a query and stuff like precision and recall).
So what’s the fix?
A Google Dork.
Enter a query and stick “reddit” in the query. The idea is that some entity (bot or humanoid) will have posted more useful, authentic, relevant information on that service. One can be sporty and try wiki at the end of a query as well.
Google Dorking for everyone even the self proclaimed experts in online information search and retrieval! The challenge is that Google advertising is pumping cash, and that plus the bonuses for senior management is what makes Google search the outstanding service it is.
Stephen E Arnold, February 21, 2022
Google Observations: A Hoot and a Maybe Bit Frightening If Statements Are Accurate
February 4, 2022
I read an item on Hacker News which “tells” about an issue/observation. The comment points out that certain queries generate links on a search result page which point to questionable content. Interesting, but news? Not in Harrods Creek, the technology centroid of the world.
What is quite fascinating in the short article? The comments. Yep, the comments. There are quite a few gems scattered in the trollite outcrops.
Here are a few examples with the “names” of the entity generating the output. Remember. I am just sharing. These are not my observations, comments, or ideas. In fact, we think the current version of the Google is a heck of a lot better than Version 2.0 which I wrote a monograph about many years ago.
- “nobody gets promoted in Google for doing their job well. Only for inventing a new job to do.” – reaperducer
- “It was my mistake. I trusted Google.” — Silisili
- “I work for Google Search. We are looking into this.” – SullilvanDanny
- “My wife recently received, in her inbox, a spoofed email from her own email address on Gmail.” – andrewmcwatters
- “There is no end to The Greed.” – JayTaylor
Stephen E Arnold, February 4, 2022
Mike Lynch: Going to America?
January 29, 2022
I noted the Beeb’s article “Mike Lynch: Priti Patel Approves Extradition of Autonomy Founder.” The write up states:
Home Secretary Priti Patel has approved the extradition of a British tech tycoon to the US to face criminal fraud charges. The decision comes after Mike Lynch, the founder of Autonomy, lost a multibillion-dollar fraud action in London on Friday.
Welp.
A Home Office spokesperson said: “Under the Extradition Act 2003, the secretary of state must sign an extradition order if there are no grounds to prohibit the order being made. Extradition requests are only sent to the home secretary once a judge decides it can proceed after considering various aspects of the case. On 28 January, following consideration by the courts, the extradition of Dr Michael Lynch to the US was ordered.”
The Beeb’s write up includes some biographical information:
Cambridge graduate Mr Lynch, 56, built Autonomy up to be one of the top 100 UK public companies. In 2006, he was awarded an OBE for services to enterprise. A fellow of the Royal Society, Mr Lynch, who lives in Suffolk, previously advised the government and sat on the boards of the British Library and the BBC.
The brief summary omits some interesting information; for example, the Bayesian influence and the architecture of a system which would influence decades of content processing systems. More information is available on my Xenky.com site at this link: https://bit.ly/3IQTwgz
Stephen E Arnold, January 29, 2022
Hewlett Packard Autonomy: A Decision of Sorts
January 28, 2022
I read “HPE Has Substantially Succeeded in Its £3.3bn Fraud Trial against Autonomy’s Mike Lynch – Judge.” The write up reports that buyer beware is not a legal argument. It appears that more litigation awaits Mike Lynch in the US. I noted one interesting statement in the very good summary of the UK legal activities:
Autonomy, which told the market it was a “pure play” software company, accounted for its substantial hardware sales by burying them inside its sales and marketing revenue instead of breaking them out separately.
I am delighted I am not an attorney. I am a retired knowledge worker who has some familiarity with the general technology used by Autonomy and I did some work for the company years ago.
My uninformed view is that Hewlett Packard was looking for a home run when Léo Apotheker (formerly SAP and owner of the TREX search technology), ignored realities about the search and content processing revenue ceilings. Hewlett Packard, it seems to me, pushed forward, ignored inputs, and paid the what might be called the Ford Bronco surcharge.
What happens when a used vehicle sales professional explains the sidewalk guarantee to the buyer? Nothing. Buyers often do not do their homework, are too excited about the deal, or just don’t care about the future until it arrives. Oh, oh. Are there lemon laws for content processing platforms? I suppose the question will be answered US style in the coming months.
Stephen E Arnold, January 28, 2022
How about That Subscription Web Search Model?
January 24, 2022
Former Googlers Sridhar Ramaswamy and Vivek Raghunathan are refining their paid, privacy-centric search platform Neeva. We have followed this development from the 2020 beta through the 2021 official launch. Now we learn Neeva has added a free tier from The Next Web’s piece, “How a Couple of Ex-Googlers Are Trying to Fix What’s Wrong with Search Engines.” It appears not enough users are (yet) willing to pay the low, low price of $4.95 per month for search and the team is looking to upsell about 5% of those who sign on for free. It might be a good bet—Ramaswamy reports that a third of folks who sampled the free trial have subscribed. Even he was surprised users cited the peaceful, ad-free screen as their favorite feature. Reporter Ivan Mehta writes:
“[Neeva] will offer ad-free search with customizations, and integration to accounts such as Gmail, Microsoft Office, and Dropbox. People who’re paying for Neeva’s services will get all of this, a leading third-party VPN and a password manager service, and advanced features, like a monthly Q&A. As far as search engine features go, Neeva offers customizations, such as being able to see particular sites in results more or less. You can also ‘skip’ an ecommerce site in results, or get the whole recipe for a dish without having to visit a site. What’s more, the new search engine lets your look through your email right from the search bar. And if you install Neeva’s extension, it also blocks ad trackers that are collecting your browsing data. Last October, Neeva also launched a 1-click Fasttap search geared towards mobile where users just need to type a phrase to get accurate search results. It’s like Google auto-complete on steroids.”
The write-up includes a few screenshots of Neeva features in action. Regarding the how-to behind it all, Mehta tells us:
“On the technological side, while Neeva is aggregating some search results from Bing, the company is building its own crawler and looking at billions of pages every day. But as Raghunathan pointed out in his FastCompany interview earlier this month, crawling the web to create a new index while maintaining privacy standards is hard.”
Perhaps if anyone is up to the task, it is these two Xooglers. As of yet, Neeva is only available in the US, but the company hopes to become global. The plan is to expand into India and Western Europe “soon.” One tactic it is using to compete against the likes of privacy-focused DuckDuckGo and Brave is its partnership with news rating agency NewsGuard, which is helping it assess the accuracy of information. We wonder whether such features plus the free-tier offering will help Neeva reach its stated goal: to become the primary search engine for millions of privacy-centered users in the next two years.
Are there monetization options? The Point team is available to offer some ideas. Just write benkent2020 at yahoo dot com. We’ve been there and know the subscription method was a loser decades ago.
Cynthia Murrell, January 24, 2021
New Search Platform Focuses on Protecting Intellectual Property
January 21, 2022
Here is a startup offering a new search engine, now in beta. Huski uses AI to help companies big and small reveal anyone infringing on their intellectual property, be it text or images. It also promises solutions for title optimization and even legal counsel. The platform was developed by a team of startup engineers and intellectual property litigation pros who say they want to support innovative businesses from the planning stage through protection and monitoring. The Technology page describes how the platform works:
“* Image Recognition: Our deep learning-based image recognition algorithm scans millions of product listings online to quickly and accurately find potentially infringing listings with images containing the protected product.
* Natural Language Processing: Our machine learning algorithm detects infringements based on listing information such as price, product description, and customer reviews, while simultaneously improving its accuracy based on patterns it finds among confirmed infringements.
* Largest Knowledge Graph in the Field: Our knowledge graph connects entities such as products, trademarks, and lawsuits in an expansive network. Our AI systems gather data across the web 24/7 so that you can easily base decisions on the most up-to-date information.
* AI-Powered Smart Insights: What does it mean to your brands and listings when a new trademark pops out? How about when a new infringement case pops out? We’ll help you discover the related insights that you may never know otherwise.
* Big Data: All of the above intelligence is being derived from the data universe of the eCommerce, intellectual property, and trademark litigation. Our data engine is the biggest ‘black hole’ in that universe.”
Founder Guan Wang and his team promise a lot here, but only time will tell if they can back it up. Launched in the challenging year of 2020, Huski.ai is based in Silicon Valley but it looks like it does much of its work online. The niche is not without competition, however. Perhaps a Huski will cause the competition to run away?
Cynthia Murrell, January 21, 2021
Search Quality: 2022 Style
January 11, 2022
I read the interesting “Is Google Search Deteriorating? Measuring Google’s Search Quality in 2022?” The approach is different from what was the approach used at the commercial database outfits for which I worked decades ago. We knew what our editorial policy was; that is, we could tell a person exactly what was indexed, how it was indexed, how classification codes were assigned, and what the field codes were for each item in our database. (A field code for those who have never encountered the term means an index term which disambiguates a computer terminal from an airport terminal.) When we tested a search engine — for example, a touch of the DataStar systems — we could determine the precision and recall of the result set. This was math, not an opinion. Yep, we had automatic indexing routines, but we relied primarily on human editors and subject matter experts with a consultant or two tossed in for good measure. (A tip of the Silent 700 paper feed to you, Betty Eddison.)
The cited article takes a different approach. It is mostly subjective. The results of the analysis is that Google is better than Bing. Here’s a key passage:
So Google does outperform Bing (the difference is statistically significant)…
Okay, statistics.
Several observations:
First, I am not sure either Bing’s search team or Google’s search team knows what is in the indexes at any point in time. I assume someone could look, but I know from first hand experience that the young wizards are not interested in the scope of an index. The interest is reducing the load or computational cost of indexing new content objects and updating certain content objects, discarding content domains which don’t pay for their computational costs, and similar MBA inspired engineering efficiencies. Nobody gets a bonus for knowing what’s indexed, when, why, and whether that index set is comprehensive. How deep does Google go unloved Web sites like the Railway Retirement Board?
Second, without time benchmarks and hard data about precision and recall, the subjective approach to evaluating search results misses the point of Bing and Google. These are systems which must generate revenue. Bing has been late to the party, but the Redmond security champs are giving ad sales the old college drop out try. (A tip of the hat to MSFT’s eternal freshman, Bill Gates, too.) The results which are relevant are the ones that by some algorithmic cartwheels burn through the ad inventory. Money, not understanding user queries, supporting Boolean logic, including date and time information about the content object and when it was last indexed, are irrelevant. In one meeting, I can honestly say no one knew what I was talking about when I mentioned “time” index points.
Third, there are useful search engines which should be used as yardsticks against which to measure the Google and the smaller pretender, Bing. Why not include Swisscows.ch or Yandex.ru or Baidu.com or any of the other seven or eight Web centric and no charge systems. I suppose one could toss in the Google killer Neeva and a handful of metasearch systems. Yep, that’s work. Set up standard queries. Capture results. Analyze those results. Calculate result overlap. Get subject matter experts to evaluate the results. Do the queries at different points in time for a period of three months or more, etc., etc. This is probably not going to happen.
Fourth, what has been filtered. Those stop word lists are fascinating and they make it very difficult to find certain information. With traditional libraries struggling for survival, where is that verifiable research process going to lead? Yep, ad centric, free search systems. It might be better to just guess at some answers.
Net net: Web search is not very good. It never has been. For fee databases are usually an afterthought if thought of at all. It is remarkable how many people pass themselves off as open source intelligence experts, expert online researchers, or digital natives able to find “anything” using their mobile phone.
Folks, most people are living in a cloud of unknowing. Search results shape understanding. A failure of search just means that users have zero chance to figure out if a result from a free Web query is much more than Madison Avenue, propaganda, crooked card dealing, or some other content injection goal.
That’s what one gets when the lowest cost methods to generate the highest ad revenue are conflated with information retrieval. But, hey, you can order a pizza easily.
Stephen E Arnold, January 11, 2022
Cherche: A Neural Search Pipeline
January 10, 2022
For fans of open source search, Cherche is available. The GitHub write up states:
Cherche is meant to be used with small to medium sized corpora. Cherche’s main strength is its ability to build diverse and end-to-end pipelines.
The “neural search” module includes ElasticSearch. The programming team for Cherche consists of Raphaël Sourty and François-Paul Servant. Beyond Search has not fired up the system and run it against our test corpus. We did have in our files a paper called “Knowledge Base Embedding by Cooperative Knowledge Distillation.” That paper states:
Given a set of KBs, our proposed approach KDMKB, learns KB embeddings by mutually and jointly distilling knowledge within a dynamic teacher-student setting. Experimental results on two standard datasets show that knowledge distillation between KBs through entity and relation inference is actually observed. We also show that cooperative learning significantly outperforms the two proposed baselines, namely traditional and sequential distillation.
The idea is that instead of retrieving strings, broader tags (concepts and classifications) appear to provide an advantage; pushing “beyond” old school search.
Stephen E Arnold, January 10, 2022