Interesting Number: Apple Sells Access

September 3, 2021

I read “Google to Pay Apple $15 Billion to Remain Default Safari Search Engine in 2021.” The write up states:

It’s long been known that Google pays Apple a hefty sum every year to ensure that it remains the default search engine on iPhone, iPad, and Mac. Now, a new report from analysts at Bernstein suggests that the payment from Google to Apple may reach $15 billion in 2021, up from $10 billion in 2020. In the investor note, seen by Ped30, Bernstein analysts are estimating that Google’s payment to Apple will increase to $15 billion in 2021, and to between $18 billion and $20 billion in 2022.

Apple and Google care about their users and their “experience.” That’s a mellifluous thing to say, particularly in an anti-trust deposition.

Let’s put the allegedly accurate number in context:

The metasearch engine DuckDuckGo may be in the $70 million range. That is in the neighborhood of 200 times the metasearch system’s estimated revenues for 2020.

Stephen E Arnold, September 3, 2021

Wiki People: One Cannot Find Online Information If It Is Censored

September 2, 2021

Women have born the brunt of erasure from history, but thanks web sites like Wikipedia, their stories are shared more than ever. There is a problem with Wikipedia though, says CBC in the article: “Canadian Nobel Scientist’s Deletion From Wikipedia Points To Wider Bias, Study Finds.” Wikipedia is the most comprehensive, collaborative, and largest encyclopedia in human history. It is maintained by thousands of volunteer editors, who curate the content, verify information, and delete entries.

There are different types of Wikipedia editors. One type is an “inclusionist,” an editor who takes broad views about what to include in Wikipedia. The second type are “deflationists,” who have high content standards. American sociologist Francesca Tripodi researched the pages editors deleted and discovered that women’s pages are deleted more than men’s. Tripodi learned that 25% of women’s pages account for all deletion recommendations and their pages only make up 19% of the profiles.

Experts say it is either gender bias or notability problem. The notability is a gauge Wiki editors use to determine if a topic deserves a page and they weigh the notability against reliable sources. What makes a topic notable, Tripodi explained, leads to gender bias, because there is less information on them. It also does not help that many editors are men and there are attempts to add more women:

“Over the years, women have tried to fix the gender imbalance on Wikipedia, running edit-a-thons to change that ratio. Tripodi said these efforts to add notable women to the website have moved the needle — but have also run into roadblocks. ‘They’re welcoming new people who’ve never edited Wikipedia, and they’re editing at these events,’ she said. ‘But then after all of that’s done, after these pages are finally added, they have to double back and do even more work to make sure that the article doesn’t get deleted after being added.”

Unfortunately women editors complain they need to do more work to make sure their profiles are verifiable and are published. The Wikipedia Foundation acknowledges that the lack of women pages, because it reflects world gender biases. The Wikipedia Foundation, however, is committed to increasing the amount of women pages and editors. The amount of women editors has increased over 30% in the past year.

That is the problem when there is a lack of verifiable data about women or anyone erased from history due to biases. If there is not any information on them, they cannot be searched even by trained research librarians like me. Slick method, right?

Whitney Grace, September 2, 2021

Amazon Search: Just Outstanding

September 2, 2021

Authors at Paste Magazine are dedicated to assembling lists of the best streaming content from Netflix, Hulu, Amazon Prime, and other services. They know almost as much about these content libraries as their developers. The title in Paste Magazine’s article, “Amazon Prime Video’s Library Is Not Genuinely Impossible To Browse” says it all.

It is notoriously difficult to browse Amazon Prime’s content library and the problem was noted in 2018. Amazon Prime’s library contains a lot of content, much of it is considered unwatchable. The only way to locate anything is searching by its proper name, but users who want to browse films like physicals libraries and video stores of yore are abandoned.

Amazon Prime has also hidden its search function, instead it wants users to work around this road block:

It quickly becomes apparent that there is no obvious way to view that full list of sci-fi movies, suggesting that Amazon doesn’t want consumers to be able to easily find that kind of information—its user experience is built around you choosing one of the small handful of suggested films, or knowing in advance what you want to see and then specifically searching it out. However, it is possible to see the full list—in order for it to display, you just have to click on any specific sci-fi film, look at the movie’s genre tags, and click on the words “science fiction” once again.”

The search function is worse than that available in a medieval scriptorium. When users return to certain genre pages and browse the supposed complete list, the same twenty-one movies continuously reload.

Amazon Prime has thousands of titles and is designed by a high tech company, yet it cannot fix its search function? Why does Amazon, an important company that is shaking the film and television industry, not offering its users the best of the best when it comes to search? Amazon did A9, it sucked in Lucid Imagination “experts,” it intruded on Elastic search territory. And now search doesn’t work the way users expect. Has another high-tech outfit become customer hostile or just given up making search useful?

Whitney Grace,September 1, 2021

Semantic: Scholar and Search

September 1, 2021

The new three musketeers could be named Semantic, Scholar, and Search. What’s missing is a digital d’Artagnan. What are three valiant mousquetaires up to? Fixing search for scholarly information.

To learn why smart software goes off the rails, navigate to “Building a Better Search Engine for Semantic Scholar.” The essay documents how a group of guardsmen fixed up search which is sort of intelligent and sort of sensitive to language ambiguities like “cell”: A biological cell or “cell” in wireless call admission control. Yep, English and other languages require context to figure out what someone might be trying to say. Less tricky for bounded domains, but quite interesting for essay writing or tweets.

Please, read the article because it makes clear some of the manual interventions required to make search deliver objective, on point results. The essay is important because it talks about issues most search and retrieval “experts” prefer to keep under their kepis. Imagine what one can do with the knobs and dials in this system to generate non-objective and off point results. That would be exciting in certain scholarly fields I think.

Here are some quotes which suggest that Fancy Dan algorithmic shortcuts like those enabled by Snorkel-type solutions; for example:

Quote A

The best-trained model still makes some bizarre mistakes, and posthoc correction is needed to fix them.

Meaning: Expensive human and maybe machine processes are needed to get the model outputs back into the realm of mostly accurate.

Quote B

Here’s another:

Machine learning wisdom 101 says that “the more data the better,” but this is an oversimplification. The data has to be relevant, and it’s helpful to remove irrelevant data. We ended up needing to remove about one-third of our data that didn’t satisfy a heuristic “does it make sense” filter.

Meaning: Rough sets may be cheaper to produce but may be more expensive in the long run. Why? The outputs are just wonky, at odds with what an expert in a field knows, or just plain wrong. Does this make you curious about black box smart software? If not, it should.

Quote C

And what about this statement:

The model learned that recent papers are better than older papers, even though there was no monotonicity constraint on this feature (the only feature without such a constraint). Academic search users like recent papers, as one might expect!

Meaning: The three musketeers like their information new, fresh, and crunchy. From my point of view, this is a great reason to delete the backfiles. Even thought “old” papers may contain high value information, the new breed wants recent papers. Give ‘em what they want and save money on storage and other computational processes.

Net Net

My hunch is that many people think that search is solved. What’s the big deal? Everything is available on the Web. Free Web search is great. But commercial search systems like LexisNexis and Compendex with for fee content are chugging along.

A free and open source approach is a good concept. The trajectory of innovation points to a need for continued research and innovation. The three musketeers might find themselves replaced with a more efficient and unmanageable force like smart software trained by the Légion étrangère drunk on digital pastis.

Stephen E Arnold, September 1, 2021

Enterprise Search: What Is Missing from This List

August 31, 2021

I got a wild and wooly announcement from something called The Market Gossip. The message was that a new report about enterprise search has been published. I never heard of the outfit (Orbis Research) in Dallas.

Take a look at this list of vendors covered in this global predictive report:

image

Notice anything interesting? I do. First, Elastic (commercial and open source) is not in the list. Second, the Algolia system (a distant cousin of Dassault Exalead) is not mentioned. Weird, because the company got another infusion of cash.) Three, the name of LucidWorks (an open source search recycler) is misspelled. Fourth, the inclusion of MarkLogic is odd because the company offers an XML data management system. Sure, one can create a search solution but that’s like building a  real Darth Vader out of Lego blocks. Interesting but of limited utility. Fifth, the inclusion of SAP. Does the German outfit still pitch the long-in-the-tooth TREX system? Sixth, Microsoft offers many search systems. Which, I wonder, is the one explored?

Net net: Quite a thorough research report. Too bad it is tangential to where search and retrieval in the enterprise is going. If the report were generated by artificial intelligence, the algorithm should be tweaked. If humans cooked up this confection, I am not sure what to suggest. Maybe starting over?

Stephen E Arnold, August 30, 2021

Algolia: Now the Need for Sustainable, Robust Revenue Comes

August 27, 2021

We long ago decided Algolia was an outfit worth keeping an eye on. We were right. Now Pulse 2.0 reports, “Algolia: $150 Million Funding and $2.25 Billion Valuation.” The company closed recently on the Series D funding, bringing its total funding to $315 million. Putting that sum to shame is the hefty valuation touted in the headline. Can the firm live up to expectations? Reporter Annie Baker writes:

“This latest funding round reflects Algolia’s hyper growth fueled by demand for ‘building block’ API software that increases developer productivity, the growth in e-commerce, and digital transformation. And this additional investment enables Algolia to scale and serve the increased demand for the company’s Search and Recommendations products as well as fuel the company’s continued product expansion into adjacent markets and use-cases. … This new funding round caps a landmark year that saw significant growth and product innovation. And Algolia launched with the goal of creating fast, instant, and relevant search and discovery experiences that surfaced the desired information quickly. Earlier this year, the company had announced its new vision for dynamic experiences, advancing beyond search to empower businesses to quickly predict a visitor’s intent on their digital property in real time, in the session, and in the moment. And the business, armed with this visitor intent, can surface dynamic content in the form of search results, recommendations, offers, in-app notifications, and more — all while respecting privacy laws and regulations.”

Baker notes Algolia’s approach is a departure from opaque SaaS solutions and monolithic platforms. Instead, the company works with developers to build dynamic, personalized applications using its API platform. Over the last year and a half, Algolia also added seven new executives to its roster. Headquartered in San Francisco, the company was founded in 2012.

Cynthia Murrell, August 27, 2021

Google Fiddled Its Magic Algorithm. What?

August 19, 2021

This story is a hoot. Google, as I recall, has a finely tuned algorithm. It is tweaked, tuned, and tailored to deliver on point results. The users benefit from this intense interest the company has in relevance, precision, recall, and high-value results. Now a former Google engineer or Xoogler in my lingo has shattered my beliefs. Night falls.

Navigate to “Top Google Engineer Abandons Company, Reveals Big Tech Rewrote Algos To Target Trump.” (I love the word “algos”. So colloquial. So in.) I spotted this statement:

Google rewrote its algorithms for news searches in order to target #Trump, according to target Trump, according to @Perpetualmaniac #Google whistleblower, and author of the new book, “Google Leaks: An Expose of Bit Tech Censorship.”

The write up states:

As a senior engineer at Google for many years, Zach was aware of their bias, but watched in horror as the 2016 election of Donald Trump seemed to drive them into dangerous territory. The American ideal of an honest, hard-fought battle of ideas — when the contest is over, shaking hands and working together to solve problems — was replaced by a different, darker ethic alien to this country’s history,” the description adds. Vorhies said he left Google in 2019 with 950 pages of internal documents and gave them to the Justice Department.

Wowza. Is this an admission of unauthorized removal of a commercial enterprise’s internal information?

The sources for this interesting allegation of algorithm fiddling are interesting and possibly a little swizzly.

I am shocked.

The Google fiddling with precision, recall, objectivity, and who knows what else? Why? My goodness. What has happened to cause a former employee to offer such shocking assertions.

The algos are falling on my head and nothing seems to fit. Crying’s not for me. Nothing’s worrying me. Because Google.

Stephen E Arnold, August 19, 2021

Milvus and Mishards: Search Marches and Marches

August 13, 2021

I read “How We Used Semantic Search to Make Our Search 10x Smarter.” I am fully supportive of better search. Smarter? Maybe.

The write up comes from Zilliz which describes itself this way: The developer of Milvus “the world’s most advanced vector database, to accelerate the development of next generation data fabric.”

The system has a search component which is Elasticsearch. The secret sauce which makes the 10x claim is a group of value adding features; for instance, similarity and clustering.

The idea is that a user enters a word or phrase and the system gets related information without entering a string of synonyms or a particularly precise term. I was immediately reminded of Endeca without the MBAs doing manual fiddling and the computational burden the Endeca system and method imposed on constrained data sets. (Anyone remember the demo about wine?)

This particular write up includes some diagrams which reveal how the system operates. The diagrams like the one shown below are clear, but I

the world’s most advanced vector database, to accelerate the development of next generation data fabric.

image

The idea is “similarity search.” If you want to know more, navigate to https://zilliz.com. Ten times smarter. Maybe.

Stephen E Arnold, August 13, 2021

Algolia and Its View of the History of Search: Everyone Has an Opinion

August 11, 2021

Search is similar to love, patriotism, and ethical behavior. Everyone has a different view of the nuances of meaning with a specific utterance. Agree? Let’s assume you cannot define one of these words in a way that satisfies a professor from a mid tier university teaching a class to 20 college sophomores who signed up for something to do with Western philosophy: Post Existentialism. Imagine your definition. I took such a class, and I truly did not care. I wrote down the craziness the brown clad PhD provided, got my A, and never gave that stuff a thought. And you, gentle reader, are you prepared to figure out what an icon in an ibabyrainbow chat stream “means.” We captured a stream for one of my lectures to law enforcement in which she says, “I love you.” Yeah, right.

Now we come to “Evolution of Search Engines Architecture – Algolia New Search Architecture Part 1.” The write up explains finding information, and its methods through the lens of Algolia, a publicly traded firm. Search, which is not defined, characterizes the level of discourse about findability. The write up explains an early method which permitted a user to query by key words. This worked like a champ as long as the person doing the search knew what words to use like “nuclear effects modeling”.

The big leap was faster computers and clever post-Verity methods of getting distributed index to mostly work. I want to mention that Exalead (which may have had an informing role to play in Algolia’s technical trajectory) was a benchmark system. But, alas, key words are not enough. The Endeca facets were needed. Because humans had to do the facet identification, the race was on to get smart software to do a “good enough” job so old school commercial database methods could be consigned to a small room in the back of a real search engine outfit.

Algolia includes a diagram of the post Alta Vista, post Google world. The next big leap was scaling the post Google world. What’s interesting is that in my experience, most search problems result in processing smaller collections of information containing disparate content types. What’s this mean? When were you able to use a free Web search system or an enterprise search system like Elastic or Yext to retrieve text, audio, video, engineering drawings and their associated parts data, metadata from surveilled employee E2EE messages, and TikTok video résumés or the wildly entertaining puff stuff on LinkedIn? The answer is and probably will be for the foreseeable future, “No.” And what about real time data, the content on a sales person’s laptop with the changed product features and customer specific pricing. Oh, right. Some people forget about that. Remember. I am talking about a “small” content set, not the wild and crazy Internet indexes. Where are those changed files on the Department of Energy Web site? Hmmm.

The fourth part of the “evolution” leaps to keeping cloud centric, third party hosted chugging along. Have you noticed the latency when using the OpenText cloud system? What about the display of thumbnails on YouTube? What about retrieving a document from a content management system before lunch, only to find that the system reports, “Document not found.” Yeah, but. Okay, yeah but nothing.

The final section of the write up struck me as a knee slapper. Algolia addresses the “current challenges of search.” Okay, and what are these from the Algolia point of view: The main points have to do with using a cloud system to keep the system up and running without trashing response time. That’s okay, but without a definition of search, the fixes like separating search and indexing may not be the architectural solution. One example is processing streams of heterogeneous data in real time. This is a big thing in some circles and highly specialized systems are needed to “make sense” of what’s rushing into a system. Now means now, not a latency centric method which has remain largely unchanged for – what? — maybe 50 years.

What is my view of “search”? (If you are a believer that today’s search systems work, stop reading.) Here you go:

  1. One must define search; for example, chemical structure search, code search, HTML content search, video search, and so on. Without a definition, explanations are without context and chock full of generalizations.
  2. Search works when the content domain is “small” and clearly defined. A one size fits all content is pretty much craziness, regardless of how much money an IPO’ed or SPAC’ed outfit generates.
  3. The characteristic of the search engines my team and I have tested over the last — what is it now, 40 or 45 years — is that whatever system one uses is “good enough.” The academic calculations mean zero when an employee cannot locate the specific item of information needed to deal with a business issue or a student wants to locate a source for a statement from a source about voter fraud. Good enough is state of the art.
  4. The technology of search is like a 1962 Corvette. It is nice to look at but terrible to drive.

Net net: Everyone is a search expert now. Yeah, right. Remember: The name of the game is sustainable revenue, not precision and recall, high value results, or the wild and crazy promise that Google made for “universal search”. Ho ho ho.

Stephen E Arnold, August 11, 2021

DuckDuckGo Produces Privacy Income

August 10, 2021

DuckDuckGo advertises that it protects user privacy and does not have targeted ads in search results.  Despite its small size, protecting user privacy makes DuckDuckGo a viable alternative to Google.  TechRepublic delves into DuckDuckGo’s profits and how privacy is a big money maker in the article, “How DuckDuckGo Makes Money Selling Selling Search, Not Privacy.”  DuckDuckGo has had profitable margins since 2014 and made over $100 million in 2020.

Google, Bing, and other companies interested in selling personal data say that it is a necessary evil in order for search and other services to work.  DuckDuckGo says that’s not true and the company’s CEO Gabriel Weinberg said:

“It’s actually a big myth that search engines need to track your personal search history to make money or deliver quality search results. Almost all of the money search engines make (including Google) is based on the keywords you type in, without knowing anything about you, including your search history or the seemingly endless amounts of additional data points they have collected about registered and non-registered users alike. In fact, search advertisers buy search ads by bidding on keywords, not people….This keyword-based advertising is our primary business model.”

Weinberg continued that search engines do not need to track as much personal information as they do to personalize customer experiences or make money.  Search engines and other online services could limit the amount of user data they track and still generate a profit.

Google made over $147 billion in 2020, but DuckDuckGo’s $100 million is not a small number either.  DuckDuckGo’s market share is greater than Bing’s and, if limited to the US market, its market share is second to Google.  DuckDuckGo is a like the Little Engine That Could.  It is a hard working marketing operation and it keeps chugging along while batting the privacy beach ball along the Madison Avenue sidewalk.

Whitney Grace, August 10, 2021

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta