The Internet Archive Dons a Scholar Skin

April 23, 2021

Some of today’s biggest social faux pas are believing everything on the Internet, clicking the first link in search results, and buying items from questionable Internet ads. It is easy to forget that search engines like Google and Bing are for-profit search engines that put paid links at the top of search results. What is even worse is scientific and scholarly information is locked behind expensive paywalls.

Wikipedia is often believed to be a reliable source, but despite the dedication of wiki editors the encyclopedia is not 100% accurate. There are free scholarly databases and newspapers often have their archives online, but that information is not widely known.

Thankfully the Internet Archive is fairly famous. The Internet Archive is a non-profit digital library that provides users with access to millions of free books, music, Web sites, videos, and software. They also allow users to peruse old Web sites with the Wayback Machine.

The Internet Archive recently introduced a brand new service that is sheer genius: Internet Archive Scholar. It is described as:

“This full text search index includes over 25 million research articles and other scholarly documents preserved in the Internet Archive. The collection spans from digitized copies of eighteenth century journals through the latest Open Access conference proceedings and pre-prints crawled from the World Wide Web.”

Why did no one at the Internet Archive think of doing this before? It is a brilliant idea that localizes millions of scholarly articles and other information without paywalls, university matriculation, or a library card. Most of the information available through the Internet Archive Scholar would otherwise remain buried in Google search results or on the Web, like old books gathering dust on library shelves.

Internet Archive Scholar is still in the beta phase and enhancements are a positive step.

Whitney Grace, April 23, 2021

Search Tips: Ideal for the Thumbtyper in a Hurry

April 21, 2021

Finding information is “easy.” Some systems display information before you search for it. A mobile with the time and temperature displayed are examples. Maybe you want to locate a source for flowering Chinese cabbage? Plug the phrase into Bing, Google, Qwant, and Yandex? Bingo super relevant, timely results. Works every time.

If you want to locate information germane to a topic like loss of coolant accident or octonitrocubane, you may need to use a different approach. To get some tips on locating high value, useful information navigate to “Internet Search Tips.” The write up beats the drum for the Internet Archive. That’s okay.

Useful but probably not suitable for those who are into “good enough” results, a category which includes some YouTube stars, most MBAs, and sadly some of the more recent graduates of information science programs.

Stephen E Arnold, April 21, 2021

Google Stop Words: Close Enough for the Mom and Pop Online Ad Vendor

April 15, 2021

I remember from a statistics lecture given by a fellow named Dr. Peplow maybe that fuzzy is one of the main characteristics of statistics. The idea is that a percentage is not a real entity; for example, the average number of lions in a litter is three, give or take a couple of the magnets for hunters and poachers. Depending upon the data set, the “real” number maybe 3.2 cubs in a litter. Who has ever seen a fractional lion? Certainly not me.

Why am I thinking fuzzy? Google is into data. The company collects, counts, and transform “real” data into actions. Whip in some smart software, and the company has processes which transform an advertiser’s need to reach eyeballs with some statistically validated interest in whatever the Mad Ave folks are trying to sell.

Google Has a Secret Blocklist that Hides YouTube Hate Videos from Advertisers—But It’s Full of Holes” suggests that some of the Google procedures are fuzzy. The uncharitable might suggest that Google wants to get close enough to collect ad money. Horse shoe aficionados use the phrase “close enough for horse shoes” to indicate a toss which gets a point or blocks an opponent’s effort. That seems to be one possible message from the Mark Up article.

I noted this passage in the essay:

If you want to find YouTube videos related to “KKK” to advertise on, Google Ads will block you. But the company failed to block dozens of other hate and White nationalist terms and slogans, an investigation by The Markup has found. Using a list of 86 hate-related terms we compiled with the help of experts, we discovered that Google uses a blocklist to try to stop advertisers from building YouTube ad campaigns around hate terms. But less than a third of the terms on our list were blocked when we conducted our investigation.

What seems to be happening is that Google’s methods for taking a term and then “broadening” it so that related terms are identified is not working. The idea is that related terms with a higher “score” are more directly linked to the original term. Words and phrases with lower “scores” are not closely related. The article uses the example of the term KKK.

I learned:

Google Ads suggested millions upon millions of YouTube videos to advertisers purchasing ads related to the terms “White power,” the fascist slogan “blood and soil,” and the far-right call to violence “racial holy war.” The company even suggested videos for campaigns with terms that it clearly finds problematic, such as “great replacement.” YouTube slaps Wikipedia boxes on videos about the “the great replacement,” noting that it’s “a white nationalist far-right conspiracy theory.” Some of the hundreds of millions of videos that the company suggested for ad placements related to these hate terms contained overt racism and bigotry, including multiple videos featuring re-posted content from the neo-Nazi podcast The Daily Shoah, whose official channel was suspended by YouTube in 2019 for hate speech.

It seems to me that Google is filtering specific words and phrases on a stop word list. Then the company is not identifying related terms, particularly words which are synonyms for the word on the stop list.

Is it possible that Google is controlling how it does fuzzification. In order to get clicks and advertising, Google blocks specifics and omits the term expansion and synonym identification settings to eliminate the words and phrases identified by the Mark Up’s investigative team?

These references to synonym expansion and reference to query expansion are likely to be unfamiliar to some people. Nevertheless, fuzzy is in the hands of those who set statistical thresholds.

Fuzzy is not real, but the search results are. Ad money is a powerful force in some situations. The article seems to have uncovered a couple of enlightening examples. String matching coupled with synonym expansion seem to be out of step. Some fuzzification may be helpful in the hate speech methods.

Stephen E Arnold, April 12, 2021

An Exploration of Search Code

April 9, 2021

Software engineer Bard de Geode posts an exercise in search coding on his blog—“Building a Full-Text Search Engine in 150 Lines of Python Code.” He has pared down the thousands and thousands of lines of code found in proprietary search systems to the essentials. Of course, those platforms have many more bells and whistles, but this gives one an idea of the basic components. Navigate to the write-up for the technical details and code snippets that I do not pretend to follow completely. The headings de Geode walks us through include Data, Data preparation, Indexing, Analysis, Indexing the corpus, Searching, Relevancy, Term frequency, and Inverse document frequency. He concludes:

“You can find all the code on Github, and I’ve provided a utility function that will download the Wikipedia abstracts and build an index. Install the requirements, run it in your Python console of choice and have fun messing with the data structures and searching. Now, obviously this is a project to illustrate the concepts of search and how it can be so fast (even with ranking, I can search and rank 6.27m documents on my laptop with a ‘slow’ language like Python) and not production grade software. It runs entirely in memory on my laptop, whereas libraries like Lucene utilize hyper-efficient data structures and even optimize disk seeks, and software like Elasticsearch and Solr scale Lucene to hundreds if not thousands of machines. That doesn’t mean that we can’t think about fun expansions on this basic functionality though; for example, we assume that every field in the document has the same contribution to relevancy, whereas a query term match in the title should probably be weighted more strongly than a match in the description. Another fun project could be to expand the query parsing; there’s no reason why either all or just one term need to match.”

Fore more information, de Geode recommends curious readers navigate to MonkeyLearn’s post “What is TF-IDF?” and to an explanation of “Term Frequency and Weighting” posted by Stanford’s NLP Group. Happy coding.

Cynthia Murrell, April 9, 2021

Microsoft Adds Semantic Search to Azure Cognitive Search: Is That Fast?

April 9, 2021

Microsoft is adding new capabilities to its cloud-based enterprise search platform Azure Cognitive Search, we learn from “Microsoft Debuts AI-Based Semantic Search on Azure” at Datanami. We’re told the service offers improved development tools. There is also a “semantic caption” function that identifies and displays a document’s most relevant section. Reporter George Leopold writes:

“The new semantic search framework builds on Microsoft’s AI at Scale effort that addresses machine learning models and the infrastructure required to develop new AI applications. Semantic search is among them. The cognitive search engine is based on the BM25 algorithm, (as in ‘best match’), an industry standard for information retrieval via full-text, keyword-based searches. This week, Microsoft released semantic search features in public preview, including semantic ranking. The approach replaces traditional keyword-based retrieval and ranking frameworks with a ranking algorithm using deep neural networks. The algorithm prioritizes search results based on how ‘meaningful’ they are based on query relevance. Semantics-based ranking ‘is applied on top of the results returned by the BM25-based ranker,’ Luis Cabrera-Cordon, group program manager for Azure Cognitive Search, explained in a blog post. The resulting ‘semantic answers’ are generated using an AI model that extracts key passages from the most relevant documents, then ranks them as the sought-after answer to a query. A passage deemed by the model to be the most likely to answer a question is promoted as a semantic answer, according to Cabrera-Cordon.”

By Microsoft’s reckoning, the semantic search feature represents hundreds of development years and millions of dollars in compute time by the Bing search team. We’re told recent developments in transformer-based language models have also played a role, and that this framework is among the first to apply the approach to semantic search. There is one caveat—right now the only language the platform supports is US English. We’re told that others will be added “soon.” Readers who are interested in the public preview of the semantic search engine can register here.

Cynthia Murrell, April 9, 2021

Autonomy: Some Search History

April 6, 2021

I want to offer a happy quack to The Register, an online information service, for links to Autonomy documents. The slow moving legal carnival train is nearing its destination. “Everything You Need to Know about the HPE v Mike Lynch High Court Case” provides a useful summary of the trial. In addition, the article includes links to a number of fascinating documents. These provide some helpful insights into the challenges vendors of enterprise search and content processing systems face. Furthermore, the documents make clear that enterprise software can be a business challenge. The sales cycle is difficult. Installing and optimizing the software are challenges. Plus keeping the customer’s expectations for a solution in line with the realities of the solution often require the intellectual skills of big time wizards. Why are these documents relevant in 2021?

First, some vendors of search and content processing systems ignore the realities exposed in these documents.

Second, today’s customers are fooled by buzzwords and well crafted demonstrations. The actual system may be “different.”

Third, the users of today’s systems are likely to find themselves struggling to locate and make sense of information they know is available in the organization.

But marketing and complex interactions among software and service vendors and their partners are fascinating. Are similar practices in play today?

That’s an interesting question to consider.

Stephen E Arnold, April 6, 2021

Google Ad King Assembles Ad Free Search Engine

April 5, 2021

The heart of Google’s revenue is targeted ads. Despite the tech giant’s code of conduct, the company became a profit-driven corporate beast. Sridhar Ramaswamy was once Google’s advertising king, but he became disillusioned with the corporate beast. His biggest qualms were how Google’s obsessions with growth affected everything in the company, including user privacy and search quality.

Maybe Ramaswamy was inspired by DuckDuckGo when he decided to build a new search engine without ads and data tracking. Forbes details Ramaswamy’s career move in the article, “After Building Google’s Advertising Business, This Founder Is Creating An Ad-Free Alternative.”

His new search engine is called Neeva and his fellow Google cofounder Vivek Raghunathan invested in the new search startup. Instead of relying on ad revenue, Ramaswamy wants Neeva to be subscription based. His plan is for users to pay $5-10 a month to see non-sponsored search results.

Privacy is a major concern for users and the current Internet of things is hardly secure. Neeva comes at a time when users are demanding better regulations and better technology securing their information. There could also be a growing demand for unpolluted search results. Larry Page and Sergey Brin even wrote in their famous Stanford research paper that search engines driven by ad revenue will not ultimately meet consumers’ needs, because they will be biased by advertisements.

Neeva already has many investors, but tech experts doubt it will do much damage to Google:

“Search engine experts doubt Neeva will be able to do much damage to Google, at least in the short term. Some say Google’s gravitational pull is too strong for users to leave. Arun Kumar, CTO at Interpublic Group of Companies, Inc. a New York-based advertising holding company, says while Neeva might ‘find a few takers, but you’re not going to shake the kingdom.’”

Money is the driving force behind Google and user’s needs. Why pay for something when it is free in other places-biased or not?

Whitney Grace, April 5, 2021

Xooglers Have Google DNA When It Comes to Search

March 22, 2021

I spotted this story: “Ex-Google Employees Come Up with Their Own Privacy-Focused Search Engine.” The hook is that two Xooglers (former Google employees) are beavering away on a new search engine. The details appear in the write up. What I noticed was that users will have to pay to play. Plus, in order to become a subscriber, certain personal information will be required. Here’s a selection of the data the “privacy focused search engine” will possess:

  • Email address
  • Phone number
  • Location information
  • Name
  • User settings
  • IP address
  • Information you save in your ‘spaces.’
  • Payment information
  • The operating system or device
  • Mailing address
  • Cookie identifiers
  • Information regarding your contacts
  • The browser type and version you use
  • Pages that you visit

You can take the Xooglers out of Google, but it seems you cannot take the Google out of Xooglers. I particularly like the useful information which can be extracted from these data and nifty analyses like cross correlation. And that browser history! Yep, very interesting.

The privacy focused phrase is tasty too.

Stephen E Arnold, March 22, 2021

The Duck Confronts Googzilla

March 18, 2021

You have heard of David and Goliath? What about the duck and Googzilla? No. Navigate to “DuckDuckGo Calls Out Google over User Data Collection.” The metasearch engine wants everyone to know that Google does not define “privacy” the way the duck crowd does. The write up states:

DuckDuckGo says Google tried its best to hide its data collection practices, until it was no longer possible for them to keep it private. ‘After months of stalling, Google finally revealed how much personal data they collect in Chrome and the Google app. No wonder they wanted to hide it,’ DuckDuckGo said in a series of tweets. ‘Spying on users has nothing to do with building a great web browser or search engine. We would know (our app is both in one).’

Everyone is entitled to an opinion.

However, it is interesting to consider the question, “What happens next?”

  1. Google can ignore the duck. Eric Schmidt is no longer explaining that Qwant keeps him awake at night because that service is a heck of a threat. So, meh.
  2. Google takes steps to make life slight more interesting for the DuckDuckGo. There are some possibilities which are fun to ponder; for example, hasta la vista to links from the GOOG to the duck or Google works its magic within its walled garden. There’s a lot of content that lives within the Google ecosystem and when it is blocked or gifted with added latency, the scope may be a surprise to some.
  3. Google goes on the offensive just as it has with Microsoft. Imagine Google’s CEO suggesting that Microsoft’s CEO is dragging red herrings to the monopoly party. What could Google’s minions identify as information of value about DuckDuckGo, its traffic, and its index coverage? Interesting to ponder.

The tale of David and Goliath is an enduring one. The duck versus Googzilla might lack legendary status of brave David, but the confrontation might be a surprising one. Ducks are fierce creatures, but may have to punch above their weight to cause Googzilla pain.

Stephen E Arnold, March 18, 2021

Google and Microsoft Are Fighting. But a Battle May Loom between Coveo and Service Now

March 18, 2021

The 2021 cage match line ups are interesting. The Google – Microsoft dust up is a big deal. Google says Microsoft is using its posture on news as a way to blast rock and roll fog around the egregious security breaches for SolarWinds and Exchange Server.

But that fog could obscure a bout between Coveo (a smart search company) and Service Now (a Swiss Army knife of middleware, including Attivio search. Both companies invoke the artificial intelligence moniker. Both covet enterprise customers. Both want to extend their software into large organizations.

Service Now makes it plans clear in “Service Now Adds New AI and Low-Code Development Features.” The write up states:

[A user conference in Quebec] … also introduces AI Search, underpinned by technology acquired in ServiceNow’s purchase of Attivio. AI Search delivers intelligent search results and actionable information, complementing Quebec’s Engagement Messenger that extends self-service to third-party portals to enable AI search, knowledge management, and case interactions. Also new in Quebec is the aforementioned virtual agent, which delivers AI-powered conversational experiences for IT incident resolution.

From my vantage point, the AI is hand waving. Search has quite a few moving parts, and human involvement is necessary whether smart software is involved or not.

What Service Now has, however, is a meta-play; that is, it offers numerous management services. If properly set up and resourced could reduce the pain of some utility functions. Search is the mother of all utility services.

Coveo is a traditional enterprise search vendor. The company has targeted numerous business functions as likely customers; for example, customer support and marketing.

But niche vendors of utilities have to be like the “little engine that could.”

This may not be the main event like Google versus Microsoft, but it will be an event to watch.

Stephen E Arnold, March 18, 2021

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta