The Google: Indexing and Discriminating Are Expensive. So Get Bigger Already

November 9, 2022

It’s Wednesday, November 9, 2022, only a few days until I hit 78. Guess what? Amidst the news of crypto currency vaporization, hand wringing over the adult decisions forced on high school science club members at Facebook and Twitter, and the weirdness about voting — there’s a quite important item of information. This particular datum is likely to be washed away in the flood of digital data about other developments.

What is this gem?

An individual has discovered that the Google is not indexing some Mastodon servers. You can read the story in a Mastodon post at this link. Don’t worry. The page will resolve without trying to figure out how to make Mastodon stomp around in the way you want it to. The link to you is Snake.club Stephen Brennan.

The item is that Google does not index every Mastodon server. The Google, according to Mr. Brennan:

has decided that since my Mastodon server is visually similar to other Mastodon servers (hint, it’s supposed to be) that it’s an unsafe forgery? Ugh. Now I get to wait for a what will likely be a long manual review cycle, while all the other people using the site see this deceptive, scary banner.

So what?

Mr. Brennan notes:

Seems like El Goog has no problem flagging me in an instant, but can’t cleanup their mistakes quickly.

A few hours later Mr. Brennan reports:

However, the Search Console still insists I have security problems, and the “transparency report” here agrees, though it classifies my threat level as Yellow (it was Red before).

Is the problem resolved? Sort of. Mr. Brennan has concluded:

… maybe I need to start backing up my Google data. I could see their stellar AI/moderation screwing me over, I’ve heard of it before.

Why do I think this single post and thread is important? Four reasons:

The incident underscores how an individual perceives Google as “the Internet.” Despite the use of a decentralized, distributed system. The mind set of some Mastodon users is that Google is the be-all and end-all. It’s not, of course. But if people forget that there are other quite useful ways of finding information, the desire to please, think, and depend on Google becomes the one true way. Outfits like Mojeek.com don’t have much of a chance of getting traction with those in the Google datasphere.
Google operates on a close-enough-for-horseshoes or good-enough approach. The objective is to sell ads. This means that big is good. The Good Principle doesn’t do a great job of indexing Twitter posts, but Twitter is bigger than Mastodon in terms of eye balls. Therefore, it is a consequence of good-enough methods to shove small and low-traffic content output into a area surrounded by Google’s police tape. Maybe Google wants Mastodon users behind its police tape? Maybe Google does not care today but will if and when Mastodon gets bigger? Plus some Google advertisers may want to reach those reading search results citing Mastodon? Maybe? If so, Mastodon servers will become important to the Google for revenue, not content.
Google does not index “the world’s information.” The system indexes some information, ideally information that will attract users. In my opinion, the once naive company allegedly wanted to achieve the world’s information. Mr. Page and I were on a panel about Web search as I recall. My team and I had sold to CMGI some technology which was incorporated into Lycos. That’s why I was on the panel. Mr. Page rolled out the notion of an “index to the world’s information.” I pointed out that indexing rapidly-expanding content and the capturing of content changes to previously indexed content would be increasingly expensive. The costs would be high and quite hard to control without reducing the scope, frequency, and depth of the crawls. But Mr. Page’s big idea excited people. My mundane financial and technical truths were of zero interest to Mr. Page and most in the audience. And today? Google’s management team has to work overtime to try to contain the costs of indexing near-real time flows of digital information. The expense of maintaining and reindexing backfiles is easier to control. Just reduce the scope of sites indexed, the depth of each crawl, the frequency certain sites are reindexed, and decrease how much content old content is displayed. If no one looks at these data, why spend money on it? Google is not Mother Theresa and certainly not the Andrew Carnegie library initiative. Mr. Brennan brushed against an automated method that appears to say, “The small is irrelevant controls because advertisers want to advertise where the eyeballs are.”
Google exists for two reasons: First, to generate advertising revenue. Why? None of its new ventures have been able to deliver advertising-equivalent revenue. But cash must flow and grow or the Google stumbles. Google is still what a Microsoftie called a “one-trick pony” years ago. The one-trick pony is the star of the Google circus. Performing Mastodons are not in the tent. Second, Google wants very much to dominate cloud computing, off-the-shelf machine learning, and cyber security. This means that the performing Mastodons have to do something that gets the GOOG’s attention.

Net net: I find it interesting to find examples of those younger than I discovering the precise nature of Google. Many of these individuals know only Google. I find that sad and somewhat frightening, perhaps more troubling than Mr. Putin’s nuclear bomb talk. Mr. Putin can be seen and heard. Google controls its datasphere. Like goldfish in a bowl, it is tough to understand the world containing that bowl and its inhabitants.

Stephen E Arnold, November 9, 2022

Written by Stephen E. Arnold · Filed Under Business strategy, Google, News, Social, Social Media, Web 3 | Comments Off on The Google: Indexing and Discriminating Are Expensive. So Get Bigger Already

Sepana: A Web 3 Search System

November 8, 2022

Decentralized search is the most recent trend my team and I have been watching. We noted “Decentralized Search Startup Sepana Raises $10 Million.” The write up reports:

Sepana seeks to make web3 content such as DAOs and NFTs more discoverable through its search tooling.

What’s the technical angle? The article points out:

One way it’s doing this is via a forthcoming web3 search API that aims to enable any decentralized application (dapp) to integrate with its search infrastructure. It claims that millions of search queries on blockchains and dapps like Lens and Mirror are powered by its tooling.

With search vendors working overtime to close deals and keep stakeholders from emulating Vlad the Impaler, some vendors are making deals with extremely interesting companies. Here’s a question for you? “What company is Elastic’s new best friend?” Elasticsearch has been a favorite of many companies. However, Amazon nosed into the Elastic space. Furthermore, Amazon appears to be interested in creating a walled garden protected by a moat around its search technologies.

One area for innovation is the notion of avoiding centralization. Unfortunately online means that centralization becomes an emergent property. That’s one of my pesky Arnold’s Laws of Online. But why rain on the decentralized systems parade?

Sepana’s approach is interesting. You can get more information at https://sepana.io. Also you can check out Sepana’s social play at https://lens.sepana.io/.

Stephen E Arnold, November 8, 2022

Written by Stephen E. Arnold · Filed Under News, Search, Web 3 | Comments Off on Sepana: A Web 3 Search System

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.