Why So Few Search Vendors Index the Web?

July 5, 2018

How many companies are indexing the Surface Web, the Dark Web, and the other bits and pieces which comprise the accessible Internet?

The answer is, “Not many most people can name.”

Another question, “Why don’t more companies just index the Internet?

The answer is, “Money, resources, time, expertise, and generating revenue.”

The write up from 2012 “How t Crawl a Quarter Billion Webpages in 40 Hours” surfaced again after an absence of six years. The article remains valid even thought the principal change in the last 72 months is the increased concentration of Google’s index. Microsoft, a company which insists that its Bing system, provides an alternative to Google has not significantly stopped Google’s market magnetism. Many of the systems which are marketed as Web indexes like Duckduckgo.com and Startpage.com are metasearch engines; that is, the users’ queries are passed to other services and may be supplemented with some original crawling. A bit of fiddling ensures that the results lists seem to be different. But there is a sameness to the result sets, particularly on popular queries. Yandex, the Russian Web search system, does a good job of handling certain sets of domains, but the overall coverage is not that different from what one can find in Google or its country centric indexes.

What’s interesting about “How to Crawl” from 2012 is the use of the Amazon system. This is important because the plumbing required to index the Internet can be large, complicated, and expensive.

Does Amazon still operate its A9 Web index? We have heard yes and no as an answer to this question. With a significant number of queries seeking product information, it makes sense to consider Amazon as a potential competitor to Bing, Google, and Yandex.

After rereading the “How to Crawl” paper, one thing jumps out. The notion that a quarter of a billion pages is a non trivial chunk of the Internet is interesting but a bit misleading. There may be upwards of more than 30 billion indexable Web pages. A large number of these content objects exist in mobile forms; thus, deduplication becomes an interesting issue. That’s why the Google has multiple indexes.

The big question becomes, “Is there another company able to compete with Google?”

After reading “How to Crawl” after a lapse of six years, the answer may be,

“Very, very few companies. And some of the outfits indexing the Surface and Hidden Internet may not make their activities public.”

Monocultures are okay but these can be vulnerable to something the monoculture cannot resist. Is Google like today’s banana? What happens if a blight attacks? One can shift to durian I suppose.

Stephen E Arnold, July 5, 2018

Calendars Are Now Search… If One Is Busy and Eschews Print Schedulers

July 3, 2018

You might not think it, but your doctor’s appointments and dinner parties are a big deal to search companies. With the rise of digital assistants like Siri and Alexa, your datebook is the next big horizon to conquer. The ways in which this will unfold might surprise you, according to a recent Japan Today story, “Google’s ‘Reserve’ Tool Winning Converts and Taking Search to the Next Level.”

According to the story:

“[S]even software firms that supply schedule data to Google described the volume as significant, with as much as 75 percent of bookings representing new customers. Consumers like the convenience. Business owners say the tool is putting their names in front of more potential clients.”

It is no coincidence that several experts are touting the ability of digital assistants to help with travel planning. In a weird way, voice search can now do a lot of the work of a travel agent, in terms of eyeing your schedule, finding deals, and even purchasing flight tickets. From getting reservations to booking flights to making sure someone is picking up junior from soccer practice, there is a revolution happening in search and how it relates to daily life. Search and scheduling: A wonderful way to fill one’s day with useful activities.

Patrick Roland, July 3, 2018

Search History? No Big Deal Maybe

June 29, 2018

What you search for leaves a digital footprint, or more accurately, a fingerprint. So much identifying data is left behind in your search history. However, there are some angles to this predicament many people are overlooking. We realized just how much bad information people are getting after reading a recent Pagal Parrot article, “Searching These Five Things Can Make Trouble For You.”

This odd little story seems to really give some elementary advice on what not to search for, like:

“#2 Your Name- It’s not a big secret that in this era of the internet our privacy questioned. If you try to Google most probably you will get stumble upon some unpleasant results, bad photos of you, outdated information, irrelevant content. we take such things way too seriously. If you find something like this, you want to delete it.”

This is a little obscure, considering there are such worse implications of your search history. For one, it informs all the bots what is sent through your social media feed. So, for example, a simple search about fake news might just land you with a glut of bogus stories. Thankfully, there is better advice out there than not searching your name, like how to wipe your Facebook and Google search history so that you aren’t fed to the algorithm monsters. Much more practical, in our book!

Patrick Roland, June 29, 2018

Search Now Maps Physical Products

June 21, 2018

Search has slowly been creeping into the real world, but rarely have we seen it making a positive impact on our lives when it does. Until now! A new search engine we discovered bridges the gap between the digital world and the physical world with impressively helpful results. We learned more from a recent LifeHacker story, “See What’s Actually In You Skincare Products With this Search Engine.”

The site is called Incidecoder, and this is what the article had to say:

“You can search for individual ingredients and popular products by name on INCIDecoder, and it will list out all of the ingredients as well as descriptions of what they actually are and what they do. Because while I know what Aqua is, I’m less familiar with PPG-26-Buteth-26 and Ethylhexylglycerin.”

Another way search is sneaking into the real world is in the fashion industry, where AI and predictive analytics can tell designers what look is hot now, but also what trends will pop up in the future. Expect to see more of this trend beyond fashion and beauty aids. This seems like it will be a huge market for blending search and AI into our daily lives.

Patrick Roland, June 21, 2018

Google Search Evaluator Handbook

June 12, 2018

How does Google shape search results? The pay to play search giant allegedly has a guide for individuals who interact with the automated search system. The information appears at this link. The information dates to 2017. There may be a revision or additional instructional material online. If we come across that information, we will post the link in Beyond Search.

The information is described as “Search Quality Rating System.” A sample from the table of contents for the documentation appears below:


search evaluation 1

An example of the information provided to the human making quality decisions appears below:


Here’s the guidance for queries about kittens:


In my first Google monograph (The Google Legacy, 2004), I gathered about 100 factors allegedly used to determine “quality” of Google search results. What I found interesting is that Google’s listing has many more entries than I identified 14 years ago.

Quality, it seems, is more difficult to pinpoint today. The rules for relevance, however, seem to have been marginalized.

I do know that in order to obtain useful results from Google, I have to craft my queries carefully. In fact, creating a query for an old school Boolean system is easier to do. Google has added on to what was essentially a key word system by wrapping layers of software around an ageing core.

Worth spending a few minutes with the document in my opinion.

Stephen E Arnold, June 12, 2018

Want Info about a Small Town? Hit the Library

June 6, 2018

For many, libraries are obsolete, deader than a Peruvian mummy. This is true for some, but if you live in a small town then libraries are far from dead. Big news outlets cover global issues, so they skip over small town stories. Small towns, however, still have news and the residents want to read it. Where do they go to get information when local newspapers dried up? They go to the local library. The Atlantic shares the story, “The Libraries Bringing Small-Town News Back To Life” and how the US’s smaller cities still rely on libraries as information centers.

Libraries have seen their budgets slashed, branches closed down, and the librarian profession has been traded for para-professionals. Yet people still go to libraries and even trust librarians over journalists and other news sources. Why? Librarians also understand the importance of accurate information and their sources.

Librarians have picked up the slack where local news sources fail or disappeared. In some towns, being a news source has increased participation at libraries. The write up stated:

“Various types of community building are happening across the nation. In some cities, libraries are partnering with established news sources, teaming up in Dallas to train high schoolers in news gathering or hosting a satellite studio in Boston for the public radio station WGBH. In San Antonio, the main library offers space to an independent video news site that trains students and runs a C-SPAN-style operation in America’s seventh-biggest city. (That site was the only video outlet covering a mayoral debate last year in which the incumbent mayor’s comments on poverty became a national story—and may have contributed to her electoral defeat.)”

Where once libraries use to store information, they are turning into the information source. They are also reinforcing important information literacy skills, which are in desperate need as fake news and instant search weakens people’s judgment skills.

Whitney Grace, June 6, 2018

Are Auto Suggestions Inherently Problematic?

June 3, 2018

Politics is a dangerous subject to bring up in any social situation. My advice is to keep quiet and nod, then you can avoid loudmouths trying to press their agendas down your throat. Despite attempts to remain polite, the Internet always brings out the worst in people and The Sun shares how with a simple search engine function, “‘Trump Should Be Shot’ Google And Bing Searches For ‘Trump’ And ‘Conservatives’ Offer Disgusting Auto-Suggestions.”

Auto-complete is notorious for making hilarious mistakes and the same is with auto-suggest on search engines, but these end up to be more gruesome than a misspelling.

If you want to see some interesting suggestions, type “Trump should be…” into a blank search bar and the results are endless, including: shot, arrested, killed, in jail, arrested banned from Twitter (okay, the last one might be a little funny).

Typing in “conservatives need…” results in less derogatory terms, but the auto-suggestions include: to die, to go, a new party, and not apply.


What creates these auto-suggestions?

“These are based on a number of factors including real-time searches, trending results, your location, and previous activity.The intuitive predictions change in “response to new characters being entered into the search box” explains Google. And the company also has its own set of “autocomplete policies” in case something untoward should pop up.Along with prohibiting predictions that contain sexually explicit, violent, and harmful terms, Google says it also removes hateful suggestions against groups and individuals. ‘We remove predictions that include graphic descriptions of violence or advocate violence generally,’ states the firm.”

Google and Bing deserve some credit for removing the slander from auto-complete, but sometimes they only do it when they are pushed. Trolls and bigots create these terms and it would be nice to see them scrubbed from auto-suggest, but it is near impossible. Hey, Bing and Google try scrubbing 4chan!

Whitney Grace, June 3, 2018

AI: A Little Helper for Those Seeking Information

May 31, 2018

Search is a powerful tool and big data software has only improved search’s quality.  Search can now locate items in all data structures, ranging from structured to unstructured.  Do users, however, actually find the answers they want?  InfoWorld runs through the impact AI has had on search in the article, “The Wonders Of AI-Or The Shortcomings Of Search?”

In essence, Google and Amazon’s subsecond search results have spoiled and ruined users.  Users are so use to accurate and quick results that they expect all Web sites, software, and hardware to do the same.  These search tools are actually providing users with an information overload.

One the other hand, AI makes search and other tools more robust.  Organizations use AI not only to power search, but to feed and filter data to make business recommendations.  Google and Amazon are not the only ones using it.  Other companies that use AI to power their businesses are Uber, Tesla, Spotify, Pandora, Netflix, and Bristol Meyers Squibb.  AI takes the search out of search:

“Those last points are crucial. A structural shift is under way. AI cuts through the clutter to provide not endless pages of results to wade through, but with specific recommendations tailored to you as the seeker of knowledge—or simply as the seeker of where to find the best Chicago-style pizza while away from home on a business trip. (Which is not to admit, certainly not in print, that I have not supplemented my normal whole-foods, plant-based, no-meat-or-dairy nutrition by indulging in such a cheesy, guilty pleasure. I present it merely for illustration.) The key construct: AI-driven systems present either the single best solution or a tight shortlist of best-fit solutions.”

AI also augments search by providing recommendations that are related to the original query, but are simply suggestions.  This requires that AI be fed a lot of data, so that it can offer proactive assistance.

Big data and AI are empowering, but they do need a checks and balances system.  The solution is to combine AI search and regular search into one tool: the curated list and the raw data list.

Whitney Grace, May 31, 2018


Want Mobile Traffic? New Tactics May Be Needed

May 30, 2018

I read “Mobile Direct Traffic Eclipses Facebook.” Like any research, I like to know the size of the sample, the methodology, and the “shaping” which the researchers bring to the project. To answer these questions, one must see other sources cited in the write up, including Nieman Lab, which appears to be recycling Chartbeat data. In short, I don’t know much about the research design or other aspects of the research.

Nevertheless, I noted a handful of statements or “facts” which on the surface struck me as interesting. The study data appear to support the assertion that “mobile does not equal social”.

First, the study reports that “mobile direct to traffic has surpassed Facebook.” I think this means that if those in the sample use a mobile device, some of those users use an app or a browser to go directly to a site. At first glance, Facebook seems to be a major player but it is, according to the survey, trending down from being the gateway to information for some mobile device users.

Second, the write up points out sites offering “content” are not losing visitors. On one hand, the finding suggests that Facebook is not a gateway trending upwards. I have seen reports suggesting that Facebook has been negatively affected by the Cambridge Analytica matter, but I have also seen reports which assert that Facebook is adding users. Which is it? That’s the question, isn’t it?

Third, the Chartbeat data put Google as the leading source of traffic to sites. What this means is that the “gap” between Facebook and Google as referrers seems to be getting bigger. Bad news for Facebook and good news for Google if the data are accurate.

Several observations:

  • The data, if accurate, make it clear that Google and its Android operating system have a clear path to the barn
  • Facebook may have to begin the process of adapting to mobile users who do not use Facebook as the gateway to the Internet (whatever that ends up being)
  • Governments interested in censoring certain content streams have a crude road map for determine what online destinations should be cut off from the information superhighway. (The law enforcement addiction to Facebook and Twitter may require some special treatment at clinics run by Google and high traffic destinations accessed via an app.)

To sum up, if the data in the Chartbeat report are accurate, changes are underway. Some positive, some negative. There is, however, that “if.”

Stephen E Arnold, May 30, 2018

Bing Keeps on Trying

May 21, 2018

Ah, Bing.

Microsoft has struggled to garner the respect in the search engine world that its software has commanded.

Bing is often seen as the Avis to Google’s Hertz. Maybe a stepchild of the search game patriarchs, Sergey and Larry.

Microsoft is not blind to these views, which is resulting in some interesting innovations to close the gap between it and Google. We learned about these steps from a recent TechRadar story, “Microsoft Unveils New Features for Bing in Bid to Make You Switch from Google.”

The biggest upgrade? The fact that Bing now gives you an “Intelligent Answer” and not just the one that ranks first. It seems like a good move, which the article highlights:

“We’re pleased to see Microsoft attempt to win over users by adding more features (which you can read about more on the Bing blog), rather than trying to strong-arm people who use Windows 10 into using the search engine, but will this be enough to make people switch?”

We’re going to go out on a (not very long) limb and suggest, no. This isn’t enough to make people switch. That’s especially true when we see news like this, that claims that Google’s Assistant is the most accurate. Looks like the game board is shifting beneath Microsoft’s feet as they try to catch up. How does one find information available on the Internet?

One doesn’t without recourse to commercial systems from vendors with low or zero profile among consumers. Money is required to find relevant information. Free stuff returns what earns money to pay for the “free lunch.”

Patrick Roland, May 21, 2018

Next Page »

  • Archives

  • Recent Posts

  • Meta