Why So Few Search Vendors Index the Web?

July 5, 2018

How many companies are indexing the Surface Web, the Dark Web, and the other bits and pieces which comprise the accessible Internet?

The answer is, “Not many most people can name.”

Another question, “Why don’t more companies just index the Internet?

The answer is, “Money, resources, time, expertise, and generating revenue.”

The write up from 2012 “How t Crawl a Quarter Billion Webpages in 40 Hours” surfaced again after an absence of six years. The article remains valid even thought the principal change in the last 72 months is the increased concentration of Google’s index. Microsoft, a company which insists that its Bing system, provides an alternative to Google has not significantly stopped Google’s market magnetism. Many of the systems which are marketed as Web indexes like Duckduckgo.com and Startpage.com are metasearch engines; that is, the users’ queries are passed to other services and may be supplemented with some original crawling. A bit of fiddling ensures that the results lists seem to be different. But there is a sameness to the result sets, particularly on popular queries. Yandex, the Russian Web search system, does a good job of handling certain sets of domains, but the overall coverage is not that different from what one can find in Google or its country centric indexes.

What’s interesting about “How to Crawl” from 2012 is the use of the Amazon system. This is important because the plumbing required to index the Internet can be large, complicated, and expensive.

Does Amazon still operate its A9 Web index? We have heard yes and no as an answer to this question. With a significant number of queries seeking product information, it makes sense to consider Amazon as a potential competitor to Bing, Google, and Yandex.

After rereading the “How to Crawl” paper, one thing jumps out. The notion that a quarter of a billion pages is a non trivial chunk of the Internet is interesting but a bit misleading. There may be upwards of more than 30 billion indexable Web pages. A large number of these content objects exist in mobile forms; thus, deduplication becomes an interesting issue. That’s why the Google has multiple indexes.

The big question becomes, “Is there another company able to compete with Google?”

After reading “How to Crawl” after a lapse of six years, the answer may be,

“Very, very few companies. And some of the outfits indexing the Surface and Hidden Internet may not make their activities public.”

Monocultures are okay but these can be vulnerable to something the monoculture cannot resist. Is Google like today’s banana? What happens if a blight attacks? One can shift to durian I suppose.

Stephen E Arnold, July 5, 2018

Mobile Search: The Google Focus

May 28, 2018

SEO is the ultimate moving target. Just as you get a hunch about what algorithms are looking for in order to boost rankings, the algorithms and parameters change and you practically have to start from scratch. It’s a convoluted world to navigate fresh and we were pleased to find a really competent explanation of the latest landscape change, the Mobile First Index, in a recent Search Engine Watch story, “Google’s Mobile First Index: Six Actions to Minimize Risk and Maximize Ranking Opportunities.”

According to the story:

“In any period of uncertainty there are opportunities to take advantage of and risks to manage – and in competitive SEO niches, taking every chance to get ahead is important… “Whatever your starting point – the mobile-first index is the new normal in SEO, and now is the time to get to grips with the challenge – and potential.”

Your starting point, it seems, should involve voice search. Another compelling article makes the point that the ideal pairing of Mobile First Indexing is voice search. Watch for this massive shift to happen rapidly. Critics have been eyeing the future of voice search for a while and now the pieces are finally in place.

Old fashioned keyword search seems to be less and less relevant.

Patrick Roland, May 28, 2018

Free Keyword Research Tools

May 15, 2018

Short honk: Search Engine Watch published a write up intended for SEO experts. The article contained some useful links to free keyword search tools. Even if you are not buying online ads or fiddling with your indexing, the services are interesting to know about. Here they are:

Stephen E Arnold, May 15, 2018

Mondeca: Another Semantic Search Option

April 9, 2018

Mondeca, based in France, has long been focused on indexing and taxonomy. Now they offer a search platform named, simply enough, Semantic Search. Here’s their description:

“Semantic search systems consider various points including context of search, location, intent, variation of words, synonyms, generalized and specialized queries, concept matching and natural language queries to provide relevant search results. Augment your SolR or ElasticSearch capabilities; understand the intent, contextualize search results; search using business terms instead of keywords.”

A few details from the product page caught my eye. Let’s begin with the Search functionality; the page succinctly describes:

“Navigational search – quickly locate specific content or resource. Informational search – learn more about a specific subject. Compound term processing, concept search, fuzzy search, simple but smart search, controlled terms, full text or metadata, relevancy scoring. Takes care of language, spelling, accents, case. Boolean expressions, auto complete, suggestions. Disambiguated queries, suggests alternatives to the original query. Relevance feedback: modify the original query with additional terms. Contextualize by user profile, location, search activity and more.”

The software includes a GUI for visualizing the semantic data, and features word-processing tools like auto complete and a thesaurus. Results are annotated, with key terms highlighted, and filters provide significant refinement, complete with suggestions. Results can also be clustered by either statistics or semantic tags. A personalized dashboard and several options for sharing and publishing round out my list. See the product page for more details.

Established in 1999, Mondeca delivers pragmatic semantic solutions to clients in Europe and North America, and is proud to have developed their own, successful semantic methodology. The firm is based in Paris. Perhaps the next time our beloved leader, Stephen E Arnold, visits Paris, the company will make time to speak with him. Previous attempts to set up a meeting were for naught. Ah, France.

Cynthia Murrell, April 9, 2018

Attivio and MC+A Combine Forces

April 7, 2018

Over the years, Attivio positioned itself as more than search. That type of shift has characterized many vendors anchored in search and retrieval. We noted that Attivio has “partnered” with MC+A, a search centric company. MC+A also forged a relationship with Coveo, another search and retrieval vendor with a history of repositioning.

We learned from “Attivio and MC+A Announce Partnership to Deliver Next-Generation Cognitive Search Solutions” at Markets Insider that:

“MC+A will resell Attivio’s platform, seamlessly integrate their enterprise-grade connectors into it, and provide SI services in the US market. ‘Partnering with MC+A extends our ability to address organizations’ needs for making all information available to employees and customers at the moment they need it,’ said Stephen Baker, CEO at Attivio. ‘This is particularly critical for companies looking to upgrade legacy search applications onto a modern, machine-learning based search and insight platform.’ …

The story added:

“By combining self-learning technologies, such as natural language processing, machine learning, and information indexing, the Attivio platform is helping Fortune 500 enterprises leverage customer insight, surface upsell opportunities, and improve compliance productivity. MC+A has over 15 years of experience innovating with search and delivering customized search-based applications solutions to enterprises. MC+A has also developed a connector bridge solution that allows customers to leverage existing infrastructure to simplify the transition to the Attivio platform.”

Attivio was founded in 2007, and is headquartered in Newton, Massachusetts. The company’s client roster includes prominent organizations like UBS, Cisco, Citi, and DARPA. Attivio in its early days was similar in some ways to the Fast Search & Transfer technology once cleverly dubbed ESP. No, not extra sensory perception. ESP was the enterprise search platform.

Based in Chicago and founded in 2004, MC+A specializes in implementations of cognitive search and insight engine technology. A couple of years ago, MC+A was involved with Yippy, the former Vivisimo metasearch system. When IBM bought Vivisimio, the metasearch technology morphed into a Big Data component of Watson.

If this walk down memory lane suggests that vendors of proprietary systems have been working to find purchase on revenue mountain, there may be  a reason. The big money, based on information available to Beyond Search, comes from integrating open source solutions like Lucene into comprehensive analytic systems.

In a nutshell, the rise of Lucene and Elastic have created opportunities for some companies which can deliver more comprehensive solutions than search and retrieval anchored in old-school solutions.

More than repositioning, jargon, and partnerships may be needed in today’s market place where “answers”, not laundry lists are in demand. For mini profiles of vendors which are redefining information access and answering questions, follow the news stories in our new video news program DarkCyber. There’s a new program each week. Plus, you can get a sense of the new directions in information access by reading my 2015 book (still timely and very relevant) CyberOSINT: Next Generation Information Access.

Stephen E Arnold,

Stephen E Arnold, April 7, 2018

Build an Alternative Google: How To Wanted

April 6, 2018

Hacker News presented an interesting question, “How would you build an internet scale web crawler?” We have been talking with companies which have developed Internet search systems that are not available for free Web search. Those conversations have produced some fascinating information. Some of the data will be included in my upcoming lecture for a government agency and then in my two presentations at the June 2018 Telestrategies ISS Conference in Prague.

What was interesting about this question was the few people responded. That is interesting because my team’s research for my new presentations on deanonymizing encrypted chat and deanonymizing digital currency transactions pivot on comprehensive Internet indexing. In fact, more companies are indexing the Internet content than at any time in the last 10 years.

The second issue the post triggered was a realization that only a handful of people jumped on the topic. This low response to the question in itself is interesting. With more activity in indexing, why aren’t more people helping out JustinGarrson? That’s a question worth thinking about.

Third, one of the responses to the Hacker News question was a pointer to the YaCy.net open source project. We once included this technology in our Internet Research for Law Enforcement training program. My recollection of the system is fuzzy, so I will get one of my team to take at look.

The final thought the Hacker News’ story triggered was, “Have people just accepted Bing, Google, Qwant, and a handful of metasearch systems as too dominant to challenge?” My view is that an opportunity exists to create a public facing Internet search and retrieval system. The reason? Outstanding alternatives to Bing, Google, and Qwant are available for those who qualify as customers and who are willing to pay the license fees.

My hunch is that just as enterprise search has coalesced around the open source Lucene/Solr technologies, free Web search has become “game over” because the ad supported model has won.

The problem, of course, is that a person looking for information usually does not realize that free Web search results are neither comprehensive, timely, or objective.

I hope individuals like JustinGarrison get the information needed to seize an opportunity in Internet search.

Stephen E Arnold, April 6, 2018

Artificial Intelligence: Tiny Ears May Listen Well

March 29, 2018

The allegations that Facebook-type companies can “listen” to one’s telephone conversations or regular conversations may be “fake” news. But the idea is worth considering.

Artificial intelligence’s ability to process written data is unparalleled. However, the technology has always lagged pretty severely when it comes to spoken words. Soon, that will be a thing of the past if this recent article is to be believed. We learned more from the Smart Data Collective piece, “Natural Language Processing: An Essential Element of Artificial Intelligence.”

According to the story:

“Natural Language Processing (NLP) is an important part of artificial intelligence which is being researched upon to aid enterprises and businesses in the quick, speedy and fast retrieval of both structured and unstructured organizational data when needed. In simple terms, natural language processing (NLP), is the skill of a machine to understand and process human language within the context in which it is spoken.”

This technology is really taking off in the food industry. According to sources, shoppers in London are the first to use language processing apps to help them determine what vitamins their body may be lacking. It may sound like a stretch, but this is the sweet spot where AI really soars. The technology seems to really take off in industries that previously felt like it needed no help. Watch for language processing to begin bleeding into everyday life elsewhere, too. If one is carrying a mobile phone, is it listening and recording, converting text to speech, and indexing that content for psychographic analysis?

Patrick Roland, March 29, 2018

De-Archiving: Where Is the Money to Deliver Digital Beef?

February 25, 2018

I read “De-Archiving: What Is It and Who’s Doing It?” I don’t want to dig into the logical weeds of the essay. Let’s look at one passage I highlighted.

As the cost of hot storage continues to drop, economics work in favor of taking more and more of their stored material and putting it online. Millions of physical documents, films, recordings, photographs, and historical data are being converted to online digital assets every year. Soon, anything that was worth saving will also be worth putting online. Tomorrow’s warehouse will be a data center filled with spinning disks that safely store any valuable data – even if it has to be converted to a digital format first. “De-archiving” will be a new vocab word for enterprises and individuals everywhere – and everyone will be doing it in the near future.

My hunch is that the thought leader who wrote the phrase “anything that was worth saving will be worth putting online” has not checked out the holdings of the Library of Congress. The American Memory project, on which I worked, represents a miniscule percentage of the non text information the LoC has. Toss in text, boxes of manuscripts, and artifacts (3D imaging and indexing). The amount of money required to convert and index the content might stretch the US budget which seems to wobble around with continuing resolutions.

Big ideas are great. Reality may not be as great. Movies which can disintegrate during conversion? Yeah, right. Easy. Economical.

Stephen E Arnold, February 25, 2018

How SEO Has Shaped the Web

January 19, 2018

With the benefit of hindsight, big-name thinker Anil Dash has concluded that SEO has contributed to the ineffectiveness of Web search. He examines how we got here in his article, “Underscores, Optimization & Arms Races” at Medium.  Starting with the year 2000, Dash traces the development of Internet content management systems (CMS’s), of which he was a part. (It is a good brief summary for anyone who wasn’t following along at the time.) WordPress is an example of a CMS.

As Google’s influence grew, online publishers became aware of an opportunity—they could game the search algorithm to move their site to the top of “relevant” results by playing around with keywords and other content details. The question of whether websites should bow to Google’s whims seemed to go unasked, as site after site fell into this pattern, later to be known as Search Engine Optimization. For Dash, the matter was symbolized by a question over hyphens or underbars to represent spaces in web addresses. Now, of course, one can use either without upsetting Google’s algorithm, but that was not the case at first. When Google’s Matt Cutts stated a preference for the hyphen in 2005, most publishers fell in line. Including Dash, eventually and very reluctantly; for him, the choice represented nothing less than the very nature of the Internet.

He writes:

You see, the theory of how we felt Google should work, and what the company had often claimed, was that it looked at the web and used signals like the links or the formatting of webpages to indicate the quality and relevance of content. Put simply, your search ranking with Google was supposed to be based on Google indexing the web as it is. But what if, due to the market pressure of the increasing value of ranking in Google’s search results, websites were incentivized to change their content to appeal to Google’s algorithm? Or, more accurately, to appeal to the values of the people who coded Google’s algorithm?

Eventually, even Dash and his CMS caved and switched to hyphens. What he did not notice at the time, he muses, was the unsettling development of the  entire SEO community centered around appeasing these algorithms. He concludes:

By the time we realized that we’d gotten suckered into a never-ending two-front battle against both the algorithms of the major tech companies and the destructive movements that wanted to exploit them, it was too late. We’d already set the precedent that independent publishers and tech creators would just keep chasing whatever algorithm Google (and later Facebook and Twitter) fed to us. Now, the challenge is to reform these systems so that we can hold the big platforms accountable for the impacts of their algorithms. We’ve got to encourage today’s newer creative communities in media and tech and culture to not constrain what they’re doing to conform to the dictates of an opaque, unknowable algorithm.

Is that doable, or have we gone too far toward appeasing the Internet behemoths to turn back?

Cynthia Murrell, January 19, 2018

Is the End of Google Web Search Coming?

December 20, 2017

I read “Google to Use Mobile Version of a Site to Determine Mobile Rankings.” The info, if on the money, makes clear that the Google cares about mobile, not desktop anchor Web search. No surprise. The article reported:

[The write up quoted a Googler as stating:] “Mobile-first indexing means that we’ll use the mobile version of the content for indexing and ranking, to better help our – primarily mobile – users find what they’re looking for.” These changes probably won’t affect end users too much, but it does highlight how Google’s efforts are starting to focus more on mobile.

I think the word for this modest step is “deprecate.” Flash forward a year or so and what have we got? Less “deep” Google indexing of non mobile Web sites. Fewer PowerPoints indexed. Fewer PDFs indexed. In short, the lack of rigor in indexing the Railway Retirement Board comes to boat anchor Web sites.

Web indexing is expensive and likely to be facing “friction” from the net neutrality change. This means mobile is money for the GOOG.

Just a thought from Harrod’s Creek.

Stephen E Arnold, December 20, 2017

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta