Deindexing SEO Delivers Revenue Results

June 7, 2018

SEO is still an important aspect of the Google algorithm and other search engine crawlers. In my opinion, tweaking Web pages can result in a boost for content in some queries. I have a hunch that Google’s system then ignores subsequent tweaks. The Web master then has an opportunity to buy Google advertising, and the content becomes more findable. But that’s just an opinion.

The received wisdom is that the key to great SEO is to generate great content, which is the crawlers then index. Robin Rozhon shares that technical SEO has a big impact on your Web site, especially if it is large. In his article, “Crawling & Indexing: Technical SEO Basics That Drive Revenue (Case Study)” Rozhon discusses to maximize technical SEO, including deindexing benefits.

Rozhan ran an experiment where they deindexed over 400,000 of their 500,000 Web sites and 80% of their URLs, because search engines indexed them as duplicate category URLs. Their organic traffic highly increased. Before you deindex your Web sites, check into Google Analytics to determine how well the pages are doing.

Also to determine what pages to deindex collect data about the URLs and find out what the parameters are along with other data. Use Google Analytics, Google Search Console, Screaming Frog, log files, and other data about the URL to understand its performance.

Facets and filters are another important contribution to URLs:

“Faceted navigation is another common troublemaker on ecommerce websites we have been dealing with.Every combination of facets and filters creates a unique URL. This is a good thing and a bad thing at the same time, because it creates tons of great landing pages but also tons of super specific landing pages no one cares about.”

They also have pros and cons:

I learned this about “facets”:

  • Facets are discoverable crawlable and indexable by search engines;
  • Wait! Facets are not discoverable if multiple items from the same facet are selected (e.g. Adidas and Nike t-shirts).
  • Facets contain self-referencing canonical tags;

And what about filters?

  • Filters are not discoverable;
  • Filters contain a “noindex’ tag;
  • Filters use url parameters that are configured in Google Search Console and Bing Webmaster tools.

As a librarian, I believe that old school ideas have found their way into the zippy modern approach to indexing via humans and semi smart software.

In the end, consolidate pages and remove any dead weight to drive traffic to the juicy content and increase sales. Why did they not say that to begin with, instead of putting us through the technical jargon?

Whitney Grace, June 7, 2018

Index Is Important. Yes, Indexing.

March 8, 2017

I read “Ontologies: Practical Applications.” The main idea in the write up is that indexing is important. Now indexing is labeled in different ways today; for example, metadata, entity extraction, concepts, etc. I agree that indexing is important, but the challenge is that most people are happy with tags, keywords, or systems which return a result that has made a high percentage of users happy. Maybe semi-happy. Who really knows? Asking about search and content processing system satisfaction returns the same grim news year after year; that is, most users (roughly two thirds) are not thrilled with the tools available to locate information. Not much progress in 50 years it seems.

The write up informs me:

Ontologies are a critical component of the enterprise information architecture. Organizations must be capable of rapidly gathering and interpreting data that provides them with insights, which in turn will give their organization an operational advantage.  This is accomplished by developing ontologies that conceptualize the domain clearly, and allows transfer of knowledge between systems.

This seems to mean a classification system which makes sense to those who work in an organization. The challenge which we have encountered over the last half century is that the content and data flowing into an organization changes often rapidly over time. At any one point in time, the information today is not available. The organization sucks in what’s needed and hopes the information access system indexes the new content right away and makes it findable and usable in other software.

That’s the hope anyway.

The reality is that a gap exists between what’s accessible to a person in an organization and what information is being acquired and used by others in the organization. Search fails for most system users because what’s needed now is not indexed or if indexed, the information is not findable.

An ontology is a fancy way of saying that a consultant and software can cook up a classification system and use those terms to index content. Nifty idea, but what about that gap?

This is the killer for most indexing outfits. They make a sale because people are dissatisfied with the current methods of information access. An ontology or some other jazzed up indexing component is sold as the next big thing.

When an ontology, taxonomy, or other solution does not solve the problem, the company grouses about search and cotenant processing again.

Is there a fix? Who knows. But after 50 years in the information access sector, I know that jargon is not an effective way to solve very real problems. Money, know how, and old school methods are needed to make certain technologies deliver useful applications.

Ontologies. Great. Silver bullet. Nah. Practical applications? Nifty concept. Reality is different.

Stephen E Arnold, March 8, 2017

The Pros and Cons of Human Developed Rules for Indexing Metadata

February 15, 2017

The article on Smartlogic titled The Future Is Happening Now puts forth the Semaphore platform as the technology filling the gap between NLP and AI when it comes to conversation. The article posits that in spite of the great strides in AI in the past 20 years, human speech is one area where AI still falls short. The article explains,

The reason for this, according to the article, is that “words often have meaning based on context and the appearance of the letters and words.” It’s not enough to be able to identify a concept represented by a bunch of letters strung together. There are many rules that need to be put in place that affect the meaning of the word; from its placement in a sentence, to grammar and to the words around – all of these things are important.

Advocating human developed rules for indexing is certainly interesting, and the author compares this logic to the process of raising her children to be multi-lingual. Semaphore is a model-driven, rules-based platform that allows us to auto-generate usage rules in order to expand the guidelines for a machine as it learns. The issue here is cost. Indexing large amounts of data is extremely cost-prohibitive, and that it before the maintenance of the rules even becomes part of the equation. In sum, this is a very old school approach to AI that may make many people uncomfortable.

Chelsea Kerwin, February 15, 2017

Indexing: The Big Wheel Keeps on Turning

January 23, 2017

Yep, indexing is back. The cacaphone “ontology” is the next big thing yet again. Folks, an ontology is a form of metadata. There are key words, categories, and classifications. Whipping these puppies into shape has been the thankless task of specialists for hundreds if not thousands of years. “What Is an Ontology and Why Do I Want One?” tries to make indexing more alluring. When an enterprise search system delivers results which are off the user’s information need or just plain wrong, it is time for indexing. The problem is that machine based indexing requires some well informed humans to keep the system on point. Consider Palantir Gotham. Content finds its way into the system when a human performs certain tasks. Some of these tasks are riding herd on the indexing of the content object. IBM Analyst’s Notebook and many other next generation information access systems work hand in glove with expensive humans. Why? Smart software is still only sort of smart.

The write up dances around the need for spending money on indexing. The write up prefers to confuse a person who just wants to locate the answer to a business related question without pointing, clicking, and doing high school research paper dog work. I noted this passage:

Think of an ontology as another way to classify content (like a taxonomy) that allows you to identify what the content is about and how it relates to other types of content.

Okay, but enterprise search generally falls short of the mark for 55 to 70 percent of a search system’s users. This is a downer. What makes enterprise search better? An ontology. But without the cost and time metrics, the yap about better indexing ends up with “smart content” companies looking confused when their licenses are not renewed.

What I found amusing about the write up is that use of an ontology improves search engine optimization. How about some hard data? Generalities are presented, not instead of some numbers one can examine and attempt to verify.

SEO means getting found when a user runs a query. That does not work too well for general purpose Web search systems like Google. SEO is struggling to deal with declining traffic to many Web sites and the problem mobile search presents.

But in an organization, SEO is not what the user wants. The user needs the purchase order for a client and easy access to related data. Will an ontology deliver an actionable output. To be fair, different types of metadata are needed. An ontology is one such type, but there are others. Some of these can be extracted without too high an error rate when the content is processed; for example, telephone numbers. Other types of data require different processes which can require knitting together different systems.

To build a bubble gum card, one needs to parse a range of data, including images and content from a range of sources. In most organizations, silos of data persist and will continue to persist. Money is tight. Few commercial enterprises can afford to do the computationally intensive content processing under the watchful eye and informed mind of an indexing professional.

Cacaphones like “ontology” exacerbate the confusion about indexing and delivering useful outputs to users who don’t know a Boolean operator from a SQL expression.

Indexing is a useful term. Why not use it?

Stephen E Arnold, January 23, 2017

Deindexing: A Thing?

October 12, 2016

There was the right to be forgotten. There were reputation management companies promising to scrub unwanted information for indexes using humans, lawyers (a different species, of course), and software agents.

Now I have learned that “dozens of suspicious court cases, with missing defendants, aim at getting web pages taken down or deindexed.” The write up asserts:

Google and various other Internet platforms have a policy: They won’t take down material (or, in Google’s case, remove it from Google indexes) just because someone says it’s defamatory. Understandable — why would these companies want to adjudicate such factual disputes? But if they see a court order that declares that some material is defamatory, they tend to take down or deindex the material, relying on the court’s decision.

Two thoughts:

  1. Have reputation management experts cooked up some new broth?
  2. How quickly will the lovely word “deindex” survive in the maelstrom of the information flow.

I love the idea of indexing content. Perhaps there is a new opportunity for innovation with the deindexing thing? Semantic deindexing? Structured deindexing? And my fave unstructured deindexing in federated cloud based data lakes. I wish I were 21 years old again. A new career beckons with declassification codes, delanguage processing, and even desmart software.

Stephen E Arnold, October 22, 2016

Toshiba Amps up Vector Indexing and Overall Data Matching Technology

September 13, 2016

The article on MyNewsDesk titled Toshiba’s Ultra-Fast Data Matching Technology is 50 Times Faster than its Predecessors relates the bold claims swirling around Toshiba and their Vector Indexing Technology. By skipping the step involving computation of the distance between vectors, Toshiba has slashed the time it takes to identify vectors (they claim). The article states,

Toshiba initially intends to apply the technology in three areas: pattern mining, media recognition and big data analysis. For example, pattern mining would allow a particular person to be identified almost instantly among a large set of images taken by surveillance cameras, while media recognition could be used to protect soft targets, such as airports and railway stations*4by automatically identifying persons wanted by the authorities.

In sum, Toshiba technology is able to quickly and accurately recognize faces in the crowd. But the specifics are much more interesting. Current technology takes around 20 seconds to identify an individual out of 10 million, and Toshiba can do it in under a second. The precision rates that Toshiba reports are also outstanding at 98%. The world of Minority Report, where ads recognize and direct themselves to random individuals seems to be increasingly within reach. Perhaps more importantly, this technology should be of dire importance to the criminal and perceived criminal populations of the world

Chelsea Kerwin, September 13, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monographThere is a Louisville, Kentucky Hidden Web/Dark Web meet up on September 27, 2016.
Information is at this link: https://www.meetup.com/Louisville-Hidden-Dark-Web-Meetup/events/233599645/

Search without Indexing

April 27, 2016

I read “Outsmarting Google Search: Making Fuzzy Search Fast and Easy Without Indexing.”

Here’s a passage I highlighted:

It’s clear the “Google way” of indexing data to enable fuzzy search isn’t always the best way. It’s also clear that limiting the fuzzy search to an edit distance of two won’t give you the answers you need or the most comprehensive view of your data. To get real-time fuzzy searches that return all relevant results you must use a data analytics platform that is not constrained by the underlying sequential processing architectures that make up software parallelism. The key is hardware parallelism, not software parallelism, made possible by the hybrid FPGA/x86 compute engine at the heart of the Ryft ONE.

I also circled:

By combining massively parallel FPGA processing with an x86-powered Linux front-end, 48 TB of storage, a library of algorithmic components and open APIs in a small 1U device, Ryft has created the first easy-to-use appliance to accelerate fuzzy search to match exact search speeds without indexing.

An outfit called InsideBigData published “Ryft Makes Real-time Fuzzy Search a Reality.” Alas, that link is now dead.

Perhaps a real time fuzzy search will reveal the quickly deleted content?

Sounds promising. How does one retrieve information within videos, audio streams, and images? How does one hook together or link a reference to an entity (discovered without controlled term lists) with a phone number?

My hunch is that the methods disclosed in the article have promise, the future of search seems to be lurching toward applications that solve real world, real time problems. Ryft may be heading in that direction in a search climate which presents formidable headwinds.

Stephen E Arnold, April 27, 2016

Visual Content: An Indexing Challenge

December 4, 2015

The average bounce rate on blogs for new visitors is 60.2%, and the average reader stays only 1 to 2 minutes on your website. One way to get people to really engage with your content is to use a tool like Roojoom, which is a content curation and creation platform.

Here’s one example from the write up:

Roojoom lets you collect content from your online and offline sources (such as your web pages, videos, PDFs and marketing materials) to create a “content journey“ for readers. You then guide readers step by step through the journey,all from within one centralized place. I read “5 Visual Content Tools to Boost Engagement.” The write up points to a handful of services which generate surveys, infographics, and collages of user supplier photos. If I knew a millennial, I can imagine hearing the susurration of excitement emitted by the lad or lass.

Now I don’t want to rain on the innovation parade. Years ago, an outfit called i2 Group Ltd. developed a similar solution. After dogging and ponying the service, it became clear that in the early 2000s, there was not much appetite for this type of data exploration. i2 eventually sold out to IBM and the company returned to its roots in intelligence and law enforcement.

The thought I had after reading about Roojoom and the other services was this:

How will the information be indexed and made findable?

As content become emoji-ized, the indexing task does not become easier. Making sense of images is not yet a slam dunk. Heck, automated indexing only shoots accurately 80 to 90 percent of the time. In a time of heightened concern about risks, is a one in five bet a good one? I try to narrow the gap, but many are okay without worrying too much.

As visual content becomes more desirable, the indexing systems will have to find a way to make this content findable. Words baffle many of these content processing outfits. Pictures are another hill to climb. If it is not indexed, the content may not be findable. Is this a problem for researchers and analysts? And for you, gentle reader?

Stephen E Arnold, December 4, 2015

Indexing: A Cautionary Example

November 17, 2015

i read “Half of World’s Museum Specimens Are Wrongly Labeled, Oxford University Finds.” Anyone involved in indexing knows the perils of assigning labels, tags, or what the whiz kids call metadata to an object.

Humans make mistakes. According to the write up:

As many as half of all natural history specimens held in the some of the world’s greatest institutions are probably wrongly labeled, according to experts at Oxford University and the Royal Botanic Garden in Edinburgh. The confusion has arisen because even accomplished naturalists struggle to tell the difference between similar plants and insects. And with hundreds or thousands of specimens arriving at once, it can be too time-consuming to meticulously research each and guesses have to be made.

Yikes. Only half. I know that human indexers get tired. Now there is just too much work to do. The reaction is typical of busy subject matter experts. Just guess. Close enough for horse shoes.

What about machine indexing? Anyone who has retrained an HP Autonomy system knows that humans get involved as well. If humans make mistakes with bugs and weeds, imagine what happens when a human has to figure out a blog post in a dialect of Korean.

The brutal reality is that indexing is a problem. When dealing with humans, the problems do not go away. When humans interact with automated systems, the automated systems make mistakes, often more rapidly than the sorry human indexing professionals do.

What’s the point?

I would sum up the implication as:

Do not believe a human (indexing species or marketer of automated indexing species).

Acceptable indexing with accuracy above 85 percent is very difficult to achieve. Unfortunately the graduates of a taxonomy boot camp or the entrepreneur flogging an automatic indexing system which is powered by artificial intelligence may not be reliable sources of information.

I know that this notion of high error rates is disappointing to those who believe their whizzy new system works like a champ.

Reality is often painful, particularly when indexing is involved.

What are the consequences? Here are three:

  1. Results of queries are incomplete or just wrong
  2. Users are unaware of missing information
  3. Failure to maintain either human, human assisted, or automated systems results in indexing drift. Eventually the indexing is just misleading if not incorrect.

How accurate is your firm’s indexing? How accurate is your own indexing?

Stephen E Arnold, November 17, 2015

RAVN Pipeline Coupled with ElasticSearch to Improve Indexing Capabilities

October 28, 2015

The article on PR Newswire titled RAVN Systems Releases its Enterprise Search Indexing Platform, RAVN Pipeline, to Ingest Enterprise Content Into ElasticSearch unpacks the decision to improve the ElasticSearch platform by supplying the indexing platform of the RAVN Pipeline. RAVN Systems is a UK company with expertise in processing unstructured data founded by consultants and developers. Their stated goal is to discover new lands in the world of information technology. The article states,

“RAVN Pipeline delivers a platform approach to all your Extraction, Transformation and Load (ETL) needs. A wide variety of source repositories including, but not limited to, File systems, e-mail systems, DMS platforms, CRM systems and hosted platforms can be connected while maintaining document level security when indexing the content into Elasticsearch. Also, compressed archives and other complex data types are supported out of the box, with the ability to retain nested hierarchical structures.”

The added indexing ability is very important, especially for users trying to index from from or into cloud-based repositories. Even a single instance of any type of data can be indexed with the Pipeline, which also enriches data during indexing with auto-tagging and classifications. The article also promises that non-specialists (by which I assume they mean people) will be able to use the new systems due to their being GUI driven and intuitive.

Chelsea Kerwin, October 28, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta