Bing: No More Public URL Submissions

September 19, 2018

Ever wondered why some Web site content is not indexed? Heck, ever talk to a person who cannot find their Web site in a “free” Web index? I know that many people believe that “free” Web search services are comprehensive. Here’s a thought: The Web indexes are not comprehensive. The indexing is selective, disconnected from meaningful date and time stamps, and often limited to following links to a specified depth; for example, three levels down or fewer in many cases.

I thought about the perception of comprehensiveness when I read “Bing Is Removing Its Public URL Submission Tool.” The tool allowed a savvy SEO professional or an informed first time Web page creator to let Bing know that a site was online and ready for indexing.

No more.

How do “free” Web indexes find new sites? Now that’s a good question, and the answers range from “I don’t know” or “Bing and Google are just able to find these sites.”

A couple of thoughts:

  • Editorial or spidering policies are not spelled out by most Web indexing outfits
  • Users assume that if information is available online, that information is accurate
  • “Free” Web indexing services are not set up to deliver results that are necessarily timely (indexed on a daily basis) or comprehensive.

Bing’s allegedly turning off public url submissions is a small thing. My question, “Who looked at these submissions and made a decision about what to index or exclude from indexing?” Perhaps the submission form operated like a thermostat control in a hotel room?

Stephen E Arnold, September 18, 2018

Semantic Struggles and Metadata

August 31, 2018

I have noticed the flood of links and social media posts about semantics from David Amerland. I found many of the observations interesting; a few struck me as a wildly different view of indexing. A recent essay by David AmerlandSnipers Use Metadata Much Like Semantic Search Does” caught the Beyond Search team’s attention.


Learn about “The Sniper Mind” at this link.

According to the story:

“There are two key takeaways here [about metadata and trained killers]: First, such skills are directly transferable in the business domain and even in most life situations. Second, in order to use their brain in this way snipers need training. The mental training and the psychological aids that are developed as a result of it is what I detailed…”

We must admit that it is a fresh metaphor: Comparing killers’ use of indexing with semantic search. In our experience with professional indexing systems and human indexers, the word “sniper” has not to our recollection been used.

Watch your back, your blindside, or ontology. Oh, also metaphors.

Patrick Roland, August 31, 2018

Deindexing SEO Delivers Revenue Results

June 7, 2018

SEO is still an important aspect of the Google algorithm and other search engine crawlers. In my opinion, tweaking Web pages can result in a boost for content in some queries. I have a hunch that Google’s system then ignores subsequent tweaks. The Web master then has an opportunity to buy Google advertising, and the content becomes more findable. But that’s just an opinion.

The received wisdom is that the key to great SEO is to generate great content, which is the crawlers then index. Robin Rozhon shares that technical SEO has a big impact on your Web site, especially if it is large. In his article, “Crawling & Indexing: Technical SEO Basics That Drive Revenue (Case Study)” Rozhon discusses to maximize technical SEO, including deindexing benefits.

Rozhan ran an experiment where they deindexed over 400,000 of their 500,000 Web sites and 80% of their URLs, because search engines indexed them as duplicate category URLs. Their organic traffic highly increased. Before you deindex your Web sites, check into Google Analytics to determine how well the pages are doing.

Also to determine what pages to deindex collect data about the URLs and find out what the parameters are along with other data. Use Google Analytics, Google Search Console, Screaming Frog, log files, and other data about the URL to understand its performance.

Facets and filters are another important contribution to URLs:

“Faceted navigation is another common troublemaker on ecommerce websites we have been dealing with.Every combination of facets and filters creates a unique URL. This is a good thing and a bad thing at the same time, because it creates tons of great landing pages but also tons of super specific landing pages no one cares about.”

They also have pros and cons:

I learned this about “facets”:

  • Facets are discoverable crawlable and indexable by search engines;
  • Wait! Facets are not discoverable if multiple items from the same facet are selected (e.g. Adidas and Nike t-shirts).
  • Facets contain self-referencing canonical tags;

And what about filters?

  • Filters are not discoverable;
  • Filters contain a “noindex’ tag;
  • Filters use url parameters that are configured in Google Search Console and Bing Webmaster tools.

As a librarian, I believe that old school ideas have found their way into the zippy modern approach to indexing via humans and semi smart software.

In the end, consolidate pages and remove any dead weight to drive traffic to the juicy content and increase sales. Why did they not say that to begin with, instead of putting us through the technical jargon?

Whitney Grace, June 7, 2018

Fake News May Be a Forever Feature

June 4, 2018

While the world’s big names in social media go on tour to tout the ways in which they are snuffing out fake news, the fake news machine keeps rolling along. Mark Zuckerberg and company can do all the testifying in Washington they want, but that does not mean the criminal element will just curl up and go away. They certainly aren’t going anywhere when there is money to be made and there is plenty of that, according to a surprising BoingBoing story, “It’s Laughably Simple to Buy Thousands of Cheap, Plausible Facebook Identities.”

According to the story:

“[F]or $13, a Buzzfeed reporter was able to buy the longstanding Facebook profile of a fake 23 year old British woman living in London with 921 friends and a deep, plausible dossier of activities, likes and messages. The reporter’s contact said they could supply 5,000 more Facebook identities at any time.”

The danger is that there is essentially no way to really stop this as bot makers get more sophisticated and adjust to Facebook and other social media outlets’ algorithm changes. Some experts even fear that this unstoppable tide of bots will have deadly consequences. We’ll keep watching this story, but don’t have a lot of faith things will get better any time soon.

Patrick Roland, June 4, 2018

Are Auto Suggestions Inherently Problematic?

June 3, 2018

Politics is a dangerous subject to bring up in any social situation. My advice is to keep quiet and nod, then you can avoid loudmouths trying to press their agendas down your throat. Despite attempts to remain polite, the Internet always brings out the worst in people and The Sun shares how with a simple search engine function, “‘Trump Should Be Shot’ Google And Bing Searches For ‘Trump’ And ‘Conservatives’ Offer Disgusting Auto-Suggestions.”

Auto-complete is notorious for making hilarious mistakes and the same is with auto-suggest on search engines, but these end up to be more gruesome than a misspelling.

If you want to see some interesting suggestions, type “Trump should be…” into a blank search bar and the results are endless, including: shot, arrested, killed, in jail, arrested banned from Twitter (okay, the last one might be a little funny).

Typing in “conservatives need…” results in less derogatory terms, but the auto-suggestions include: to die, to go, a new party, and not apply.


What creates these auto-suggestions?

“These are based on a number of factors including real-time searches, trending results, your location, and previous activity.The intuitive predictions change in “response to new characters being entered into the search box” explains Google. And the company also has its own set of “autocomplete policies” in case something untoward should pop up.Along with prohibiting predictions that contain sexually explicit, violent, and harmful terms, Google says it also removes hateful suggestions against groups and individuals. ‘We remove predictions that include graphic descriptions of violence or advocate violence generally,’ states the firm.”

Google and Bing deserve some credit for removing the slander from auto-complete, but sometimes they only do it when they are pushed. Trolls and bigots create these terms and it would be nice to see them scrubbed from auto-suggest, but it is near impossible. Hey, Bing and Google try scrubbing 4chan!

Whitney Grace, June 3, 2018

Google: Excellence Evolves to Good Enough

May 25, 2018

I read “YouTube’s Infamous Algorithm Is Now Breaking the Subscription Feed.” I assume the write up is accurate. I believe everything I read on the Internet.

The main point of the write up seems to me to be that good enough is the high water mark.

I noted this passage, allegedly output by a real, thinking Googler:

Just to clarify. We are currently experimenting with how to show content in the subs feed. We find that some viewers are able to more easily find the videos they want to watch when we order the subs feed in a personalized order vs always showing most recent video first.

I also found this statement interesting:

With chronological view thrown out, it’s going to become even more difficult to find new videos you haven’t seen — especially if you follow someone who uploads at a regular time each day.

I would like to mention that Google, along wit In-Q-Tel, invested in Recorded Future. That company has some pretty solid date and time stamping capabilities. Furthermore, my hunch is that the founders of the company know the importance of time metadata to some of the Recorded Future customers.

What would happen if Google integrated some of Recorded Future’s time capabilities into YouTube and into good old Google search results.

From my point of view, good enough means “sells ads.” But I am usually incorrect, and I expect to learn just how off base I am when I explain how one eCommerce giant is about to modify the landscape for industrial strength content analysis. Oh, that company’s technology does the date and time metadata pretty well.

More on this mythical “revolution” on June 5th and June 6th. In the meantime, try and find live feeds of the Hawaii volcano event using YouTube search. Helpful, no?

Stephen E Arnold, May 25, 2018

LightTag Helps AI Developers Label Training Data

May 16, 2018

The creators of LightTag are betting on the AI boom, we learn from TechCrunch’s post, “LightTag Is a Text Annotation Platform for Data Scientists Creating AI Training Data.” Built by a former Natural Language researcher for Citigroup, the shiny new startup hopes to assist AI developers with one of their most labor-intensive and error-prone tasks—labeling the data used to train AI systems. Since it is a job carried out by teams of imperfect humans, errors often abound. LightTag’s team-based workflow, user interface, and quality controls are designed to mitigate these imperfections. Writer Steve O’Hear cites founder Tal Perry as he reports:

“Perry says LightTag’s annotation interface is designed to keep labelers ‘effective and engaged’. It also employs its own ‘AI’ to learn from previous labeling and make annotation suggestions. The platform also automates the work of managing a project, in terms of assigning tasks to labelers and making sure there is enough overlap and duplication to keep accuracy and consistency high. ‘We’ve made it dead-simple to annotate with a team (sounds obvious, but nothing else makes it easy),’ he says. ‘To make sure the data is good, LightTag automatically assigns work to team members so that there is overlap between them. This allows project managers to measure agreement and recognize problems in their project early on. For example, if a specific annotator is performing worse than others’.”

For the organizations in certain industries like healthcare, law, and banking that simply cannot risk outsourcing the task, LightTag offers an on-premise version. The write-up includes a couple GIFs of the software at work, so check it out if curious. Though it only recently launched publicly, the beta software has been tried out by select clients, including these noteworthy uses: An energy company is using it to predict drilling issues at certain depths with data from oil-rig logs, and a medical imaging company has used it to label MRI-scan reports. We are curious to see whether the young startup will be able to capitalize on the current AI boom, as Perry predicts.

Cynthia Murrell, May 16, 2018

Free Keyword Research Tools

May 15, 2018

Short honk: Search Engine Watch published a write up intended for SEO experts. The article contained some useful links to free keyword search tools. Even if you are not buying online ads or fiddling with your indexing, the services are interesting to know about. Here they are:

Stephen E Arnold, May 15, 2018

Google Search and Hot News: Sensitivity and Relevance

November 10, 2017

I read “Google Is Surfacing Texas Shooter Misinformation in Search Results — Thanks Also to Twitter.” What struck me about the article was the headline; specifically, the implication for me was that Google was not responding to user queries. Google is actively “surfacing” or fetching and displaying information about the event. Twitter is also involved. I don’t think of Twitter as much more than a party line. One can look up keywords or see a stream of content containing a keyword or a, to use Twitter speak, “hash tags.”

The write up explains:

Users of Google’s search engine who conduct internet searches for queries such as “who is Devin Patrick Kelley?” — or just do a simple search for his name — can be exposed to tweets claiming the shooter was a Muslim convert; or a member of Antifa; or a Democrat supporter…

I think I understand. A user inputs a term and Google’s system matches the user’s query to the content in the Google index. Google maintains many indexes, despite its assertion that it is a “universal search engine.” One has to search across different Google services and their indexes to build up a mosaic of what Google has indexed about a topic; for example, blogs, news, the general index, maps, finance, etc.

Developing a composite view of what Google has indexed takes time and patience. The results may vary depending on whether the user is logged in, searching from a particular geographic location, or has enabled or disabled certain behind the scenes functions for the Google system.

The write up contains this statement:

Safe to say, the algorithmic architecture that underpins so much of the content internet users are exposed to via tech giants’ mega platforms continues to enable lies to run far faster than truth online by favoring flaming nonsense (and/or flagrant calumny) over more robustly sourced information.

From my point of view, the ability to figure out what influences Google’s search results requires significant effort, numerous test queries, and recognition that Google search now balances on two pogo sticks. Once “pogo stick” is blunt force keyword search. When content is indexed, terms are plucked from source documents. The system may or may not assign additional index terms to the document; for example, geographic or time stamps.

The other “pogo stick” is discovery and assignment of metadata. I have explained some of the optional tags which Google may or may not include when processing a content object; for example, see the work of Dr. Alon Halevy and Dr. Ramanathan Guha.

But Google, like other smart content processing today, has a certain sensitivity. This means that streams of content processed may contain certain keywords.

When “news” takes place, the flood of content allows smart indexing systems to identify a “hot topic.” The test queries we ran for my monographs “The Google Legacy” and “Google Version 2.0” suggest that Google is sensitive to certain “triggers” in content. Feedback can be useful; it can also cause smart software to wobble a bit.

Image result for the impossible takes a little longer

T shirts are easy; search is hard.

I believe that the challenge Google faces is similar to the problem Bing and Yandex are exploring as well; that is, certain numerical recipes can over react to certain inputs. These over reactions may increase the difficulty of determining what content object is “correct,” “factual,” or “verifiable.”

Expecting a free search system, regardless of its owner, to know what’s true and what’s false is understandable. In my opinion, making this type of determination with today’s technology, system limitations, and content analysis methods is impossible.

In short, the burden of figuring out what’s right and what’s not correct falls on the user, not exclusively on the search engine. Users, on the other hand, may not want the “objective” reality. Search vendors want traffic and want to generate revenue. Algorithms want nothing.

Mix these three elements and one takes a step closer to understanding that search and retrieval is not the slam dunk some folks would have me believe. In fact, the sensitivity of content processing systems to comparatively small inputs requires more discussion. Perhaps that type of information will come out of discussions about how best to deal with fake news and related topics in the context of today’s information retrieval environment.

Free search? Think about that too.

Stephen E Arnold, November 10, 2017

Smartlogic: A Buzzword Blizzard

August 2, 2017

I read “Semantic Enhancement Server.” Interesting stuff. The technology struck me as a cross between indexing, good old enterprise search, and assorted technologies. Individuals who are shopping for an automatic indexing systems (either with expensive, time consuming hand coded rules or a more Autonomy-like automatic approach) will want to kick the tires of the Smartlogic system. In addition to the echoes of the SchemaLogic approach, I noted a Thomson submachine gun firing buzzwords; for example:

best bets (I’m feeling lucky?)
dynamic summaries (like Island Software’s approach in the 1990s)
faceted search (hello, Endeca?)
navigator (like the Siderean “navigator”?)
real time
related topics (clustering like Vivisimo’s)
semantic (of course)
topic maps
topic pages (a Google report as described in US29970198481)
topic path browser (aka breadcrumbs?)

What struck me after I compiled this list about a system that “drives exceptional user search experiences” was that Smartlogic is repeating the marketing approach of traditional vendors of enterprise search. The marketing lingo and “one size fits all” triggered thoughts of Convera, Delphes, Entopia, Fast Search & Transfer, and Siderean Software, among others.

I asked myself:

Is it possible for one company’s software to perform such a remarkable array of functions in a way that is easy to implement, affordable, and scalable? There are industrial strength systems which perform many of these functions. Examples range from BAE’s intelligence system to the Palantir Gotham platform.

My hypothesis is that Smartlogic might struggle to process a real time flow of WhatsApp messages, YouTube content, and mobile phone intercept voice calls. Toss in the multi language content which is becoming increasingly important to enterprises, and the notional balloon I am floating says, “Generating buzzwords and associated over inflated expectations is really easy. Delivering high accuracy, affordable, and scalable content processing is a bit more difficult.”

Perhaps Smartlogic has cracked the content processing equivalent of the Voynich manuscript.


Will buzzwords crack the Voynich manuscript’s inscrutable text? What if Voynich is a fake? How will modern content processing systems deal with this type of content? Running some content processing tests might provide some insight into systems which possess Watson-esque capabilities.

What happened to those vendors like Convera, Delphes, Entopia, Fast Search & Transfer, and  Siderean Software, among others? (Free profiles of these companies are available at Oh, that’s right. The reality of the marketplace did not match the companies’ assertions about technology. Investors and licensees of some of these systems were able to survive the buzzword blizzard. Some became the digital equivalent of Ötzi, 5,300 year old iceman.

Stephen E Arnold, August 2, 2017

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta