Google Book Search: Broken Unfixable under Current Incentives

February 19, 2019

I read “How Badly is Google Books Search Broken, and Why?” The main point is that search results do not include the expected results. The culprit, as I understand the write up, looking for rare strings of characters within a time slice behaves in an unusual manner. I noted this statement:

So possibly Google has one year it displays for books online as a best guess, and another it uses internally to represent the year they have legal certainty a book is released. So maybe those volumes of the congressional record have had their access rolled back as Google realized that 1900 might actually mean 1997; and maybe Google doesn’t feel confident in library metadata for most of its other books, and doesn’t want searchers using date filters to find improperly released books. Oddly, this pattern seems to work differently on other searches. Trying to find another rare-ish term in Google Ngrams, I settled on “rarely used word”; the Ngrams database lists 192 uses before 2002. Of those, 22 show up in the Google index. A 90% disappearance rate is bad, but still a far cry from 99.95%.

There are many reasons one can identify for the apparent misbehavior of the Google search system for books. The author identifies the main reason but does not focus on it.

From my point of view and based on the research we have done for my various Google monographs, Google’s search systems operate in silos. But each shares some common characteristics even though the engineers, often reluctantly assigned to what are dead end or career stalling projects, make changes.

One of the common flaws has to do with the indexing process itself. None of the Google silos does a very good job with time related information. Google itself has a fix, but implementing the fix for most of its services is a cost increasing step.

The result is that Google focuses on innovations which can drive revenue; that is, online advertising for the mobile user of Google services.

But Google’s time blindness is unlikely to be remediated any time soon. For a better implementation of sophisticated time operations, take a look at the technology for time based retrieval, time slicing, and time analytics from the Google and In-Q-Tel funded company Recorded Future.

In my lectures about Google’s time blindness DNA, I compare and contrast what Recorded Future can do versus what Google silos are doing.

Net net: Performing sophisticated analyses of the Google indexes requires the type of tools available from Recorded Future.

Stephen E Arnold, February 19, 2019

Amazon: Wheel Re-Invention

December 19, 2018

Some languages have bound phrases; that is, two words which go together. Examples include “White House”, a presidential dwelling, and “ticket counter”, a place to talk with an uninterested airline professionals. How does a smart software system recognize a bound phrase and then connect it to the speaker’s or writer’s intended meaning. There is a difference between “I toured the White House” and “Turn left at the white house.”

Traditionally, vendors of text analysis, indexing, and NLP systems used jargon to explain a collection of methods pressed into action to make sense of language quirks. The guts of most systems are word lists, training material selected to make clear that in certain contexts some words go together and have a specific meaning; for example, “terminal” doesn’t make much sense until one gets whether the speaker or writer is referencing a place to board a train (railroad terminal), the likely fate of a sundowner (terminal as in dead), or a computer interface device (dumb terminal).

How does Amazon accomplish this magic? Amazon embraces jargon, of course, and then explains its bound phrase magic in “How Alexa Knows “Peanut Butter” Is One Shopping-List Item, Not Two.”

Amazon’s spin is spoken language understanding. The write up explains how the system operates. But the methods are ones that others have used. Amazon, to be sure, has tweaked the procedures. That’s standard operating procedure in the index game.

What’s interesting is that no reference is made to the contextual information which Amazon has to assist its smart software with disambiguation.

But Amazon is now talking, presumably to further the message that the company is a bold, brave innovator.

No argument from Harrod’s Creek. That’s a bound phrase, by the way, with capital letters and sometimes and apostrophe or not.

Stephen E Arnold, December 19, 2018

Facial Recognition and Image Recognition: Nervous Yet?

November 18, 2018

I read “A New Arms Race: How the U.S. Military Is Spending Millions to Fight Fake Images.” The write up contained an interesting observation from an academic wizard:

“The nightmare situation is a video of Trump saying I’ve launched nuclear weapons against North Korea and before anybody figures out that it’s fake, we’re off to the races with a global nuclear meltdown.” — Hany Farid, a computer science professor at Dartmouth College

Nothing like a shocking statement to generate fear.

But there is a more interesting image recognition observation. “Facebook Patent Uses Your Family Photos For Targeted Advertising” reports that a the social media sparkler has an invention that will

attempt to identify the people within your photo to try and guess how many people are in your family, and what your relationships are with them. So for example if it detects that you are a parent in a household with young children, then it might display ads that are more suited for such family units. [US20180332140]

While considering the implications of pinpointing family members and linking the deduced and explicit data, consider that one’s fingerprint can be duplicated. The dupe allows a touch ID to be spoofed. You can get the details in “AI Used To Create Synthetic Fingerprints, Fools Biometric Scanners.”

For a law enforcement and intelligence angle on image recognition, watch for DarkCyber on November 27, 2018. The video will be available on the Beyond Search blog splash page at this link.

Stephen E Arnold, November 18, 2018

Google Struggles with Indexing?

November 14, 2018

You probably know that Google traffic was routed to China. The culprit was something obvious. In this case, Nigeria. Yep, Nigeria. You can read about the mistake that provided some interesting bits and bytes to the Middle Kingdom. Yeah, I know. Nigeria. “A Nigerian Company Is in Trouble with Google for Re-Routing Traffic to Russia, China” provides some allegedly accurate information.

But the major news I noted here in Harrod’s Creek concerned Google News and its indexing. Your experience may be different from mine, but Google indexing can be interesting. I was looking for an outfit identified as Inovatio, which is a university anchored outfit in China. The reference to Inovatio in Google aimed me at a rock band and a design company in Slovenia. Google’s smart search system changed Inovatio to innovation even when I used quote marks. I did locate the Inovatio operation using a Chinese search engine. I was able to track Ampthon.com which listed Inovatio and provided the university affiliation to allow me to get some info about an outfit providing surveillance and intercept services to countries in need of this capability.

Google. Indexing. Yeah.

Google News Publishers Complaining About Indexing Issues” highlights another issue with the beloved Google. I learned:

In the past few days there has been an uptick in complaints from Google News publishers around Google not indexing their new news content. Gary Illyes from Google did a rare appearance on Twitter to say he passed along the feedback to the Google News team to investigate. You can scan through the Google News Help forums and see a nice number of complaints. Also David Esteve, the SEO at the Spanish newspaper El Confidencial, posted his concerns on Twitter.

The good news is that the write up mentions that this indexing glitch is a known issue.

Net net: Many people with whom I speak believe that Google’s index is comprehensive, timely, and consistent.

Yeah, also smart because Inovatio is really innovation.

Stephen E Arnold, November 14, 2018

Indexing Matters: The Investment Sector Analysis

October 15, 2018

I read reports which explain why large monopolistic or oligopolistic companies alter the behavior of certain ecosystems. I don’t see that many because analysts are preoccupied with more practical matters; namely, their bonuses, appearances on Bloomberg TV or CNBC, and riding their hobby horses.

I read and then reread “Platform Giants and Venture Backed Startups.” The premise struck me as obvious. The whales of online are functioning like giant electromagnets. There companies pull traffic, attention, and money. At the same time, they emit beacons which are tuned to the inner ears of investors.

Image result for jello cubed dessert

Looks tasty but only semi organized. And from what is this confection fabricated? Answer: Cow hooves. Intellectual Jello, lovingly crafted to delight the eye.

The squeaks of these ultra high frequency waves alert those looking for big paydays to put their money into startups which do not compete head on with the outfits operating like electromagnets.

The “Platform Giant” write up assembles observations from a report which asserts the opposite; that is, big electromagnets do not have an impact on start ups and most investors.

Put that aside.

The core of the write up makes clear that indexing and classification make a difference. The idea is that if one classifies and marshals data, the classification creates a way to look at the data, the world, and in this particular case the way investments flow or do not flow.

What goes in “Internet software” becomes the trigger for the conclusion. Invest to compete against the Google? Not a good idea.

The question becomes, “Who does the indexing, classification, ontology, and related bits of the taxonomy?”

Indexing is important. But more important is the creation of the knowledge structure and the categories which will be used to chop, slice, and organize data for analysis.

Get the knowledge structure wrong and the flawed categorization creates findings that are probably misleading at best and just off base.

Who takes the time to work out the knowledge structure before training humans and smart software to assign metadata?

The write up suggests that humans (either with agenda or without, with expertise or not, or with a wonky knowledge superstructure or not) do.

Net net: Counting is verifiable. Pegging what to count may be more like organizing cubes of a gelatin dessert.

Stephen E Arnold, October 15, 2018

The Semantic Web: Technology Roadkill or a Roadside Snack?

September 24, 2018

I spotted a quote to note. Here it is:

The Semantic Web is as dead as last year’s roadkill.

The statement appears in “Whatever Happened to the Semantic Web?” The write up provides a run through of the starts and stops associated with making the Web into a more organized place.

I would point out that the state of the Semantic Web can be glimpsed in the TweetedTimes’ auto generated list of articles called “Semantic Search.” The collection of items focuses on a range of topics, but the thrust seems to be getting traffic for a Web site; for example, “How to Optimize Content for Semantic SEO.”

If you are an adherent of the Semantic Web, check out the included footnotes. I would point out that the Google has a number of Guha patents in its portfolio. I think the Semantic Web may be of interest to some at the online ad search giant.

Guha’s patents plus the work by Alon Halevy may suggest some interesting use cases for the mark up, triplet, smart agent system and methods.

Stephen E Arnold,  September 24, 2018

Bing: No More Public URL Submissions

September 19, 2018

Ever wondered why some Web site content is not indexed? Heck, ever talk to a person who cannot find their Web site in a “free” Web index? I know that many people believe that “free” Web search services are comprehensive. Here’s a thought: The Web indexes are not comprehensive. The indexing is selective, disconnected from meaningful date and time stamps, and often limited to following links to a specified depth; for example, three levels down or fewer in many cases.

I thought about the perception of comprehensiveness when I read “Bing Is Removing Its Public URL Submission Tool.” The tool allowed a savvy SEO professional or an informed first time Web page creator to let Bing know that a site was online and ready for indexing.

No more.

How do “free” Web indexes find new sites? Now that’s a good question, and the answers range from “I don’t know” or “Bing and Google are just able to find these sites.”

A couple of thoughts:

  • Editorial or spidering policies are not spelled out by most Web indexing outfits
  • Users assume that if information is available online, that information is accurate
  • “Free” Web indexing services are not set up to deliver results that are necessarily timely (indexed on a daily basis) or comprehensive.

Bing’s allegedly turning off public url submissions is a small thing. My question, “Who looked at these submissions and made a decision about what to index or exclude from indexing?” Perhaps the submission form operated like a thermostat control in a hotel room?

Stephen E Arnold, September 18, 2018

Semantic Struggles and Metadata

August 31, 2018

I have noticed the flood of links and social media posts about semantics from David Amerland. I found many of the observations interesting; a few struck me as a wildly different view of indexing. A recent essay by David AmerlandSnipers Use Metadata Much Like Semantic Search Does” caught the Beyond Search team’s attention.

image

Learn about “The Sniper Mind” at this link.

According to the story:

“There are two key takeaways here [about metadata and trained killers]: First, such skills are directly transferable in the business domain and even in most life situations. Second, in order to use their brain in this way snipers need training. The mental training and the psychological aids that are developed as a result of it is what I detailed…”

We must admit that it is a fresh metaphor: Comparing killers’ use of indexing with semantic search. In our experience with professional indexing systems and human indexers, the word “sniper” has not to our recollection been used.

Watch your back, your blindside, or ontology. Oh, also metaphors.

Patrick Roland, August 31, 2018

Deindexing SEO Delivers Revenue Results

June 7, 2018

SEO is still an important aspect of the Google algorithm and other search engine crawlers. In my opinion, tweaking Web pages can result in a boost for content in some queries. I have a hunch that Google’s system then ignores subsequent tweaks. The Web master then has an opportunity to buy Google advertising, and the content becomes more findable. But that’s just an opinion.

The received wisdom is that the key to great SEO is to generate great content, which is the crawlers then index. Robin Rozhon shares that technical SEO has a big impact on your Web site, especially if it is large. In his article, “Crawling & Indexing: Technical SEO Basics That Drive Revenue (Case Study)” Rozhon discusses to maximize technical SEO, including deindexing benefits.

Rozhan ran an experiment where they deindexed over 400,000 of their 500,000 Web sites and 80% of their URLs, because search engines indexed them as duplicate category URLs. Their organic traffic highly increased. Before you deindex your Web sites, check into Google Analytics to determine how well the pages are doing.

Also to determine what pages to deindex collect data about the URLs and find out what the parameters are along with other data. Use Google Analytics, Google Search Console, Screaming Frog, log files, and other data about the URL to understand its performance.

Facets and filters are another important contribution to URLs:

“Faceted navigation is another common troublemaker on ecommerce websites we have been dealing with.Every combination of facets and filters creates a unique URL. This is a good thing and a bad thing at the same time, because it creates tons of great landing pages but also tons of super specific landing pages no one cares about.”

They also have pros and cons:

I learned this about “facets”:

  • Facets are discoverable crawlable and indexable by search engines;
  • Wait! Facets are not discoverable if multiple items from the same facet are selected (e.g. Adidas and Nike t-shirts).
  • Facets contain self-referencing canonical tags;

And what about filters?

  • Filters are not discoverable;
  • Filters contain a “noindex’ tag;
  • Filters use url parameters that are configured in Google Search Console and Bing Webmaster tools.

As a librarian, I believe that old school ideas have found their way into the zippy modern approach to indexing via humans and semi smart software.

In the end, consolidate pages and remove any dead weight to drive traffic to the juicy content and increase sales. Why did they not say that to begin with, instead of putting us through the technical jargon?

Whitney Grace, June 7, 2018

Fake News May Be a Forever Feature

June 4, 2018

While the world’s big names in social media go on tour to tout the ways in which they are snuffing out fake news, the fake news machine keeps rolling along. Mark Zuckerberg and company can do all the testifying in Washington they want, but that does not mean the criminal element will just curl up and go away. They certainly aren’t going anywhere when there is money to be made and there is plenty of that, according to a surprising BoingBoing story, “It’s Laughably Simple to Buy Thousands of Cheap, Plausible Facebook Identities.”

According to the story:

“[F]or $13, a Buzzfeed reporter was able to buy the longstanding Facebook profile of a fake 23 year old British woman living in London with 921 friends and a deep, plausible dossier of activities, likes and messages. The reporter’s contact said they could supply 5,000 more Facebook identities at any time.”

The danger is that there is essentially no way to really stop this as bot makers get more sophisticated and adjust to Facebook and other social media outlets’ algorithm changes. Some experts even fear that this unstoppable tide of bots will have deadly consequences. We’ll keep watching this story, but don’t have a lot of faith things will get better any time soon.

Patrick Roland, June 4, 2018

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta