An Interesting Hypothesis about Google Indexing
January 15, 2020
We noted “Google’s Crawl-Less Index.” The main idea is that something has changed in how Google indexes. We circled in yellow this statement from the article:
[Google’ can do this now because they have a popular web browser, so they can retire their old method of discovering links and let the users do their crawling.
The statement needs context.
The speculation is that Google indexes a Web page only when a user visits a page. Google notes the behavior and indexes the page.
What’s happening, DarkCyber concludes, is that Google no longer brute force crawls the public Web. Indexing takes place when a signal (a human navigating to a page) is received. Then the page is indexed.
Is this user-behavior centric indexing a reality?
DarkCyber has noted these characteristics of Google’s indexing in the last year:
- Certain sites are in the Google indexes but are either not updated or updated selectively; for example, the Railway Pension Retiriement Board, MARAD, and similar sites
- Large sites like the Auto Channel no longer have backfiles indexed and findable unless the user resorts to Google’s advanced search syntax. Then the results display less speedily than more current content probably due to the Google caches not having infrequently accessed content in a cache close to that user
- Current content for many specialist sites is not available when it is published. This is a characteristic of commercial sites with unusual domains like dot co and for some blogs.
What’s going on? DarkCyber believes that Google is trying to reduce the increasing and very difficult to control costs associated with indexing new content, indexing updated content (the deltas), and indexing the complicated content which Web sites generate in chasing the dream of becoming number one for a Google query.
Search efficiency, as we have documented in our write ups, books, and columns about Google, boils down to:
- Maximizing advertising value. That’s one reason why query expansion is used. Results match more ads and, thus, the advertiser’s ads get broader exposure.
- Getting away from the old school approach of indexing the billions of Web pages. 90 percent of these Web pages get zero traffic; therefore, index only what’s actually wanted by users. Today’s Google is not focused on library science, relevance, precision, and recall.
- Cutting costs. Cost control at the Google is very, very difficult. The crazy moonshots, the free form approach to management, the need for legions of lawyers and contract workers, the fines, the technical debt of a 20 year old company, the salaries, and the extras—each of these has to be controlled. The job is difficult.
Net net: Even wonder why finding specific information is getting more difficult via Google? Money.
PS: Finding timely, accurate information and obtaining historical content are more difficult, in DarkCyber’s experience, than at any time since we sold our ThePoint service to Lycos in the mid 1990s.
Stephen E Arnold, January 15, 2020
Intellisophic: Protected Content
December 28, 2019
Curious about Intellisophic? If you navigate to www.intellisophic.com, you get this page. If you know that Intellisophic operates from www.intellisophic.com, you get a live Web site that looks like this:
No links, and there is no indication who operates this page.
You persevere and locate a link to the “real” Intellisophic. You spot the About page and click it. What renders?
Yep, protected information.
Even companies providing specialized services to governments with “interesting” investors and solutions, provides a tiny bit of information; for example, check out https://voyagerlabs.co/.
DarkCyber finds it interesting that a company in the information business, does not provide any information about itself.
Stephen E Arnold, December 28, 2019
Instagram Learns about Uncontrolled Indexing
December 23, 2019
Everyone is an expert on search. Everyone can assign index terms, often called metatags or hashtags. The fun world of indexing at this time usually means anyone can make up a “tag” and assign it. This is uncontrolled indexing. The popularity of the method is a result of two things: A desire to save money. Skilled indexers want to develop controlled vocabularies and guidelines for the use of those terms. These activities are expensive, and in MBA land who cares? A second reason is that without an editorial policy and editorial controls, MBAs and engineers can say, “Hey, Boomer, we just provide a platform. Not our problem.”
Not surprisingly even some millennials are figuring out that old school indexing has some value, despite the burden of responsibility. Responsible behavior builds a few ethical muscles.
“How Anti-Vaxxers Get around Instagram’s New Hashtag Controls” reveals some of the flaws of uncontrolled indexing and the shallowness of the solutions crafted by some thumb typing content professionals. This passage explains the not too tough method in use by some individuals:
But anti-vaccine Instagram users have been getting around the controls by employing more than 40 cryptic hashtags such as #learntherisk and #justasking.
There you go. Make up a new indexing term and share it with your follow travelers. Why not use wonky spelling or an odd ball character?
The write up exposes the limitations of rules based term filtering and makes clear that artificial intelligence is not showing up for required office hours.
Should I review the benefits of controlled term indexing? Yes.
Will I? No.
Why? Today no one cares.
Who needs old fashioned methods? No one who wants his or her bonus.
Stephen E Arnold, December 23, 2019
From the Desk of Captain Obvious: How Image Recognition Mostly Works
July 8, 2019
Want to be reminded about how super duper image recognition systems work? If so, navigate to the capitalist’s tool “Facebook’s ALT Tags Remind Us That Deep Learning Still Sees Images as Keywords.” The DarkCyber teams knows that this headline is designed to capture clicks and certainly does not apply to every image recognition system available. But if the image is linked via metadata to something other than a numeric code, then images are indeed mapped to words. Words, it turns out, remain useful in our video and picture first world.
Nevertheless, the write up offers some interesting comments, which is what the DarkCyber research team expects from the capitalist tool. (One of our DarkCyber team saw Malcolm Forbes at a Manhattan eatery keeping a close eye on a spectacularly gaudy motorcycle. Alas, that Mr. Forbes is no longer with us, although the motorcycle probably survives somewhere unlike the “old” Forbes’ editorial policies.
Here’s the passage:
For all the hype and hyperbole about the AI revolution, today’s best deep learning content understanding algorithms are still remarkably primitive and brittle. In place of humans’ rich semantic understanding of imagery, production image recognition algorithms see images merely through predefined galleries of metadata tags they apply based on brittle and naïve correlative models that are trivially confused.
Yep, and ultimately the hundreds of millions of driver license pictures will be mapped to words; for example, name, address, city, state, zip, along with a helpful pointer to other data about the driver.
The capitalist tool reminds the patient reader:
Today’s deep learning algorithms “see” imagery by running it through a set of predefined models that look for simple surface-level correlative patterns in the arrangement of its pixels and output a list of subject tags much like those human catalogers half a century ago.
Once again, no push back from Harrod’s Creek. However, it is disappointing that new research is not referenced in the article; for example, the companies involved in Darpa Upside.
Stephen E Arnold, July 8, 2019
How Smart Software Goes Off the Rails
June 23, 2019
Navigate to “How Feature Extraction Can Be Improved With Denoising.” The write up seems like a straight forward analytics explanation. Lots of jargon, buzzwords, and hippy dippy references to length squared sampling in matrices. The concept is not defined in the article. And if you remember statistics 101, you know that there are five types of sampling: Convenience, cluster, random, systematic, and stratified. Each has its strengths and weaknesses. How does one avoid the issues? Use length squared sampling obviously: Just sample rows with probability proportional to the square of their Euclidean norms. Got it?
However, the math is not the problem. Math is a method. The glitch is in defining “noise.” Like love, there are many ways to define love. The write up points out:
Autoencoders with more hidden layers than inputs run the risk of learning the identity function – where the output simply equals the input – thereby becoming useless. In order to overcome this, Denoising Autoencoders(DAE) was developed. In this technique, the input is randomly induced by noise. This will force the autoencoder to reconstruct the input or denoise. Denoising is recommended as a training criterion for learning to extract useful features that will constitute a better higher level representation.
Can you spot the flaw in approach? Consider what happens if the training set is skewed for some reason. The system will learn based on the inputs smoothed by statistical sanding. When the system encounters real world data, the system will, by golly, convert the “real” inputs in terms of the flawed denoising method. As one wit observed, “So s?c^2 p gives us a better estimation than the zero matrix.” Yep.
To sum up, the system just generates “drifting” outputs. The fix? Retraining. This is expensive and time consuming. Not good when the method is applied to real time flows of data.
In a more colloquial turn of phrase, the denoiser may not be denoising correctly.
A more complex numerical recipes are embedded in “smart” systems, there will be some interesting consequences. Does the phrase “chain of failure”? What about “good enough”?
Stephen E Arnold, June 23, 2019
Facial Recognition: In China, Deployed. In the US, Detours
April 9, 2019
Amazon faces push back for its facial recognition system Rekognition. China? That is a different story.
Chinese authorities seem to be fond of re-education camps and assorted types of incarceration facilities. China is trying to become the recognized (no pun intended) technology capital of the world. Unlike Chile and Bolivia which have somewhat old school prison systems, the Chinese government is investing money into its prison security systems. Technode explains how Chinese upgraded its security system in, “Briefing: Chinese VIP Jail Uses AI Technology To Monitor Prisoners.”
One flagship for facial recognition is China’s Yancheng Prison, known for imprisoning government officials and foreigners. The facility has upgraded its security system with a range of surveillance technology. The new surveillance system consists of a smart AI network with cameras and hidden sensors that are equipped with facial recognition, movement analysis The system detects prisoners’ unusual behavioral patterns, then alerts the guards and it is included in daily reports.
Yancheng Prison wants to cut down on the number of prison breaks, thus the upgrade:
“Jointly developed by industry and academic organizations including Tianjin-based surveillance technology company Tiandy, the system is expected to provide blanket coverage extending into every cell, rendering prison breaks next to impossible. The company is also planning to sell the system to some South American countries for jails with histories of violence and security breaches. The use of technology to monitor prisoners prompted concern over negative effects on prisoners’ lives and mental state from one human behavior expert who also suggested that some prisoners may look find ways to exploit the AI’s weaknesses.”
China continues to take steps to put technology into use. The feedback to the engineers who develop these systems can make adjustments. Over time, China may become better at facial recognition than almost any other country.
Whitney Grace April 9, 2019
Federating Data: Easy, Hard, or Poorly Understood Until One Tries It at Scale?
March 8, 2019
I read two articles this morning.
One article explained that there’s a new way to deal with data federation. Always optimistic, I took a look at “Data-Driven Decision-Making Made Possible using a Modern Data Stack.” The revolution is to load data and then aggregate. The old way is to transform, aggregate, and model. Here’s a diagram from DAS43. A larger version is available at this link.
Hard to read. Yep, New Millennial colors. Is this a breakthrough?
I don’t know.
When I read “2 Reasons a Federated Database Isn’t Such a Slam-Dunk”, it seems that the solution outlined by DAS42 and the InfoWorld expert are not in sync.
There are two reasons. Count ‘em.
One: performance
Two: security.
Yeah, okay.
Some may suggest that there are a handful of other challenges. These range from deciding how to index audio, video, and images to figuring out what to do with different languages in the content to determining what data are “good” for the task at hand and what data are less “useful.” Date, time, and geocodes metadata are needed, but that introduces the not so easy to solve indexing problem.
So where are we with the “federation thing”?
Exactly the same place we were years ago…start ups and experts notwithstanding. But then one has to wrangle a lot of data. That’s cost, gentle reader. Big money.
Stephen E Arnold, March 8, 2019
Natural Language Generation: Sort of Made Clear
February 28, 2019
I don’t want to spend too much time on NGA (natural language generation). This is a free Web log. Providing the acronym should be enough of a hint.
If you are interested in the subject and can deal with wonky acronyms, you may want to read “Beyond Local Pattern Matching: Recent Advances in Machine Reading.”
Search sucks, so bright young minds want to tell you what you need to know. What if the system is only 75 to 80 percent accurate? The path is a long one, but the direction information retrieval is heading seems clear.
Stephen E Arnold, February 28, 2019
ChemNet: Pre Training and Rules Can Work but Time and Cost Can Be a Roadblock
February 27, 2019
I read “New AI Approach Bridges the Slim Data Gap That Can Stymie Deep Learning Approaches.” The phrase “slim data” caught my attention. Pairing the phrase with “deep learning” seemed to point the way to the future.
The method described in the document reminded me that creating rules for “smart software” works on narrow domains with constraints on terminology. No emojis allowed. The method of “pre training” has been around since the early days of smart software. Autonomy in the mid 1990s relied upon training its “black box.”
Creating a training set which represents the content to be processed or indexed can be a time consuming, expensive business. Plus because content “drifts”, re-training is required. For some types of content, the training process must be repeated and verified.
So the cost of the rule creation, tuning and tweaking is one thing. The expense of training, training set tuning, and retraining is another. Add them up, and the objective of keeping costs down and accuracy up becomes a bit of a challenge.
The article focuses on the benefits of the new system as it crunches and munches its way through chemical data. The idea is to let software identify molecules for their toxicity.
Why hasn’t this type of smart software been used to index outputs at scale?
My hunch is that the time, cost, and accuracy of the indexing itself is a challenge. Eighty percent accuracy may be okay for some applications like identifying patients with a risk of diabetes. For identifying substances that will not kill one outright is another.
In short, the slim data gap and deep learning remain largely unsolved even for a constrained content domain.
Stephen E Arnold, February 27, 2019
Google Book Search: Broken Unfixable under Current Incentives
February 19, 2019
I read “How Badly is Google Books Search Broken, and Why?” The main point is that search results do not include the expected results. The culprit, as I understand the write up, looking for rare strings of characters within a time slice behaves in an unusual manner. I noted this statement:
So possibly Google has one year it displays for books online as a best guess, and another it uses internally to represent the year they have legal certainty a book is released. So maybe those volumes of the congressional record have had their access rolled back as Google realized that 1900 might actually mean 1997; and maybe Google doesn’t feel confident in library metadata for most of its other books, and doesn’t want searchers using date filters to find improperly released books. Oddly, this pattern seems to work differently on other searches. Trying to find another rare-ish term in Google Ngrams, I settled on “rarely used word”; the Ngrams database lists 192 uses before 2002. Of those, 22 show up in the Google index. A 90% disappearance rate is bad, but still a far cry from 99.95%.
There are many reasons one can identify for the apparent misbehavior of the Google search system for books. The author identifies the main reason but does not focus on it.
From my point of view and based on the research we have done for my various Google monographs, Google’s search systems operate in silos. But each shares some common characteristics even though the engineers, often reluctantly assigned to what are dead end or career stalling projects, make changes.
One of the common flaws has to do with the indexing process itself. None of the Google silos does a very good job with time related information. Google itself has a fix, but implementing the fix for most of its services is a cost increasing step.
The result is that Google focuses on innovations which can drive revenue; that is, online advertising for the mobile user of Google services.
But Google’s time blindness is unlikely to be remediated any time soon. For a better implementation of sophisticated time operations, take a look at the technology for time based retrieval, time slicing, and time analytics from the Google and In-Q-Tel funded company Recorded Future.
In my lectures about Google’s time blindness DNA, I compare and contrast what Recorded Future can do versus what Google silos are doing.
Net net: Performing sophisticated analyses of the Google indexes requires the type of tools available from Recorded Future.
Stephen E Arnold, February 19, 2019