No Fooling: Copyright Enforcer Does Indexing Too

April 1, 2020

The Associated Press is one of the oldest, most respected, and widely read news services in the world. As more than half the world reads Associated Press, it makes one wonder how the news services organizes and distributes its content. Synaptica has more details in the article, “Synaptica Insights: Veronika Zielinska, The Associated Press.”

Veronika Zielinska has a background in computational linguistics and natural language. She was interested in how automated tagging, taxonomies, and statistical engines apply rules to content. She joined Associated Press’s Information Management team in 2005, then moving up to the Metadata Technology team. Her current responsibilities are to develop the Metadata Services platform, fine tuning search quality and relevancy for content distribution platforms, scheme design, data transformations, analytics and business intelligence programs, and developing content enrichment methods.

Zielinska offers information on how the Associated Press builds a taxonomy:

“We looked at all the content that AP produced and scoped our taxonomy to cover all possible topics, events, places, organizations, people, and companies that our news production covered. News can be about anything – it’s broad, but we also took into account there are certain areas where AP produces more content than others. We have verticals that have huge news coverage – this can be government, politics, sports, entertainment and emerging areas like health, environment, nature, and education. Looking at the content and knowing what the news is about helps us to develop the taxonomy framework. We took this content base and divided the entire news domain into smaller domains. Each person on the team was responsible for their three or four taxonomy domains. They became subject and theme matter experts.”

The value of Associated Press’s taxonomies comes from the entire content package that includes everything from photos, articles, and videos centered around descriptive metadata that makes it agreeable and findable.

While the Associated Press is a non-profit news service, they do offer a platform called AP Metadata Services that is used by other news services. The Associated Press frequently updates its taxonomy with new terms when they enter the media. The AP taxonomy team works with the AP Editorial team to identify new terms and topics. The biggest challenges Zielinska faces are maintenance and writing in a manner that the natural language processing algorithms can understand it.

As for the future, Zielinska fears news services losing their budgets, local news not getting as much coverage, and the spread of misinformation. The biggest problem is that automated technologies can take the misinformation and disseminate it. She advises, “Managers can help by creating standardized vocabularies for fact checking across media types, for example, so that deep fakes and other misleading media can be identified consistently across various outlets.”

Whitney Grace, April 1, 2020

TemaTres: Open Source Indexing Tool Updated

February 11, 2020

Open source software is the foundation for many proprietary software startups, including the open source developers themselves. Most open source software tends to lag in the manner of updates and patches, but TemaTres recently updated according to blog post, “TemaTres 3.1 Release Is Out! Open Source Web Tool To Manage Controlled Vocabularies.”

TemaTres is an open source vocabulary server designed to manage controlled vocabularies, taxonomies, and thesauri. The recent update includes the following:

“Utility for importing vocabularies encoded in MARC-XML format

  • Utility for the mass export of vocabulary in MARC-XML format
  • New reports about global vocabulary structure (ex: https://r020.com.ar/tematres/demo/sobre.php?setLang=en#global_view)
  • Distribution of terms according to depth level
  • Distribution of sum of preferred terms and the sum of alternative terms
  • Distribution of sum of hierarchical relationships and sum of associative relationships
  • Report about terms with relevant degree of centrality in the vocabulary (according to prototypical conditions)
  • Presentation of terms with relevant degree of centrality in each facet
  • New options to config the presentation of notes: define specific types of note as prominent (the others note types will be presented in collapsed div).
  • Button for Copy to clipboard the terms with indexing value (Copy-one-click button)
  • New user login scheme (login)
  • Allows to config and add Google Analytics tracking code (parameter in config.tematres.php file)
  • Improvements in standard exposure of metadata tags
  • Inclusion of the term notation or code in the search box predictive text
  • Compatibility with PHP 7.2”

TemaTres does updates frequently, but it is monitored. The main ethos about open source is to give back as much as you take. TemaTres appears to follow this modus operandi. It TemaTres wants to promote its web image, the organization should really upgrade its Web site, fix the broken links, and provide more information on what the software actually does.

Whitney Grace, February 11, 2020

An Interesting Hypothesis about Google Indexing

January 15, 2020

We noted “Google’s Crawl-Less Index.” The main idea is that something has changed in how Google indexes. We circled in yellow this statement from the article:

[Google’ can do this now because they have a popular web browser, so they can retire their old method of discovering links and let the users do their crawling.

The statement needs context.

The speculation is that Google indexes a Web page only when a user visits a page. Google notes the behavior and indexes the page.

What’s happening, DarkCyber concludes, is that Google no longer brute force crawls the public Web. Indexing takes place when a signal (a human navigating to a page) is received. Then the page is indexed.

Is this user-behavior centric indexing a reality?

DarkCyber has noted these characteristics of Google’s indexing in the last year:

  1. Certain sites are in the Google indexes but are either not updated or updated selectively; for example, the Railway Pension Retiriement Board, MARAD, and similar sites
  2. Large sites like the Auto Channel no longer have backfiles indexed and findable unless the user resorts to Google’s advanced search syntax. Then the results display less speedily than more current content probably due to the Google caches not having infrequently accessed content in a cache close to that user
  3. Current content for many specialist sites is not available when it is published. This is a characteristic of commercial sites with unusual domains like dot co and for some blogs.

What’s going on? DarkCyber believes that Google is trying to reduce the increasing and very difficult to control costs associated with indexing new content, indexing updated content (the deltas), and indexing the complicated content which Web sites generate in chasing the dream of becoming number one for a Google query.

Search efficiency, as we have documented in our write ups, books, and columns about Google, boils down to:

  1. Maximizing advertising value. That’s one reason why query expansion is used. Results match more ads and, thus, the advertiser’s ads get broader exposure.
  2. Getting away from the old school approach of indexing the billions of Web pages. 90 percent of these Web pages get zero traffic; therefore, index only what’s actually wanted by users. Today’s Google is not focused on library science, relevance, precision, and recall.
  3. Cutting costs. Cost control at the Google is very, very difficult. The crazy moonshots, the free form approach to management, the need for legions of lawyers and contract workers, the fines, the technical debt of a 20 year old company, the salaries, and the extras—each of these has to be controlled. The job is difficult.

Net net: Even wonder why finding specific information is getting more difficult via Google? Money.

PS: Finding timely, accurate information and obtaining historical content are more difficult, in DarkCyber’s experience, than at any time since we sold our ThePoint service to Lycos in the mid 1990s.

Stephen E Arnold, January 15, 2020

Instagram Learns about Uncontrolled Indexing

December 23, 2019

Everyone is an expert on search. Everyone can assign index terms, often called metatags or hashtags. The fun world of indexing at this time usually means anyone can make up a “tag” and assign it. This is uncontrolled indexing. The popularity of the method is a result of two things: A desire to save money. Skilled indexers want to develop controlled vocabularies and guidelines for the use of those terms. These activities are expensive, and in MBA land who cares? A second reason is that without an editorial policy and editorial controls, MBAs and engineers can say, “Hey, Boomer, we just provide a platform. Not our problem.”

Not surprisingly even some millennials are figuring out that old school indexing has some value, despite the burden of responsibility. Responsible behavior builds a few ethical muscles.

“How Anti-Vaxxers Get around Instagram’s New Hashtag Controls” reveals some of the flaws of uncontrolled indexing and the shallowness of the solutions crafted by some thumb typing content professionals. This passage explains the not too tough method in use by some individuals:

But anti-vaccine Instagram users have been getting around the controls by employing more than 40 cryptic hashtags such as #learntherisk and #justasking.

There you go. Make up a new indexing term and share it with your follow travelers. Why not use wonky spelling or an odd ball character?

The write up exposes the limitations of rules based term filtering and makes clear that artificial intelligence is not showing up for required office hours.

Should I review the benefits of controlled term indexing? Yes.

Will I? No.

Why? Today no one cares.

Who needs old fashioned methods? No one who wants his or her bonus.

Stephen E Arnold, December 23, 2019

Google and Right to Be Forgotten: Selective Indexing Gets a Green Light

September 25, 2019

DarkCyber noted this BBC article: “Google Wins Landmark Right to Be Forgotten Case.” The main point seems to be that references under the “right to be forgotten” umbrella apply only in Europe. The BBC stated:

There has been a lot of interest in the case since, had the ruling gone the other way, it could have been viewed as an attempt by Europe to police a US tech giant beyond the EU’s borders.

Several observations may be warranted:

  • Google can indeed filter search results; thus, objective results are unlikely
  • The index pointers are blocked, which means that those in another country can view proscribed links and maybe – just maybe — a Google super user can view what’s in the Google indexes
  • The “algorithms” which are allegedly working automatically may not; therefore, human adjustments to modify search results are probably available to certain search engineers.

If these observations are more than hypotheticals, will the index tuning have an impact on other legal matters in which Google is involved? Query reshaping and search results filtering are a fact of Google life.

Stephen E Arnold, September 25, 2019

Google: A Tuneful Allegation about Indexing and Search Results

June 17, 2019

Google continues to attract criticism. DarkCyber noted an interesting twist on cleverness. Google has been a clever outfit. Now there may be evidence that a company with song lyrics may be slightly more clever. According to Boy Genius Report, a company with a database of song lyrics allegedly believed that Google was copying the lyrics and using them without permission. Remember. This is an allegation, and anyone not clever can make an allegation do a two step. The company with the lyrics is named Genius, and allegedly Genius inserted a coded message within its lyrics. Thus, when Google acquired these lyrics, Genius alleges that the coded messages appeared in Google’s lyrics. Smoking gun? How long has Genius been aware of the GOOG’s alleged improper use of lyrics? The answer, according to the article, is two years.

Several observations:

  1. This is an allegation, so it seems that legal eagles will take flight
  2. The use of “codes” is interesting because it suggests that the intake, indexing, and content processing system in use at Google may operate in an indiscriminate manner. The scraping may give a bad actor an idea for injecting certain types of data into a Google system. (I cover this type of exploit in my lectures about the flaws in the most widely used algorithms in content processing. Now we have allegations of a big time use case.)
  3. The allegation may provide some additional information about how Google allegedly favors its own content over that of third parties. The idea which could inspire some legal analysis is that: [a] Google knows via its analytics which content is hot, [b] Google seeks to acquire that content in some manner; and [c] when a query is run for something in that corpus, Google displays its content, not that of a third party.

Net net: Google is indeed clever, but this may be an example of a smaller company being clever-er. Worth watching what fancy dancing the Google uses to deal with this allegation of “genius.”

Stephen E Arnold, June 17, 2019

Aleph: Another Hidden Internet Indexing Service

January 23, 2019

Law enforcement and intelligence organizations have a new tool to navigate the Dark Web, the Mail & Guardian reports in, “French Start-Up Offers ‘Dark Web’ Compass, but Not for Everyone.” The start-up, called Aleph Networks, has developed a way to navigate the Dark Web, but they wish it to only be wielded for good. In fact, report writer Frederic Garlan, the company performs ethics reviews of potential clients and turns down 30-40  percent of the licensing requests it receives. We also learn:

“Over the past five years Aleph has indexed 1.4 billion links and 450 million documents across some 140,000 dark web sites. As of December its software had also found 3.9 million stolen credit card numbers. ‘Without a search engine, you can’t have a comprehensive view’ of all the hidden sites, Hernandez said. He and a childhood friend began their adventure by putting their hacking skills to work for free-speech advocates or anti-child abuse campaigners, while holding down day jobs as IT engineers. [Co-founder Celine] Haeri, at the time a teacher, asked for their help in merging blogs by her colleagues opposed to a government reform of the education system. The result became the basis of their mass data collection and indexing software, and the three created Aleph in 2012. They initially raised €200,000 ($228,000) but had several close calls with bankruptcy before finding a keen client in the French military’s weapon and technology procurement agency. ‘They asked us for a demonstration two days after the Charlie Hebdo attack,’ Hernandez said, referring to the 2015 massacre of 12 people at the satirical magazine’s Paris offices, later claimed by a branch of Al-Qaeda. ‘They were particularly receptive to our pitch which basically said, if you don’t know the territory — which is the case with the dark web — you can’t gain mastery of it,’ Haeri added.”

That is a good point. Garlan notes the DARPA’s Memex program, which is based on the same principle. As for Aleph, it is now working to incorporate AI into its platform. While the company’s clients so far have mostly been government agencies, it plans to bring in more private-sector clients as it continues to attract investors. Based in Pommiers, France, Aleph Networks was launched in 2012.

Cynthia Murrell, January 23, 2019

Google Struggles with Indexing?

November 14, 2018

You probably know that Google traffic was routed to China. The culprit was something obvious. In this case, Nigeria. Yep, Nigeria. You can read about the mistake that provided some interesting bits and bytes to the Middle Kingdom. Yeah, I know. Nigeria. “A Nigerian Company Is in Trouble with Google for Re-Routing Traffic to Russia, China” provides some allegedly accurate information.

But the major news I noted here in Harrod’s Creek concerned Google News and its indexing. Your experience may be different from mine, but Google indexing can be interesting. I was looking for an outfit identified as Inovatio, which is a university anchored outfit in China. The reference to Inovatio in Google aimed me at a rock band and a design company in Slovenia. Google’s smart search system changed Inovatio to innovation even when I used quote marks. I did locate the Inovatio operation using a Chinese search engine. I was able to track Ampthon.com which listed Inovatio and provided the university affiliation to allow me to get some info about an outfit providing surveillance and intercept services to countries in need of this capability.

Google. Indexing. Yeah.

Google News Publishers Complaining About Indexing Issues” highlights another issue with the beloved Google. I learned:

In the past few days there has been an uptick in complaints from Google News publishers around Google not indexing their new news content. Gary Illyes from Google did a rare appearance on Twitter to say he passed along the feedback to the Google News team to investigate. You can scan through the Google News Help forums and see a nice number of complaints. Also David Esteve, the SEO at the Spanish newspaper El Confidencial, posted his concerns on Twitter.

The good news is that the write up mentions that this indexing glitch is a known issue.

Net net: Many people with whom I speak believe that Google’s index is comprehensive, timely, and consistent.

Yeah, also smart because Inovatio is really innovation.

Stephen E Arnold, November 14, 2018

Indexing Matters: The Investment Sector Analysis

October 15, 2018

I read reports which explain why large monopolistic or oligopolistic companies alter the behavior of certain ecosystems. I don’t see that many because analysts are preoccupied with more practical matters; namely, their bonuses, appearances on Bloomberg TV or CNBC, and riding their hobby horses.

I read and then reread “Platform Giants and Venture Backed Startups.” The premise struck me as obvious. The whales of online are functioning like giant electromagnets. There companies pull traffic, attention, and money. At the same time, they emit beacons which are tuned to the inner ears of investors.

Image result for jello cubed dessert

Looks tasty but only semi organized. And from what is this confection fabricated? Answer: Cow hooves. Intellectual Jello, lovingly crafted to delight the eye.

The squeaks of these ultra high frequency waves alert those looking for big paydays to put their money into startups which do not compete head on with the outfits operating like electromagnets.

The “Platform Giant” write up assembles observations from a report which asserts the opposite; that is, big electromagnets do not have an impact on start ups and most investors.

Put that aside.

The core of the write up makes clear that indexing and classification make a difference. The idea is that if one classifies and marshals data, the classification creates a way to look at the data, the world, and in this particular case the way investments flow or do not flow.

What goes in “Internet software” becomes the trigger for the conclusion. Invest to compete against the Google? Not a good idea.

The question becomes, “Who does the indexing, classification, ontology, and related bits of the taxonomy?”

Indexing is important. But more important is the creation of the knowledge structure and the categories which will be used to chop, slice, and organize data for analysis.

Get the knowledge structure wrong and the flawed categorization creates findings that are probably misleading at best and just off base.

Who takes the time to work out the knowledge structure before training humans and smart software to assign metadata?

The write up suggests that humans (either with agenda or without, with expertise or not, or with a wonky knowledge superstructure or not) do.

Net net: Counting is verifiable. Pegging what to count may be more like organizing cubes of a gelatin dessert.

Stephen E Arnold, October 15, 2018

Are Some Google Docs Exposed to Web Indexing Systems?

July 21, 2018

Recently, Russian search giant Yandex reported seeing Google Docs turn up in search results. Previously, this was thought to be impossible. However, this brings up a lot of questions that others have taken for granted: namely, how secure are documents on the cloud? This was looked at more closely in the Media Post story, “Private Google Docs Serve Up In Yandex Search Engine Results.”

According to the story:

“[O]ther search engines can only serve up Google documents that had either been deliberately made public by its authors or when a user publishes a link to a document and makes it available for public access and search… Saving and protecting users’ personal data is our main priority for search engines. A Yandex spokesperson said the search only yields files that don’t require logins or passwords.”

For its part, Google appears to deflect the Yandex observation. Regardless, the Yandex assert arrives near the muddy heels of other security woes like the idea that our Gmail messages and their content could be used by developers. With the Android matter behind it, the EU may look at access to certain Google content.

Patrick Roland, July 21, 2018

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta