Why Metadata? The Answer: Easy and Good Enough

April 30, 2021

I read “We Were Promised Strong AI, But Instead We Got Metadata Analysis.” The essay is thoughtful and provides a good summary of indexing’s virtues. The angle of attack is that artificial intelligence has not delivered the zip a couple of bottles of Red Bull provides. Instead, metadata is more like four ounces of Sunny D tangy original.

The write up states:

The phenomenon of metadata replacing AI isn’t just limited to web search. Manually attached metadata trumps machine learning in many fields once they mature – especially in fields where progress is faster than it is in internet search engines. When your elected government snoops on you, they famously prefer the metadata of who you emailed, phoned or chatted to the content of the messages themselves. It seems to be much more tractable to flag people of interest to the security services based on who their friends are and what websites they visit than to do clever AI on the messages they send. Once they’re flagged, a human can always read their email anyway.

This is an accurate statement.

The write up does not address a question I think is important in the AI versus metadata discussion. That question is, “Why?”

Here are some of the reasons I have documented in my books and writings over the years:

  1. Metadata is cheaper to process than spending to get smart software to work in a reliable way
  2. Metadata is good enough; that is, key insights can be derived with maths taught in most undergraduate mathematics programs. (I lectured about the 10 algorithms which everyone uses. Why? These are good enough.)
  3. Machines can do pretty good indexing; that is, key word and bound phrase extraction and mapping, clustering, graphs of wide paths among nodes, people, etc.
  4. Humans have been induced to add their own – often wonky – index terms or hash tags as the thumbtypers characterize their tags
  5. Index analysis (Gene Garfield’s citation analysis) provides reasonably useful indications of what’s important even if one knows zero about a topic, entity, etc.
  6. Packaging indexing – sorry, metadata – as smart software and its ilk converts VCs from skeptics into fantasists. Money flows even though Google’s DeepMind technology is not delivering dump trucks of money to the Alphabet front door. Maybe soon? Who knows?

Net net: The strongest supporters of artificial intelligence have specific needs: Money, vindication of an idea gestated among classmates at a bar, or a desire to become famous.

Who agrees with me? Probably not too many people. As the professionals who founded commercial database products in the late 1970s and early 1980s die off, any chance of getting the straight scoop on the importance of indexing decreases. For AI professionals, that’s probably good news. For those individuals who understand indexing in today’s context, good luck with your mission.

Stephen E Arnold, April 30, 2021

Volv for Brief, Unbiased News

April 5, 2021

We learn about an app that pares the news down to as little information as possible from Insider’s write-up, “Volv Bills Itself as ‘TikTok for News.’ The Snap-Backed App Makes News Stories You Can Read in 9 Seconds.” Who needs in-depth analysis, anyway? Co-founders Shannon Almeida and Priyanka Vazirani wished to create a source of unbiased news; I suppose eliminating any attempts to provide context is one way to do that. Writer Grace Dean tells us:

“It creates news stories, averaging at around 70 words, which users can read in less than nine seconds. The stories are listed in-app in a swipe format that’s easy on the eye. This is crucial to make the app attractive to its millennial target market, Vazirani said. People in their teens and 20s often check their phones before they even get out of bed, logging into various apps to view the latest newsfeed updates. On Volv, users can scroll through and see all the major news stories at a glance. The app combines breaking news with pop culture stories, such as explaining memes that are going viral. A prime example would be Bernie Sanders’ mittens at Joe Biden’s presidential inauguration. In this way, the app can show people the top political and financial stories and covert non-news readers, while also offsetting heavy stories with lighter reads. This approach is paying off. Volv publishes around 50 stories a day and its articles have been read nearly 8 million times so far. Its founders said it has a high retention rate, too.”

Almeida and Vazirani, who had no tech experience before this project, are delighted at its success—they certainly seem to be on to something. We’re told the pair received some good advice from successful entrepreneur Mark Cuban, who shared his thoughts on appealing to millennials and marketing their product to stand out from other news sites. Though Volv currently employs fewer than 10 workers, it is looking to expand to provide more diverse content. Launched last year, the company is based in New York City.

Cynthia Murrell, April 5, 2021

Let Us Now Consider Wonky Data and Tagging

March 31, 2021

As you may know, I find MIT endlessly amusing. From the Jeffrey Epstein matter to smart people who moonlight for other interesting entities, the esteemed university does not disappoint. I noted an article about and MIT finding which is interesting. “MIT”s AI Dataset Study and Startling Findings” reports:

MIT Researchers analyzed 10 test sets from datasets, including ImageNet, and found over 2,900 errors in the ImageNet validation set alone. When used as a benchmark data set, the errors in the dataset were proved to have an incorrect position in correlation to direct observation or ground truth.

So what?

Garbage in, garbage out.

This is not a surprise and it certainly seems obvious. If anything, the researchers’ error rate seems low. There is no information about data pushed into the “exception” folder for indexing systems.

Stephen E Arnold, March 31, 2021

So You Wanna Be a Google?

March 31, 2021

Just a short item which may be of interest to Web indexing wannabes: Datashake has rolled out its Web Scraper API. You can read about how to:

Scrape the web with proxies, CAPTCHA solving, headless browsers and more to avoid being blocked.

You will have to sign up to get “early access” to the service. The service is not free … because scraping Web sites is neither easy nor inexpensive.

There’s not much info about this API as of March 23, 2021, but this type of service beats the pants off trying to cook up our content acquisition scripts in 1993 for the The Point (Top 5% of the Internet). You remember that, don’t you?

Of course, thumbtypers will say, “Okay, boomer, what’s up with that ancient history?”

Sigh.

Stephen E Arnold, March 31, 2021

Historical Revisionism: Twitter and Wikipedia

March 24, 2021

I wish I could recall the name of the slow talking wild-eyed professor who lectured about Mr. Stalin’s desire to have the history of the Soviet Union modified. The tendency was evident early in his career. Ioseb Besarionis dz? Jughashvili became Stalin, so fiddling with received wisdom verified by Ivory Tower types should come as no surprise.

Now we have Google and the right to be forgotten. As awkward as deleting pointers to content may be, digital information invites “reeducation”.

I learned in “Twitter to Appoint Representative to Turkey” that the extremely positive social media outfit will interact with the country’s government. The idea is to make sure content is just A-Okay. Changing tweets for money is a pretty good idea. Even better is coordinating the filtering of information with a nation state is another. But Apple and China seem to be finding a path forward. Maybe Apple in Russia will be a  similar success.

A much more interesting approach to shaping reality is alleged in “Non-English Editions of Wikipedia Have a Misinformation Problem.” Wikipedia has a stellar track record of providing fact rich, neutral information I believe. This “real news” story states:

The misinformation on Wikipedia reflects something larger going on in Japanese society. These WWII-era war crimes continue to affect Japan’s relationships with its neighbors. In recent years, as Japan has seen an increase in the rise of nationalism, then­–Prime Minister Shinzo Abe argued that there was no evidence of Japanese government coercion in the comfort women system, while others tried to claim the Nanjing Massacre never happened.

I am interested in these examples because each provides some color to one of my information “laws”. I have dubbed these “Arnold’s Precepts of Online Information.” Here’s the specific law which provides a shade tree for these examples:

Online information invites revisionism.

Stated another way, when “facts” are online, these are malleable, shapeable, and subjective.

When one runs a query on swisscows.com and then the same query on bing.com, ask:

Are these services indexing the same content?

The answer for me is, “No.” Filters, decisions about what to index, and update calendars shape the reality depicted online. Primary sources are a fine idea, but when those sources are shaped as well, what does one do?

The answer is like one of those Borges stories. Deleting and shaping content is more environmentally friendly than burning written records. A python script works with less smoke.

Stephen E Arnold, March24, 2021

GenX Indexing: Newgen Designs ID Info Extraction Software

March 9, 2021

As vaccines for COVID-19 rollout, countries are discussing vaccination documentation and how to include that with other identification paperwork. The World Health Organization does have some standards for vaccination documentation, but they are not universally applied.  Adding yet another document for international travel makes things even more confusing.  News Patrolling has a headline about new software that could make extracting ID information easier: “Newgen Launches AI And ML-Based Identity Document Extraction And Redaction Software.”

Newgen Software provides low code digital automation platforms developed a new ID extraction software: Intelligent IDXtract.  Intelligent IDXtract extracts required information from identity documents and allows organizations to use the information for multiple reasons across industries.  These include KYC verification, customer onboarding, and employee information management.

Intelligent IDXtract works by:

“Intelligent IDXtract uses a computer vision-based cognitive model to identify the presence, location, and type of one or more entities on a given identity document. The entities can include a person’s name, date of birth, unique ID number, and address, among others. The software leverages artificial intelligence and machine learning, powered by computer vision techniques and rule-based capabilities, to extract and redact entities per business requirements.”
The key features in the software will be seamless integration with business applications, entity recognition and localization, language independent localization and redaction of entity, trainable machine learning for customized models, automatic recognition, interpretation, location, and support for image capture variations.

Hopefully Intelligent IDXtract will streamline processes that require identity documentation as well as vaccine documentation.

Whitney Grace, March 9, 2021

The Semantic Web Identity Crisis? More Like Intellectual Cotton Candy?

February 22, 2021

The Semantic Web identity Crisis: In Search of the Trivialities That Never Were” is a 5,700 word essay about confusion. The write up asserts that those engaged in Semantic Web research have an “ill defined sense of identity.” What I liked about the essay is that semantic progress has been made, but moving from 80 percent of the journey over the last 20 percent is going to be difficult. I would add that making the Semantic Web “work” may be impossible.

The write up explains:

In this article, we make the case for a return to our roots of “Web” and “semantics”, from which we as a Semantic Web community—what’s in a name—seem to have drifted in search for other pursuits that, however interesting, perhaps needlessly distract us from the quest we had tasked ourselves with. In covering this journey, we have no choice but to trace those meandering footsteps along the many detours of our community—yet this time around with a promise to come back home in the end.

Does the write up “come back home”?

In order to succeed, we will need to hold ourselves to a new, significantly higher standard. For too many years, we have expected engineers and software developers to take up the remaining 20%, as if they were the ones needing to catch up with us. Our fallacy has been our insistence that the remaining part of the road solely consisted of code to be written. We have been blind to the substantial research challenges we would surely face if we would only take our experiments out of our safe environments into the open Web. Turns out that the engineers and developers have moved on and are creating their own solutions, bypassing many of the lessons we already learned, because we stubbornly refused to acknowledge the amount of research needed to turn our theories into practice. As we were not ready for the Web, more pragmatic people started taking over.

From my point of view, it looks as if the Semantic Web thing is like a flashy yacht with its rudders and bow thrusters stuck in one position. The boat goes in circles. That would drive the passengers and crew bonkers.

Stephen E Arnold, February 22, 2021

Online Axiom: Distorted Information Is Part of the Datasphere

January 28, 2021

I read a 4,300 word post called “Nextdoor Is Quietly Replacing the Small-Town Paper” about an online social network aimed at “neighbors.” Yep, just like the one in which Mr. Rogers lived in for 31 years.

image

A world that only exists in upscale communities, populated by down home folks with money, and alarm systems.

The write up explains:

Nextdoor is an evolution of the neighborhood listserv forthe social media age, a place to trade composting tips, offerbabysitting services, or complain about the guy down the street whodoesn’t clean up his dog’s poop. Like many neighborhood listservs,it also has increasingly well-documented issues with racial profiling, stereotyping of the homeless, and political ranting of variousstripes, including QAnon. But Nextdoor has gradually evolved into something bigger and more consequential than just a digital bulletin board: In many communities,the platform has begun to step into roles once filled by America’slocal newspapers.

As I read this, I recalled that Google wants to set up its own news operation in Australia, but the GOOG is signing deals with independent publishers, maybe the mom-and-pop online advertising company should target Nextdoor. Imagine the Google Local ads which could be hosed into this service. Plus, Nextdoor already disappears certain posts and features one of the wonkiest interfaces for displaying comments and locating items offered for free or for sale. Google-ize it?

The article gathers some examples of how the at homers use Nextdoor to communicate. Information, disinformation, and misinformation complement quasi-controversial discussions. But if one gets too frisky, then the “seed” post is deleted from public view.

I have pointed out in my lectures (when I was doing them until the Covid thing) that the local and personal information is a goldmine of information useful to a number of commercial and government entities.

If you know zero about Nextdoor, check out the long, long article hiding happily behind a “register to read” paywall. On the other hand, sign up and check out the service.

Google, if you were a good neighbor, you would be looking at taking Nextdoor to Australia to complement the new play of “Google as a news publisher.” A “real” news outfit. Maybe shaped information is an online “law” describing what’s built in to interactions which are not intermediated?

Stephen E Arnold, January 28, 2021

Mobile and Social Media Users: Check Out the Utility of Metadata

January 15, 2021

Policeware vendors once commanded big, big bucks to match a person of interest to a location. Over the last decade prices have come down. Some useful products cost a fraction of the industrial strength, incredibly clumsy tools. If you are thinking about the hassle of manipulating data in IBM or Palantir products, you are in the murky field of prediction. I have not named the products which I think are the winners of this particular race.

image

Source: https://thepatr10t.github.io/yall-Qaeda/

The focus of this write up is the useful information derived from the deplatformed Parler social media outfit. An enterprising individual named Patri10tic performed the sort of trick which Geofeedia made semi famous. You can check the map placing specific Parler uses in particular locations based on their messages at this link. What’s the time frame? The unusual protest at the US Capitol.

The point of this short post is different. I want to highlight several points:

  1. Metadata can be more useful than the content of a particular message or voice call
  2. Metadata can be mapped through time creating a nifty path of an individual’s movements
  3. Metadata can be cross correlated with other data. (If you attended one of my Amazon policeware lectures, the cross correlation figures prominently.)
  4. Metadata can be analyzed in more than two dimensions.

To sum up, I want to remind journalists that this type of data detritus has enormous value. That is the reason third parties attempt to bundle data together and provide authorized users with access to them.

What’s this have to do with policeware? From my point of view, almost anyone can replicate what systems costing as much as seven figures a year or more from their laptop at an outdoor table near a coffee shop.

Policeware vendors want to charge a lot. The Parler analysis demonstrates that there are many uses for low or zero cost geo manipulations.

Stephen E Arnold, January 15, 2021

Semantic Scholar: Mostly Useful Abstracting

December 4, 2020

A new search engine specifically tailored to scientific literature uses a highly trained algorithm. MIT Technology Review reports, “An AI Helps You Summarize the latest in AI” (and other computer science topics). Semantic Scholar generates tl;dr sentences for each paper on an author’s page. Literally—they call each summary, and the machine-learning model itself, “TLDR.” The work was performed by researchers at the Allen Institute for AI and the University of Washington’s Paul G. Allen School of Computer Science & Engineering.

AI-generated summaries are either extractive, picking a sentence out of the text to represent the whole, or abstractive, generating a new sentence. Obviously, an abstractive summary would be more likely to capture the essence of a whole paper—if it were done well. Unfortunately, due to limitations of natural language processing, most systems have relied on extractive algorithms. This model, however, may change all that. Writer Karen Hao tells us:

“How they did it: AI2’s abstractive model uses what’s known as a transformer—a type of neural network architecture first invented in 2017 that has since powered all of the major leaps in NLP, including OpenAI’s GPT-3. The researchers first trained the transformer on a generic corpus of text to establish its baseline familiarity with the English language. This process is known as ‘pre-training’ and is part of what makes transformers so powerful. They then fine-tuned the model—in other words, trained it further—on the specific task of summarization. The fine-tuning data: The researchers first created a dataset called SciTldr, which contains roughly 5,400 pairs of scientific papers and corresponding single-sentence summaries. To find these high-quality summaries, they first went hunting for them on OpenReview, a public conference paper submission platform where researchers will often post their own one-sentence synopsis of their paper. This provided a couple thousand pairs. The researchers then hired annotators to summarize more papers by reading and further condensing the synopses that had already been written by peer reviewers.”

The team went on to add a second dataset of 20,000 papers and their titles. They hypothesized that, as titles are themselves a kind of summary, this would refine the model further. They were not disappointed. The resulting summaries average 21 words to summarize papers that average 5,000 words, a compression of 238 times. Compare this to the next best abstractive option at 36.5 times and one can see TLDR is leaps ahead. But are these summaries as accurate and informative? According to human reviewers, they are even more so. We may just have here a rare machine learning model that has received enough training on good data to be effective.

The Semantic Scholar team continues to refine the software, training it to summarize other types of papers and to reduce repetition. They also aim to have it summarize multiple documents at once—good for researchers in a new field, for example, or policymakers being briefed on a complex issue. Stay tuned.

Cynthia Murrell, December 4, 2020

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta