Volv for Brief, Unbiased News

April 5, 2021

We learn about an app that pares the news down to as little information as possible from Insider’s write-up, “Volv Bills Itself as ‘TikTok for News.’ The Snap-Backed App Makes News Stories You Can Read in 9 Seconds.” Who needs in-depth analysis, anyway? Co-founders Shannon Almeida and Priyanka Vazirani wished to create a source of unbiased news; I suppose eliminating any attempts to provide context is one way to do that. Writer Grace Dean tells us:

“It creates news stories, averaging at around 70 words, which users can read in less than nine seconds. The stories are listed in-app in a swipe format that’s easy on the eye. This is crucial to make the app attractive to its millennial target market, Vazirani said. People in their teens and 20s often check their phones before they even get out of bed, logging into various apps to view the latest newsfeed updates. On Volv, users can scroll through and see all the major news stories at a glance. The app combines breaking news with pop culture stories, such as explaining memes that are going viral. A prime example would be Bernie Sanders’ mittens at Joe Biden’s presidential inauguration. In this way, the app can show people the top political and financial stories and covert non-news readers, while also offsetting heavy stories with lighter reads. This approach is paying off. Volv publishes around 50 stories a day and its articles have been read nearly 8 million times so far. Its founders said it has a high retention rate, too.”

Almeida and Vazirani, who had no tech experience before this project, are delighted at its success—they certainly seem to be on to something. We’re told the pair received some good advice from successful entrepreneur Mark Cuban, who shared his thoughts on appealing to millennials and marketing their product to stand out from other news sites. Though Volv currently employs fewer than 10 workers, it is looking to expand to provide more diverse content. Launched last year, the company is based in New York City.

Cynthia Murrell, April 5, 2021

Let Us Now Consider Wonky Data and Tagging

March 31, 2021

As you may know, I find MIT endlessly amusing. From the Jeffrey Epstein matter to smart people who moonlight for other interesting entities, the esteemed university does not disappoint. I noted an article about and MIT finding which is interesting. “MIT”s AI Dataset Study and Startling Findings” reports:

MIT Researchers analyzed 10 test sets from datasets, including ImageNet, and found over 2,900 errors in the ImageNet validation set alone. When used as a benchmark data set, the errors in the dataset were proved to have an incorrect position in correlation to direct observation or ground truth.

So what?

Garbage in, garbage out.

This is not a surprise and it certainly seems obvious. If anything, the researchers’ error rate seems low. There is no information about data pushed into the “exception” folder for indexing systems.

Stephen E Arnold, March 31, 2021

So You Wanna Be a Google?

March 31, 2021

Just a short item which may be of interest to Web indexing wannabes: Datashake has rolled out its Web Scraper API. You can read about how to:

Scrape the web with proxies, CAPTCHA solving, headless browsers and more to avoid being blocked.

You will have to sign up to get “early access” to the service. The service is not free … because scraping Web sites is neither easy nor inexpensive.

There’s not much info about this API as of March 23, 2021, but this type of service beats the pants off trying to cook up our content acquisition scripts in 1993 for the The Point (Top 5% of the Internet). You remember that, don’t you?

Of course, thumbtypers will say, “Okay, boomer, what’s up with that ancient history?”


Stephen E Arnold, March 31, 2021

Historical Revisionism: Twitter and Wikipedia

March 24, 2021

I wish I could recall the name of the slow talking wild-eyed professor who lectured about Mr. Stalin’s desire to have the history of the Soviet Union modified. The tendency was evident early in his career. Ioseb Besarionis dz? Jughashvili became Stalin, so fiddling with received wisdom verified by Ivory Tower types should come as no surprise.

Now we have Google and the right to be forgotten. As awkward as deleting pointers to content may be, digital information invites “reeducation”.

I learned in “Twitter to Appoint Representative to Turkey” that the extremely positive social media outfit will interact with the country’s government. The idea is to make sure content is just A-Okay. Changing tweets for money is a pretty good idea. Even better is coordinating the filtering of information with a nation state is another. But Apple and China seem to be finding a path forward. Maybe Apple in Russia will be a  similar success.

A much more interesting approach to shaping reality is alleged in “Non-English Editions of Wikipedia Have a Misinformation Problem.” Wikipedia has a stellar track record of providing fact rich, neutral information I believe. This “real news” story states:

The misinformation on Wikipedia reflects something larger going on in Japanese society. These WWII-era war crimes continue to affect Japan’s relationships with its neighbors. In recent years, as Japan has seen an increase in the rise of nationalism, then­–Prime Minister Shinzo Abe argued that there was no evidence of Japanese government coercion in the comfort women system, while others tried to claim the Nanjing Massacre never happened.

I am interested in these examples because each provides some color to one of my information “laws”. I have dubbed these “Arnold’s Precepts of Online Information.” Here’s the specific law which provides a shade tree for these examples:

Online information invites revisionism.

Stated another way, when “facts” are online, these are malleable, shapeable, and subjective.

When one runs a query on swisscows.com and then the same query on bing.com, ask:

Are these services indexing the same content?

The answer for me is, “No.” Filters, decisions about what to index, and update calendars shape the reality depicted online. Primary sources are a fine idea, but when those sources are shaped as well, what does one do?

The answer is like one of those Borges stories. Deleting and shaping content is more environmentally friendly than burning written records. A python script works with less smoke.

Stephen E Arnold, March24, 2021

GenX Indexing: Newgen Designs ID Info Extraction Software

March 9, 2021

As vaccines for COVID-19 rollout, countries are discussing vaccination documentation and how to include that with other identification paperwork. The World Health Organization does have some standards for vaccination documentation, but they are not universally applied.  Adding yet another document for international travel makes things even more confusing.  News Patrolling has a headline about new software that could make extracting ID information easier: “Newgen Launches AI And ML-Based Identity Document Extraction And Redaction Software.”

Newgen Software provides low code digital automation platforms developed a new ID extraction software: Intelligent IDXtract.  Intelligent IDXtract extracts required information from identity documents and allows organizations to use the information for multiple reasons across industries.  These include KYC verification, customer onboarding, and employee information management.

Intelligent IDXtract works by:

“Intelligent IDXtract uses a computer vision-based cognitive model to identify the presence, location, and type of one or more entities on a given identity document. The entities can include a person’s name, date of birth, unique ID number, and address, among others. The software leverages artificial intelligence and machine learning, powered by computer vision techniques and rule-based capabilities, to extract and redact entities per business requirements.”
The key features in the software will be seamless integration with business applications, entity recognition and localization, language independent localization and redaction of entity, trainable machine learning for customized models, automatic recognition, interpretation, location, and support for image capture variations.

Hopefully Intelligent IDXtract will streamline processes that require identity documentation as well as vaccine documentation.

Whitney Grace, March 9, 2021

The Semantic Web Identity Crisis? More Like Intellectual Cotton Candy?

February 22, 2021

The Semantic Web identity Crisis: In Search of the Trivialities That Never Were” is a 5,700 word essay about confusion. The write up asserts that those engaged in Semantic Web research have an “ill defined sense of identity.” What I liked about the essay is that semantic progress has been made, but moving from 80 percent of the journey over the last 20 percent is going to be difficult. I would add that making the Semantic Web “work” may be impossible.

The write up explains:

In this article, we make the case for a return to our roots of “Web” and “semantics”, from which we as a Semantic Web community—what’s in a name—seem to have drifted in search for other pursuits that, however interesting, perhaps needlessly distract us from the quest we had tasked ourselves with. In covering this journey, we have no choice but to trace those meandering footsteps along the many detours of our community—yet this time around with a promise to come back home in the end.

Does the write up “come back home”?

In order to succeed, we will need to hold ourselves to a new, significantly higher standard. For too many years, we have expected engineers and software developers to take up the remaining 20%, as if they were the ones needing to catch up with us. Our fallacy has been our insistence that the remaining part of the road solely consisted of code to be written. We have been blind to the substantial research challenges we would surely face if we would only take our experiments out of our safe environments into the open Web. Turns out that the engineers and developers have moved on and are creating their own solutions, bypassing many of the lessons we already learned, because we stubbornly refused to acknowledge the amount of research needed to turn our theories into practice. As we were not ready for the Web, more pragmatic people started taking over.

From my point of view, it looks as if the Semantic Web thing is like a flashy yacht with its rudders and bow thrusters stuck in one position. The boat goes in circles. That would drive the passengers and crew bonkers.

Stephen E Arnold, February 22, 2021

Online Axiom: Distorted Information Is Part of the Datasphere

January 28, 2021

I read a 4,300 word post called “Nextdoor Is Quietly Replacing the Small-Town Paper” about an online social network aimed at “neighbors.” Yep, just like the one in which Mr. Rogers lived in for 31 years.


A world that only exists in upscale communities, populated by down home folks with money, and alarm systems.

The write up explains:

Nextdoor is an evolution of the neighborhood listserv forthe social media age, a place to trade composting tips, offerbabysitting services, or complain about the guy down the street whodoesn’t clean up his dog’s poop. Like many neighborhood listservs,it also has increasingly well-documented issues with racial profiling, stereotyping of the homeless, and political ranting of variousstripes, including QAnon. But Nextdoor has gradually evolved into something bigger and more consequential than just a digital bulletin board: In many communities,the platform has begun to step into roles once filled by America’slocal newspapers.

As I read this, I recalled that Google wants to set up its own news operation in Australia, but the GOOG is signing deals with independent publishers, maybe the mom-and-pop online advertising company should target Nextdoor. Imagine the Google Local ads which could be hosed into this service. Plus, Nextdoor already disappears certain posts and features one of the wonkiest interfaces for displaying comments and locating items offered for free or for sale. Google-ize it?

The article gathers some examples of how the at homers use Nextdoor to communicate. Information, disinformation, and misinformation complement quasi-controversial discussions. But if one gets too frisky, then the “seed” post is deleted from public view.

I have pointed out in my lectures (when I was doing them until the Covid thing) that the local and personal information is a goldmine of information useful to a number of commercial and government entities.

If you know zero about Nextdoor, check out the long, long article hiding happily behind a “register to read” paywall. On the other hand, sign up and check out the service.

Google, if you were a good neighbor, you would be looking at taking Nextdoor to Australia to complement the new play of “Google as a news publisher.” A “real” news outfit. Maybe shaped information is an online “law” describing what’s built in to interactions which are not intermediated?

Stephen E Arnold, January 28, 2021

Mobile and Social Media Users: Check Out the Utility of Metadata

January 15, 2021

Policeware vendors once commanded big, big bucks to match a person of interest to a location. Over the last decade prices have come down. Some useful products cost a fraction of the industrial strength, incredibly clumsy tools. If you are thinking about the hassle of manipulating data in IBM or Palantir products, you are in the murky field of prediction. I have not named the products which I think are the winners of this particular race.


Source: https://thepatr10t.github.io/yall-Qaeda/

The focus of this write up is the useful information derived from the deplatformed Parler social media outfit. An enterprising individual named Patri10tic performed the sort of trick which Geofeedia made semi famous. You can check the map placing specific Parler uses in particular locations based on their messages at this link. What’s the time frame? The unusual protest at the US Capitol.

The point of this short post is different. I want to highlight several points:

  1. Metadata can be more useful than the content of a particular message or voice call
  2. Metadata can be mapped through time creating a nifty path of an individual’s movements
  3. Metadata can be cross correlated with other data. (If you attended one of my Amazon policeware lectures, the cross correlation figures prominently.)
  4. Metadata can be analyzed in more than two dimensions.

To sum up, I want to remind journalists that this type of data detritus has enormous value. That is the reason third parties attempt to bundle data together and provide authorized users with access to them.

What’s this have to do with policeware? From my point of view, almost anyone can replicate what systems costing as much as seven figures a year or more from their laptop at an outdoor table near a coffee shop.

Policeware vendors want to charge a lot. The Parler analysis demonstrates that there are many uses for low or zero cost geo manipulations.

Stephen E Arnold, January 15, 2021

Semantic Scholar: Mostly Useful Abstracting

December 4, 2020

A new search engine specifically tailored to scientific literature uses a highly trained algorithm. MIT Technology Review reports, “An AI Helps You Summarize the latest in AI” (and other computer science topics). Semantic Scholar generates tl;dr sentences for each paper on an author’s page. Literally—they call each summary, and the machine-learning model itself, “TLDR.” The work was performed by researchers at the Allen Institute for AI and the University of Washington’s Paul G. Allen School of Computer Science & Engineering.

AI-generated summaries are either extractive, picking a sentence out of the text to represent the whole, or abstractive, generating a new sentence. Obviously, an abstractive summary would be more likely to capture the essence of a whole paper—if it were done well. Unfortunately, due to limitations of natural language processing, most systems have relied on extractive algorithms. This model, however, may change all that. Writer Karen Hao tells us:

“How they did it: AI2’s abstractive model uses what’s known as a transformer—a type of neural network architecture first invented in 2017 that has since powered all of the major leaps in NLP, including OpenAI’s GPT-3. The researchers first trained the transformer on a generic corpus of text to establish its baseline familiarity with the English language. This process is known as ‘pre-training’ and is part of what makes transformers so powerful. They then fine-tuned the model—in other words, trained it further—on the specific task of summarization. The fine-tuning data: The researchers first created a dataset called SciTldr, which contains roughly 5,400 pairs of scientific papers and corresponding single-sentence summaries. To find these high-quality summaries, they first went hunting for them on OpenReview, a public conference paper submission platform where researchers will often post their own one-sentence synopsis of their paper. This provided a couple thousand pairs. The researchers then hired annotators to summarize more papers by reading and further condensing the synopses that had already been written by peer reviewers.”

The team went on to add a second dataset of 20,000 papers and their titles. They hypothesized that, as titles are themselves a kind of summary, this would refine the model further. They were not disappointed. The resulting summaries average 21 words to summarize papers that average 5,000 words, a compression of 238 times. Compare this to the next best abstractive option at 36.5 times and one can see TLDR is leaps ahead. But are these summaries as accurate and informative? According to human reviewers, they are even more so. We may just have here a rare machine learning model that has received enough training on good data to be effective.

The Semantic Scholar team continues to refine the software, training it to summarize other types of papers and to reduce repetition. They also aim to have it summarize multiple documents at once—good for researchers in a new field, for example, or policymakers being briefed on a complex issue. Stay tuned.

Cynthia Murrell, December 4, 2020

Smarsh Acquires Digital Reasoning

November 26, 2020

On its own website, communications technology firm Smarsh crows, “Smarsh Acquires Digital Reasoning, Combining Global Leadership in Artificial Intelligence and Machine Learning with Market Leading Electronic Communications Archiving and Supervision.” It is worth noting that Digital Reasoning was founded by a college philosophy senior, Tim Estes, who saw the future in machine learning back in 2000. First it was a search system, then an intelligence system, and now part of an archiving system. The company has been recognized among Fast Company’s Most Innovative Companies for AI and recently received the Frost & Sullivan Product Leadership Award in the AI Risk Surveillance Market. Smarsh was smart to snap it up. The press release tells us:

“The transaction brings together the leadership of Smarsh in digital communications content capture, archiving, supervision and e-discovery, with Digital Reasoning’s leadership in advanced AI/ML powered analytics. The combined company will enable customers to spot risks before they happen, maximize the scalability of supervision teams, and uncover strategic insights from large volumes of data in real-time. Smarsh manages over 3 billion messages daily across email, social media, mobile/text messaging, instant messaging and collaboration, web, and voice channels. The company has unparalleled expertise in serving global financial institutions and US-based wealth management firms across both the broker-dealer and registered investment adviser (RIA) segments.”

Dubbing the combined capabilities “Communications Intelligence,” Smarsh’s CEO Brian Cramer promises Digital Reasoning’s AI and machine learning contributions will help clients better manage risk and analyze communications for more profitable business intelligence. Estes adds,

“In this new world of remote work, a company’s digital communications infrastructure is now the most essential one for it to function and thrive. Smarsh and Digital Reasoning provide the only validated and complete solution for companies to understand what is being said in any digital channel and in any language. This enables them to quickly identify things like fraud, racism, discrimination, sexual harassment, and other misconduct that can create substantial compliance risk.”

See the write-up for its list of the upgraded platform’s capabilities. Smarsh was founded in 2001 by financial services professional Stephen Marsh (or S. Marsh). The company has made Gartner’s list of Leaders in Enterprise Information Archiving for six years running, among other accolades. Smarsh is based in Portland, Oregon, and maintains offices in several major cities worldwide.

Our take? Search plus very dense visualization could not push some government applications across the warfighting finish line. Smarsh on!

Cynthia Murrell, November 26, 2020

Next Page »

  • Archives

  • Recent Posts

  • Meta