Semantic Scholar: Mostly Useful Abstracting

December 4, 2020

A new search engine specifically tailored to scientific literature uses a highly trained algorithm. MIT Technology Review reports, “An AI Helps You Summarize the latest in AI” (and other computer science topics). Semantic Scholar generates tl;dr sentences for each paper on an author’s page. Literally—they call each summary, and the machine-learning model itself, “TLDR.” The work was performed by researchers at the Allen Institute for AI and the University of Washington’s Paul G. Allen School of Computer Science & Engineering.

AI-generated summaries are either extractive, picking a sentence out of the text to represent the whole, or abstractive, generating a new sentence. Obviously, an abstractive summary would be more likely to capture the essence of a whole paper—if it were done well. Unfortunately, due to limitations of natural language processing, most systems have relied on extractive algorithms. This model, however, may change all that. Writer Karen Hao tells us:

“How they did it: AI2’s abstractive model uses what’s known as a transformer—a type of neural network architecture first invented in 2017 that has since powered all of the major leaps in NLP, including OpenAI’s GPT-3. The researchers first trained the transformer on a generic corpus of text to establish its baseline familiarity with the English language. This process is known as ‘pre-training’ and is part of what makes transformers so powerful. They then fine-tuned the model—in other words, trained it further—on the specific task of summarization. The fine-tuning data: The researchers first created a dataset called SciTldr, which contains roughly 5,400 pairs of scientific papers and corresponding single-sentence summaries. To find these high-quality summaries, they first went hunting for them on OpenReview, a public conference paper submission platform where researchers will often post their own one-sentence synopsis of their paper. This provided a couple thousand pairs. The researchers then hired annotators to summarize more papers by reading and further condensing the synopses that had already been written by peer reviewers.”

The team went on to add a second dataset of 20,000 papers and their titles. They hypothesized that, as titles are themselves a kind of summary, this would refine the model further. They were not disappointed. The resulting summaries average 21 words to summarize papers that average 5,000 words, a compression of 238 times. Compare this to the next best abstractive option at 36.5 times and one can see TLDR is leaps ahead. But are these summaries as accurate and informative? According to human reviewers, they are even more so. We may just have here a rare machine learning model that has received enough training on good data to be effective.

The Semantic Scholar team continues to refine the software, training it to summarize other types of papers and to reduce repetition. They also aim to have it summarize multiple documents at once—good for researchers in a new field, for example, or policymakers being briefed on a complex issue. Stay tuned.

Cynthia Murrell, December 4, 2020

Smarsh Acquires Digital Reasoning

November 26, 2020

On its own website, communications technology firm Smarsh crows, “Smarsh Acquires Digital Reasoning, Combining Global Leadership in Artificial Intelligence and Machine Learning with Market Leading Electronic Communications Archiving and Supervision.” It is worth noting that Digital Reasoning was founded by a college philosophy senior, Tim Estes, who saw the future in machine learning back in 2000. First it was a search system, then an intelligence system, and now part of an archiving system. The company has been recognized among Fast Company’s Most Innovative Companies for AI and recently received the Frost & Sullivan Product Leadership Award in the AI Risk Surveillance Market. Smarsh was smart to snap it up. The press release tells us:

“The transaction brings together the leadership of Smarsh in digital communications content capture, archiving, supervision and e-discovery, with Digital Reasoning’s leadership in advanced AI/ML powered analytics. The combined company will enable customers to spot risks before they happen, maximize the scalability of supervision teams, and uncover strategic insights from large volumes of data in real-time. Smarsh manages over 3 billion messages daily across email, social media, mobile/text messaging, instant messaging and collaboration, web, and voice channels. The company has unparalleled expertise in serving global financial institutions and US-based wealth management firms across both the broker-dealer and registered investment adviser (RIA) segments.”

Dubbing the combined capabilities “Communications Intelligence,” Smarsh’s CEO Brian Cramer promises Digital Reasoning’s AI and machine learning contributions will help clients better manage risk and analyze communications for more profitable business intelligence. Estes adds,

“In this new world of remote work, a company’s digital communications infrastructure is now the most essential one for it to function and thrive. Smarsh and Digital Reasoning provide the only validated and complete solution for companies to understand what is being said in any digital channel and in any language. This enables them to quickly identify things like fraud, racism, discrimination, sexual harassment, and other misconduct that can create substantial compliance risk.”

See the write-up for its list of the upgraded platform’s capabilities. Smarsh was founded in 2001 by financial services professional Stephen Marsh (or S. Marsh). The company has made Gartner’s list of Leaders in Enterprise Information Archiving for six years running, among other accolades. Smarsh is based in Portland, Oregon, and maintains offices in several major cities worldwide.

Our take? Search plus very dense visualization could not push some government applications across the warfighting finish line. Smarsh on!

Cynthia Murrell, November 26, 2020

AI Tech Used to Index and Search Joint Pathology Center Archive

November 23, 2020

The US’s Joint Pathology Center is the proud collector of the world’s largest group of preserved human tissue samples. Now, with help from University of Waterloo’s KIMIA Lab in Ontario, Canada, the facility will soon be using AI to index and search its digital archive of samples. ComputerUser announces the development in, “Artificial Intelligence Search Technology Will Be Used to Help Modernize US Federal Pathology Facility.”

As happy as we are to see the emergence of effective search solutions, we are also ticked by the names KIMIA used—the image search engine is commercialized under the name Lagotto, and the image retrieval tech is dubbed Yottixel. The write-up tells us:

“Yottixel will be used to enhance biomedical research for infectious diseases and cancer, enabling easier data sharing to facilitate collaboration and medical advances. The JPC is the leading pathology reference centre for the US federal government and part of the US Defense Health Agency. In the last century, it has collected more than 55 million glass slides and 35 million tissue block samples. Its data spans every major epidemic and pandemic, and was used to sequence the Spanish flu virus of 1918. It is expected that the modernization also helps to better understand and fight the COVID-19 pandemic. … Researchers at Waterloo have obtained promising diagnostic results using their AI search technology to match digital images of tissue samples in suspected cancer cases with known cases in a database. In a paper published earlier this year, a validation project led by Kimia Lab achieved accurate diagnoses for 32 kinds of cancer in 25 organs and body parts.”

Short for the Laboratory for Knowledge Inference in Medical Image Analysis, KIMIA Lab focuses on mass image data in medical archives using machine learning schemes. Established in 2013 and hosted by the University of Waterloo’s Faculty of Engineering, the program trains students and hosts international visiting scholars.

Cynthia Murrell, November 23, 2020

Defeating Facial Recognition: Chasing a Ghost

August 12, 2020

The article hedges. Check the title: “This Tool could Protect Your Photos from Facial Recognition.” Notice the “could”. The main idea is that people do not want their photos analyzed and indexed with the name, location, state of mind, and other index terms. I am not so sure, but the write up explains with that “could” coloring the information:

The software is not intended to be just a one-off tool for privacy-loving individuals. If deployed across millions of images, it would be a broadside against facial recognition systems, poisoning the accuracy of the data sets they gather from the Web. <

So facial recognition = bad. Screwing up facial recognition = good.

There’s more:

“Our goal is to make Clearview go away,” said Dr Ben Zhao, a professor of computer science at the University of Chicago.

Okay, a company is a target.

How’s this work:

Fawkes converts an image — or “cloaks” it, in the researchers’ parlance — by subtly altering some of the features that facial recognition systems depend on when they construct a person’s face print.

Several observations:

  • In the event of a problem like the explosion in Lebanon, maybe facial recognition can identify some of those killed.
  • Law enforcement may find narrowing a pool of suspects to a smaller group may enhance an investigative process.
  • Unidentified individuals who are successfully identified “could” add precision to Covid contact tracking.
  • Applying the technology to differentiate “false” positives from “true”positives in some medical imaging activities may be helpful in some medical diagnoses.

My concern is that technical write ups are often little more than social polemics. Examining the upside and downside of an innovation is important. Converting  a technical process into a quest to “kill” a company, a concept, or an application of technical processes is not helpful in DarkCyber’s view.

Stephen E Arnold, August 12, 2020

Twitter: Another Almost Adult Moment

August 7, 2020

Indexing is useful. Twitter seems to be recognizing this fact. “Twitter to Label State-Controlled News Accounts” reports:

The company will also label the accounts of government-linked media, as well as “key government officials” from China, France, Russia, the UK and US. Russia’s RT and China’s Xinhua News will both be affected by the change. Twitter said it was acting to provide people with more context about what they see on the social network.

Long overdue, the idea of an explicit index term may allow some tweeters to get some help when trying to figure out where certain stories originate.

Twitter, a particularly corrosive social media system, has avoided adult actions. The firm’s security was characterized in a recent DarkCyber video as a clown car operation. No words were needed. The video showed a clown car.

Several questions from the DarkCyber team:

  1. When will Twitter verify user identities, thus eliminating sock puppet accounts? Developers of freeware manage this type of registration and verification process, not perfectly but certainly better than some other organizations’.
  2. When will Twitter recognize that a tiny percentage of its tweeters account for the majority of the messages and implement a Twitch-like system to generate revenue from these individuals? Pay-per-use can be implemented in many ways, so can begging for dollars. Either way, Twitter gets an identification point which may have other functions.
  3. When will Twitter innovate? The service is valuable because a user or sock puppet can automate content regardless of its accuracy. Twitter has been the same for a number of Internet years. Dogs do age.

Is Twitter, for whatever reason, stuck in the management mentality of a high school science club which attracts good students, just not the whiz kids who are starting companies and working for Google type outfits from their parents’ living room?

Stephen E Arnold, August 7, 2020

NLP with an SEO Spin

July 8, 2020

If you want to know how search engine optimization has kicked librarians and professional indexers in the knee and stomped on their writing hand, you will enjoy “Classifying 200,000 Articles in 7 Hours Using NLP” makes clear that human indexers are going to become the lamp lighters of the 21st century. Imagine. No libraries, no subject matter experts curating and indexing content, no human judgment. Nifty. Perfect for a post Quibi world.

The write up explains the indexing methods of one type of smart software. The passages below highlights the main features of the method:

Weak supervision: the human annotator explains their chosen label to the AI model by highlighting the key phrases in the example that helped them make the decision. These highlights are then used to automatically generate nuanced rules, which are combined and used to augment the training dataset and boost the model’s quality.

Uncertainty sampling: it finds those examples for which the model is most uncertain, and suggests them for human review.
Diversity sampling: it helps make sure that the dataset covers as diverse a set of data as possible. This ensures the model learns to handle all of the real-world cases.

Guided learning: it allows you to search through your dataset for key examples. This is particularly useful when the original dataset is very imbalanced (it contains very few examples of the category you care about).

These phrases may not be clear. May I elucidate:

  • Weak supervision. Subject matter experts riding herd. No way. Inefficient and not optimizable.
  • Uncertainty sampling means a “fudge factor” or “fuzzifying.” A metaphor might be “close enough for horse shoes.”
  • Guided learning. Yep, manual assembly of training data, recalibration, and more training until the horse shoe thing scores a point.

The write up undermines its good qualities with a reference to Google. Has anyone noticed that Google’s first page of results for most of my queries are advertisements.

NLP and horse shoes. Perfect match. Why are the index and classification codes those which an educated person would find understandable and at hand? Forget answering this question. Just remember good enough and close enough for horse shoes. Clang and kha-ching as another ad sucks in a bidder.

Stephen E Arnold, July 8, 2020

Smartlogic: Making Indexing a Thing

May 29, 2020

Years ago, one of the wizards of Smartlogic visited the DarkCyber team. The group numbered about seven of my loyal researchers. These were people who had worked on US government projects, analyses for now disgraced banks in NYC, and assorted high technology firms. Was the world’s largest search system in this list? Gee, I don’t recall.

In that presentation, Smartlogic’s wizard explained that indexing, repositioned as tagging was important. Examples of the values of metatagging (presumably a more advanced form of the 40 year old classification codes used in the ABI/INFORM database since — what? — 1983. Smartlogic embarked on a mini acquisition spree, purchasing the interesting Schemalogic company about a decade ago.

What did Schemalogic do? In addition to being a wonderland for Windows Certified Professionals, the “server” managed index terms. The idea was that people in different departments assigned key words to different digital entities; for example, an engineer might assign the key word “E12.” This is totally clear to a person who thinks about resistors, but to a Home Economics graduate working in marketing the E12 was a bit of a puzzle. The idea that an organization in the pre Covid days could develop a standard set of tags is a fine idea. There are boot camps and specialist firms using words like taxonomy or controlled terms in their marketing collateral. However, humans are not too good at assigning terms. Humans get tired and fall back upon their faves. Other humans are stupid, bored, or indifferent and just assign terms and be done with it. Endeca’s interesting Guided Navigation worked because the company cleverly included consulting in a license. The consulting consisted of humans who worked up the needed vocabulary for a liquor store or preferably an eCommerce site with a modest number of products for sale. (There are some computational challenges inherent in the magical Endeca facets.)

Consequently massive taxonomy projects come and then fade. A few stick around, but these are often hooked to applications with non volatile words. The Engineering Index is a taxonomy, but its terminology is of scant use to an investment bank. How about a taxonomy for business? ABI/INFORM created, maintained, and licensed its vocabulary to outfits like the Royal Bank of Canada. However, ABI/INFORM moved into the brilliant managers at other firms. I assume a biz dev professional at whatever owner possesses rights to the vocabulary will cut a deal.

Back to Smartlogic.

Into this historical stew, Smartlogic offered a better way. I think that was the point of the marketing presentation we enjoyed years ago. Today the words have become more jargon centric, but the idea is the same: Index in a way that makes it possible to find that E12 when the vocabulary of the home ec major struggles with engineer-speak.

Our perception evolved. Smartlogic dabbled in the usual markets. Enterprise search vendors pushed into customer support. Smartlogic followed. Health and medical outfits struggled with indexing content and medical claims form. Indexing specialists followed the search vendors. Smartlogic has enthusiastically chased those markets as well. An exit for the company’s founders has not materialized. The dream of many — a juicy IPO — must work through the fog of the post Covid business world.

The most recent pivot is announced this way:


Smartlogic now offers indexing for these sectors expressed in what may be Smartlogic compound controlled terms featuring conjunctions. There you go, Bing, Google, Swisscows, Qwant, and Yandex. Parse these and then relax the users’ query. That’s what happens to well considered controlled terms today DarkCyber knows.

  • Energy and utilities
  • Financial services and insurance
  • Health care
  • High tech and manufacturing
  • Media and publishing
  • Life sciences
  • Retail and consumer products
  • and of course, intelligence (presumably business, military, competitive, and enforcement).

Is the company pivoting or running a Marketing 101 game plan?

DarkCyber noted that Smartlogic offers a remarkable array of services, technologies (including our favorites semantic and knowledge management), and — wait for it — artificial intelligence.

Interesting. Indexing is versatile and definitely requires a Swiss Army Knife of solutions, a Gartner encomium, and those pivots. Those spins remains anchored to indexing.

Want to know more about Smartlogic? Navigate to the company’s Web site. There’s even a blog! Very reliable outfit. Quick response. Objective. Who could ask for anything more?

Stephen E Arnold, May 29, 2020

YouTube and Objective Search Results

May 13, 2020

DarkCyber, working from a run down miner’s camp in rural Kentucky, does not understand the outside world. One of the DarkCyber research team who actually graduated from middle school spotted this article: “YouTube CEO Admits Users Don’t Like Boosting Of “Authoritative” Mainstream Channels, But They Do It Anyway.”

The article appears to present information implicating the most popular video service in Eastern Europe, including and the “stans” in some surprising activities.

The article asserts:

YouTube CEO Susan Wojcicki admits that the company knows its users don’t like the video giant rigging its own algorithm to boost “authoritative” mainstream sources, but that they do it anyway.

The article notes:

For several years now, the company has artificially gamed its own search engine to ensure that independent content creators are buried underneath a wall of mainstream media content. This rigging is so severe that the company basically broke its own search engine, with some videos posted by independent creators almost impossible to find even if the user searches for the exact title.

One fascinating connection between the providers of content from Van Wives is:

the company’s disdain for its own user base was also underscored by its Chief Product Officer Neil Mohan insulting non-mainstream YouTube creators as basement-dwelling idiots. This followed a new policy by the company to remove any content that challenged the World Health Organization’s official coronavirus guidelines, despite the fact that those guidelines have changed numerous times.

Here in Kentucky, the world is shaped by individuals walking along empty roads and mostly unused trails in the hills.

When big city information like this reaches the DarkCyber research team, our first instinct is to search Google and YouTube, maybe Google News or the comprehensive Google Scholar indexes. But this write up suggests that the information displayed may be subjective, the team is confused.

The team believes that what appears in the Google search results is accurate.

Sometimes we don’t believe the state’s environmental officer who has recently decided to wear shoes. The information in the hollow is that yellow green water is safe to drink.

Does this person obtain information as we do? A Google search? Are those Google algorithms the digital equivalent of the local grocer who puts his grimy thumb on the scale when weighing kiwano and feijoa? Our grocer tells us that durian smells great too.

Stephen E Arnold, May 13, 2020

No Fooling: Copyright Enforcer Does Indexing Too

April 1, 2020

The Associated Press is one of the oldest, most respected, and widely read news services in the world. As more than half the world reads Associated Press, it makes one wonder how the news services organizes and distributes its content. Synaptica has more details in the article, “Synaptica Insights: Veronika Zielinska, The Associated Press.”

Veronika Zielinska has a background in computational linguistics and natural language. She was interested in how automated tagging, taxonomies, and statistical engines apply rules to content. She joined Associated Press’s Information Management team in 2005, then moving up to the Metadata Technology team. Her current responsibilities are to develop the Metadata Services platform, fine tuning search quality and relevancy for content distribution platforms, scheme design, data transformations, analytics and business intelligence programs, and developing content enrichment methods.

Zielinska offers information on how the Associated Press builds a taxonomy:

“We looked at all the content that AP produced and scoped our taxonomy to cover all possible topics, events, places, organizations, people, and companies that our news production covered. News can be about anything – it’s broad, but we also took into account there are certain areas where AP produces more content than others. We have verticals that have huge news coverage – this can be government, politics, sports, entertainment and emerging areas like health, environment, nature, and education. Looking at the content and knowing what the news is about helps us to develop the taxonomy framework. We took this content base and divided the entire news domain into smaller domains. Each person on the team was responsible for their three or four taxonomy domains. They became subject and theme matter experts.”

The value of Associated Press’s taxonomies comes from the entire content package that includes everything from photos, articles, and videos centered around descriptive metadata that makes it agreeable and findable.

While the Associated Press is a non-profit news service, they do offer a platform called AP Metadata Services that is used by other news services. The Associated Press frequently updates its taxonomy with new terms when they enter the media. The AP taxonomy team works with the AP Editorial team to identify new terms and topics. The biggest challenges Zielinska faces are maintenance and writing in a manner that the natural language processing algorithms can understand it.

As for the future, Zielinska fears news services losing their budgets, local news not getting as much coverage, and the spread of misinformation. The biggest problem is that automated technologies can take the misinformation and disseminate it. She advises, “Managers can help by creating standardized vocabularies for fact checking across media types, for example, so that deep fakes and other misleading media can be identified consistently across various outlets.”

Whitney Grace, April 1, 2020

Swagiggle? Nope, Not an April Fooler

April 1, 2020

Big ecommerce sites like eBay and Amazon depend on a robust, accurate, and functional search engine. Without a powerful search application, searching for items on eBay and Amazon is like looking through every page of a printed catalog. The only difference is that there are millions of items compared to the thousands in one catalog. Amazon and eBay are not always accurate, especially when users edit and add content without being monitored. That means there is room for improvement and a startup to worm their way into the big leagues. Swagiggle is a:

“Swagiggle is a precision shopping search and product discovery website created by WAND, Inc. to demonstrate the capabilities of its taxonomy based product data organization and enrichment abilities featured in the WAND eCommerce Taxonomy Portal and PIM. WAND, Inc. is the world’s leading provider of pre-defined taxonomies, including the WAND Product and Service Taxonomy.

Have you ever had the experience of going to a category on an online retail site and seeing mis-categorized items? Or, a bunch of items dumped into a catch-all “Accessories” category. At Swagiggle, our goal is to provide accurate and specific categories so that our users can quickly find exactly the products they are looking for. From there, we assign product specifications so that users can filter through the items in a category and find exactly what they want.”

Wand’s Swagiggle sounds like an awesome product. Using products from its clients, Swagiggle offers an online catalog for users to search for products they wish to buy. These products range from clothing to cleaning products. The items are organized by large categories, then users man drill down to specific items or search with key words. It is a pretty standard search engine, but it has one major problem. The drilling down aspect does fill dated and half the time pictures and content would not load. The loading time is extraordinary long too. Plus, due to the variety of their clients, items offered on Swagiggle are very random. Swagiggle needs tofu the broken pictures and figure out how to make itself faster.

Whitney Grace, April 1, 2020

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta