Natural Language Processing: Useful Papers Selected by an Informed Human

July 28, 2020

Nope, no artificial intelligence involved in this curated list of papers from a recent natural language conference. Ten papers are available with a mouse click. Quick takeaway: Adversarial methods seem to be a hot ticket. Navigate to “The Ten Must Read NLP/NLU Papers from the ICLR 2020 Conference.” Useful editorial effort and a clear, adult presentation of the bibliographic information. Kudos to jakubczakon.

Stephen E Arnold, July 27, 2020

Jargon Alert: Direct from the Video Game Universe

July 22, 2020

I scanned a write up called “Who Will Win the Epic Battle for Online Meeting Hegemony?” The write up was a rah rah for Microsoft because, you know, it’s Microsoft.

Stepping away from the “epic battle,” the write up contained a word from the video game universe. (It’s a fine place: Courteous, diverse, and welcoming.)

The word is “upleveled” and it was used in this way:

Upleveled security and encryption. Remote work sites, especially home offices, have become a prime target for a surge in cybersecurity attacks due to their less hardened and secure nature.

A “level” in a game produced the phrase “level up” to communicate that one moved from loser level 2 to almost normal level 3. That “jump” is known as a “level up.”

Now the phrase has become an adjective as in “leveled up.”

DarkCyber believes that the phrase will be applied in this way:

That AI program upleveled its accuracy.

Oh, and the article: Go Microsoft Teams. It’s an elephant and one knows what elephants do. If you are near an elephant uplevel your rubber boots. Will natural language processing get the drift?

Stephen E Arnold, July 22, 2020

NLP with an SEO Spin

July 8, 2020

If you want to know how search engine optimization has kicked librarians and professional indexers in the knee and stomped on their writing hand, you will enjoy “Classifying 200,000 Articles in 7 Hours Using NLP” makes clear that human indexers are going to become the lamp lighters of the 21st century. Imagine. No libraries, no subject matter experts curating and indexing content, no human judgment. Nifty. Perfect for a post Quibi world.

The write up explains the indexing methods of one type of smart software. The passages below highlights the main features of the method:

Weak supervision: the human annotator explains their chosen label to the AI model by highlighting the key phrases in the example that helped them make the decision. These highlights are then used to automatically generate nuanced rules, which are combined and used to augment the training dataset and boost the model’s quality.

Uncertainty sampling: it finds those examples for which the model is most uncertain, and suggests them for human review.
Diversity sampling: it helps make sure that the dataset covers as diverse a set of data as possible. This ensures the model learns to handle all of the real-world cases.

Guided learning: it allows you to search through your dataset for key examples. This is particularly useful when the original dataset is very imbalanced (it contains very few examples of the category you care about).

These phrases may not be clear. May I elucidate:

  • Weak supervision. Subject matter experts riding herd. No way. Inefficient and not optimizable.
  • Uncertainty sampling means a “fudge factor” or “fuzzifying.” A metaphor might be “close enough for horse shoes.”
  • Guided learning. Yep, manual assembly of training data, recalibration, and more training until the horse shoe thing scores a point.

The write up undermines its good qualities with a reference to Google. Has anyone noticed that Google’s first page of results for most of my queries are advertisements.

NLP and horse shoes. Perfect match. Why are the index and classification codes those which an educated person would find understandable and at hand? Forget answering this question. Just remember good enough and close enough for horse shoes. Clang and kha-ching as another ad sucks in a bidder.

Stephen E Arnold, July 8, 2020

Another Low Profile, Specialized Services Firm Goes for Mad Ave Marketing

April 25, 2020

Investigative software firm ShadowDragon looks beyond traditional cyber-attacks in its latest podcast, “Cyber Cyber Bang Bang—Attacks Exploiting Risks Within the Physical and Cyber Universe.” The four-and-a-half-minute podcast is the fourth in a series that was launched on April second. The description tells us:

“Truly Advanced Persistent attacks where physical exploitation and even death are rarely discussed. We cover some of this along with security within the Healthcare and Government space. Security Within Healthcare and government is always hard. Tensions between information security and the business make this harder. Hospitals hit in fall of 2019 had a taste of exploitation. Similarly, state governments have had issues with cartel related attackers. CISO’s that enable assessment, and security design around systems that cannot be fully hardened can kill two birds with one stone. Weighing authority versus influence, FDA approved equipment, 0day discovery within applications. Designing security around systems is a must when unpatchable vulnerabilities exist.”

Hosts Daniel Clemens and Brian Dykstra begin by answering some questions from the previous podcast then catch up on industry developments. The get into security challenges for hospitals and government agencies not quite halfway through.

A company of fewer than 50 workers, ShadowDragon keeps a low profile. Created “by investigators for investigators,” its cyber security tools include AliasDB, MalNet, OIMonitor, SocialNet, and Spotter. The firm also supports their clients with training, integration, conversion, and customization. ShadowDragon was launched in 2015 and is based in Cheyenne, Wyoming.

Cynthia Murrell, April 13, 2020

Linguistic Insight: Move Over, Parrots

February 7, 2020

DarkCyber noted an item sure to be of interest to the linguists laboring in the world of chat bots, NLP, and inference. “Penguins Follow Same Linguistic Patterns As Humans, Study Finds” states:

Words more frequently used by the animals are briefer, and longer words are composed of extra but briefer syllables, researchers say.

The write up also reveals:

Information compression is a general principle of human language.

Yep. Penguins better than parrots? Well, messier for sure.

Stephen E Arnold, February 7, 2020

Lexalytics: The RPA Market

December 12, 2019

RPA is an acronym which was new to the DarkCyber team. A bit of investigation pointed us to “Adding New NLP Capabilities for RPA: Build or Buy” plus other stories published by Lexalytics. This firm provides a sentiment analysis system. The idea is that smart software can figure out what the emotion of content objects is. Some sentiment analysis systems just use word lists. An email arrives with the word, “sue,” the smart software flags the email and, in theory, a human looks at the message. Other systems use a range of numerical recipes to figure out if a message contains an emotional payload.

Now RPA.

The idea is that robotic process automation is becoming more important. The vendors of RPA have to be aware that natural language processing related to text analytics is also increasing in importance. You can read about RPA on the Lexalytics blog at this link.

The jargon caught our attention. After a bit of discussion over lunch on December 5, 2019, we decided that RPA is a new term for workflows that are scripted and hopefully intelligent.

Now you know. RPA, workflow, not IPA.

Stephen E Arnold, December 12, 2019

Parsing Document: A Shift to Small Data

November 14, 2019

DarkCyber spotted “Eigen Nabs $37M to Help Banks and Others Parse Huge Documents Using Natural Language and Small Data.” The folks chasing the enterprise search pot of gold may need to pay attention to figuring out specific problems. Eigen uses search technology to identify the important items in long documents. The idea is “small data.”

The write up reports:

The basic idea behind Eigen is that it focuses what co-founder and CEO Lewis Liu describes as “small data”. The company has devised a way to “teach” an AI to read a specific kind of document — say, a loan contract — by looking at a couple of examples and training on these. The whole process is relatively easy to do for a non-technical person: you figure out what you want to look for and analyze, find the examples using basic search in two or three documents, and create the template which can then be used across hundreds or thousands of the same kind of documents (in this case, a loan contract).

Interesting, but the approach seems similar to identify several passages in a text and submitting these to a search engine. This used to be called “more like this.” But today? Small data.

With the cloud coming back on premises and big data becoming user identified small data, what’s next? Boolean queries?

DarkCyber hopes so.

Stephen E Arnold, November 14, 2019

Visual Data Exploration via Natural Language

November 4, 2019

New York University announced a natural language interface for data visualization. You can read the rah rah from the university here. The main idea is that a person can use simple English to create complex machine learning based visualizations. Sounds like the answer to a Wall Street analyst’s prayers.

The university reported:

A team at the NYU Tandon School of Engineering’s Visualization and Data Analytics (VIDA) lab, led by Claudio Silva, professor in the department of computer science and engineering, developed a framework called VisFlow, by which those who may not be experts in machine learning can create highly flexible data visualizations from almost any data. Furthermore, the team made it easier and more intuitive to edit these models by developing an extension of VisFlow called FlowSense, which allows users to synthesize data exploration pipelines through a natural language interface.

You can download (as of November 3, 2019, but no promises the document will be online after this date) “FlowSense: A Natural Language Interface for Visual Data Exploration within a Dataflow System.”

DarkCyber wants to point out that talking to a computer to get information continues to be of interest to many researchers. Will this innovation put human analysts out of their jobs.

Maybe not tomorrow but in the future. Absolutely. And what will those newly-unemployed people do for money?

Interesting question and one some may find difficult to consider at this time.

Stephen E Arnold, November 4, 2019

 

Sentiment Analysis: Still Ticking Despite Some Lickings

October 29, 2019

Sentiment analysis is a “function” that tries to identify the emotional payload of an object, typically text. Sentiment analysis of images, audio, and video is “just around the corner”, just like quantum computing and getting Windows 10 updates from killing some computers.

The Best Sentiment Analysis Tools of 2019” provides a list of go-to vendors. Like most lists some options do not appear; for example, Algeion and Vader. The list was compiled by MonkeyLearn, which is number one on the list. There are some surprises; for example, IBM Watson.

Stephen E Arnold, October 29, 2019

Real Life Q and A for Information Access Allegedly Arrives

October 14, 2019

DarkCyber noted “Promethium Tool Taps Natural Language Processing for Analytics.” The write up, which may be marketing oriented, asserts:

software, called Data Navigation System, was designed to enable non-technical users to make complex SQL requests using plain human language and ease the delivery of data.

The company developing the system is Promethium, founded in 2018, may have delivered what users have long wanted: Ask the computer a question and get a usable, actionable answer. If the write up is accurate, Promethium has achieved with $2.5 million in funding a function that many firms have pursued.

The article reports:

After users ask a question, Promethium locates the data, demonstrates how it should be assembled, automatically generates the SQL statement to get the correct data and executes the query. The queries run across all databases, data lakes and warehouses to draw actionable knowledge from multiple data sources. Simultaneously, Promethium ensures that data is complete while identifying duplications and providing lineage to confirm insights. Data Navigation System is offered as SaaS in the public cloud, in the customer’s virtual private cloud or as an on-premises option.

More information is available at the firm’s Web site.

Stephen E Arnold, October 14, 2019

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta