Linguistics: Becoming Useful to Regular People?

January 8, 2020

Now here is the linguistic reference app I have been waiting for: IDEA’s “In Other Words.” Finally, an online resource breaks the limiting patterns left over from book-based resources like traditional dictionaries and thesauri. The app puts definitions into context by supplying real-world examples from both fiction and nonfiction works of note from the 20th and 21st centuries. It also lets users explore several types of linguistic connections. Not surprisingly, this thoroughly modern approach leverages a combination of artificial and human intelligence. Here is how they did it:

“Building on the excellent definitions written by the crowd-sourced editors at Wiktionary, IDEA’s lexicographic team wrote more than 2,700 short, digestible definitions for all common words, including ‘who,’ ‘what,’ and ‘the.’ For over 100k other words that also have Wikipedia entries, we included a snippet of the article as well. To power the app, our team created the IDEA Linguabase, a database of word relationships built on an analysis of various published and open source dictionaries and thesauri, an artificial intelligence analysis of a large corpus of published content, and original lexicographic work. Our app offers relationships for over 300,000 terms and presents over 60 million interrelationships. These include close relationships, such as synonyms, as well as broader associations and thousands of interesting lists, such as types of balls, types of insects, words for nausea, and kinds of needlework. Additionally, the app has extensive information on word families (e.g., ‘jump,’ ‘jumping’) and common usage (‘beautiful woman’ vs. ‘handsome man’), revealing words that commonly appear before or after a word in real use. In Other Words goes beyond the traditional reference text by allowing users to explore interesting facts about words and wordplay, such as common letter patterns and phonetics/rhymes.”

The team has endeavored to give us an uncluttered, intuitive UI that makes it quick to look up a word and easy to follow a chain of meanings and associations. Users can also save and share what they have found across devices. Be warned, though—In Other Words does not shy away from salty language; it even points out terms that were neutral in one time period and naughty in another. (They will offer a sanitized version for families and schools.) They say the beta version is coming soon and will be priced at $4.99, or $25 with a custom tutorial. We look forward to it.

Cynthia Murrell, January 8, 2020

Megaputer Spans Text Analysis Disciplines

January 6, 2020

What exactly do we mean by “text analysis”? That depends entirely on the context. Megaputer shares a useful list of the most popular types in its post, “What’s in a Text Analysis Tool?” The introduction explains:

“If you ask five different people, ‘What does a Text Analysis tool do?’, it is very likely you will get five different responses. The term Text Analysis is used to cover a broad range of tasks that include identifying important information in text: from a low, structural level to more complicated, high-level concepts. Included in this very broad category are also tools that convert audio to text and perform Optical Character Recognition (OCR); however, the focus of these tools is on the input, rather than the core tasks of text analysis. Text Analysis tools not only perform different tasks, but they are also targeted to different user bases. For example, the needs of a researcher studying the reactions of people on Twitter during election debates may require different Text Analysis tasks than those of a healthcare specialist creating a model for the prediction of sepsis in medical records. Additionally, some of these tools require the user to have knowledge of a programming language like Python or Java, whereas other platforms offer a Graphical User Interface.”

The list begins with two of the basics—Part-of-Speech (POS) Taggers and Syntactic Parsing. These tasks usually underpin more complex analysis. Concordance or Keyword tools create alphabetical lists of a text’s words and put them into context. Text Annotation Tools, either manual or automated, tag parts of a text according to a designated schema or categorization model, while Entity Recognition Tools often use knowledge graphs to identify people, organizations, and locations. Topic Identification and Modeling Tools derive emerging themes or high-level subjects using text-clustering methods. Sentiment Analysis Tools diagnose positive and negative sentiments, some with more refinement than others. Query Search Tools let users search text for a word or a phrase, while Summarization Tools pick out and present key points from lengthy texts (provided they are well organized.) See the article for more on any of these categories.

The post concludes by noting that most text analysis platforms offer one or two of the above functions, but that users often require more than that. This is where the article shows its PR roots—Megaputer, as it happens, offers just such an all-in-one platform called PolyAnalyst. Still, the write-up is a handy rundown of some different text-analysis tasks.

Based in Bloomington, Indiana, Megaputer launched in 1997. The company grew out of AI research from the Moscow State University and Bauman Technical University. Just a few of their many prominent clients include HP, Johnson & Johnson, American Express, and several US government offices.

Cynthia Murrell, January 02, 2020

Cambridge Analytica: Maybe a New Name and Some of the Old Methods?

December 29, 2019

DarkCyber spotted an interesting factoid in “HH Plans to Work with the Re-Branded Cambridge Analytica to Influence 2021 Elections.”

The new company, Auspex International, will keep former Cambridge Analytica director Mark Turnbull at the helm.

Who is HH? He is President Hakainde Hichilema, serving at this time in Zambia.

The business focus of Auspex is, according to the write up:

We’re not a data company, we’re not a political consultancy, we’re not a research company and we’re not necessarily just a communications company. We’re a combination of all four.—Ahmad *Al-Khatib, a Cairo born investor

You can obtain some information about Auspex at this url: https://www.auspex.ai/.

DarkCyber noted the use of the “ai” domain. See the firm’s “What We Believe” information at this link. It is good to have a reason to get out of bed in the morning.

Stephen E Arnold, December 29, 2019

Insight from a Microsoft Professional: Susan Dumais

December 1, 2019

Dr. Susan Dumais is Microsoft Technical Fellow and Deputy Lab Director of MSR AI. She knows that search has evolved from discovering information to getting tasks done. In order. To accomplish tasks, search queries are a fundamental and they are rooted in people’s information needs. The Microsoft Research Podcast interviewed Dr. Dumais in the episode, “HCI, IR, And The Search For Better Search With Dr. Susan Dumais.”

Dr. Dumais shared that most of her work centered around search stems from frustrations she encountered with her own life. These included trouble learning Unix OS and vast amounts of spam. At the beginning of the podcast, she runs down the history of search and how it has changed in the past twenty years. Search has become more intuitive, especially give the work Dr. Dumais did when providing context to search.

“Host: Context in anything makes a difference with language and this is integrally linked to the idea of personalization, which is a buzz word in almost every area of computer science research these days: how can we give people a “valet service” experience with their technical devices and systems? So, tell us about the technical approaches you’ve taken on context in search, and how they’ve enabled machines to better recognize or understand the rich contextual signals, as you call them, that can help humans improve their access to information?

Susan Dumais: If you take a step back and consider what a web search engine is, it’s incredibly difficult to understand what somebody is looking for given, typically, two to three words. These two to three words appear in a search box and what you try to do is match those words against billions of documents. That’s a really daunting challenge. That challenge becomes a little easier if you can understand things about where the query is coming from. It doesn’t fall from the sky, right? It’s issued by a real live human being. They have searched for things in the longer term, maybe more acutely in the current session. It’s situated in a particular location in time. All of those signals are what we call context that help understand why somebody might be searching and, more importantly, what you might do to help them, what they might mean by that. You know, again, it’s much easier to understand queries if you have a little bit of context about it.”

Dr. Dumais has a practical approach to making search work for the average user. It is the everyday tasks that build up that power how search is shaped and its functionality. She represents an enlightened technical expert that understands the perspective of the end user.

Whitney Grace, November 30, 2019

Google Trends Used to Reveal Misspelled Wirds or Is It Words?

November 25, 2019

We spotted a listing of the most misspelled words in each of the USA’s 50 states. Too bad Puerto Rico. Kentucky’s most misspelled word is “ninety.” Navigate to Considerable and learn what residents cannot spell. How often? Silly kweston.

The listing includes some bafflers and may reveal what can go wrong with data from an online ad sales data collection system; for example:

  • Washington, DC (which is not a state in DarkCyber’s book) cannot spell “enough”; for example, “enuf already with these televised hearings and talking heads”
  • Idaho residents cannot spell embarrassed, which as listeners to Kara Swisher know has two r’s and two s’s. Helpful that.
  • Montana residents cannot spell “comma.” Do those in Montana use commas?
  • And not surprisingly, those in Tennessee cannot spell “intelligent.” Imagine that!

What happens if one trains smart software on these data?

Sumthink mite go awf the railz.

Stephen E Arnold, November 25, 2019

Info Extraction: Improving?

November 21, 2019

Information extraction (IE) is key to machine learning and artificial intelligence (AI), especially for natural language processing (NLP). The problem with information extraction is while information is pulled from datasets it often lacks context, thusly it fails to properly categorize and rationalize the data. Good Men Project shares some hopeful news for IE in the article, “Measuring Without Labels: A Different Approach To Information Extraction.”

Current IE relies on an AI programmed with a specific set of schema that states what information needs to be extracted. A retail Web site like Amazon probably uses an IE AI programmed to extract product names, UPCs, and price, while a travel Web site like Kayak uses an IE AI to find price, airlines, dates, and hotel names. For law enforcement officials, it is particularly difficult to design schema for human trafficking, because datasets on that subject do not exist. Also traditional IE methods, such as crowdsourcing, do not work due to the sensitivity.

In order to create a reliable human trafficking dataset and prove its worth, the IE dependencies between extractions. A dependency works as:

“Consider the network illustrated in the figure above. In this kind of network, called attribute extraction network (AEN), we model each document as a node. An edge exists between two nodes if their underlying documents share an extraction (in this case, names). For example, documents D1 and D2 are connected by an edge because they share the extraction ‘Mayank.’ Note that constructing the AEN only requires the output of an IE, not a gold standard set of labels. Our primary hypothesis in the article was that, by measuring network-theoretic properties (like the degree distribution, connectivity etc.) of the AEN, correlations would emerge between these properties and IE performance metrics like precision and recall, which require a sufficiently large gold standard set of IE labels to compute. The intuition is that IE noise is not random noise, and that the non-random nature of IE noise will show up in the network metrics. Why is IE noise non-random? We believe that it is due to ambiguity in the real world over some terms, but not others.”

Using the attributes names, phone numbers, and locations, correlations were discovered. AI systems that have dependencies creates a new methodology to evaluate them. Network science relies on non-abstract interactions to test IE, but the AEN is an abstract network of IE interactions. The mistakes, in fact, allow law enforcement to use IE AI to acquire the desired information without having a practice dataset.

Whitney Grace, November 21, 2019

Palantir and Sompo: Is a $150 Million Deal Big Enough, Too Small, or Just Right

November 19, 2019

Palantir Technologies has ingested about $2 billion in a couple of dozen investment rounds. Now a $150 million deal is very important to a services firm with a few million in sales. To an outfit like Booz, Allen or Deloitte, $150 million means a partner will keep her job and a handful of MBAs will be making regular flights to wonderful Narita.

Thiel Marks Palantir’s Asia Push with $150 Million Japan Venture” reports that Sompo Holdings is now Palantir’s partner, noting that the $150 million may be more of an investment. We noted this passage:

The billionaire entrepreneur [Peter Thiel] was in Japan Monday to unveil a $150 million, 50-50 joint venture with local financial services firm Sompo Holdings Inc., Palantir Technologies Japan Co. The new company will target government and public sector customers, emphasizing health and cybersecurity initially. Like IBM Corp. and other providers, Palantir’s software pulls together a range of data provided by its customers, mining it for patterns and displaying connections in easy-to-read spider web-like graphics that might otherwise get overlooked.

Bloomberg reported:

Palantir is very close to breaking even and will end 2019 either slightly in the black or slightly in the red, Thiel said at the briefing. The company will be “significantly in the black” next year, he added.

A few comments from the DarkCyber team:

  • The money in the headline is not explained in much detail. There is a difference between setting up a new company and landing a cash deal.
  • Bloomberg seems indifferent to the revenue challenge Palantir faces; namely, there are quite a few investors and stakeholders who want their money plus interest. The announcement may not put these individuals’ minds at ease.
  • The news story does not mention that new, more agile companies are introducing solutions which make both IBM Analysts Notebook and Gotham look a bit like Vinnie Testaverde or Bart Starr throwing passes at a barbeque.

Singapore is the location of choice for some of the more agile intelware and policeware vendors. Is Japan is a bit 2003?

To sum up, Palantir is to some a start up. To others Palantir is an example of a company that may lose out to upstarts which offer a more intuitive user interface and slicker data analytics. It is possible that an outfit like Amazon and its whiz bang data market place could deliver a painful blow to a firm which opened for business in 2003. That’s more than 15 years ago. But next year? Palantir will be profitable.

Stephen E Arnold, November 19, 2019

Simple English Is The Basis For Complex Data Visualizations

November 7, 2019

Computers started to gain a foothold in modern society during the 1980s. By today’s standards, the physical size and amount of data old school computers used to process are laughable. Tech Explore reports on how spoken English can actually create complex, data rich visitations, something that was only imaginable in the 1980s, in the article, “Study Leads To A System That Lets People Use Simple English To Create Complex Complex Machine Learning-Driven Visualizations.”

Today’s technology collects terabytes of diverse information from traffic patterns, weather patterns, disease outbreaks, animal migrations, financial trends, and human behavior models. The problem is that the people who could benefit from this data do not know how to make visualization models.

Professor Claudio Silva led a team at New York University Tandon School of Engineering’s Visualization and Data Analytics (VIDA) that developed VisFlow, a framework that allows non-data experts to create flexible and graphic rich data visualization models. These models will also be easy to edit with an extension called FlowSense. FlowSense will allow users to edit and synthesize data exploration pipes with a NLP interface.

VIDA is one of the leading research centers on data visualizations and FlowSense is already being used in astronomy, medicine, and climate research”

• “OpenSpace, a System for Astrographics is being used worldwide in planetariums, museums, and other contexts to explore the solar system and universe

Motion Browser: Visualizing and Understanding Complex Upper Limb Movement under Obstetrical Brachial Plexus Injuries is a collaboration between computer scientists, orthopedic surgeons, and rehabilitation physicians that could lead to new treatments for brachial nerve injuries and hypotheses for future research

The Effect of Color Scales on Climate Scientists’ Objective and Subjective Performance in Spatial Data Analysis Tasks is a web-based user study that takes a close look at the efficacy of the widely used practice of superimposing color scales on geographic maps.”

FlowSense and VisFlow are open source frameworks available on Github and programmers are welcome to experiment with them. These applications allow non-data experts to manipulate data for their fields, take advantage of technology, and augment their current work.

Whitney Grace, November 7, 2019

False News: Are Smart Bots the Answer?

November 7, 2019

To us, this comes as no surprise—Axios reports, “Machine Learning Can’t Flag False News, New Studies Show.” Writer Joe Uchill concisely summarizes some recent studies out of MIT that should quell any hope that machine learning will save us from fake news, at least any time soon. Though we have seen that AI can be great at generating readable articles from a few bits of info, mimicking human writers, and even detecting AI-generated stories, that does not mean they can tell the true from the false. These studies were performed by MIT doctoral student Tal Schuster and his team of researchers. Uchill writes:

“Many automated fact-checking systems are trained using a database of true statements called Fact Extraction and Verification (FEVER). In one study, Schuster and team showed that machine learning-taught fact-checking systems struggled to handle negative statements (‘Greg never said his car wasn’t blue’) even when they would know the positive statement was true (‘Greg says his car is blue’). The problem, say the researchers, is that the database is filled with human bias. The people who created FEVER tended to write their false entries as negative statements and their true statements as positive statements — so the computers learned to rate sentences with negative statements as false. That means the systems were solving a much easier problem than detecting fake news. ‘If you create for yourself an easy target, you can win at that target,’ said MIT professor Regina Barzilay. ‘But it still doesn’t bring you any closer to separating fake news from real news.’”

Indeed. Another of Schuster’s studies demonstrates that algorithms can usually detect text written by their kin. We’re reminded, however, that just because an article is machine written does not in itself mean it is false. In fact, he notes, text bots are now being used to adapt legit stories to different audiences or to generate articles from statistics. It looks like we will just have to keep verifying articles with multiple trusted sources before we believe them. Imagine that.

Cynthia Murrell, November 7, 2019

Visual Data Exploration via Natural Language

November 4, 2019

New York University announced a natural language interface for data visualization. You can read the rah rah from the university here. The main idea is that a person can use simple English to create complex machine learning based visualizations. Sounds like the answer to a Wall Street analyst’s prayers.

The university reported:

A team at the NYU Tandon School of Engineering’s Visualization and Data Analytics (VIDA) lab, led by Claudio Silva, professor in the department of computer science and engineering, developed a framework called VisFlow, by which those who may not be experts in machine learning can create highly flexible data visualizations from almost any data. Furthermore, the team made it easier and more intuitive to edit these models by developing an extension of VisFlow called FlowSense, which allows users to synthesize data exploration pipelines through a natural language interface.

You can download (as of November 3, 2019, but no promises the document will be online after this date) “FlowSense: A Natural Language Interface for Visual Data Exploration within a Dataflow System.”

DarkCyber wants to point out that talking to a computer to get information continues to be of interest to many researchers. Will this innovation put human analysts out of their jobs.

Maybe not tomorrow but in the future. Absolutely. And what will those newly-unemployed people do for money?

Interesting question and one some may find difficult to consider at this time.

Stephen E Arnold, November 4, 2019

 

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta