Grammar Rules Help Algorithms Grasp Language

June 20, 2019

Researchers at several universities have teamed up with IBM to teach algorithms some subtleties of language. VentureBeat reports, “IBM, MIT, and Harvard’s AI Uses Grammar Rules to Catch Linguistic Nuances of U.S. English.” Writer Kyle Wiggers links to the two resulting research papers, noting the research was to be presented at the recent North American Chapter of the Association for Computational Linguistics conference. We learn:

“The IBM team, along with scientists from MIT, Harvard, the University of California, Carnegie Mellon University, and Kyoto University, devised a tool set to suss out grammar-aware AI models’ linguistic prowess. As the coauthors explain, one model in question was trained on a sentence structure called recurrent neural network grammars, or RNNGs, that imbued it with basic grammar knowledge. The RNNG model and similar models with little-to-no grammar training were fed sentences with good, bad, or ambiguous syntax. The AI systems assigned probabilities to each word, such that in grammatically ‘off’ sentences, low-probability words appeared in the place of high-probability words. These were used to measure surprisal[sic]. The coauthors found that the RNNG system consistently performed better than systems trained on little-to-no grammar using a fraction of the data, and that it could comprehend ‘fairly sophisticated’ rules.”

See the write-up for a few details about those rules, or check out the research papers for more information (links above). This is but a start for their model, the team cautions, for the work must be validated on larger data sets. Still, they believe, this project represents a noteworthy milestone.

Cynthia Murrell, June 20, 2019

Written by Stephen E. Arnold · Filed Under News, Text processing | Comments Off on Grammar Rules Help Algorithms Grasp Language

Firefox Translation Add In

May 17, 2019

The DarkCyber team encounters information in a number of languages. For years, we relied on Google Translate, but the limits on document size proved an annoyance. FreeTranslate.com has been more useful. We have an older installation of some Systran modules.

DarkCyber learned that Firefox has returned to the “translate now” territory with Translate Man. You can get an overview of the functionality of the add in in “Translate anything instantly in Firefox with Translate Man.” Translate Man uses Google’s API.

We haven’t tested the functionality of the add in in an extensive way. It did translate words and short passages in a helpful way.

The write up identifies useful features that add in delivers. Two are a translate on hover feature and a pronunciation function so you can “hear” the word or passage.

In our experience, some text requires a native speaker of the language to translate with accuracy.

Google has introduced its wonderfully named Translatotron. You can read about that innovation in “Google Unveils Translatotron, Its Speech-to-Speech Translation System.”

Now about these systems’ ability to translate the argot of insiders involved in “interesting” work in North Korea or Iran? What about making sense of emojis in clear text messages?

Someday perhaps.

Stephen E Arnold, May 17, 2019

Written by Stephen E. Arnold · Filed Under News, Text processing, Translation | Comments Off on Firefox Translation Add In

Into R? A List for You

May 12, 2019

Computerworld, which runs some pretty unusual stories, published “Great R Packages for Data Import, Wrangling and Visualization.” “Great” is an interesting word. In the lingo of Computerworld, a real journalist did some searching, talked to some people, and created a list. As it turns out, the effort is useful. Looking at the Computerworld table is quite a bit easier than trying to dig information out of assorted online sources. Plus, people are not too keen on the phone and email thing now.

The listing includes a mixture of different tools, software, and utilities. There are more than 80 listings. I wasn’t sure what to make of XML’s inclusion in the list, but, the source is Computerworld, and I assume that the “real” journalist knows much more than I.

Two observations:

Earthworm lists without classification or alphabetization are less useful to me than listings which are sorted by tags and alphabetized within categories. Excel does perform this helpful trick.
Some items in the earthworm list have links and others do not. Consistency, I suppose, is the hobgoblin of some types of intellectual work
An indication of which item is free or for fee would be useful too.

Despite these shortcomings, you may want to download the list and tuck it into your “Things I love about R” folder.

Stephen E Arnold, May 12, 2019

Written by Stephen E. Arnold · Filed Under Analytics, NGIA, software, Text analytics, Text processing, Tools | Comments Off on Into R? A List for You

China: Patent Translation System

May 10, 2019

Patents are usually easily findable documents. However, reading a patent once found is a challenge. Up the ante if the patent is in a language the person does not read. “AI Used to Translate Patent Documents” provides some information about a new system available from the Intellectual Property Publishing House. According to the article in China Daily:

The system can translate Chinese into English, Japanese and German and vice versa. Its accuracy in two-way translation between Chinese and Japanese has reached 95 percent, far more than the current industry average, and the rest has topped 90 percent…

The system uses a dictionary, natural language processing algorithms, and a computational model. In short, this is a collection of widely used methods tuned over a decade by the Chinese organization. In that span, Thomson Reuters dropped out of the patent game, and just finding patents, even in the US, can be a daunting task.

Translation has been an even more difficult task for some lawyers, researchers, analysts, and academics.

If the information in the China Daily article is accurate, China may have an intellectual property advantage., The write up offers some details, which sound interesting; for example:

Translation of a Japanese document: five seconds
Patent documents record 90 percent of a country’s technology and innovation
China has “a huge database of global patents”.

And the other 10 percent? Maybe other methods are employed.

Stephen E Arnold, May 10, 2019

Written by Stephen E. Arnold · Filed Under algorithms, News, Text processing, Translation | Comments Off on China: Patent Translation System

Cognitive Engine: What Powers the USAF Platform?

May 1, 2019

Last week I met with a university professor who does cutting edge data and text mining and also shepherds PhD candidates. In the course of our 90 minute conversation, I noticed some reference books which had SPSS on the cover. The procedures implemented at this particular university worked well.

After the meeting, I was thinking about the newer approaches which are becoming publicly available. The USAF has started talking about its “cognitive engine.” I thought I heard at a conference that some technology developed developed by Nutonian, now part of a data and text mining roll up, had influenced the project.

The Nutonian system is predictive with a twist. The person using the system can rely on the smart software to perform the numerous intermediary steps required when using more traditional systems.

The article “The US Air Force Will Showcase Its Many Technological Advances in the USAF Lab Day.” The original is in Chinese but Freetranslate.com can help out if don’t read Chinese or have a close by contact who does.

The USAF wants to deploy a cognitive platform into which vendors can “plug in” their systems. The Chinese write up reported:

AFRL’s Autonomy Capability Team 3 (ACT3) is developing artificial intelligence on a large scale through the development and application of the Air Force Cognitive Engine (ACE), an artificial intelligence software platform. Put into application. The software platform architecture reduces the barriers to entry for artificial intelligence applications and provides end-user applications with the ability to cover a range of artificial intelligence problem types. In the application, the software platform connects educated end users, developers, and algorithms implemented in software, task data, and computing hardware to the process of creating an artificial intelligence solution.

The article also provides some interesting details which were not included in some of the English language reports about this session; for example:

Smart rockets
An agile pod
Pathogen identification.

A couple of observations:

First, obviously the Chinese writer had access to information about the Lab Day demonstrations.

Second, the cognitive platform does not mention foundation vendors, which I understand.

Third, it would be delightful to visit a university and see documentation and information about the next-generation predictive analytics systems available.

Stephen E Arnold, May 1, 2019

Here’s what the Chinese writer reported about the

Written by Stephen E. Arnold · Filed Under Analytics, News, Text analytics, Text processing | Comments Off on Cognitive Engine: What Powers the USAF Platform?

Latest GraphDB Edition Available

April 25, 2019

A new version of GraphDB is now available, we learn from the company’s News post, “Ontotext’s GraphDB 8.9 Boosts Semantic Similarity Search.” The semantic graph database offers a couple new features inspired by user feedback. We learn:

“The semantic similarity search is based on the Random Indexing algorithm. … The latest GraphDB release enables users to create hybrid similarity searches using pre-built text-based similarity vectors for the predication-based similarity index. The index combines the power of graph topology with the text similarity. The users can control the index accuracy by specifying the number of iterations required to refine the embeddings. Another improvement is that now GraphDB 8.9 allows users to boost the term weights when searching in text-based similarity indexes. It also simplifies the processes of abortion of running queries or updates from the SPARQL editor in the Workbench.”

The database continues to be updated to the current RDF4J 2.4.6 public release. GraphDB comes in Free, Standard, and Enterprise editions. Begun in 2000, Ontotext is based in Sofia, Bulgaria, and maintains its North American office in New York City.

Cynthia Murrell, April 25, 2019

Written by Stephen E. Arnold · Filed Under News, Text processing | Comments Off on Latest GraphDB Edition Available

Nosing Beyond the Machine Learning from Human Curated Data Sets: Autonomy 1996 to Smart Software 2019

April 24, 2019

How does one teach a smart indexing system like Autonomy’s 1996 “neurodynamic” system?* Subject matter experts (SMEs) assembled training collection of textual information. The article and other content would replicate the characteristics of the content which the Autonomy system would process; that is, index and make searchable or analyzable. The work was important. Get the training data wrong and the indexing system would assign metadata or “index terms” and “category names” which could cause a query to generate results the user could perceive as incorrect.

How would a licensee adjust the Autonomy “black box”? (Think of my reference to Autonomy and search as a way of approaching “smart software” and “artificial intelligence.”)

The method was to perform re-training. The approach was practical and for most content domains, the re-training worked. It was an iterative process. Because the words in the corpus fed into the “black box” included new words, concepts, bound phrases, entities, and key sequences, there were several functions integrated into the basic Autonomy system as it matured. Examples ranged from support for term lists (controlled vocabularies) and dictionaries.

The combination of re-training and external content available to the system allowed Autonomy to deliver useful outputs.

Where the optimal results departed from the real world results usually boiled down to several factors, often working in concert. First, licensees did not want to pay for re-training. Second, maintenance of the external dictionaries was necessary because new entities arrive with reasonable frequency. Third, testing and organizing the freshening training sets and the editorial work required to keep dictionaries ship shape was too expensive, time consuming, and tedious.

Not surprisingly, some licensees grew unhappy with their Autonomy IDOL (integrated data operating layer) system. That, in my opinion, was not Autonomy’s fault. Autonomy explained in the presentations I heard what was required to get a system up and running and outputting results that could easily hit 80 percent or higher on precision and recall tests.

The Autonomy approach is widely used. In fact, wherever there is a Bayesian system in use, there is the training, re-training, external knowledge base demand. I just took a look at Haystax Constellation. It’s Bayesian and Haystax makes it clear that the “model” has to be training. So what’s changed between 1996 and 2019 with regards to Bayesian methods?

Nothing. Zip. Zero.

Written by Stephen E. Arnold · Filed Under AI, Feature, Text analytics, Text processing | Comments Off on Nosing Beyond the Machine Learning from Human Curated Data Sets: Autonomy 1996 to Smart Software 2019

Who Is Assisting China in Its Technology Push?

March 20, 2019

I read “U.S. Firms Are Helping Build China’s Orwellian State.” The write up is interesting because it identifies companies which allegedly provide technology to the Middle Kingdom. The article also uses an interesting phrase; that is, “tech partnerships.” Please, read the original article for the names of the US companies allegedly cooperating with China.

I want to tell a story.

Several years ago, my team was asked to prepare a report for a major US university. Our task was to try and answer what I thought was a simple question when I accepted the engagement, “Why isn’t this university’s computer science program ranked in the top ten in the US?”

The answer, my team and I learned, had zero to do with faculty, courses, or the intelligence of students. The primary reason was that the university’s graduates were returning to their “home countries.” These included China, Russia, and India, among others. In one advanced course, there was no US born, US educated student.

We documented that for over a seven year period, when the undergraduate, the graduate students, and post doctoral students completed their work, they had little incentive to start up companies in proximity to the university, donate to the school’s fund raising, and provide the rah rah that happy graduates often do. To see the rah rah in action, may I suggest you visit a “get together” of graduates near Stanford or an eatery in Boston or on NCAA elimination week end in Las Vegas.

How could my client fix this problem? We were not able to offer a quick fix or even an easy fix. The university had institutionalized revenue from non US student and was, when we did the research, dependent on non US students. These students were very, very capable and they came to the US to learn, form friendships, and sharpen their business and technical “soft” skills. These, I assume, were skills put to use to reach out to firms where a “soft” contact could be easily initiated and brought to fruition.

Follow the threads and the money.

China has been a country eager to learn in and from the US. The identification of some US firms which work with China should not be a surprise.

However, I would suggest that Foreign Policy or another investigative entity consider a slightly different approach to the topic of China’s technical capabilities. Let me offer one example. Consider this question:

What Israeli companies provide technology to China and other countries which may have some antipathy to the US?

This line of inquiry might lead to some interesting items of information; for example, a major US company which meets on a regular basis with a counterpart with what I would characterize as “close links” to the Chinese government. One colloquial way to describe the situation is like a conduit. Digging in this field of inquiry, one can learn how the Israeli company “flows” US intelligence-related technology from the US and elsewhere through an intermediary so that certain surveillance systems in China can benefit directly from what looks like technology developed in Israel.

Net net: If one wants to understand how US technology moves from the US, the subject must be examined in terms of academic programs, admissions, policies, and connections as well as from the point of view of US company investments in technologies which received funding from Chinese sources routed through entities based in Israel. Looking at a couple of firms does not do the topic justice and indeed suggests a small scale operation.

Uighur monitoring is one thread to follow. But just one.

Stephen E Arnold, March 20, 2019

Written by Stephen E. Arnold · Filed Under AI, Analytics, Big data, Business strategy, Government, News, Policeware, Technology, Text processing | Comments Off on Who Is Assisting China in Its Technology Push?

Identification of Machine Generated Text: Not There Yet

March 18, 2019

“A.I. Generated Text Is Supercharging Fake News. This Is How We Fight Back” provides a run down of projects focused on figuring out if a sentence were written by a human or smart software. IBM’s “visual tool” is described this way by an IBM data scientist:

“[Our current] visual tool might not be the solution to that, but it might help to create algorithms that work like spam detection algorithms,” he said. “Imagine getting emails or reading news, and a browser plug-in tells you for the current text how likely it was produced by model X or model Y.”

Okay, not there yet.

The article references XceptionNet but does not provide a link. If you want to know a bit about this approach, click this link. Interesting but not designed for text.

Net net: There is no fool proof way to determine if a chunk of content has been created:

Entirely by a human writing to a template; for example, certain traditional news story about a hearing or a sport score
Entirely by software processing digital content streaming from a third party
A combination of human and smart software.

As some individuals emerge from schools with little training in more traditional types of research and source verification, understanding the difference between information which is written by a careless or stupid human from information assembled by a semi-smart software system is likely to be difficult for people.

Identification of text features is tricky. Exciting opportunities for researchers; for example, should a search and retrieval system automatically NOT our machine generated text?

Stephen E Arnold, March 18, 2019

Written by Stephen E. Arnold · Filed Under AI, News, Text processing | Comments Off on Identification of Machine Generated Text: Not There Yet

Text Analysis Toolkits

March 16, 2019

One of the DarkCyber team spotted a useful list, published by MonkeyLearn. Tucked into a narrative called “Text Analysis: The Only Guide You’ll Ever Need” was a list of natural language processing open source tools, programming languages, and software. Each description is accompanied with links and in several cases comments. See the original article for more information.

Caret

CoreNLP

Java

Keras

mlr

NLTK

OpenNLP

Python

SpaCy

Scikit-learn

TensorFlow

PyTorch

R

Weka

Stephen E Arnold, March 16, 2019

Written by Stephen E. Arnold · Filed Under Natural language processing, News, Text analytics, Text processing | 2 Comments

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Fogint
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

Grammar Rules Help Algorithms Grasp Language

Firefox Translation Add In

Into R? A List for You

China: Patent Translation System

Cognitive Engine: What Powers the USAF Platform?

Latest GraphDB Edition Available

Nosing Beyond the Machine Learning from Human Curated Data Sets: Autonomy 1996 to Smart Software 2019

Who Is Assisting China in Its Technology Push?

Identification of Machine Generated Text: Not There Yet

Text Analysis Toolkits

Search the site

Categories

Archives

Recent Posts

Meta

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Search the site

Categories

Archives

Recent Posts

Meta