November 28, 2016
I recall sitting in high school when I was 14 years old and listening to our English teacher explain the basic plots used by fiction writers. The teacher was Miss Dalton and he seemed quite happy to point out that fiction depended upon: Man versus man, man versus the environment, man versus himself, man versus belief, and maybe one or two others. I don’t recall the details of a chalkboard session in 1959.
Not to fear.
I read “Fiction Books Narratives Down to Six Emotional Story Lines.” Smart software and some PhDs have cracked the code. Ivory Tower types processed digital versions of 1,327 books of fiction. I learned:
They [the Ivory Tower types] then applied three different natural language processing filters used for sentiment analysis to extract the emotional content of 10,000-word stories. The first filter—dubbed singular value decomposition—reveals the underlying basis of the emotional storyline, the second—referred to as hierarchical clustering—helps differentiate between different groups of emotional storylines, and the third—which is a type of neural network—uses a self-learning approach to sort the actual storylines from the background noise. Used together, these three approaches provide robust findings, as documented on the hedonometer.org website.
Okay, and what’s the smart software say today that Miss Dalton did not tell me more than 50 years ago?
[The Ivory Tower types] determined that there were six main emotional storylines. These include ‘rags to riches’ (sentiment rises), ‘riches to rags’ (fall), ‘man in a hole’ (fall-rise), ‘Icarus’ (rise-fall), ‘Cinderella’ (rise-fall-rise), ‘Oedipus’ (fall-rise-fall). This approach could, in turn, be used to create compelling stories by gaining a better understanding of what has previously made for great storylines. It could also teach common sense to artificial intelligence systems.
Stephen E Arnold, November 28, 2016
November 15, 2016
I read “French AI Ecosystem.” Most of the companies have zero or a low profile in the United States. The history of French high technology outfits remains a project for an enterprising graduate student with one foot in La Belle France and one in the USA. This write up is a bit of a sales pitch for venture capital in my opinion. The reason that VC inputs are needed is that raising money in France is — how shall I put this? — not easy. There is no Silicon Valley. There is Paris and a handful of other acceptable places to be intelligent. In the Paris high tech setting, there are a handful of big outfits and lots and lots of institutions which keep the French one percent in truffles and the best the right side of the Rhone have to offer. The situation is dire unless the start up is connected by birth, by education at one of the acceptable institutions, or hooked up with a government entity. I want to mention that there is a bit of French ethnocentrism at work in the French high tech scene. I won’t go into detail, but you can check it out yourself if you attend a French high tech conference in one of the okay cities. Ars-en-Ré and Gémenos do not qualify. Worth a visit, however.
Now to the listings. You will have to work through the almost unreadable graphic or contact the outfit creating the listing, which is why the graphic is unreadable I surmise. From the version of the graphic I saw, I did discern a couple of interesting points. Here we go:
Three outfits were identified as having natural language capabilities. These are Proxem, syJLabs (no, I don’t know how to pronounce this”syjl” string. I can do “abs”, though.), and Yseop k(maybe, Aesop from the fable?). Proxem offers its Advanced Natural Language Object Orient Processing Environment (Antelope). The company was founded in 2007.) syJLabs does not appear in my file of French outfits, and we drew a blank when looking for the company’s Web site. Sigh. Yseop has been identified as a “top IT innovator” by an objective, unimpeachable, high value, super credible, wonderful, and stellar outfit (Ventana Research). Yseop, also founded in 2007, offers a system which “turns data into narrative in English, French, German, and Spanish, all at the speed of thousands of pages per second.”
As I worked through a graphic containing lots of companies, I spotted two interesting inclusions. The first is Sinequa, a vendor of search founded in 2002, now positioned as an important outfit in Big Data and machine learning. Fascinating. The reinvention of Sinequa is a logical reaction to the implosion of the market for search and retrieval for the enterprise. The other company I noted was Antidot, which mounted a push to the US market several years ago. Antidot, like Sinequa, focused on information access. It too is “into” Big Data and machine learning.
I noted some omissions; for example, Hear&Know, among others. Too bad the listing is almost unreadable and does not include a category for law enforcement, surveillance, and intelligence innovators.
Stephen E Arnold, November 15, 2016
November 7, 2016
There are differences among these three use cases for entity extraction:
- Operatives reviewing content for information about watched entities prior to an operation
- Identifying people, places, and things for a marketing analysis by a PowerPoint ranger
- Indexing Web content to add concepts to keyword indexing.
Regardless of your experience with software which identifies “proper nouns,” events, meaningful digits like license plate numbers, organizations, people, and locations (accepted and colloquial)—you will find the information in “Performance Comparison of 10 Linguistic APIs for Entity Recognition” thought provoking.
The write up identifies the systems which perform the best and the worst.
Here are the five systems and the number of errors each generated in a test corpus. The “scores” are based on a test which contained 150 targets. The “best” system got more correct than incorrect. I find the results interesting but not definitive.
The five best performing systems on the test corpus were:
- Intellexer API (best)
- Lexalytics (better
- AlchemyLanguage IBM (good)
- Indico (less good)
- Google Natural Language.
The five worst performing systems on the test corpus were:
- Microsoft Cognitive Services (dead last)
- Hewlett Packard Enterprise Haven (penultimate last)
- Text Razor (antipenultimate)
- Meaning Cloud
- Aylien (apparently misspelled in the source article).
There are some caveats to consider:
- Entity identification works quite well when the training set includes the entities and their synonyms as part of the training set
- Multi-language entity extraction requires additional training set preparation. “Learn as you go” is often problematic when dealing with social messages, certain intercepted content, and colloquialisms
- Identification of content used as a code—for example, Harrod’s teddy bear for contraband—is difficult even for smart software operating with subject matter experts’ input. (Bad guys are often not stupid and understand the concept of using one word to refer to another thing based on context or previous interactions).
Net net: Automated systems are essential. The error rates may be fine for some use cases and potentially dangerous for others.
Stephen E Arnold, November 7, 2016
August 31, 2016
Natural language processing is not a new term in the IT market, but NLP technology has only become commonplace in the last year or so. When I refer to commonplace, I refer how most computers and mobile devices have some form of NLP tool, including digital assistants and voice to text. Business 2 Community explains the basics about NLP technology in the article, “Natural Language Processing: Turning Words in Data.”
The article acts a primer for understanding how NLP works and is redundant until you get into the text about how it is applied in the real world; that is, tied to machine learning. I found this paragraph helpful:
“This has changed with the advent of machine learning. Machine learning refers to the use of a combination of real-world and human-supplied characteristics (called “features”) to train computers to identify patterns and make predictions. In the case of NLP, using a real-world data set lets the computer and machine learning expert create algorithms that better capture how language is actually used in the real world, rather than on how the rules of syntax and grammar say it should be used. This allows computers to devise more sophisticated—and more accurate—models than would be possible solely using a static set of instructions from human developers.”
It then goes into further details about how NLP is applied to big data technology and explaining the practical applications. It makes some reference to open source NLP technologies, but only in passing.
The article sums up the NLP and big data information in regular English vernacular. The technology gets even more complicate when you delve into further research on the subject.
Whitney Grace, August 31, 2016
August 25, 2016
The article titled National Language Processing: Turning Words Into Data on B2C takes an in-depth look at NLP and why it is such a difficult area to perfect. Anyone who has conversed with an automated customer service system knows that NLP technology is far from ideal. Why is this? The article suggests that while computers are great at learning the basic rules of language, things get far more complex when you throw in context-dependent or ambiguous language, not to mention human error. The article explains,
“This has changed with the advent of machine learning…In the case of NLP, using a real-world data set lets the computer and machine learning expert create algorithms that better capture how language is actually used in the real world, rather than on how the rules of syntax and grammar say it should be used. This allows computers to devise more sophisticated—and more accurate—models than would be possible solely using a static set of instructions from human developers.”
Throw in Big Data and we have a treasure trove of unstructured data to glean value from in the form of text messages, emails, and social media. The article lists several exciting applications such as automatic translation, automatic summarization, Natural Language Generation, and sentiment analysis.
Chelsea Kerwin, August 25, 2016
December 28, 2015
A new company seeks to make everyone a big data expert. You can get the full scoop in “Detecting Consumer Decisions within Messy Data: Software Analyzes Online Chatter to Predict Health Care Consumers’ Behavior.” The company with the natural language technology and proprietary smart software is dMetrics.
Here’s the premise:
DecisionEngine, Nemirovsky [dMetrics wizard] says, better derives meaning from text because the software — which now consists of around 2 million lines of code — is consistently trained to recognize various words and synonyms, and to interpret syntax and semantics. “Online text is incredibly tough to analyze properly,” he says. “There’s slang, misspellings, run-on sentences, and crazy punctuation. Discussion is messy.”
Now, how does the system work?
Visualize the software as a three-tiered funnel, Nemirovsky suggests, with more refined analysis happening as the funnel gets narrower. At the top of the funnel, the software mines all mentions of a particular word or phrase associated with a certain health care product, while filtering out “noise” such as fake websites and users, or spam. The next level down involves separating out commenters’ personal experiences over, say, marketing materials and news. The bottom level determines people’s decisions and responses, such as starting to use a product — or even considering doing so, experiencing fear or confusion, or switching to a different medication.
The company wants to expand beyond health care. Worth monitoring.
Stephen E Arnold, December 28, 2015
December 12, 2015
I read “U.S. Data Company Palantir Raises $679.8 Million.” The key points in the write up from my point of view were that Palantir is valued at $20 billion, which may be a record for a company providing search and content analysis. The other point is that the company has raised more than $670 million. The company keeps a low profile and reminds me of the teenage Autonomy from the early 2000s. Value may become an issue at some point.
Stephen E Arnold, December 12, 2015
December 4, 2015
You talk to your mobile phone, right? I assume you don’t try the chat with Siri- and Cortana- type services in noisy places, in front of folks you don’t trust, and when you are in a wind storm.
I know that the idea of typing questions with subjects, verbs, adjectives, and other parts of speech is an exciting one to some people. In reality, crafting sentences is not the ideal way to interact with some search systems. If you are looking for snaps of Miley Cyrus, you don’t want to write a sentence. Just enter the word “Miley” and the Alphabet Google thing does the rest. Easy.
I read about another search related research study in “Natural Language Processing NLP Market Dynamics, Forecast, Analysis and Supply Demand 2015-2025.” I find the decade long view evidence that Excel trend functions may have helped the analysts formulate their “future insights.”
The write up about the study identifies some of the companies engaged in NLP. Here’s a sample:
Dolbey Systems Inc.
SAS Institute Inc.
Netbase Solutions Inc.
Verint Systems Inc.
What no Alphabet Google? Perhaps the full study includes this outfit.
A report by MarketsAndMarkets pegged NLP as reaching $13.4 billion by 2020. I assume that the size of the market in 2025 will be larger. Since I don’t have the market size data from Future Market Insights, we will just have to wait and see what happens.
In today’s business world, projections for a decade in the future strike me as somewhat far reaching and maybe a little silly.
Who crafted this report? According to the write up:
Future Market Insights (FMI) is a premier provider of syndicated research reports, custom research reports, and consulting services. We deliver a complete packaged solution, which combines current market intelligence, statistical anecdotes, technology inputs, valuable growth insights, aerial view of the competitive framework, and future market trends.
I like the aerial view of the competitive framework thing. I wish I could do that type of work. I wonder how Verint perceives a 10 year projection when some of the firm’s intelligence works focuses on slightly shorter time horizons.
Stephen E Arnold, December 4, 2015
October 8, 2015
Want to see what natural language processing does in the Stanford CoreNLP implementation. Navigate to Stanford CoreNLP. The service is free. Enter some text. The system will scan the input and display an output. NLP revealed:
What can one do with this output? Build an application around the outputs. NLP is easy. The artificial intelligence implementation is a bit of a challenge, of course, but parts of speech, named entities, and dependency parsing can be darned useful. Now mixed language inputs may be an issue. Names in different character sets could be a hurdle. I am thrilled that NLP has been visualized using the brat visualization and annotation software. Get excited, gentle reader.
Stephen E Arnold, October 8, 2015
August 17, 2015
If you read academic papers, you may want to take a flight through Partridge. Additional details are at this link. According to the Web site: Partridge
is a web based tool used for information retrieval and filtering that makes use of Machine Learning techniques. Partridge indexes lots of academic papers and classifies them based on their content. It also indexes their content by scientific concept, providing the ability to search for phrases within specific key areas of the paper (for example, you could search for a specific outcome in the ‘conclusion’ or find out how the experiment was conducted in the ‘methodology’ section.)
The About section of the Web site explains:
Partridge is a collection of tools and web-based scripts that use artificial intelligence to run semantic analysis on large bodies of text in order to provide reader recommendations to those who query the tool. The project is named after Errol Partridge, a character from the cult Science Fiction film ‘Equilibrium’ who imparts knowledge of a cache of fictional books (banned contraband in the film) upon the protagonist, John Preston, eventually leading to his defiance of the state and the de-criminalization of literature. Partridge is my dissertation project at Aberystwyth University, United Kingdom.
Check out the system at http://paprol.org.uk.
Stephen E Arnold, August 17, 2015