Economical Semantics: Check Out GitHub

June 9, 2022

A person asked me at lunch this week, “How can we do a sentiment analysis search on the cheap?” My reaction was, “There are many options. Check out GitHub and let it rip.” After lunch, one of my trust researchers reminded me that our files contained a cop of a 2021 article called “Semantic Search on the Cheap.” I re-read the article and noticed that I had circled this passage in October 2021:

Innovative models are being released at a blistering pace, with different architectures and better scores against the benchmarks. The models are almost always bigger networks, with billions of parameters, requiring more and more GPU power. These models are extremely expressive, dynamic and can be fine-tuned to solve a multitude of problems.

Despite the cratering of some tech juggernauts, the pace of marketing in the smart software sector continues to outpace innovation. The write up is interesting because it raised a number of questions on Thursday, June 2, 2022. In a post-lunch stupor, I asked myself these questions:

  1. How many organizations want to know the “sentiment” of a chunk of text. The early sentiment analysis systems operated on word lists. Some of the words and phrases in a customer email, for example, reveal the emotional payload of a customer’s message; for example, “sue you” or “terminate our agreement.” The semantic sentiment has launched a thousand PowerPoints, but what about the emotional payload of an employee complaining on TikTok?
  2. Is 85 percent accuracy the high water mark? If it is, the “accuracy” scores are in what I continue to call the “close enough for horse shoes” playing area. In 100 text passages, the best one can do is generate 15 misses. Lower “scores” mean more misses. This is okay for online advertising, but what about diagnosing a child’s medical condition. Hey, only 15 get worse and that is the best case. No sentiment score for the parents’ communications with a malpractice attorney is necessary.
  3. Is cheap the optimal way to get good “performance”? The answer is that it costs money to go fast. Plus, smart software has a nasty tendency to drift. As the content fed into the system reflects words and concepts not part of the system’s furniture, the camp chairs get mixed up with the love seats. For certain applications like customer service in companies that don’t want to hear from customers, this approach is perfect.

Google wants everyone to Snorkel. Meta or Zuckbook wants everyone to embrace the outputs of FAIR (Facebook Artificial Intelligence Research). Clever, eh? Amazon and Microsoft are players too. We must not forget IBM. Who could ever forget Watson and DataFountain?

Net net: Download stuff from GitHub or another open source repository and get coding. Reserve time for a zippy PowerPoint too.

Stephen E Arnold, June 9, 2022

Sentiment Analysis: A Comparison with Jargon

January 3, 2022

For anyone faced with choosing a sentiment extraction method, KD Nuggets offers a useful comparison in, “Sentiment Analysis API vs Custom Text Classification: Which One to Choose?” Data consultant and blogger Jérémy Lambert used a concrete dataset to demonstrate the pros and cons of each approach. For sentiment analysis, is team tested out Google Cloud Platform Natural Language API, Amazon Web Service Comprehend, and Microsoft Azure Text Analytics. Of those, Google looks like it performed the best. The custom text classification engines they used were Google Cloud Platform AutoML Natural Language and Amazon Web Service Comprehend Custom Classification. Lambert notes there are several other custom classification options they could have used, for example Monkey Learn, Twinwords, and Connexun. We observe no specialized solutions like Lexalytics were considered.

Before diving into the comparison, Lambert emphasizes it is important to distinguish between sentiment analysis and custom text classification. (See the two preceding links for more in-depth information on each.) He specifies:

“*Trained APIs [sentiment analysis engines] are based on models already trained by providers with their databases. These models are usually used to manage common use cases of : sentiment analysis, named entity recognition, translation, etc. However, it is always relevant to try these APIs before custom models since they are more and more competitive and efficient. For specific use cases where a very high precision is needed, it may be better to use AutoML APIs [custom text classification engines]. … AutoML APIs allow users to build their own custom model, trained on the user’s database. These models are trained on multiple datasets beforehand by providers.”

See the write-up for details on use cases, test procedures, performance results, and taxi-meter pricing. For those who want to skip to the end, here is Lambert’s conclusion:

“Both alternatives are viable. The choice between Sentiment Analysis API and Custom text classification must be made depending on the expected performance and budget allocated. You can definitely reach better performance with custom text classification but sentiment analysis performance remains acceptable. As shown in the article, sentiment analysis is much cheaper than custom text classification. To conclude, we can advise you to try sentiment analysis first and use custom text classification if you want to get better accuracy.”

Cynthia Murrell, January 3, 2022

A Test of Two Sentiment Analysis Libraries

June 17, 2021

A post by developer Alan Jones at Towards Data Science takes a close look at “Two Sentiment Analysis Libraries and How they Perform.” Complete with snippets of code, Jones takes us through his comparison of TextBlob and VADER. He emphasizes that, since human language is so nuanced, sentiment analysis is imprecise by nature. We are sure of one thing—the word “lawyer” in a customer support email is probably a bad sign. Jones introduces his experiment, and describes how interested readers might perform their own:

“So, it’s not reasonable to expect a sentiment analyzer to be accurate on all occasions because the meaning of sentences can be ambiguous. But how just accurate are they? It obviously depends on the techniques used to perform the analysis and also on the context. To find out, we are going to do a simple experiment with two easy to use libraries to see if we can find out what sort of accuracy we might expect. You could decide to build you own analyzer and, in doing so, you might learn more about sentiment analysis and text analysis in general. If you feel inclined to do such a thing, I highly recommend that you read the article by Conor O’Sullivan, Introduction to Sentiment Analysis where he not only explains the aim of Sentiment Analysis but demonstrates how to build an analyzer in Python using a bag of words approach and a machine learning technique called a Support Vector Machine (SVN). On the other hand you might prefer to import a library such as TextBlob or VADER to do the job for you.”

Jones walks us through his dual analysis of the 500 tweets found in the Sentiment140 for Academics collection, narrowed down from the 1.6 million contained in the greater Sentiment140 project. The twist it this: he had to reconcile the different classification schemas used by TextBlob and VADER. See the post for how he applies the two analyzers to the dataset and compares the results.

Cynthia Murrell, June 17, 2021

Bitext and MarkLogic Join in a Strategic Partnership

June 13, 2017

Strategic partnerships are one of the best ways for companies to grow and diamond in the rough company Bitext has formed a brilliant one. According to a recent press release, “Bitext Announces Technology Partnership With MarkLogic, Bringing Leading-Edge Text Analysis To The Database Industry.” Bitext has enjoyed a number of key license deals. The company’s ability to process multi-lingual content with its deep linguistics analysis platform reduces costs and increases the speed with which machine learning systems can deliver more accurate results.

bitext logo

Both Bitext and MarkLogic are helping enterprise companies drive better outcomes and create better customer experiences. By combining their respectful technologies, the pair hopes to reduce data’s text ambiguity and produce high quality data assets for semantic search, chatbots, and machine learning systems. Bitext’s CEO and founder said:

““With Bitext’s breakthrough technology built-in, MarkLogic 9 can index and search massive volumes of multi-language data accurately and efficiently while maintaining the highest level of data availability and security. Our leading-edge text analysis technology helps MarkLogic 9 customers to reveal business-critical relationships between data,” said Dr. Antonio Valderrabanos.

Bitext is capable of conquering the most difficult language problems and creating solutions for consumer engagement, training, and sentiment analysis. Bitext’s flagship product is its Deep Linguistics Analysis Platform and Kantar, GFK, Intel, and Accenture favor it. MarkLogic used to be one of Bitext’s clients, but now they are partners and are bound to invent even more breakthrough technology. Bitext takes another step to cement its role as the operating system for machine intelligence.

Whitney Grace, June 13, 2017

Can Online Systems Discern Truth and Beauty or All That One Needs to Know?

October 14, 2015

Last week I fielded a question about online systems’ ability to discern loaded or untruthful statements in a plain text document. I responded that software is not yet very good at figuring out whether a specific statement is accurate, factual, right, or correct. Google pokes at the problem in a number of ways; for example, assigning a credibility score to a known person. The higher the score, the person may be more likely to be “correct.” I am simplifying, but you get the idea: Recycling a variant of Page Rank and the CLEVER method associated with Jon Kleinberg.

There are other approaches as well, and some of them—dare I suggest, most of them—use word lists. The idea is pretty simple. Create a list of words which have positive or negative connotations. To get fancy, you can work a variation on the brute force Ask Jeeves’ method; that is, cook up answers or statement of facts “known” to be spot on. The idea is to match the input text with the information in these word lists. If you want to get fancy, call these lists and compilations “knowledgebases.” I prefer lists. Humans have to help create the lists. Humans have to maintain the lists. Get the lists wrong, and the scoring system will be off base.

There is quite a bit of academic chatter about ways to make software smart. A recent example is “Sentiment Diffusion of Public Opinions about Hot Events: Based on Complex Network.” In the conclusion to the paper, which includes lots of fancy math, I noticed that the researchers identified the foundation of their approach:

This paper studied the sentiment diffusion of online public opinions about hot events. We adopted the dictionary-based sentiment analysis approach to obtain the sentiment orientation of posts. Based on HowNet and semantic similarity, we calculated each post’s sentiment value and classified those posts into five types of sentiment orientations.

There you go. Word lists.

My point is that it is pretty easy to spot a hostile customer support letter. Just write a script that looks for words appearing on the “nasty list”; for example, consumer protection violation, fraud, sue, etc. There are other signals as well; for example, capital letters, exclamation points, underlined words, etc.

The point is that distorted, shaped, weaponized, and just plain bonkers information can be generated. This information can be gussied up in a news release, posted on a Facebook page, or sent out via Twitter before the outfit reinvents itself.

The researcher, the “real” journalist, or the hapless seventh grader writing a report will be none the wiser unless big time research is embraced. For now, what can be indexed is presented as if the information were spot on.

How do you feel about that? That’s a sentiment question, gentle reader.

Stephen E Arnold, October 14, 2015

Pentaho Makes Big Plans for Big Data in 2015

January 1, 2015

The Pentaho blog takes the year in review and makes some pretty big speculations about 2015 and they’re big, because they concern big data: “Big Data In 2015-Power To The People.” Pentaho predicted that big data business demands would be shaped by businesses’ demands for data blending and it turns out that was correct. Companies do not have standard data sets that fly across the board, rather each company in different fields are turning to big data to handle their increasing amount of data sets.

“Moving into 2015, and fired up by their initial big data bounties, businesses will seek even more power to explore data freely, structure their own data blends, and gain profitable insights faster. They know “there’s gold in them hills” and they want to mine for even more!”

The post’s 2015 big data predictions are even bigger than the imagination.

In 2015, companies will want to blend traditional data with more unstructured content. An example of how this will be used is to get a 360-degree customer profile. Combining social media with sentiment analysis about a company’s good and services tells them more about their clients. Industry is predicted to see big changes in operational, strategic, and competitive advantages by feeding companies info on to improve in these areas. Think smart house capabilities transferred to the new smart factories.

Big data will also have more flexibility in the cloud and people are demanding embedded analytics to see all the nitty gritty details about their business. The list ends that more big data power will be given to the people, mostly in ease of use. You can’t really call that a prediction, more like common sense. Whatever happens in 2015, big data will see big growth.

Whitney Grace, January 01, 2015
Sponsored by, developer of Augmentext

Suicide Sentiment Analysis

November 21, 2014

Short honk: The notion of figuring out something about the emotional payload of a message is interesting. If you are following developments in sentiment analysis, you may find “Emotion Detection in Suicide Notes Using Maximum Entropy Classification” interesting. Now what might be done to pipe the output of this analysis into a predictive analytics engine with access to deep user data?

Stephen E Arnold, November 21, 2014

Attensity Ups Its Presence in Hackathons

October 28, 2014

I found the Attensity blog post “Attensity Takes Utah Tech Week” quite interesting. I cannot recall when mainstream content processing companies embraced hackathons so fiercely.

The blog post explains:

A hackathon, for the uninitiated, is exactly what it sounds like: a hybrid of computer hacking and a marathon in a grueling, caffeine-fueled, 12-hour time period. Groups comprised of mostly engineers and IT whizzes compete against the clock and other teams to create a project to present at the of the day to a panel of judges.

What did Attensity’s engineers build to showcase the company’s sentiment analysis and analytics technologies? Here’s the Attensity description:

With the Twitter API up and running, Team Attensity used Raspberry Pi to process tweets using #obama and #utahtechweek. Simultaneously, the team used Arduino to code sentiments from the tweets using a red light for negative sentiments, blue for positive sentiments, and yellow for neutral sentiments.

Attensity was pleased with the outcome in Utah. More hackathons are in the firm’s future. I wonder if one can deploy IBM Watson using a Raspberry Pi or showcase HP Autonomy with an Arduino.

How will hackathons generate revenue? I am not sure. The effort seems like a cost hole to me.

Stephen E Arnold, October 28, 2014

On the Value of Customized Sentiment Analysis

August 26, 2014

Natural language processing—one of its most-discussed functions in business is sentiment analysis. Over at the SmartData Collective, Lexalytics’ Scott Van Boeyen tells us “Why Sentiment Analysis Engines Need Customization.” The short answer: slang. The write-up explains:

The problem with sentiment analysis is sometimes it’s wrong.[…]

“Oh man, that was nasty!” Is this sentence positive or negative? Surely, it must be negative. “Nasty” is a negative word, and everything else in this sentence is neutral. Final answer, negative! Drum roll…. Wrong! It’s positive.

The person who said this used the American slang definition of nasty, which has positive sentiment. There is absolutely no way to know by reading the sentence. So, if you (a human) were just tricked by reading this article, how is a machine supposed to figure it out? Answer: Tell the engine what’s positive and what’s negative.

High quality NLP engines will let you customize your sentiment analysis settings. “Nasty” is negative by default. If you’re processing slang where “nasty” is considered a positive term, you would access your engine’s sentiment customization function, and assign a positive score to the word.

The man has a point. Still, we are left with a few questions: How much more should one expect to pay for a customization feature? Also, how long does it take to teach an NLP platform comprehensive alternate vocabulary? How does one decide what slang to include—has anyone developed a list of suggestions? Perhaps one could start by consulting the Urban Dictionary.

Cynthia Murrell, August 26, 2014

Sponsored by, developer of Augmentext

Attensity Leverages Biz360 Invention

August 4, 2014

In 2010, Attensity purchased Biz360. The Beyond Search comment on this deal is at One of the goslings reminded me that I had not instructed a writer to tackle Attensity’s July 2014 announcement “Attensity Adds to Patent Portfolio for Unstructured Data Analysis Technology.” PR-type “stories” can disappear, but for now you can find a description of “Attensity Adds to Patent Portfolio for Unstructured Data Analysis Technology” at

My researcher showed me a hard copy of 8,645,395, and I scanned the abstract and claims. The abstract, like many search and content processing inventions, seemed somewhat similar to other text parsing systems and methods. The invention was filed in April 2008, two years before Attensity purchased Biz360, a social media monitoring company. Attensity, as you may know, is a text analysis company founded by Dr. David Bean. Dr. Bean employed various “deep” analytic processes to figure out the meaning of words, phrases, and documents. My limited understanding of Attensity’s methods suggested to me that Attensity’s Bean-centric technology could process text to achieve a similar result. I had a phone call from AT&T regarding the utility of certain Attensity outputs. I assume that the Bean methods required some reinforcement to keep pace with customers’ expectations about Attensity’s Bean-centric system. Neither the goslings nor I are patent attorneys. So after you download 395, seek out a patent attorney and get him/her to explain its mysteries to you.

The abstract states:

A system for evaluating a review having unstructured text comprises a segment splitter for separating at least a portion of the unstructured text into one or more segments, each segment comprising one or more words; a segment parser coupled to the segment splitter for assigning one or more lexical categories to one or more of the one or more words of each segment; an information extractor coupled to the segment parser for identifying a feature word and an opinion word contained in the one or more segments; and a sentiment rating engine coupled to the information extractor for calculating an opinion score based upon an opinion grouping, the opinion grouping including at least the feature word and the opinion word identified by the information extractor.

This invention tackles the Mean Joe Green of content processing from the point of view of a quite specific type of content: A review. Amazon has quite a few reviews, but the notion of an “shaped” review is a thorny one. See, for example, The invention’s approach identifies words with different roles; some words are “opinion words” and others are “feature words.” By hooking a “sentiment engine” to this indexing operation, the Biz360 invention can generate an “opinion score.” The system uses item, language, training model, feature, opinion, and rating modifier databases. These, I assume, are either maintained by subject matter experts (expensive), smart software working automatically (often evidencing “drift” so results may not be on point), or a hybrid approach (humans cost money).


The Attensity/Biz360 system relies on a number of knowledge bases. How are these updated? What is the latency between identifying new content and updating the knowledge bases to make the new content available to the user or a software process generating an alert or another type of report?

The 20 claims embrace the components working as a well oiled content analyzer. The claim I noted is that the system’s opinion score uses a positive and negative range. I worked on a sentiment system that made use of a stop light metaphor: red for negative sentiment and green for positive sentiment. When our system could not figure out whether the text was positive or negative we used a yellow light.


The approach used for a US government project a decade ago, used a very simple metaphor to communicate a situation without scores, values, and scales. Image source:

Attensity said, according the news story cited above:

By splitting the unstructured text into one or more segments, lexical categories can be created and a sentiment-rating engine coupled to the information can now evaluate the opinions for products, services and entities.

Okay, but I think that the splitting of text into segment was a function of iPhrase and search vendors converting unstructured text into XML and then indexing the outputs.

Attensity’s Jonathan Schwartz, General Counsel at Attensity is quoted in the news story as asserting:

“The issuance of this patent further validates the years of research and affirms our innovative leadership. We expect additional patent issuances, which will further strengthen our broad IP portfolio.”

Okay, this sounds good but the invention took place prior to Attensity’s owning Biz360. Attensity, therefore, purchased the invention of folks who did not work at Attensity in the period prior to the filing in 2008. I understand that company’s buy other companies to get technology and people. I find it interesting that Attensity’s work “validates” Attensity’s research and “affirms” Attensity’s “innovative leadership.”

I would word what the patent delivers and Attensity’s contributions differently. I am no legal eagle or sentiment expert. I do like less marketing razzle dazzle, but I am in the minority on this point.

Net net: Attensity is an interesting company. Will it be able to deliver products that make the licensees’ sentiment score move in a direction that leads to sustaining revenue and generous profits. With the $90 million in funding the company received in 2014, the 14-year-old company will have some work to do to deliver a healthy return to its stakeholders. Expert System, Lexalytics, and others are racing down the same quarter mile drag strip. Which firm will be the winner? Which will blow an engine?

Stephen E Arnold, August 4, 2014

Next Page »

  • Archives

  • Recent Posts

  • Meta