Quick and Dirty Sentiment Analysis

September 14, 2010

I thought “Most Common Words Unique to 1 Star and 5 Star App Store Reviews” provides some insight into how certain sentiment analysis systems work. The article said:

I wrote a script to crawl U.S. App Store customer reviews for the top 100 apps from every category (minus duplicates) and compute the most common words in 1-star and 5-star reviews, excluding words that were also common in 3-star reviews.

Frequency count against a “field”. Here are the results for positive apps:

awesome, worth, thanks, amazing, simple, perfect, price, everything, ever, must, iPod, before, found, store, never, recommend, done, take, always, touch

How do you know a loser?

waste, money, crashes, tried, useless, nothing, paid, open, deleted, downloaded, didn’t, says, stupid, anything, actually, account, bought, apple, already

“Sentiment” can be disceerned by looking for certain words and keeping count. So much for rocket science of “understanding unstructured text.”

Stephen E Arnold, September 14, 2010

Freebie

Written by Stephen E. Arnold · Filed Under News, Semantic, Text analytics, Text processing

Comments

5 Responses to “Quick and Dirty Sentiment Analysis”

Avi Rappoport on September 14th, 2010 2:33 pm

I wonder how it works with sarcasm and cultural slang, where “sick” may mean “good”.

And the word “useless” is interesting: as Marco says, when users are scathing in their desire for a new feature, they may be overall positive about the product.

It would be interesting to compare using these key words vs. more sophisticated sentiment analysis. I suspect the simple solutions may be fairly accurate 80% of the time, but that last 20% would be much harder.
Pascal Soucy on September 15th, 2010 7:21 am

Stephen,

In my opinion, sentiment analysis is never really “rocket science”, yet it is a bit more sophisticated that simply counting words.

Good sentiment analysis systems will analyze sentences much more deeply in order to “understand” it the right way. For instance, if the sentence contains “the service was not outstanding”, obviously, you cannot rely only on the word “outstanding” as a positive cue to detect the sentiment properly.

This is a very simple case, but if the sentence is: “I used to say that their service is outstanding, but that was until a couple months ago”, the problem is more challenging.

I can assure you that sophisticated systems are trying to analyze accurately even the most complex sentences like the last one. Only the customer reviews with an obvious and distinct positive or negative tone could be guessed properly with word counts.

To answer Avi, many people don’t get sarcasm, so I would not expect an automatic algorithm to be very accurate with irony and sarcasm. Training the system to capture the language subtleties of cultural groups or the problem domain can help a lot to boost the accuracy, it’s actually almost a must.
Bob Carpenter on September 15th, 2010 3:05 pm

The folks at Google are using exactly this kind of counting. Only they’re doing it on a much larger scale and for many more domains using slightly more sophisticated models than counting. But nothing so intensive as parsing whole sentences.

Check out Ryan McDonald’s paper on inferring a sentiment lexicon of phrases (not just words):

The Viability of Web-derived Polarity Lexicons
L. Velikovich, S. Blair-Goldensohn, K. Hannan and R. McDonald
North American Association for Computational Linguistics (NAACL), 2010.

http://www.ryanmcd.com/papers/web_polarity_lexiconsNAACL2010.pdf

And how they’re tackling problems like negation:

Learning to Classify the Scope of Negation for Improved Sentiment Analysis
I. Councill, R. McDonald and L. Velikovich
Negation and Speculation in Natural Language Processing (NeSp-NLP), 2010

http://www.ryanmcd.com/papers/NeSp-NLP10.pdf

They’re doing the same kind of things for other applications, like filling in categorical lists (like lists of doctors, lawyers and baseball players).

As far as I know, no one is trying to apply complex parsing to extract sentiment on a web scale. The problem is that the value added by these bleeding edge techniques is negligible at current levels of accuracy compared to the computational costs.
Stephen E. Arnold on September 15th, 2010 3:59 pm

Bob Carpenter,

Thanks for taking the time to comment. Interested in doing a Search Wizards Speak interview? It’s a freebie I offer some folks.

Stephen E Arnold, September 15, 2010
From the Sublime to the Ridiculous: Cracking the Code on Sentiment Analysis « She Blogs She Blogs on October 8th, 2010 8:56 am

[…] you read blogger ArnoldIT’s post last month on Marco Arment’s (former lead developer at Tumblr) blog about gauging the […]

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.