Economical Semantics: Check Out GitHub
June 9, 2022
A person asked me at lunch this week, “How can we do a sentiment analysis search on the cheap?” My reaction was, “There are many options. Check out GitHub and let it rip.” After lunch, one of my trust researchers reminded me that our files contained a cop of a 2021 article called “Semantic Search on the Cheap.” I re-read the article and noticed that I had circled this passage in October 2021:
Innovative models are being released at a blistering pace, with different architectures and better scores against the benchmarks. The models are almost always bigger networks, with billions of parameters, requiring more and more GPU power. These models are extremely expressive, dynamic and can be fine-tuned to solve a multitude of problems.
Despite the cratering of some tech juggernauts, the pace of marketing in the smart software sector continues to outpace innovation. The write up is interesting because it raised a number of questions on Thursday, June 2, 2022. In a post-lunch stupor, I asked myself these questions:
- How many organizations want to know the “sentiment” of a chunk of text. The early sentiment analysis systems operated on word lists. Some of the words and phrases in a customer email, for example, reveal the emotional payload of a customer’s message; for example, “sue you” or “terminate our agreement.” The semantic sentiment has launched a thousand PowerPoints, but what about the emotional payload of an employee complaining on TikTok?
- Is 85 percent accuracy the high water mark? If it is, the “accuracy” scores are in what I continue to call the “close enough for horse shoes” playing area. In 100 text passages, the best one can do is generate 15 misses. Lower “scores” mean more misses. This is okay for online advertising, but what about diagnosing a child’s medical condition. Hey, only 15 get worse and that is the best case. No sentiment score for the parents’ communications with a malpractice attorney is necessary.
- Is cheap the optimal way to get good “performance”? The answer is that it costs money to go fast. Plus, smart software has a nasty tendency to drift. As the content fed into the system reflects words and concepts not part of the system’s furniture, the camp chairs get mixed up with the love seats. For certain applications like customer service in companies that don’t want to hear from customers, this approach is perfect.
Google wants everyone to Snorkel. Meta or Zuckbook wants everyone to embrace the outputs of FAIR (Facebook Artificial Intelligence Research). Clever, eh? Amazon and Microsoft are players too. We must not forget IBM. Who could ever forget Watson and DataFountain?
Net net: Download stuff from GitHub or another open source repository and get coding. Reserve time for a zippy PowerPoint too.
Stephen E Arnold, June 9, 2022