Palantir Profile: Search Plus Add Ons

November 25, 2015

Short honk: If you read French, you will learn quite a bit about Palantir, an interesting company with a $20 billion valuation. The write up is “Palantir et la France : naissance d’une nouvelle théorie abracadabrantesque ? An listicle in the heart of the article provides a good run down of the system’s search and content processing capabilities. Yep, search. The difference between Palantir and outfits like Attivio, Coveo, Smartlogic, et al is the positioning, the bundle of technology, and – oh, did I mention the $20 billion valuation? I do like the abracadabra reference. Magic?

Stephen E Arnold, November 25, 2015

Entity Extraction: Human Intermediation Still Necessary

November 23, 2015

I read “Facebook Should Be Able to Handle names like Isis and Phuc Dat Bich.” The article underscores the challenges smart software faces in a world believing that algorithms deliver the bacon.

Entity extraction methods requiring human subject matter experts and dictionary editors are expensive and slow. Algorithms are faster and over time more economical. Unfortunately the automated systems miss some things and get other stuff wrong.

The article explains that Facebook thinks a real person name Isis Anchalee is a bad guy. Another person with the transliterated Vietnamese name Phuc Dat Bich is a prohibited phrase.

What’s the fix?

First, the folks assuming that automated systems are pretty much accurate need to connect with the notion of an “exception file” or a log containing names which are not in a dictionary. What if there is no dictionary? Well, that is a problem. What about names with different spellings and in different character sets? Well, that too is a problem.

Will the vendors of automated systems point out the need for subject matter experts to create dictionaries, perform quality and accuracy audits, and update the dictionaries? Well, sort of.

The point is that like many numerical recipes the expectation that a system is working with a high degree of accuracy is often incorrect. Strike that, substitute “sort of accurate.”

The write up states:

If that’s how the company want the platform to function, Facebook is going to have to get a lot better at making sure their algorithms don’t unfairly penalize people whose names don’t fit in with the Anglo-standard.

When it comes time to get the automated system back into sync with accurate entity extraction, there may be a big price tag.

What your vendor did not make that clear?

Explain your “surprise” to the chief financial officer who wants to understand how you overlooked costs which may be greater than the initial cost of the system.

Stephen E Arnold, November 23, 2015

Inferences: Check Before You Assume the Outputs Are Accurate

November 23, 2015

Predictive software works really well as long as the software does not have to deal with horse races, the stock market, and the actions of single person and his closest pals.

Inferences from Backtest Results Are False Until Proven True” offers a useful reminder to those who want to depend on algorithms someone else set up. The notion is helpful when the data processed are unchecked, unfamiliar, or just assumed to be spot on.

The write up says:

the primary task of quantitative traders should be to prove specific backtest results worthless, rather than proving them useful.

What throws backtests off the track? The write up provides a useful list of reminders:

  1. Data-mining and data snooping bias
  2. Use of non tradable instruments
  3. Unrealistic accounting of frictional effects
  4. Use of the market close to enter positions instead of the more realistic open
  5. Use of dubious risk and money management methods
  6. Lack of effect on actual prices

The author is concerned about financial applications, but the advice may be helpful to those who just want to click a link, output a visualization, and assume the big spikes are really important to the decision you will influence in one hour.

One point I highlighted was:

Widely used strategies lose any edge they might have had in the past.

Degradation occurs just like the statistical drift in Bayesian based systems. Exciting if you make decisions on outputs known to be flawed. How is that automatic indexing, business intelligence, and predictive analytics systems working?

Stephen E Arnold, November 23, 2015

A Modest Dust Up between Big Data and Text Analytics

November 18, 2015

I wonder if you will become involved in this modest dust up between the Big Data folks and the text analytics adherents. I know that I will sit on the sidelines and watch the battle unfold. I may mostly alone on that fence for three reasons:

  • Some text analytics outfits are Big Data oriented. I would point modestly to Terbium Labs and Recorded Future. Both do the analytics thing and both use “text” in their processing. (I know that learning about these companies is not as much fun as reading about Facebook friends, but it is useful to keep up with cutting edge outfits in my opinion.)
  • Text analytics can produce Big Data. I know that sounds like a fish turned inside out. Trust me. It happens. Think about some wan government worker in the UK grinding through Twitter and Facebook posts. The text analytics output lots of data.
  • A faux dust up is mostly a marketing play. I enjoyed search and content processing vendor presentations which pitted features of one system versus another. This approach is not too popular because every system says it can do what every other system can do. The reality of the systems is, in most cases, not discernible to the casual failed webmaster now working as a “real” wizard.

Navigate to “Text Analytics Gurus Debunk 4 Big Data Myths.” You will learn that there are four myths which are debunked. Here are the myths:

  1. Big Data survey scores reign supreme. Hey, surveys are okay because outfits like Survey  Monkey and the crazy pop up technology from that outfit in Michigan are easy to implement. Correct? Not important. Usable data for marketing? Important.
  2. Bigger social media data analysis is better. The outfits able to process the real time streams from Facebook and Twitter have lots of resources. Most companies do not have these resources. Ergo: Statistics 101 reigns no matter what the marketers say.
  3. New data sources are the most valuable. The idea is that data which are valid, normalized, and available for processing trump bigness. No argument from me.
  4. Keep your eye on the ball by focusing on how customers view you. Right. The customer is king in marketing land. In reality, the customer is a code word for generating revenue. Neither Big Data nor text analytics produce enough revenue in my world view. Sounds great though.

Will Big Data respond to this slap down? Will text analytic gurus mount their steeds and take another run down Marketing Lane to the windmill set up as a tourist attraction in an Amsterdam suburb?

Nope. The real battle involves organic, sustainable revenues. Talk is easy. Closing deals is hard. This dust up is not a mixed martial arts pay per view show.

Stephen E Arnold, November 18, 2015

Lexalytics: Checking into Hotel Data

November 14, 2015

I read “Boost Your Brand Reputation by Listening to Social Content.” I find the title interesting. The idea that a brand such as a hotel like Motel 6 or Hilton can improve its reputation by listening is interesting. I am not sure that listening translates to a better reputation. The guest in the hotel deals with the room, the staff, and the electrical outlet (presumably working). The guest forms an opinion about the hotel. I was in a hotel in Cape Town which features a non working door, no electricity, and pipes which leaked. This was a new room.

Listening to me did not solve the problems. What solved the problems was my speaking with two managers and proposing that I sleep in the lobby.

I am enthused by technology. I am not keen when technology is presented as a way to sidestep or subvert the reality that creates one’s views of a business, in this case a hotel. The idea is—well, let me be frank—not a good one.

I think the write up means that a hotel using Lexalytics technology has a way to obtain information that otherwise might not find its way to the 20 something in the marketing department. The hotel then has to take action to resolve the problem. This is pretty much common sense, but it does not boost a hotel’s social reputation. Listening is passive. The information must be converted to meaningful action.

That’s the problem with search and content processing companies and why many of them face credibility and revenue challenges. The hoped for action has zero direct connection with the grinding of the algorithms and the motivation or capability of the licensee to make a change.

The write up asserts:

This social currency, online reputation, directly influences a hotelier’s sales volume: good reputation, higher sales — poor reputation, lower sales. The upshot is that in the hospitality industry, increasing your reputation (and revenue) means listening to social content and basing your business decisions on the feedback you receive from guests. And I know I’m preaching to the choir here, but remember that good reputation isn’t just for high-end establishments. There’s a lot to be said for value for money, and smaller, more modest establishments can often gain the most from careful management of their online reputation.

Sounds great. The management of the hotel have to make changes. Over time, the changes will have an impact on the Facebook or Yelp posts that the guests contribute.

Technology is simply a utility, not a way to get from lousy hotel to wonderful hotel with a mouse click or by listening. Horse feathers.

Stephen E Arnold, November 14, 2015

What Does Connotate Deliver to Licensee?

October 23, 2015

If you have asked yourself this question, you will find the answers in “The Most Asked Questions About Connotate.” Connotate was founded in year 2000 and has ingested about $12.5 million from four investors, according to Crunchbase. That works out to 15 years.

Here are two questions which I highlighted:

One question is, “Is Connotate Web scraping?” Here’s the answer:

No. Web scrapers parse HTML code on web pages, searching for markers to identify which data elements to extract. While web scrapers may be alright for one-off extractions, they’re not sustainable when it comes to regularly extracting and monitoring large parcels of content. Also, web scrapers require programmers to create and maintain them. Connotate’s solutions rely on machine learning to reach and maintain optimal resiliency.

In my experience, Connotate offers a type of Web scraping.

The other question I noted was “Who uses Connotate’s solutions?” And the answer:

Connotate’s solutions can be harnessed by anyone who is looking to turn the vast scope of data on the Web into usable data. Chief Content Officers, Data Operations Managers and Product Managers are just a few examples of the kind of professionals who could readily benefit from Connotate’s solution.

I was hoping for some specific customer references. I know there are Connotate customers.

For further information about Connotate, the company offers a “complete online FAQ.”

Stephen E Arnold, October 23, 2015

Concept Searching SharePoint White Paper

October 22, 2015

I saw a reference to “2015 SharePoint and Office 365 State of the Market Survey White Paper.” If you are interested in things SharePoint and Office 365, you can (as of October 15, 2015) download the 40 page document at this Concept Searching link. A companion webinar is also available.

The most interesting portion of the white paper is its Appendix A. A number of buzzwords are presented as “Priorities by Application.” Note that the Appendix is graphical and presents the result of a “survey.” Goodness, SharePoint seems to have some holes in its digital fabric.

The data for enterprise search are interesting.

image

Source: Concept Searching, 2015

It appears that fewer than 20 percent of those included in the sample (not many details about the mechanics of this survey the data for which was gathered via the Web) do not see enterprise search as a high priority issue. About 30 percent of the respondents perceive search as working as intended. An equal number, however, are beavering away to improve their enterprise search system.

Unlike some enterprise search and content processing vendors, Concept Search is squarely in the Microsoft camp. With third party vendors providing “solutions” for SharePoint and Office 365, I ask myself, “Why doesn’t Microsoft address the shortcomings third parties attack?”

Stephen E Arnold, October 22, 2015

Attensity: Discover Now

October 21, 2015

i read “Speedier Data Analysis Focus of Attensity’s DiscoverNow.” Attensity is one of the firms processing content for information signals.The company has undergone some management turnover. The company has rolled out DiscoverNow, a product that runs from the cloud and features “built in integration with the Informatica cloud.” The write up reports:

According to the company, DiscoverNow connects to more than 150 internal and external text-based data sources, including popular enterprise apps and databases such as Salesforce.com, SAP, Oracle/Siebel, Box, Concur, Dropbox, Datasift, Eloqua, JIRA, MailChimp, Marketo, NetSuite, Hadoop, MySQL and Thomson Reuters. It combines insights from these internal data sources with external text sources such as Twitter, Facebook, Google+, YouTube, Reddit, forums and review sites, to offer a robust view of customer activities.

Attensity is, according to the article, different and outperforms its competitors. According to Cary Fulbright, Attensity’s chief strategy officer:

Attensity outperforms competing text analytics systems that rely more heavily on keywords. “We parse sentences by subject, noun and object, so we can identify the context used,” he said. “For example, DiscoverNow understands the difference between the Venetian Hotel, Venetian blinds and Venetian gondolas, or ‘uber cool’ and Uber ridesharing. Our team of linguists is constantly updating our generic and industry-specific libraries with new terms, including slang.”

A number of companies offer text processing systems. Attensity is a mash up of several organizations. DiscoverNow may be the breakthrough product the company has been seeking. To date, according to Crunchbase, the company has ingested since 2000 $90 million.

Stephen E Arnold, October 21, 2015

The Tweet Gross Domestic Product Tool

October 16, 2015

Twitter can be used to figure out your personal income.  Twitter was not designed to be a tool to tally a person’s financial wealth, instead it is a communication tool based on a one hundred forty character messages to generate for small, concise delivery.  Twitter can be used to chat with friends, stars, business executives, etc, follow news trends, and even advertise products by sent to a tailored audience.  According to Red Orbit in the article “People Can Guess Your Income Based On Your Tweets,” Twitter has another application.

Other research done on Twitter has revealed that your age, location, political preferences, and disposition to insomnia, but your tweet history also reveals your income.  Apparently, if you tweet less, you make more money.  The controls and variables for the experiment were discussed, including that 5,191 Twitter accounts with over ten million tweets were analyzed and accounts with a user’s identifiable profession were used.

Users with a high follower and following ratio had the most income and they tended to post the least.  Posting throughout the day and cursing indicated a user with a lower income.  The content of tweets also displayed a plethora of “wealth” information:

“It isn’t just the topics of your tweets that’s giving you away either. Researchers found that “users with higher income post less emotional (positive and negative) but more neutral content, exhibiting more anger and fear, but less surprise, sadness and disgust.” It was also apparent that those who swore more frequently in their tweets had lower income.”

Twitter uses the information to tailor ads for users, if you share neutral posts get targeted ads advertising expensive items, while the cursers get less expensive ad campaigns.  The study also proves that it is important to monitor your Twitter profile, so you are posting the best side of yourself rather than shooting yourself in the foot.

Whitney Grace, October 16, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Can Online Systems Discern Truth and Beauty or All That One Needs to Know?

October 14, 2015

Last week I fielded a question about online systems’ ability to discern loaded or untruthful statements in a plain text document. I responded that software is not yet very good at figuring out whether a specific statement is accurate, factual, right, or correct. Google pokes at the problem in a number of ways; for example, assigning a credibility score to a known person. The higher the score, the person may be more likely to be “correct.” I am simplifying, but you get the idea: Recycling a variant of Page Rank and the CLEVER method associated with Jon Kleinberg.

There are other approaches as well, and some of them—dare I suggest, most of them—use word lists. The idea is pretty simple. Create a list of words which have positive or negative connotations. To get fancy, you can work a variation on the brute force Ask Jeeves’ method; that is, cook up answers or statement of facts “known” to be spot on. The idea is to match the input text with the information in these word lists. If you want to get fancy, call these lists and compilations “knowledgebases.” I prefer lists. Humans have to help create the lists. Humans have to maintain the lists. Get the lists wrong, and the scoring system will be off base.

There is quite a bit of academic chatter about ways to make software smart. A recent example is “Sentiment Diffusion of Public Opinions about Hot Events: Based on Complex Network.” In the conclusion to the paper, which includes lots of fancy math, I noticed that the researchers identified the foundation of their approach:

This paper studied the sentiment diffusion of online public opinions about hot events. We adopted the dictionary-based sentiment analysis approach to obtain the sentiment orientation of posts. Based on HowNet and semantic similarity, we calculated each post’s sentiment value and classified those posts into five types of sentiment orientations.

There you go. Word lists.

My point is that it is pretty easy to spot a hostile customer support letter. Just write a script that looks for words appearing on the “nasty list”; for example, consumer protection violation, fraud, sue, etc. There are other signals as well; for example, capital letters, exclamation points, underlined words, etc.

The point is that distorted, shaped, weaponized, and just plain bonkers information can be generated. This information can be gussied up in a news release, posted on a Facebook page, or sent out via Twitter before the outfit reinvents itself.

The researcher, the “real” journalist, or the hapless seventh grader writing a report will be none the wiser unless big time research is embraced. For now, what can be indexed is presented as if the information were spot on.

How do you feel about that? That’s a sentiment question, gentle reader.

Stephen E Arnold, October 14, 2015

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta