Big Data: A Shopsmith for Power Freaks?

February 4, 2016

I read an article that I dismissed. The title nagged at my ageing mind and dwindling intellect. “This is Why Dictators Love Big Data” did not ring my search, content processing, or Dark Web chimes.

Annoyed at my inner voice, I returned to the story, annoyed with the “This Is Why” phrase in the headline.

image

Predictive analytics are not new. The packaging is better.

I think this is the main point of the write up, but I an never sure with online articles. The articles can be ads or sponsored content. The authors could be looking for another job. The doubts about information today plague me.

The circled passage is:

Governments and government agencies can easily use the information every one of us makes public every day for social engineering — and even the cleverest among us is not totally immune.  Do you like cycling? Have children? A certain breed of dog? Volunteer for a particular cause? This information is public, and could be used to manipulate you into giving away more sensitive information.

The only hitch in the git along is that this is not just old news. The systems and methods for making decisions based on the munching of math in numerical recipes has been around for a while. Autonomy? A pioneer in the 1990s. Nope. Not even the super secret use of Bayesian, Markov, and related methods during World War II reaches back far enough. Nudge the ball to hundreds of years farther on the timeline. Not new in my opinion.

I also noted this comment:

In China, the government is rolling out a social credit score that aggregates not only a citizen’s financial worthiness, but also how patriotic he or she is, what they post on social media, and who they socialize with. If your “social credit” drops below a certain level because you post anti-government messages online or because you’re socially associated with other dissidents, you could be denied credit approval, financial opportunities, job promotions, and more.

Just China? I fear not, gentle reader. Once again the “real” journalists are taking an approach which does not do justice to the wide diffusion of certain mathy applications.

Net net: I should have skipped this write up. My initial judgment was correct. Not only is the headline annoying to me, the information is par for the Big Data course.

Stephen E Arnold, February 4, 2016

Palantir: Revenue Distribution

January 27, 2016

I came across a write up in a Chinese blog about Palantir. You can find the original text at this link. I have no idea if the information are accurate, but I had not seen this breakdown before:

image

The chart from “Touchweb” shows that in FY 2015 privately held Palantir derives 71 percent of its revenue from commercial clients.

The report then lists the lines of business which the company offers. Again this was information I had not previously seen:

Energy, disaster recovery, consumer goods, and card services

  • Retail, pharmaceuticals, media, and insurance
  • Audit, legal prosecution
  • Cyber security, banking
  • Healthcare research
  • Local law enforcement, finance
  • Counter terrorism, war fighting, special forces.

Because Palantir is privately held, there is not solid, audited data available to folks in Kentucky at this time.

Nevertheless, the important point is that the Palantir search and content processing platform has a hefty valuation, lots of venture financing, and what appears to be a diversified book of business.

Stephen E Arnold, January 27, 2016

Cheerleading for the SAS Text Exploration Framework

January 27, 2016

SAS is a stalwart in the number crunching world. I visualize the company’s executives chatting among themselves about the Big Data revolution, the text mining epoch, and the predictive analytics juggernaut.

Well, SAS is now tapping that staff interaction.

Navigate to “To Data Scientists and Beyond! One of Many Applications of Text Analytics.” There is an explanation of the ease of use of SAS. Okay, but my recollection was that I had to hire a PhD in statistics from Cornell University to chase down the code which was slowing our survivability analyses to meander instead of trot.

I learned:

One of the misconceptions I often see is the expectation that it takes a data scientist, or at least an advanced degree in analytics, to work with text analytics products. That is not the case. If you can type a search into a Google toolbar, you can get value from text analytics.

The write up contains a screenshot too. Where did the text analytics plumbing come from? Perchance an acquisition in 2008 like the canny purchase Teragram’s late 1990s technology?

The write up focuses on law enforcement and intelligence applications of text analytics. I find that interesting because Palantir is allegedly deriving more than 60 percent of the firm’s revenue from commercial customers like JP Morgan and starting to get some traction in health care.

Check out the screenshot. That is worth 1,000 words. SAS has been working on the interface thing to some benefit.

Stephen E Arnold, January 27, 2016

Dark Web and Tor Investigative Tools Webinar

January 5, 2016

Telestrategies announced on January 4, 2016, a new webinar for active LEA and intel professionals. The one hour program is focused on tactics, new products, and ongoing developments for Dark Web and Tor investigations. The program is designed to provide an overview of public, open source, and commercial systems and products. These systems may be used as standalone tools or integrated with IBM i2 ANB or Palantir Gotham. More information about the program is available from Telestrategies. There is no charge for the program. In 2016, Stephen E Arnold’s new Dark Web Notebook will be published. More information about the new monograph upon which the webinar is based may be obtained by writing benkent2020 at yahoo dot com.

Stephen E Arnold, January 5, 2016

Text Analytics Jargon: You Too Can Be an Expert

December 22, 2015

Want to earn extra money as a text analytics expert? Need to drop some cool terms like Latent Dirichlet Allocation at a holiday function? Navigate to “Text Analytics: 15 Terms You Should Know Surrounding ERP.” The article will make clear some essential terms. I am not sure the enterprise resource planning crowd will be up to speed on probabilistic latent semantic analysis, but the buzzword will definitely catch everyone’s attention. If you party in certain circles, you might end up with a consulting job at mid tier services firm or, better yet, land several million in venture funding to dance with Dirichlet.

Stephen E Arnold, December 22, 2015

Palantir Profile: Search Plus Add Ons

November 25, 2015

Short honk: If you read French, you will learn quite a bit about Palantir, an interesting company with a $20 billion valuation. The write up is “Palantir et la France : naissance d’une nouvelle théorie abracadabrantesque ? An listicle in the heart of the article provides a good run down of the system’s search and content processing capabilities. Yep, search. The difference between Palantir and outfits like Attivio, Coveo, Smartlogic, et al is the positioning, the bundle of technology, and – oh, did I mention the $20 billion valuation? I do like the abracadabra reference. Magic?

Stephen E Arnold, November 25, 2015

Inferences: Check Before You Assume the Outputs Are Accurate

November 23, 2015

Predictive software works really well as long as the software does not have to deal with horse races, the stock market, and the actions of single person and his closest pals.

Inferences from Backtest Results Are False Until Proven True” offers a useful reminder to those who want to depend on algorithms someone else set up. The notion is helpful when the data processed are unchecked, unfamiliar, or just assumed to be spot on.

The write up says:

the primary task of quantitative traders should be to prove specific backtest results worthless, rather than proving them useful.

What throws backtests off the track? The write up provides a useful list of reminders:

  1. Data-mining and data snooping bias
  2. Use of non tradable instruments
  3. Unrealistic accounting of frictional effects
  4. Use of the market close to enter positions instead of the more realistic open
  5. Use of dubious risk and money management methods
  6. Lack of effect on actual prices

The author is concerned about financial applications, but the advice may be helpful to those who just want to click a link, output a visualization, and assume the big spikes are really important to the decision you will influence in one hour.

One point I highlighted was:

Widely used strategies lose any edge they might have had in the past.

Degradation occurs just like the statistical drift in Bayesian based systems. Exciting if you make decisions on outputs known to be flawed. How is that automatic indexing, business intelligence, and predictive analytics systems working?

Stephen E Arnold, November 23, 2015

A Modest Dust Up between Big Data and Text Analytics

November 18, 2015

I wonder if you will become involved in this modest dust up between the Big Data folks and the text analytics adherents. I know that I will sit on the sidelines and watch the battle unfold. I may mostly alone on that fence for three reasons:

  • Some text analytics outfits are Big Data oriented. I would point modestly to Terbium Labs and Recorded Future. Both do the analytics thing and both use “text” in their processing. (I know that learning about these companies is not as much fun as reading about Facebook friends, but it is useful to keep up with cutting edge outfits in my opinion.)
  • Text analytics can produce Big Data. I know that sounds like a fish turned inside out. Trust me. It happens. Think about some wan government worker in the UK grinding through Twitter and Facebook posts. The text analytics output lots of data.
  • A faux dust up is mostly a marketing play. I enjoyed search and content processing vendor presentations which pitted features of one system versus another. This approach is not too popular because every system says it can do what every other system can do. The reality of the systems is, in most cases, not discernible to the casual failed webmaster now working as a “real” wizard.

Navigate to “Text Analytics Gurus Debunk 4 Big Data Myths.” You will learn that there are four myths which are debunked. Here are the myths:

  1. Big Data survey scores reign supreme. Hey, surveys are okay because outfits like Survey  Monkey and the crazy pop up technology from that outfit in Michigan are easy to implement. Correct? Not important. Usable data for marketing? Important.
  2. Bigger social media data analysis is better. The outfits able to process the real time streams from Facebook and Twitter have lots of resources. Most companies do not have these resources. Ergo: Statistics 101 reigns no matter what the marketers say.
  3. New data sources are the most valuable. The idea is that data which are valid, normalized, and available for processing trump bigness. No argument from me.
  4. Keep your eye on the ball by focusing on how customers view you. Right. The customer is king in marketing land. In reality, the customer is a code word for generating revenue. Neither Big Data nor text analytics produce enough revenue in my world view. Sounds great though.

Will Big Data respond to this slap down? Will text analytic gurus mount their steeds and take another run down Marketing Lane to the windmill set up as a tourist attraction in an Amsterdam suburb?

Nope. The real battle involves organic, sustainable revenues. Talk is easy. Closing deals is hard. This dust up is not a mixed martial arts pay per view show.

Stephen E Arnold, November 18, 2015

The Tweet Gross Domestic Product Tool

October 16, 2015

Twitter can be used to figure out your personal income.  Twitter was not designed to be a tool to tally a person’s financial wealth, instead it is a communication tool based on a one hundred forty character messages to generate for small, concise delivery.  Twitter can be used to chat with friends, stars, business executives, etc, follow news trends, and even advertise products by sent to a tailored audience.  According to Red Orbit in the article “People Can Guess Your Income Based On Your Tweets,” Twitter has another application.

Other research done on Twitter has revealed that your age, location, political preferences, and disposition to insomnia, but your tweet history also reveals your income.  Apparently, if you tweet less, you make more money.  The controls and variables for the experiment were discussed, including that 5,191 Twitter accounts with over ten million tweets were analyzed and accounts with a user’s identifiable profession were used.

Users with a high follower and following ratio had the most income and they tended to post the least.  Posting throughout the day and cursing indicated a user with a lower income.  The content of tweets also displayed a plethora of “wealth” information:

“It isn’t just the topics of your tweets that’s giving you away either. Researchers found that “users with higher income post less emotional (positive and negative) but more neutral content, exhibiting more anger and fear, but less surprise, sadness and disgust.” It was also apparent that those who swore more frequently in their tweets had lower income.”

Twitter uses the information to tailor ads for users, if you share neutral posts get targeted ads advertising expensive items, while the cursers get less expensive ad campaigns.  The study also proves that it is important to monitor your Twitter profile, so you are posting the best side of yourself rather than shooting yourself in the foot.

Whitney Grace, October 16, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Can Online Systems Discern Truth and Beauty or All That One Needs to Know?

October 14, 2015

Last week I fielded a question about online systems’ ability to discern loaded or untruthful statements in a plain text document. I responded that software is not yet very good at figuring out whether a specific statement is accurate, factual, right, or correct. Google pokes at the problem in a number of ways; for example, assigning a credibility score to a known person. The higher the score, the person may be more likely to be “correct.” I am simplifying, but you get the idea: Recycling a variant of Page Rank and the CLEVER method associated with Jon Kleinberg.

There are other approaches as well, and some of them—dare I suggest, most of them—use word lists. The idea is pretty simple. Create a list of words which have positive or negative connotations. To get fancy, you can work a variation on the brute force Ask Jeeves’ method; that is, cook up answers or statement of facts “known” to be spot on. The idea is to match the input text with the information in these word lists. If you want to get fancy, call these lists and compilations “knowledgebases.” I prefer lists. Humans have to help create the lists. Humans have to maintain the lists. Get the lists wrong, and the scoring system will be off base.

There is quite a bit of academic chatter about ways to make software smart. A recent example is “Sentiment Diffusion of Public Opinions about Hot Events: Based on Complex Network.” In the conclusion to the paper, which includes lots of fancy math, I noticed that the researchers identified the foundation of their approach:

This paper studied the sentiment diffusion of online public opinions about hot events. We adopted the dictionary-based sentiment analysis approach to obtain the sentiment orientation of posts. Based on HowNet and semantic similarity, we calculated each post’s sentiment value and classified those posts into five types of sentiment orientations.

There you go. Word lists.

My point is that it is pretty easy to spot a hostile customer support letter. Just write a script that looks for words appearing on the “nasty list”; for example, consumer protection violation, fraud, sue, etc. There are other signals as well; for example, capital letters, exclamation points, underlined words, etc.

The point is that distorted, shaped, weaponized, and just plain bonkers information can be generated. This information can be gussied up in a news release, posted on a Facebook page, or sent out via Twitter before the outfit reinvents itself.

The researcher, the “real” journalist, or the hapless seventh grader writing a report will be none the wiser unless big time research is embraced. For now, what can be indexed is presented as if the information were spot on.

How do you feel about that? That’s a sentiment question, gentle reader.

Stephen E Arnold, October 14, 2015

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta