CyberOSINT banner

Inferences: Check Before You Assume the Outputs Are Accurate

November 23, 2015

Predictive software works really well as long as the software does not have to deal with horse races, the stock market, and the actions of single person and his closest pals.

Inferences from Backtest Results Are False Until Proven True” offers a useful reminder to those who want to depend on algorithms someone else set up. The notion is helpful when the data processed are unchecked, unfamiliar, or just assumed to be spot on.

The write up says:

the primary task of quantitative traders should be to prove specific backtest results worthless, rather than proving them useful.

What throws backtests off the track? The write up provides a useful list of reminders:

  1. Data-mining and data snooping bias
  2. Use of non tradable instruments
  3. Unrealistic accounting of frictional effects
  4. Use of the market close to enter positions instead of the more realistic open
  5. Use of dubious risk and money management methods
  6. Lack of effect on actual prices

The author is concerned about financial applications, but the advice may be helpful to those who just want to click a link, output a visualization, and assume the big spikes are really important to the decision you will influence in one hour.

One point I highlighted was:

Widely used strategies lose any edge they might have had in the past.

Degradation occurs just like the statistical drift in Bayesian based systems. Exciting if you make decisions on outputs known to be flawed. How is that automatic indexing, business intelligence, and predictive analytics systems working?

Stephen E Arnold, November 23, 2015

A Modest Dust Up between Big Data and Text Analytics

November 18, 2015

I wonder if you will become involved in this modest dust up between the Big Data folks and the text analytics adherents. I know that I will sit on the sidelines and watch the battle unfold. I may mostly alone on that fence for three reasons:

  • Some text analytics outfits are Big Data oriented. I would point modestly to Terbium Labs and Recorded Future. Both do the analytics thing and both use “text” in their processing. (I know that learning about these companies is not as much fun as reading about Facebook friends, but it is useful to keep up with cutting edge outfits in my opinion.)
  • Text analytics can produce Big Data. I know that sounds like a fish turned inside out. Trust me. It happens. Think about some wan government worker in the UK grinding through Twitter and Facebook posts. The text analytics output lots of data.
  • A faux dust up is mostly a marketing play. I enjoyed search and content processing vendor presentations which pitted features of one system versus another. This approach is not too popular because every system says it can do what every other system can do. The reality of the systems is, in most cases, not discernible to the casual failed webmaster now working as a “real” wizard.

Navigate to “Text Analytics Gurus Debunk 4 Big Data Myths.” You will learn that there are four myths which are debunked. Here are the myths:

  1. Big Data survey scores reign supreme. Hey, surveys are okay because outfits like Survey  Monkey and the crazy pop up technology from that outfit in Michigan are easy to implement. Correct? Not important. Usable data for marketing? Important.
  2. Bigger social media data analysis is better. The outfits able to process the real time streams from Facebook and Twitter have lots of resources. Most companies do not have these resources. Ergo: Statistics 101 reigns no matter what the marketers say.
  3. New data sources are the most valuable. The idea is that data which are valid, normalized, and available for processing trump bigness. No argument from me.
  4. Keep your eye on the ball by focusing on how customers view you. Right. The customer is king in marketing land. In reality, the customer is a code word for generating revenue. Neither Big Data nor text analytics produce enough revenue in my world view. Sounds great though.

Will Big Data respond to this slap down? Will text analytic gurus mount their steeds and take another run down Marketing Lane to the windmill set up as a tourist attraction in an Amsterdam suburb?

Nope. The real battle involves organic, sustainable revenues. Talk is easy. Closing deals is hard. This dust up is not a mixed martial arts pay per view show.

Stephen E Arnold, November 18, 2015

The Tweet Gross Domestic Product Tool

October 16, 2015

Twitter can be used to figure out your personal income.  Twitter was not designed to be a tool to tally a person’s financial wealth, instead it is a communication tool based on a one hundred forty character messages to generate for small, concise delivery.  Twitter can be used to chat with friends, stars, business executives, etc, follow news trends, and even advertise products by sent to a tailored audience.  According to Red Orbit in the article “People Can Guess Your Income Based On Your Tweets,” Twitter has another application.

Other research done on Twitter has revealed that your age, location, political preferences, and disposition to insomnia, but your tweet history also reveals your income.  Apparently, if you tweet less, you make more money.  The controls and variables for the experiment were discussed, including that 5,191 Twitter accounts with over ten million tweets were analyzed and accounts with a user’s identifiable profession were used.

Users with a high follower and following ratio had the most income and they tended to post the least.  Posting throughout the day and cursing indicated a user with a lower income.  The content of tweets also displayed a plethora of “wealth” information:

“It isn’t just the topics of your tweets that’s giving you away either. Researchers found that “users with higher income post less emotional (positive and negative) but more neutral content, exhibiting more anger and fear, but less surprise, sadness and disgust.” It was also apparent that those who swore more frequently in their tweets had lower income.”

Twitter uses the information to tailor ads for users, if you share neutral posts get targeted ads advertising expensive items, while the cursers get less expensive ad campaigns.  The study also proves that it is important to monitor your Twitter profile, so you are posting the best side of yourself rather than shooting yourself in the foot.

Whitney Grace, October 16, 2015
Sponsored by, publisher of the CyberOSINT monograph


Can Online Systems Discern Truth and Beauty or All That One Needs to Know?

October 14, 2015

Last week I fielded a question about online systems’ ability to discern loaded or untruthful statements in a plain text document. I responded that software is not yet very good at figuring out whether a specific statement is accurate, factual, right, or correct. Google pokes at the problem in a number of ways; for example, assigning a credibility score to a known person. The higher the score, the person may be more likely to be “correct.” I am simplifying, but you get the idea: Recycling a variant of Page Rank and the CLEVER method associated with Jon Kleinberg.

There are other approaches as well, and some of them—dare I suggest, most of them—use word lists. The idea is pretty simple. Create a list of words which have positive or negative connotations. To get fancy, you can work a variation on the brute force Ask Jeeves’ method; that is, cook up answers or statement of facts “known” to be spot on. The idea is to match the input text with the information in these word lists. If you want to get fancy, call these lists and compilations “knowledgebases.” I prefer lists. Humans have to help create the lists. Humans have to maintain the lists. Get the lists wrong, and the scoring system will be off base.

There is quite a bit of academic chatter about ways to make software smart. A recent example is “Sentiment Diffusion of Public Opinions about Hot Events: Based on Complex Network.” In the conclusion to the paper, which includes lots of fancy math, I noticed that the researchers identified the foundation of their approach:

This paper studied the sentiment diffusion of online public opinions about hot events. We adopted the dictionary-based sentiment analysis approach to obtain the sentiment orientation of posts. Based on HowNet and semantic similarity, we calculated each post’s sentiment value and classified those posts into five types of sentiment orientations.

There you go. Word lists.

My point is that it is pretty easy to spot a hostile customer support letter. Just write a script that looks for words appearing on the “nasty list”; for example, consumer protection violation, fraud, sue, etc. There are other signals as well; for example, capital letters, exclamation points, underlined words, etc.

The point is that distorted, shaped, weaponized, and just plain bonkers information can be generated. This information can be gussied up in a news release, posted on a Facebook page, or sent out via Twitter before the outfit reinvents itself.

The researcher, the “real” journalist, or the hapless seventh grader writing a report will be none the wiser unless big time research is embraced. For now, what can be indexed is presented as if the information were spot on.

How do you feel about that? That’s a sentiment question, gentle reader.

Stephen E Arnold, October 14, 2015

Full Text Search Gets Explained

October 6, 2015

Full text search is a one of the primary functions of most search platform.  If a search platform cannot get full text search right, then it is useless and should be tossed in the recycle bin.    Full text search is such a basic function these days that most people do not know how to explain what it is.  So what is full text?

According to the Xojo article, “Full Text Search With SQLite” provides a thorough definition:

“What is full text searching? It is a fast way to look for specific words in text columns of a database table. Without full text searching, you would typically search a text column using the LIKE command. For example, you might use this command to find all books that have “cat” in the description…But this select actually finds row that has the letters “cat” in it, even if it is in another word, such as “cater”. Also, using LIKE does not make use of any indexing on the table. The table has to be scanned row by row to see if it contains the value, which can be slow for large tables.”

After the definition, the article turns into advertising piece for SQLite and how it improves the quality of full text search.  It offers some more basic explanation, which are not understood by someone unless they have a coding background.   It is a very brief with some detailed information, but could explain more about what SQLite is and how it improves full text search.

Whitney Grace, October 6, 2015
Sponsored by, publisher of the CyberOSINT monograph

The Cricket Cognitive Analysis

September 4, 2015

While Americans scratch their heads at the sport cricket, it has a huge fanbase and not only that, there are mounds of data that can now be fully analyzed says First Post in the article, “The Intersection Of Analytics, Social Media, And Cricket In The Cognitive Era Of Computing.”

According to the article, cricket fans absorb every little bit of information about their favorite players and teams.  Technology advances have allowed the cricket players to improve their game with better equipment and ways to analyze their playing, in turn the fans have a deeper personal connection with the game as this information is released.  For the upcoming Cricket World Cup, Wisden India will provide all the data points for the game and feed them into IBM’s Analytics Engine to improve the game for spectators and the players.

Social media is a huge part of the cricket experience and the article details examples about how it platforms like Twitter are processed through sentimental analysis and IBM Text Analytics.

“What is most interesting to businesses however is that observing these campaigns help in understanding the consumer sentiment to drive sales initiatives. With right business insights in the nick of time, in line with social trends, several brands have come up with lucrative offers one can’t refuse. In earlier days, this kind of marketing required pumping in of a lot of money and waiting for several weeks before one could analyze and approve the commercial success of a business idea. With tools like IBM Analytics at hand, one can not only grab the data needed, assess it so it makes a business sense, but also anticipate the market response.”

While Cricket might be what the article concentrates on, imagine how data analytics are being applied to other popular sports such as American football, soccer, baseball, golf, and the variety of racing popular around the world.

Whitney Grace, September 4, 2015
Sponsored by, publisher of the CyberOSINT monograph

Suggestions for Developers to Improve Functionality for Search

September 2, 2015

The article on SiteCrafting titled Maxxcat Pro Tips lays out some guidelines for improved functionality when it comes deep search. Limiting your Crawls is the first suggestion. Since all links are not created equally, it is wise to avoid runaway crawls on links where there will always be a “Next” button. The article suggests hand-selecting the links you want to use. The second tip is Specify Your Snippets. The article explains,

“When MaxxCAT returns search results, each result comes with four pieces of information: url, title, meta, and snippet (a preview of some of the text found at the link). By default, MaxxCAT formulates a snippet by parsing the document, extracting content, and assembling a snippet out of that content. This works well for binary documents… but for webpages you wanted to trim out the content that is repeated on every page (e.g. navigation…) so search results are as accurate as possible.”

The third suggestion is to Implement Meta-Tag Filtering. Each suggestion is followed up with step-by-step instructions. These handy tips come from a partnering between Sitecrafting is a web design company founded in 1995 by Brian Forth. Maxxcat is a company acknowledged for its achievements in high performance search since 2007.

Chelsea Kerwin, September 2, 2015

Sponsored by, publisher of the CyberOSINT monograph

SAS Text Miner Promises Unstructured Insight

July 10, 2015

Big data is tools help organizations analyze more than their old, legacy data.  While legacy data does help an organization study how their process have changed, the data is old and does not reflect the immediate, real time trends.  SAS offers a product that bridges old data with the new as well as unstructured and structured data.

The SAS Text Miner is built from Teragram technology.  It features document theme discovery, a function the finds relations between document collections; automatic Boolean rule generation; high performance text mining that quickly evaluates large document collection; term profiling and trending, evaluates term relevance in a collection and how they are used; multiple language support; visual interrogation of results; easily import text; flexible entity options; and a user friendly interface.

The SAS Text Miner is specifically programmed to discover data relationships data, automate activities, and determine keywords and phrases.  The software uses predictive models to analysis data and discover new insights:

“Predictive models use situational knowledge to describe future scenarios. Yet important circumstances and events described in comment fields, notes, reports, inquiries, web commentaries, etc., aren’t captured in structured fields that can be analyzed easily. Now you can add insights gleaned from text-based sources to your predictive models for more powerful predictions.”

Text mining software reveals insights between old and new data, making it one of the basic components of big data.

Whitney Grace, July 10, 2015

Sponsored by, publisher of the CyberOSINT monograph

CSC Attracts Buyer And Fraud Penalties

July 1, 2015

According to the Reuters article “Exclusive: CACI, Booz Allen, Leidos Eyes CSC’s Government Unit-Sources,” CACI International, Leidos Holdings, and Booz Allen Hamilton Holdings

have expressed interest in Computer Sciences Corp’s public sector division.  There are not a lot of details about the possible transaction as it is still in the early stages, so everything is still hush-hush.

The possible acquisition came after the news that CSC will split into two divisions: one that serves US public sector clients and the other dedicated to global commercial and non-government clients.  CSC has an estimated $4.1 billion in revenues and worth $9.6 billion, but CACI International, Leidos Holdings, and Booz Allen Hamilton might reconsider the sale or getting the price lowered after hearing this news: “Computer Sciences (CSC) To Pay $190M Penalty; SEC Charges Company And Former Executives With Accounting Fraud” from Street Insider.  The Securities and Exchange Commission are charging CSC and former executives with a $190 million penalty for hiding financial information and problems resulting from the contract they had with their biggest client.  CSC and the executives, of course, are contesting the charges.

“The SEC alleges that CSC’s accounting and disclosure fraud began after the company learned it would lose money on the NHS contract because it was unable to meet certain deadlines. To avoid the large hit to its earnings that CSC was required to record, Sutcliffe allegedly added items to CSC’s accounting models that artificially increased its profits but had no basis in reality. CSC, with Laphen’s approval, then continued to avoid the financial impact of its delays by basing its models on contract amendments it was proposing to the NHS rather than the actual contract. In reality, NHS officials repeatedly rejected CSC’s requests that the NHS pay the company higher prices for less work. By basing its models on the flailing proposals, CSC artificially avoided recording significant reductions in its earnings in 2010 and 2011.”

Oh boy!  Is it a wise decision to buy a company that has a history of stealing money and hiding information?  If the company’s root products and services are decent, the buyers might get it for a cheap price and recondition the company.  Or it could lead to another disaster like HP and Autonomy.

Whitney Grace, July 1, 2015

Sponsored by, publisher of the CyberOSINT monograph

Search Cheerleader Seeks Text Analytics Unicorns

June 12, 2015

The article on Venture Beat whimsically titled Where Are the Text Analytics Unicorns provides yet another cheerleader for search. The article uses Aileen Lee’s “unicorn” concept of a company begun since 2003 and valued at over a billion dollars. (“Super unicorns” are companies valued at over a hundred billion dollars like Facebook.) The article asks why no text analytics companies have joined this exclusive club? Candidates include Clarabridge, NetBase and Medallia.

“In the end, the answer is a very basic one. Contrast the text analytics sector with unicorns that include Uber — Travis Kalanick’s company — and Airbnb, Evernote, Flipkart, Square, Pinterest, and their ilk. They play to mass markets — they’re a magic mix of revenue, data, platform, and pizazz — in ways that text analytics doesn’t. The tech companies on the unicorn list — Cloudera, MongoDB, Pivotal — provide or support essential infrastructure that covers a broad set of needs.”

Before coming to this conclusion, the article posits other possible reasons as well, such as the sheer number of companies competing in the field, or even competition from massive companies like IBM and Google. But these are dismissed for the more optimistic end note that essentially suggests we give the text analytics unicorns a year. Caution advised.

Chelsea Kerwin, June 12, 2015

Sponsored by, publisher of the CyberOSINT monograph


Next Page »