Can Online Systems Discern Truth and Beauty or All That One Needs to Know?

October 14, 2015

Last week I fielded a question about online systems’ ability to discern loaded or untruthful statements in a plain text document. I responded that software is not yet very good at figuring out whether a specific statement is accurate, factual, right, or correct. Google pokes at the problem in a number of ways; for example, assigning a credibility score to a known person. The higher the score, the person may be more likely to be “correct.” I am simplifying, but you get the idea: Recycling a variant of Page Rank and the CLEVER method associated with Jon Kleinberg.

There are other approaches as well, and some of them—dare I suggest, most of them—use word lists. The idea is pretty simple. Create a list of words which have positive or negative connotations. To get fancy, you can work a variation on the brute force Ask Jeeves’ method; that is, cook up answers or statement of facts “known” to be spot on. The idea is to match the input text with the information in these word lists. If you want to get fancy, call these lists and compilations “knowledgebases.” I prefer lists. Humans have to help create the lists. Humans have to maintain the lists. Get the lists wrong, and the scoring system will be off base.

There is quite a bit of academic chatter about ways to make software smart. A recent example is “Sentiment Diffusion of Public Opinions about Hot Events: Based on Complex Network.” In the conclusion to the paper, which includes lots of fancy math, I noticed that the researchers identified the foundation of their approach:

This paper studied the sentiment diffusion of online public opinions about hot events. We adopted the dictionary-based sentiment analysis approach to obtain the sentiment orientation of posts. Based on HowNet and semantic similarity, we calculated each post’s sentiment value and classified those posts into five types of sentiment orientations.

There you go. Word lists.

My point is that it is pretty easy to spot a hostile customer support letter. Just write a script that looks for words appearing on the “nasty list”; for example, consumer protection violation, fraud, sue, etc. There are other signals as well; for example, capital letters, exclamation points, underlined words, etc.

The point is that distorted, shaped, weaponized, and just plain bonkers information can be generated. This information can be gussied up in a news release, posted on a Facebook page, or sent out via Twitter before the outfit reinvents itself.

The researcher, the “real” journalist, or the hapless seventh grader writing a report will be none the wiser unless big time research is embraced. For now, what can be indexed is presented as if the information were spot on.

How do you feel about that? That’s a sentiment question, gentle reader.

Stephen E Arnold, October 14, 2015

Full Text Search Gets Explained

October 6, 2015

Full text search is a one of the primary functions of most search platform.  If a search platform cannot get full text search right, then it is useless and should be tossed in the recycle bin.    Full text search is such a basic function these days that most people do not know how to explain what it is.  So what is full text?

According to the Xojo article, “Full Text Search With SQLite” provides a thorough definition:

“What is full text searching? It is a fast way to look for specific words in text columns of a database table. Without full text searching, you would typically search a text column using the LIKE command. For example, you might use this command to find all books that have “cat” in the description…But this select actually finds row that has the letters “cat” in it, even if it is in another word, such as “cater”. Also, using LIKE does not make use of any indexing on the table. The table has to be scanned row by row to see if it contains the value, which can be slow for large tables.”

After the definition, the article turns into advertising piece for SQLite and how it improves the quality of full text search.  It offers some more basic explanation, which are not understood by someone unless they have a coding background.   It is a very brief with some detailed information, but could explain more about what SQLite is and how it improves full text search.

Whitney Grace, October 6, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

The Cricket Cognitive Analysis

September 4, 2015

While Americans scratch their heads at the sport cricket, it has a huge fanbase and not only that, there are mounds of data that can now be fully analyzed says First Post in the article, “The Intersection Of Analytics, Social Media, And Cricket In The Cognitive Era Of Computing.”

According to the article, cricket fans absorb every little bit of information about their favorite players and teams.  Technology advances have allowed the cricket players to improve their game with better equipment and ways to analyze their playing, in turn the fans have a deeper personal connection with the game as this information is released.  For the upcoming Cricket World Cup, Wisden India will provide all the data points for the game and feed them into IBM’s Analytics Engine to improve the game for spectators and the players.

Social media is a huge part of the cricket experience and the article details examples about how it platforms like Twitter are processed through sentimental analysis and IBM Text Analytics.

“What is most interesting to businesses however is that observing these campaigns help in understanding the consumer sentiment to drive sales initiatives. With right business insights in the nick of time, in line with social trends, several brands have come up with lucrative offers one can’t refuse. In earlier days, this kind of marketing required pumping in of a lot of money and waiting for several weeks before one could analyze and approve the commercial success of a business idea. With tools like IBM Analytics at hand, one can not only grab the data needed, assess it so it makes a business sense, but also anticipate the market response.”

While Cricket might be what the article concentrates on, imagine how data analytics are being applied to other popular sports such as American football, soccer, baseball, golf, and the variety of racing popular around the world.

Whitney Grace, September 4, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Suggestions for Developers to Improve Functionality for Search

September 2, 2015

The article on SiteCrafting titled Maxxcat Pro Tips lays out some guidelines for improved functionality when it comes deep search. Limiting your Crawls is the first suggestion. Since all links are not created equally, it is wise to avoid runaway crawls on links where there will always be a “Next” button. The article suggests hand-selecting the links you want to use. The second tip is Specify Your Snippets. The article explains,

“When MaxxCAT returns search results, each result comes with four pieces of information: url, title, meta, and snippet (a preview of some of the text found at the link). By default, MaxxCAT formulates a snippet by parsing the document, extracting content, and assembling a snippet out of that content. This works well for binary documents… but for webpages you wanted to trim out the content that is repeated on every page (e.g. navigation…) so search results are as accurate as possible.”

The third suggestion is to Implement Meta-Tag Filtering. Each suggestion is followed up with step-by-step instructions. These handy tips come from a partnering between Sitecrafting is a web design company founded in 1995 by Brian Forth. Maxxcat is a company acknowledged for its achievements in high performance search since 2007.

Chelsea Kerwin, September 2, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

SAS Text Miner Promises Unstructured Insight

July 10, 2015

Big data is tools help organizations analyze more than their old, legacy data.  While legacy data does help an organization study how their process have changed, the data is old and does not reflect the immediate, real time trends.  SAS offers a product that bridges old data with the new as well as unstructured and structured data.

The SAS Text Miner is built from Teragram technology.  It features document theme discovery, a function the finds relations between document collections; automatic Boolean rule generation; high performance text mining that quickly evaluates large document collection; term profiling and trending, evaluates term relevance in a collection and how they are used; multiple language support; visual interrogation of results; easily import text; flexible entity options; and a user friendly interface.

The SAS Text Miner is specifically programmed to discover data relationships data, automate activities, and determine keywords and phrases.  The software uses predictive models to analysis data and discover new insights:

“Predictive models use situational knowledge to describe future scenarios. Yet important circumstances and events described in comment fields, notes, reports, inquiries, web commentaries, etc., aren’t captured in structured fields that can be analyzed easily. Now you can add insights gleaned from text-based sources to your predictive models for more powerful predictions.”

Text mining software reveals insights between old and new data, making it one of the basic components of big data.

Whitney Grace, July 10, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

CSC Attracts Buyer And Fraud Penalties

July 1, 2015

According to the Reuters article “Exclusive: CACI, Booz Allen, Leidos Eyes CSC’s Government Unit-Sources,” CACI International, Leidos Holdings, and Booz Allen Hamilton Holdings

have expressed interest in Computer Sciences Corp’s public sector division.  There are not a lot of details about the possible transaction as it is still in the early stages, so everything is still hush-hush.

The possible acquisition came after the news that CSC will split into two divisions: one that serves US public sector clients and the other dedicated to global commercial and non-government clients.  CSC has an estimated $4.1 billion in revenues and worth $9.6 billion, but CACI International, Leidos Holdings, and Booz Allen Hamilton might reconsider the sale or getting the price lowered after hearing this news: “Computer Sciences (CSC) To Pay $190M Penalty; SEC Charges Company And Former Executives With Accounting Fraud” from Street Insider.  The Securities and Exchange Commission are charging CSC and former executives with a $190 million penalty for hiding financial information and problems resulting from the contract they had with their biggest client.  CSC and the executives, of course, are contesting the charges.

“The SEC alleges that CSC’s accounting and disclosure fraud began after the company learned it would lose money on the NHS contract because it was unable to meet certain deadlines. To avoid the large hit to its earnings that CSC was required to record, Sutcliffe allegedly added items to CSC’s accounting models that artificially increased its profits but had no basis in reality. CSC, with Laphen’s approval, then continued to avoid the financial impact of its delays by basing its models on contract amendments it was proposing to the NHS rather than the actual contract. In reality, NHS officials repeatedly rejected CSC’s requests that the NHS pay the company higher prices for less work. By basing its models on the flailing proposals, CSC artificially avoided recording significant reductions in its earnings in 2010 and 2011.”

Oh boy!  Is it a wise decision to buy a company that has a history of stealing money and hiding information?  If the company’s root products and services are decent, the buyers might get it for a cheap price and recondition the company.  Or it could lead to another disaster like HP and Autonomy.

Whitney Grace, July 1, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Search Cheerleader Seeks Text Analytics Unicorns

June 12, 2015

The article on Venture Beat whimsically titled Where Are the Text Analytics Unicorns provides yet another cheerleader for search. The article uses Aileen Lee’s “unicorn” concept of a company begun since 2003 and valued at over a billion dollars. (“Super unicorns” are companies valued at over a hundred billion dollars like Facebook.) The article asks why no text analytics companies have joined this exclusive club? Candidates include Clarabridge, NetBase and Medallia.

“In the end, the answer is a very basic one. Contrast the text analytics sector with unicorns that include Uber — Travis Kalanick’s company — and Airbnb, Evernote, Flipkart, Square, Pinterest, and their ilk. They play to mass markets — they’re a magic mix of revenue, data, platform, and pizazz — in ways that text analytics doesn’t. The tech companies on the unicorn list — Cloudera, MongoDB, Pivotal — provide or support essential infrastructure that covers a broad set of needs.”

Before coming to this conclusion, the article posits other possible reasons as well, such as the sheer number of companies competing in the field, or even competition from massive companies like IBM and Google. But these are dismissed for the more optimistic end note that essentially suggests we give the text analytics unicorns a year. Caution advised.

Chelsea Kerwin, June 12, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Lexalytics: GUI and Wizard

June 12, 2015

What is one way to improve a user’s software navigational experience?  One of the best ways is to add a graphical user interface (GUI).  Software Development @ IT Business Net shares a press release about “Lexalytics Unveils Industry’s First Wizard For Text Mining And Sentiment Analysis.”  Lexalytics is one of the leading companies that provides sentiment and analytics solutions and as the article’s title explains it has made an industry first by releasing a GUI and wizard for Semantria SaaS platform and Excel plug-in.  The wizard and GUI (SWIZ) are part of the Semantria Online Configurator, SWEB 1.3, which also included functionality updates and layout changes.

” ‘In order to get the most value out of text and sentiment analysis technologies, customers need to be able to tune the service to match their content and business needs,’ said Jeff Catlin, CEO, Lexalytics. ‘Just like Apple changed the game for consumers with its first Macintosh in 1984, making personal computing easy and fun through an innovative GUI, we want to improve the job of data analysts by making it just as fun, easy and intuitive with SWIZ.’”

Lexalytics is dedicated to helping its clients enjoy an easier experience when it comes to data analytics.  The company wants its clients to get the answers they by providing the tools they need to get them without having to over think the retrieval process.  While Lexalytics already provides robust and flexible solutions, the SWIZ release continues to prove it has the most tunable and configurable text mining technology.

Whitney Grace, June 12, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Sentiment Analysis: The Progeny of Big Data?

June 9, 2015

I read “Text Analytics: The Next Generation of Big Data.” The article provides a straightforward explanation of Big Data, embraces unstructured information like blog posts in various languages, email, and similar types of content, and then leaps to the notion of text analytics. The conclusion to the article is that we are experiencing “The Coming of Age of Text Analytics—The Next Generation of Big Data.”

The idea is good news for the vendors of text analytics aimed squarely at commercial enterprises, advertisers, and marketers. I am not sure the future will match up to the needs of the folks at the law enforcement and intelligence conference I had just left.

There are three reasons:

First, text analytics are not new, and the various systems and methods have been in use for decades. One notable example is BAE Systems use of its home brew tools and Autonomy’s technology in the 1990s and i2 (pre IBM) and its efforts even earlier.

Second, the challenges of figuring out what structured and unstructured data mean require more than determining if a statement is positive or negative. Text analytics is, based on my experience, blind to such useful data as real time geospatial inputs and video streamed from mobile devices and surveillance devices. Text analytics, like key word search, makes a contribution, but it is in a supporting role, not the Beyoncé of content processing.

Third, the future points to the use of technologies like predictive analytics. Text analytics are components in these more robust systems whose outputs are designed to provide probability-based outputs from a range of input sources.

There was considerable consternation a year or so ago. I spoke with a team involved with text analytics at a major telecommunications company. The grousing was that the outputs of the system did not make sense and it was difficult for those reviewing the outputs to figure out what the data meant.

At the LE/intel conference, the focus was on systems which provide actionable information in real time. My point is that vendors have a tendency to see the solutions in terms of what is often a limited or supporting technology.

Sentiment analysis is a good example. Blog posts invoking readers to join ISIS are to some positive and negative. The point is that the point of view of the reader determines whether a message is positive or negative.

The only way to move beyond this type of superficial and often misleading analysis is to deal with context, audio, video, intercept data, geolocation data, and other types of content. Text analytics is one component in a larger system, not the solution to the types of problems explored at the LE/intel conference in early June 2015. Marketing often clouds reality. In some businesses, no one knows that the outputs are not helpful. In other endeavors, the outputs have far higher import. Knowing that a recruiting video with a moving nasheed underscoring the good guys dispatching the bad guys is off kilter. Is it important to know that the video is happy or sad? In fact, it is silly to approach the content in this manner.

Stephen E Arnold, June 9, 2014

Free Book from OpenText on Business in the Digital Age

May 27, 2015

This is interesting. OpenText advertises their free, downloadable book in a post titled, “Transform Your Business for a Digital-First World.” Our question is whether OpenText can transform their own business; it seems their financial results have been flat and generally drifting down of late. I suppose this is a do-as-we-say-not-as-we-do situation.

The book may be worth looking into, though, especially since it passes along words of wisdom from leaders within multiple organizations. The description states:

“Digital technology is changing the rules of business with the promise of increased opportunity and innovation. The very nature of business is more fluid, social, global, accelerated, risky, and competitive. By 2020, profitable organizations will use digital channels to discover new customers, enter new markets and tap new streams of revenue. Those that don’t make the shift could fall to the wayside. In Digital: Disrupt or Die, a multi-year blueprint for success in 2020, OpenText CEO Mark Barrenechea and Chairman of the Board Tom Jenkins explore the relationship between products, services and Enterprise Information Management (EIM).”

Launched in 1991, OpenText offers tools for enterprise information management, business process management, and customer experience management. Based in Waterloo, Ontario, the company maintains offices around the world.

Cynthia Murrell, May 27, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta