Cheerleading for the SAS Text Exploration Framework

January 27, 2016

SAS is a stalwart in the number crunching world. I visualize the company’s executives chatting among themselves about the Big Data revolution, the text mining epoch, and the predictive analytics juggernaut.

Well, SAS is now tapping that staff interaction.

Navigate to “To Data Scientists and Beyond! One of Many Applications of Text Analytics.” There is an explanation of the ease of use of SAS. Okay, but my recollection was that I had to hire a PhD in statistics from Cornell University to chase down the code which was slowing our survivability analyses to meander instead of trot.

I learned:

One of the misconceptions I often see is the expectation that it takes a data scientist, or at least an advanced degree in analytics, to work with text analytics products. That is not the case. If you can type a search into a Google toolbar, you can get value from text analytics.

The write up contains a screenshot too. Where did the text analytics plumbing come from? Perchance an acquisition in 2008 like the canny purchase Teragram’s late 1990s technology?

The write up focuses on law enforcement and intelligence applications of text analytics. I find that interesting because Palantir is allegedly deriving more than 60 percent of the firm’s revenue from commercial customers like JP Morgan and starting to get some traction in health care.

Check out the screenshot. That is worth 1,000 words. SAS has been working on the interface thing to some benefit.

Stephen E Arnold, January 27, 2016

Big Data Blending Solution

January 20, 2016

I would have used Palantir or maybe our own tools. But an outfit named National Instruments found a different way to perform data blending. “How This Instrument Firm Tackled Big Data Blending” provides a case study and a rah rah for Alteryx. Here’s the paragraph I highlighted:

The software it [National Instruments] selected, from Alteryx, takes a somewhat unique approach in that it provides a visual representation of the data transformation process. Users can acquire, transform, and blend multiple data sources essentially by dragging and dropping icons on a screen. This GUI approach is beneficial to NI employees who aren’t proficient at manipulating data using something like SQL.

The graphical approach has been part of a number of tools. There are also some systems which just figure out where to put what.

The issue for me is, “What happens to rich media like imagery and unstructured information like email?”

There are systems which handle these types of content.

Another challenge is the dependence on structured relational data tables. Certain types of operations are difficult in this environment.

The write up is interesting, but it reveals that a narrow view of available tools may produce a partial solution.

Stephen E Arnold, January 20, 2016

Dark Web and Tor Investigative Tools Webinar

January 5, 2016

Telestrategies announced on January 4, 2016, a new webinar for active LEA and intel professionals. The one hour program is focused on tactics, new products, and ongoing developments for Dark Web and Tor investigations. The program is designed to provide an overview of public, open source, and commercial systems and products. These systems may be used as standalone tools or integrated with IBM i2 ANB or Palantir Gotham. More information about the program is available from Telestrategies. There is no charge for the program. In 2016, Stephen E Arnold’s new Dark Web Notebook will be published. More information about the new monograph upon which the webinar is based may be obtained by writing benkent2020 at yahoo dot com.

Stephen E Arnold, January 5, 2016

IBM Generates Text Mining Work Flow Diagram

January 4, 2016

I read “Deriving Insight Text Mining and Machine Learning.” This is an article with a specific IBM Web address. The diagram is interesting because it does not explain which steps are automated, which require humans, and which are one of those expensive man-machine processes. When I read about any text related function available from IBM, I think about Watson. You know, IBM’s smart software.

Here’s the diagram:

image

If you find this hard to read, you are not in step with modern design elements. Millennials, I presume, love these faded colors.

Here’s the passage I noted about the important step of “attribute selection.” I interpret attribute selection to mean indexing, entity extraction, and related operations. Because neither human subject matter specialists nor smart software perform this function particularly well, I highlighted in red ink in recognition of IBM’s 14 consecutive quarters of financial underperformance:

Machine learning is closely related to and often overlaps with computational statistics—a discipline that also specializes in prediction-making. It has strong ties to mathematical optimization, which delivers methods, theory and application domains to the field. It is employed in a range of computing tasks where designing and programming explicit algorithms is infeasible. Example applications include spam filtering, optical character recognition (OCR), search engines and computer vision. Text mining takes advantage of machine learning specifically in determining features, reducing dimensionality and removing irrelevant attributes. For example, text mining uses machine learning on sentiment analysis, which is widely applied to reviews and social media for a variety of applications ranging from marketing to customer service. It aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation, affective state or the intended emotional communication. Machine learning algorithms in text mining include decision tree learning, association rule learning, artificial neural learning, inductive logic programming, support vector machines, Bayesian networks, genetic algorithms and sparse dictionary learning.

Interesting, but how does this IBM stuff actually work? Who uses it? What’s the payoff from these use cases?

More questions than answers to explain the hard to read diagram, which looks quite a bit like a 1998 Autonomy graphic. I recall being able to read the Autonomy image, however.

Stephen E Arnold, December 30, 2015

Text Analytics Jargon: You Too Can Be an Expert

December 22, 2015

Want to earn extra money as a text analytics expert? Need to drop some cool terms like Latent Dirichlet Allocation at a holiday function? Navigate to “Text Analytics: 15 Terms You Should Know Surrounding ERP.” The article will make clear some essential terms. I am not sure the enterprise resource planning crowd will be up to speed on probabilistic latent semantic analysis, but the buzzword will definitely catch everyone’s attention. If you party in certain circles, you might end up with a consulting job at mid tier services firm or, better yet, land several million in venture funding to dance with Dirichlet.

Stephen E Arnold, December 22, 2015

Search Vendors Under Pressure: Welcome to 2016

December 21, 2015

I read ”Silicon Valley’s Cash Party Is Coming to an End.” What took so long? I suppose reality is less fun than fantasy. Why watch a science documentary when one can get lost in Netflix binging.

The write up reports:

Based on interviews with about two dozen venture capitalists and tech investors, 2016 is shaping up to be a year of reckoning for scores of technology start-ups that have yet to prove out their business models and equally challenging for those that raised money at unjustifiably high prices.

Forget the unicorns. There are some enterprise search outfits which have ingested millions of dollars, have convinced investors that big revenue or an HP-Autonomy scale buy out is just around the corner, and proprietary technology or consulting plus open source will produce gushers of organic revenue. Other vendors have tapped their moms, their nest eggs, and angels who believe in fairies.

I am not there is a General Leia Organa to fight Star Wars: The Revenue Battle for most vendors of search and content processing. Bummer. Despite the lack of media coverage for search and content processing vendors, the number of companies pitching information access is hefty. I track about 200 outfits, but many of these are unknown either because they don’t want to be visible or lack any substantive “newsy” magnetism.

My hunch is that this article suggests that 2016 may be different from the free money era the articles suggests is ending. In 2016, my view is that many vendors will find themselves in a modest tussle with their stakeholders. I worked through some of the search and content processing companies taking cash from folks with deep pockets often filled with other people’s money. (Note that investments totals come from Crunchbase). Here’s a list of search and content processing vendors who may face stakeholder and investor pressure. The more more ingested, the greater the interest investors may have in getting a return:

  • Antidot, $3 million
  • Attensity, $90 million
  • Attivio, $71 million
  • BA Insight, $14 million
  • Connotate, $12 million
  • Coveo, $69 million
  • Digital Reasoning, $28 million
  • Elastic (formerly Elasticsearch), $104 million
  • Lucidworks, $53 million
  • MarkLogic, $175 million
  • Perfect Search, $4 million
  • Palantir, $1.7 billion
  • Recommind, $22 million
  • Sinequa, $5 million
  • Sophia Ambiance, $5 million
  • X1, $12 million.

Then there are the acquired search systems which been acquired. One assumes these deals will have to produce sustainable revenues in some form:

  • Hewlett Packard with Autonomy
  • IBM with Vivisimo
  • Dassault Systèmes with Exalead
  • Lexmark with Brainware and ISYS Search
  • Microsoft with Fast Search
  • OpenText with BASIS, BRS, Fulcrum, and Nstein
  • Oracle with Endeca, InQuira, and Rightnow
  • Thomson Reuters with Solcara

Are there sufficient prospects to generate deals large enough to keep these outfits afloat?

There are search and content processing vendors competing for sales with free and open source options and the vendors with proprietary software:

  • Ami Albert
  • Content Analyst
  • Concept Searching
  • dtSearch
  • EasyAsk
  • Exorbyte
  • Fabasoft Mindbreeze
  • Funnelback
  • IHS Goldfire
  • SLI Systems
  • Smartlogic
  • Sprylogics
  • SurfRay
  • Thunderstone
  • WCC Elise
  • Zaizi

These search vendors plus many smaller outfits like Intrafind and Srch2 have to find a way to close deals to avoid the fate of Arikus, Convera, Delphes, Dieselpoint, Entopia, Hakia, Kartoo, NuTech Search, and Siderean Software, among others.

Despite the lack of coverage from mid tier consultants and the “real” journalists, the information access sector is moving along. In fact, when one looks at the software options, search and content processing vendors are easily found.

The problem for 2016 will be making sales, generating sustainable revenues, and paying back stakeholders. For many of these companies, the new year will be one which sees a number of outfits going dark. A few will thrive.

Darned exciting times in findability.

Stephen E Arnold, December 21, 2015

Smart Software Sort of Snares Sarcasm

December 18, 2015

I read “Scientists Devise Algorithm That Detects Sarcasm Better Than Humans.” My first reaction was, “How well do humans detect sarcasm?” In my experience, literalism is expected. Sarcasm and its kissing cousins cynicism and humor are surprises.

I read:

In at least one study, by UC Berkeley’s David Bamman and the University of Washington’s Noah A. Smith, computers showed an accuracy rate of 75 percent—notably better than the humans in the 2005 study.

There you go. (Not sarcasm) Smart software can detect a statement designed to deliver a payload which has the surprise thing going for it. (Sarcasm)

The write up asserted:

Bamman (smart software champion) says sentiment analysis can be useful, for instance, when conducting an analysis of reviews on Amazon, to determine whether the reviewer actually liked a product. “One thing that can really interfere with that,” he says, “is whether or not the person is being sarcastic.” Accurate sentiment analysis can also be valuable to national security. In 2014, the Secret Service posted a work order requesting analytics software that can detect sarcasm on social media—the idea being that the ability to identify sarcasm would help them discern jokes from actual emergencies.

Okay. (Sarcasm). More of the good enough approach to understanding text. Hey, maybe the system is better than a word list? (Sarcasm)

Human language is a slippery fish. The researchers are trying to create a net to snag the elusive creatures like “Hell is empty and all the devils are here.” (Sarcasm)

Stephen E Arnold, December 21, 2015

Textio for Text Analysis

December 17, 2015

I read “Textio, A Startup That Analyzes Text Performance, Raises $8M.” The write up reported:

Textio recognizes more than 60,000 phrases with its predictive technology, Snyder [Textio’s CEO] said, and that data set is changing constantly as it continues to operate. It looks at how words are put together — such as how verb dense a phrase is — and at other syntax-related properties the document may have. All that put together results in a score for the document, based on how likely it is to succeed in whatever the writer set out to do.

The secret sauce bubbles in this passage:

it’s important that it [Textio] feels easy to use — hence the highlighting and dropdown boxes rather than readouts.

Textio’s Web site states:

From e-commerce to real estate to marketing content, Textio was founded on this simple vision: how you write changes who you reach, and using data, we can predict ahead of time how you’re going to do.

The company, according to Crunchbase has received $9.5 million from Emergence Capital Partners and four other firms.

There are a number of companies offering text analysis, but Textio may be one of the few providing user friendly tools to help people write and make sense of the nuances in résumés and similar corpuses. Sophisticated analysis of text is available from a number of vendors.

It is encouraging to me that a sub function of information access is attracting attention as a stand alone service. One of the company’s customers is Microsoft, a firm with home grown text solutions and technologies from Fast Search & Transfer and Powerset, among others sources. Microsoft’s interest in Textio underscores that text processing that works as one hopes is an unmet need.

Stephen E Arnold, December 17, 2015

Microsoft, Cortana, and C2P0 Speech Recognition on the Way

December 5, 2015

Microsoft’s speech recognition pros are confident that Star Wars-like speech recognition is a “few years” away. Sci fi becomes reality.

The article “The Long Quest for Technology That Understands Speech as Well as a Human” is an encomium to Microsoft. I recall that when I tested a Windows phone, the design allowed me to activate Cortana, the company’s answer to Siri, when I did not want to deal with speech recognition.

The write up ignores the fact that ambient noise, complex strings of sounds like “yankeelov,” of poor diction creates nonsense outputs. That’s okay. This is rah rahism at its finest.

The write up states:

instinctively, without thinking, and with the expectation that they will work for us.

“When machine learning works at its best, you really don’t see the effort. It’s just so natural. You see the result,” says Harry Shum, the executive vice president in charge of Microsoft’s Technology and Research group.

There you go. A collision of Microsoft’s perception and the reality of a hugely annoying implementation of speech recognition in the real world.

The article points out:

“In research in particular we can take a long-term approach,” Shum said. “Research is a marathon.”

Interesting because the graphic in the write up depicts a journey that has spanned 30 plus years. But, remember, “parity” with human understanding of another human is coming really soon.

Have the wizards at Microsoft tried ordering a pecan turtle blizzard with the senior citizens’ discount at the Dairy Queer in Prospect, Kentucky?

I can tell you that human to human communication does not work particularly well. “Parity” then means that human to machine communication won’t be very good either unless specific ambient conditions are met.

The hope is that data, machine learning, and deep neural networks will come to the rescue. These technical niches may deliver the pecan turtle blizzard with the senior citizen discount, but I think more than a few years will be needed.,

Microsoft points out that humans “want the whole thing.” Yeah, really? When a company touts parity between Microsoft technology and human speech, the perception is that Microsoft will deliver the pecan turtle Blizzard.

Reality leads to “Would you repeat that?” and “What discount is that?” and “How many Blizzards?” Those Kentucky accents are difficult for a person living in a hollow to figure out. Toss in an “order from your car” gizmo and you have many opportunities to drag out a simple order into a modern twist on Bottom’s verbal blundering in Midsummer Night’s Dream.

One benefit of this write up is that IBM Watson can recycle the content for its knowledge base. Now that’s a thought.

Stephen E Arnold, December 5, 2015

Reed Elsevier Lexis Nexis Embraces Legal Analytics: No, Not an Oxymoron

November 27, 2015

Lawyers and legal search and content processing systems do words. The analytics part of life, based on my limited experience of watching attorneys do mathy stuff, is not these folks’ core competency. Words. Oh, and billing. I can’t overlook billing.

I read “Now It’s Official: Lexis Nexis Acquires Lex Machina.” This is good news for the stakeholders of Lex Machina. Reed Elsevier certainly expects Lex Machina’s business processes to deliver an avalanche of high margin revenue. One can only raise prices so far before the old chestnut from Economics 101 kicks in: Price elasticity. Once something is too expensive, the customers kick the habit, find an alternative, or innovate in remarkable ways.

According to the write up:

LexisNexis today announced the acquisition of Silicon Valley-based Lex Machina, creators of the award-winning Legal Analytics platform that helps law firms and companies excel in the business and practice of law.

So what does legal analytics do? Here’s the official explanation, which is in, gentle reader, words:

  • A look into the near future. The integration of Lex Machina Legal Analytics with the deep collection of LexisNexis content and technology will unleash the creation of new, innovative solutions to help predict the results of legal strategies for all areas of the law.
  • Industry narrative. The acquisition is a prominent and fresh example of how a major player in legal technology and publishing is investing in analytics capabilities.

I don’t exactly know what Lex Machina delivers. The company’s Web page states:

We mine litigation data, revealing insights never before available about judges, lawyers, parties, and patents, culled from millions of pages of IP litigation information. We call these insights Legal Analytics, because analytics involves the discovery and communication of meaningful patterns in data. Our customers use to win in the highly competitive business and practice of law. Corporate counsel use Lex Machina to select and manage outside counsel, increase IP value and income, protect company assets, and compare performance with competitors. Law firm attorneys and their staff use Lex Machina to pitch and land new clients, win IP lawsuits, close transactions, and prosecute new patents.

I think I understand. Lex Machina applies the systems and methods used for decades by companies like BAE Systems (Detica/ NetReveal) and similar firms to provide tools which identify important items. (BAE was one of Autonomy’s early customers back in the late 1990s.) Algorithms, not humans reading documents in banker boxes, find the good stuff. Costs go down because software is less expensive than real legal eagles. Partners can review outputs and even visualizations. Revolutionary.

Read more

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta