CyberOSINT banner

Textio for Text Analysis

December 17, 2015

I read “Textio, A Startup That Analyzes Text Performance, Raises $8M.” The write up reported:

Textio recognizes more than 60,000 phrases with its predictive technology, Snyder [Textio’s CEO] said, and that data set is changing constantly as it continues to operate. It looks at how words are put together — such as how verb dense a phrase is — and at other syntax-related properties the document may have. All that put together results in a score for the document, based on how likely it is to succeed in whatever the writer set out to do.

The secret sauce bubbles in this passage:

it’s important that it [Textio] feels easy to use — hence the highlighting and dropdown boxes rather than readouts.

Textio’s Web site states:

From e-commerce to real estate to marketing content, Textio was founded on this simple vision: how you write changes who you reach, and using data, we can predict ahead of time how you’re going to do.

The company, according to Crunchbase has received $9.5 million from Emergence Capital Partners and four other firms.

There are a number of companies offering text analysis, but Textio may be one of the few providing user friendly tools to help people write and make sense of the nuances in résumés and similar corpuses. Sophisticated analysis of text is available from a number of vendors.

It is encouraging to me that a sub function of information access is attracting attention as a stand alone service. One of the company’s customers is Microsoft, a firm with home grown text solutions and technologies from Fast Search & Transfer and Powerset, among others sources. Microsoft’s interest in Textio underscores that text processing that works as one hopes is an unmet need.

Stephen E Arnold, December 17, 2015

Microsoft, Cortana, and C2P0 Speech Recognition on the Way

December 5, 2015

Microsoft’s speech recognition pros are confident that Star Wars-like speech recognition is a “few years” away. Sci fi becomes reality.

The article “The Long Quest for Technology That Understands Speech as Well as a Human” is an encomium to Microsoft. I recall that when I tested a Windows phone, the design allowed me to activate Cortana, the company’s answer to Siri, when I did not want to deal with speech recognition.

The write up ignores the fact that ambient noise, complex strings of sounds like “yankeelov,” of poor diction creates nonsense outputs. That’s okay. This is rah rahism at its finest.

The write up states:

instinctively, without thinking, and with the expectation that they will work for us.

“When machine learning works at its best, you really don’t see the effort. It’s just so natural. You see the result,” says Harry Shum, the executive vice president in charge of Microsoft’s Technology and Research group.

There you go. A collision of Microsoft’s perception and the reality of a hugely annoying implementation of speech recognition in the real world.

The article points out:

“In research in particular we can take a long-term approach,” Shum said. “Research is a marathon.”

Interesting because the graphic in the write up depicts a journey that has spanned 30 plus years. But, remember, “parity” with human understanding of another human is coming really soon.

Have the wizards at Microsoft tried ordering a pecan turtle blizzard with the senior citizens’ discount at the Dairy Queer in Prospect, Kentucky?

I can tell you that human to human communication does not work particularly well. “Parity” then means that human to machine communication won’t be very good either unless specific ambient conditions are met.

The hope is that data, machine learning, and deep neural networks will come to the rescue. These technical niches may deliver the pecan turtle blizzard with the senior citizen discount, but I think more than a few years will be needed.,

Microsoft points out that humans “want the whole thing.” Yeah, really? When a company touts parity between Microsoft technology and human speech, the perception is that Microsoft will deliver the pecan turtle Blizzard.

Reality leads to “Would you repeat that?” and “What discount is that?” and “How many Blizzards?” Those Kentucky accents are difficult for a person living in a hollow to figure out. Toss in an “order from your car” gizmo and you have many opportunities to drag out a simple order into a modern twist on Bottom’s verbal blundering in Midsummer Night’s Dream.

One benefit of this write up is that IBM Watson can recycle the content for its knowledge base. Now that’s a thought.

Stephen E Arnold, December 5, 2015

Reed Elsevier Lexis Nexis Embraces Legal Analytics: No, Not an Oxymoron

November 27, 2015

Lawyers and legal search and content processing systems do words. The analytics part of life, based on my limited experience of watching attorneys do mathy stuff, is not these folks’ core competency. Words. Oh, and billing. I can’t overlook billing.

I read “Now It’s Official: Lexis Nexis Acquires Lex Machina.” This is good news for the stakeholders of Lex Machina. Reed Elsevier certainly expects Lex Machina’s business processes to deliver an avalanche of high margin revenue. One can only raise prices so far before the old chestnut from Economics 101 kicks in: Price elasticity. Once something is too expensive, the customers kick the habit, find an alternative, or innovate in remarkable ways.

According to the write up:

LexisNexis today announced the acquisition of Silicon Valley-based Lex Machina, creators of the award-winning Legal Analytics platform that helps law firms and companies excel in the business and practice of law.

So what does legal analytics do? Here’s the official explanation, which is in, gentle reader, words:

  • A look into the near future. The integration of Lex Machina Legal Analytics with the deep collection of LexisNexis content and technology will unleash the creation of new, innovative solutions to help predict the results of legal strategies for all areas of the law.
  • Industry narrative. The acquisition is a prominent and fresh example of how a major player in legal technology and publishing is investing in analytics capabilities.

I don’t exactly know what Lex Machina delivers. The company’s Web page states:

We mine litigation data, revealing insights never before available about judges, lawyers, parties, and patents, culled from millions of pages of IP litigation information. We call these insights Legal Analytics, because analytics involves the discovery and communication of meaningful patterns in data. Our customers use to win in the highly competitive business and practice of law. Corporate counsel use Lex Machina to select and manage outside counsel, increase IP value and income, protect company assets, and compare performance with competitors. Law firm attorneys and their staff use Lex Machina to pitch and land new clients, win IP lawsuits, close transactions, and prosecute new patents.

I think I understand. Lex Machina applies the systems and methods used for decades by companies like BAE Systems (Detica/ NetReveal) and similar firms to provide tools which identify important items. (BAE was one of Autonomy’s early customers back in the late 1990s.) Algorithms, not humans reading documents in banker boxes, find the good stuff. Costs go down because software is less expensive than real legal eagles. Partners can review outputs and even visualizations. Revolutionary.

Read more

Palantir Profile: Search Plus Add Ons

November 25, 2015

Short honk: If you read French, you will learn quite a bit about Palantir, an interesting company with a $20 billion valuation. The write up is “Palantir et la France : naissance d’une nouvelle théorie abracadabrantesque ? An listicle in the heart of the article provides a good run down of the system’s search and content processing capabilities. Yep, search. The difference between Palantir and outfits like Attivio, Coveo, Smartlogic, et al is the positioning, the bundle of technology, and – oh, did I mention the $20 billion valuation? I do like the abracadabra reference. Magic?

Stephen E Arnold, November 25, 2015

Entity Extraction: Human Intermediation Still Necessary

November 23, 2015

I read “Facebook Should Be Able to Handle names like Isis and Phuc Dat Bich.” The article underscores the challenges smart software faces in a world believing that algorithms deliver the bacon.

Entity extraction methods requiring human subject matter experts and dictionary editors are expensive and slow. Algorithms are faster and over time more economical. Unfortunately the automated systems miss some things and get other stuff wrong.

The article explains that Facebook thinks a real person name Isis Anchalee is a bad guy. Another person with the transliterated Vietnamese name Phuc Dat Bich is a prohibited phrase.

What’s the fix?

First, the folks assuming that automated systems are pretty much accurate need to connect with the notion of an “exception file” or a log containing names which are not in a dictionary. What if there is no dictionary? Well, that is a problem. What about names with different spellings and in different character sets? Well, that too is a problem.

Will the vendors of automated systems point out the need for subject matter experts to create dictionaries, perform quality and accuracy audits, and update the dictionaries? Well, sort of.

The point is that like many numerical recipes the expectation that a system is working with a high degree of accuracy is often incorrect. Strike that, substitute “sort of accurate.”

The write up states:

If that’s how the company want the platform to function, Facebook is going to have to get a lot better at making sure their algorithms don’t unfairly penalize people whose names don’t fit in with the Anglo-standard.

When it comes time to get the automated system back into sync with accurate entity extraction, there may be a big price tag.

What your vendor did not make that clear?

Explain your “surprise” to the chief financial officer who wants to understand how you overlooked costs which may be greater than the initial cost of the system.

Stephen E Arnold, November 23, 2015

Inferences: Check Before You Assume the Outputs Are Accurate

November 23, 2015

Predictive software works really well as long as the software does not have to deal with horse races, the stock market, and the actions of single person and his closest pals.

Inferences from Backtest Results Are False Until Proven True” offers a useful reminder to those who want to depend on algorithms someone else set up. The notion is helpful when the data processed are unchecked, unfamiliar, or just assumed to be spot on.

The write up says:

the primary task of quantitative traders should be to prove specific backtest results worthless, rather than proving them useful.

What throws backtests off the track? The write up provides a useful list of reminders:

  1. Data-mining and data snooping bias
  2. Use of non tradable instruments
  3. Unrealistic accounting of frictional effects
  4. Use of the market close to enter positions instead of the more realistic open
  5. Use of dubious risk and money management methods
  6. Lack of effect on actual prices

The author is concerned about financial applications, but the advice may be helpful to those who just want to click a link, output a visualization, and assume the big spikes are really important to the decision you will influence in one hour.

One point I highlighted was:

Widely used strategies lose any edge they might have had in the past.

Degradation occurs just like the statistical drift in Bayesian based systems. Exciting if you make decisions on outputs known to be flawed. How is that automatic indexing, business intelligence, and predictive analytics systems working?

Stephen E Arnold, November 23, 2015

A Modest Dust Up between Big Data and Text Analytics

November 18, 2015

I wonder if you will become involved in this modest dust up between the Big Data folks and the text analytics adherents. I know that I will sit on the sidelines and watch the battle unfold. I may mostly alone on that fence for three reasons:

  • Some text analytics outfits are Big Data oriented. I would point modestly to Terbium Labs and Recorded Future. Both do the analytics thing and both use “text” in their processing. (I know that learning about these companies is not as much fun as reading about Facebook friends, but it is useful to keep up with cutting edge outfits in my opinion.)
  • Text analytics can produce Big Data. I know that sounds like a fish turned inside out. Trust me. It happens. Think about some wan government worker in the UK grinding through Twitter and Facebook posts. The text analytics output lots of data.
  • A faux dust up is mostly a marketing play. I enjoyed search and content processing vendor presentations which pitted features of one system versus another. This approach is not too popular because every system says it can do what every other system can do. The reality of the systems is, in most cases, not discernible to the casual failed webmaster now working as a “real” wizard.

Navigate to “Text Analytics Gurus Debunk 4 Big Data Myths.” You will learn that there are four myths which are debunked. Here are the myths:

  1. Big Data survey scores reign supreme. Hey, surveys are okay because outfits like Survey  Monkey and the crazy pop up technology from that outfit in Michigan are easy to implement. Correct? Not important. Usable data for marketing? Important.
  2. Bigger social media data analysis is better. The outfits able to process the real time streams from Facebook and Twitter have lots of resources. Most companies do not have these resources. Ergo: Statistics 101 reigns no matter what the marketers say.
  3. New data sources are the most valuable. The idea is that data which are valid, normalized, and available for processing trump bigness. No argument from me.
  4. Keep your eye on the ball by focusing on how customers view you. Right. The customer is king in marketing land. In reality, the customer is a code word for generating revenue. Neither Big Data nor text analytics produce enough revenue in my world view. Sounds great though.

Will Big Data respond to this slap down? Will text analytic gurus mount their steeds and take another run down Marketing Lane to the windmill set up as a tourist attraction in an Amsterdam suburb?

Nope. The real battle involves organic, sustainable revenues. Talk is easy. Closing deals is hard. This dust up is not a mixed martial arts pay per view show.

Stephen E Arnold, November 18, 2015

Lexalytics: Checking into Hotel Data

November 14, 2015

I read “Boost Your Brand Reputation by Listening to Social Content.” I find the title interesting. The idea that a brand such as a hotel like Motel 6 or Hilton can improve its reputation by listening is interesting. I am not sure that listening translates to a better reputation. The guest in the hotel deals with the room, the staff, and the electrical outlet (presumably working). The guest forms an opinion about the hotel. I was in a hotel in Cape Town which features a non working door, no electricity, and pipes which leaked. This was a new room.

Listening to me did not solve the problems. What solved the problems was my speaking with two managers and proposing that I sleep in the lobby.

I am enthused by technology. I am not keen when technology is presented as a way to sidestep or subvert the reality that creates one’s views of a business, in this case a hotel. The idea is—well, let me be frank—not a good one.

I think the write up means that a hotel using Lexalytics technology has a way to obtain information that otherwise might not find its way to the 20 something in the marketing department. The hotel then has to take action to resolve the problem. This is pretty much common sense, but it does not boost a hotel’s social reputation. Listening is passive. The information must be converted to meaningful action.

That’s the problem with search and content processing companies and why many of them face credibility and revenue challenges. The hoped for action has zero direct connection with the grinding of the algorithms and the motivation or capability of the licensee to make a change.

The write up asserts:

This social currency, online reputation, directly influences a hotelier’s sales volume: good reputation, higher sales — poor reputation, lower sales. The upshot is that in the hospitality industry, increasing your reputation (and revenue) means listening to social content and basing your business decisions on the feedback you receive from guests. And I know I’m preaching to the choir here, but remember that good reputation isn’t just for high-end establishments. There’s a lot to be said for value for money, and smaller, more modest establishments can often gain the most from careful management of their online reputation.

Sounds great. The management of the hotel have to make changes. Over time, the changes will have an impact on the Facebook or Yelp posts that the guests contribute.

Technology is simply a utility, not a way to get from lousy hotel to wonderful hotel with a mouse click or by listening. Horse feathers.

Stephen E Arnold, November 14, 2015

What Does Connotate Deliver to Licensee?

October 23, 2015

If you have asked yourself this question, you will find the answers in “The Most Asked Questions About Connotate.” Connotate was founded in year 2000 and has ingested about $12.5 million from four investors, according to Crunchbase. That works out to 15 years.

Here are two questions which I highlighted:

One question is, “Is Connotate Web scraping?” Here’s the answer:

No. Web scrapers parse HTML code on web pages, searching for markers to identify which data elements to extract. While web scrapers may be alright for one-off extractions, they’re not sustainable when it comes to regularly extracting and monitoring large parcels of content. Also, web scrapers require programmers to create and maintain them. Connotate’s solutions rely on machine learning to reach and maintain optimal resiliency.

In my experience, Connotate offers a type of Web scraping.

The other question I noted was “Who uses Connotate’s solutions?” And the answer:

Connotate’s solutions can be harnessed by anyone who is looking to turn the vast scope of data on the Web into usable data. Chief Content Officers, Data Operations Managers and Product Managers are just a few examples of the kind of professionals who could readily benefit from Connotate’s solution.

I was hoping for some specific customer references. I know there are Connotate customers.

For further information about Connotate, the company offers a “complete online FAQ.”

Stephen E Arnold, October 23, 2015

Concept Searching SharePoint White Paper

October 22, 2015

I saw a reference to “2015 SharePoint and Office 365 State of the Market Survey White Paper.” If you are interested in things SharePoint and Office 365, you can (as of October 15, 2015) download the 40 page document at this Concept Searching link. A companion webinar is also available.

The most interesting portion of the white paper is its Appendix A. A number of buzzwords are presented as “Priorities by Application.” Note that the Appendix is graphical and presents the result of a “survey.” Goodness, SharePoint seems to have some holes in its digital fabric.

The data for enterprise search are interesting.


Source: Concept Searching, 2015

It appears that fewer than 20 percent of those included in the sample (not many details about the mechanics of this survey the data for which was gathered via the Web) do not see enterprise search as a high priority issue. About 30 percent of the respondents perceive search as working as intended. An equal number, however, are beavering away to improve their enterprise search system.

Unlike some enterprise search and content processing vendors, Concept Search is squarely in the Microsoft camp. With third party vendors providing “solutions” for SharePoint and Office 365, I ask myself, “Why doesn’t Microsoft address the shortcomings third parties attack?”

Stephen E Arnold, October 22, 2015

« Previous PageNext Page »