Predictive Coding: Who Is on First? What Is the Betting Game?

December 20, 2012

I am confused, but what’s new? The whole “predictive analytics” rah rah causes me to reach for my NRR 33 dB bell shaped foam ear plugs.

Look. If predictive methods worked, there would be headlines in the Daily Racing Form, in the Wall Street Journal, and in the Las Vegas sports books. The cheerleaders for predictive wizardry are pitching breakthrough technology in places where accountability is a little fuzzier than a horse race, stock picking, and betting on football games.

The godfather of cost cutting for legal document analysis. Revenend Thomas Bayes, 1701 to 1761. I heard he said, “Praise be, the math doth work when I flip the numbers and perform the old inverse probability trick. Perhaps I shall apply this to legal disputes when lawyers believe technology will transform their profession.” Yep, partial belief. Just the ticket for attorneys. See http://goo.gl/S5VSR.

I understand that there is PREDICTION which generates tons of money to the person who has an algorithm which divines which nag wins the Derby, which stock is going to soar, and which football team will win a particular game. Skip the fuzzifiers like 51 percent chance of rain. It either rains or it does not rain. In the harsh world of Harrod’s Creek, capital letter PREDICTION is not too reliable.

The lower case prediction is far safer. The assumptions, the unexamined data, the thresholds hardwired into the off-the-shelf algorithms, or the fiddling with Bayesian relaxation factors is aimed at those looking to cut corners, trim costs, or figure out which way to point the hit-and-miss medical research team.

Which is it? PREDICTION or prediction.

I submit that it is lower case prediction with an upper case MARKETING wordsmithing.

Here’s why:

I read “The Amazing Forensic Tech behind the Next Apple, Samsun Legal Dust Up (and How to Hack It).” Now that is a headline. Skip the “amazing”, “Apple”, “Samsung,” and “Hack.” I think the message is that Fast Company has discovered predictive text analysis. I could be wrong here, but I think Fast Company might have been helped along by some friendly public relations type.

Let’s look at the write up.

First, the high profile Apple Samsung trial become the hook for “amazing” technology. the idea is that smart software can grind through the text spit out from a discovery process. In the era of a ballooning digital data, it is really expensive to pay humans (even those working at a discount in India or the Philippines) to read the emails, reports, and transcripts.

Let a smart machine do the work. It is cheaper, faster, and better. (Shouldn’t one have to pick two of these attributes?)

Fast Company asserts:

“A couple good things are happening now,” Looby says. “Courts are beginning to endorse predictive coding, and training a machine to do the information retrieval is a lot quicker than doing it manually.” The process of “Information retrieval” (or IR) is the first part of the “discovery” phase of a lawsuit, dubbed “e-discovery” when computers are involved. Normally, a small team of lawyers would have to comb through documents and manually search for pertinent patterns. With predictive coding, they can manually review a small portion, and use the sample to teach the computer to analyze the rest. (A variety of machine learning technologies were used in the Madoff investigation, says Looby, but he can’t specify which.)

Fast Company can reveal the predictive coding players. One notable player is Hewlett Packard. Yep, that outfit which is embroiled in a legal matter which has to do with figuring out what is important in legal and financial information. Is this a positive use case for HP’s technology for predictive coding?

Fast Company also clamps on to FTI, a consulting firm with technology. Now the technology behind FTI’s predictive capabilities evolved from Yahoo and then Microsoft. Neither of these companies strikes me as the gold standard for fancy math. Maybe I am a skeptic, but Yahoo was a content outfit and managed to hire a person with allegedly incorrect credential. Microsoft is wrestling with the mobile and Windows 8 demons. (If that predictive stuff worked, did Microsoft apply its own technology to its mobile and Windows 8 data? I don’t know.)

The FTI system (which is based on the Attenex technology from the Yahoo to Microsoft trajectory) requires humans to fiddle around. Once the settings have been inserted in the system, FTI’s technology does the predictive coding thing, which—as I understand it—winnows the wheat from the chaff. The core FTI value is, according to Fast Company:

the predictive coding software applies its refined weight values and judgment line to the entire collection of documents, reducing the amount of documents that need to be examined by a human from, say, 10 million to as low as a few thousand. To be fully confident in the quality of the results, attorneys can look at sets of documents (usually of a few thousand) from the relevant and irrelevant piles that the software generated, to evaluate how well the software met expectations….Predictive coding is known in the annals of artificial intelligence as “supervised machine learning.” FTI can do it so effectively and defensibly because it adds in human training, human checking, and statistical mapping.

As I understand prediction in this context, humans are not known for their predictive skills if one believes Daniel Kahneman’s argument in Thinking, Fast and Slow or the acerbic Nicholas Taleb’s assertions in Antifragile: Things That Gain from Disorder. So humans make decisions and the FTI system applies those thresholds to break the back of thorny legal problems. I think that’s the gist of the Fast Company analysis of FTI.

I did some poking around. There is another firm which also asserts predictive coding expertise. This outfit—Recommind—uses a variant of the type of math which underpins HP Autonomy’s IDOL (integrated data operating layer). The approach reaches back to rural England in the 18th century. The math, it seems, is one of those chestnuts which, like New Year’s Day celebrations, are sufficiently useful to be nearly ubiquitous. Recommind has pushed beyond Bayes. The company received a patent for its predictive coding technology. You can find information about the approach and the patent at http://www.recommind.com/predictive-coding. On this page, there is a picture which shows a diagram which seems familiar to me. I thought what Recommind asserts it invented is similar to what Fast Company describes as the FTI process. Here’s the Recommind diagram. Compare that to the FTI description. Don’t these peas seem to come from the same Autonomy pod?

Source: Recommind at http://www.recommind.com/predictive-coding

How many other companies are pitching the predictive coding thing? I think there are several. Why so many? If one of these systems delivered the goods, the winning predictive method would win the horse races, pick the winning stocks, and identify the outcome of football games without looking in the rear view mirror and involving messy, irrational, numerically challenged humans.

I really don’t care too much about how lawyers reduce costs and maintain their partners’ lifestyles. The more interesting question is, “Why the rush to predictive systems?”

My hypotheses:

Companies have identified a way to couple fear with cost control. Those involved in a legal matter can slash certain costs and assuage to some degree the fear that an important factoid will probably not be overlooked. Marketers can convert this notion into a signed contract.
Law firms have to cut costs. Even though there is a surfeit of legal eagles, the money is not sloshing around as it did in the days of the ATT or IBM break up matters. Those were the days. Now law firms have to figure out how to make sales, win cases, and make money. One of my attorneys bailed out of litigation and became a blogger and college professor. Another attorney I know was disenchanted and now does push ups for a living. Cost cutting is possible with technology. One assumes the technology works.
Clients want to sue and their competitors want to sue them. In order to deal with the thermonuclear war of modern litigation, technology sure seems to be the way to go. The rush to predictive analytics just seems to make so much sense. Once again, if these methods worked, why are the vendors pitching cash strapped and fearful clients? Answer: the prediction works within certain narrow domains and one has to find a client who buys into the limitations.

The predictive analytics bandwagon is rolling along. The information superhighway is growing crowded with bandwagons for Big Data, cloud computing, and surveillance systems. Is the technology flashy? No. The demos are. What’s the reality of these next generation systems? Well, they are expensive to set up, tune, and keep in step with competitive systems rushing to market.

Fast Company gives the impression that FTI has the market cornered. I don’t think FTI does have the market cornered. FTI bought Attenex in 2008. What happened between 2008 and the Fast Company article? For starters, there was the economic downturn. Next was the shift from innovation to litigation. Finally, there was the cost crunch in figuring out what was likely to be germane to a particular legal matter for a particular party to the matter.

Predictive coding has been around for a while. Attenex opened for business in 2001. That’s newer than the pious reverend who did the Bayesian thing, but in Internet years, that’s pretty darned old.

Are there more modern approaches? Yep. Will I mention them in this write up? Nope. Just keep one point in mind.

If the predictive stuff worked, those with the functioning method would be making money in horses, stocks, or sports betting.

Predictive systems and methods have utility. Is that utility what the licensees know and understand? Answering this question explains the market and the limitations of modern predictive systems in my opinion.

Stephen E Arnold, December 20, 2012

Written by Stephen E. Arnold · Filed Under Analytics, Feature, Predictive coding, Technology, Text analytics, Text processing

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.