The Alleged Received Wisdom about Predictive Coding

June 19, 2012

Let’s start off with a recommendation. Snag a copy of the Wall Street Journal and read the hard copy front page story in the Marketplace section, “Computers Carry Water of Pretrial Legal Work.” In theory, you can read the story online if you don’t have Sections A-1, A-10 of the June 18, 2012, newspaper. Check out a variant of the story appears as “Why Hire a Lawyer? Computers Are Cheaper.”

Now let me offer a possibly shocking observation: The costs of litigation are not going down for certain legal matters. Neither bargain basement human attorneys nor Fancy Dan content processing systems make the legal bills smaller. Your mileage may vary, but for those snared in some legal traffic jams, costs are tough to control. In fact, search and content processing can impact costs, just not in the way some of the licensees of next generation systems expect. That is one of the mysteries of online that few can penetrate.

The main idea of the Wall Street Journal story is that “predictive coding” can do work that human lawyers do for a higher cost but sometimes with much less precision. That’s the hint about costs in my opinion. But the article is traditional journalistic gold. Coming from the Murdoch organization, what did I expect? i2 Group has been chugging along with relationship maps for case analyses of important matters since 1990. Big alert: i2 Ltd. was a client of mine. Let’s see that was more than a couple of weeks ago that basic discovery functions were available.

The write up quotes published analyses which indicate that when humans review documents, those humans get tired and do a lousy job. The article cites “experts” who from Thomson Reuters, a firm steeped in legal and digital expertise, who point out that predictive coding is going to be an even bigger business. Here’s the passage I underlined: “Greg McPolin, an executive at the legal outsourcing firm Pangea3 which is owned by Thomson Reuters Corp., says about one third of the company’s clients are considering using predictive coding in their matters.” This factoid is likely to spawn a swarm of azure chip consultants who will explain how big the market for predictive coding will be. Good news for the firms engaged in this content processing activity.

What goes faster? The costs of a legal matter or the costs of a legal matter that requires automation and trained attorneys? Why do companies embrace automation plus human attorneys? Risk certainly is a turbo charger?

The article also explains how predictive coding works, offers some cost estimates for various actions related to a document, and adds some cautionary points about predictive coding proving itself in court. In short, we have a touchstone document about this niche in search and content processing.

My thoughts about predictive coding are related to the broader trends in the use of systems and methods to figure out what is in a corpus and what a document is about.

First, the driver for most content processing is related to two quite human needs. First, the costs of coping with large volumes of information is high and going up fast. Second, the need to reduce risk. Most professionals find quips about orange jump suits, sharing a cell with Mr. Madoff, and the iconic “perp walk” downright depressing. When a legal matter surfaces, the need to know what’s in a collection of content like corporate email is high. The need for speed is driven by executive urgency. The cost factor clicks in when the chief financial officer has to figure out the costs of determining what’s in those documents. Predictive coding to the rescue. One firm used the phrase “rocket docket” to communicate speed. Other firms promise optimized statistical routines. The big idea is that automation is fast and cheaper than having lots of attorneys sifting through documents in printed or digital form. The Wall Street Journal is right. Automated content processing is going to be a big business. I just hit the two key drivers. Why dance around what is fueling this sector?

Read more

More Predictive Silliness: Coding, Decisioning, Baloneying

June 18, 2012

It must be the summer vacation warm and fuzzies. I received another wild analytics news release today. This one comes from 5WPR, “a top 25 PR agency.” Wow. I learned from the spam: PeekAnalytics “delivers enterprise class Twitter analytics and help marketers understand their social consumers.”

What?

Then I read:

By identifying where Twitter users exist elsewhere on the Web, PeekAnalytics offers unparalleled audience metrics from consumer data aggregated not just from Twitter, but from over sixty social sites and every major blog platform.

The notion of algorithms explaining anything is interesting. But the problem with numerical recipes is that those who use outputs may not know what’s going on under the hood. Wide spread knowledge of the specific algorithms, the thresholds built into the system, and the assumptions underlying the selection of a particular method is in short supply.

Analytics is the realm of the one percent of the population trained to understand the strengths and weaknesses of specific mathematical systems and methods. The 99 percent are destined to accept analytics system outputs without knowing how the data were selected, shaped, formed, and presented given the constraints of the inputs. Who cares? Well, obviously not some marketers of predictive analytics, automated indexing, and some trigger trading systems. Too bad for me. I do care.

When I read about analytics and understanding, I shudder. As an old goose, each body shake costs me some feathers, and I don’t have many more to lose at age 67. The reality of fancy math is that those selling its benefits do not understand its limitations.

Consider the notion of using a group of analytic methods to figure out the meaning of a document. Then consider the numerical recipes required to identify a particular document as important from thousands or millions of other documents.

When companies describe the benefits of a mathematical system, the details are lost in the dust. In fact, bringing up a detail results in a wrinkled brow. Consider the Kolmogorov-Smirnov Test. Has this non parametric test been applied to the analytics system which marketers have presented to you in the last “death by PowerPoint” session? The response from 99.5 percent of the people in the world is, “Kolmo who?” or “Isn’t Smirnov a vodka?” Bzzzz. Wrong.

Mathematical methods which generate probabilities are essential to many business sectors. When one moves fuel rods at a nuclear reactor, the decision about what rod to put where is informed by a range of mathematical methods. Special training experts, often with degrees in nuclear engineering plus post graduate work handle the fuel rod manipulation. Take it from me. Direct observation is not the optimal way to figure out fuel pool rod distribution. Get the math “wrong” and some pretty exciting events transpire. Monte Carlo anyone? John Gray? Julian Steyn? If these names mean nothing to you, you would not want to sign up for work in a nuclear facility.

Why then would a person with zero knowledge of how numerical recipes, oddball outputs from particular types of algorithms, and little or know experience with probability methods use the outputs of a system as “truth.” The outputs of analytical systems require expertise to interpret. Looking at a nifty graphic generated by Spotfire or Palantir is NOT the same as understand what decisions have been made, what limitations exist within the data display, and what are the blind spots generated by the particular method or suite of methods. (Firms which do focus on explaining and delivering systems which make it clear to users about methods, constraints, and considerations include Digital Reasoning, Ikanow, and Content Analyst. Others? You are on your own, folks.)

Today I have yet another conference call with 30 somethings who are into analytics. Analytics is the “next big thing.” Just as people assume coding up a Web site is easy, people assume that mathematical methods are now the mental equivalent of clicking a mouse to get a document. Wrong.

The likelihood of misinterpreting the outputs of modern analytic systems is higher than it was when I entered the workforce after graduate school. These reasons include:

  1. A rise in the “something for nothing” approach to information. A few clicks, a phone call, and chit chat with colleagues makes many people expert in quite difficult systems and methods. In the mid 1960s, there was limited access to systems which could do clever stuff with tricks from my relative Vladimir Ivanovich Arnold. Today, the majority of the people with whom I interact assume their ability to generate a graph and interpret a scatter diagram equips them as analytic mavens. Math is and will remain hard. Nothing worthwhile comes easy. That truism is not too popular with the 30 somethings who explain the advantages of analytics products they sell.
  2. Sizzle over content. Most of the wild and crazy decisions I have learned about come from managers who accept analytic system outputs as a page from old Torah scrolls from Yitzchok Riesman’s collection. High ranking government officials want eye candy, so modern analytic systems generate snazzy graphics. Does the government official know what the methods were and the data’s limitations? Nope. Bring this up and the comment is, “Don’t get into the weeds with me, sir.” No problem. I am an old advisor in rural Kentucky.
  3. Entrepreneurs, failing search system vendors, and open source repackagers are painting the bandwagon and polishing the tubas and trombones. The analytics parade is on. From automated and predictive indexing to surfacing nuggets in social media—the music is loud and getting louder. With so many firms jumping into the bandwagon or joining the parade, the reality of analytics is essentially irrelevant.

The bottom line for me is that the social boom is at or near its crest. Marketers—particularly those in content processing and search—are desperate for a hook which will generate revenues. Analytics seems to be as good as any other idea which is converted by azure chip consultants and carpetbaggers into a “real business.”

The problem is that analytics is math. Math is easy as 1-2-3; math is as complex as MIT’s advanced courses. With each advance in computing power, more fancy math becomes possible. As math advances, the number of folks who can figure out what a method yields decreases. The result is a growing “cloud of unknowing” with regard to analytics. Putting this into a visualization makes clear the challenge.

Stephen E Arnold, June 18, 2012

Inteltrax: Top Stories, June 11 to June 15

June 18, 2012

Inteltrax, the data fusion and business intelligence information service, captured three key stories germane to search this week, specifically, how governments and the voting public are utilizing big data.

In “Government Leads Way in Big Data Training” we discovered the private sector lagging behind the government in terms of user education.

Our story, “U.S. Agencies Analytics Underused” showed that even though we have all that training, some agencies still need more to fully utilize this digital power.

Cultural Opinion Predicted by Analytics” used the Eurovision song contest to show us the power of people using analytics and gives the nugget of thought as to how this could be used in government elections.

While sometimes the outcomes contradict one another, there’s no denying that big data analytics is a huge part of governments around the world. Expect these facts to only rise as the popularity catches fire.

Follow the Inteltrax news stream by visiting www.inteltrax.com

 

Patrick Roland, Editor, Inteltrax.

June 18, 2012

Inteltrax: Top Stories, June 4 to June 8

June 11, 2012

Inteltrax, the data fusion and business intelligence information service, captured three key stories germane to search this week, specifically, how financial markets are being influenced and affected by big data analytics.

In “Venture Capitalists Invest in Cloud Based API Provider” we explore how tons of financial investments, namely in the cloud, are changing the game of big data.

In “UK Financial Industry Benefiting from Analytics” we discovered how England is attempting to avoid Eurozone financial catastrophe with analytics.

Finally, our feature, “Quantitative Financial Analytics is a Serious Weapon” dove headlong into this new buzzword and its impact on financial markets and the vendors supplying software.

With global markets plummeting or rising in equally shaky motions, analytics looks to be a potential stabilizing force. We’ll keep watching to see what kind of aid it can be.

Follow the Inteltrax news stream by visiting www.inteltrax.com

Patrick Roland, Editor, Inteltrax.

June 11, 2012

HP Autonomy: The Big Data Arabesque

June 5, 2012

Hewlett Packard has big plans for Autonomy. HP paid $10 billion for the search and content processing company last year. HP faces a number of challenges in its printer and ink business. The personal computer business is okay, but HP is without a strong revenue stream from mobile devices.

HP Rolls Out Hadoop AppSystem Stack” provided some interesting information about Autonomy and big data. The write up focuses on the big data trend. In order to make sense out of large volumes of information, HP wants to build management software, integrate the “Vertica column oriented distributed database and the Autonomy Intelligent Data Operating Layer (IDOL) 10 stack.” The article reports:

On the Autonomy front, HP has announced the capability to put the IDOL 10 engine, which supports over 1,000 file types and connects to over 400 different kinds of data repositories, onto each node in a Hadoop cluster. So you can MapReduce the data and let Autonomy make use of it. For instance, you can use it to feed the Optimost Clickstream Analytics module for the Autonomy software, which also uses the Vertica data store for some parts of the data stream. HP is also rolling out its Vertica 6 data store, and the big new feature is the ability to run the open source R statistical analysis programming language in parallel on the nodes where Vertica is storing data in columnar format. More details on the new Vertica release were not available at press time, but Miller says that the idea is to provider connectors between Vertica, Hadoop, and Autonomy so all of the different platforms can share information.

HP’s idea blends a hot trend, HP’s range of hardware, HP’s system management software, a database, and Autonomy IDOL. In order to make this ensemble play in tune, HP will offer professional services.

InfoWorld’s “HP Extends Autonomy’s Big Data Chops to Hadoop Cloud” added some additional insight. I learned that former Autonomy boss Michael Lynch will leave HP “along with Autonomy’s entire original management team and 20 percent of its staff.”

The story then explained that Autonomy, which combines with Vertica:

can now be embedded in Hadoop nodes. From there, users can combine Idol’s 500-plus functions — including automatic categorization, clustering, and hyperlinking — to scour various sources of structured and unstructured data to glean deeper meanings and trends. Sources run the gamut, too, from structured data such as purchase history, services issues, and inventory records to unstructured Twitter streams, and even audio files. IDOL includes 400 connectors, which companies can use to get at external data.

Autonomy moved beyond search many years ago. This current transformation of Autonomy makes marketing sense. I am interested in monitoring this big data approach. IBM had a similar idea when it presented the Vivisimo clustering and deduplication system as a “big data” system. The challenge will be applying text centric technology to ensembles which generate insights from “big data.”

Will the shift earn back the purchase price of $10 billion and have enough horsepower to pull HP into robust top line growth? Big data and analytics have promise but I don’t know of any single analytics company that has multi-billion dollar product lines. Big data is a hot button, but does it hard wire into the pocketbooks of chief financial officers?

Stephen E Arnold, June 5, 2012

Sponsored by IKANOW

The New Lexi-Portal Version 4 Offers More Options

June 5, 2012

Leximancer just introduced Lexi-Portal Version 4 to the market. This new service provides availability to users for all the wide-ranging text analytic capability of Leximancer. Market researchers will find that this portal will provide them with fast analysis of qualitative surveys, spreadsheets and verbatim data.

Leximancer’s technology is proven with customers all around the globe. Their providing new and innovative ways for businesses to benefit in a no strings attached way. Basically, you have options on how to utilize the Lexi-Portal.

There are several aspects of their portal that make it unique to users, such as the fact they made it an ‘on demand’ service. This means you don’t actually have to subscribe every month, but instead are charged for the actual amount of usage based on either a time used or service basis. The convenience of the pay as you go aspect is that the Lexi-Portal will retain your company’s information for up to two months even if your usage drops for a month.

About Leximancer:

Leximancer is an Australian company that has been providing leading-edge text analytics technology for almost 10 years.”

“The technology was created following 7 years research and development at the University of Queensland by Dr Andrew Smith. Andrew’s physics and cognitive science background, in conjunction with his working IT application experience, enabled him to envisage and develop an innovative solution to the growing need to readily determine meaning from unstructured, qualitative, textual data.”

You can view sample out’s at the Leximancer Chart Gallery such as the interview dashboard below:

clip_image001

Jennifer Shockley, June 5, 2012

Sponsored by PolySpot

Inteltrax: Top Stories, May 28 to June 1

June 4, 2012

Inteltrax, the data fusion and business intelligence information service, captured three key stories germane to search this week, specifically, what is hot and trending in big data these days.

The first answer came from our story, “Dashboard Data Analytics Hot” which showcases the many ways in which increased usability is increasing big data’s popularity.

Also, “The Next Great Data Gold Mine” looks a little deeper into what we already know, social media is going to be huge for analytics.

Finally, “Analytic Healthcare Contests Boom” showed that many of the health field’s biggest problems are being solved by analytic contests.

The rapidly evolving world of big data is always in flux. What’s hot today might be cold next week. But know we’ll be taking the industry’s temperature every day to stay atop all the exciting changes.

Follow the Inteltrax news stream by visiting www.inteltrax.com

Patrick Roland, Editor, Inteltrax.

June 4, 2012

Lexalytics Uses Text Analytics to Find the Most Popular Superhero

May 31, 2012

The LexaBlog recently posted some interesting information about popular superheroes in the article “The Avengers: Most Popular Superhero?”

According to the article, writer Seth Redmore analyzed 330,000 tweets regarding the new Avengers superhero movie by sending out query topics on the main characters as well as the actors playing them.

Redmore breaks the information down for us with several charts showing the most to least popular characters as well as the most to least popular themes as well as hash-tags.

When discussing his process, Redmore states:

“This actually does a good job of showing why I wanted to create query topics for the superheroes.  Many of their names come out looking more like themes than like proper “names”. Many of these themes aren’t particularly useful, so, I excluded a bunch of them when I was doing other sorts of analysis. Next, I decided to see what themes were most commonly associated with each of said superheroes. As I said before, I pulled out things like “watching avengers” when I was doing this analysis, as it adds nothing in terms of what people were associating with this character/actor.”

How will this aid your business? Send us your ideas via the comments section of this blog.

Jasmine Ashton, May 31, 2012

Sponsored by PolySpot

Inteltrax: Top Stories, May 21 to May 25

May 28, 2012

Inteltrax, the data fusion and business intelligence information service, captured three key stories germane to search this week, specifically, the latest happenings with some of big data’s biggest names.

Our story, “Data Analytics Expert Points to the Crux of Big Data Issues,” looked at the CEO of Revolution Analytics and Digital Reasoning, catching up with their latest moves.

EMC Provides a Lot of Analytic Good,” shows all the positive ways in which EMC is moving the analytic game ahead.

While, “MicroTech Wins Military Intelligence Contract” shows this up-and-coming firm making a name for itself with defense.

There are a million different directions that analytics are moving in at any given moment, but we’ll be providing snapshots of the scene, just like this, every day. Be sure to tune in.

Follow the Inteltrax news stream by visiting www.inteltrax.com

 

Patrick Roland, Editor, Inteltrax.

May 28, 2012

Inteltrax: Top Stories, May 14 to May 18

May 21, 2012

Inteltrax, the data fusion and business intelligence information service, captured three key stories germane to search this week, specifically, how unstructured data is shaping the way vendors operate.

In “A Mountain of Unstructured Data” the problem of collecting tweets, posts, pictures, videos and more and making analytic sense is laid out.

Unstructured Data Investment on the Horizon” shows how many companies are investing in solving their own unstructured data crises.

Finally, “Another Analytics Partnership is Born” showed companies joining forces to tackle this massive problem.

We’ve talked about unstructured data before, but we keep returning to the well because it’s such a massive concern for companies. Thankfully, those problems are being solved and we’re monitoring it every step of the way.

Follow the Inteltrax news stream by visiting www.inteltrax.com

 

Patrick Roland, Editor, Inteltrax.

May 21, 2012

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta