Huff Po and a Search Vendor Debunk Big Data Myths

September 1, 2014

I suppose I am narrow minded. I don’t associate the Huffington Post with high technology analyses. My ignorance is understandable because I don’t read the Web site’s content.

However, a reader sent me a link to “Top Three Big Data Myths: Debunked”, authored by a search vendor’s employee at Recommind. Now Recommind is hardly a household word. I spoke with a Recommind PR person about my perception that Recommind is a variant of the technology embodied in Autonomy IDOL. Yep, that company making headlines because of the minor dust up with Hewlett Packard. Recommind provides a probabilistic search system to customers that were originally involved in the legal market. The company has positioned its technology to other markets and added a touch of predictive magic as well. At its core, Recommind indexes content and makes the indexes available to users and other services. The company in 2010 formed a partnership with the Solcara search folks. Solcara is now the go to search engine for Thomson Reuters. I have lost track of the other deals in which Recommind has engaged.

The write up reveals quite a bit about the need for search vendors to reach a broader market in order to gain visibility to make the cost of sales bearable. This write up is a good example of content marketing and the malleability of outfits like Huffington Post. The idea strikes me as something that looks interesting may get a shot at building the click traffic for Ms. Huffington’s properties.

So what does the article debunk? Fasten your seat belt and take your blood pressure medicine. The content of the write up may jolt you. Ready?

First, the article reveals that “all” data are not valuable. The way the write up expresses it takes this form, “Myth #1—All Data Is Valuable.” Set aside the subject verb agreement error. Data is the plural and datum is the singular. But in this remarkable content marketing essay, grammar is not my or the author’s concern. The notion of categorical propositions applied to data is interesting and raises many questions; for example, what data? So the first my is that if one if able to gather “all data”, it therefore follows that some is not germane. My goodness, I had a heart palpitation with this revelation.

Second, the next myth is that “with Big Data the more information the better.” I must admit this puzzles me. I am troubled by the statistical methods used to filter smaller, yet statistically valid, subsets of data. Obviously the predictive Bayesian methods of Recommind can address this issue. The challenges Autonomy like syst4ems face are well known to some Autonomy licensees and, I assume, to the experts at Hewlett Packard. The point is that if the training information is off base by a smidge and the flow of content does not conform to the training set, the outputs are often off point. Now with “more information” the sampling purists point to sampling theory and the value of carefully crafted training sets. No problem on my end, but aren’t we emphasizing that certain non Bayesian methods are just not a wonderful as Recommind’s methods? I think so.

The third myth that the write up “debunks” is “Big Data opportunities come with no costs.” I think this is a convoluted way of saying that get ready to spend a lot of money to embrace Big Data. When I flip this debunking on its head, and I get this hypothesis, “The Recommind method is less expensive than the Big Data methods that other hype artists are pitching as the best thing since sliced bread.

The fix is “information governance.” I musty admit that like knowledge management, I have zero idea what the phrase means. Invoking a trade association anchored in document scanning does not give me confidence that an explanation will illuminate the shadows.

Net net: The myths debunked just set up myths for systems based on aging technology. Does anyone notice? Doubt it.

Stephen E Arnold, September 1, 2014

The Importance of Publishing Replication Studies in Academic Journals

September 1, 2014

The article titled Why Psychologists’ Food Fight Matters on Slate discusses the issue of the lack of replication studies published in academic journals. In most cases, journals are looking for new information, exciting information, which will draw in their readers. While that is only to be expected, it can also cause huge problems in scientific method. Replication studies are important because science is built on laws. If a study cannot be replicated, then it’s finding should not be taken for granted. The article states,

“Since journal publications are valuable academic currency, researchers—especially those early in their careers—have strong incentives to conduct original work rather than to replicate the findings of others. Replication efforts that do happen but fail to find the expected effect are usually filed away rather than published. That makes the scientific record look more robust and complete than it is—a phenomenon known as the “file drawer problem.””

When scientists have an incentive to get positive results from a study, and little to no incentive to do replication studies, the results are obvious. Manipulation of data occurs, and few replication studies are completed. This also means that when the rare replication study is done, and refutes the positive finding, the scientist responsible for the false positive is a scapegoat for a much larger problem. The article suggests that academic journals encouraging more replication studies would assuage this problem.

Chelsea Kerwin, September 01, 2014

Sponsored by, developer of Augmentext

The Abilities and Promise of Watson IBMs Reasoning Computer

September 1, 2014

A video on titled The Computer That’s Smarter Than YOU & I offers an explanation of Watson, IBM’s supercomputer. It begins with the beginning of civilization and humankind’s constant innovation since. With the creation of the microchip, modern technology really began to ramp up, and it asks (somewhat rhetorically) what will be the next great technological innovation? The answer is: the reasoning computer. The video shows a demo of the supercomputer trying to understand pros and cons on the sale of violent video games. Watson worked through the topic as follows,

“Scanned approximately 4 million Wikipedia articles. Returning ten most relevant articles. Scanned all three thousand sentences in top ten articles. Detected sentences which contain candidate claims. Identified borders of candidate claims. Assessed pro and con polarity of candidate claims. Constructed demo speech… the sale of violent video games should be banned.”

Watson went on to list his reasons for choosing this stance, such as “exposure to violent video games results in increased physiological arousal.” But he also offered a refutation, that the link between the games and actual violent action has not been proven. The ability of the computer to reason without human aid on its own is touted as the truly exciting innovation. Meanwhile, we are still waiting for a publicly accessible demo.

Chelsea Kerwin, September 01, 2014

Sponsored by, developer of Augmentext

Yahoo Flickr Images: Does Search Work?

August 31, 2014

I think you know the answer if you are a regular reader of Beyond Search.


Finding images is a tedious and time consuming business. I know what the marketing collateral and public relations noise suggests. One can search by photographer, color, yada, yada.

The reality is that finding an image requires looking at images. Some find this fun, particularly if the client is paying by the hour for graphic expertise. For me, image search underscores how primitive information retrieval tools are.

Feel free to disagree.

To test Yahoo Flickr search, navigate to “Welcome to the Internet Archive to the the Commons.” Check out the sample entry to the millions of public domain images.


Darned meaty.

To search the “Commons”, one has to navigate to the Commons page and scroll down to the search box highlighted in yellow in this screenshot:


Enter a query like this one “18th century elocution.”

Here’s what the system displayed:


I then tried this query “london omnibus 1870”.

Here’s what the system displayed:


No omnibuses.

Like many image retrieval systems, the user has to fiddle with queries until images are spotted by manual inspection.

The archive is useful. Finding images in Yahoo Flickr remains a problem for me. I thought Xooglers knew quite a bit about search. You know: Finding information when the user enters a key word or two.

Stephen E Arnold, August 31, 2014

Quote to Note: Facebook Search

August 31, 2014

Facebook has done little public facing work on search. Behind the scenes, Facebookers and Xooglers have been beavering away. A bit of public information surfaced in “Zuckerberg On Search — Facebook Has More Content Than Google.” Does Facebook have a trillion pieces of content. Is that more content than Google has? Nah. But it is the thought that counts:

Here’s the quote I highlighted:

What would it ultimately mean if Facebook’s search efforts are effective–and if Facebook allowed universal use of a post search tool that really worked? It’s dizzying, really. As Zuckerberg said early this year on an earnings call: “There are more than a trillion status updates and unstructured text posts and photos and pieces of content that people have shared over the past 10 years.” Then the Facebook CEO put that figure into context: “a trillion pieces of content is more than the index in any web search engine.” You know what “any web search engine” spells? That’s a funny way of spelling Google.

With Amazon nosing into ads and Facebook contemplating more public search functionality, will Google be able to respond in a manner that keeps its revenues flowing and projects like Loon flying? I wonder what the Arnold name surfer thinks about Facebook? Maybe it is a place to post musings about failed youth coaching?

Stephen E Arnold, August 31, 2014

Google and Universal Search or Google Floudering with Search

August 30, 2014

There have been some experts who have noticed that Google has degraded blog search. In the good old days, it was possible to query Google’s index of Web logs. It was not comprehensive, and it was not updated with the zippiness of years past.

Search Engine Land and Web Pro News both pointed out that redirects to Google’s main search page. The idea of universal search, as I understood it, was to provide a single search box for Google’s content. Well, that is not too useful when it is not possible to limit a query to a content type or a specific collection.

“Universal” to Google is similar to the telco’s use of the word “unlimited.”

According the to experts, it is possible to search blog content. Here’s the user friendly sequence that will be widely adopted by Google users:

  1. Navigate to the US version of Google News. Note that this can be tricky if one is accessing Google from another country
  2. Enter a query; for example, “universal search”
  3. Click on “search tools” and then click on “All news”image
  4. Then click on “Blogs”


Several observations:

First, finding information in Google is becoming more and more difficult.

Second, obvious functions such as providing an easy way to run queries against separate Google indexes is anything but obvious. Do you know how to zip to Google’s patent index or its book index? Not too many folks do.

Third, the “logic” of making search a puzzle is no longer of interest to me. Increasing latency in indexing, Web sites that are pushed deep in the index for a reason unrelated to the site’s content, and a penchant for hiding information points to some deep troubles in Google search.

Net net: Google has lost its way in search. Too bad. As the volume of information goes up, the findability goes down. Wild stuff like Loon and Glass go up. Let’s hope Google can keep its ad revenue flowing; otherwise, there would be little demand for individuals who can perform high value research.

Stephen E Arnold, August 30, 2014

Google: Authors Not Helping Traffic

August 30, 2014

First, Google removed operators for Boolean queries. Then, Google started suggesting what I wanted. Now, Google does away with authors. These steps improve user experience. In John  Mueller’s Google Plus post I learned:

(If you’re curious — in our tests, removing authorship generally does not seem to reduce traffic to sites. Nor does it increase clicks on ads. We make these kinds of changes to improve our users’ experience.)

No, I am not curious. I know several things. Precision and recall are less and less useful to Google.

What is important is ad revenue. Google wants a way to sell ads to fund projects like Loon, Glass, and drones. Oh, pesky authors anyway.

Stephen E Arnold, August 30, 2014

Hewlett Packard May Sue Accounting Firm over Autonomy Deal

August 30, 2014

Hewlett Packard fatigue is nibbling at my consciousness. I read “Hewlett-Packard Plans to Sue Deloitte’s UK Arm over Autonomy Audit.” HP appears to find others to blame for its decision to purchase Autonomy. The write up says:

Hewlett-Packard plans to sue the UK arm of accountancy firm Deloitte over its role in auditing Autonomy, the software company HP acquired but later accused of inflating financial figures, a lawyer for the US company said in court on Monday.

The Autonomy matter does keep HP in the news. However, the steady background hum of allegations about impropriety at Autonomy are like white noise. After a short time, the sound fades away.

The Autonomy matter, like the Fast Search & Technology financial restatement, suggests that search is a tough business to make into a massive, sustainable revenue stream.

Buying search technology appears to deliver headaches to those involved. Do the Autonomy and Fast Search issues suggest that content processing is easy to talk about and tough to turn into solutions that make everyone involved happy. Ooops. One group is very happy: the lawyers.

Stephen E Arnold, August 30, 2014

IBM Watson and Research

August 29, 2014

The IBM Watson content marketing machine grinds on. This time, IBM’s Hail Mary is making Watson into a research assistant. Let’s see. Watson does cancer treatment, recipe invention, and insurance analyses. “IBM Sees Broader Role for Watson in Airing Research” the operative word is “sees”, not hipping, sold, market dominance, and similar “got it done” phrases. Heck, there’s not even a public demo on Wikipedia data or a collection of patents.

The write up cheers me forward with:

With the aid of Watson, companies could better mine that private information and combine it with scientific data in the public domain.

One company studying such possibilities to evaluate medications and treatments is Johnson & Johnson, IBM said. But the company sees applications beyond the health realm, including making automated suggestions based on financial, legal, energy and intelligence-related information, IBM said.

Watson has to generate lots of dough and fast. IBM expects the Watson “system” to produce billions in revenue in five or six years. What Watson is producing is more credibility problems for search vendors with technology that “sort of” works.

I had a query yesterday from a consultant whose client wants to use IBM Watson technology. I suggested that if IBM will fund the quest for a brass ring, go for it. Have a Plan B.

In the meantime, I find the Watson arabesques pretty darned interesting. With HP planning billions from Autonomy, where is this money going to come from. No one seems to think much about the need to have a product that solves a problem for a specific company.

No “saids” or “sees” required. Just a business built on open source technology and home grown code. IBM is fascinating as is its content marketing methods. Quite an end of summer announcement. How about a live demo? I am weary of Jeopardy references.

Stephen E Arnold, August 29, 2014

Fixing US Government Information Technology

August 29, 2014

Short honk: I found this item amusing: “America’s Tech Guru Steps Down—But He’s Not Done Rebooting the Government.” Let’s see. There was and then the missing IRS emails. I heard about a few other minor glitches, but these are not germane. The notion is that a “tech guru” can fix government IT from outside the government. I think this means getting into the consulting and engineering services game.

Optimism is evident; for example:

Park wants to move government IT into the open source, cloud-based, rapid-iteration environment that is second nature to the crowd considering his pitch tonight. The president has given reformers like him leave, he told them, “to blow everything … up and make it radically better.”

Okay, I suppose some folks are waiting. Will Booz Allen, CSC, SAIC, SRA, and IBM Federal lose sleep tonight? Nope. Some will probably be chuckling as I did.

This is a get funding, bill, submit engineering change order, bill, get funding, etc. etc. world. Improvement is usually a lower priority task whether one is inside or outside the entity.

Stephen E Arnold, August 29, 2014

Next Page »