Questions about Statistical Data

September 4, 2014

Autonomy, Recommind, and dozens of other search and content processing firms rely on statistical procedures. Anyone who has survived Statistics 101 believe in the power of numbers. Textbook examples are—well—pat. The numbers work out even for B and C students.

The real world, on the other hand, is different. What was formulaic in the textbook exercises is more difficult with most data sets. The data are incomplete, inconsistent, generated by systems whose integrity is unknown, and often wrong. Human carelessness, the lack of time, a lack of expertise, and plain vanilla cluelessness makes those nifty data sets squishier than a memory foam pillow.

If you have some questions about statistical evidence in today’s go go world, check out “I Disagree with Alan Turing and Daniel Kahneman Regarding the Strength of Statistical Evidence.”

I noted this passage:

It’s good to have an open mind. When a striking result appears in the dataset, it’s possible that this result does not represent an enduring truth or even a pattern in the general population but rather is just an artifact of a particular small and noisy dataset. One frustration I’ve had in recent discussions regarding controversial research is the seeming unwillingness of researchers to entertain the possibility that their published findings are just noise.

An open mind is important. Just looking at the outputs of zippy systems that do prediction for various entities can be instructive. In the last couple of months, I learned that predictive systems:

  • Failed to size the Ebola outbreak by orders of magnitude
  • Did not provide reliable outputs for analysts trying to figure out where a crashed airplane was
  • Came up short regarding resources available to ISIS.

The Big Data revolution is one of those hoped for events. The idea is that Big Data will allow content processing vendors to sell big buck solutions. Another is that massive flows of unstructured content can only be tapped in a meaningful way with expensive information retrieval solutions.

Dreams, hopes, wishes—yep, all valid for children waiting for the tooth fairy. The real world has slightly more bumps and sharp places.

Stephen E Arnold, September, 2014

Instagram Search from

September 3, 2014

If you are interested in searching Instagram images, navigate to

The site says: is a new Instagram search engine and web viewer. Featuring millions of pictures, users, likes and comments, is your go-to source when you want to browse Instagram on computer or desktop. works on both PC and Mac.

A user can query by hashtags or user name. Instagram users assign hashtags and their handles. As a result, a query for “visualization” returns images and terms; for example, on September 2, 2014:


A more popular hashtag like “chicagobears” returns images more in line with non specialist content; for example:


Interesting but filtering and limits on access to user content may trouble some.

Stephen E Arnold, September 3, 2014

Tribler: A File Finder That Legal Eagles Will Want to Check

September 3, 2014

Short honk: We learned about Tribler, a rich media file finder. There is an interesting body of content; for example rich media. The site says:

Tribler can find files for you. No need for websites. Tribler can do 100 Mbps, sadly we cannot fix slow Internet or poor swarms. Lots of “pro” features: magnet links, streaming, sub-second search, channels and our upcoming anonymous mode.

Note the word “anonymous.” Tribler can play videos. The site says, “You can watch even before the download is finished.”


For more information, navigate to

Stephen E Arnold, September 3, 2014

Microsoft Azure Search Documentation

September 2, 2014

Microsoft has posted information about the Azure Search service. You can find the information at Azure Search Preview. The features remind me of Amazon’s cloud search approach.

The idea is that search is available. The “How It Works” section summarizes the procedures the customer follows. The approach is intended for engineers familiar with Microsoft conventions or a consultant capable of performing the required steps.

Of particular interest to potential licensees  will be the description of the pricing options. The Preview Pricing Details uses an Amazon like approach as well; for example, combinable search units. For higher demand implementations, Microsoft provides a custom price quote. The prices in the table below represent a 50 percent preview discount:


Microsoft offers different “editions” of Azure Search. Microsoft says:

Free is a free version of Azure Search designed to provide developers a sandbox to test features and implementations of Search. It is not designed for production workloads. Standard is the go-to option for building applications that benefit from a self-managed search-as-a-service solution. Standard delivers storage and predictable throughput that scales with application needs. For very high-demand applications, please contact

Support and service level agreements are available. A pricing calculator is available. Note that the estimates are not for search alone. Additional pricing information points to a page with four categories of fees and more than two dozen separate services. The link to Azure Search Pricing is self-referential, which is interesting to me.

I was not able to locate an online demo of the service. I was invited to participate in a free trial.

If you are interested in the limits for the free trial, Microsoft provides some information in its “Maximum Limits for Shared (Free) Search Service.”

Based on the documentation, correctly formed content uploaded permits full text search, facets, and hit highlighting. Specific functionalities are outlined on this reference page.

Net net: The search system is developer centric.

Stephen E Arnold, September 2, 2014

Huff Po and a Search Vendor Debunk Big Data Myths

September 1, 2014

I suppose I am narrow minded. I don’t associate the Huffington Post with high technology analyses. My ignorance is understandable because I don’t read the Web site’s content.

However, a reader sent me a link to “Top Three Big Data Myths: Debunked”, authored by a search vendor’s employee at Recommind. Now Recommind is hardly a household word. I spoke with a Recommind PR person about my perception that Recommind is a variant of the technology embodied in Autonomy IDOL. Yep, that company making headlines because of the minor dust up with Hewlett Packard. Recommind provides a probabilistic search system to customers that were originally involved in the legal market. The company has positioned its technology to other markets and added a touch of predictive magic as well. At its core, Recommind indexes content and makes the indexes available to users and other services. The company in 2010 formed a partnership with the Solcara search folks. Solcara is now the go to search engine for Thomson Reuters. I have lost track of the other deals in which Recommind has engaged.

The write up reveals quite a bit about the need for search vendors to reach a broader market in order to gain visibility to make the cost of sales bearable. This write up is a good example of content marketing and the malleability of outfits like Huffington Post. The idea strikes me as something that looks interesting may get a shot at building the click traffic for Ms. Huffington’s properties.

So what does the article debunk? Fasten your seat belt and take your blood pressure medicine. The content of the write up may jolt you. Ready?

First, the article reveals that “all” data are not valuable. The way the write up expresses it takes this form, “Myth #1—All Data Is Valuable.” Set aside the subject verb agreement error. Data is the plural and datum is the singular. But in this remarkable content marketing essay, grammar is not my or the author’s concern. The notion of categorical propositions applied to data is interesting and raises many questions; for example, what data? So the first my is that if one if able to gather “all data”, it therefore follows that some is not germane. My goodness, I had a heart palpitation with this revelation.

Second, the next myth is that “with Big Data the more information the better.” I must admit this puzzles me. I am troubled by the statistical methods used to filter smaller, yet statistically valid, subsets of data. Obviously the predictive Bayesian methods of Recommind can address this issue. The challenges Autonomy like syst4ems face are well known to some Autonomy licensees and, I assume, to the experts at Hewlett Packard. The point is that if the training information is off base by a smidge and the flow of content does not conform to the training set, the outputs are often off point. Now with “more information” the sampling purists point to sampling theory and the value of carefully crafted training sets. No problem on my end, but aren’t we emphasizing that certain non Bayesian methods are just not a wonderful as Recommind’s methods? I think so.

The third myth that the write up “debunks” is “Big Data opportunities come with no costs.” I think this is a convoluted way of saying that get ready to spend a lot of money to embrace Big Data. When I flip this debunking on its head, and I get this hypothesis, “The Recommind method is less expensive than the Big Data methods that other hype artists are pitching as the best thing since sliced bread.

The fix is “information governance.” I musty admit that like knowledge management, I have zero idea what the phrase means. Invoking a trade association anchored in document scanning does not give me confidence that an explanation will illuminate the shadows.

Net net: The myths debunked just set up myths for systems based on aging technology. Does anyone notice? Doubt it.

Stephen E Arnold, September 1, 2014

Yahoo Flickr Images: Does Search Work?

August 31, 2014

I think you know the answer if you are a regular reader of Beyond Search.


Finding images is a tedious and time consuming business. I know what the marketing collateral and public relations noise suggests. One can search by photographer, color, yada, yada.

The reality is that finding an image requires looking at images. Some find this fun, particularly if the client is paying by the hour for graphic expertise. For me, image search underscores how primitive information retrieval tools are.

Feel free to disagree.

To test Yahoo Flickr search, navigate to “Welcome to the Internet Archive to the the Commons.” Check out the sample entry to the millions of public domain images.


Darned meaty.

To search the “Commons”, one has to navigate to the Commons page and scroll down to the search box highlighted in yellow in this screenshot:


Enter a query like this one “18th century elocution.”

Here’s what the system displayed:


I then tried this query “london omnibus 1870”.

Here’s what the system displayed:


No omnibuses.

Like many image retrieval systems, the user has to fiddle with queries until images are spotted by manual inspection.

The archive is useful. Finding images in Yahoo Flickr remains a problem for me. I thought Xooglers knew quite a bit about search. You know: Finding information when the user enters a key word or two.

Stephen E Arnold, August 31, 2014

Quote to Note: Facebook Search

August 31, 2014

Facebook has done little public facing work on search. Behind the scenes, Facebookers and Xooglers have been beavering away. A bit of public information surfaced in “Zuckerberg On Search — Facebook Has More Content Than Google.” Does Facebook have a trillion pieces of content. Is that more content than Google has? Nah. But it is the thought that counts:

Here’s the quote I highlighted:

What would it ultimately mean if Facebook’s search efforts are effective–and if Facebook allowed universal use of a post search tool that really worked? It’s dizzying, really. As Zuckerberg said early this year on an earnings call: “There are more than a trillion status updates and unstructured text posts and photos and pieces of content that people have shared over the past 10 years.” Then the Facebook CEO put that figure into context: “a trillion pieces of content is more than the index in any web search engine.” You know what “any web search engine” spells? That’s a funny way of spelling Google.

With Amazon nosing into ads and Facebook contemplating more public search functionality, will Google be able to respond in a manner that keeps its revenues flowing and projects like Loon flying? I wonder what the Arnold name surfer thinks about Facebook? Maybe it is a place to post musings about failed youth coaching?

Stephen E Arnold, August 31, 2014

Google and Universal Search or Google Floudering with Search

August 30, 2014

There have been some experts who have noticed that Google has degraded blog search. In the good old days, it was possible to query Google’s index of Web logs. It was not comprehensive, and it was not updated with the zippiness of years past.

Search Engine Land and Web Pro News both pointed out that redirects to Google’s main search page. The idea of universal search, as I understood it, was to provide a single search box for Google’s content. Well, that is not too useful when it is not possible to limit a query to a content type or a specific collection.

“Universal” to Google is similar to the telco’s use of the word “unlimited.”

According the to experts, it is possible to search blog content. Here’s the user friendly sequence that will be widely adopted by Google users:

  1. Navigate to the US version of Google News. Note that this can be tricky if one is accessing Google from another country
  2. Enter a query; for example, “universal search”
  3. Click on “search tools” and then click on “All news”image
  4. Then click on “Blogs”


Several observations:

First, finding information in Google is becoming more and more difficult.

Second, obvious functions such as providing an easy way to run queries against separate Google indexes is anything but obvious. Do you know how to zip to Google’s patent index or its book index? Not too many folks do.

Third, the “logic” of making search a puzzle is no longer of interest to me. Increasing latency in indexing, Web sites that are pushed deep in the index for a reason unrelated to the site’s content, and a penchant for hiding information points to some deep troubles in Google search.

Net net: Google has lost its way in search. Too bad. As the volume of information goes up, the findability goes down. Wild stuff like Loon and Glass go up. Let’s hope Google can keep its ad revenue flowing; otherwise, there would be little demand for individuals who can perform high value research.

Stephen E Arnold, August 30, 2014

Google: Authors Not Helping Traffic

August 30, 2014

First, Google removed operators for Boolean queries. Then, Google started suggesting what I wanted. Now, Google does away with authors. These steps improve user experience. In John  Mueller’s Google Plus post I learned:

(If you’re curious — in our tests, removing authorship generally does not seem to reduce traffic to sites. Nor does it increase clicks on ads. We make these kinds of changes to improve our users’ experience.)

No, I am not curious. I know several things. Precision and recall are less and less useful to Google.

What is important is ad revenue. Google wants a way to sell ads to fund projects like Loon, Glass, and drones. Oh, pesky authors anyway.

Stephen E Arnold, August 30, 2014

IBM Watson and Research

August 29, 2014

The IBM Watson content marketing machine grinds on. This time, IBM’s Hail Mary is making Watson into a research assistant. Let’s see. Watson does cancer treatment, recipe invention, and insurance analyses. “IBM Sees Broader Role for Watson in Airing Research” the operative word is “sees”, not hipping, sold, market dominance, and similar “got it done” phrases. Heck, there’s not even a public demo on Wikipedia data or a collection of patents.

The write up cheers me forward with:

With the aid of Watson, companies could better mine that private information and combine it with scientific data in the public domain.

One company studying such possibilities to evaluate medications and treatments is Johnson & Johnson, IBM said. But the company sees applications beyond the health realm, including making automated suggestions based on financial, legal, energy and intelligence-related information, IBM said.

Watson has to generate lots of dough and fast. IBM expects the Watson “system” to produce billions in revenue in five or six years. What Watson is producing is more credibility problems for search vendors with technology that “sort of” works.

I had a query yesterday from a consultant whose client wants to use IBM Watson technology. I suggested that if IBM will fund the quest for a brass ring, go for it. Have a Plan B.

In the meantime, I find the Watson arabesques pretty darned interesting. With HP planning billions from Autonomy, where is this money going to come from. No one seems to think much about the need to have a product that solves a problem for a specific company.

No “saids” or “sees” required. Just a business built on open source technology and home grown code. IBM is fascinating as is its content marketing methods. Quite an end of summer announcement. How about a live demo? I am weary of Jeopardy references.

Stephen E Arnold, August 29, 2014

« Previous PageNext Page »