Huff Po and a Search Vendor Debunk Big Data Myths

September 1, 2014

I suppose I am narrow minded. I don’t associate the Huffington Post with high technology analyses. My ignorance is understandable because I don’t read the Web site’s content.

However, a reader sent me a link to “Top Three Big Data Myths: Debunked”, authored by a search vendor’s employee at Recommind. Now Recommind is hardly a household word. I spoke with a Recommind PR person about my perception that Recommind is a variant of the technology embodied in Autonomy IDOL. Yep, that company making headlines because of the minor dust up with Hewlett Packard. Recommind provides a probabilistic search system to customers that were originally involved in the legal market. The company has positioned its technology to other markets and added a touch of predictive magic as well. At its core, Recommind indexes content and makes the indexes available to users and other services. The company in 2010 formed a partnership with the Solcara search folks. Solcara is now the go to search engine for Thomson Reuters. I have lost track of the other deals in which Recommind has engaged.

The write up reveals quite a bit about the need for search vendors to reach a broader market in order to gain visibility to make the cost of sales bearable. This write up is a good example of content marketing and the malleability of outfits like Huffington Post. The idea strikes me as something that looks interesting may get a shot at building the click traffic for Ms. Huffington’s properties.

So what does the article debunk? Fasten your seat belt and take your blood pressure medicine. The content of the write up may jolt you. Ready?

First, the article reveals that “all” data are not valuable. The way the write up expresses it takes this form, “Myth #1—All Data Is Valuable.” Set aside the subject verb agreement error. Data is the plural and datum is the singular. But in this remarkable content marketing essay, grammar is not my or the author’s concern. The notion of categorical propositions applied to data is interesting and raises many questions; for example, what data? So the first my is that if one if able to gather “all data”, it therefore follows that some is not germane. My goodness, I had a heart palpitation with this revelation.

Second, the next myth is that “with Big Data the more information the better.” I must admit this puzzles me. I am troubled by the statistical methods used to filter smaller, yet statistically valid, subsets of data. Obviously the predictive Bayesian methods of Recommind can address this issue. The challenges Autonomy like syst4ems face are well known to some Autonomy licensees and, I assume, to the experts at Hewlett Packard. The point is that if the training information is off base by a smidge and the flow of content does not conform to the training set, the outputs are often off point. Now with “more information” the sampling purists point to sampling theory and the value of carefully crafted training sets. No problem on my end, but aren’t we emphasizing that certain non Bayesian methods are just not a wonderful as Recommind’s methods? I think so.

The third myth that the write up “debunks” is “Big Data opportunities come with no costs.” I think this is a convoluted way of saying that get ready to spend a lot of money to embrace Big Data. When I flip this debunking on its head, and I get this hypothesis, “The Recommind method is less expensive than the Big Data methods that other hype artists are pitching as the best thing since sliced bread.

The fix is “information governance.” I musty admit that like knowledge management, I have zero idea what the phrase means. Invoking a trade association anchored in document scanning does not give me confidence that an explanation will illuminate the shadows.

Net net: The myths debunked just set up myths for systems based on aging technology. Does anyone notice? Doubt it.

Stephen E Arnold, September 1, 2014

Yahoo Flickr Images: Does Search Work?

August 31, 2014

I think you know the answer if you are a regular reader of Beyond Search.

Nope.

Finding images is a tedious and time consuming business. I know what the marketing collateral and public relations noise suggests. One can search by photographer, color, yada, yada.

The reality is that finding an image requires looking at images. Some find this fun, particularly if the client is paying by the hour for graphic expertise. For me, image search underscores how primitive information retrieval tools are.

Feel free to disagree.

To test Yahoo Flickr search, navigate to “Welcome to the Internet Archive to the the Commons.” Check out the sample entry to the millions of public domain images.

image

Darned meaty.

To search the “Commons”, one has to navigate to the Commons page and scroll down to the search box highlighted in yellow in this screenshot:

image

Enter a query like this one “18th century elocution.”

Here’s what the system displayed:

image

I then tried this query “london omnibus 1870”.

Here’s what the system displayed:

image

No omnibuses.

Like many image retrieval systems, the user has to fiddle with queries until images are spotted by manual inspection.

The archive is useful. Finding images in Yahoo Flickr remains a problem for me. I thought Xooglers knew quite a bit about search. You know: Finding information when the user enters a key word or two.

Stephen E Arnold, August 31, 2014

Quote to Note: Facebook Search

August 31, 2014

Facebook has done little public facing work on search. Behind the scenes, Facebookers and Xooglers have been beavering away. A bit of public information surfaced in “Zuckerberg On Search — Facebook Has More Content Than Google.” Does Facebook have a trillion pieces of content. Is that more content than Google has? Nah. But it is the thought that counts:

Here’s the quote I highlighted:

What would it ultimately mean if Facebook’s search efforts are effective–and if Facebook allowed universal use of a post search tool that really worked? It’s dizzying, really. As Zuckerberg said early this year on an earnings call: “There are more than a trillion status updates and unstructured text posts and photos and pieces of content that people have shared over the past 10 years.” Then the Facebook CEO put that figure into context: “a trillion pieces of content is more than the index in any web search engine.” You know what “any web search engine” spells? That’s a funny way of spelling Google.

With Amazon nosing into ads and Facebook contemplating more public search functionality, will Google be able to respond in a manner that keeps its revenues flowing and projects like Loon flying? I wonder what the Arnold name surfer thinks about Facebook? Maybe it is a place to post musings about failed youth coaching?

Stephen E Arnold, August 31, 2014

Google and Universal Search or Google Floudering with Search

August 30, 2014

There have been some experts who have noticed that Google has degraded blog search. In the good old days, it was possible to query Google’s index of Web logs. It was not comprehensive, and it was not updated with the zippiness of years past.

Search Engine Land and Web Pro News both pointed out that www.google.com/blogsearch redirects to Google’s main search page. The idea of universal search, as I understood it, was to provide a single search box for Google’s content. Well, that is not too useful when it is not possible to limit a query to a content type or a specific collection.

“Universal” to Google is similar to the telco’s use of the word “unlimited.”

According the to experts, it is possible to search blog content. Here’s the user friendly sequence that will be widely adopted by Google users:

  1. Navigate to the US version of Google News. Note that this can be tricky if one is accessing Google from another country
  2. Enter a query; for example, “universal search”
  3. Click on “search tools” and then click on “All news”image
  4. Then click on “Blogs”

image

Several observations:

First, finding information in Google is becoming more and more difficult.

Second, obvious functions such as providing an easy way to run queries against separate Google indexes is anything but obvious. Do you know how to zip to Google’s patent index or its book index? Not too many folks do.

Third, the “logic” of making search a puzzle is no longer of interest to me. Increasing latency in indexing, Web sites that are pushed deep in the index for a reason unrelated to the site’s content, and a penchant for hiding information points to some deep troubles in Google search.

Net net: Google has lost its way in search. Too bad. As the volume of information goes up, the findability goes down. Wild stuff like Loon and Glass go up. Let’s hope Google can keep its ad revenue flowing; otherwise, there would be little demand for individuals who can perform high value research.

Stephen E Arnold, August 30, 2014

Google: Authors Not Helping Traffic

August 30, 2014

First, Google removed operators for Boolean queries. Then, Google started suggesting what I wanted. Now, Google does away with authors. These steps improve user experience. In John  Mueller’s Google Plus post I learned:

(If you’re curious — in our tests, removing authorship generally does not seem to reduce traffic to sites. Nor does it increase clicks on ads. We make these kinds of changes to improve our users’ experience.)

No, I am not curious. I know several things. Precision and recall are less and less useful to Google.

What is important is ad revenue. Google wants a way to sell ads to fund projects like Loon, Glass, and drones. Oh, pesky authors anyway.

Stephen E Arnold, August 30, 2014

IBM Watson and Research

August 29, 2014

The IBM Watson content marketing machine grinds on. This time, IBM’s Hail Mary is making Watson into a research assistant. Let’s see. Watson does cancer treatment, recipe invention, and insurance analyses. “IBM Sees Broader Role for Watson in Airing Research” the operative word is “sees”, not hipping, sold, market dominance, and similar “got it done” phrases. Heck, there’s not even a public demo on Wikipedia data or a collection of patents.

The write up cheers me forward with:

With the aid of Watson, companies could better mine that private information and combine it with scientific data in the public domain.

One company studying such possibilities to evaluate medications and treatments is Johnson & Johnson, IBM said. But the company sees applications beyond the health realm, including making automated suggestions based on financial, legal, energy and intelligence-related information, IBM said.

Watson has to generate lots of dough and fast. IBM expects the Watson “system” to produce billions in revenue in five or six years. What Watson is producing is more credibility problems for search vendors with technology that “sort of” works.

I had a query yesterday from a consultant whose client wants to use IBM Watson technology. I suggested that if IBM will fund the quest for a brass ring, go for it. Have a Plan B.

In the meantime, I find the Watson arabesques pretty darned interesting. With HP planning billions from Autonomy, where is this money going to come from. No one seems to think much about the need to have a product that solves a problem for a specific company.

No “saids” or “sees” required. Just a business built on open source technology and home grown code. IBM is fascinating as is its content marketing methods. Quite an end of summer announcement. How about a live demo? I am weary of Jeopardy references.

Stephen E Arnold, August 29, 2014

How to End Googles Search Monopoly if You Want To

August 29, 2014

The article on makeuseof titled Help End Google’s Search Monopoly: Use Something Else implores Internet users to consider alternatives for search on the basis of a very simple concept: monopolies are bad. Without a doubt, Google is a monopoly, with the Chinese Baidu in a lagging second place. The amount of power this gives Google is the main target of the article, not Google itself, interestingly. The article states,

“The ball is always in Google’s court – they control the search game. This breeds a culture of tailoring content to what Google wants, with the problem being that nobody really knows what this is. Most “SEO experts” will tell you they know how to get your site ranking highly, but really they have no greater insight into what goes on behind the scenes than you do.

We’re not bitter, that’s not the point of this article.”

They are referring to Panda, Google’s 2011 filter that removed lower quality content websites from searches. This benefitted some sites, but it also had far-reaching negative implications for any number of sites. This is why monopolies are bad, not because Google is inherently evil but because they are making decisions that can affect huge amounts of people and businesses. It may be too late to recommend alternatives like DuckDuckGo, since Google is so ingrained in its users as the only option for search.

Chelsea Kerwin, August 29, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Short Honk: Surveillance Database Report

August 26, 2014

I wanted to document a report that ICREACH exists. For information, see The Intercept’s report. No further comment from Beyond Search.

Stephen E Arnold, August 26, 2014

Endeca Wins Over Beauty Retailer

August 26, 2014

To overhaul the customer experience on their site, ULTA Beauty turned to Endeca. We learn of the move from Integrated Solutions for Retailers in, “Thanx Media’s Oracle Endeca and ULTA Beauty Take Customer Experience to the Next Level.” Thanx Media is ULTA’s integrated-search-solutions provider. The press release tells us:

“Oracle Endeca has replaced a third party search solution, now tightly integrating the browse and search navigation, resulting in a consistent guest experience with minimal maintenance. The previous lack of integration with the third party search solution caused discrepancies in product data (such as pricing and inventory levels between search and browse) resulting in product listing pages that didn’t always match and a process that lacked the flexibility required by the e-commerce business team.”

Those are indeed serious problems for a retail site. How did the switch pan out? The write-up makes it clear that the reseller is very, very happy. Less clear is how, exactly, the system paid off for ULTA. Aside from a tangential reference to “positive Q4 results,” we are given no details. Oh, well. At least the middleman is pleased.

Cynthia Murrell, August 26, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Questioning How To Search New Sound files

August 25, 2014

Sound is an underrated science, but it is quite an amazing topic to study. MIT News reports an amazing experiment: “Extracting Audio From Visual Information.” The article explains that Adobe, Microsoft, and MIT researchers developed an algorithm that can reconstruct an audio signal by analyzing minute vibrations of objects depicted in video. The team has been able to get audible files of the leaves of a potted plant, the surface of a glass of water, aluminum foil, and vibrations from a potato-chip bag.

The sound files can be used by law enforcement organizations, but MIT graduate student Abe Davis says it creates a “new kind of imaging.”

“ ‘We’re recovering sounds from objects,’ [Davis] says. ‘That gives us a lot of information about the sound that’s going on around the object, but it also gives us a lot of information about the object itself, because different objects are going to respond to sound in different ways.’”

The team speculates that the technology community will embrace the research and amazing applications will be developed from it. The new sound technology will also create a new slew of content. How will we search the new content? A specific and exact ontology will be needed to distinguish sound files. Will a search application smart enough to read the sound data be developed to identify the user’s information need? Oh wait, enterprise search systems index “all information” so it already exists.

Whitney Grace, August 25, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta