Couchdb-Lucene Search

July 31, 2010

The Beyond Search goslings noticed a post from R Newson about couchdb-lucene search. A bug fix was posted. Couchdb-lucene enables full text searching of couchdb documents. The Github detail page is at couchdb-lucene uses Apache Tika to index attachments. File types supported include Microsoft formats, Java class files, and jar archives, XML and about a dozen others. Queries use Lucene’s default syntax. There are nine optional parameters which permit more sophisticated searches; for example, limit, which sets the maximum number of documents to return in response to a query.

Stephen E Arnold, July 30, 2010

The New Ask Is the Same Old Ask

July 30, 2010

I have written so much about, formerly, that I am not going to go over the long and quite interesting history. I want to talk about DirectHit, the enterprise play, the fling with the Rutgers’ wizards, and the death of the smirking butler.


Won’t do it. No.

I want to direct your attention to “Can’s New Search Strategy Work?” The article summarizes the most recent type-a-question, get-an-answer approach for the Web search service. The article points out that is in the Q&A based query business, and that’s close enough for horseshoes in a race with a leader way out front, number two loaded down with billions of dollars and a crazed gleam in its eye, a confused third place runner, and then dear old

The write up does a good job of explaining how the system answers questions. There is a reference to “proprietary matching technology”, a secret sauce. Don’t forget the human element.  And there’s a comment that is okay with me for a company that has not done much since one of the Ziff fellows with whom I worked labored in the AskJeeves vineyard early in the company’s history. Here’s the passage I noted:’s new strategy could deliver an indirect and slightly ironic benefit: By creating an ever-growing number of user-generated answer pages, Ask will likely gain decent placement in Google searches for commonly discussed topics. And that’ll mean people searching Google for information will end up clicking on Ask’s answers. So even if it doesn’t attract hoards of new long-term users, Ask may find enough added incidental traffic to help it grow as a small but consistently present niche player.

From my point of view, search has lost its focus. The Google style stuff is too expensive for too many to sign on to index the Web. There are too many things to code around. The ad world is chugging along, but there has not been change in Google’s revenue because no one has found a way to trip Googzilla.

The surge in mobile devices and smaller form factors makes the result list look dorky. The notion of answering a tough question is tough because lots of Web users want to type 2.3 words, hit the enter key, and be done with search. Others are happy taking whatever the mobile device spits out. Pizza? Hey, there’s one. Looks good to me. Other vendors are using saved queries and just pumping stuff to people who match a profile a person created or an algorithm ginned up.

The addled goose has some ideas for an outfit like, but I live in Harrod’s Creek, and the Barry Diller mavens live far away from the pond filled with mine run off.

Oh, I did a query for Beyond Search, and it came up number one. Exciting.

But when I read “Competitors beware: Innovation in search benefits Google,” I learned that Barry Diller learned that advertising via Nascar sponsorships, tossing in new tricks, and dreaming of a big winner were wrong headed. Ah, what one can learn by doing.

Stephen E Arnold, July 30, 2010

When Domains Collide: The Apple-Time Bonk

July 30, 2010

My NFAIS lecture will appear in a forthcoming book of essays. I won’t drag you through the argument in that write up. I do want to call your attention to “Apple Blocks Time, Others from Running iPad Subscriptions.” The issue is that Apple is taking a somewhat arbitrary stand with regard to a publisher using the Apple ecosystem sell an Apple device user a subscription without Apple being in the middle. For me the key passage in the write up was:

Why Apple would reject subscriptions isn’t known, but it’s speculated that the company may be worried about how publishers would use the consumer data collected with each subscription, even though such collection is standard in the print world. Apple might alternately be worried about missed revenue opportunities, since allowing direct payment for subscriptions would cut the company out of some or a lot of income. The latter approach would be incongruous though, since Amazon and the Wall Street Journal can already bill customers directly in some cases.

The view from the goose pond is easy to explain. Publishers want control but no longer own the distribution channel. Apple owns the distribution channel, including the bank, the customer, and the information pipeline. Apple wants control and will take it until someone or something prevents Apple from doing what it wants. Publishers on the other hand think they are in control.

When domains collide, the nature of who is on top guarantees friction between those who are crushing others and who is being crushed. In this battle, the publishers like the music folks are learning that the person with power pretty much calls the shots.

Solution: find a way to regain control. This might be tough because a core competency in distributing information in a 19th century world does not mean much in the data-centric world of 2010. Apple, which was a possible dead on arrival company two or three times in the last 20 years found a way to survive. Now publishers have to find their solution.

Complaining and trying to be clever in order to “work around” the Apple guidelines won’t work. Anyway, it may be too late.

Stephen E Arnold, July 30, 2010

Freebie, unlike a single copy of Time

Add Comintelli to the Bandwagon

July 30, 2010

Toss another name into the club of search and enterprise programs utilizing open source technology. Comintelli,  the Stockholm-based developer of enterprise knowledge management software, recently improved its product by using Lucene. Red Orbit reported in its story, “New Enterprise Search Solution Based on Apache Solr Released by Comintelli” that Comintelli would be basing its Knowledge XChanger program on Apache Solr. According to the article, “[Knowledge XChanger’s] major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document handling.” Comintelli’s CEO was optimistic about the partnership, saying they have, “always been strong in search, but with Solr in the back-end, it is now well-positioned to tackle any information access problems.” The more we hear about Lucene programs being integrated, the more we hear how business will undoubtedly improve, which is an exciting trend.

Pat Roland, July 30, 2010


Quote to Note: Google about Android and Content

July 29, 2010

Here’s a quote to note. Today is July 29, 2010, and I don’t want this puppy to slip away. The story “Eric Schmidt on Google’s Next Tricks” is about Google’s dependence on advertising revenue. There is nothing wrong with billions of dollars. The problem is that Apple’s revenues are more diversified and Apple is moving into advertising. Google is jumping into hardware but in a Platonic manner. Google gives away Android and allows other companies to build hardware. Apple on the other hand gets its hands dirty with deals, hardware, software, and online services. So in terms of tricks, the Apple orchard has more diversity than Google’s patch.

Now here’s the quote to note:

“If we have a billion people using Android, you think we can’t make money from that?”

That’s a good question. The proof will be like Apple’s bushel basket full of revenue, earnings, and valuation. Google’s good, but a monoculture may be more vulnerable than diverse revenue streams that are up and flowing, not glimpsed in the mirror reflecting the horizon. Plato was not into money. Investors are the kinds of folks who say, “Show me the money.” Very un-Platonic in this goose’s opinion.

Stephen E Arnold, July 29, 2010

Freebie from the goose

Google Books Israel Edition

July 29, 2010

Nobody ever said the next frontier of literature would be smooth, but it is a realm that will be conquered none the less. Google is learning all about the highs and lows of digital books these days. A recent Globes article, “Google Books Reaches Isreal,” [NOTE: Link may be dead when you read this Beyond Search post] highlighted the search giant’s new foray into Hebrew texts, reaching a deal to allow full or partial downloads of many books published in Israel. This victory is offset by the legal hot water Google Books currently finds itself in in America. There, the online bookshop finally reached a settlement which, “claimed that Google’s scanning of texts was copyright infringement,” so it can now release many more titles. While Google is making forward progress, their boat seems to be taking on water. But the Google is persistent.

Ken Toth, July 28, 2010


Online Paywalls: British Users Click Elsewhere

July 29, 2010

Internet users in England are the biggest online penny pinchers. Net Imperative recently reported these finding in an article, “British Least Likely to Pay for Online Content According to a New Survey.” The survey, performed by global accounting firm KPMG, discovered nearly 81 percent of Brits polled would prefer not to pay for online content, no matter what it was.

According to the article: “UK users will make some concessions though: almost 75% of UK users are happy to have free or heavily subsidized content supported by advertising. In addition up to 48% are happy to allow their personal data to be tracked if it means cheaper content, though some remain concerned about online privacy and safety.”

Regardless of ad space, this is a sign that there is too much valuable free information, like the kind found here, to ever force readers to pay.

Commercial database publishers in the 1980s knew how to generate revenue. Pity those lessons have been ignored. But today’s managers are just so much more informed.

Stephen E Arnold, July 28, 2010

Freebie unlike the paywall crowd

SAP Picks Black Duck

July 29, 2010

We received an email with information about SAP’s open source activities. The good news is that Black Duck Software, a provider of products and services for accelerating software development through the managed use of open source software, issued this statement:

SAP has selected to implement the Black Duck™ Suite. The comprehensive suite provides a platform for managing the use of open source software in a multi-source development process. It will help development teams at SAP improve productivity by further automating the company’s open source approval processes. SAP is the world’s leading provider of business software(*). SAP, which previously used complex, time-consuming and partly manual processes for handling open source approvals and the legal compliance aspects of open source use, sought a scalable, enterprise-strength platform to further automate the management, compliance, and integration of open source software into its development life cycle. After researching available tools and platforms, SAP chose the Black Duck Suite to support SAP developers worldwide with the suite’s automated, developer-oriented, multi-function platform, which supports scanning, early detection and management of open source used in software development. “When we established the open source approval process at SAP in 2001, we assumed we’d receive only a few open source requests per month,” said Francis Ip, head of SAP Global Technology Legal Compliance. “However, with the continuously increasing importance of open source globally and SAP’s recent strategic change towards systematically utilizing benefits that come with open source, it was necessary for us to scale our open source process through further automation. We conducted an exhaustive search of applications on the market, and the Black Duck Suite was the best solution we tested. The Black Duck Suite will help us further automate and scale our open source process in order to support our open source strategy.” “Using the Black Duck Suite will help SAP developers reduce the amount of code that needs to be developed while increasing the velocity of development,” said Peter Vescuso, executive vice president, Black Duck Software. “Automating the use and management of open source software also will yield the benefits of compliance with software license obligations, reducing risk and improving developer efficiency.”

Based on information in this announcement, SAP is making a move into open source. Details are not yet available. The announcement is a vote of confidence for Black Duck, a company giving one of the presentations at the Lucene Revolution Conference in October in Boston, Mass. For more information on the Black Duck Suite visit

Stephen E Arnold, July 29, 2010

Recommendation Engines May Engineer the Soul

July 29, 2010

“Recommendation engines aren’t designed to give us what we want. They’re designed to give us what they think we want, “ says Lev Grossman in his recent Time Magazine article, If You Liked This… . And that’s the crux of the difference between recommendation engines and perfection.

In my perfect world, I would open a retail store called “YOU” and you would shop there all the time because every product in the store would suit your taste. I would use your buying habits to build my inventory. You would spend almost all of your money in my store. You would be happy and I would be rich. Fair trade.

In a sense, that’s what recommendation engines do. They use what you’ve already purchased to guess what you might like to buy next… and they offer it to you immediately. It’s you recommending something to yourself with the computer as an intermediary.

Word of mouth from a friend is, by far, the most relied upon source of confidence, says a recent survey.  Statistically, almost 90% of us trust a friend’s recommendation to some degree. So wouldn’t you assume that, by being your own best friend, you couldn’t go wrong? You know the answer to that loaded question.

“The trouble with recommendation engines,” says the author, “is that they’re really hard to build. They look simple on the outside—if you liked X, you’ll love Y—but they’re actually doing something fiendishly complex. They’re processing astounding quantities of data and doing so with seriously high-level math. That’s because they’re attempting to second-guess a mysterious, perverse and profoundly human form of behavior: the personal response to a work of art. They’re trying to reverse-engine the soul.”


Technology can engineer one’s soul directly or indirectly, the addled goose assumes. Source:

A lot of companies are trying hard to link one preference to others but, unlike the alphabet, human beings just don’t go from A to B to infinity and beyond in any algorithmically defined order.

Pandora, Netflix, Amazon, Facebook, eHarmony, MySpace and the like are tying hard to get it right. Industry studies show that about a third of us buyers choose another selection from the recommendations, so the value is obvious to both merchant and buyer. But getting it right is proving much more problematic than anyone thought.

Read more

Facebook Runs Wide Open

July 28, 2010

Open source technology, once relegated to the furthest reaches of computer geekdom, is helping over 500 million people a day share status info, view photos and even poke a friend, they just don’t know it. Facebook, the King Kong of social media, has embraced open source tools, especially Lucene products, on several different levels so it works faster and smarter. A recent interview with Facebook’s senior open programs manager, David Recordon, for “Inside Facebook’s Open Source Infrastructure,” revealed a surprising pile of open source applications. According to the piece, “Facebook’s open source Web serving infrastructure has a lot more than just the traditional LAMP (Linux/Apache/MySQL/PHP) stack behind it.”

The company taps Apache and Lucene. Cassandra, an Apache database project, is utilized heavily by the site. It is one of three open source databases used for storing information and helping the process run smoother. “While we store the majority of our user data inside of MySQL,” Recordon said. “We have about 150 terabytes of data inside of Cassandra, which we use for inbox search on the site and over 36 petabytes of uncompressed data in Hadoop overall.”

With such a well-planned method for storing data by using open source programs, it only makes sense the data analysis is handled in a similar fashion. Here, Apache Hive technology is utilizedin a major way.

“A large part of our infrastructure is open source and we really think that it’s important in terms of being able to allow developers that are building with the Facebook platform to scale using the same pieces of infrastructure that we use,” Recordan said.

Facebook is arguably one of the most important companies of our time. Few sites have changed the way we spend time at and away from the computer. So its warm embrace of open source technology feels like a sign of big things to come as companies like Lucene gain more recognition. What’s interesting is that Facebook has a number of Googlers pumping their DNA into Facebook. The technical decisions at Facebook are different from those made at Google. Facebook does social pretty well. Google does not, at least yet. Is there a message here? Beyond Search will do an Overflight at some point.

Stephen E Arnold, July 28, 2010


Next Page »

  • Archives

  • Recent Posts

  • Meta