Behind the Scenes at DuckDuckGo

February 14, 2013

High Scalability gives us an in-depth look at the burgeoning DuckDuckGo derived from an interview with the site’s founder, Gabriel Weinberg, in “DuckDuckGo Architecture—1 Million Deep Searches a Day and Growing.” Writer Todd Hoff notes that the Duck is proudly famous for (or famously proud of) refusing to collect data on their users. Though it is understandable that Weinberg emphasizes that popular stance, Hoff is more interested in the mechanics behind the service. He writes:

“What I found most compelling is DDG’s strong vision of a crowdsourced network of plugins giving broader search coverage by tying an army of vertical data suppliers into their search framework. For example, there’s a specialized Lego plugin for searching against a complete Lego database. Use the name of a spice in your search query, for example, and DDG will recognize it and may trigger a deeper search against a highly tuned recipe database. Many different plugins can be triggered on each search and it’s all handled in real-time.

“Can’t searching the Open Web provide all this data? No really. This is structured data with semantics. Not an HTML page. You need a search engine that’s capable of categorizing, mapping, merging, filtering, prioritizing, searching, formatting, and disambiguating richer data sets and you can’t do that with a keyword search.”

He’s right. I do turn to DuckDuckGo for such deep searches, but I often go back to Google if I need a broader one. It is good to have a variety of tools. All else being equal, I do prefer the Duck’s privacy policy.

That bragging point, however, comes at a cost. Like other Web search engines, DuckDuckGo is ad-supported, but their key policy makes it impossible to take advantage of the most lucrative source of revenue—the targeted ad. Our view is that 2013 is about revenue, not about bits and bytes, or about popularity. We hope our fellow water-fowl makes it through okay.

Do check out Hoff’s article if you are interested in the mechanics behind DuckDuckGo. It is chock-full of detailed information.

Cynthia Murrell, February 14, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

SOLR Relevancy Tuning from Search Technologies

February 11, 2013

Search Technologies introduced “Solr Lucene Relevancy Tuning.” Search Technologies will supply services to improve the relevancy of results within an existing Solr/Lucene implementation. If the service works as advertised, this could be a boon to many organizations awash with extraneous data. The announcement explains:

This engagement will provide powerful relevancy ranking improvements in an existing Solr installation. This includes setting up a basic system for relevancy evaluation, based on a set of sample queries, so that improvements can be quantitatively measured. Additions to the default relevancy formula in Solr Lucene can dramatically improve search results, solving many of the most thorny relevancy problems including:

  • Reducing the impact of peripheral content (sidebars, ads, tangential discussions, etc.)
  • Automatically handling word phrases in a flexible manner, reducing the need to use complex query constructions to obtain good search results.”

The Search Technologies’ solution changes the default Solr/Lucene functionality, which can overemphasize document size and term frequency. Search Technologies’ new Parameterized Document Similarity Function provides more control over these formulas through configurable parameters. The company’s Gradient Proximity Boost operator eliminates the need to tweak Solr/Lucene’s default “hard window,” the term-proximity parameters which can trigger a document boost. The method does this by measuring the density and completeness of terms across each document, gradually boosting documents in which terms cluster.

The post identifies the expected engagement tasks and deliverables associated with this software. The only pre-requisite listed is the presence of a working Solr /Lucene system with already-indexed documents. The firm promises ongoing maintenance and support services, including an optional round-the-clock support package.

Founded in 2005, Search Technologies bills themselves as the largest (independent) IT services company dedicated to search-engine implementation, consulting, and managed services. Staffed with veterans of the search field, the company prides itself on innovation. Search Technologies is headquartered in Herndon, Virginia, and maintains two other U.S. offices as well as locations in Berkshire, U.K., and San Jose, Costa Rica.

Ken Toth, February 11, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Yahoo Back On Search

February 11, 2013

Before Google came into the spotlight, Yahoo used to have a series of commercials where its subjects were put in hilarious situations they wanted to get out of. By using Yahoo search, they were able to find a solution. At the end of every commercial a yodeler yodeled “Ya-ho-oo!” Everybody was “yahooing” and everyone thought Yahoo was number one. They were wrong. Computer World reports that Yahoo wants to snatch the crown, “Yahoo To Focus On Search—And Google.”

Marissa Mayer the Yahoo CEO plans on taking on Google in Internet search. She became the CEO after a successful career at Google, but Yahoo pulled her in to save its floundering tail. Mayer more than anyone else, knows what it means to take on the search giant. Yahoo needs to do something very new and very bold to have the smallest glimmer of hope in competing. Mayer will focus on building technology to improve search results and to extend the reach to desktop/mobile device users.

“’There’s a lot more potential here,’ Mayer said. ‘Overall, search is a key area of investment for us. All the innovations in search are going to happen at the user interface level going forward. We need to invest in those features, both for desktop and mobile [devices]. I think both ultimately will be key plays for us.’”

The new strategy does not call for the end of the Yahoo/Microsoft partnership, Mayer instead hopes Bing will help Yahoo. In 2010, Yahoo ditched its own search engine for Bing. In order to even make a dent in the market, Yahoo needs to grasp onto something that Google misses. Yahoo stinks and needs help. A former Googler is pulled into help. Talk about knowing thy enemy.

Whitney Grace, February 11, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

From Jeopardy to Cancer Treatment: An IBM Story

February 10, 2013

I read “IBM Supercomputer Watson to Help in Cancer Treatment.” I am burned out on the assertions of search, content processing, and analytics vendors. The algorithms predict, deliver actionable information, and answer tough questions. Okay, I will just believe these statements. Most of the folks with whom I interact either believe these statements or do not really care.

Watson, as you may know, takes open source goodness, layers on a knowledge base, and wraps the confection in layers of smart software. I am simplifying, but the reality is irrelevant given the marketing need.

Here’s the passage I noted:

A year ago, a team at Memorial Sloan-Kettering started working with an IBM and a WellPoint team to train Watson to help doctors choose therapies for breast and lung cancer patients. They continue to share their knowledge and expertise in oncology and information technology, beginning with hundreds of lung cancers, the aim being to help Watson learn as much as possible about cancer care and how oncologists use medical data, as well as their experiences in personalized cancer therapies. During this period, doctors and technology experts have spent thousands of hours helping Watson learn how to process, analyze and interpret the meaning of sophisticated clinical data using natural language processing; the aim being to achieve better health care quality and efficiency.

There you go. For the dozens of companies working to create next generation information retrieval systems which are affordable, actually work, and can be deployed without legions of engineers—game over. IBM Watson has won the search battle. Now for the optimists who continue to pump money into decade old search companies which have modest revenue growth, kiss those bucks goodbye. For the PhD students working on the revolutionary system which promises to transform findability, get a job at Kentucky Fried Chicken. And Google? Well, IBM knows your limits so stick to selling ads.

IBM is doing it all:

Manoj Saxena, IBM General Manager, Watson Solutions, said:

“IBM’s work with WellPoint and Memorial Sloan-Kettering Cancer Center represents a landmark collaboration in how technology and evidence based medicine can transform the way in which health care is practiced. breakthrough capabilities bring forward the first in a series of Watson-based technologies, which exemplifies the value of applying big data and analytics and cognitive computing to tackle the industry’s most pressing challenges.”

How different is Watson from the HP Autonomy, Recommind, or even the DR LINK technology? Well, maybe the open source angle is the same. But IBM needs to do more than make assertions and buy analytics companies as the company recycles open source technology in my opinion. I thought IBM was a consulting firm? Here I am wrong again. Watson probably “knew” that after hours of training, tuning, and talking. But in the back of my mind, I ask, “What if those training data are inapplicable to the problem at hand? What if the journal articles are fiddled by tenure seekers or even pharmaceutical outfits or institutions trying to maximize insurance payouts or careless record keeping by medical staff? Nah, irrelevant questions. IBM has this smart system nailed. Search solved. What’s next IBM?

Stephen E Arnold, February 10, 2013

Google: Objective Indexing and a Possible Weak Spot

February 6, 2013

A reader sent me a link to “Manipulating Google Scholar Citations and Google Scholar Metrics: Simple, Easy, and Tempting.” I am not sure how easy and tempting the process of getting a fake scholarly paper into the Google index is, but the information provided is food for thought. Worth a look, particularly if you are a fan of traditional methods for building a corpus and delivering on point results which the researcher can trust. The notion of “ethics” is an interesting additional to a paper which focuses on fake or misleading research.

Stephen E Arnold, February 7, 2013

Independent News in Eastern Europe Bolstered by Solr

February 6, 2013

Independent news agencies have a hard time escaping the tight grasp of government in restricted countries. In the nation of Georgia, the non-profit Sourcefabric has developed Newscoop based on open source software. Open source is not only contributing to profitable business, but to political and ideological freedom. CMS Wire has a full story in, “Newscoop CMS 4.1 Integrates Solr for Search, GeoLocation Tools.”

But our interest here is in the technology, and how Newscoop has been boosted by the power of Solr. The article states:

“The enhanced search functionality in 4.1 is made possible by an integration with Solr, an open source search project out of the Apache Lucene effort, and it is designed to facilitate the ability of site visitors to find relevant content on the news site or in connected blogs. Solr features full-text search, hit highlighting, database integration, auto-suggestion and advanced ranking.”

Many software and enterprise solutions also find their strength in the solid base of Apache Lucene and Solr, the two most trusted names in the Apache open source community. One such solution is LucidWorks. While LucidWorks’ ultimate aim is in efficient enterprise search, its commonality with Newscoop is its sturdy and reliable infrastructure.

Emily Rae Aldridge, February 6, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Repercussions of Facebook Graph Search

February 6, 2013

As with the arrival of most new things, no one is quite sure what the results of Facebook’s venture into search will be. Forbes investigates the possibilities in, “Facebook Graph Search is a Disruptive Minefield of Unintended Consequences.” It is good to see we are not the only ones who think this development could shake up the search terrain.

Journalist Anthony Wing Kosner begins by noting that Graph Search is not something users have requested, but rather a marketing initiative. For the feature to work, users will have to help by continuing to populate Facebook with data in the form of likes, check-ins, photos, and profile info. Somehow, I don’t think that’s a big hurdle, even if some users do get spooked by the very real search-related privacy concerns. More tricky, perhaps, is convincing users they want to narrow their searches from the World Wide Web to their own Facebook network.

Kosner writes:

“I think Graph Search is indeed important, but the results of Facebook’s search for increased relevance may be both more and less than it intends. Its users may find the utility of searching their own social graph to be hit-or-miss, but they also may find themselves feeling much more exposed in the searches of others than they ever intended to be. Rather than phrase this negatively, however, I want to try to identify the potentially explosive issues, land mines if you will, that Facebook will encounter in its path to build out its third pillar and suggest what it needs to do to avoid or diffuse them.”

Not surprisingly, the main suggestion is to make it easier for users to protect their privacy. The current process can be cumbersome, and not even a Zuckerberg can be certain the results will be as expected. With Graph Search in particular, the inability of algorithms to understand irony or a love of randomness, both hallmarks of today’s youth culture, can result in acute misrepresentation of someone’s views. Sometimes this could simply be amusing, but other times, it could cause real damage. And you might never know.

If you are concerned about these issues (and if you or someone you love uses Facebook, you should be), check out this detailed article. I suppose we will just have to wait and see where the chips fall, while helping spread the word—be careful out there.

Cynthia Murrell, February 06, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

The Ugly Underbelly of Search

February 5, 2013

By now everyone has heard about the major snafu incurred by the Github repository at the end of January. Search is our favorite topic of discussion, and while we primarily focus on all the good it can do for individuals and organizations, there is another side to search. In the wrong hands, or in incapable hands, search can have serious negative repercussions. The H Open article, “GitHub Search Exposes Uploaded Credentials,” fills us in.

The article gets to the heart of the problem:

“Users of the GitHub project hosting system have been reminded not to upload sensitive information to the system’s Git repositories. The reminder comes after GitHub launched a new search service based on elasticsearch. The launch of the service sent people off searching the code and, as people tend to do, they searched for private information. Various searches for terms such as ‘BEGIN RSA PRIVATE KEY’ were revealing many people had, in fact, been uploading private keys.”

Perhaps as a blessing in disguise, the elasticsearch infrastructure collapsed under the weight of searches as curious readers searched for themselves after hearing the news on Twitter. So the moral of this story is to never upload private keys or similar data into repositories, under any circumstances. A little common sense goes a long way. And, just to be safe, explore a more trusted solution based on Lucene and Solr, which pull from the strength of a large open source community. These solutions, like LucidWorks, are less likely to crack under the pressure.

Emily Rae Aldridge, February 5, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

A Quote To Note About Search

February 5, 2013

Search is act of trying to find the answer to a question. Internet users browse the Web searching for answers to their questions. The main tool people user to search the Internet are search engines, but while reading Explore this quote came up:

“Forget search engines. The real revolution will come when we have research engines, intelligent Web helpers that can find out new things, not just what’s already been written. Facebook Graph Search isn’t anywhere near that good, but it’s a nice hint at greater things to come.”

Gary Marcus, a neuroscientist, said this quote about Facebook and how its new Graph search mean big changes for search in the coming years. Explore also mentioned that it echoes Vannevar Bush’s 1945 vision for the future of knowledge. Bush was an engineer and well known for his work on analog computers and little project called the Manhattan. Reflecting on this quote, one can only agree that yes, Graph Search and other searches, are on the brink of something grand. From the science fiction and romantic writing angle, these will be the times that people will find nostalgic for our infant-like knowledge. All the information in the world can be discovered on a little device someone carries around in their pocket, but people are still clueless about how to use it.

Google is already trying to remedy this with Knowledge Graph, which is the start of a Star Trek like computer. People need to be taught how to use information and what it can do for them, rather than passively let it seep through their heads. The time to start is now.

Whitney Grace, February 05, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

A Search Death Report a Decade Too Late

February 3, 2013

I wrote a feature for Searcher Magazine in 2003 called “In Search of…the Good Search.” The original title was “Search Is Dead.” I picked up the theme in a number of Beyond Search articles; for example, “The Search Is Dead Question.”

I was interested in the February 2013 write up “The End of the Web, Computers, and Search as We Know It.” The main idea is that search is dead. I am okay with the premise. I did find the following statement interesting in light of the explosion of interest in making information in academic papers free which is bubbling along with the Google agreement to pay France for links.

Here’s the passage I noted:

But it’s about time: “Bring me what I want” is almost always more useful than “Let me rummage around and see what I can find.” No matter how fast it seems, most search is a waste of time. In a way, we are using time (i.e., the time-based structure) to gain time. Instead of doing an endless series of separate searches, we tune the knobs on our stream-browser to continuously feed us just the information we need. This future doesn’t just kill the operating system, browser, and search as we know it — it changes the meaning of “computer” as we know it, too. Whether large or small (e.g., a smartphone), a computer’s main function in the near future will be tuning in to — as a car radio tunes in a broadcast station — the constantly flowing global cyberflow. We won’t care much about the computer devices themselves since we’ll be more focused on the world of information … and our lives as attached to it.

My thought is that the subtext for this remark rests upon the chronological approach in Scopeware. But when I ran a query for the system, Google had nothing substantive but Bing.com produced a reference to LegalTech.com and a download link on Softpedia.

My view:

  1. The death of search took place with the rise of pay to play services. Online advertising is the main engine of growth. As pay to play grew, the likelihood that different types of retrieval systems would become the next big thing has dwindled. After Google went public, the old precision and recall model ended up in the morgue.
  2. Search has been devalued by the systems marketed aggressively by the Big Five in search. These were Autonomy, Convera, Endeca, Fast Search, and Verity. Each installation left licensees with some surprises. None of these outfits exist as a self standing multi billion dollar, absolutely essential solution. Vestiges of the legacy of these breakthrough systems may be seen in the HP Autonomy dust up in my opinion.
  3. The stampede to predictive analytics, business intelligence, and personalized systems is little more than a way to get ride of the hassle of making the user craft a query and using smart software to tell the hapless what he or she needs to know. Do these systems work? In my book, the marketing is better than the technology at this time. Licensing pure search is not what most vendors do. The pitch is for customer support, Big Data, and sentiment. Search is a tough sell in 2003 and is even a tougher sell today.

I am okay with brave new worlds, nifty technology, and total immersion in pay to play. I just want to shift the moment of death back a decade. Reporting about a death long after it occurred is similar to the disappearance of content in a Web centric world. Maybe the Library of Congress will save the day with its archive of Twitter messages.

Stephen E Arnold, February 3, 2013

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta