Endeca Technologies

An Interview with Pete Bell

Pete Bell of Endeca Technologies

Endeca is synonymous with "guided navigation", a point-and-click presentation of information that makes it painless and intuitive for a user to access information. When Endeca's system débuted in 1999, it quickly established its presence with high-profile deals, including Home Depot, IBM, Harvard, and Tesco. In February 2008, Endeca inked deals with Intel and SAP for strategic investments. Endeca's competitors providing search technology to these two firms could see the writing on the wall. Endeca was poised for growth in the behind-the-firewall sector, not just eCommerce.

On a blustery March day in Boston, I sat with the effervescent Pete Bell. The buzz in the Starbuck's added to the confidence and intensity that radiated from Mr. Bell. Comfortable in the Boston scene which bears no resemblance to the run-down general store in Harrod's Creek, Kentucky, I probed into the Endeca formula for success. The company in calendar 2007 was, according to my sources, hitting $107 million in revenues and growing rapidly.

Where did you and Steve Papa get the idea for Endeca?

We were lucky because we got the idea by trying to fix eBay back in 1999, when it looked like earth's biggest flea market. That turned out to be the general solution to a problem that applies almost everywhere people accumulate information -- from design engineers to librarians to intelligence analysts.

The eBay problem had two parts that needed to be solved -- one for the user, which in turn created one for the infrastructure. The user problem was about discovery. eBay's search box could find you anything you already knew existed, but was not able to help you discover things you didn't already know about. That's because you weren't given an overview of what was available. Harder still, what is available changes, so how could eBay predict how to organize it all. The street market needed to be transformed into a store, with aisles, shelves, and a guide.

What did you recommend?

So we modeled a solution, which we called Guided Navigation. We began bottoms-up with a data model that could hold anything in eBay. We complemented that with a user experience that would let you explore it by attributes and text. We literally modeled this in Steve's graduate school dormitory with clay balls and coffee stirrers hanging from string on the ceiling. Guided Navigation turned out to be a form of facet analysis, an idea that goes back to the father of information science, Shiyali Ramamrita Ranganathan, from the 1930s. But those ideas had gotten lost for this problem, probably because of the second part of this problem: backend architecture.

It turned out that existing architectures like the relational database or the inverted index (which powers conventional search engines) had about as much chance of helping eBay as our clay balls did. It was a classic scalability problem -- Guided Navigation needed to work at high scale for millions of users, millions of products with millions of attributes, and constant updates.

Not only that, eBay data was semi-structured. Search engines were designed for unstructured data, and databases for structured data, so again, they were the wrong tool for the job. So our team was forced to design a third way, which we call an MDEX engine. It's a new class of database that has a flexible data model, analogous to XML. That's complemented by a way to query, slice-and-dice, and summarize that XML-like data. We were fortunate because that infrastructure turned out to solve many other problems, giving Endeca some surprising opportunities as it turns out.

Why search? It's a very crowded space, right?

It is crowded, which is a sign that the status quo is still broken. Otherwise the market would have consolidated like the airline industry. In fact, Steve [Papa] came from Inktomi, where he was an early employee. This was right before Google came on the scene and Inktomi still ran more than half of all Web search; Inktomi powered Yahoo. Steve knew from the inside just how broken the search model was, and how much business opportunity there was for innovators.

But search is a confusing term, so the market is moving towards finer-grained terms. For starters, people confuse web search and enterprise search, but they're completely different technologies. And the word search is also used as the name of the search box. And, to make matters worse, the word search is the name of one architecture that powers the box -- the inverted index. But there are other architectures that can power it, like relational databases, or our MDEX engine. Most importantly, search has become a catchall for the many ways people discover information, which include browsing, visualizations, charts, and graphs. People often get their answer without ever using the search box, and when you ask them afterwards what they just did, they'll tell you they searched. The market is converging on the phrase "Information Access" as the term for that broader set of features.

Once you segment the field with a finer-grained vocabulary, there are some less-crowded parts of the market. We're thriving as an Information Access platform whose architecture is based on a new class of database. And that's not just positioning gymnastics. If a procurement team is considering Endeca, Google, and Lucene for the same search problem, we need to help the team express their need in a more informative manner.

Steve Papa and you fellows triggered a wave of "imitators". How have you evolved "faceted navigation" in the bright light of many, many knock offs?

It's funny because early on, we needed to evangelize Guided Navigation in every account, so we prayed for the day when it would just appear on RFPs [requests for proposals]. Then by 2005, it was appearing on all the RFPs, and every vendor seemed to have something to demo, which created a different problem.

We've evolved Guided Navigation in two ways.

The first is on features related to faceted navigation itself. The benefit to being the pioneer is that when you take a product out of the lab and into the real world, you get to discover problems that need to be solved before anyone else even knows they exist. Just like early cars had no turn signals because who knew you'd need them? So we're doing great things on tooling and workflow. Ways to help users make sense of large numbers of facets -- one of our customers has more than 15,000, with control over when they appear. And we also introduced a feature called "Record Relationship Navigation" that looks to be as important as Guided Navigation itself was. It lets you navigate by facets of facets, or navigate one record set by the facets of a related record set -- essentially, you're joining dimensions on the fly. Sounds abstract, but it's the only good way to manage a lot of security and personalization problems, or to combine record sets.

The second way we've evolved is on the architecture. Since imitators were playing catch up, nearly everyone else grafted facets onto their existing engine, so they do things like manage facets through application-side code. If you have a thousand products and three facets, that's could work. But it gets ugly when you need to scale or want to make changes. But since we architected for facets from the very beginning, we built it into the engine. We've evolved industrial strength infrastructure for this.

I was at a conference and several speakers emphasized that there's something called "search box fatigue". The idea is that users don't want to come up with key word queries. What are you doing to give users alternatives to playing the key word guessing game?

People need guidance. When you ask a concierge for a good restaurant, she doesn't hand you the phone book. She asks you questions: "What are you in the mood for?" "How much do you want to spend?"

But search engines dump out the phone book on you. What good is a long list of results? You need to summarize that list. Here are all the restaurants on a map. These are the cuisines available, and so on. The search box is friendlier when you have a guide to your results.

It goes back to foraging theory. We're just like bears looking for berries. We follow a scent, and if we're not getting reliable cues that the scent is growing stronger or weaker, we'll quit looking.

I know you do eCommerce, but what's new in the enterprise build of Endeca?

We've had huge growth in markets like manufacturing, media, financial services, and intelligence, so there's a lot that's new. We are using our MDEX engine beneath it all, but it gets configured to such a specific set of use cases that users at Time Magazine, Boeing, and the Defense Intelligence Agency experience specific information access features tailored to them.

There are new features, like new analytics and visualizations. Our customers are using elements familiar from business intelligence to help guide them through results. For documents and text, our framework for faceted navigation lets us approach text mining and semantic processing in a novel way. For example, not only do we extract entities, we turn them into useful navigation. Or you can do social navigation across user-generated content.

For data integration, we have pre-built adapters to most enterprise systems, like SAP and Documentum, and a kit to develop others quickly. On the front end, we have a rapid application development framework. For example, there's a Dot Net library that lets you build applications in a few minutes by selecting a search box, breadcrumbs, navigation, and everything else. The changes in the core engine are exciting -- scale is way up. We have XQuery in the next release which lets you form more expressive queries.

We've also worked closely with our customers to learn about their workflows. This gets reflected back through better tooling. It also spills out beyond the code into regional user group meetings and an active developers network. For the enterprise, that community and expertise becomes a big source of value.

In my work, I keep hearing comments about the performance penalties the increasing volume of digital information imposes on search systems. What are you doing to keep your behind-the-firewall systems delivering quick results?

We're architected to handle Wal*Mart's Christmas peak, and behind-the-firewall is quicker in comparison. Once you're faster than a blink, the question becomes one of what else you can do with the extra headroom. One thing you can get with it is operational simplicity, which comes from minimizing your hardware footprint. We're on a native 64 bit architecture running multi-cores, and memory is very affordable compared to its cost a few years ago. We are packing a great deal into a single box.

The other thing you can do with that headroom is enhance the results. Beyond summarizing each result set with Guided Navigation, you can also chart, graph, and map it all. You can do text mining and visualizations. If you did an apples-to-apples comparison of an Endeca query to a simple search query, it's many times more complex.

The hardware innovations have been a help to us and other vendors as well. With the current generation of CPUs, for example, more things are now possible such as additional text processing operations.

The news is full of Yahoo "discovering" Hadoop. What architectural and plumbing changes do see coming in the next six to 12 months?

Hadoop is neat, but for now, that scale really only applies to Web search, not the enterprise. Maybe with the exception of a few of our customers in the Washington, DC area. In general, cloud computing has its benefits, but for the enterprise, that approach is not yet in the mainstream. The computer science is easy. It's the political science that's hard.

But its relative, virtualization, is definitely coming into its own. Once you take away the overhead of provisioning each application, it becomes feasible to launch one-off applications, possibly each for a single user. In other words, an analyst might invoke a full application for just the data set he happens to care about for that project. Disposable applications open up many possibilities.

Next generation data standards are finally hitting critical mass. That means XQuery [XQuery is a set of syntax rules used to extract information from XML documents.] starts to become a useful complement to all that XML. The Resource Description Framework and Semantic Web are still on the horizon. There's been good progress for RDF and semantics in life sciences and government, though.

The enterprise is behind on Web 2.0 front ends, like AJAX, Flash, and Flex, but is catching up. Nike.com is a great example of how faceted navigation could look.

As you think about the strong vote of confidence the Intel and SAP investments show in Endeca, what will you be doing to leverage these companies' interest in and support of Endeca?

Intel and SAP give us the opportunity to plan a product roadmap today that will be ready for how enterprises look three years from now. Intel has chips in their labs that boggle my mind. It's all about multi-core -- what would you do with an 80 core chip? Our Chief Technical Officer, Adam Ferrari, has a background in Web-scale distributed computing, so there's nothing he loves more than playing with that problem. Intel wants visionaries to create the demand for its next generations of chips.

As for SAP, their software manages a lot of the world's most valuable data. Today, the SAP data support business processes -- things that are repetitive and predictable, like managing your sales pipeline or human resource functions. But as soon as you veer off from a specific process, it can be difficult to make use of those data.

There may be a significant opportunity to take the data that businesses already own and use it for discovery, exploration, and ad hoc analysis. Many of the Fortune 500 could squeeze more value from their data. SAP is supportive of anything that makes those data stores more valuable.

As you look forward to 2009, what are the major trends in search that you see?

We'll see much more of the convergence of search and business intelligence, which makes sense because they're both about information visibility. But it doesn't look like the 2007 version, which uses a search box to retrieve canned reports. That's a Frankenstein of the two -- like those early airplane concepts that glued wings on a Ford Model T. The deep integration is much more interesting.

Write-backs will start to matter. We think of search as read-only today. But why not close the loop? Consider something as useful as tagging. Say you've discovered a helpful document and want to add some metadata. You shouldn't need to open up a content management system, make a change there, and then re-index everything. Database-like engines will open up new possibilities there.

The new kid on the block for academics is HCIR, or Human Computer Interaction meets Information Retrieval. By itself, Information Retrieval has been stuck on relevance ranking. If you hit Google's "I'm feeling lucky button," how close is your answer to the top of the list? That only works for fact-finding, or popularity contests. But for most discovery problems, the popularity model doesn't work very well because a computer fundamentally can't determine relevance. If everyone asks a computer which is the best restaurant, it's going to tell us McDonalds. Each person needs to determine relevance, and the computer's job is to provide us each with evidence. HCIR tells us how.

Multi-touch displays will have an impact sooner than you'd expect. The first Minority Report experiences will be here by 2009, and they're genuinely useful. Endeca on the iPhone is already powerful, and there's no barrier to making the giant version.

Finally, scale will continue to increase, which means we can just keep pumping more information into our applications. Now, just because you can doesn't mean you should. In fact, with earlier systems, the more information you included, the harder it was to discover anything -- the eBay problem from 1999.

How can a user or a procurement team determine if a search system does what the marketing collateral says?

The sign that your system is good is that you want to keep adding information. For example, we have an internal application we call Endecapedia. It started with content from our Documentum/eRoom collaborative workspaces. Then we put all our wikis in. Then we added all the CRM data from our Salesforce.com account. Now we've got an integration underway to our SAP systems. And each time we add more information, the entire thing becomes more useful because you uncover the relationships across those silos.

We've seen that same organic growth play out at customer after customer, so I think there's something fundamental here. When you get visibility into your information for the first time, you discover a lot of answers, but you discover even more questions.

ArnoldIT Comment

Endeca is one of the vendors whose system finds itself on procurement team's short lists. The company has high-profile clients in eCommerce, behind-the-firewall search, and eDiscovery installations in government agencies. Competitors find themselves in the position of explaining how their systems compare to Endeca's. With the infusion of capital from Intel and SAP, Endeca will continue to exert pressure on competitors across the search market sectors. If you haven't explored Endeca's functionality, navigate to an Endeca-powered site, and examine the system's functionality. More information is available at Endeca's website.

Stephen E. Arnold, March 17, 2008

Search AIT

Endeca Technologies

An Interview with Pete Bell