Digital Reasoning and Entity Based Analytics

October 5, 2011

As the entity-based analytics discipline becomes more prominent in the business sector, private company Digital Reasoning has already made great strides in setting the standard for achieving actionable intelligence.

Dr. Ric Upton will be leading Digital Reasoning’s Washington, DC area office and team in this exciting time for the company. Their product Synthesys is exactly what analysts require in this era of ever-amassing data.

While many other firms offering intelligence software focus on an aspect of entity extraction, Synthesys provides analysts with a comprehensive package for automating the interpretation of big data when the work of search and content processing systems has been undone.

In an exclusive  Arnoldit.com interview, Upton revealed how Digital Reasoning deals with such high volumes of real time information. He said:

[O]ur processing and analytics often have to complement these high volume data flows. We do this in part through judicious use of cloud-based processing augmented by intelligent methods of processing and storing data as it becomes available so that we can avoid the need to perform batch processing or redundant processing of previously-captured data.

The real value is their focus on content centric analytics instead of using statistical algorithms to analyze structured data. Essentially, they decipher the subtext and implicit meanings of content that doesn’t have to be well-structured. The real feat in this is that Digital Reasoning can automate this analysis without any data preparation.

Without Digital Reasoning’s systematic interpretation of data, analysts and clients would actually have to spend hours upon hours of time reading and comprehending content.

Upton shared the reasons why clients have typically used their software:

Our ability to automate understanding is critical to customers with concerns about time, accuracy, completeness, or even the ability to leverage the massive amount of data they have generated.

Serving as an intermediary between the raw data and analysts in the business process, this software has the capabilities to understand the subtleties of the human language. Synthesys can understand the underlying messages in the context of the content’s medium—whether it is a blog, a tweet, or an SMS.

In the interview, Upton sheds insight into how this rich entity extraction manifests itself:

We don’t just extract a name, we can develop and create a persona – the sum of what a person is called, where they have been and when, their relationships with other persona, their behaviors over time, etc.

Digital Reasoning is already looking towards the future, which forecasts that other media such as video and audio sources hold clout as data. As they work on developing methods to analyze these structures, competitors’ opportunities to dominate this field dwindle away.

Megan Feil, October 5, 2011

Sponsored by Pandia.com

Interview with John Steinhauer, Search Technologies

August 29, 2011

Search Technologies Corp., a privately-held firm, continues to widen its lead as the premier enterprise search consulting and engineering services firm. Founded six years ago, the company has grown rapidly. The firms dozens of engineers offer clients deep experience in Microsoft (SharePoint and Fast), Lucene/Solr, Google Search Appliances, and Autonomy systems, among others. Another factor that sets Search Technologies apart is that the company is profitable and debt-free, and its business continues to grow at 20 percent or more each year. It is privately held and headquartered in Herndon, VA.

John-Steinhauer

John Steinhauer, vice president of technology, Search Technologies

John Steinhauer

On August 8, I spoke with John Steinhauer,  vice president of technology of Search Technologies. Before joining Search Technologies, Mr. Steinhauer was the director of product management at Convera. He attended Boston University and the University of Chicago. At Search Technologies, Mr. Steinhauer is Responsible for the day-to-day direction of all technical and customer delivery operations. He manages a growing team of more than 75 engineers and project managers. Mr. Steinhauer is one of the most experienced project directors in the enterprise search space, having been involved with hundreds of sophisticated search implementations for commercial and government clients. The full text of the interview appears below.

What’s your role at Search Technologies?

Search Technologies is an IT services provider focused on search engines. Working with search engines is essentially all we do. We’re technology independent and work with most of the leading vendors, and with open source. The things we do with search engines covers a broad spectrum – from helping companies in need of some expert resources to deliver a project on time, to fully inclusive development projects where we analyze, architect, develop and implement a new search-based solution for a customer, and then provide a fully managed service to administer and maintain the application. If required, we can also host it for the customer, at one of our hosting facilities or in the cloud.

My title is VP, Technology and I am one of the three original founders of the company and have been in the search engine business full-time since 1997. I am responsible for the technical organization, comprised of 70+ people, including Professional Services, Engineering, and Technical Support.

From your point of view, what do customers value most about your services?

We bring hard-won experience to customer projects and a deep knowledge of what works and where the difficult issues lie. Our partners, the major search vendors, sometimes find it difficult to be pragmatic, even where they have their own implementation departments, because their primary focus is their software licensing business. That’s not a criticism. As with most enterprise software sectors, license fees pay for all of the valuable research & development that the vendors put in to keep the industry moving forward. But it does mean that in a typical services engagement, less emphasis is put on the need for implementation planning, and ongoing processes to maintain and fine-tune the search application. We focus only on those elements, and this benefits both customers, who get more from their investment, and search engine partners who end up with happier customers.

In your role as VP of Technology, what achievements are you most proud of?

I’m proud that we have built a company with happy customers, happy employees, and good profits. I’m also proud that we’ve delivered some massively complex projects on time and on budget, even after others have tried and failed. It is gratifying that we have ongoing, multi-year relationships with household names such as the US Government Printing Office, Library of Congress, Comcast, the BBC, and Yellowpages.com.

But our primary achievement is probably the level of expertise of our personnel, along with the methodologies and best practices they use that are now embedded into our company culture. When we engage with customers, we bring experience and proven methodologies with us. That mitigates risks and saves money for customers.

Do you recommend search engines to customers?

Occasionally, but only after conducting what we call an “Assessment. We start from first principles and understand the customer’s circumstances; business needs, data sets, user requirements, infrastructure, existing licensing arrangements, etc. Based on a full knowledge of those issues, we offer independent advice and product recommendations including, where appropriate, open source alternatives.

So you also work with customers who have already chosen a search engine?

This is our primary business. Often, our initial engagement with a customer is to solve a problem; they’ve acquired a software license, spent significant time and money on implementation and are having technical problems and/or trouble meeting their deadlines and budgets. Problems include poor relevancy, performance and scaling issues, security issues, data complexity issues, etc. Probably 70% of our customers first engaged with us by asking us to look at a narrow problem and solve it. Once they discover what we can do and how cost effective we are, they typically expand the scope into implementation of the full solution. We help people to implement best practices to reduce complexity and ownership cost, while dramatically improving the quality of the search service.

So, what’s your secret sauce?

With search projects, usually the secret sauce is that there is no secret sauce. Success is down to hard work and execution at the detail level.

What makes Search Technologies unique?

Sure. If there is any secret to building great search applications, it is usually in showing greater respect for the data and how best to process and enhance it to enable sophisticated search features to work effectively through the front end. That and just experience from hundreds of search application development projects. When a customer hires a Search Technologies Engineer to participate in their project, they are not just getting a well-trained, hard working and hugely experienced individual who writes good code, they are getting access to 80+ technical colleagues in the background with more than 40,000 person-days experience on search projects. We’re great at sharing experiences and best practices – we’ve worked hard at that since the beginning. Also, our staff turnover is really low. People who like working with search engines like it here, and they tend to stick around. That huge body of experience is our differentiation.

So you’re pure services, no software of your own?

In customer engagements we’re pure services. That’s our business. But as a company of largely technical people, of course we’ve developed software along the way. But we do so for the purposes of making our implementation services more efficient, and our support and maintenance services more reliable and sustainable.

Where is the search engine industry heading?

There are now two 800 pound Gorillas in the market, called Microsoft and Google. That’s a big difference from the somewhat fractious market that existed for 10 years ago. That will certainly make it harder for smaller vendors to find oxygen. But at the same time, these very large companies have their own agendas for what features and platforms matter for them and their customers. They will not attempt to be all things to all prospective customers in the same way that smaller hungrier vendors have. In theory this should leave gaps for either products or services companies to fill where specific and relatively sophisticated capabilities are required. We see those requirements all over the place.

Open source (primarily SOLR/Lucene) is making major inroads too. We are seeing a lot of large companies move in this direction.

So is innovation dead?

Not at all. Actually we see lots of companies doing really cool and innovative things with search. Many people have been operating on the assumption that search software would reach a sort of commodity state. Analysts have predicted this for years, that once all the hard problems had been solved, then all search engines would have equivalent capabilities and compete on price. What we’re seeing is very different from that. People are realizing that these problems can’t just be solved and then packaged into an off the shelf solution.

Instead the software vendors are putting a ring fence around the core search functionality and then letting integrators and smart customers go from there. With search, there are now some firmly established basics: Platforms need good indexing pipelines, relevancy algorithms that can be tweaked to suit the audience, navigation options based on metadata, readable, insightful results summaries. But that’s just the starting point for great search.

Here’s an example we’ve been involved with recently. Auto-completion functions have been around for years. You start the search clue, the system suggests what you’re looking for, to help you complete it more quickly. We’ve recently implemented some innovative new ways of doing this, working with a customer who has a specific business need. This includes relevancy ranking and tweaking of auto-completions suggestions, and the inclusion of industry jargon. Influencing search behavior in this way not only helps the customer to provide a very efficient search service, it also supports business goals by promoting particular products and services in context. Think of it as a form or relevancy tuning, but applicable to the search clue and not just the results. These are small tweaks that can have a big impact on the customer’s bottom line.

Another big innovation is SaaS models for search applications. This has also been talked about for years, but is really just now coming into focus in practical ways that customers can leverage.

I understand that your business is growing. Where are you heading and what might Search Technologies look like in a couple of years?

Perhaps the most pleasing thing of all for me personally, is that a lot of our growth, which is averaging 20%+ year on year, comes from perpetuating existing relationships with customers. This speaks well for customer satisfaction levels. We’ve just renewed our Microsoft GOLD partner status, and as a part of that, we conduct a customer satisfaction survey and share the results with Microsoft. The returns this year have been really great. So one of the places we are heading is to build ever longer, deeper relationships with companies for who search is a critical application. We initially engaged with all of our largest customers by providing a few consultant-days of search expertise and implementation services. Today, we provide these same customers with turnkey design and implementation, hosting services, and “hands-off” managed services where all the customer does is use the search application and focus on their core business. This model works really well. Through our experience and focus on search we can run search systems very efficiently and provide a consistently excellent search experience to the customer’s user community. In the future we’ll do a lot more of this.

Finally, tell me something about yourself

I grew up in Michigan, have lived in Chicago, Boston, DC, London and now in San Diego. The best thing about that is I can ride my bike to work most mornings year round. I have two boys (4 years old and 6 months old), neither of whom have the slightest clue what a Michigan winter entails. I expect that will continue for the foreseeable future.

Don C Anderson, August 29, 2011

Sponsored by Search Technologies

Exclusive Interview with Ana Athayde, Spotter SA

August 16, 2011

I have been monitoring Spotter SA, a European software development firm specializing in business intelligence for several years. A lengthy interview with the founder, Ana Athayde appears in the Search Wizards Speak section of the ArnoldIT.com Web site.

The company has offices throughout Europe, the Middle East, and in the United States. The firm offers solutions in market sentiment, reputation management, risk assessment, crisis management, and competitive intelligence.

In the wide ranging interview, Ms. Athayde mentioned that she had been recognized as an exceptional manager, but she was quick to give credit to her staff and her chief technical officer, who was involved in the forward looking Datops SA content analytics service, now absorbed into the LexisNexis organization.

I asked her what pulled her into the vortex of content processing and analytics. She told me:

My background is business and marketing management in the sports field. In my first professional experience, I had to face major challenges in communication and marketing working for the International Olympic Committee. The amount of information published on those subjects was so huge that the first challenge was to solve the infoglut: not only to search for relevant information and build a list, but to understand opinions and assess reputation at an international level….I decided to fund a company to deliver a solution that could make use of information in textual form, what most people call unstructured data. But I knew that the information had to be presented in a way that a decision maker could actually use. Data dumps and row after row of numbers usually mean no one can tell what’s important without spending minutes, maybe hours deciphering the outputs.

I asked her about the firm’s technical plumbing. She replied:

The architecture of our own crawling system is based on proprietary methods to define and tune search scenarios. The “plumbing” is a fully scalable architecture which distributes tasks to schedulers. The content is processed, and we syndicate results. We use what we call “a source monitoring approach” which makes use of standard Web scraping methods. However, we have developed our own methods to adjust the scraping technology to each source in order to search all available documents. We extract metadata and relevant content from each page or content object.  Only documents which have been assessed as fresh are processed and provided to users. This assessment is done by a proprietary algorithm based on rules involving such factors as the publication date. This means that each document collected by Spotter’s tracking and monitoring system is stamped with a publication date. This date is extracted by the Web scraping technology, from the document content. The type of behavior of the source; that is, the source has a known update cycle. We analyze the text content of the document. And we use the date and time stamp on the document itself.

Anyone who has tried to use the dates provided in some commercial systems realizes that without accurate time context, much information is essentially useless without additional research and analysis.

To read the complete interview with Ms. Athayde, point your browser to the full text of our discussion. More information about Spotter SA is available at the firm’s Web site www.spotter.com.

Stephen E Arnold, August 16, 2011

Freebie but you may support our efforts by buying a copy of The New Landscape of Enterprise Search

Thoughts from an Industry Leader: Margie Hlava, Access Innovations

August 4, 2011

Here are some astute observations on the direction of enterprise search from someone who knows what she’s talking about. Library Technology Guides points to an interview with Margie Hlava, president of Access Innovations, in “Access Innovations founder and industry pioneer talks about trends in taxonomy and search.”

Ms Hlava’s 33 years in the search industry informed her observations on current trends, three of which she sees as significant: Cloud and Software as a Service (SaaS) computing, term mining, and the demand for metadata.

The move to the Cloud and SaaS computing demands more of our hardware, not less, Hlava insists. In particular, broadband networks are struggling to keep up. One advantage of the shift is a declining need to navigate labyrinths of hardware, software, and even internal politics on the client side. Other pluses are the motion toward increased data sharing and service enhancement. Also, more ways to maintain security and intellectual property rights are on the horizon.

She says that term mining is “a process involving conceptual extraction using thesaurus terms and their synonyms with a rule-base, then looking for occurrences to create more detailed data maps,” according to Hlava. Her company leverages this concept to make the most of clients’ large data sets. She is interested in new angles like mashups, data fusion, visualization, linked data, and personalization, but with a caveat: success in all these depends on the quality of the data itself. “Rotten data gives rotten results.”

Ms. Hlava regards taxonomies and other metadata enrichment as the way to bring efficiency to our searches. In that realm, the benefits have only begun:

“In terms of taxonomies and search, ‘I think we have just scratched the surface. With good data, our clients are in a good position to do an incredible array of new and interesting things. Good taxonomies take everything to the next level, forming the basis of not only mashups, but also author networks, project collaborations, deeper and better information retrieval,’ she concluded.”

Wise words from a wise woman. We look forward to observing these predictions take shape as the search industry moves forward. The interview with Margie Hlava, can be read in full here.

Access Innovations offers a wide range of content management services. The company has been building its semantic-based solutions for over thirty years and prides itself on its unique tool set and experienced personnel.

Stephen E Arnold, August 4, 2011

Sponsored by Pandia.com, publishers of The New Landscape of Enterprise Search

The Feivi Arnstein Interview: Founder of SearchLion

August 2, 2011

On August 1, 2011, I had an opportunity to talk with Feivi Arnstein, founder of SearchLion. SearchLion provides a browser-based interface that looks like a Google-influenced Web search system. The home page for SearchLion presents an interesting description: The new way to search. Welcome to the 21st century Web search.” The system makes it easy to narrow a query on specific types of content; for example, Web content, images, news, blogs, and Twitter messages.

SearchLion reflects a different approach from the keyword method that is quite different from the brute force approach used by the early Web search systems. In fact, the tagline for the service is “The New Way to Search.” To make certain a user understands the new direction the company is taking, the splash page offers the greeting, “Welcome to 21st century Web search.”

I ran queries on the system, which offers relevance ranked search results from Google and Yahoo. I found the output useful. When I clicked on the Open button next to an entry in the results list, the system displayed in the browser a preview of the Web page. IN addition, other hits are listed in the right hand column of the display which are related to the result I “opened”.

image

Source: www.searchlion.com

When I spoke with Mr. Arnstein, I was curious about the inspiration for the interface, which puts the focus on content, not ads. The idea for the content centric interface was, according to Mr. Arnstein, a result of his work in the financial services sector. Screens for traders, for example, are filled with information important to the task at hand. He said:

My first professional background was as a Technical Futures trader. I spent several years making a living day trading equity futures from my own private office. When you trade equities, you use software which makes use of every inch of screen space. So, for example, you can have a screen which is evenly split into four equity charts. The concept is simple: the more data you can access on the screen, the more productive you will be. I was accustomed to the efficiency of trading software. I realized that when searching and browsing the web, there were big parts of the screen going to waste. So I sought to find ways to use the available screen space to give the user more data.

He noted:

We think this fosters switching back and forth which is time consuming and can be confusing to many users. If you can have results and the source both on the same screen, our research suggests that users can find what they looking for much more quickly. In addition to opening the live sites, you can also save your searches together with the live sites. When you then load a search from your saved list, the live sites open automatically. We’ve used the same concepts without our MultiView features. Instead of the live Web site, MultiView uses the blank areas of the page to show you a different type of search result; for example, images, news, videos, etc.

The technical challenges were “interesting”, according to Mr. Arnstein. He added:

When showing more information, your browser will be using more resources. It took a lot of work and innovation to make sure the user gets his additional information, whether the live sites or the various types of results and still be extremely fast.

You can read the full interview with Mr. Arnstein on the ArnoldIT.com subsite, Search Wizards Speak. The Search Lion site is at http://www.searchlion.com.

Stephen E Arnold, August 2, 2011

Sponsored by Pandia.com, publishers of The New Landscape of Enterprise Search

Exclusive Interview with Margie Hlava, Access Innovations

July 19, 2011

Access Innovations has been a leader in the indexing, thesaurus, and value-added content processing space for more than 30 years. Her company has worked for most of the major commercial database publishers, the US government, and a number of professional societies.

image

See www.accessinn.com for more information about MAI and the firm’s other products and services.

When I worked at the database unit of the Courier-Journal & Louisville Times, we relied on Access Innovations for a number of services, including thesaurus guidance. Her firm’s MAI system and its supporting products deliver what most of the newly-minted “discovery” systems need. Indexing that is accurate, consistent, and makes it easy for a user to find the information needed to answer a research or consumer level question. What few realize is that using the systems and methods developed by the taxonomy experts at Access Innovations is the value of standards. Specifically, the Access Innovations’ approach generates an ANSI standard term list. Without getting bogged down in details, the notion of an ANSI compliant controlled term list embodies logical consistency and adherence to strict technical requirements. See the Z39.19 ANSI/NISO standard. Most of the 20 somethings hacking away at indexing fall far short of the quality of the Access Innovations’ implementations. Quality? Not in my book. Give me the Access Innovations (Data Harmony) approach.

Care to argue? I think you need to read the full interview with Margie Hlava in the ArnoldIT.com Search Wizards Speak series. Then we can interact enthusiastically.

On a rare visit to Louisville, Kentucky, on July 15, 2011, I was able to talk with Ms. Hlava about the explosion of interest in high quality content tagging, the New Age word for indexing. Our conversation covered the roots of indexing to the future of systems which will be available from Access Innovations in the next few months.

Let me highlight three points from our conversation, interview, and enthusiastic discussion. (How often do I in rural Kentucky get to interact with one of the, if not the, leading figure in taxonomy development and smart, automated indexing? Answer: Not often enough.)

First, I asked how her firm fit into the landscape of search and retrieval?

She said:

I have always been fascinated with logic and the application of it to the search algorithms was a perfect match for my intellectual interests. When people have an information need, I believe there are three levels to the resources which will satisfy them. First, the person may just need a fact checked. For this they can use encyclopedia, dictionary etc. Second, the person needs what I call “discovery.” There is no simple factual answer and one needs to be created or inferred. This often leads to a research project and it is certainly the beginning point for research. Third, the person needs updating, what has happened since I last gathered all the information available. Ninety five percent of search is either number one or number two. These three levels are critical to answering properly the user questions and determining what kind of search will support their needs. Our focus is to change search to found.

Second, I probed why is indexing such a hot topic?

She said:

Indexing, which I define as the tagging of records with controlled vocabularies, is not new. Indexing has been around since before Cutter and Dewey. My hunch is that librarians in Ephesus put tags on scrolls thousands of years ago. What is different is that it is now widely recognized that search is better with the addition of controlled vocabularies. The use of classification systems, subject headings, thesauri and authority files certainly has been around for a long time. When we were just searching the abstract or a summary, the need was not as great because those content objects are often tightly written. The hard sciences went online first and STM [scientific, technical, medical] content is more likely to use the same terms worldwide for the same things. The coming online of social sciences, business information, popular literature and especially full text has made search overwhelming, inaccurate, and frustrating. I know that you have reported that more than half the users of an enterprise search system are dissatisfied with that system. I hear complaints about people struggling with Bing and Google.

Third, I queried her about her firm’s approach, which I know to be anchored in personal service and obsessive attention to detail to ensure the client’s system delivers exactly what the client wants and needs.

She said:

The data processed by our systems are flexible and free to move. The data are portable. The format is flexible. The interfaces are tailored to the content via the DTD for the client’s data.  We do not need to do special programming. Our clients can use our system and perform virtually all of the metadata tasks themselves through our systems’ administrative module. The user interface is intuitive. Of course, we would do the work for a client as well. We developed the software for our own needs and that includes needing to be up running and in production on a new project very quickly. Access Innovations does not get paid for down time. So our staff are are trained. The application can be set up, fine tuned, deployed in production mode in two weeks or less. Some installations can take a bit longer. But as soon as we have a DTD, we can have the XML application up in two hours. We can create a taxonomy really quickly as well. So the benefits, are fast, flexible, accurate, high quality, and fun!

You will want to read the complete interview with Ms. Hlava. Skip the pretend experts in indexing and taxonomy. The interview answers the question, “Where’s the beef in the taxonomy burger?”

Answer: http://www.arnoldit.com/search-wizards-speak/access-innovations.html

Stephen E Arnold, July 19, 2011

It pains me to say it, but this is a freebie.

Laurent Couillard, CEO, Dassault Exalead: Exclusive Interview

June 28, 2011

Exalead caught my attention many years ago. Exalead’s Cloudview approach allowed licensees to tap into Exalead’s traditional Web and enterprise functions via on premises installations, a cloud implementation, or a hybrid approach. Today, a number of companies are working to offer these options. Exalead’s approach is stable and provides a licensee with platform flexibility as well as mobile search, mash ups, and inclusion of Exalead technology into existing enterprise applications. For organizations fed up with seven figure licensing fees for content processing systems that “never seem to arrive”, Exalead has provided a fresh approach.

Exalead provides high-performance search and semantic processing to organizations worldwide. Exalead specializes in taking a company’s data “from virtually any source, in any format” and transforming it into a search-enabled application. The firm’s technology, Exalead CloudView, represents the implementation of next-generation computing technology available for on-premises installation and from hosted or cloud services. Petascale content volume and mobile support are two CloudView capabilities. Exalead’s architecture makes integration and customization almost friction-free. The reason for the firm’s surge in the last two years has been its push into the enterprise with its search-based applications.

The idea of an enterprise application built upon a framework that can seamlessly integrate structured and unstructured data is one of the most important innovations in enterprise search. Only Google, Microsoft, and Exalead can boast commercial books about their search and content processing technology.

In 2010, Exalead’s market success triggered action on the part of one of the world’s leading engineering firms, Dassault Systèmes. Instead of licensing Exalead’s technology, the firm acquired Exalead and aggressively expanded the firm’s research, development, and marketing activities. Exalead’s approach enables more than 300 organizations to break the chains of the “key word search box” and has provided Dassault with a competitive advantage in next-generation information processing. In addition to mobile and rich media processing, Exalead is working to present integrated displays of real time information that add value to a wide range of business functions. These include traditional engineering to finding a restaurant on an iPhone.

Couillard Exalead

Laurent Couillard, Chief Dassault Exalead

With the purchase of Exalead, Dassault appointed Laurent Couillard as Exalead’s chief executive officer. Mr. Couillard joined Dassault Systèmes as an application engineer in 1996, most recently serving as Vice-President Sales and Distribution for Europe, the Middle East and Africa. In that post, he played a central role in the sales transformation of 3DS, establishing a powerful reseller channel for all PLM brands and contracting with more than 140 companies. As CEO of EXALEAD, his mission is to accelerate the market penetration of applications based on search technologies. Mr. Couillard holds an M.S. from Institut Supérieur de l’Aéronautique et de l’Espace, a preeminent institution  in Toulouse, France.

I asked him what was capturing his attention. He told me:

We are devoting more energy to developing packaged business applications or SBAs built on this foundation. That’s a mission right up my alley. And I intend to apply all my experience in sales and partner network development to this mission as well. That’s my charge from Dassault: To use my dual technology/sales background to develop Exalead and to penetrate new markets with SBAs, while preserving all the qualities that make Exalead so unique in this market. I’m fortunate to be in a position to leverage the full knowledge, resources, geographical coverage and expertise of the Dassault group to make this happen.

I probed for the reasons behind Dassault’s purchase of Exalead in 2010, a move which caught many analysts by surprise. He said:

Dassault saw first-hand how search-based applications based on Exalead’s systems and methods solved some of its clients’ long-standing, mission-critical business challenges quickly, painlessly and inexpensively. Dassault’s management understood–based on technical, financial, and performance facts—that Exalead’s search-based applications were a prime reason why search was, and is forecast to remain, an exceptional performer in the information technology software market. Because Dassault was seeking to diversify its content processing offerings, search in general and search based applications technology in particular were obviously an appealing choice. Dassault is, therefore, developing SBAs as one of its three core activities.

We discussed the challenges facing most of the traditional key word search and content processing systems. He noted:

You have to remember Exalead’s always understood search is sometimes something you do, and other times something you consume. In other words, sometimes it’s a search text box, and sometimes it’s the silent enabler beneath a business application, or even an entire information ecosystem.

You can read the full text of my interview with Mr. Couillard in the ArnoldIT.com Search Wizards Speak collection. The interview is located at this link.

Stephen E Arnold, June 28, 2011

Freebie from the leading vertical file service for search and content processing.

Interview: Forensic Logic CTO, Ronald Mayer

May 20, 2011

Introduction

Ronald Mayer has spent his career with technology start-ups in a number of fields ranging from medical devices to digital video to law enforcement software.    Ron has also been involved in Open Source for decades, with code that has been incorporated in the LAME MP3 library, the PostgreSQL database, and the PostGIS geospatial extension. His most recent speaking engagement was when he gave a presentation on a broader aspect of this system to the SD Forum’s Emerging Tech SIG titled “Fighting Crime: Information Choke Points & New Software Solutions.” His Lucene Revolution talk is at http://lucenerevolution.org/2011/sessions-day-2#highly-mayer.

mugshot_ron_200x200

Ronald Mayer, Forensic Logic

The Interview

When did you become interested in text and  content processing?

I’ve been involved in crime analysis with Forensic Logic for the past eight years.  It quickly became apparent that while a lot of law enforcement information is kept in structured database fields, often richer information is in their text narratives, word documents on their desktops, or internal email lists. Police officers are all-to-familiar with long structured search forms for looking stuff up in their systems that are built on top of relational databases.  And there are adequate  text-search utilities for searching the narratives  in their various systems one at a time.   And separate  text-search utilities for searching their mailing lists. But what they really need is something as simple as Google that works well on all the information they’re interested in–both their structured and unstructured content–both their internal data documents and ones from other sources; so we set out to build one.

What is it about Lucene/Solr that most interests you, particularly as it relates to some of the unique complexity law enforcement search poses?

The flexibility of Lucene and Solr interest are what really attracted me to Solr.  There are many factors that contribute to how relevant a search is to a law enforcement user. Obviously traditional text-search factors like keyword density, and exact phrase matches matter.   How long ago an incident occurred is important (a recent similar crime is more interesting than a long-ago similar crime). And location is important too.   Most police officers are likely to be more interested in crimes that happen in their jurisdiction or neighboring ones.   However, a state agent focused on alcoholic beverage licenses may want to search for incidents from anywhere in a state but may be most interested in ones that are at or near bars. The quality of the data makes things interesting too. Victims often have vague descriptions of offenders, and suspects lie.   We try to program our system so that a search for “a tall thin teen male” will match an incident mentioning “a 6’3″ 150lb 17 year old boy.” There’s been a steady emergence of information technology in law enforcement, such as in New York City’s CompStat.

What are the major issues in this realm, from an information retrieval processing perspective?

We’ve had meetings with the NYPD’s CompStat group, and they have inspired a number of features in our software including powering the CompStat reports for some of our customers. One of the biggest issues in law enforcement data today is bringing together data from different sources and making sense of it.   These sources could be from different systems within a single agency like records management and CAD (Computer Aided Dispatch) systems and internal agency email lists – or groups of cities sharing data with each other – or federal agencies sharing data with state and local agencies.

Is this a matter of finding new information of interest in law enforcement and security? Or is it about integrating the information that’s already there? Put differently, is it about connecting the dots you already have, or finding new dots in new places?

Both.  Much of the work we’re doing is connecting dots between data from two different agencies; or two different software systems from within a single agency. But we’re also indexing a number of non-obvious sources as well.   One interesting example is a person who was recently found in our software, and one of the better documents describing a gang he’s potentially associated with a Web page about one of his relatives in Wikipedia.

You’ve contributed to Lucene/Solr. How has the community aspect of open source helped you do your job better, and how do you think it has helped other people as well?

It’s a bit early to say I’ve contributed – while I posted my patch to their issue tracking Web site, last I checked it hadn’t been integrated yet.  There are a couple users who mentioned  to me and the mailing lists that they are using it and would like to see it merged. The community help has been incredible.   One example is when we started a project to make a minimal simple user interface to let novice users find agency documents.   We noticed that the University of Virginia/Stanford/etc.’s Project Blacklight which is a beautiful library search product built on Solr/Lucene. Our needs for one of our products weren’t too different –  just for an internal collection of documents with a few  additional facets.   With that as a starting point we had a working prototype in a few man-days of work; and a product in a few months.

What are some new or different uses you would like to see evolve within search?

I’d be interesting if the search phrases can be aware of what adjectives go with which nouns.   For example a phrase like

‘a tall white male with brown hair and blue eyes and
a short asian female with black hair and brown eyes’

should be a very close match to a document that says

‘blue eyed brown haired tall white male; brown eyed
black haired short asian female’

Solr’s edismax’s “pf2” and “pf3” can do quite a good job at this by considering the distance between words, but note that in the latter document the “brown eyes” clause is nearer to the male than the female; so there’s some room for improvement. I’d like to see some improved spatial features as well.     Right now we use a single location in a document to help sort how relevant it might be to a user (incident’s close to a user’s agency are often more interesting than ones half way across the country).  But some documents may be highly relevant in multiple different locations, like a drug trafficking ring operating between Dallas and Oakland.

When someone asks you why you don’t use a commercial search solution, what do you tell them?

I tell them that where appropriate, we also use commercial search solutions.   For our analysis and reporting product  that works mostly with structured data we use a commercial text search solution because it integrates well with the relational tables that also filter results for such reporting. The place where solr/lucene’s flexibility really shined for us is in our product that brings structured, semi-structured, and totally unstructured data together.

What are the benefits to a commercial organization or a government agency when working with your firm? How does an engagement for Forensic Logic move through its life cycle?

Our software is used to power the Law Enforcement Analysis Portal (LEAP) project which is a software-as-a-services platform for law enforcement tools not unlike Salesforce.com is for sales software.    The project started in Texas and has recently expanded to include agencies from other states and the federal government.   Rather than engaging us directly, a government agency would engage with the LEAP Advisory Board, which is a group of chiefs of police, sheriffs, and state  and federal law enforcement officials. We provide some of the domain-specific software, while other partners such as Sungard manage some operations and other software and hardware vendors provide their support. The benefits of government agencies working with us are similar to the benefits of an enterprise working with Salesforce.com – leading edge tools without having to buy expensive equipment and software and manage it internally.

One challenge to those involved with squeezing useful elements from large volumes of content is the volume of content and the rate of change in existing content objects. What does your firm provide to customers to help them deal with the volume scaling) challenge? What is the latency for index updates? Can law enforcement and public security agencies use this technology to deal with updates from high-throughput sources like Twitter? Or is the signal-to-noise ratio too weak to make it worth the effort?

In most cases when a record is updated in an agency’s records management system, the change pushed to our system in a few minutes.   For some agencies – mostly with older mainframe based systems, the integration’s a nightly batch job. We don’t yet handle high-throughput sources like Twitter.  License plate readers on freeways are probably the highest throughput data source we’re integrating today. But we strongly believe it is worth the effort to handle the high-throughput sources like Twitter, and that it’s our software’s job to deal with the signal-to-noise challenges you mentioned to try to present more signal than noise to the end user.

Visualization has been a great addition to briefings. On the other hand, visualization and other graphic eye candy can be a problem to those in stressful operational situations? What’s your firm’s approach to presenting “outputs” for end user reuse or for mobile access? Is there native support in Lucid Imagination for results formats?

Visualization’s very important to law enforcement; with crime mapping and reporting being very common needs.   We have a number of visualization tools like interactive crime maps, heat maps, charts, time lines, and link diagrams built into our software, and we also expose XML Web services to let our customers integrate their own visualization tools. Some of our products were designed with mobile access in mind. Others have such complex user interfaces you really want a keyboard.

There seems to be a popular perception that the world will be doing computing via iPad devices and mobile phones. My concern is that serious computing infrastructures are needed and that users are “cut off” from access to more robust systems? How do you see the  computing world over the next 12 to 18 months?

I think the move to mobile devices is *especially* true in  law enforcement.   For decades most officers have “searched” their systems by using the radio they carry to verbally ask for information about people and property.  It’s a natural transition for them to do this on a phone or iPad instead. Similarly, their data entry is often done first in paper in the field, and then re-data-entered into computers.   One agency we work with will be getting iPads for each of their officers to replace both of those. We agree that serious computing infrastructures are needed, but our customers don’t want to manage those themselves.  Better if an SaaS vendor manages a robust system, and what better devices than iPads and phones to access it. That said, for some kinds of analysis a powerful workstation is useful, so good SaaS vendors will provide Web services so customers can pull whatever data they need into their other applications.

Put on your wizard hat. What are the three most significant technologies that you see affecting your search business? How will your company respond?

Entity extraction from text documents is improving all the time; so soon we’ll be able to distinguish if a paragraph mentioning “Tom Green” is talking about a person or  the county in Texas. For certain types of data we integrate, XML standards for information sharing such as the National Information Exchange Model are finally gaining momentum.   As more software vendors support it, it’ll make it easier to inter-operate with other systems. Rich-media processing–like facial recognition, license plate reading, OCR, etc.–are making new media types searchable and analyzable as well.

I note that you’re speaking at the Lucene Revolution conference. What effect is open source search having in your space? I note that the term ‘open source intelligence’ doesn’t really overlap with ‘open source software’. What do you think the public sector can learn from the world of open source search applications, and vice versa?

Many of the better tools are open source tools.  In addition to Lucene/Solr, I’d note that the PostGIS extension to the PostgreSQL database is leading the commercial implementations of geospatial tools in some ways.   That said, there are excellent commercial tools too.  We’re not fanatic either way. Open Source Intelligence is important as well; and we’re working with universities to bring some of the collected research that they do on organized crime and gangs into our system. Regarding learning experiences?  I think the big lesson is that easy collaboration is a very powerful tool – whether it’s sharing source code or sharing documents and data.

Lucene/Solr seems to have matured significantly in recent years, achieving a following large and sophisticated enough to merit a national conference dedicated to the open source projects, Lucene Revolution. What advice do you have for people who are interested in adopting open source search, but don’t know where to begin?

If they’re interested, one of the easiest ways to begin is to just try it.  On Linux you can probably install it with your OS’s standard package manager with a command like “apt-get install solr-jetty” or similar. If they have a particular need in mind, they might want to look if someone already built a Lucene/Solr powered application similar to their needs.   For example, we wanted a searchable index for a set of publications/documents, and Project Blacklight gave us a huge head start.

David Fishman, May 20, 2011

Post sponsored by Lucid Imagination. Posted by Stephen E Arnold

Exclusive Interview: Stephen O’Grady, RedMonk

May 18, 2011

Introduction

The open source movement is expanding, and it is increasingly difficult for commercial software vendors to ignore. Some large firms have embraced open source. If you license, IBM OmniFind with Content Analytics, you get open source plus proprietary software. Oracle has opted for a different path, electing to acquire high profile open source solutions such as MySQL and buying companies with a heritage of open source. Sun Microsystems is now part of Oracle, and Oracle became an organization of influence with regard to Java. Google is open source, or at least Google asserts that it is open source. Other firms have built engineering and consulting services around open source. A good example is Lucid Imagination, a firm that provides one click downloads of Lucene/Solr and value-add software and consulting for open source search. The company also operates a successful conference series and has developed specialized systems and methods to handle scaling, big data, and other common search challenges.

image

I wanted to get a different view of the open source movement in general and probe about the more narrow business applications of open source technology. Fortunately I was able to talk with Stephen O’Grady, the co-founder and Principal Analyst of RedMonk, a boutique industry analyst firm focused on developers. Founded in 2002, RedMonk provides strategic advisory services to some of the most successful technology firms in the world. Stephen’s focus is on infrastructure software such as programming languages, operating systems and databases, with a special focus on open source and big data. Before setting up RedMonk, Stephen worked as an analyst at Illuminata. Prior to joining Illuminata, Stephen served in various senior capacities with large systems integration firms like Keane and consultancies like Blue Hammock. Regularly cited in publications such as the New York Times, NPR, the Boston Globe, and the Wall Street Journal, and a popular speaker and moderator on the conference circuit, Stephen’s advice and opinion is well respected throughout the industry.

The full text of my interview with him on May 16, 2011 appears below.

The Interview

Thanks for making time to speak with me.

No problem.

Let me ask a basic question. What’s a RedMonk?

That’s my favorite question. We are a different type of consultancy. We like to say we are “not your parents’ industry analyst firm.” We set up RedMonk in 2002.

Right. You take a similar view of industry analysts and mid tier consulting firms that I do as I recall.

Yes, pretty similar. We suggest that the industry analysis business has become a  “protection racket…  undoubtedly a profitable business arrangement, but ultimately neither sustainable nor ethical. In fact, we make our content open and accessible in most cases. We work under yearly retained subscriptions with clients.

Over the last nine years we have been able to serve big household names to a large number of startups. We deliver consulting hours, press services, and a variety of other value adds.

Quite a few firms say that. What’s your key difference?

We are practical.

First, RedMonk is focused on developers, whom we consider to be the new “kingmakers” in technology. If you think about it, most of the adoption we’ve seen in the last ten years has been bottom up.

We’re “practitioner-focused” rather than “buyer-focused”. RedMonk is focused on developers, whom we consider to be the new “kingmakers” in technology. If you think about it, most of the adoption we’ve seen in the last ten years has been bottom up. Our core thesis is that technology adoption is increasingly a bottom up proposition, as demonstrated by Linux, Apache, MySQL, PHP, Firefox, or Eclipse. Each is successful because these solutions have been built from the ground floor, often in grassroots fashion.

Third, we are squarely in the big data space. The database market was considered saturated, but it exploded with new tools and projects. A majority of these are open source, and thus developer friendly. We are right in the epicenter of that shift.

Do you do commissioned research?

No, we don’t do commissioned research of any kind. We just don’t see it as high value, even if the research is valid.

How has the commercial landscape of search specifically, and data infrastructure generally, been impacted – for better or for worse – by open source?

As with every other market with credible open source alternatives, the commercial landscape of search has unquestionably been impacted. Contrary to some of the more aggressive or doom crying assertions, open source does not preclude success for closed source products. It does, however, force vendors of proprietary solutions to compete more effectively. We talk about open source being like a personal trainer for commercial vendors in that respect; they can’t get lazy or complacent with open source alternatives readily available.

Isn’t there an impact on pricing?

Great point.

Besides pushing commercial vendors to improve their technology, open source generally makes pricing more competitive, and search is no exception here. Closed source alternatives remain successful, but even if an organization does not want to use open source, search customers would be foolish not to use the proverbial Amdahl mug as leverage in negotiations.

When the software is available for free, what are customers paying for?

Revenue models around open source businesses vary, but the most common is service and support. The software, in other words, is free, and what customers pay for is help with installation and integration, or the ability to pick up the phone when something breaks.

A customer may also be paying for updates, whereby vendors backport fixes or patches to older software versions. Broadly then, the majority of commercial open source users are paying for peace of mind. Customers want the same assurances they get from traditional commercial software vendors. Customers want to know that there will be there someone to help when bugs inevitably appear: open source vendors provide that level of support and assurance.

What’s the payoff to the open source user?

That’s my second favorite question.

The advantages to this model from the customer perspective are multiple, but perhaps the most important is what Simon Phipps once observed: users can pay at the point of value, rather than acquisition. Just a few years ago, if you had a project to complete, you’d invite vendors in to do a bake off. They would try to prove to you in an hour or two demo that their software could do the job well enough for you to pay to get it.

This is like an end run, right?

In general, but we believe open source software inverts the typical commercial software process. You download the software for free, employ it as you see fit and determine whether it works or not. If it does, you can engage a commercial vendor for support. If it doesn’t, you’re not out the cost of a license. This shift has been transformative in how vendors interact with their customers, whether they’re selling open source software or not.

The general complexion of software infrastructure appears to be changing. Relational databases, once the only choice, are becoming rather one of many. Where does search fit in, and how do customers determine which pieces fit which needs?

The data infrastructure space is indeed exploding. In the space of eighteen months we’ve gone from relational databases are the solution to every data problem to, seemingly, a different persistence mechanism per workload.

As for how customers put the pieces together, the important thing is to work backwards from need. For example, customers that have search needs should, unsurprisingly, look at search tools like Solr. But the versatility of search makes it useful in a variety of other contexts; AT&T for example uses it for Web page composition.

What’s driving the adoption of search? Is it simply a function of data growth, as the headlines seem to imply, or is there more going on?

Certainly data growth is a major factor. Every year there’s a new chart asserting things like we’re going to produce more information in the next year than in all of recorded history, but the important part is that it’s true. We are all–every one of us–generating massive amounts of information. How do you extract, then, the proverbial needle from the haystack? Search is one of the most effective mechanisms for this.

Just as important, however, has been the recognition amongst even conservative IT shops that the database does not need to be the solution to every problem. Search, like a variety of other non-relational tools, is far more of a first class citizen today than it was just a few short years ago.

What is the most important impact effective search can have on an organization?

That’s a very tough question. I would say that one of the most important impacts search can have is that a good answer to one question will generate the next question. Whether it’s a customer searching your Web site for the latest Android handset or your internal analyst looking for last quarter’s sales figures, it’s crucial to get the right answer quickly if you ever want them to ask a second.

If your search fails they don’t ask a second question, you’ll either have lost a potential customer or your analyst is making decisions without last quarter’s sales figures. Neither is a good outcome.

Looking at the market ahead, what trends do you see impacting the market in the next year or two? What should customers be aware of with respect to their data infrastructure?

There are a great many trends that will affect search, but two of the most interesting from my view will be the increasing contextual intelligence of search and the accelerating integration of search into other applications. Far from being just a dumb search engine, Solr increasingly has an awareness of what specifically it is searching, and in some cases, how to leverage and manipulate that content whether it’s JSON or numeric fields. This broadens the role that search can play, because it’s no longer strictly about retrieval.

And integration?

Okay, as for integration, data centers are increasingly heterogeneous, with databases deployed alongside MapReduce implementations, key-value stores and document databases.

Search fills an important role, which is why we’re increasingly seeing it not simply pointed at a repository to index, but leveraged in conjunction with tools like Hadoop.

What kind of threat does Oracle’s lawsuit over Google plus Java pose to open source?How does it compare to the SCO controversy with Linux some years back?

In my view, Oracle’s ongoing litigation of Google over Java related intellectual property has profound implications for both participants, but also for the open source community as a whole.

The real concern is that the litigation, particularly if it is successful, could have chilling effects on Java usage and adoption. As far as SCO is concerned, this is somewhat different in that it targets a reimplementation of the platform in Android rather than the Java platform itself. SCO was threatening Linux rather than a less adopted derivative.

While users of both Java and MySQL should be aware of the litigation, however, realistically the implications for them, if any are, are very long term. No one is going to abandon Java based open source projects, for example, based on the outcome of Oracle’s suit.

It seems like everyone who is anyone in the software world has an open source strategy, even through to Microsoft’s embrace of php. Should information technology executives and decision makers, who were once suspicious of open source, be suspicious of software vendors without a solid open source strategy?

With the possible exception of packaged applications, open source is a fact of life in most infrastructure software markets. Adoption is accelerating, the volume of options is growing, and – frequently – the commercial open source products are lower cost. So it is no surprise that vendors might feel threatened by open source.

But even if they choose not to sell open source software, as many do not, those without a solid open source interoperability and partnership story will be disadvantaged in a marketplace that sees open source playing crucial roles at every layer of the data center. Like it or not, that is the context in which commercial vendors are competing. Put more simply, if you’re building for a market of all closed source products, that’s not that large a market. In such cases, then, I would certainly have some hard questions for vendors who lack an open source strategy.

Where can a reader get more information about RedMonk?

Please, visit our Web site at www.redmonk.com.

ArnoldIT Comment

RedMonk’s approach to professional services is refreshing and a harbinger of change in the consulting sector. But more importantly, the information in this interview makes clear that open source solutions and open source search technology are part of the disruption that is shaking the foundation of traditional computing. Vendors without an open source strategy are likely to face both customer and price pressure. Open source is no longer a marginalized option. Companies from Twitter to Cisco Systems to Skype, now a unit of Microsoft, rely on open source technology. RedMonk is the voice of this new wave of technical opportunity.

Stephen E Arnold, May 18, 2011

Exalead Embraces SWYM or “See What You Mean”

May 3, 2011

In late April 2011, I spoke with Francois Bourdoncle, one of the founders of Exalead. Exalead was acquired by Dassault Systèmes in 2010. The French firm is one of the world’s premier engineering and technology products and services companies. I wanted to get more information about the acquisition and probe the next wave of product releases from Exalead, a leader in search and content processing. Exalead introduced its search based applications approach. Since that shift, the firm has experienced a surge in sales. Organizations such as the World Bank and PriceWaterhouseCoopers (IBM) have licensed the Exalead Cloudview platform.

I wanted to know more about Exalead’s semantic methods. In our conversation, Mr. Bourdoncle told me:

We have a number of customers that use Exalead for semantic processing. Cloudview has a number of text processing modules that we classify as providing semantic processing. These are: entity matching, ontology matching, fuzzy matching, related terms extraction, categorization/clustering and event detection among others. Used in combination, these processors can extract arbitrary sentiment, meaning not just positive or negative, but also along other dimensions as well. For example, if we were analyzing sentiment about restaurants, perhaps we’d want to know if the ambiance was casual or upscale or the cuisine was homey or refined.

When I probed about future products and services, Mr. Bourdoncle stated:

I cannot pre-announce future product plans, I will say that Dassault Systèmes has a deep technology portfolio. For example, it is creating a prototype simulation of the human body. This is a non-trivial computer science challenge. One way Dassault describes its technology vision is “See-What-You-Mean”. Or SWYM.

For the full text of the April 2011 interview with Mr. Bourdoncle, navigate to the ArnoldIT.com Search Wizards Speak subsite. For more information about Exalead, visit www.exalead.com.

Stephen E Arnold, May 3, 2011

No money but I was promised a KYFry the next time I was in Paris.

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta