Exalead: Making Headway in the US
October 25, 2008
Exalead, based in Paris, has been increasing its footprint in the US. The company has expanded its US operation and now it is making headlines in information technology publications. The company has updated its enterprise search system CloudView. Peter Sayer’s “Exalead Updates Enterprise Search to Explore Data Cloud” here provides a good summary of the system’s new features. For me, the most important comment in the Network World article was this comment:
Our approach is very different from Google’s in that we’re interested in conversational search,” he [the president of Exalead] said. That ‘conversation’ takes the form of a series of interactions in which Exalead invites searchers to refine their request by clicking on related terms or links that will restrict the search to certain kinds of site (such as blogs or forums), document format (PDF, Word) or language.”
Exalead’s engineering, however, is the company “secret sauce.” My research revealed that Exalead uses many of the techniques first pioneered by AltaVista.com, Google, and Amazon. As a result, Exalead delivers performance on content and query processing comparable to Google’s. The difference is that the Exalead platform has been engineered to mesh with existing enterprise applications. Google’s approach, on the other hand, requires a dedicated “appliance”. Microsoft takes another approach, requiring customers to adopt dozens of Microsoft servers to build a search enabled application.
On a recent trip to Europe, I learned that Exalead is working to make it easy for a licensee to process content from an organization’s servers as well as certain Internet content. Exalead is an interesting company, and I want to dig into its technical innovations. If I unearth some useful information, I will post the highlights. In the meantime, you can get a feel for the company’s engineering from its Web search and retrieval system. The company has indexed eight to nine billion Web pages. You can find the service here.
Stephen Arnold, October 25, 2008
Twine’s Semantic Spin on Bookmarks
October 25, 2008
Twine is a company committed to semantic technology. Semantics can be difficult to define. I keep it simple and suggest that semantic technology allows software to understand the meaning of a document. Semantic technology finds a home inside of many commercial search and content processing systems. Users, however, don’t tinker with the semantic plumbing. Users take advantage of assisted navigation, search suggestions, or a system’s ability to take a single word query and automatically hook the term to a concept or make a human-type connection without a human having to do the brain work.
Twine, according to the prestigious MIT publication Technology Review, is breaking new ground. Erica Naone’s article “Untangling Web Information: The Semantic Web Organizer Twine Offers Bookmarking with Built In AI” stop just short of a brass band enhanced endorsement but makes Twine’s new service look quite good. You must read the two part article here. For me, the most significant comment was:
But Jim Hendler, a professor of computer science at Rensselaer Polytechnic Institute and a member of Twine’s advisory board, says that Semantic Web technologies can set Twine apart from other social-networking sites. This could be true, so long as users learn to take advantage of those technologies by paying attention to recommendations and following the threads that Twine offers them. Users could easily miss this, however, by simply throwing bookmarks into Twine without getting involved in public twines or connecting to other users.
Radar Networks developed Twine. The metaphor of twine invokes for me a reminder of the trouble I precipitated when I tangled my father’s ball of hairy, fibrous string. My hunch is that others will think of twine as tying things together.
You will want to look at the Twine service here. Be sure to compare it to the new Microsoft service U Rank. The functions of Twine and U Rank are different, yet both struck me as sharing a strong commitment to sharing and saving Web information that is important to a user. Take a look at IBM’s Dogear. This service has been around for almost a year, yet it is almost unknown. Dogear’s purpose is to give social bookmarking more oomph for the enterprise. You can try this service here.
As I explored the Twine service and refreshed my memory of U Rank and Dogear, several thoughts occurred to me:
- Exposing semantic technology in new services is a positive development. The more automatic functions can be a significant time saver. A careless user, however, could lose sight of what’s happening and shift into cruise control mode, losing sight of the need to think critically about who recommends what and from where information comes.
- Semantic technology may be more useful in the plumbing. As search enabled applications supplant key word search, putting too much semantic functionality in front of a user could baffle some people. Google has stuck with its 1950s, white refrigerator interface because it works. The Google semantic technology hums along out of sight.
- The new semantic services, regardless of the vendor developing them, have not convinced me that they can generate enough cash to stay alive. The Radar Networks and the Microsofts will have to more than provide services that are almost impossible to monetize. IBM’s approach is to think about the enterprise, which may be a better revenue bet.
I am enthusiastic about semantic technology. User facing applications are in their early days. More innovation will be coming.
Stephen Arnold, October 25, 2008
SurfRay Round Up
October 24, 2008
SurfRay and its products have triggered a large number of comments on this Web log. On my recent six day trip to Europe, I was fortunate to be in a position to talk with people who knew about the company’s products. I also toted my Danish language financial statements along, and I was able to find some people to walk me through the financials. Finally, I sat down and read the dozens of postings that have accumulated about this company.
I visited the company on a trip to Copenhagen five or six years ago. I wrote some profiles about the market for SharePoint centric search, sent bills, got paid, and then drifted away from the company. I liked the Mondosoft folks, but I live in rural Kentucky. One of my friends owned a company which ended up in the SurfRay portfolio. I lost track of that product. I recall learning that SurfRay gobbled up an outfit called Ontolica. My recollection was that, like Interse and other SharePoint centric content processing companies’ technology, Ontolica put SharePoint on life support. What this means is that some of SharePoint’s functions work but not too well. Third party vendors pay Microsoft to certify one or more engineers in the SharePoint magic. Then those “certified” companies can sell products to SharePoint customers. If Microsoft likes the technology, a Microsoft engineer may facilitate a deal for a “certified” vendor. I am hazy on the ways in which the Microsoft certification program works, but I have ample data from interviews I have conducted that “certification” yields sales.
An Ontolica results list.
Why is this important? It’s background for the points I want to set forth as “believed to be accurate” so the SurfRay folks can comment, correct, clarify, and inform me on what the heck is going on at SurfRay. Here are the points I about which comments are in bounds.
Silobreaker: Two New Services Coming
October 24, 2008
I rarely come across real news. In London, England, last week I uncovered some information about Silobreaker‘s new services. I have written about Silobreaker before here and interviewed one of the company’s founders, Mats Bjore here. In the course of my chatting with some of the people I know in London, I garnered two useful pieces of intelligence. Keep in mind that the actual details of these forthcoming services may vary, but I am 99% certain that Silobreaker will introduce:
Contextualized Ad Retrieval in Silobreaker.com.
The idea is that Silobreaker’s “smart software” called a “contextualization engine” will be applied to advertising. The method understands concepts and topics, not just keywords. I expect to see Silobreaker offering this system to licensees and partners. What’s the implication of this technology? Obviously, for licensees, the system makes it possible to deliver context-based ads. Another use is for a governmental organization to blend a pool of content with a stream of news. In effect, when certain events occur in a news or content stream, an appropriate message or reminder can be displayed for the user. I can think of numerous police and intelligence applications for this blend of static and dynamic content in operational situations.
Enterprise Media Monitoring & Analysis Service
The other new service I learned about is a fully customizable online service that delivers a simple and effective way for enterprise customers to handle the entire work flow around their media monitoring and analysis needs. While today’s media monitoring and news clipping efforts remain resource intensive, Silobreaker Enterprise will be a subscription-based service that will automate much of the heavy lifting that either internal or external analysts must perform by hand. The Silobreaker approach is to blend–a key concept in the Silobreaker technical approach–in a single intuitive user interface disparate yet related information. The enterprise customers will be able to define monitoring targets, trigger content aggregation, perform analyses, and display results in a customized web-service. A single mouse click allows a user to generate a report or receive an auto-generated PDF report in response to an event of interest. Silobreaker has also teamed up with a partner company to add sentiment analysis to its already comprehensive suite of analytics. Currently in final testing phase with large multinational corporate test-users and due to be released at end of 2008/early 2009.
Silobreaker is a leader in search enabled intelligence applications. Check out the company at www.silobreaker.com. A happy quack to the reader who tipped me on these Silobreaker developments.
Stephen Arnold, October 23, 2008
Able2Act: Serious Information, Seriously Good Intelligence
October 23, 2008
Remember Silobreaker? The free online aggregator provides current events news through a contextual search engine. One of its owners is Infosphere, an intelligence and knowledge strategy consulting business. Infosphere also offers a content repository called able2act.com. able2act delivers structured info in modules. For example, there are more than 55 000 detailed biographies, 200,000-plus contacts in business and politics, company snapshots, and analyst notebook files, among others. Modules cover topics like the Middle East, global terrorism, and information warfare. Most of the data, files, and reports are copyrighted by Infosphere, a small part of the informatioin is in the public domain. Analysts update able2act to the tune of 2,000 records a week. You access able2act by direct XML/RSS feed, the Web site, or even feed into your in-house systems. The database search can be narrowed by making module searches, such as searching keywords only in the “tribes” module. We were able to look up the poorly reported movements of the Gandapur tribe in Afghanistan. Please, take a look at the visual demonstration is available online here. We found it quite good. able2act is available by subscription. The price for a government agency to get full access to all modules starts at $70,000 a year. Only certain modules are available to individual subscribers. You can get more details by writing to opcenter at infosphere.se.
Stephen Arnold, October 23, 2008
Google: A Powerful Mental Eraser
October 23, 2008
Earlier today I learned that a person who listened to my 20 minute talk at a small conference in London, England, heard one thing only–Google. I won’t mention the name of this person, who has an advanced degree and is sufficiently motivated to attend a technical conference.
What amazed me were these points:
- The attendee thought I was selling Google’s eDiscovery services
- I did not explain that organizations require predictive services, not historical search services
- I failed to mention other products in my talk.
I looked at the PowerPoint deck I used to check my memory. At age 64, I have a tough time remembering where I parked my car. Here’s what I learned from my slide deck.
Mention Google and some people in the audience lose the ability to listen and “erase” any recollection of other companies mentioned or any suggestion that Google is not flawless. Source: http://i265.photobucket.com/albums/ii215/Katieluvr01/eraser-2.jpg.
First, I began with a chart created by an SAS Institute professional. I told the audience the source of the chart and pointed out the bright red portion of the chart. This segment of the chart identifies the emergence of the predictive analytics era. Yep, that’s the era we are now entering.
Second, I reviewed the excellent search enabled eDiscovery system from Clearwell Systems. I showed six screen shots of the service and its outputs. I pointed out that attorneys pay big sums for the Clearwell System because it creates an audit trail so queries can be rerun at any time. It generates an email thread so an attorney can see who wrote whom when and what was said. It creates outputs that can be submitted to a court without requiring a human to rekey data. In short, I gave Clearwell a grade of “A” and urged the audience to look at this system for competitive intelligence, not just eDiscovery. Oh, I pointed out that email comprises a larger percentage of content in eDiscovery than it has in previous years.
The Future of Database
October 21, 2008
Every five years, some database gurus rendezvous to share ideas about the future of databases. This is a select group of wizards. This year the attendees included Eric Brewer (one of the founders of Inktomi), AnHai Doan (University of Wisconsin), and Michael Stonebraker (formerly CTO of Informix), among others. You can see the list of attendees here.
At this get together, the attendees give short talks, and then the group prepares a report. The report was available on October 19, 2008, at this link. The document is important, and it contains several points that I found suggestive. Let me highlight four and urge you to get the document and draw your own conclusions:
- Database and data management are at a turning point. Among the drivers are changes in architecture like cloud computing and the needs to deal with large amounts of data.
- Database will be looking outside its traditional boundaries. One example is Google’s MapReduce.
- Data collections, not databases, are increasingly important.
- The cloud, mobile, and virtual applications are game changers.
Summarizing an 11-page document in four dot points does not convey the substance of this report. The narrative and footnotes provide significant insight into the shift that is coming in database. Among those likely to be squeezed by this change are such vendors as IBM, Microsoft, and Oracle. And what company’s shadow fell across this conference? The GOOG. The report’s reference to MapReduce was neither a rhetorical flourish nor accidental.
Stephen Arnold, October 20, 2008
Recommind: Grabs Legal Hold
October 20, 2008
Recommind released its Insite Legal Hold solution today. This product bridges the gap between enterprise search and analytics.
Recommind’s Craig Carpenter states that Insite maps well with the current customer base of financial and professional service firms that are involved in heavily regulated, high knowledge users that are subject to mass litigation.
The release of this product during these financially strained times is viewed as a growth opportunity backed by a recently infusion of $7.5 million in private-equity funding.
So what makes Insite Legal Hold worth an investment in your company? First, it is an integrated solution – early risk assessment (ERA), preservation, hold/collection and processing. Second, you can reduce your litigation related costs and risks to some degree. Third, you can collect only what is needed and leave the rest to current company retention policy. Finally, you can proactively address retention and spoliation risks; that is, having an email changed.
Perhaps the most intriguing part of this product is the automated updates to current holds, though Mr. Carpenter said that in response to customer feedback, Recommind also included less sexy but still important features including filtering, deduping, near-duping, and e-mail-thread processing.
A few other benefits of Insite Legal Hold include:
- Collective selection based upon keyword, Boolean, and concept matching. This collective selection provides is more defensible than previous legal hold releases because the applied intelligence normalizes for related concepts and produces documents that yield more relevant data that is above and beyond reasonable as required by the Federal Rules of Civil Procedure.
- Explore in Place Technology allows the indexing an return of light results into html for a sampling review which can them be used to apply concept searches and more to the fuller data sets.
- Multi-platform flexibility: allows enterprises with a legacy review platform to enhance data analytics yet still use its current system for production
- Built-in processing: filter, dedupe, near-dupe, and thread documents, thereby saving 70-80% of processing and review costs.
- Manages Multiple Holds.
- Reduces IT costs by providing a forensically sound copy of perceived
relevant data and holds it in a separate data store.
When asked about pricing Mr. Carpenter provided an overview of Recommind’s three-tiered licensing module.
- Annual license fee bases upon the number of custodians
- System sold outright to customers with existing infrastructures
- A La Carte for those customers who don’t have a huge litigation load but need to manage 1 or 2 cases per year.
Insite Legal Hold has a huge potential to reduce the costs and risks involved in e-discovery endeavors. The pain points of high costs at the collection and review stage make the automation of updates and concept and near-concept bases selection an attractive solution.
Recommind’s investment of private equity funds to get the word out about their solution in a time when more potential customers are struggling with the fall-out from a global financial crisis bodes well for the profit stream of this company. What is apparent with this solution is that the developers are starting to pay attention to the less-sexy parts of e-discovery work and spending time and money to provide solutions that help reduce costs and the collection and production stages of the e-discovery cycle.
Constance Ard, Answer Maven for Beyond Search, October 20, 2008
Google: Building Its Knowledge Base a Book at a Time
October 16, 2008
Google does not seem to want to create a Kindle or Sony eBook. “For what does the firm want to scan and index books?” I ask myself. My research suggests that Google is adding to its knowledge base. Books have information, and Google finds that information useful for its various processes. Google’s book search and its sale of books are important, but if my information are correct, Google is getting brain food for its smart software. The company has deals in place that increase the number of publishers participating in its book project. Reuters’ “Google Doubles Book Scan Publisher Partners” provides a run down on how many books Google processes and the number of publishers now participating. The numbers are somewhat fuzzy, but you can read the full text of the story here and judge for yourself. Google’s been involved in legal hassles over its book project for several years. The fifth anniversary of these legal squabbles will be fast upon us. Nary a word in the Reuters story about Google’s knowledge base. Once again the addled goose is the only bird circling this angle. What do you think Google’s doing with a million or more books in 100 languages? Let me know.
Stephen Arnold, October 16, 2008
Searching Google Patent Documents with ISYS Version 9
October 13, 2008
After my two lectures at the Enterprise Search Summit in San Jose, California, in mid-September 2008, I had two people write me about my method for figuring out Google patent documents. Please, appreciate that I can’t reveal the tools that I use which my team has developed. These are my secret sauce, but I can describe the broad approach and provide some detail about what Constance, Don, Stuart, and Tony do when I have to cut through the “fog of transparency” and lava lamp light emanating from Google.
Background
Google generates a large volume of technical information and comparatively modest amounts of patent-related documents. The starting point, therefore, is a fact that catches my attention. One client sent two people to “watch” me investigate a technical topic. After five days of taking notes, snapping digital photos, and reviewing the information that I have flowing into my Harrod’s Creek, Kentucky, offices, the pair gave up. The procedure was easily flow charted, but the identification of an important and interesting item was a consequence of effort and years of grunting through technical material. Knowing what to research, it seems, is a matter of experience, judgment, and instinct.
The two “watchers” looked at the dozens of search, text mining, and content utilities I had on my machines. The two even fiddled with the systems’ ability to pattern match using n-gram technology, entity extraction using 12-year-old methods that some companies still find cutting edge, and various search systems from companies still in business as well as those long since bought out or simply shut down.
Here’s the big picture:
- Spider and collect information via various push methods. The data may be in XML, PDF, or other formats. The key point is that everything I process is open source. This means that I rely on search engines, university Web sites, government agencies with search systems that are prone to time outs, and postings of Web logs. I use exactly the same data that you can use when you run a query on any of the more than 500 systems listed here. This list is one of the keys to our work because none of the well known search systems index “everything”. The popular search engines don’t even come close. In fact, most don’t go more than two or three links deep for certain Web sites. Do some exploring on the US Department of Energy Web site, and you will what I mean. The key is to run the query across multiple systems and filter out duplicates. Software and humans do this work, just as humans process information at national security operations in many countries. (If you read my Web log, you will know that I have a close familiarity with systems developed by former intelligence professionals.)
- Take the filtered subset and process it with a search engine. The bulk of this Web log post describes the ISYS Search Software system. We have been using this system for several years, and we find that it is a quick indexer, so we can process new information quickly.
- Subset analysis. Once we have a cut from the content we are processing, then we move the subset into our proprietary tools. One of these tools runs stored queries or what some people call saved searches against the subset looking for specific people and things. My team looks at these outputs.
- I review the winnowed subset, and, as time allows, I involve myself in the preceding steps. Once the subset is on my machine, I have to do what anyone reviewing patents and technical documents must do. I read these materials. No, I don’t like to do it, but I have found that doing consistently the dog work that most people prefer to dismiss as irrelevant is what makes it possible for me to “connect the dots”.
Searching
There’s not much to say about running queries and collecting information that comes via RSS or other push technologies. We get “stuff” from open sources, and we filter out the spam, duplicates, and uninteresting material. Let’s assume that we have information regarding new Google patent documents. We get this information pushed to us, and these are easy to flag. You can navigate to the USPTO Web site and see what we get. You can pay commercial services to send you alerts when new Google documents are filed or published. You can poke around on the Web and find a number of free patent services. If you want to use Google to track Google, then you can use Google’s own patent service. I don’t find it particularly helpful, but Google may improve it at some point in the future. Right now, it’s on my list, but it’s like a dull but well meaning student. I let the student attend my lectures, but I don’t pay much attention to the outputs. If you want some basic information about patent documents, click here.
Narrowed result set for a Google hardware invention related to cooling. This is an image generated using ISYS Version 9, which is now available.
Before Running Queries
You can’t search patent documents and technical materials shooting from the hip. When I look for information about Google or Microsoft, for instance, I have to get smart with regards to terminology. Let me illustrate. If you want to find out how Microsoft is building data centers to compete with Google, you will get zero useful information with this type of query on any system: “Microsoft and “data centers”. My actual queries are more complex and use nesting, but this test query is one you can use on Microsoft’s Live.com search. Now run the same query for “Microsoft Monsoon”. You will see what you need to know here. If you don’t know the code word “Monsoon”, you will never find the information. It’s that simple.