Silobreaker: Two New Services Coming

October 24, 2008

I rarely come across real news. In London, England, last week I uncovered some information about Silobreaker‘s new services. I have written about Silobreaker before here and interviewed one of the company’s founders, Mats Bjore here. In the course of my chatting with some of the people I know in London, I garnered two useful pieces of intelligence. Keep in mind that the actual details of these forthcoming services may vary, but I am 99% certain that Silobreaker will introduce:

Contextualized Ad Retrieval in Silobreaker.com.

The idea is that Silobreaker’s “smart software” called a “contextualization engine” will be applied to advertising. The method understands concepts and topics, not just keywords. I expect to see Silobreaker offering this system to licensees and partners. What’s the implication of this technology? Obviously, for licensees, the system makes it possible to deliver context-based ads. Another use is for a governmental organization to blend a pool of content with a stream of news. In effect, when certain events occur in a news or content stream, an appropriate message or reminder can be displayed for the user. I can think of numerous police and intelligence applications for this blend of static and dynamic content in operational situations.

Enterprise Media Monitoring & Analysis Service

The other new service I learned about is a fully customizable online service that delivers a simple and effective way for enterprise customers to handle the entire work flow around their media monitoring and analysis needs.  While today’s media monitoring and news clipping efforts remain resource intensive, Silobreaker Enterprise will be a subscription-based service that will automate much of the heavy lifting that either internal or external analysts must perform by hand. The Silobreaker approach is to blend–a key concept in the Silobreaker technical approach–in a single intuitive user interface disparate yet related information. The enterprise customers will be able to define monitoring targets, trigger content aggregation, perform analyses, and display results in a customized web-service. A single mouse click allows a user to generate a report or receive an auto-generated PDF report in response to an event of interest. Silobreaker has also teamed up with a partner company to add sentiment analysis to its already comprehensive suite of analytics.  Currently in final testing phase with large multinational corporate test-users and due to be released at end of 2008/early 2009.

Silobreaker is a leader in search enabled intelligence applications. Check out the company at www.silobreaker.com. A happy quack to the reader who tipped me on these Silobreaker developments.

Stephen Arnold, October 23, 2008

Able2Act: Serious Information, Seriously Good Intelligence

October 23, 2008

Remember Silobreaker? The free online aggregator provides current events news through a contextual search engine. One of its owners is Infosphere, an intelligence and knowledge strategy consulting business. Infosphere also offers a content repository called able2act.com. able2act delivers structured info in modules. For example, there are more than 55 000  detailed biographies, 200,000-plus contacts in business and politics, company snapshots, and analyst notebook files, among others. Modules cover topics like the Middle East, global terrorism, and information warfare.  Most of the data, files, and reports are copyrighted by Infosphere, a small part of the informatioin is in the public domain. Analysts update able2act to the tune of 2,000 records a week. You access able2act by direct XML/RSS feed, the Web site, or even feed into your in-house systems. The database search can be narrowed by making module searches, such as searching keywords only in the “tribes” module. We were able to look up the poorly reported movements of the Gandapur tribe in Afghanistan. Please, take a look at the visual demonstration is available online here. We found it quite good. able2act is available by subscription. The price for a government agency to get full access to all modules starts at $70,000 a year. Only certain modules are available to individual subscribers. You can get more details by writing to opcenter at infosphere.se.

Stephen Arnold, October 23, 2008

Google: A Powerful Mental Eraser

October 23, 2008

Earlier today I learned that a person who listened to my 20 minute talk at a small conference in London, England, heard one thing only–Google. I won’t mention the name of this person, who has an advanced degree and is sufficiently motivated to attend a technical conference.

What amazed me were these points:

  1. The attendee thought I was selling Google’s eDiscovery services
  2. I did not explain that organizations require predictive services, not historical search services
  3. I failed to mention other products in my talk.

I looked at the PowerPoint deck I used to check my memory. At age 64, I have a tough time remembering where I parked my car. Here’s what I learned from my slide deck.

image

Mention Google and some people in the audience lose the ability to listen and “erase” any recollection of other companies mentioned or any suggestion that Google is not flawless. Source: http://i265.photobucket.com/albums/ii215/Katieluvr01/eraser-2.jpg.

First, I began with a chart created by an SAS Institute professional. I told the audience the source of the chart and pointed out the bright red portion of the chart. This segment of the chart identifies the emergence of the predictive analytics era. Yep, that’s the era we are now entering.

Second, I reviewed the excellent search enabled eDiscovery system from Clearwell Systems. I showed six screen shots of the service and its outputs. I pointed out that attorneys pay big sums for the Clearwell System because it creates an audit trail so queries can be rerun at any time. It generates an email thread so an attorney can see who wrote whom when and what was said. It creates outputs that can be submitted to a court without requiring a human to rekey data. In short, I gave Clearwell a grade of “A” and urged the audience to look at this system for competitive intelligence, not just eDiscovery. Oh, I pointed out that email comprises a larger percentage of content in eDiscovery than it has in previous years.

Read more

The Future of Database

October 21, 2008

Every five years, some database gurus rendezvous to share ideas about the future of databases. This is a select group of wizards. This year the attendees included Eric Brewer (one of the founders of Inktomi), AnHai Doan (University of Wisconsin), and Michael Stonebraker (formerly CTO of Informix), among others. You can see the list of attendees here.

At this get together, the attendees give short talks, and then the group prepares a report. The report was available on October 19, 2008, at this link. The document is important, and it contains several points that I found suggestive. Let me highlight four and urge you to get the document and draw your own conclusions:

  • Database and data management are at a turning point. Among the drivers are changes in architecture like cloud computing and the needs to deal with large amounts of data.
  • Database will be looking outside its traditional boundaries. One example is Google’s MapReduce.
  • Data collections, not databases, are increasingly important.
  • The cloud, mobile, and virtual applications are game changers.

Summarizing an 11-page document in four dot points does not convey the substance of this report. The narrative and footnotes provide significant insight into the shift that is coming in database. Among those likely to be squeezed by this change are such vendors as IBM, Microsoft, and Oracle. And what company’s shadow fell across this conference? The GOOG. The report’s reference to MapReduce was neither a rhetorical flourish nor accidental.

Stephen Arnold, October 20, 2008

Recommind: Grabs Legal Hold

October 20, 2008

Recommind released its Insite Legal Hold solution today. This product bridges the gap between enterprise search and analytics.

Recommind’s Craig Carpenter states that Insite maps well with the current customer base of financial and professional service firms that are involved in heavily regulated, high knowledge users that are subject to mass litigation.

The release of this product during these financially strained times is viewed as a growth opportunity backed by a recently infusion of $7.5 million in private-equity funding.

So what makes Insite Legal Hold worth an investment in your company? First, it is an integrated solution – early risk assessment (ERA), preservation, hold/collection and processing. Second, you can reduce your litigation related costs and risks to some degree. Third, you can collect only what is needed and leave the rest to current company retention policy. Finally, you can proactively address retention and spoliation risks; that is, having an email changed.

Perhaps the most intriguing part of this product is the automated updates to current holds, though Mr. Carpenter said that in response to customer feedback, Recommind also included less sexy but still important features including filtering, deduping, near-duping, and e-mail-thread processing.

A few other benefits of Insite Legal Hold include:

  • Collective selection based upon keyword, Boolean, and concept matching. This collective selection provides is more defensible than previous legal hold releases because the applied intelligence normalizes for related concepts and produces documents that yield more relevant data that is above and beyond reasonable as required by the Federal Rules of Civil Procedure.
  • Explore in Place Technology allows the indexing an return of light results into html for a sampling review which can them be used to apply concept searches and more to the fuller data sets.
  • Multi-platform flexibility: allows enterprises with a legacy review platform to enhance data analytics yet still use its current system for production
  • Built-in processing: filter, dedupe, near-dupe, and thread documents, thereby saving 70-80% of processing and review costs.
  • Manages Multiple Holds.
  • Reduces IT costs by providing a forensically sound copy of perceived
    relevant data and holds it in a separate data store.

When asked about pricing Mr. Carpenter provided an overview of Recommind’s three-tiered licensing module.

  1. Annual license fee bases upon the number of custodians
  2. System sold outright to customers with existing infrastructures
  3. A La Carte for those customers who don’t have a huge litigation load but need to manage 1 or 2 cases per year.

Insite Legal Hold has a huge potential to reduce the costs and risks involved in e-discovery endeavors. The pain points of high costs at the collection and review stage make the automation of updates and concept and near-concept bases selection an attractive solution.

Recommind’s investment of private equity funds to get the word out about their solution in a time when more potential customers are struggling with the fall-out from a global financial crisis bodes well for the profit stream of this company. What is apparent with this solution is that the developers are starting to pay attention to the less-sexy parts of e-discovery work and spending time and money to provide solutions that help reduce costs and the collection and production stages of the e-discovery cycle.

Constance Ard, Answer Maven for Beyond Search, October 20, 2008

Google: Building Its Knowledge Base a Book at a Time

October 16, 2008

Google does not seem to want to create a Kindle or Sony eBook. “For what does the firm want to scan and index books?” I ask myself. My research suggests that Google is adding to its knowledge base. Books have information, and Google finds that information useful for its various processes. Google’s book search and its sale of books are important, but if my information are correct, Google is getting brain food for its smart software. The company has deals in place that increase the number of publishers participating in its book project. Reuters’ “Google Doubles Book Scan Publisher Partners” provides a run down on how many books Google processes and the number of publishers now participating. The numbers are somewhat fuzzy, but you can read the full text of the story here and judge for yourself. Google’s been involved in legal hassles over its book project for several years. The fifth anniversary of these legal squabbles will be fast upon us. Nary a word in the Reuters story about Google’s knowledge base. Once again the addled goose is the only bird circling this angle. What do you think Google’s doing with a million or more books in 100 languages? Let me know.

Stephen Arnold, October 16, 2008

Searching Google Patent Documents with ISYS Version 9

October 13, 2008

After my two lectures at the Enterprise Search Summit in San Jose, California, in mid-September 2008, I had two people write me about my method for figuring out Google patent documents. Please, appreciate that I can’t reveal the tools that I use which my team has developed. These are my secret sauce, but I can describe the broad approach and provide some detail about what Constance, Don, Stuart, and Tony do when I have to cut through the “fog of transparency” and lava lamp light emanating from Google.

Background

Google generates a large volume of technical information and comparatively modest amounts of patent-related documents. The starting point, therefore, is a fact that catches my attention.  One client sent two people to “watch” me investigate a technical topic. After five days of taking notes, snapping digital photos, and reviewing the information that I have flowing into my Harrod’s Creek, Kentucky, offices, the pair gave up. The procedure was easily flow charted, but the identification of an important and interesting item was a consequence of effort and years of grunting through technical material. Knowing what to research, it seems, is a matter of experience, judgment, and instinct.

The two “watchers” looked at the dozens of search, text mining, and content utilities I had on my machines. The two even fiddled with the systems’ ability to pattern match using n-gram technology, entity extraction using 12-year-old methods that some companies still find cutting edge, and various search systems from companies still in business as well as those long since bought out or simply shut down.

Here’s the big picture:

  1. Spider and collect information via various push methods. The data may be in XML, PDF, or other formats. The key point is that everything I process is open source. This means that I rely on search engines, university Web sites, government agencies with search systems that are prone to time outs, and postings of Web logs. I use exactly the same data that you can use when you run a query on any of the more than 500 systems listed here. This list is one of the keys to our work because none of the well known search systems index “everything”. The popular search engines don’t even come close. In fact, most don’t go more than two or three links deep for certain Web sites. Do some exploring on the US Department of Energy Web site, and you will what I mean. The key is to run the query across multiple systems and filter out duplicates. Software and humans do this work, just as humans process information at national security operations in many countries. (If you read my Web log, you will know that I have a close familiarity with systems developed by former intelligence professionals.)
  2. Take the filtered subset and process it with a search engine. The bulk of this Web log post describes the ISYS Search Software system. We have been using this system for several years, and we find that it is a quick indexer, so we can process new information quickly.
  3. Subset analysis. Once we have a cut from the content we are processing, then we move the subset into our proprietary tools. One of these tools runs stored queries or what some people call saved searches against the subset looking for specific people and things. My team looks at these outputs.
  4. I review the winnowed subset, and, as time allows, I involve myself in the preceding steps. Once the subset is on my machine, I have to do what anyone reviewing patents and technical documents must do. I read these materials. No, I don’t like to do it, but I have found that doing consistently the dog work that most people prefer to dismiss as irrelevant is what makes it possible for me to “connect the dots”.

Searching

There’s not much to say about running queries and collecting information that comes via RSS or other push technologies. We get “stuff” from open sources, and we filter out the spam, duplicates, and uninteresting material. Let’s assume that we have information regarding new Google patent documents. We get this information pushed to us, and these are easy to flag. You can navigate to the USPTO Web site and see what we get. You can pay commercial services to send you alerts when new Google documents are filed or published. You can poke around on the Web and find a number of free patent services. If you want to use Google to track Google, then you can use Google’s own patent service. I don’t find it particularly helpful, but Google may improve it at some point in the future. Right now, it’s on my list, but it’s like a dull but well meaning student. I let the student attend my lectures, but I don’t pay much attention to the outputs. If you want some basic information about patent documents, click here.

datacenterresults

Narrowed result set for a Google hardware invention related to cooling. This is an image generated using ISYS Version 9, which is now available.

Before Running Queries

You can’t search patent documents and technical materials shooting from the hip. When I look for information about Google or Microsoft, for instance, I have to get smart with regards to terminology. Let me illustrate. If you want to find out how Microsoft is building data centers to compete with Google, you will get zero useful information with this type of query on any system: “Microsoft and “data centers”. My actual queries are more complex and use nesting, but this test query is one you can use on Microsoft’s Live.com search. Now run the same query for “Microsoft Monsoon”. You will see what you need to know here. If you don’t know the code word “Monsoon”, you will never find the information. It’s that simple.

Read more

Mark Logic and Basis Technology

October 13, 2008

Browsing the Basis Technology Web site revealed an October 7, 2008, news release about a Basis Technology and Mark Logic tie up. You can read the news release here or here. Basis Technology licenses text and content processing components and systems. The Basis Technology announcement says “Rosette Entity Extractor provides advanced search and text analytics for MarkLogic Server 4.0.” Mark Logic, as I have noted elsewhere in this Web log, is one of the leading providers of XML server technology. The new version can store, manage, search, and deliver content in a variety of forms to individual users, other enterprise systems, or to devices. REX (shorthand for Rosette Entity Extractor) can identify people, organizations, locations, numeric strings such as credit card numbers, email address, geographic data, and other items such as dates from unstructured or semi structured content. I don’t have details on the deal. My take on this is that Mark Logic wants to put its XML engine into gear and drive into market spaces not now well served with applications and functions in other vendors’ XML systems. Enterprise search is dead. Long live more sophisticated information and data management systems. Search will be tucked in these solutions, but it’s no longer the focal point of the system. I am pondering the impact of this announcement on other XML vendors and upon such companies as Microsoft Fast Search.

Stephen Arnold, October 13, 2008

The Financial Times Rediscovers Text Mining

October 11, 2008

On October 8, 2008, the former owner of Madame Tussaud’s wax museum until 1998, published Alan Cane’s “New Techniques Find Meanings in Words.” Click “fast” because locating Financial Times’s news stories can be an interesting exercise. You can read this “news” in the Financial Times, a traditional publishing company with the same type of online track record as the Wall Street Journal and the New York Times. The premise of Mr. Cane’s article is that individuals need information about people, places, and things. Apparently Mr. Cane is unfamiliar with the work of i2 in Cambridge, England, Linguamatics, and dozens of other companies in the British Commonwealth alone actively engaged in providing systems that parse content to discern and make evident information of this type. Nevertheless, Mr. Cane reviews the ideas of Sinequa, Google, and Autonomy. You can read about these companies and their “new” technology in this Web log. For me, the most interesting comment in this write up was this passage attributed in part to the Charles Armstrong, CEO of Trampoline Systems, a company with which I am not familiar:

“The rise of Web 2.0 in the consumer world alerted business to the role that social contacts and networks play. When you are dealing with a project that requires a particular knowledge, you look for the person with the knowledge, not a document.” Mr Armstrong says Trampoline’ [System]s search engine is the first to analyse not just the content of documents but the professional networks of those connected to the documents.

There are three points in this snippet that I noted on my trusty yellow pad:

  1. Who is Charles Armstrong?
  2. What is the connection between the specious buzzword “Web 2.0” and entity extraction. I recall Dr. Ramana Rao talking about entity extraction in the mid-1980s. Before that, various government agencies had systems that would identify “persons of interest”. Vendors included ConQuest Technologies, acquired by Excalibur and even earlier saved queries running against content in the Dialog and LexisNexis files. Anyone remember the command UD=9999 from 1979.
  3. What’s with the “Web 2.0” and the “first”? You can see this type of function on public demonstration sites at www.cluuz.com and www.silobreaker.com. You can also ring your local Kroll OnTrack office, and if you have the right credentials, you can see this type of operation in its industrial strength form.

Here’s what I found:

  • CRM Magazine named Trampoline Systems a rising start in 2008
  • Charles Armstrong, Cambridge grad, is an “ethnographer turned technology entrepreneur.” The company Trampoline Systems was founded in 2003 to “build on his research into how small communities distribute information to relevant recipients.” Ah, the angle is the blend of entity extraction and alerts. Not really new, but more of an angle on what Mr. Armstrong wants to deliver to licensees. Source: here. You can read the Wikipedia profile here. His Linked In profile carries this tag: “Ethnographer gone wrong” here. His Web log is here.
  • Craig McMillan is the technology honcho. According to the Trampoline Web site here, he is a veteran of Sun Microsystems where he “led the technical team building the Identrus Global Trust Network Identity assertion platform led technical team for new enterprise integration and meta-directory platform.” Source: here. I found it interesting that the Fast Forward Web log, the official organ of the pre-Microsoft Fast Search & Transfer, wrote about Mr. McMillan’s work in early 2007 here in a story called “Trampoline Systems: Rediscovering the Lost Art of Communications.” The Fast Forward article identifies Raytheon, the US defense outfit, as a “pilot”. Maybe Fast Search should have purchased this company before the financial issues thrust Fast Search into the maw of Microsoft?
  • I located an Enron Explorer here. This seems to be a demo of some of the Trampoline functionality. But the visualizer was not working on October 10, 2008.
  • The core products are packaged as the Sonar Suite. You can view a demo of a Tacit Software like system here. You can download a demo of the system here. The graphics look quite nice, but the entity precision, relevance, throughput and query response time are where the rubber meets the road. A nice touch is that the demos are available for Macs and PCs. With a bit of clicking from the Trampoline Systems’ home page, you can explore the different products the company offers.
  • Web Pro News has a useful write up about the company which appeared in 2006 here.

Charles Armstrong’s relationships as identified by the Canadian company Cluuz.com appear in the diagram below. You can recreate this map by running this query “Charles Armstrong” + Trampoline on Cluuz.com. The url for the map below is http://www.cluuz.com/ClusterChart.aspx?req=633592276174800000&key=9

armstong map

This is Cluuz.com’s relationship map of Charles Armstrong, CEO of Trampoline Systems. “New” is not the word I would use to describe either the Cluuz.com or the Trampoline relationship visualization function. Both have interesting approaches, but the guts of this type of map have been around for a couple of decades.

Let me be clear: I am intrigued by the Trampoline Systems’ approach. There’s something there. The FT article doesn’t pull the cart, however. I am, therefore, not too thrilled with the FT’s write up, but that’s my opinion to which I am entitled.

Make up your own mind. Please, read the Financial Times article. You will get some insight into why traditional media struggles to explain technology. Neither the editors nor the journalist takes the time or has the expertise to figure out what’s “new” and what’s not. My hunch is that trampoline does offer some interesting features. Ripping through some contacts with well known companies and jumping to the “new” assertion calls into question the understanding of the subjects about which the UK’s top journalists write. Agree? Disagree? Run a query on FT.com for “Trampoline Systems” before you chime in, please.

Stephen Arnold, October 10, 2008

Data Mining: A Bad Report Card

October 9, 2008

Two readers sent me a link to reports about the National Research Council’s study findings about data mining. Declan McCullagh’s “Government Report: Data Mining Doesn’t Work Well” for CNet is here. BoingBoing’s most colorful write up of the report is here. The is certainly catchy, “Data Mining Sucks: Official Report.” The only problem with the study’s findings is that I don’t believe the results. I had a stake in a firm responsible for a crazy “red, yellow, green” flagging system for a Federal agency. The data mining system worked like a champ. What did not work was the government agency responsible for the program and the data stuffed into the system. Algorithms are numerical recipes. Some work better than others, but in most cases, the math in data mining is pretty standard. Sure there are some fancy tricks, but these are not the deep, dark secrets locked in Descartes’ secret notebooks. The math is taught in classes that dance majors and social science students never, ever consider taking. Cut through the math nerd fog, and the principles can be explained.

I am also suspicious that nothing reassures a gullible reader more than a statement that something is broken. I don’t think I am going to bite that worm nestled on a barbed hook. Clean data, off-the-shelf algorithms, reasonably competent management, and appropriate resources–data mining works. Period. Fumble the data, the management, and the resources–data mining outputs garbage. To get a glimpse of data mining that works, click here. Crazy stuff won’t work. Pragmatic stuff works really well. Keep that in mind after reading the NRC report.

Stephen Arnold, October 9, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta