The Future of Database

October 21, 2008

Every five years, some database gurus rendezvous to share ideas about the future of databases. This is a select group of wizards. This year the attendees included Eric Brewer (one of the founders of Inktomi), AnHai Doan (University of Wisconsin), and Michael Stonebraker (formerly CTO of Informix), among others. You can see the list of attendees here.

At this get together, the attendees give short talks, and then the group prepares a report. The report was available on October 19, 2008, at this link. The document is important, and it contains several points that I found suggestive. Let me highlight four and urge you to get the document and draw your own conclusions:

  • Database and data management are at a turning point. Among the drivers are changes in architecture like cloud computing and the needs to deal with large amounts of data.
  • Database will be looking outside its traditional boundaries. One example is Google’s MapReduce.
  • Data collections, not databases, are increasingly important.
  • The cloud, mobile, and virtual applications are game changers.

Summarizing an 11-page document in four dot points does not convey the substance of this report. The narrative and footnotes provide significant insight into the shift that is coming in database. Among those likely to be squeezed by this change are such vendors as IBM, Microsoft, and Oracle. And what company’s shadow fell across this conference? The GOOG. The report’s reference to MapReduce was neither a rhetorical flourish nor accidental.

Stephen Arnold, October 20, 2008

Searching Google Patent Documents with ISYS Version 9

October 13, 2008

After my two lectures at the Enterprise Search Summit in San Jose, California, in mid-September 2008, I had two people write me about my method for figuring out Google patent documents. Please, appreciate that I can’t reveal the tools that I use which my team has developed. These are my secret sauce, but I can describe the broad approach and provide some detail about what Constance, Don, Stuart, and Tony do when I have to cut through the “fog of transparency” and lava lamp light emanating from Google.

Background

Google generates a large volume of technical information and comparatively modest amounts of patent-related documents. The starting point, therefore, is a fact that catches my attention.  One client sent two people to “watch” me investigate a technical topic. After five days of taking notes, snapping digital photos, and reviewing the information that I have flowing into my Harrod’s Creek, Kentucky, offices, the pair gave up. The procedure was easily flow charted, but the identification of an important and interesting item was a consequence of effort and years of grunting through technical material. Knowing what to research, it seems, is a matter of experience, judgment, and instinct.

The two “watchers” looked at the dozens of search, text mining, and content utilities I had on my machines. The two even fiddled with the systems’ ability to pattern match using n-gram technology, entity extraction using 12-year-old methods that some companies still find cutting edge, and various search systems from companies still in business as well as those long since bought out or simply shut down.

Here’s the big picture:

  1. Spider and collect information via various push methods. The data may be in XML, PDF, or other formats. The key point is that everything I process is open source. This means that I rely on search engines, university Web sites, government agencies with search systems that are prone to time outs, and postings of Web logs. I use exactly the same data that you can use when you run a query on any of the more than 500 systems listed here. This list is one of the keys to our work because none of the well known search systems index “everything”. The popular search engines don’t even come close. In fact, most don’t go more than two or three links deep for certain Web sites. Do some exploring on the US Department of Energy Web site, and you will what I mean. The key is to run the query across multiple systems and filter out duplicates. Software and humans do this work, just as humans process information at national security operations in many countries. (If you read my Web log, you will know that I have a close familiarity with systems developed by former intelligence professionals.)
  2. Take the filtered subset and process it with a search engine. The bulk of this Web log post describes the ISYS Search Software system. We have been using this system for several years, and we find that it is a quick indexer, so we can process new information quickly.
  3. Subset analysis. Once we have a cut from the content we are processing, then we move the subset into our proprietary tools. One of these tools runs stored queries or what some people call saved searches against the subset looking for specific people and things. My team looks at these outputs.
  4. I review the winnowed subset, and, as time allows, I involve myself in the preceding steps. Once the subset is on my machine, I have to do what anyone reviewing patents and technical documents must do. I read these materials. No, I don’t like to do it, but I have found that doing consistently the dog work that most people prefer to dismiss as irrelevant is what makes it possible for me to “connect the dots”.

Searching

There’s not much to say about running queries and collecting information that comes via RSS or other push technologies. We get “stuff” from open sources, and we filter out the spam, duplicates, and uninteresting material. Let’s assume that we have information regarding new Google patent documents. We get this information pushed to us, and these are easy to flag. You can navigate to the USPTO Web site and see what we get. You can pay commercial services to send you alerts when new Google documents are filed or published. You can poke around on the Web and find a number of free patent services. If you want to use Google to track Google, then you can use Google’s own patent service. I don’t find it particularly helpful, but Google may improve it at some point in the future. Right now, it’s on my list, but it’s like a dull but well meaning student. I let the student attend my lectures, but I don’t pay much attention to the outputs. If you want some basic information about patent documents, click here.

datacenterresults

Narrowed result set for a Google hardware invention related to cooling. This is an image generated using ISYS Version 9, which is now available.

Before Running Queries

You can’t search patent documents and technical materials shooting from the hip. When I look for information about Google or Microsoft, for instance, I have to get smart with regards to terminology. Let me illustrate. If you want to find out how Microsoft is building data centers to compete with Google, you will get zero useful information with this type of query on any system: “Microsoft and “data centers”. My actual queries are more complex and use nesting, but this test query is one you can use on Microsoft’s Live.com search. Now run the same query for “Microsoft Monsoon”. You will see what you need to know here. If you don’t know the code word “Monsoon”, you will never find the information. It’s that simple.

Read more

Mark Logic and Basis Technology

October 13, 2008

Browsing the Basis Technology Web site revealed an October 7, 2008, news release about a Basis Technology and Mark Logic tie up. You can read the news release here or here. Basis Technology licenses text and content processing components and systems. The Basis Technology announcement says “Rosette Entity Extractor provides advanced search and text analytics for MarkLogic Server 4.0.” Mark Logic, as I have noted elsewhere in this Web log, is one of the leading providers of XML server technology. The new version can store, manage, search, and deliver content in a variety of forms to individual users, other enterprise systems, or to devices. REX (shorthand for Rosette Entity Extractor) can identify people, organizations, locations, numeric strings such as credit card numbers, email address, geographic data, and other items such as dates from unstructured or semi structured content. I don’t have details on the deal. My take on this is that Mark Logic wants to put its XML engine into gear and drive into market spaces not now well served with applications and functions in other vendors’ XML systems. Enterprise search is dead. Long live more sophisticated information and data management systems. Search will be tucked in these solutions, but it’s no longer the focal point of the system. I am pondering the impact of this announcement on other XML vendors and upon such companies as Microsoft Fast Search.

Stephen Arnold, October 13, 2008

The Financial Times Rediscovers Text Mining

October 11, 2008

On October 8, 2008, the former owner of Madame Tussaud’s wax museum until 1998, published Alan Cane’s “New Techniques Find Meanings in Words.” Click “fast” because locating Financial Times’s news stories can be an interesting exercise. You can read this “news” in the Financial Times, a traditional publishing company with the same type of online track record as the Wall Street Journal and the New York Times. The premise of Mr. Cane’s article is that individuals need information about people, places, and things. Apparently Mr. Cane is unfamiliar with the work of i2 in Cambridge, England, Linguamatics, and dozens of other companies in the British Commonwealth alone actively engaged in providing systems that parse content to discern and make evident information of this type. Nevertheless, Mr. Cane reviews the ideas of Sinequa, Google, and Autonomy. You can read about these companies and their “new” technology in this Web log. For me, the most interesting comment in this write up was this passage attributed in part to the Charles Armstrong, CEO of Trampoline Systems, a company with which I am not familiar:

“The rise of Web 2.0 in the consumer world alerted business to the role that social contacts and networks play. When you are dealing with a project that requires a particular knowledge, you look for the person with the knowledge, not a document.” Mr Armstrong says Trampoline’ [System]s search engine is the first to analyse not just the content of documents but the professional networks of those connected to the documents.

There are three points in this snippet that I noted on my trusty yellow pad:

  1. Who is Charles Armstrong?
  2. What is the connection between the specious buzzword “Web 2.0” and entity extraction. I recall Dr. Ramana Rao talking about entity extraction in the mid-1980s. Before that, various government agencies had systems that would identify “persons of interest”. Vendors included ConQuest Technologies, acquired by Excalibur and even earlier saved queries running against content in the Dialog and LexisNexis files. Anyone remember the command UD=9999 from 1979.
  3. What’s with the “Web 2.0” and the “first”? You can see this type of function on public demonstration sites at www.cluuz.com and www.silobreaker.com. You can also ring your local Kroll OnTrack office, and if you have the right credentials, you can see this type of operation in its industrial strength form.

Here’s what I found:

  • CRM Magazine named Trampoline Systems a rising start in 2008
  • Charles Armstrong, Cambridge grad, is an “ethnographer turned technology entrepreneur.” The company Trampoline Systems was founded in 2003 to “build on his research into how small communities distribute information to relevant recipients.” Ah, the angle is the blend of entity extraction and alerts. Not really new, but more of an angle on what Mr. Armstrong wants to deliver to licensees. Source: here. You can read the Wikipedia profile here. His Linked In profile carries this tag: “Ethnographer gone wrong” here. His Web log is here.
  • Craig McMillan is the technology honcho. According to the Trampoline Web site here, he is a veteran of Sun Microsystems where he “led the technical team building the Identrus Global Trust Network Identity assertion platform led technical team for new enterprise integration and meta-directory platform.” Source: here. I found it interesting that the Fast Forward Web log, the official organ of the pre-Microsoft Fast Search & Transfer, wrote about Mr. McMillan’s work in early 2007 here in a story called “Trampoline Systems: Rediscovering the Lost Art of Communications.” The Fast Forward article identifies Raytheon, the US defense outfit, as a “pilot”. Maybe Fast Search should have purchased this company before the financial issues thrust Fast Search into the maw of Microsoft?
  • I located an Enron Explorer here. This seems to be a demo of some of the Trampoline functionality. But the visualizer was not working on October 10, 2008.
  • The core products are packaged as the Sonar Suite. You can view a demo of a Tacit Software like system here. You can download a demo of the system here. The graphics look quite nice, but the entity precision, relevance, throughput and query response time are where the rubber meets the road. A nice touch is that the demos are available for Macs and PCs. With a bit of clicking from the Trampoline Systems’ home page, you can explore the different products the company offers.
  • Web Pro News has a useful write up about the company which appeared in 2006 here.

Charles Armstrong’s relationships as identified by the Canadian company Cluuz.com appear in the diagram below. You can recreate this map by running this query “Charles Armstrong” + Trampoline on Cluuz.com. The url for the map below is http://www.cluuz.com/ClusterChart.aspx?req=633592276174800000&key=9

armstong map

This is Cluuz.com’s relationship map of Charles Armstrong, CEO of Trampoline Systems. “New” is not the word I would use to describe either the Cluuz.com or the Trampoline relationship visualization function. Both have interesting approaches, but the guts of this type of map have been around for a couple of decades.

Let me be clear: I am intrigued by the Trampoline Systems’ approach. There’s something there. The FT article doesn’t pull the cart, however. I am, therefore, not too thrilled with the FT’s write up, but that’s my opinion to which I am entitled.

Make up your own mind. Please, read the Financial Times article. You will get some insight into why traditional media struggles to explain technology. Neither the editors nor the journalist takes the time or has the expertise to figure out what’s “new” and what’s not. My hunch is that trampoline does offer some interesting features. Ripping through some contacts with well known companies and jumping to the “new” assertion calls into question the understanding of the subjects about which the UK’s top journalists write. Agree? Disagree? Run a query on FT.com for “Trampoline Systems” before you chime in, please.

Stephen Arnold, October 10, 2008

Data Mining: A Bad Report Card

October 9, 2008

Two readers sent me a link to reports about the National Research Council’s study findings about data mining. Declan McCullagh’s “Government Report: Data Mining Doesn’t Work Well” for CNet is here. BoingBoing’s most colorful write up of the report is here. The is certainly catchy, “Data Mining Sucks: Official Report.” The only problem with the study’s findings is that I don’t believe the results. I had a stake in a firm responsible for a crazy “red, yellow, green” flagging system for a Federal agency. The data mining system worked like a champ. What did not work was the government agency responsible for the program and the data stuffed into the system. Algorithms are numerical recipes. Some work better than others, but in most cases, the math in data mining is pretty standard. Sure there are some fancy tricks, but these are not the deep, dark secrets locked in Descartes’ secret notebooks. The math is taught in classes that dance majors and social science students never, ever consider taking. Cut through the math nerd fog, and the principles can be explained.

I am also suspicious that nothing reassures a gullible reader more than a statement that something is broken. I don’t think I am going to bite that worm nestled on a barbed hook. Clean data, off-the-shelf algorithms, reasonably competent management, and appropriate resources–data mining works. Period. Fumble the data, the management, and the resources–data mining outputs garbage. To get a glimpse of data mining that works, click here. Crazy stuff won’t work. Pragmatic stuff works really well. Keep that in mind after reading the NRC report.

Stephen Arnold, October 9, 2008

MarkLogic 4.0: A Next-Generation Content System

October 7, 2008

Navigate to KDNuggets here, and you can see a line up of some of the content processing systems available on October 2, 2008. The list is useful but it is not complete. There are more than 350 vendors in my files, and each asserts that it has a must-have content system. Most of these systems suffer from one or more drawbacks; for example, scaling is problematic or just plain expensive, repurposing the information is difficult, or modifying the system requires lots of fiddling.

MarkLogic 4.0 addresses a number of these common shortcomings in text processing and XML content manipulation. The company “accelerates the creation of content applications.” With MarkLogic’s most recent release of its flagship server product, MarkLogic offers a content platform, not a content utility. Think in terms of most content processing companies as tug boats. MarkLogic 4.0 is an ocean growing vessel with speed, power, and range. When I spoke with MarkLogic’s engineers, the ideas for enhancements to MarkLogic 3.2, the previous release, originated with MarkLogic users. One engineer said, “Our licensees have discovered new uses for the system. We have integrated into the code base functions and operations that our customers told us they need to get the most value from their information. Version 4.0 is a customer driven upgrade. We just did the coding for them.”

image

Most text processing systems, including XML databases, are useful but limited in power and scope. The MarkLogic 4.0 system is an ocean going vessel among harbor bound craft.

You can learn quite a bit about the functionality of MarkLogic in this Dr Dobbs’s interview with Dave Kellogg, CEO of this Sequoia-backed firm. The Dr Dobbs’ interview is here.

MarkLogic is an ocean going vessel amidst smaller boats. The product is an XML server, and it offers search, analytics, and jazzy features such as geospatial querying. For example, I can ask a MarkLogic system for information about a specific topic within a 100 mile radius of a particular city. But the core of MarkLogic 4.0 is an XML database. When textual information or data are stored in MarkLogic 4.0, slicing, dicing, reporting, and reshaping information provides solutions, not results lists.

According to Andy Feit, vice president, MarkLogic is “a mix of native XML handling, full-text search engines, and state-of-the-art DBMS features like time-based queries, large-scale alerting, and large-scale clustering.” The new release adds important new functionality. New features include:

  • Geospatial support for common geospatial markup standards plus an ability to display data on polygons such as state boundaries or a sales person’s region. The outputs or geospatial mash ups are hot linked to make drill down a one-mouse click operation

    geospatial image

  • Push operations such as alerts sent to a user’s mobile phone or triggers which operate when a content change occurs which, in turn, launches a separate application. The idea is to automate content and information operations in near real time, not leave it up to the system user to run a query and find the important new information.
  • Embedded entity enrichment functionality including support for Chinese, Russian and other languages
  • Improved support for third party enterprise entity extraction engines or specialized systems. For example, the new version ships with direct support for TEMIS’s health and medical processing, financial services, and pharmaceutical content processing system. MarkLogic calls its approach “an open framework”
  • Mobile device support. A licensee can extract data from MarkLogic and the built in operations will format those data for the user’s device. Location services become more fluid and require less developer time to implement.

The new release of MarkLogic manipulates XML quickly. In addition to performance enhancements to the underlying XML data management system, MarkLogic supports the Xquery 1.0 standard. Users of earlier versions of MarkLogic server can continue to use these systems along side Version 4.0. According to Mr. Feit, “Some vendors require a lock step upgrade when a new release becomes available. At MarkLogic, we make it possible for a licensee to upgrade to a new version incrementally. No recoding is required. Version 4, for example, supports earlier versions’ query language and scripts.”

Read more

Powerset’s Approach to Search

October 6, 2008

Powerset was acquired by Microsoft for about $100 million in June 2008. I haven’t paid too much attention to what Microsoft has done or is doing with the Powerset semantic, natural language, latent semantic indexing, et al system it acquired. A reader sent me a link to Jon Udell’s well Web log interview that focuses on Powerset. If you want to know more about how Microsoft will leverage the aging Xerox Parc technology, you will want to click here to get an introduction to the Perspectives interview conducted on September 30, 2008, with Scott Prevost. You will need to install Silverlight, or you can read the interview transcript here.

I can’t summarize the lengthy interview. For several three points were of particular interest:

  1. The $100 million bought Powerset, but Microsoft had to then license the Xerox Parc technology. You can get some “inxight” into the functions of the technology by exploring the SAP/ Business Objects’ information here.
  2. The Powerset technology can be used with both structured and unstructured information.
  3. Microsoft will be doing more work to deliver “instant answers”.

A happy quack to the reader who sent me this link, and two quacks for Mr. Udell for getting some useful information from Scott Prevost. I am curious about the roles of Barney Pell (Powerset founder) and Ron Kaplan (Powerset CTO and former Xerox Parc wizard) in the new organization. If anyone can shed light on this, you too will warrant a happy quack.

Stephen Arnold, October

Intel’s Interest in Medical Terminology Translation

October 4, 2008

Intel continues to be a slippery fish when it comes to search and content processing. The ill fated Convera deal burned thorough millions in the early 2000s. Earlier this year, Intel pumped cash into Endeca, one of the two high profile enterprise search systems, known for their ecommerce and information access systems. (The other vendor is Autonomy. Fast Search & Transfer seems to be shifting from a vendor to an R&D role, but its trajectory remains unclear to me.)

Intel has one engineer thinking about language. The posting on an Intel Software Network Web log “Designing for Gray Scale: Under the Hood of Medical Terminology Translation” is suggestive. The author is Joshua Painter, who identifies himself with Intel. You can read this post here. Translation of scientific, technical, and medical terminology is somewhat easier than translating general business writing. The task is difficult, particularly when a large pharmaceutical company wants to monitor references to a drug’ formal and casual names in English and non-English document sets.

Mr. Painter’s write up concerns standards; specifically, “data standards in enabling interoperability in healthcare.” For me the interesting passage in this write up was:

An architecture for Health Information Exchange must accommodate choice and dealing with change – it must be designed for grayscale. This includes choice of medical vocabularies, messaging standards, and other terminology interchange considerations. In my last post I introduced the notion of a Common Terminology Services to deliver a set of capabilities in this space. In this post, I will discuss a technical architecture for enabling this.

The word grayscale, I think, means fuzziness. Intel makes these tantalizing pieces of information available, and I continue to watch for them. My hunch is that Intel wants to put some content centric operations in silicon. Imagine. Endeca on a multi core chip. So far this is speculation, but it is clear that juiced hardware can deliver some impressive content processing performance boosts. Exegy’s appliance demonstrates the value of this hardware angle.

Stephen Arnold, October 4, 2008

Attensity and Tremendous Momentum

October 3, 2008

With the economy in the US stumbling along, I found Attensity’s September 30, 2008, “Momentum” news release intriguing. The information issued by the the analytics company is here. I had to struggle to decipher some of the jargon. For example, First Person Intelligence. This is a product name with a trademark.  The idea is that email or phone calls from a customer are analyzed by Attensity. The resulting insights yield information about a particular customer; hence, First Person Intelligence. You can see FPI in action by clicking here. The company won an award called the Stevie. If you are curious or you want to enter to compete to snag the 2009 award, click here. I think I know what text analytics is, so I jumped to VoC. The acronym means “voice of the customer.” I think the notion is that a company pays attention to emails, call center notes, and survey data. I’m not certain if VoC is a subset of FPI or if VoCis the broader concept and FPI is a subset of VoC.

The core of the news release is that Attensity has landed some major accounts. Customer names are tough to come by, so you may want to note these organizations who have licensed the Attensity technology but hopefully not the jargon:

  • JetBlue
  • Royal Bankk of Canada
  • Travelocity

For me, the most useful part of the company-written article was this passage:

The text analytics market is rapidly moving out of the early adopter stage. Industry analyst firm Hurwitz & Associates estimates an annual growth rate for this market at 30 to 50 percent. According to a survey conducted last year by the firm, the largest growth area is in customer care-related applications. In fact, over 70 percent of the companies surveyed that had deployed, or were considering deploying the technology, cited customer care as a key application area.

The growth rate does not match my calculation which pegs growth at a more leisurely 10 to 18 percent on an annual basis. The Hurwitz organization is much larger than this single goose operation. Endangered species like this addled goose are more conservative, and its estimates in a grim financial market are less optimistic than other consultants’ and analysts’.

In my Beyond Search study for the Gilbane Group, published in April 2008, I gave Attensity high marks. Its deep extraction technology yields useful metadata. Since my early 2008 analysis, Attensity has worked hard to productize its system. Calls centers are a market segment in need of help. Most companies want to contain support costs.

In my opinoin, Attensity’s technology is better than its explanation of its products and those products names. I wonder if the addition of marketers to a technology-centric company is a benefit or a drawback. Thoughts?

Stephen Arnold, October 3, 2008

Silobreaker: Mary Ellen Bates’ Opinion Is on Target

September 30, 2008

Mary Ellen Bates is one sharp information professional. She moved from Washington, DC, to the sunny clime in Colorado. The shift from the nation’s capital (the US crime capital) to the land of the Prairie Lark Finch has boosted her acumen. Like me, she finds much goodness in the Silobreaker.com service. (You can read an interview with one of the founders of Silobreaker.com here.) Writing in the September number of Red Orbit here she said:

What Silobreaker does particularly well is provide you with visual displays of information, which enable you to spot trends or relationships that might not be initially obvious. Say, for example, you want to find out about transgenic research. Start with what Silobreaker calls the “360[degrees] search,” which looks across its indexes, including fields for entities (people, companies, locations, organizations, industries, and keywords), news stories, YouTube videos, blog postings, and articles.

If you want to try Silobreaker yourself, click here. With Ms. Bates in the wilds of Colorado and me in a hollow in rural Kentucky, I am gratified that news about next-generation information services reaches us equally. A happy quack to Silobreaker and Ms. Bates.

Stephen Arnold, September 30, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta