Google: Security of Its Cloud Applications

January 3, 2009

CSO Security and Risk published a mini interview with two Googlers here. “Four Questions on Google App Security” contains little of the lava lamp and Odwalla disingenuousness and some useful information about security for Google Apps’s users. The author bill Brenner is to be commended for ignoring the usual fluff that distracts most of the journalists writing about the GOOG. The Googlers make a passing reference to the problem of multi tenant computing, a topic that warranted some deeper probing in my opinion. The Googlers lay out the Google view of delivering applications from the cloud. Google is not viewing cloud services as Virtualization. Nope. For the GOOG, cloud computing is build around “message application, security, and compliance.” For me the most important comment in the article was:

We have taken a big chunk of Postini’s technology and incorporated it into the Gmail client.

Google has a presence in secure hosted email. If you dig around on the Google Web site, you will find a very reasonably priced email archiving service. The present service is a tiny step away from more robust eDiscovery services. The “hook” between Gmail and Postini is an important signal that Google is beginning the process of rationalization; that is, why have two services. Blend the technology and go with one branded service. I am inclined to reassess Gmail as a more important enterprise service that it now is.

Stephen Arnold, January 2, 2009

Duplicates and Deduplication

December 29, 2008

In 1962, I was in Dr. Daphne Swartz’s Biology 103 class. I still don’t recall how I ended up amidst the future doctors and pharmacists, but there I was sitting next to my nemesis Camille Berg. She and I competed to get the top grades in every class we shared. I recall that Miss Berg knew that there five variations of twinning three dizygotic and two monozygotic. I had just turned 17 and knew about the Doublemint Twins. I had some catching up to do.

Duplicates continue to appear in data just as the five types of twins did in Bio 103. I find it amusing to hear and read about software that performs deduplication; that is, the machine process of determining which item is identical to another. The simplest type of deduplication is to take a list of numbers and eliminate any that are identical. You probably encountered this type of task in your first programming class. Life gets a bit more tricky when the values are expressed in different ways; for example, a mixed list with binary, hexadecimal, and real numbers plus a few more interesting versions tossed in for good measure. Deduplication becomes a bit more complicated.

At the other end of the scale, consider the challenge of examining two collections of electronic mail seized from a person of interest’s computers. There is the email from her laptop. And there is the email that resides on her desktop computer. Your job is to determine which emails are identical, prepare a single deduplicated list of those emails, generate a file of emails and attachments, and place the merged and deduplicated list on a system that will be used for eDiscovery.

Here are some of the challenges that you will face once you answer this question, “What’s a duplicate?” You have two allegedly identical emails and their attachments. One email is dated January 2, 2008; the other is dated January 3, 2008. You examine each email and find that difference between the two emails is in the inclusion of a single slide in the two PowerPoint decks. You conclude what:

  1. The two emails are not identical and include both and the two attachments
  2. The earlier email is the accurate one and exclude the later email
  3. The later email is accurate and exclude the earlier email.

Now consider that you have 10 million emails to process. We have to go back to our definition of a duplicate and apply the rules for that duplicate to the collection of emails. If we get this wrong, there could be legal consequences. A system develop who generates a file of emails where a mathematical process has determined that a record is different may be too crude to deal with the problem in the context of eDiscovery. Math helps but it is not likely to be able to handle the onerous task of determining near matches and the reasoning required to determine which email is “the” email.

image

Which is Jill? Which is Jane? Parents keep both. Does data work like this? Source: http://celebritybabies.typepad.com/photos/uncategorized/2008/04/02/natalie_grant_twins.jpg

Here’s another situation. You are merging two files of credit card transactions. You have data from an IBM DB2 system and you have data from an Oracle system. The company wants to transform these data, deduplicate them, normalize them, and merge them to produce on master “clean” data table. No, you can’t Google for an offshore service bureau, you have to perform this task yourself. In my experience, the job is going to be tricky. Let me give you one example. You identify two records which agree in field name and data for a single row in Table A and Table B. But you notice that the telephone number varies by a single digit. Which is the correct telephone number? You do a quick spot check and find that half of the entries from Table B have this variant, or you can flip the analysis around and say that half of the entries in Table A vary from Table B. How do you determine which records are duplicates.

Read more

Microsoft SharePoint and the Law Firm

December 22, 2008

Lawyers are, in general, similar to Scrooge McDuck. If you are too young to remember, the Donald Duck funny papers, Scrooge McDuck was tight with a penny. Lawyers eschew capital expenditures if possible. When a client foots the bill, the legal eagles will become slightly less abstemious, but in my experience, not too profligate with money.

Microsoft SharePoint offers an unbeatable combination for some law firms. Because the operating system is Microsoft’s, lawyers know that programmers, technical assistance, and even the junior college introductory computer class can be a source of expertise. And, Microsoft includes with SharePoint a search system. Go with Microsoft and visions of lower initial costs, bundles, and a competitive market from which to select the technical expertise you need. What could be better? Well, maybe a big pharma outfit struggling with a government agency? Most attorneys would drool with anticipation to work for either the company or the US government. A new client is more exciting than software.

Several people sent me links to Mark Gerow’s article “Elements of a Successful SharePoint Search.” You can read the full text of his article at Law.com here. The article  does a good job of walking through a SharePoint installation for a law firm. You will also find passing references to other vendors’ systems. The focus is SharePoint.

image

Could this be a metaphor for a SharePoint installation?

I found several points interesting. First, Mr. Gerow explains why search in a law firm is not like running a query on Microsoft’s Web search or any other Web indexing system. There is a reference to Google’s assertion that it has indexed one trillion Web pages and an accurate comment about the inadequacy of Federal government information in public search systems. I am not certain that attorneys will understand why Google has been able to land some law firms and a number of Federal agencies as customers with its search appliance. I know from experience that many professionals have a difficult time differentiating the content that’s available via the Web, content on the organization’s Web site, content on an Intranet, and content that may be available behind a firewall yet pulled from various sources. Also, I don’t think one can ignore the need for specialized systems to handle information obtained during the discovery process. Those systems do search, but law firms often pay hundreds of thousands of dollars because “traditional” search systems don’t do what attorneys need to do when preparing their documentation for litigation. These topics are referenced but not in a way that makes much sense for SharePoint, a singularly tricky collaborative, content management, search, and Swiss Army Knife collection of software packages as “one big thing”.

Read more

Entropy Soft and Kazeon Deal

December 10, 2008

Entropy Soft is a company that codes connectors. A “connector” allows one enterprise system to tap into the data and information in another enterprise system. Kazeon is one of the companies involved in providing information systems for eDiscovery and enterprise  content applications.

Kazeon’s system can discover, search and index, classify and act on electronically stored information. Kazeon provides a full spectrum of Information Management solutions, including proactive and reactive eDiscovery, information security and privacy, records management, governance, risk & compliance, and data management. The Kazeon Information Server software automates key eDiscovery functions – identification, collection, preservation, processing, analysis and review for corporations, service providers, and law firms. The company is an end-to-end vendor which means that hard copy documents can be processed and then moved through the eDiscovery pipeline.

Entropy Soft provides connectors to a number of companies; for example, Coveo and Endeca. Kazeon will add support for content in systems developed by Alfresco, FileNet P8, Hummingbird DM, Interwoven TeamSite, Microsoft SharePoint, and IBM Lotus Quickplace.

With Entropy’s connectors, Kazeon is making clear its intention to move more aggressively into enterprise search. At a time when some search and content analysis vendors are moving from enterprise search into eDiscovery. I think search and information access companies are looking for new markets. Customers may be confused as vendors flip flop from market to market in pursuit of revenues.

Stephen Arnold, December 10, 2008

Stratify Adds Cloud Storage Services

December 9, 2008

On December 3, 2008, Stratify–a unit of Iron Mountain–announced new services for its thriving eDiscovery business. You can read the Stratify news release here. The core of the service is disaster recovery. Attorneys apparently have a need to make sure that the legions of attorneys who pour through electronic documents obtained as part of the discovery process can’t nuke the data. Stratify said:

To safeguard client eDiscovery data Stratify has invested in and deployed a fully replicated production datacenter with more than 250 terabytes of storage, 200 servers and redundant 100MB Internet access, coupled with highly trained personnel and security procedures.

Stratify (once did business as Purple Yogi) now wears a blue suit and polished shoes, no sneakers now. IDC’s Sue Feldman weighs in with an observation that the new service “raises the bar” for the companies competing for eDiscovery accounts.

Stratify’s news release added:

Stratify can restore access to client matters within four hours after a potential disaster, recover 100 percent of processed and loaded documents and system metadata, and lose no more than 59 minutes worth of review work product.

In my opinion, the eDiscovery sector is undergoing rapid change. The need for end-to-end solutions and bullet proof systems means that specialist vendors may be forced to add sophisticated new features in order to compete. The problem is that eDiscovery systems are selling to corporations. With the technology and market changing, well funded organizations with a strong client list may have an advantage. Stratify said that it had more than 250 matters underway at this time.

eDiscovery, like business intelligence, is becoming a magnet for search and content processing companies who want to find a way to pump up revenues.

Stephen Arnold, December 9, 2008

Autonomy and Big Data

December 6, 2008

Earthtimes.org reported that Autonomy’s technology manages more than seven petabytes of data. “Autonomy Reaches New Benchmark for Managing World’s Largest Data Archive” here said that the Autonomy Digital Safe “archives three million files an hour, which works out the more than 830 files a second. The Earthtimes’ report said:

Autonomy Digital Safe is a massively scalable, hosted archive service
that enables customers to outsource the storage and management of their email
messages, rich-media files, instant messages (IMs), unified communications
content and content from over 400 repositories to a trusted, proven
third-party. 

More information about the system is available on the Autonomy Web site here. If you have an axe to gring with Autonomy, please, contact them. I am linking to a story, not writing the story.

Stephen Arnold, December 6, 2008

ISYS Search Software CEO Interview

December 1, 2008

Scott Coles has joined ISYS Search Software as the firm’s chief executive officer. Ian Davies, founder, remains the chairman of the company. Among Mr. Coles’s tasks will be to lead the firm’s new strategic direction characterized by an expanded presence in Europe and Asia, specialized vertical-market offerings, a broader channel sales strategy, and a deeper set of embedded search solutions for original equipment manufacturers and independent software vendors.

Coles joins ISYS with a significant background in the commercialization of innovation for multinational corporations, holding senior executive roles with companies such as EDS, Lucent Technologies and Avaya. In the mid-1990s, Scott was the driving force behind the establishment and success of AT&T Bell Labs in Australia.

In his interview with ArnoldIT.com’s Search Wizards Speak, Coles provided information about the company’s focus in 2009.

On this topic, he said:

We are seeing significant increase in other software vendors coming to us to license our engine for incorporation into their products. This marks a general industry trend that I believe will increase significantly in the coming year. A number of applications today that previously had either none or only rudimentary search are finding that their products can be significantly enhanced with a sophisticated search engine. The amount of data that these applications have to deal with is now becoming so large that some form of pre-processing to narrow down to that which is relevant is becoming essential.

Mr. Coles also noted that Microsoft SharePoint continues to capture market share in content management and collaboration. However, the SharePoint user needs access to a range of content and:

ISYS can search all data, both inside and outside of SharePoint. In addition, ISYS provides high quality relevant results through features such as Boolean search operators, multi-dimensional clustering, and many others for which SharePoint users have expressed a desire that are currently not available in the native SharePoint product…we’ve taken great care to ensure our new “intelligent content analysis” methods are reliable, predictable and easily understood by the end user. These include parametric search and navigation, visual timeline refinement bars, intelligence clouds, de-duplication and intelligent query expansion. We’ve even added additional post-query processing to help streamline the e-discovery process. The end result is a core set of new capabilities that help our customers better cull and refine efficiently, without cutting corners on accuracy or relevance.

You can read the full text of the interview with Scott Coles at http://www.arnoldit.com/search-wizards-speak or click here.

SharePoint and Document Management

November 20, 2008

If you are in the midst of a discovery process, you will find some surprising information in the article “MOSS 2007 Document Management Services — Document Centralization” here. This Web log post appeared on Mastering SharePoint Community on November 19, 2008. The author was Bob Mixon. The write up covers a number of SharePoint document management topics, but for me the most important point was in this comment:

I don’t believe you will find anyone (or I at least hope not) at Microsoft recommending the use of a single Document Library to store all of your organizations documents.

What this means is that Microsoft is opening the door to third party vendors who can build a single collection of documents, put them in one place, and provide access control tools so the documents in the repository cannot be changed. The fancy word for this is spoliation. SharePoint, the Swiss Army knife for content, ships with a broken knife blade and some rust on the moving parts. You may find the many collections approach useful. I don’t think some senior managers who are facing litigation will be too thrilled to learn that special purpose systems will be needed because SharePoint doesn’t recommend a single repository. If you have licked this problem, let me know.

Stephen Arnold, November 20, 2008

Autonomy Upgrades Investigative System

November 15, 2008

Autonomy, based in Cambridge, England, continues to be one of the most agile of the information access and services company. The firm has updated its Intelligent Investigator & Early Case Assessment software. You can read about the story here or visit the Autonomy Web site for more details. Autonomy asserts that its software can understand the meaning of large volumes of data collected in an investigation or similar procedure. Once the structured and unstructured data are processed, an investigator can use the Autonomy system:

to reconstruct what occurred, develop informed case strategies and sweep aside non-responsive data. A seamless link with Autonomy Legal Hold software automatically provides a legally defensible preservation and collection process.

Features of the investigative system include:

  • A case centric view of the data. The idea is that an investigator can get a bird’s eye view of information, events, persons of interest, and time in a matter
  • A new feature to analyze data where it resides and provide answers to queries without building a collection and performing some of the manual tasks other systems require
  • A risk component
  • Enhanced entity extraction and alias identification

Other companies offer case management and investigative tools. Autonomy’s broad sweep of software and systems allows the company to provide a solution that can mesh with almost any organizational or legal requirement. Will Autonomy sweep the field in this market? I know the company will try? The challenge will be to convince investigative units and lawyers to try new methods. Investigators and lawyers can be like my grandmother–set in her ways. A number of search and content processing companies are looking closely at these specialized markets. When the economy goes south, legal activity goes north. Autonomy has demonstrated it knows which way the compass is spinning.

Stephen Arnold, November 15, 2008

Data Management: A New Search Driver

November 4, 2008

Earlier today I reread “The Claremont Report on Database Research.” I had a few minutes, and I recalled reading the document earlier this year, and I wanted to see if I had missed some of its key points. This report is a committee written document prepared as part of an invitation only conference focusing on databases. I follow the work of several of the people listed as authors of the report; for example, Michael Stonebraker and Hector Garcia-Molina, among others.

One passage struck me as important on this reading of the document. On page 6, the report said:

The second challenge is to develop methods for effectively querying and deriving insight from the resulting sea of heterogeneous data…. keyword queries are just one entry point into data exploration, and there is a need for techniques that lead users into the most appropriate querying mechanism. Unlike previous work on information integration, the challenges here are that we do not assume we have semantic mappings for the data sources and we cannot assume that the domain of the query or the data sources is known. We need to develop algorithms for providing best-effort services on loosely integrated data. The system should provide some meaningful answers to queries with no need for any manual integration, and improve over time in a “pay-as-you-go” fashion as semantic relationships are discovered and refined. Developing index structures to support querying hybrid data is also a significant challenge. More generally, we need to develop new notions of correctness and consistency in order to provide metrics and to enable users or system designers to make cost/quality tradeoffs. We also need to develop the appropriate systems concepts around which to tie these functionalities.

Several thoughts crossed my mind as I thought about this passage; namely:

  1. The efforts by some vendors to make search a front end or interface for database queries is bringing this function to enterprise customers. The demonstrations by different vendors of business intelligence systems such as Microsoft Fast’s Active Warehouse or Attivio’s Active Intelligence Engine make it clear that search has morphed from key words to answers.
  2. The notion of “pay as you go” translates to smart software; that is, no humans needed. If a human is needed, that involvement is as a system developer. Once the software begins to run, it educates itself. So, pay as you go becomes a colloquial way to describe what some might have labeled “artificial intelligence” in the past. With data volume increasing, the notion of humans getting paid to touch the content recedes.
  3. Database quality in the commercial database sector could be measured by consistency and completeness. The idea that zip codes were consistent was more important than a zip code being accurate. With statistical procedures the value in a cell may be filled and it will include a score that shows the probability that the zip code is correct. Similarly, if one looks for the salary or mobile number of an individuals, these probability scores become important guides to the user.

ediscovery cost perception

“Pay as you go” computing means that the most expensive functions in a data management method have costs reduced because humans are no longer needed to do “knowledge work” required to winnow and select documents, facts, and information. The company able to implement “pay as you go” computing on a large scale will destabilize the existing database business sector. My research has identified Google as an organization employing research scientists who use the phrase “pay as you go” computing. Is this a coincidence or an indication that Google wants to leap frog traditional database vendors in the enterprise?

In the last month, a number of companies have been kind enough to show me demonstrations of next generation systems that take a query and generate a report. One system allows me to look at a sample screen, click a few options, and then begin my investigation by scanning a “trial report”. I located a sample Google report in a patent application that generates a dossier when the query is for an individual. That output goes an extra step and includes aliases used by the individual who is the subject of the query and a hot link to a map showing geolocations associated with that individual.

The number of companies offering products or advanced demonstrations of these functions means that the word search is going to be stretched even further than assisted navigation or alerts. The vendors who describe search as an interface for business intelligence are moving well beyond key word queries and the seemingly sophisticated interfaces widely available today.

Despite the economic pressures on organizations today, vendors pushing into data management for the purpose of delivering business intelligence will find customers. The problem will be finding a language in which to discuss these new functions and features. The word search may not be up to the task. The phrase business intelligence is similarly devalued for many applications. An interesting problem now confronts buyers, analysts, and vendors, “How can we describe our systems so people will understand that a revolution is taking place?”

The turgid writing in the Claremont Report is designed to keep the secret for the in crowd. My hunch is that certain large organizations–possibly Google–are quite far along in this data management deployment. One risk is that some companies will be better at marketing than at deploying industrial strength next generation data management systems. The nest might be fouled by great marketing not supported by equally robust technology. If this happens, the company that says little about its next generation data management system might deploy the system, allow users to discover it, and thus carry the field without any significant sales and marketing effort.

Does anyone have an opinion on whether the “winner” in data management will be a start up like Aster Data, a market leader like Oracle, or a Web search outfit like Google? Let me know.

Stephen Arnold, November 4, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta