IBM Lotus Notes and Domino to Open Source
December 30, 2008
The Register here ran a story about pressure on IBM from outsiders urges Big Blue to Make Lotus Notes and Domino into open source products. In the late 1980s, Ziff Communications embraced Lotus Notes. Ah, a fine product. Not more than two years ago I had an opportunity to become reacquainted with the Lotus Notes mail store. Ah, the same fine product in 2006 as it was in 1989. That’s progress. My take on this kerfuffle is that Microsoft SharePoint the Swiss Army Knife of enterprise software is almost free now. In my opinion, it is free when you buy a bundle of other Microsoft servers that are needed to make SharePoint go bump in the night. The Register’s source is an information technology consultant, but I think the consultant may be pinging back gossip from IBM insiders. What happens if Lotus Notes goes open source? Nothing. My last NetFinity server came with Lotus Notes. The consultants are still required to make the creaky devil behave.
Stephen Arnold, December 30, 2008
Dataspace Boomlet
December 30, 2008
First, the UK moved toward pervasive monitoring. You can read about it here. Australia made some moves in a similar direction which you can read about here. Now The Hindu reports “Crime Scene Investigation Now to Function with Broadband.” You can read this story about POLNET here. For me the most interesting comment in the article was:
Police stations across the country will feed and upload video and still footage of the crime spot on POLNET which will then be transmitted to the Central server in Delhi and can be accessed by authorized experts. Such analysis would help in understanding the modus operandi of criminals and terrorists and prepare a strategy to tackle the same.
These systems are baby steps toward nation state dataspaces. Unlike a dataspace, the metadata available to investigators becomes richer. I have no position about the policies these nation states are implementing. What’s important to me are these issues:
- The idea of a dataspace, not a database, is clearly gaining traction. Obviously traditional databases cannot delivering the value that their licensees desire.
- The dataspace analyses will place considerable strain on the nation states’ data processing ability. The jam regarding the Bush White House digital data is an example of the data management burden that will become a major issue in 2009.
- The company best positioned to provide cloud based processing of these data is, in my opinion, Google. If there is a dip in advertising, the GOOG can contact certain countries with an offer to use Google’s proprietary data management systems for a fee. The pricing model can be a variant of the Clearwell Systems’ approach; that is, by the gigabyte.
I know that most people are blissfully unaware of the dataspace technology. I can point you to the for fee report Sue Feldman and I wrote in September 2008. I cannot reproduce that document IDC Number 213562 here. Some dataspace information appears in my Gilbane Report Beyond Search which is available here. In my view from my hollow in rural Kentucky, there will be some activity in the dataspace sector in 2009.
Stephen Arnold, December 30, 2008
Dead Tree Update: Chicago and Suburban Shoppers
December 29, 2008
Newsweek Magazine, a dead tree publication in some danger of marginalization, published “Chicago’s Newspapers Facing a Troubled Future” here. When I read this article, I had the impression that the author, F.N. D’Alessio, was writing about Newsweek and the Associated Press. Mr. D’Alessio refers to newspaper “addicts”. I don’t know too many. I receive four dead tree newspapers: the Courier Journal, USA Today (affectionately known as McPaper), the New York Times, and the Wall Street Journal. I used to get the Financial Times, but the delivery was so erratic I dropped the paper in January 2008. I received an offer of a year’s subscription for $99, and I threw it in the trash. Too much hassle trying to work through clumps of papers arriving twice a week. For me, the most significant comment in the Newsweek story was a comment about the Tribune’s rival, the Chicago Sun Times:
Hollinger’s biggest move was to create the Sun-Times Media Group by buying up 70 suburban and neighborhood newspapers, more than a dozen of which are dailies. Some of those are profitable, and some newspaper analysts envision the Sun-Times company shutting down the namesake paper and keeping the suburban ones.
I read this as a clear statement that big city papers are gone geese. Check out the Tribune’s online version of the newspaper. It is a disaster. My discussion of this wounded duck is here.
The future for dead tree outfits–if there is to be one–is to become ad supported, micro publications serving narrow markets. For years, I thought the Gaithersburg Gazette was had potential. Now that type of publication along with penny shoppers may be the margin of the information world available to the dead tree crowd.
You can make money in niches, but the revenue will buy used Malibus, not the flashy Mercedes the princes of journalism see as suitable transportation.
Stephen Arnold, December 29, 2008
SharePoint Open Source
December 29, 2008
SDTimes.com published a short interview with Sam Ramji, Microsoft’s senior director of platform strategy here. In “Microsoft Mulling Open Source Strategy in SharePoint” Mr. Ramji summarizes Microsoft’s open source initiatives. There is one veiled reference to SharePoint as an open source enabler. In my opinion, this comment may presage a shift in Microsoft’s positioning of SharePoint:
We have heard the conversation internally about SharePoint as an example of a platform of great opportunity for open-source application strategy.
Microsoft offers a free search product. What happens if Microsoft finds that it must release some of the Fast Enterprise Search Platform technology as open source? My hunch is that Microsoft will be taking some big steps to deal with the pesky Google enterprise initiative. Making some of Fast ESP available as open source may be necessary in 2009.
Stephen Arnold, December 29, 2008
SharePoint Complexity
December 29, 2008
CIO Magazine has a white paper or report called “SharePoint.” You can start the tedious process of getting a copy here. I jumped through the hoops and was rewarded with a 19 page PDF. If you are one of the dilettantes who thinks SharePoint search is simple, you will be hard pressed to accept this report. “Better but More Complex” by Christine Casatelli makes one point again and again; namely, complexity. I must admit I wasn’t sure who wrote what in this white paper, but I came away with a sense that the authors did not fall for that “simple and easy” pitch for SharePoint that some wacky consultants are pitching. The five tips (page 5 and following) are pragmatic, but the authors don’t point out the time and effort required to verify metadata and permissions.
Stephen Arnold, December 29, 2008
Duplicates and Deduplication
December 29, 2008
In 1962, I was in Dr. Daphne Swartz’s Biology 103 class. I still don’t recall how I ended up amidst the future doctors and pharmacists, but there I was sitting next to my nemesis Camille Berg. She and I competed to get the top grades in every class we shared. I recall that Miss Berg knew that there five variations of twinning three dizygotic and two monozygotic. I had just turned 17 and knew about the Doublemint Twins. I had some catching up to do.
Duplicates continue to appear in data just as the five types of twins did in Bio 103. I find it amusing to hear and read about software that performs deduplication; that is, the machine process of determining which item is identical to another. The simplest type of deduplication is to take a list of numbers and eliminate any that are identical. You probably encountered this type of task in your first programming class. Life gets a bit more tricky when the values are expressed in different ways; for example, a mixed list with binary, hexadecimal, and real numbers plus a few more interesting versions tossed in for good measure. Deduplication becomes a bit more complicated.
At the other end of the scale, consider the challenge of examining two collections of electronic mail seized from a person of interest’s computers. There is the email from her laptop. And there is the email that resides on her desktop computer. Your job is to determine which emails are identical, prepare a single deduplicated list of those emails, generate a file of emails and attachments, and place the merged and deduplicated list on a system that will be used for eDiscovery.
Here are some of the challenges that you will face once you answer this question, “What’s a duplicate?” You have two allegedly identical emails and their attachments. One email is dated January 2, 2008; the other is dated January 3, 2008. You examine each email and find that difference between the two emails is in the inclusion of a single slide in the two PowerPoint decks. You conclude what:
- The two emails are not identical and include both and the two attachments
- The earlier email is the accurate one and exclude the later email
- The later email is accurate and exclude the earlier email.
Now consider that you have 10 million emails to process. We have to go back to our definition of a duplicate and apply the rules for that duplicate to the collection of emails. If we get this wrong, there could be legal consequences. A system develop who generates a file of emails where a mathematical process has determined that a record is different may be too crude to deal with the problem in the context of eDiscovery. Math helps but it is not likely to be able to handle the onerous task of determining near matches and the reasoning required to determine which email is “the” email.
Which is Jill? Which is Jane? Parents keep both. Does data work like this? Source: http://celebritybabies.typepad.com/photos/uncategorized/2008/04/02/natalie_grant_twins.jpg
Here’s another situation. You are merging two files of credit card transactions. You have data from an IBM DB2 system and you have data from an Oracle system. The company wants to transform these data, deduplicate them, normalize them, and merge them to produce on master “clean” data table. No, you can’t Google for an offshore service bureau, you have to perform this task yourself. In my experience, the job is going to be tricky. Let me give you one example. You identify two records which agree in field name and data for a single row in Table A and Table B. But you notice that the telephone number varies by a single digit. Which is the correct telephone number? You do a quick spot check and find that half of the entries from Table B have this variant, or you can flip the analysis around and say that half of the entries in Table A vary from Table B. How do you determine which records are duplicates.
Moore’s Law: Not Enough for Google
December 29, 2008
I made good progress on my Google and Publishing report for Infonortics over the last three days. I sat down this morning and riffed through my Google technical document collection to find a number. The number is interesting because it appears in a Google patent document and provides a rough estimate of the links that Google would have to process when it runs its loopy text generation system. Here’s the number as it is expressed in the Google patent document:
50 million million billion links
Google’s engineers included an exclamation point to US7231393. The number is pretty big even by Googley standards. And who cares? Few pay much attention to Google’s PhD like technical documents. Google is a search company that sells advertising and until the forthcoming book about Google’s other business interests comes out, I don’t think many people realize that Moore’s law is not going to help Google when it processes lots of links–50 million million billion give or take a few million million.
When I scanned “Sustaining Moore’s Law – 10 Years of the CPU” by Vincent Chang here, I realized that Google has little choice to use fast CPUs and math together. In fact, the faster and more capable the CPU, the more math Google can use. Name another company worrying about Kolmogorov’s procedures?
Take a look at Mr. Chang’s article. The graph shows that the number of transistors appear to keep doubling. The problem is that information keeps growing and the type of analysis Google wants to do to use various probabilistic methods is rising even faster.
The idea that building more data centers allows Google to do more is only half the story. The other half is math. Competitors who focus on building data centers, therefore, may be addressing only part of the job when trying to catch up with Google. Leapfrogging Google seems difficult if my understanding of the issue.
Enterprise Search Is Not Web Search — A Revelation
December 28, 2008
I saw a link to a survey about enterprise search. I clicked around and found a snippet about the survey in a UK Web site called ITPro.co.uk. Here’s the passage that caught my attention:
In small and medium enterprises in the UK, a YouGov survey this summer found 80 per cent of managers and directors waste up to an hour a day looking for documents. And according the same HP survey, 60 per cent of IT managers spend “a high proportion of their time” dealing with employee requests for help finding basic information on the network.
You can read the full write up here, but the survey which I think is more significant than the discussion of the differences between enterprise and and Intranet search helps pin down the costs of manual information retrieval and the alleged savings from a successful search deployment. But I did read the full write up.
The article “Why Enterprise Search Is Not Intranet Search” by Mary Branscombe tilled ground that I thought had long ago be tamed. Guess not. Ms. Branscombe runs through some of the familiar vendors, but she spends most of her time on the Recommind system. Recommind is a system that strikes me as similar to Autonomy’s. Recommind cut its teeth in eDiscovery. The company has branched out into enterprise search, and the company has been gaining some traction in the US.
The point in the article that surprised me was the negative spin given to the Google Search Appliance. I am not confident that either the sources Ms. Branscombe consulted or Ms. Branscombe herself has a strong sense of what one can do with the OneBox API. Traditional vendors have not been able to match Google’s 25,000 plus Google Search Appliance licenses. Google is shifting from an appliance only approach and broadening its offering to include some variants.
Check out the article. Let me know if you agree with my assessment.
Stephen Arnold, December 28, 2008
Federate Net Weaver and SharePoint
December 28, 2008
The new year approaches, and you have SAP Net Weaver and Microsoft SharePoint. You want to spend a few minutes making it possible to run one query and retrieve results from each system. Trivial? You bet. In case some of the steps are a tad uncertain, you will want to peruse the SAP white paper here. The title of this useful document is “Federated Search between SAP Net Weaver Enterprise Search and Microsoft Search Server 2008 Using Open Search and SSO.” The authors are SAP wizards Andre Fischer, Pedro Arrontes, and Holger Brucheit. The 15 page document is SAP centric, and the key is to use SAP’s Open Search interface. The paper assumes you know how this middleware and its method works. If you are fuzzy in Open Search particulars, the white paper provides links to other documents in the SAP technical library. If you want to jump right in, fire up Net Weaver and use the built in templates to specify where the data are and their format. The white paper assumes that you will be using SAP’s security and access control system, which might be incorrect if SAP plays a secondary role in your organization. The information for configuring SharePoint walks through the specific graphical interface settings to use and, thankfully, includes the scripts needed to make SharePoint play nice with Net Weaver. If you work through the white paper and your federating doesn’t federate, SAP has included some troubleshooting tips. Enjoy.
Stephen Arnold, December 28, 2008
Getting Doored by Search
December 28, 2008
Have you been in Manhattan and watch a bike messenger surprised by a car door opening. The bike messenger loses these battles, which typically destroy the front wheel of the bike. When this occurs, the messenger has been doored. You can experience a similar surprise with enterprise search.
What happens when you get doored. Source: http://citynoise.org/author/ken_rosatio
The first situation is one that will be increasingly common in 2009. As the economy tanks, litigation is likely to increase. This means that you will need to provide information as part of the legal discovery process. You will get doored if you try to use your existing search system for this function. No go. You will need specialized systems and you will have to be able to provide assurance that spoliation will not occur. “Spoliation” refers to changing an email. Autonomy offers a solution, to cite one example.
The second situation occurs when you implement one of the social systems; for example, a Web log or a wiki. You will find that most enterprise search systems may lack filters to handle the content in blogs. Some vendors–for example, Blossom Search–can index Web log content. Exalead has a connector to index information within the Blogger.com and other systems. However, your search system may lack the connector. You will be doored because you will have to code or buy a connector. Ouch.
The third situation arises when you need to make email searchable from a mobile device. To pull this off, you need to find a way to preserve security, prevent a user from deleting mail from her desktop or the mail server, and deliver results without latency. When you try this trick with most enterprise search systems, you will be doored. The fix is to tap a vendor like Coveo and use that company’s email search system.
There’s a small consulting outfit prancing around like a holiday elf saying, “Search is simple. Search is easy. Search is transparent.” Like elves, this assertion is a weird mix of silliness, fairy dust, and ignorance. If this outfit helps you deal with a “simple” search, prepare to get doored. It may not be the search system; it may be your colleagues.
Stephen Arnold, December 28, 2008