Scaling SharePoint Could Be Easy

September 24, 2009

Back in the wonderful city of Washington, DC. I participated in a news briefing at the National Press Club today (September 23, 2009). The video summary of the presentations will be online next week. During the post briefing discussion, the topic of scaling SharePoint came up. The person with whom I was speaking sent me a link when she returned to her office. I read “Plan for Software Boundaries (Office SharePoint Server)” and realized that this Microsoft Certified Professional was jumping through hoops created by careless system design. I don’t think the Google enterprise applications are perfect, but Google has eliminated the egregious engineering calisthenics that Microsoft SharePoint delivers as part of the standard software.

I can deal with procedures. What made me uncomfortable right off the bat was this segment in the TechNet document:

    • In most circumstances, to enhance the performance of Office SharePoint Server 2007, we discourage the use of content databases larger than 100 GB. If your design requires a database larger than 100 GB, follow the guidance below:
      • Use a single site collection for the data.
      • Use a differential backup solution, such as SQL Server 2005 or Microsoft System Center Data Protection Manager, rather than the built-in backup and recovery tools.
      • Test the server running SQL Server 2005 and the I/O subsystem before moving to a solution that depends on a 100 GB content database.
    • Whenever possible, we strongly advise that you split content from a site collection that is approaching 100 GB into a new site collection in a separate content database to avoid performance or manageability issues.

Why did I react strongly to these dot points? Easy. Most of the datasets with which we wrestle are big, orders of magnitude larger than 100 Gb. Heck, this cheap net book I am using to write this essay has a 120 Gb solid state drive. My test corpus on my desktop computer weighs in at 500 Gb. Creating 100 Gb subsets is not hard, but in today’s petascale data environment, these chunks seem to reflect what I would call architectural limitations.

As I worked my way through the write up, I found numerous references to hard limits. One example was this statement from a table:

Office SharePoint Server 2007 supports 50 million documents per index server. This could be divided up into multiple content indexes based on the number of SSPs associated with an index server.

I like the “could be.” That type of guidance is useful, but my question is, “Why not address the problem instead of giving me the old “could be”? We have found limits in the Google Search Appliance, but the fix is pretty easy and does not require any “could be” engineering. Just license another GSA and the system has been scaled. No caveats.

I hope that the Fast ESP enterprise search system tackles engineering issues, not interface (what Microsoft calls user experience). In order to provide information access, the system has to be able to process the data the organization needs to index. Asking my team to work around what seem to be low ceilings is extra work for us. The search system needs to make it easy to deliver what the users require. This document makes clear that the burden of making SharePoint search falls on me and my team. Wrong. I want the system to lighten my load, not increase it with “could be” solutions.

Stephen Arnold, September 24, 2009

Data Transformation and the Problem of Fixes

September 24, 2009

I read “Fix Data before Warehousing It” by Marty Moseley and came away with the sense that some important information was omitted from the article. The essay was well written. My view is that the write up should have anchored the analysis in a bedrock of cost analysis.

Data shoved into a data warehouse are supposed to reduce costs. Stuffing inconsistent data into a warehouse does the opposite. My research as well as information I have heard suggests that data transformation (which includes normalization and the other “fixing tasks”) can consume up to one third of an information technology budget. Compliance is important. Access is important. But the cost of fixing data can be too high for many organizations. As a result, the data in the data warehouse are not clean. I prefer the word “broken” because that word makes explicit one point—the outputs from a data warehouse with broken data may be misleading or incorrect.

The ComputerWorld article is prescriptive, but it does not come right out and nail the cost issue or the lousy outputs issue. I think that these two handmaidens of broken data deserve center stage. Until the specific consequences of broken data are identified and made clear to management, prescriptions won’t resolve what is a large and growing problem. In my world, the failure of traditional warehousing systems to enforce or provide transformation and normalization tools makes it easier for a disruptive data management system to overthrow the current data warehousing world order. Traditional databases and data warehousing systems allow broken data and, even worse, permit outputs from these broken data. Poor data management practices cannot be correct by manual methods because of the brutal costs such remediation actions incur. My opinion is that data warehousing is reaching a critical point in its history.

Automated methods combined with smart software are part of the solution. The next generation data management systems can provide cost cutting features so that today’s market leaders become very quickly tomorrow’s market followers. Just my opinion.

Stephen Arnold, September 24, 2009

Guha’s Most Recent Patent: Enhanced Third Party Control

September 24, 2009

I am a big fan of Ramanathan Guha’s engineering. From his work on the Programmable Search Engine in 2007 to this most recent invention, he adds some zip to Google’s impressive arsenal of smart methods. You may want to take a look at US 7,593,939, filed in March 2007, a few weeks after his five PSE inventions went to the ever efficient USPTO. This invention “Generating Specialized Search Results in Response to Patterned Queries”

Third party content providers can specify parameters for generating specialized search results in response to queries matching specific patterns. In this way, a generic search website can be enhanced to provide specialized search results to subscribed users. In one embodiment, these specialized results appear on a given user’s result pages only when the user has subscribed to the enhancements from that particular content provider, so that users can tailor their search experience and see results that are more likely to be of interest to them. In other embodiments the specialized results are available to all users.

What I find interesting is that this particular method nudges the ball forward for third party content providers so certain users can obtain information enhancements. The system makes use of Google’s “trust server,” answers questions, and generates a new type of top result for a query. The invention provides additional color for Dr. Guha’s semantic systems and methods which nest comfortably within the broader dataspace inventions discussed at length in Google: The Digital Gutenberg. For a more detailed explanation of the invention, you can download the open source document from the USPTO or another US patent provider. When will Google make a “Go Guha” T shirt available. Oh, for those of you new to my less-than-clear explanation of Google’s technology, you can find the context for this third party aspect of Google’s PSE and publishing / repurposing semantic system in my Google Version 2.0, just click on Arnold’s Google studies. This invention makes explicit the type of outputs a user may receive from the exemplary system referenced in this open source document. This invention is more substantive than “eye candy” user experience as defined by Microsoft and light years ahead of the Yahoo “interface” refresh I saw this morning. The Google pushes ahead in search technology as others chase.

Stephen Arnold, September 23, 2009

Coveo and Email Search

September 24, 2009

My two or three readers felt compelled to send me links to a number of Web write ups about Coveo’s email search system. I have tested the system and found it quite good, in fact, excellent. For forensic search of a single machine, I have been testing a “pocket search” product from Gaviri, and I find that quite effective as well. If you are not familiar with the challenges email search presents, you may want to take a look at one of the Coveo-centric news stories, which does quite a good job of explaining the challenge and the Coveo solution. The article is “Coveo Brings Enterprise Search Expertise to Email” by Chelsi Nakano. For me the key passage was:

There’s at least one happy customer to speak of: “Other solution providers require you to spend tens if not hundreds of thousands in service fees to customize the enterprise search solution and make enterprise search work for your employees,” said Trent Parkhill, VP, Director IT of Haley and Aldrich. “With Coveo […] enterprise search now meshes seamlessly with classification and email archiving to give us a full email management solution.”

Happy customers are more important to me than megabytes of marketing PDFs and reports from azure chip consultants who try too, too hard to explain a useful, functional system. More info is available directly from Coveo.

Google Waves Build

September 24, 2009

I am a supporter of Wave. I wrote a column about Google WAC-ing the enterprise. W means wave; A is Android, and C represents Chrome. I know that Google’s consumer focus is the pointy end of the Google WAC thrust, but more information about Wave is now splashing around my webbed feet here in rural Kentucky. You take a look at some interesting screenshots plus commentary in “Google Wave Developer Preview: Screenshots.” Perhaps you will assert, “Hey, addled goose, this is not search.” I reply, “Oh, yes, it is.” The notion of eye candy is like lipstick on a pig. Wave is a new animal that will carry you part of the way into dataspace.

Stephen Arnold, September 24, 2009

Mobile News Aggregation

September 23, 2009

I wrote an essay about the impending implosion of CNN. The problem with traditional media boils down to cost control. Technology along won’t keep these water logged outfits afloat. With demographics working against those 45 years of age and above, the shift from desktop computers to portable devices creates opportunities for some and the specter of greater marginalization for others. I saw a glimpse of the future when I looked at Broadersheet’s iPhone application. You can read about the service in “Broadersheet Launching “Intelligent News Aggregator” iPhone App”. The app combines real time content with more “traditional” RSS content. The operative words for me are “intelligent”” and “iPhone”. More information is available on the Broadersheet Web site. Software that learns and delivers information germane to my interests on a mobile device is not completely new, of course. The Broadsheet approach adds “time” options and a function that lets me add comments to stories. This is not convergence; the application makes clear the more genetic approach of blending DNA from related software functions.

Stephen Arnold, September23, 2009

Google News Yaggs

September 23, 2009

Short honk: Just passing along the allegation that Google News went down for one hour on Tuesday, September 22, 2009. The story “Google News Back Up after Outage” asserted that Google News went offline. The interest in cloud and blended cloud and on premises computing continues to creep upwards. If the allegation is true, the problems at Google News are yet another Google glitch. That old time failover failed if the assertion is true.

Stephen Arnold, September 23, 2009

Microsoft Live: $560 Million Loss in 12 Months or $64,000 and Hour

September 23, 2009

TechFlash reported an interesting article called “Windows Live Lost $560 Million in FY2009”. With revenues of $520, the loss chewed through $64,000 an hour or $2,663 a minute 24×7 for 365 days. With Microsoft’s revenue in the $58 billion range, a $560 million is not such a big deal. In my opinion, profligate spending might work in the short term, but I wonder if the tactic will work over a longer haul on the information highway.

Stephen Arnold, September 23, 2009

Twitter Trends: A Glimpse of the Future of Content Monitoring

September 23, 2009

A happy quack to the reader who sent me information about “Trendsmap Maps Twitter Trends in Real-Time.” The Cnet write up points out that this Web site uses “trending Twitter topics by geographical location by combining data from Twitter’s API and What The Trend.” Very interesting publicly accessible service. Similar types of monitoring systems are in use in certain government organizations. The importance of this implementation is that the blend of disparate systems provide new ways to look at people, topics, and relationships. With this system another point becomes clear. If you want to drop off the grid, move to a small town where data flows are modest. Little data won’t show up so more traditional monitoring methods have to be used. On the other hand, for those in big metro areas, interesting observations may be made. Fascinating. The site has some interface issues but a few minutes of fiddling will make most of the strengths and weaknesses clear. The free service is at http://www.trendsmap.com/.

Stephen Arnold, September 22, 2009

Who Are the Five Percent Who Will Pay for News?

September 23, 2009

In the good old days of Dialog Information Services, LexisNexis before the New York Times broke its exclusive deal, and SDC, user behavior was known. Let me give you an example. The successful commercial databases on the Dialog System derived the bulk of their revenue from the Federal government, well-heeled consulting and knowledge-centric service firms, and the Fortune 1000. How much money came from these big spenders? I don’t have the Dialog reports for ABI /INFORM, Business Dateline, and Ziff Communications’ commercial databases any longer. I do recall my fiddling with the green bar reports and pokey 1-2-3 and Excel worksheets, trying to figure out who paid what and what the revenue splits were. In the course of these fun filled excursions into cracking the code on usage reports, I remember my surprise when I realized that online revenue was like an elephant standing on its trunk. I used this metaphor is a couple of journal article because it captured how a thin column of companies supported the bulk of the revenue we received from the dozen vendors distributing out products. The way the revenue share worked in the good old days was that a database producer received a percentage of the revenue generated by a particular database (called a file). If you were a good little database producer, the vendor like Dialog or LexisNexis would give you a percentage above 10 percent. If you were a top revenue producer, the vendors would reluctantly up the percentage paid. One of our databases made it possible to squeeze 50 percent of the total revenue generated from the file from Dialog and LexisNexis. We were in high cotton, but my worrying the green bar reports revealed a big surprise: About 90 percent of our revenue came from about 10 percent of the total number of file users.

This means that our top producing database in 1981 would attract about 10,000 users in a 31 day period. The money were were paid came from about 1,000 of these users. The other 9,000 users spent pennies, maybe a dollar to access our hand crafted, high quality data. The particular file I am referencing was in the 1980 to 1986 period viewed as the premier business information database in the world. No joke. Our controlled vocabulary was used to index documents at the Royal Bank of Canada and used as an instructional guide in library schools in the US and in Europe. (Keep that in mind, taxonomy newcomers.)

I spoke about this “elephant standing on its truck” insight with my colleagues, Loene Trubkin (one of the founders of the old Information Industry Association) and Dennis Auld (one of the original whiz kids behind the ABI / INFORM database). I recall our talking about this distribution. We knew that the “normal” distribution should have been an 80 – 20 distribution, following everyone’s favorite math guy Pareto. The online sector distorted “normal”. Over many discussions, at the Courier Journal and later at Ziff Communications, the 90 – 10 distributions popped up.

I don’t want to dig into the analyses my teams and I ran to verify this number, but I have noticed since I became a consultant, a slow, steady shift in the 90 – 10 distribution. Recent data sets we have analyzed reveal that revenue now comes from a 95 – 5 distribution. What this means is that if one looks at who spends online, the top five percent of the user base accounts for 95 percent of the revenue.

The shift has some interesting implications for those who want to make money online.

First, the loss of a single customer in the top five percent category will have a significant impact on the top line revenues of the online company. The reason is obvious. When a government agency shifts from for fee information to Internet accessible “free” information, the revenue to the online company takes a hit. To get that revenue back, the online vendor has to acquire sufficient new accounts to make up the short fall. With budgets tight, online vendors have to raise their prices. The consequence of this nifty trick is to discourage spenders in the lower 95 percent of the customer base from consuming more for fee online information. The big companies like the Fortune 1000 firms and the top 25 law firms can keep on spending. The result is the “softness” in the top line figures reported by the publicly traded online vendors. These outfits have either crashed and burned financially as Dialog did under the Thomson Corporation’s management or the online segments have been fuzzed and merged into other revenue line items. Pump up or hide the revenue is a time honored method of disguising the vulnerability of the elephant to falling on its rear end.

Second, customers who find the commercial online services too expensive become a hunter of free or low cost information with an appetite that is growing. The commercial online vendors have been mostly unable to cope with the surge of information available from addled geese like me who write a “free” Web log that sometimes contains useful information such as the list of European search system vendors, Web sites pumping out content via RSS, and individuals who generate Twitter messages that can be surprisingly useful to marketers, law enforcement, and analysts like my team at ArnoldIT.com. The efforts of the commercial online vendors are oddly out of sync with what former customers used to do. One example: small law firms go to law libraries and get a law student to look up info, use Dot Gov Web sites for information, or click to Google’s Uncle Sam service. For certain types of legal research, these services are quite useful and what paralyzes the commercial online services is the fact that these services are improving. The commercial online vendors have created a market for lower cost or free online information and boxed themselves into a business model that ensures a long, stately decline into a sea of red ink.

Third, entrepreneurs looking for a way to put food on the table look at the current state of the commercial database industry and see opportunity. One example is the emergence of real time information services. I document the services I find interesting in my monthly column for Bizmedia’s Information World Review, published in the UK. The commercial database vendors know about these services, and some of the tech savvy people at these companies have the expertise to offer somewhat similar services. The new services from commercial database vendors don’t get the traction that the churning ecosystem of the Internet generates. The pace of innovation is too fast for the commercial vendors. After a couple of tries at creating a more hip Web service, the commercial database vendors stand like a deer in my headlights on Wolf Pen Branch Road. Most of the deer get hit by my neighbors. (I brake for animals. My neighbors just roll forward en route to the new Cuban restaurant or the car wash which still uses humans, not machines to bathe Mercedes and Porsches.)

In this context, the fact that Rupert Murdoch’s pay-for-news plan appeals to five percent of those in a survey sample is no surprise to me. I would wager that Mr. Murdoch is not just surprised. He is probably setting lose survey companies to conduct “more accurate” studies, a practice much in favor at IBM, Microsoft, and Oracle. The irony of this five percent is heighted for me because I read about the five-percent study in the UK newspaper The Guardian. You should read the story “Murdoch’s Digital News Cartel Will Not Persuade People to Pay for Content.”

Let me wrap up with three observations:

  1. Some people will pay for Murdoch content. The problem is that these folks may not be willing to pay enough to keep the enterprise solvent. Other revenue streams will be needed which will lead to the fuzzing and merging of financial information to make it tough to figure out how big a loser this effort really is. A precursor is the handling of the Factiva unit by Dow Jones.
  2. The 95 percent who won’t pay become a ready made customer base for an entrepreneur who can implement a different business model. I nominate Google to be a player in this ready made market. Others will find a foothold as well. I know one thing. Traditional newspaper companies will have a tough time when these upstarts get their systems in gear.
  3. The problem is not limited to Mr. Murdoch’s organization. The disruption of the skewed curve is, based on my team’s research, operating across a number of business sectors where analog services are being replaced by digital services.

In short, the 95 – 5 curve is the new Pareto curve. Get used to it.

Stephen Arnold, September 22, 2009

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta