Order Google: The Digital GutenbergTop Banner

Hadoop Caught in Loops

July 4, 2009

Dana Blankenhorn’s “Who Will Control Hadoop?” here raised an important question. The focus was close, but I considered his question in a broader context. Mr. Blankenthorn asked:

Do too many Hadoops spoil the code?

In a narrow sense, my view is let many flowers bloom. When the world was less fluid, flakey, and financially challenged, many efforts seemed like a good idea. Now, I am not so sure. Mr. Blankenthorn said:

But some reporters are beginning to ask who is really in charge of Hadoop. Is it Apache or Yahoo? Was Yahoo’s distribution a diss of Facebook, which previously developed its own Hadoop SQL, called Hive? Most projects have a community and a commercial arm. Hadoop’s importance has drawn a number of corporate sponsors to separately deliver their implementations. Microsoft, Yahoo, Google, and Facebook all have their own takes on Hadoop, alongside Apache and Cloudera. All these various Hadoops can be seen as a positive or a negative. As a positive, there is growth and momentum for the framework. As a negative, there are many organizations pulling Hadoop in different directions.

In a broad context, the value of open source software is that many hands working to create something that is not proprietary, not unstable, and not subject to the whims of a corporate titan is a foundation stone. On the other hand, fragmentation of an important technology makes some folks wary of open source.

The way online works is to reward one company with a virtual monopoly. This is a natural consequence of costs and user behavior. The problem is that when one outfit is in control, that organization follows the well worn path of profit and benefit maximization. That can’t be helped either.

In short, I think the same type of financial meltdown that has trashed some individuals’ plans for the future is likely to take place again. Tricky stuff, indeed.

Stephen Arnold, July 4, 2009

Google and Data Object Visualization

June 30, 2009

The USPTO published US7555471 B2 on June 30, 2009. The Beyond Search goslings think this is a reasonably important Google disclosure. The investors include one super Googler and clutch of other Google rock star engineers. Andrew Hogue is a Googler to watch. If you find his official Google page opaque, try this link.  He and his band of engineers have received a patent for “Data Object Visualization.” Don’t get too excited about the graphics. The system and method applies to a core Google system for cleaning up discrepancies in fact tables. If you are a fan of Dilbert, this is the invention that describes one of Google’s smartest agents the official descriptor “janitor”. How smart is the janitor. Smart enough to make dataspaces closer to reality. The USPTO system is sluggish today, so you can get info from FreePatentsOnline.com or one of the other services that provide access to these public documents. I love that janitor lingo too. Googley humor for big time inventions makes clear that the 11 year old Google still possesses math club whimsy. Those examples for atomic mass and volcano are equally illuminating.

Stephen Arnold, June 30, 2009

SharePoint Virtualization

June 23, 2009

“SharePoint Virtualization Survey Results” offered some insight into how on sample of Microsoft licensees uses virtual servers. The person running the survey and preparing the summary of results is Wictor Wilén. Among the findings that I found interesting were these:

  • About 96 percent of the respondents virtualized their development environment and half virtualized their production environments. (I was surprised at the disparity between the two percentages.)
  • The Web front end was virtualized by most respondents; query service was the second most virtualized operation. Mr. Wilén wrote: “A quite high number of respondents answered that they were virtualizing the database role (73,9%) but only half of them could really recommend it (37,2%). The Excel Services role was something that about half of the participants virtualized (47,8%) and recommended for virtualization (44,2%).”

You can get more survey details from Mr. Wilen’s Web site.

Parsing Oracle Text Input

June 22, 2009

Short honk: A happy quack to the reader who sent me a link to this tip for chopping up a list of telephone numbers separated by asterisks. There are a couple of tips revealed by Michel in this post on the Oracle FAQ site. You can get the info by clicking here. The method will vary depending on  your specific source file.

Stephen Arnold, June 22, 2009

Data.gov Squeezes Two Search Govs

June 9, 2009

What a battle of governmental initiatives. In one corner is the federating champ, USA.gov with its second Science.gov. In the other corner is the Data.gov contender. A citizen or other interested party can sit back and watch the political slug fest. Unlike a traditional kick boxing match, this one is going to rage for years with each round roughly the length of the Federal fiscal year.

Here’s a run down of the combatants:

  • USA.gov (weighing in at about $22 million per year) with software implants from Microsoft and Vivisimo. The service (originally FirstGov.gov) has gained some Tyson like body mass without generating the type of online traffic one expects of a long reigning champ. USA.gov provides a “portal” to Federal information, and it is a bit like a blend of traditional search, a portal, and link farm. I use it but find the limit on documents accessible annoying. I rely on Google’s Uncle Sam service, but I am an addled goose looking for depth, not a partial results list or undisplayable images.
  • Science.gov, supported by Deep Web Technologies, works in apparent harmony with USA.gov. Science.gov focuses on tech content, which I assumed would also be in the USA.gov index. With funds from different sources, Science.gov is a variant of USA.gov.
  • Data.gov, supported by the White House, is a collection of data, not text. Next week (sometime after June 8), Data.gov gets an infusion of tens of thousands government data sets. You can read more about the expansion of the service in this PC World story, “U.S. Government Records Go Online in Volume” here.

I am not going to try and sort out individual agency Web sites, the Library of Congress, the Government Printing Office, and other assorted information repositories. I could not figure them out in Year 2000 when I dabbled in the FirstGov.gov planning activities. I sure as heck can’t make sense of them from Harrod’s Creek today.

The question in my mind is, “What citizen user knows where to look for US government information?” My solution, as noted, is Google, and I am curious about Wolfram Alpha’s appetite for these data. I wonder if the USDA Economic Research Service will be available? Lots of mystery and excitement surround this epic battle. I think every agency will win because the “silo method” is alive and well in the Federal government.

Description of Data.gov

May 26, 2009

A happy quack to the reader who sent me a link to Propublica’s “Gov’s Got Data” here. Data.gov is a data portal created by the US government. Republica reported that there were 47 raw data sets on the site plus another 27 software utilities. Click here for a sample data set.

For me, the most important comment was:

There’s not a lot there yet, but the new federal Web site, which the Obama administration had promised to create, is up and running. The site is designed to be a clearinghouse of data from federal agencies.

Data sets are the type of content that Wolfram Alpha and some of Google’s more sophisticated system ingest to generate value-added outputs.

The challenge is that US government agencies are silos and sharing is often a lengthy administrative process. After all, why share when headcount could be reduced due to the trimming of tasks for an agency. In my experience, government entities want to preserve data, tasks, and services to keep the bean counters from chopping a manager’s staff.

Long slog ahead for Data.gov I think.

Stephen Arnold, May 26, 2009

Data Tables Contain Deleted Data. Yikes. Revelation.

May 21, 2009

it was spies on Facebook. Then it was the LA Times’s spoofed via a year old Prop 8 story. Now – news flash – the issue is privacy on social networking sites. Yikes. What a scoop? Sky News in the UK published “Fears over Privacy on Social Networking Sites” here. The intrepid news hounds at Sky News reported:

Researchers from the University of Cambridge say that many social networking sites maintain copies of user photos even after users delete them.

I wonder if the wizards in the groves of academe figured out that quite a bit of other information and data lurk on these sites. In fact, unless the indexes have been rebuilt, my hunch is that my team could find some interesting stuff not searchable but available to those poking around with forensic savvy.

I am waiting for one of these intrepid reporters to define “delete” and “remove”.

Stephen Arnold, May 22, 2009

Early Days for Information Management

May 21, 2009

In the last two weeks, I have been crisscrossing the United States. On last night’s fab flight from Philadelphia to Louisville, I watched the lights and thought about the comments I heard about data management. I have to mask the clients with whom I spoke and fuzzify the language, but I think I can communicate several key points.

Search Is a Symptom, Not the Cause

One idea that hooked me was an observation about search and the turmoil and confusion it creates and leaves behind once a new system is up and running. Search is not the problem. Search is a manifestation of the organization’s broader information management situation. If information management is lousy, then search will be lousy as well. The problem is that fixing information management in an organization under financial pressure is a big job. Furthermore, it involves change which is often resisted when job loss and work responsibilities are likely. It’s much easier to slap in a new search system and move on. Unfortunately, search gets another black eye and a vendor can be criticized, sometimes in a scathing manner, because the information management approach was flawed, broken, or non existent.

image

Fatigue or diabetes?

No Clue about Volume

Most of the people with whom I spoke sang one verse from one hymnal, “We have no clue about our data. We don’t know how much info we have. We are lost in bits. We are lost in bits. We are clueless.”

Most of the savvy information technology professionals know that the volume of digital information is increasing. The problem is that no one knows exactly how fast, what to do with the emails and documents, or how to keep track of what’s where. The Abbott and Costello routine “Who’s on First?” anticipates the statements about the hassle information volume poses.

One doesn’t need a degree in information science to recognize that if you can’t collect digital information, you don’t have much of a chance answering this question: “Are you sure we don’t have that document?” Finding is now becoming a must have function, and the Catch 22 is that most organizations don’t have a grasp on the amount of data in the organization or where an item is, search becomes a bit tougher.

image

How big is the information task?

Read more

CMS Warts and Wobbles

May 15, 2009

I don’t think too much about lightweight CMS systems. To be up front’ I try to duck heavyweight CMS tools as well. You may not have that luxury. If you are one of the lucky CMSters, you must read “Dump Your Self-Banning CMS” here.

Now the language is salty and some of the illustrations might offend. Nevertheless I found the information first rate’ One example:

Since every database access is expensive, the login procedure creates a persistent cookie (today + 365 * 30) for each user property. Dynamic and user specific external CSS files as well as style-sheets served in the HEAD section could fail to apply, so all CMS scripts use a routine that converts the user settings into inline style directives like style="color:red; text-align:bolder; text-decoration:none; ...". The developer consults the W3C CSS guidelines to make sure that not a single CSS property is left out.

If this passage spe3aks to you, you will like this CMS write up.

Stephen Arnold, May 13, 2009

SAP Data Warehouse Search

May 14, 2009

Application Development Trends reported here that SAP has rolled out a new datawarehouse search tool. the product is called Business Objects Explorer, and it seems to allow simplified queries so that end users can get reports without having to involve a Business Objects programming wizard. The story “SAP Launches Data Warehouse Search Tool” said:

Explorer evolved from a Business Objects-developed tool called Polestar, released in late 2007, which lets individuals conduct searches against data in the SAP Business Objects XI 3.1 BI platform. Using a Web-based interface, SAP officials said Explorer will now let any user, regardless of their knowledge of BI, query SAP’s NetWeaver Business Warehouse Accelerator (BWA), the company’s tooling for creating data warehouses.

Nary a word about Inxight, the content processing company Business Objects acquired, nor about any other SAP search initiatives. Poor TREX. Is he orphaned? Endeca and its business intelligence capability? Silence. Will Flash plug ins deliver what Business Objects’ users want? User experience seems to be the way to deliver industrial strength search.

Stephen Arnold, May 14, 2009

Next Page »