Open Source, Opens Up the Enterprise

October 17, 2010

I am seeing open source everywhere. At the gym in Louisville, Kentucky, a second tier wrestler with tattoos sported a T shirt that said, “Hadoop” and featured an elephant on the back. Go figure.

Open Source Investment to Increase – Survey” provides more fuel for the community’s bonfire under proprietary software. The survey was cranked out by a blue chip consulting outfit (Accenture), so you know I know that those folks never make mistakes. Just like Arthur Andersen maybe?

Here is a passage I noted:

Exactly half of the respondents are fully committed to OSS in their business while just more than a quarter (28pc) said they are experimenting with it and keeping an open mind to using it. Some 65pc of those polled said they have a fully documented strategic approach for using OSS in their business, while another third (32pc) are developing a strategic plan. Of those organizations using open source, a massive 88pc said they will increase their investment in the software this year compared to 2009. The overall volume of open source software development is forecast to rise over the next three years to 27pc, up from 20pc in 2009.

Is it time for commercial software vendors to fill out applications at Wal-Mart and Costco? Not right away. If the data are overstating the update by a third, the implications of the study are tough to miss. Open source is free, and it is open. Quite a few outfits are taking a close look at open source.

I know that if I had a client trying to decide between Microsoft SharePoint and Alfresco, for example, I would probably point the outfit toward Alfresco. After all, some of the lessons of Documentum have influenced thinking at Alfresco I have heard.

Blue chip consultants can make a lot of money analyzing the pros and cons of options. This study may be the first shot in a broader push by a blue chip firm to surf the open source wave.

Stephen E Arnold, October 17, 2010


Online Paranoia and Context

December 3, 2009

Years ago, I met the president of a company in Houston, Texas. I recall hearing that person recounting some of his management insights in the construction business. One catchphrase he used to make a point had to do with paranoia and knowing that everyone was out to get you. Years later I read Andy Grove’s Only the Paranoid Survive: How to Exploit the Crisis Points That Challenge Every Company. Similar idea: some awareness of what the competition is doing is essential to focus an organization’s energies. Over the years, I have worked on a couple of jobs in which paranoia was a useful ingredient like basil on a Food Channel’s winning pizza recipe. In certain work situations, a dash of paranoia is what separates those who survive from those who become the concrete in a skyscraper or the dough in a calzone.

I read “8 Million Reasons for Real Surveillance Oversight”, and you may want to scan the article as well. The main point in my opinion is:

My point is this: The vast majority of the government’s access to individuals’ private data is not reported, either due to a failure on DOJ’s part to supply the legally required statistics, or due to the fact that information regarding law enforcement requests for third party stored records (such as email, photos and other data located in the cloud) is not currently required to be collected or reported. As for the millions of government requests for geo-location data, it is simply disgraceful that these are not currently being reported…but they should be.

If you want a catalog of examples of surveillance activities, the article provides a useful starting point.

Let me conclude with several observations:

  1. Depending on one’s job, these activities may have a different context. For example, if one is working on a project when there are other factors in play, then the need to use available resources to address a matter is a responsible and necessary activity. I think of information has an instrument, and the use of that instrument depends on context. Without context, I find it difficult to make an informed judgment about “shoulds,” “woulds” and “coulds”.
  2. Some engaged in law enforcement have experienced significant increases in the amount and type of work that must be done on the “job”. As a result, like any process oriented professional, when software can perform certain work more efficiently, it makes perfect sense to me to use new methods to manage a task. I find it typical of public companies, start ups, and government organizations to try different techniques and determine which work and which don’t. Adaptation takes place. In my experience, those experiences are an essential part of professional behavior.
  3. The budget data for law enforcement and intelligence professionals, when compared to the volume of work that must be performed is not included in the article. One quick example: a major city’s law enforcement group needs twice the number of uniforms presently available to handle existing criminal activity. There is neither budget nor political support to expand the number of officers. Use of new methods is one way to extend the thin membrane of law enforcement over the present work load.
  4. The volume of data available is impossible to capture, manage, and process with traditional methods. Not even the most sophisticated computer systems are able to deliver the type of information that may be needed to address a certain situation. In my experience, more investment and effort are needed to tame and channel the raging floods of data.

In short, paranoia is a useful motivational and creative force. However, paranoia without context can create an impression that certain situations look like a duck but may be a very different animal. Forget trade shows. Forget public announcements about data sets being made available. Remember that context is needed to understand the who, what, why, and how of an action. These nuances are tough to get even when one is working on a project that requires certain types of data. Outside of those projects, context may be impossible to obtain. Without context, I find it difficult to speak with confidence about a specific action or a group of unrelated actions.

I do know what can happen if certain data are ignored. You do too if you do some historical thinking.

Stephen Arnold, December 3, 2009

Oyez, oyex, I wish to report to the Department of Justice that I was not paid by anyone to point out that context is a useful concept when writing about specific actions taken in order to complete a mission.

Overflight Adds Coveo and Thunderstone

July 14, 2009

If you want to keep up with what’s new at Coveo and Thunderstone, navigate to the Overflight service. In addition to real time updates about the Google, you can now enjoy the same multi-source information “overflight” about Coveo (privately held Canadian company with enterprise and mobile search) and Thunderstone (privately held company in Cleveland, long an innovator in search and retrieval). The goslings and I use the Coveo tools. We had a Thunderstone appliance, but we had to be good geese and return it. Sigh.

Watch for more companies on the Overflight service, which is free to anyone who chooses to visit the service. A commercial version is available which permits integration and merging of internal content along with the Web information shown on this demonstration site.

Stephen Arnold, July 14, 2009

Arnold at NFAIS: Google Books, Scholar, and Good Enough

June 26, 2009

Speaker’s introduction: The text that appears below is a summary of my remarks at the NFAIS Conference on June 26, 2009, in Philadelphia. I talk from notes, not a written manuscript, but it is my practice to create a narrative that summarizes my main points. I have reproduced this working text for readers of this Web log. I find that it is easier to put some of my work in a Web log than it is to create a PDF and post that version of a presentation on my main Web site, I have skipped the “who I am” part of the talk and jump into the core of the presentation.

Stephen Arnold, June 26, 2009

In the past, epics were a popular form of entertainment. Most of you have read the Iliad, possibly Beowulf, and some Gilgamesh. One convention is that these complex literary constructs begin in the middle or what my grade school teacher call “In media res.”

That’s how I want to begin my comments about Google’s scanning project – an epic — usually referred to as Google Books. Then I want to go back to the beginning of the story and then jump ahead to what is happening now. I will close with several observations about the future. I don’t work for Google, and my efforts to get Google to comment on topics are ignored. I am not an attorney, so my remarks have zero legal foundation. And I am not a publisher. I write studies about information retrieval. To make matters even more suspect, I do my work from rural Kentucky. From that remote location, I note the Amazon is concerned about Google Books, probably because Google seeks to enter the eBook sector. This story is good enough; that is, in a project so large, so sweeping perfection is not possible. Pages are skewed. Insects scanned. Coverage is hit and miss. But what other outfit is prepared to spend to scan books?

Let’s begin in the heat of the battle. Google is fighting a number things. Google finds itself under scrutiny from publishers and authors. These are the entities with whom Google signed a “truce” of sorts regarding the scanning of books. Increasingly libraries have begun to express concern that Google may not be doing the type of preservation job to keep the source materials in a suitable form for scholars. Regulators have taken an interest in the matter because of the publicity swirling around a number of complicated business and legal issues.

These issues threaten Google with several new challenges.

Since its founding in 1998, Google has enjoyed what I would call positive relationships with users, stakeholders, and most of its constituents. The Google Books’ matter is now creating what I would describe as “rising tension”. If the tension escalates, a series of battles can erupt in the legal arena. As you know, battle is risky when two heroes face off in a sword fight. Fighting in a legal arena is in some ways more risky and more dangerous.

Second, the friction of these battles can distract Google from other business activities. Google, as some commentators, including myself in Google: The Digital Gutenberg may be vulnerable to new types of information challenges. One example is Google’s absence from the real time indexing sector where Facebook, Twitter,, and even Microsoft seem to be outpacing Google. Distractions like the Google Books matter could exclude Google from an important new opportunity.

Finally, Google’s approach to its projects is notable because the scope of the project makes it hard for most people to comprehend. Scanning books takes exabytes of storage. Converting images to ASCII, transforming the text (that is, adding structure tags), and then indexing the content takes a staggering amount of computing resources.


Inputs to outputs, an idea that was shaped between 1999 to 2001. © Stephen E. Arnold, 2009

Google has been measured and slow in its approach. The company works with large libraries, provides copies of the scanned material to its partners, and has tried to keep moving forward. Microsoft and Yahoo, database publishers, the Library of Congress, and most libraries have ceded the scanning of books work to Google.

Now Google finds itself having to juggle a large number of balls.

Now let’s go back in time.

I have noticed that most analysts peg Google Books’s project as starting right before the initial public offering in 2004. That’s not what my research has revealed. Google’s interest in scanning the contents of books reaches back to 2000.

In fact, an analysis of Google’s patent documents and technical papers for the period from 1998 to 2003 reveals that the company had explored knowledge bases, content transformation, and mashing up information from a variety of sources. In addition, the company had examined various security methods, including methods to prevent certain material from being easily copied or repurposed.

The idea, which I described in my The Google Legacy (which I wrote in 2003 and 2004 with publication in early 2005) was to gather a range of information, process that information using mathematical methods in order to produce useful outputs like search results for users and generate information about the information. The word given to describe value added indexing is metadata. I prefer the less common but more accurate term meta indexing.

Read more