Very Large Databases – Googzilla Being Coy

August 31, 2009

I read Technofeel’s “VLDB09 Part Two” and noted another Google head fake. Technofeel points out that Google’s paw prints were all over the conference from his point of view. MapReduce and Hadoop (an open source semi MapReduce) presentations caught his attention. In my opinion, the most interesting comment in the write up was:

Finally, I ended my visit at VLDB09 with two presentation of Google Interns about data mining to get structured result sets out of semi unstructured pages with lists and tables.

These two Google papers are important. You can get links to them from Technofeel’s article. Let me make two or three observations:

  • The use of “interns” is a way for the Google to reward bright folks while keeping the big guns off the podium. The experience of the Google Books product manager makes this use of interns prudent.
  • The content of the papers is not intern grade. When you work through the two documents, you will learn that Google has made significant advances in methods for working out issues in manipulating Google-scale structured data and discerning context.
  • The traditional world of relational databases is on a collision course with Googzilla. Big data are part of the Google core competency.

Those are some interns because their co authors are among Google’s most sophisticated researchers and academic colleagues. Technofeel’s instincts are good. He may want to check the bios of the secondary and tertiary authors of these Google papers. The interns are not the hubs on these wheels.

Stephen Arnold, August 31, 2009

Data Warehouse Leader to Reinvent Data Warehousing

August 26, 2009

“IBM Announces ‘Smart Analytics System’ Aimed at Reinventing Data Warehousing” reminded me of Einstein’s discomfort with some of the implications of his theory of relativity. Invent one thing, then scramble to find a way to deal with problems that won’t go away. IBM, one might assert, invented data warehousing. It was an IBM researcher who developed our old friend the relational database. The Codd approach has been the big dog in data management for a long time. Options are now becoming more widely available, but when one says, “Data warehousing”, I think IBM. That’s why I am an addled goose I suppose.

image

Mr. Data Warehouse. Image source: http://en.wikipedia.org/wiki/Edgar_F._Codd

This article-interview makes clear that something is not right in IBM land. For me, the most suggestive comment in the Intelligent Enterprise write up was this passage:

Though IBM is promising better performance, a big part of the appeal seems to be targeted at executives who would favor contract simplicity and a single “throat to choke” over enterprising, but potentially riskier, in-house development, integration and innovation.

The “reinvention” seems to be to be little more than fixing responsibility for a mission critical system on a company big enough to take to court if the data warehouse has a leaking roof. In my experience these traditional data warehouses have more problems than a fast-build Shanghai apartment building.

My thought is to take a hard look at the assumptions about data warehousing, then poke into some options. Dare I suggest Aster Data? What about a Perfect Search enabled system?

Stephen Arnold, August 26, 2009

Metadata: Not Delivering and Dying

August 26, 2009

I watched a year ago as dozens of people filed into a program called “the drill instructor’s approach to metadata” or something that suggested a Marine Corps. physical training session. Yep, I thought, metadata in a day. I flapped my tail feathers and waddled on by the room stuffed with people who paid hundreds of dollars to get a knowledge injection.

Metadata is not exactly a botox injection that worked particularly well.

botox lips

Lousy metadata produces a result that can be unexpected.

The notion of adding specific index terms to a content object is simple on the surface, but the indexing and tagging are intellectual walnuts. Get the terms wrong and no one can find documents because no one uses those words. Get the categories wrong and the helpful folders are like lumber rooms filled with odds and ends. Try to fix these problems, and the average MBA or art history major falls to the floor with their ankles bound by torn garments.

I quite enjoyed “Resuscitating Your Dying Metadata Strategy.” The title evoked an image of a gasping automated indexing system with three or four consultants poking at an intellectual body lying face down on the content processing vendor’s license agreement. And the word “dying” was a good one. There is a certain urgency to the word. “Sickly” denotes that a recovery may be likely. “Dying” suggests that I flip to Google Local to identify a funeral home.

The key segment of the article in my opinion was this passage:

a large number of IT professionals know intuitively that metadata management is the right thing to do, but have a hard time articulating why they need it.  Also they admit a lack of engagement and collaboration with business stakeholders they are  aiming to help. They also often have failed attempts to get metadata efforts off the ground in the past and are trying to fast track something…anything! So how can IT reverse this trend? They need to better scope and prioritize their metadata efforts by building a more realistic business case that can demonstrate real value-add.

The touchstones for me are the notion of a disconnect between users and information technology professionals. Then there is the notion that a lack of intellectual rigor and perhaps expertise have created problems. The organization wants a silver bullet.

Yes, this sounds familiar.

Metadata are important. The addled goose has no quick fixes to offer. The type of controlled terms that once were the strength of commercial databases such as ABI / INFORM are no longer valued. Creating consistent, useful controlled term lists and developing meaningful classification systems takes time and effort. Once these lists are in hand, the terms can be applied via human or “smart” systems. The moment the lists and classification systems are completed, the work begins to keep these lists in step with language. Sci tech terminology drifts less quickly than general business terminology.

The message is that an organization must continue to invest in complex, knowledge centric work. In my experience few organizations have the appetite for this activity. Quite a few folks who buy commercial databases in order to create a knowledge monopoly invest too little to keep their information products’ indexing up to snuff. The newcomers spend some money and time but fall into the trap of finding a Hollywood doctor to administer a quick botox injection to hide a wrinkle before an audition.

The folks who work at metadata often find themselves ignored. A good example is the 500,000+ categories generated by the Google. You can see a bit of this system in action if you run this query, verified at 8 am on August 25, 2009: “skin cancer”. Here is the result list I saw:

skin cancer

Based on my research, Google has been plugging away at metadata and making progress. Organizations faced with revivifying their dying metadata systems may want to learn from their errors and their consultants’ silly promises about certain automated systems. Maybe Google will make its metadata systems available someday? Maybe one of the graduates of the drill instructor programs that teach taxonomy will discover a silver bullet that is easy, cheap, and fast?

The addled goose’s team does controlled vocabularies the old-fashioned way, working with partners like Access Innovations, a company with automated systems and the deep experience required to tackle metadata in an informed way. No wonder he is paddling alone and thinking of the good old days when the ABI / INFORM and the Business Dateline teams worked each week to refine their term lists and tweak their classification systems. That was hard work not suitable to the social networking, Tweet sending “experts” selling metadata systems like carnival mountebanks.

Stephen Arnold, August 26, 2009

Silobreaker Update

August 25, 2009

I was exploring usage patterns via Alexa. I wanted to see how Silobreaker, a service developed by some savvy Scandinavians, was performing against the brand name business intelligence companies. Silobreaker is one of the next generation information services that processes a range of content, automatically indexing and filtering the stream, and making the information available in “dossiers”. A number of companies have attempted to deliver usable “at a glance” services. Silobreaker has been one of the systems I have relied upon for a number of client engagements.

I compared the daily reach of LexisNexis (a unit of the Anglo Dutch outfit Reed Elsevier), Factiva (originally a Reuters Dow Jones “joint” effort in content and value added indexing now rolled back into the Dow Jones mothership), Ebsco (the online arm of the EB Stevens Co. subscription agency), and Dialog (a unit of the privately held database roll up company Cambridge Scientific Abstracts / ProQuest and some investors). Keep in mind that Silobreaker is a next generation system and I was comparing it to the online equivalent of the Smithsonian’s computer exhibit with the Univac and IBM key punch machine sitting side by side:

silo usage

Silobreaker is the blue line which is chugging right along despite the challenging financial climate. I ran the same query on Compete.com, and that data showed LexisNexis showing a growth uptick and more traffic in June 2009. You mileage may vary. These types of traffic estimates are indicative, not definitive. But Silobreaker is performing and growing. One could ask, “Why aren’t the big names showing stronger buzz?”

silo splash

A better question may be, “Why haven’t the museum pieces performed?” I think there are three reasons. First, the commercial online services have not been able to bridge the gap between their older technical roots and the new technologies. When I poked under the hood in Silobreaker’s UK facility, I was impressed with the company’s use of next generation Web services technology. I challenged the R&D team regarding performance, and I was shown a clever architecture that delivers better performance than the museum piece services against which Silobreaker competes. I am quick to admit that performance and scaling remain problems for most online content processing companies, but I came away convinced that Silobreaker’s engineering was among the best I had examined in the real time content sector.

Second, I think the museum pieces – I could mention any of the services against which I compared Silobreaker – have yet to figure out how to deal with the gap between the old business model for online and the newer business models that exist. My hunch is that the museum pieces are reluctant to move quickly to embrace some new approaches because of the fear of [a] cannibalization of their for fee revenues from a handful of deep pocket customers like law firms and government agencies and [b] looking silly when their next generation efforts are compared to newer, slicker services from Yfrog.com, Collecta.com, Surchur.com, and, of course, Silobreaker.com.

Third, I think the established content processing companies are not in step with what users want. For example, when I visit the Dialog Web site here, I don’t have a way to get a relationship map. I like nifty methods of providing me with an overview of information. Who has the time or patience to handcraft a Boolean query and then paying money whether the dataset contains useful information or not. I just won’t play that “pay us to learn there is a null set” game any more. Here’s the Dialog splash page. Not too useful to me because it is brochureware, almost a 1998 approach to an online service. The search function only returns hits from the site itself. There is not compelling reason for me to dig deeper into this service. I don’t want a dialog; I want answers. What’s a ProQuest? Even the name leaves me puzzled.

the dialog page

I wanted to make sure that I was not too harsh on the established “players” in the commercial content processing sector. I tracked down Mats Bjore, one of the founders of Silobreaker. I interviewed him as part of my Search Wizards Speak series in 2008, and you may find that information helpful in understanding the new concepts in the Silobreaker service.

What are some of the changes that have taken place since we spoke in June 2008?

Mats Bjore: There are several news things and plenty more in the pipeline. The layout and design of Silobreaker.com have been redesigned to improve usability; we have added an Energy section to provide a more vertically focused service around both fossil fuels and alternative energy; we have released Widgets and an API that enable anyone to embed Silobreaker functionality in their own web sites; and we have improved our enterprise software to offer corporate and government customers “local” customizable Silobreaker installations, as well a technical platform for publishers who’d like to “silobreak” their existing or new offerings with our technology. Industry-wise,the recent statements by media moguls like Rupert Murdoch make it clear that the big guys want to monetize their information. The problem is that charging for information does not solve the problem of a professional already drowning in information. This is like trying to charge a man who has fallen overboard for water instead of offering a life jacket. Wrong solution. The marginal loss of losing a few news sources is really minimal for the reader, as there are thousands to choose from anyways, so unless you are a “must-have” publication, I think you’ll find out very quickly that reader loyalty can be fickle or short-lived or both. Add to that that news reporting itself has changed dramatically. Blogs and other types of social media are already favoured before many newspapers and we saw Twitters role during the election demonstrations in Iran. Citizen journalism of that kind; immediate, straight from the action and free is extremely powerful. But whether old or new media, Silobreaker remains focused on providing sense-making tools.

What is it going to be, free information or for fee information?

Mats Bjore: I think there will be free, for fee, and blended information just like Starbuck’s coffee.·The differentiators will be “smart software” like Silobreaker and some of the Google technology I have heard you describe. However, the future is not just lots of results. The services that generate value for the user will have multiple ways to make money. License fees, customization, and special processing services—to name just three—will differentiate what I can find on your Web log and what I can get from a Silobreaker “report”.

What can the museum pieces like Dialog and Ebsco do to get out of their present financial swamp?

Mats Bjore: That is a tough question. I also run a management consultancy, so let me put on my consultant hat for a moment. If I were Reed Elsevier, Dow Jones/Factiva, Dialog, Ebsco or owned a large publishing house, I must realize that I have to think out of the box. It is clear that these organizations define technology in a way that is different from many of the hot new information companies. Big information companies still define technology in terms of printing, publishing or other traditional processes. The newer companies define technology in terms of solving a user’s problem. The quick fix, therefore, ought to be to start working with new technology firms and see how they can add value for these big dragons today, not tomorrow.

What does Silobreaker offer a museum piece company?

Mats Bjore: The Silobreaker platform delivers access and answers without traditional searching. Users can spot what is hot and relevant. I would seriously look at solutions such as Silobreaker as a front to create a better reach to new customers, capture revenues from the ads sponsored free and reach a wider audience an click for premium content – ( most of us are unaware of the premium content that is out there, since the legacy contractual types only reach big companies and organizations. I am surprised that Google, Microsoft, and Yahoo have not moved more aggressively to deliver more than a laundry list of results with some pictures.

Is the US intelligence community moving more purposefully with access and analysis?

The interest in open source is rising. However, there is quite a bit of inertia when it comes to having one set of smart software pull information from multiple sources. I think there is a significant opportunity to improve the use of information with smart software like Silobreaker’s.

Stephen Arnold, August 25, 2009

Tweets Are Mostly Pointless Babble

August 15, 2009

I enjoy Mashable. The articles come at topics in a way that is youthful, enthusiastic even. I noted Jennifer Van Grove’s “40% of Tweets Are Pointless Babble.” I was surprised that * only * 40 percent of the message traffic was pointless. However, I think Ms. Van Grove reveals that she has not spent much time in monitoring traffic for intelligence and law enforcement entities. With that experience in her bag of tricks, she might reach a different conclusion about the “noise” in the Twitterstream. “Pointless” to one person might be evidence to another. Youth has its advantages but understanding the value of filtering traffic may not be apparent to an avid sender of Tweets.

Stephen Arnold, August 14, 2009

Visualization and Confusion

August 15, 2009

Visualization of search results or other data is a must-have for presentations in the Department of Defense. What’s a good presentation? One that has killer visualizations of complex data. The problem is that sizzle in one colonel’s graphics triggers a graphics escalation. This is a briefing room version of Mixed Martial Arts. The problem, based on my limited experience in this type of content, is that most of the graphics don’t make much sense. In fact, when I see a graphic I usually have zero idea about where the data originated, the mathematical methods used to generate the visual, or what Photoshop wizardry may have been employed to make that data point explode in my perceptual field. Your mileage may differ, but I find that visualization is useful in small doses.

To prove that what I prefer is out of date and that my views are road kill on the information superhighway, you will want to explore “15 Stunning Examples of Data Visualization”. Stunning is an appropriate word. After looking at these examples, I am not sure what is being communicated in some of these graphics. Example: Big fluctuations.

image

If you want to add zing to your briefings, you will definitely get some ideas from this article. If I am in the audience, expect questions from the addled goose. Know your data thoroughly because I am not sure some of these examples communicate on the addled goose wave length.

Stephen Arnold, August 14, 2009

Morphing Search Vendor Adventures: Customer Feedback

August 13, 2009

Quite a few search and content processing companies are chasing the supposed honey pot of customer support, customer feedback, customer self help, and just about any way to cut these costs. Forbes ran a cheerleading article that I was going to ignore. “No,” one of the goslings said, “This write up makes some good points.” Okay, the story is “The Upside of Bad Online Customer Reviews” by Mirela Iverac. The core idea is that customers who complain can provide useful information to the company that caused the dust up in the first place. The underlying technical hook is that the outfit mentioned in the story, based on what I have heard, uses the Attensity system to deliver the bag of goodies. If you revel in feedback loops that work, snag the Forbes’s write up.

Stephen Arnold, August 13, 2009

Google Gets Sentimental

August 10, 2009

I got a briefing from a company called Lexalytics. The firm, as I recall, was explaining its sentiment based content processing technology. I thought it was interesting. I subsequently learned that Lexalytics’ system would be part of the Financial Times’s online service, but my recollection is fuzzy. I thought of this company when I learned about the Google patent application US20090193328, “Aspect Based Sentiment Summarization”. You can find this document at the ever so powerful USPTO via its patent search engine. The abstract for the patent application, which some wizards believe are little more than the equivalent of my mother’s making Christmas tree ornaments for her friends stated:

Reviews express sentiment about one or more entities. Phrases in the reviews that express sentiment about a particular aspect are identified. Reviewable aspects of the entity are also identified. The reviewable aspects include static aspects that are specific to particular types of entities and dynamic aspects that are extracted from the reviews of a specific entity instance. The sentiment phrases are associated with the reviewable aspects to which the phrases pertain. The sentiment expressed by the phrases associated with each aspect is summarized, thereby producing a summary of sentiment associated with each reviewable aspect of the entity. The summarized sentiment and associated phrases can be stored and displayed to a user as a summary description of the entity.

Now Lexalytics and other companies with sentiment sniffers are only part of what this document sparked in my mind. The other low voltage arc was in the Endeca “Guided Navigation” department of my addled goose brain. As I read the exciting patent document and its droll legalese, I realized that the Google is claiming that its performs the same magic that Orange Julius does when it mixes fruits in fruit shake.

Will Lexalytics and Endeca shiver their timbers? Nope. My hunch is that both companies will see their technology as light years ahead of the Google’s. I also assert that both companies will not see Google’s claims as having much impact on their enterprise and ecommerce content processing applications.

In my opinion, this type of “Google does not have what we have” thinking is going to lead to unfortunate circumstances and quickly.

Stephen Arnold, August 11, 2009

Google Relationship Map

August 3, 2009

A happy quack to the reader who sent me a link to Muckety.com and its relationship map of Google. Same Googlers and former Googler whom I track appear on the map; for example, Anna Patterson (University of Illinois Ph.D., developer of Xift, Google inventor, one of the founders of Cuil.com) and the Digg-hyped Marissa Mayer(keeper of the user interface and authority on Internet anonymity).

muckety map snippet

But there are some omissions. You can click around as I did, and you may be able to nail down Steve Lawrence or Sanjay Ghemawat. Perfect? Nope. Useful. I think it is suggestive in light of IBM’s alleged “invention” of relationship maps discovered by processing data.

For the purposes of comparison, here’s the Cluuz.com map of Ms. Mayer:

cluuz mayer

I assume IBM’s relationship maps put these two free systems to shame.

Stephen Arnold, August 3, 2009

IBM Snags SPSS, May Be Bad Timing

July 29, 2009

IBM bought SPSS. Most third and fourth year statistics majors learn to love either SPSS or arch-rival SAS. MicrostAT just does not paddle fast enough for the serious stats whiz. You can read about the deal on the IBM Web site or on TechCrunch.

I liked the “Monster Merger” story. The guts of the deal are presented. For me the most interesting comment was:

IBM says it will continue to support and enhance SPSS technologies while allowing customers to take advantage of its own product portfolio. SPSS will become part of the Information Management division within the Software Group business unit, led by Ambuj Goyal, General Manager, IBM Information Management.

Right.

What I have not seen is a discussion of the SPSS text processing functions. IBM has its OmniFind and a legion of partners to deliver text processing functions. Then there is the Web Fountain system. You do remember Web Fountain, don’t you. The brainiacs at Almaden continue to labor away in text processing.

Now IBM gets PASW which counts, categorizes, and performs other content processing operations. SPSS bought Lexiquest and has added functionality since that deal in 2002.

The plumbing for SPSS text processing has these components:

image

© SPSS, 2007

SPSS, like IBM, requires a commitment from a licensee. IBM may be joining the party a bit late. The shift to lighter weight analytic tools is underway. Newcomers like Clarabridge have been holding their own. SAS’s purchase of Teragram and its open sourcing some of Teragram’s software makes it clear that the good old days may be receding in the rear view mirror. SPSS can be a real resource hog. That should make IBM happy. IBM loves to sell consulting but a close second is selling hardware and engineering support. SPSS has not made the leap to Web services.

In short, I think the text processing components of SPSS may get lost and quickly within the massive IBM organization. Furthermore, this deal may have been made at the right time for SPSS and maybe the wrong time for IBM. Just my opinion.

Stephen Arnold, July 29, 2009

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta