IBM Snags SPSS, May Be Bad Timing
July 29, 2009
IBM bought SPSS. Most third and fourth year statistics majors learn to love either SPSS or arch-rival SAS. MicrostAT just does not paddle fast enough for the serious stats whiz. You can read about the deal on the IBM Web site or on TechCrunch.
I liked the “Monster Merger” story. The guts of the deal are presented. For me the most interesting comment was:
IBM says it will continue to support and enhance SPSS technologies while allowing customers to take advantage of its own product portfolio. SPSS will become part of the Information Management division within the Software Group business unit, led by Ambuj Goyal, General Manager, IBM Information Management.
Right.
What I have not seen is a discussion of the SPSS text processing functions. IBM has its OmniFind and a legion of partners to deliver text processing functions. Then there is the Web Fountain system. You do remember Web Fountain, don’t you. The brainiacs at Almaden continue to labor away in text processing.
Now IBM gets PASW which counts, categorizes, and performs other content processing operations. SPSS bought Lexiquest and has added functionality since that deal in 2002.
The plumbing for SPSS text processing has these components:
© SPSS, 2007
SPSS, like IBM, requires a commitment from a licensee. IBM may be joining the party a bit late. The shift to lighter weight analytic tools is underway. Newcomers like Clarabridge have been holding their own. SAS’s purchase of Teragram and its open sourcing some of Teragram’s software makes it clear that the good old days may be receding in the rear view mirror. SPSS can be a real resource hog. That should make IBM happy. IBM loves to sell consulting but a close second is selling hardware and engineering support. SPSS has not made the leap to Web services.
In short, I think the text processing components of SPSS may get lost and quickly within the massive IBM organization. Furthermore, this deal may have been made at the right time for SPSS and maybe the wrong time for IBM. Just my opinion.
Stephen Arnold, July 29, 2009
Kapow Technologies
July 17, 2009
With the rise of free real time search systems such as Scoopler, Connecta, and ITPints, established players may find themselves in shadows. Most of the industrial strength real time content processing companies like Connotate and Relegence prefer to be out of the spotlight. The reason is that their customers are often publicity shy. When you are monitoring information to make a billion on Wall Street or to snag some bad guys before those folks can create a disruption, you want to be far from the Twitters.
A news release came to me about an outfit called Kapow Technologies. The company described itself this way:
Kapow Technologies provides Fortune 1000 companies with industry-leading technology for accessing, enriching, and serving real-time enterprise and public Web data. The company’s flagship Kapow Web Data Server powers solutions in Web and business intelligence, portal generation, SOA/WOA enablement, and CMS content migration. The visual programming and integrated development environment (IDE) technology enables business and technical decision-makers to create innovative business applications with no coding required. Kapow currently has more than 300 customers, including AT&T, Wells Fargo, Intel, DHL, Vodafone and Audi. The company is headquartered in Palo Alto, Calif. with additional offices in Denmark, Germany and the U.K
I navigated to the company’s Web site out of curiosity and learned several interesting factoids:
First, the company is a “market leader” in open source intelligence. It has technology to create Web crawling “robots”. The technology can, according to the company, “deliver new Web data sources from inside and outside the agency that can’t be reached with traditional BI and ETL tools.” More information is here. Kapow’s system can perform screen scraping; that is, extracting information from a Web page via software robots.
Second, the company offers what it calls a “portal generation” product. The idea is to build new portals or portlets without coding. The company said:
With Kapow’s technology, IT developers [can]: Avoid the burden of managing different security domains; eliminate the need to code new transaction; and bypass the need to create or access SOA interfaces, event-based bus architectures or proprietary application APIs.
Third, provide a system that handles content migration and transformation. With transformation an expensive line item in the information technology budget, managing these costs becomes more important each month in today’s economic environment. Kapow says here:
The module [shown below] acts much as an ETL tool, but performs the entire data extraction and transformation at the web GUI level. Kapow can load content directly into a destination application or into standard XML files for import by standard content importing tools. Therefore, any content can be migrated and synchronized to and between any web based CMS, CRM, Project Management or ERP system.
Kapow offers connections for a number of widely used content management systems, including Interwoven, Documentum, Vignette, and Oracle Stellent, among others.
Kapow includes a search function along with application programming interfaces, and a range of tools and utilities, including RoboSuite (a block diagram appears below):
Source: http://abss2.fiit.stuba.sk/TeamProject/2006/team05/doc/KapowTech.ppt
Big Data, Big Implications for Microsoft
July 17, 2009
In March 2009, my Overflight service picked up a brief post in the Google Research Web log called “The Unreasonable Effectiveness of Data.” The item mentioned that three Google wizards wrote an article in the IEEE Intelligent Systems journal called “The Unreasonable Effectiveness of Data.” You may be able to download a copy from this link.
On the surface this is a rehash of Google’s big data argument. The idea is that when you process large amounts of data with a zippy system using statistical and other mathematical methods, you get pretty good information. In a very simple way, you know what the odds are that something is in bounds or out of bounds, right or wrong, even good or bad. Murky human methods like judgment are useful, but with big data, you can get close to human judgment and be “right” most of the time.
When you read the IEEE write up, you will want to pay attention to the names of the three authors. These are not just smart guys, these are individuals who are having an impact on Google’s leapfrog technologies. There’s lots of talk about Bing.com and its semantic technology. These three Googlers are into semantics and quite a bit more. The names:
- Alon Halevy, former Bell Labs researcher and the thinker answering to some degree the question, “What’s after relational databases”?”
- Peter Norvig, the fellow who wrote the standard textbook on computational intelligence and smart software
- Fernando Pereira, the former chair of Penn’s computer science department and the Andrew and Debra Rachleff Professor.
So what do these three Googlers offer in their five page “expert opinion” essay?
First, large data makes smart software smart. This is a reference to the Google approach to computational intelligence.
Second, big data can learn from rare events. Small data and human rules are not going to deliver the precision that one gets from algorithms and big data flows. In short, costs for getting software and systems smarter will not spiral out of control.
Third, the Semantic Web is a non starter so another method – semantic interpretation – may hold promise. By implication, if semantic interpretation works, Google gets better search results plus other benefits for users.
Conclusion: dataspaces.
See Google is up front and clear when explaining what its researchers are doing to improve search and other knowledge centric operations. What are the implications for Microsoft? Simple. The big data approach is not used in the Powerset method applied to Bing in my opinion. Therefore, Microsoft has a cost control issue to resolve with its present approach to Web search. Just my opinion. Your mileage may vary.
Stephen Arnold, July 17, 2009
Software Robots Determine Content Quality
July 15, 2009
ZDNet ran an interesting article by Tom Steinert-Threlkeld about software taking over human editorial judgment. “Quality Scores for Web Content: How Numbers Will Create a Beautiful Cycle of Greatness for Us All” is worth tucking into one’s folder for future reference.
Some background. Mr. Steinert-Threlkeld notes that the hook for his story is a fellow named Patrick Keane, who worked at the Google for several years. What’s not included in Mr. Steinert-Threlkeld’s write up is that Google has been working on “quality scores” for many years. You can get references to specific patent and technical documents in my Google monographs. I just wanted to point out that the notion of letting software methods do the work that arbiters of taste have been doing is not a new idea.
The core of the ZDNet story was:
Keane is at work on figuring out what will constitute a Quality Score, for every article, podcast, Webcast or other piece of output generated by an Associated Content contributor. If his 21st Century content production and distribution network can figure out how to put a useful rank on what it puts out on the Web then it can raise it up, notch by notch. This scoring comes right back to the Page Rank process that is at the heart of Google’s success as a search engine. “The great thing about Page Rank in Google ‘ s algorithm is … seeing the Web as a big popularity contest,’’ said Keane, in Associated Content’s offices on Ninth Avenue in Manhattan.
Mr. Steinert-Threlkeld does a good job of explaining how the method at Mr. Keane’s company (Associated Content) will approach the scoring issue.
My thoughts, before I forget them, are:
- Digging into what Google has disclosed about its scoring systems and methods is probably a useful exercise for those covering Google and the businesses in which former Googlers find themselves. The key point is that the Google is leaning more heavily on smart software and less on humans. The implication of this decision is that as content flows go up, Google’s costs will rise less quickly than those of outfits such as Associated Content. Costs are the name of the game in my opinion.
- Former Googlers are going to find themselves playing in interesting jungle gyms. The insights about information will create what I cool “Cuil situations”; that is, how far from the Googzilla nest with a Xoogler stray? My hunch is that Associated Content may find itself surfing on Google because Associated Content will not have the plumbing that the Google possesses.
- Dependent services, by definition, will be subordinate to the core provider. Xooglers may be capping the uplift of their new employers who will find themselves looking at short term benefits, not the long term implications of certain methods.
I think Associated Content will be an interesting company to watch.
Stephen Arnold, July 15
The Gilbane Lecture: Google Wave as One Environmental Factor
July 14, 2009
Author’s note: In early June 2009, I gave a talk to about 50 attendees of the Gilbane content management systems conference in San Francisco. When I tried to locate the room in which I was to speak, the sign in team could not find me on the program. After a bit of 30 something “we’re sure we’re right” outputs, the organizer of the session located me and got me to the room about five minutes late. No worries because the Microsoft speaker was revved and ready.
When my turn came, I fired through my briefing in 20 minutes and plopped down, expecting no response from the audience. Whenever I talk about the Google, I am greeted with either blank stares or gentle snores. I was surprised because I did get several questions. I may have to start arriving late and recycling more old content. Seems to be a winner formula.
This post is a summary of my comments. I will hit the highlights. If you want more information about this topic, you can get it by searching this Web log for the word “Wave”, buying the IDC report No. 213562 Sue Feldman and I did last September, or buying a copy of Google: The Digital Gutenberg. If you want to grouse about my lack of detail, spare me. This is a free Web log that serves a specific purpose for me. If you are not familiar with my editorial policy, take a moment to get up to speed. Keep in mind I am not a journalist, don’t pretend to be one, and don’t want to be included in the occupational category.
Here’s we go with my original manuscript written in UltraEdit from which I gave my talk on June 5, 2009, in San Francisco:
For the last two years, I have been concluding my Google briefings with a picture of a big wave. I showed the wave smashing a skin cancer victim, throwing surfer dude and surf board high into the air. I showed the surfer dude riding inside the “tube”. I showed pictures of waves smashing stuff. I quite like the pictures of tsunami waves crushing fancy resorts, sending people in sherbert colored shirts and beach wear running for their lives.
Yep, wave.
Now Google has made public why I use the wave images to explain one of the important capabilities Google is developing. Today, I want to review some features of what makes the wave possible. Keep in mind that the wave is a consequence of deeper geophysical forces. Google operates at this deeper level, and most people find themselves dealing with the visible manifestations of the company’s technical physics.
Source: http://www.toocharger.com/fiches/graphique/surf/38525.htm
This is important for enterprise search for three reasons. First, search is a commodity and no one, not even I, find key word queries useful. More sophisticated information retrieval methods are needed on the “surface” and in the deeper physics of the information factory. Second, Google is good at glacial movement. People see incremental actions that are separated in time and conceptual space. Then these coalesce and the competitors say, “Wow, where did that come from?” Google Wave, the present media darling, is a superficial development that combines a number of Google technologies. It is not the deep geophysical force, however. Third, Google has a Stalin-era type of planning horizon. Think in terms of five years, then you have the timeline on which to plot Google developments. Wave, in fact, is more than three years old if you start when Google bought a company called Transformics, older if you dig into the background of the Transformics technology and some other components Google snagged in the last five years. Keep that time thing in mind.
First, key word search is at a dead end. I have been one of the most vocal critics of key word search and variants of that approach. When someone says, “Key word search is what we need,” I reply, “Search is dead.” In my mind, I add, “So is your future in this organization.” I keep my parenthetical comment to myself.
Users need information access, not a puzzle to solve in order to open the information lock box. In fact, we have now entered the era of “data anticipation”, a phrase I borrowed from SAS, the statistics outfit. We have to view search in terms of social analytics because human interactions provide important metadata not otherwise obtainable by search, semantic, or linguistic technology. I will give you an example of this to make this type of metadata crystal clear.
You work at Enron. You get an email about creating a false transaction. You don’t take action but you forward the email to your boss and then ignore the issue. When Enron collapsed, the “fact” that you knew and did nothing when you first knew and subsequently is used to make a case that you abetted fraud. You say, “I sent the email to my boss.” From your prison cell, you keep telling your attorney the same thing. Doesn’t matter. The metadata about what you did to that piece of information through time put your tail feather in a cell with a biker convicted of third degree murder and a prior for aggravated assault.
Got it?
Overflight for Attensity
July 8, 2009
Short honk: ArnoldIT.com has added Attensity to its Overflight profile service. You can see the auto generated page here. We will be adding additional search and content processing companies to the service. No charge, and this is a version of the service I use when those who hire the addled goose to prepare competitive profiles. I have a list of about 350 search and content processing vendors. I will peck away at this list until my enthusiasm wanes. If you want a for fee analysis of one of these companies, read the About section of this Web log before contacting me. Yep, I charge money for “real” analysis. Some folks expect me to survive on my good looks and charming personality. LOL.
Stephen Arnold, June 8, 2009
Google Gestation Period
July 7, 2009
I went through my notes about the Guha patent documents. These were published in February 2007. BearStearns published my analysis of these documents in May 2007. I am not sure these are available to the public, but I did describe the Programmable Search Engine invention in my Google Version 2.0 study which came out in September 2007. The Google Squared service and its query “digital camera” replicates the exemplary item in the Guha patent document. Several observations:
- My 2005 assertion that the Google gestation period is about four years. There is a two year ramp period inside the firm during which time the technology is shaped and then, if deemed patentable, submitted to the USPTO and other patent bodies.
- After the patent document is published like the Guha February 2007 PSE patents a two year maturing and deployment process begins.
The appearance of the Google Squared service as a beta marks the Darwinian field testing. The age of semantics is now officially underway. You can read about Google’s methods in my trilogy The Google Legacy (2005), Google Version 2.0 (2007), and Google: The Digital Gutenberg (2009). The 2007 and 2009 studies provide some research data germane to those who want to surf on Google. Yep, that the source of my “wave” analogies and the injunction at the end of my Google talks to “surf on Google”.
What’s next? Wait for my newest monograph on time in search and content. I find it easier to let research and content analysis illuminate the would and could of the GOOG.
Stephen Arnold, June 7, 2009
Google and Scientific Tagging
June 28, 2009
In my talk on June 26, 2007 for NFAIS, a question came from one of the participants in the Webcast of my presentation. A person wanted to know if Google Scholar tagged documents with scientific and other types of more formal language. The example was “heart attack” or “myocardial infarction”. I pointed the questioner to Big Google and this query: backpain. Now scroll to the bottom of the page, and you will see these added features:
This is a component of “universal search” so you see videos, categorized results, and the more precise medical term “fibromyalgia”. My point was the Google has the capability of providing these types of added value tags to the content in Google Scholar and to Google Books, for that matter. So far for public access, more sophisticated content processing outputs are not part of these two services; that is, Google Scholar or Google Books. If you know that Google is adding more sophisticated features to these services, please, use the comments section of this Web log to alert me. As Google grows larger and changes, I have a tough time keeping track of Mother Google’s knitting. People do seem to be resonating with the notion of surfing on Google. I have accepted an invitation to give a talk at the Magazine Publishers Association shindig in New York this fall. The topic? Surfing on Google. It’s not nice to fool, Mother Google.
Stephen Arnold, June 28, 2009
Facebook Streams
June 25, 2009
You will want to work through this somewhat disjointed discussion of Facebook in ReadWriteWeb’s “The Day Facebook Changed Forever: Messages to Become Public By Default.” For me the most important point was:
In time, though, people may very well decide they are comfortable with their social networking being public by default. That will be a different world, and today will have been one of the most important days in that new world’s unfolding.
The reason? More content flows to monitor and mine. Goodie. Love those social postings.
Stephen Arnold, June 26, 2009
Text Mining and Predicting Doom
June 23, 2009
The New Scientist does not cover the information retrieval sector. Occasionally the publication runs an article like “Email Patterns Can Predict Impending Doom” which gets into a content processing issue. I quite liked the confluence of three buzz words in today’s ever thrilling milieu: “predict”, “email”, and “doom”. What’s the New Scientist’s angle? The answer is that as tension within an organization increases, communication patterns in email can be discerned via text mining. The article hints that analysis of email is tough with privacy a concern. The article offers a suggestive reference to an email project at Yahoo, but provided few details. With monitoring of real time data flows available to anyone with an Internet connection, message patterns seem to be quite useful to those lucky enough to have the tools need to ferret out the nuggets. Nothing about fuzzification of data, however. Nothing about which vendors are leaders in the space except for the Yahoo and Enron comments. I think there is more to be said on this topic.
Stephen Arnold, June 23, 2009