InQuira IBM Knowledge Assessment
July 18, 2009
A happy quack to the ArnoldIT.com goose who forwarded the InQuira Knowledge Assessment Tool link from one of my test email addresses. InQuire, a company formed from two other firms in the content processing space, has morphed into a knowledge company. The firm’s natural language processing technology is under the hood, but the packaging has shifted to customer support and other sectors where search is an enabler, not the electromagnet.
The survey is designed to obtain information about my knowledge quotient. The url for the survey is http://myknowledgeiq.com. The only hitch in the git-along is that the service seems to be timing out. You can try the survey assessment here. The system came back to life after a two minute delay. My impressions as I worked through this Knowledge IQ test appear below:
Impressions as I Take the Test
InQuira uses some interesting nomenclature. For example, the company asks about customer service and a “centralized knowledge repository”. The choices include this filtering response:
Yes, individuals have personal knowledge repositories (e.g., email threads, folders, network shared drives), but there isn’t a shared repository.
I clicked this choice because distributed content seems to be the norm in my experience. Another interesting question concerns industry best practices. The implicit assumption is that a best practice exists. The survey probes for an indication of who creates content and who maintains the content once created. My hunch at this point in the Knowledge IQ test is that most respondents won’t have much of a system in place. I think I see that I will have a low Knowledge IQ because I am selecting what appear to me to be reasonable responses, no extremes or categoricals like “none” or “all”. I note that some questions have default selections already checked. Ideal for the curious survey taker who wants to get to the “final” report. About mid way through I get a question about the effectiveness of the test taker’s Web site. In my experience, most organizations offer so-so Web sites. I will go with a middle-of-the road assessment. I am now getting tired of the Knowledge IQ test. I just answered questions about customer feedback opportunities. My experience suggests that most companies “say” feedback is desirable. Acting on the feedback is often a tertiary concern, maybe of even lower priority.
My Report
The system is now generating my report. Here’s what I learned: my answers appear to put me in the middle of the radar chart I have a blue diagram which gives me a personal Knowledge IQ.
Kapow Technologies
July 17, 2009
With the rise of free real time search systems such as Scoopler, Connecta, and ITPints, established players may find themselves in shadows. Most of the industrial strength real time content processing companies like Connotate and Relegence prefer to be out of the spotlight. The reason is that their customers are often publicity shy. When you are monitoring information to make a billion on Wall Street or to snag some bad guys before those folks can create a disruption, you want to be far from the Twitters.
A news release came to me about an outfit called Kapow Technologies. The company described itself this way:
Kapow Technologies provides Fortune 1000 companies with industry-leading technology for accessing, enriching, and serving real-time enterprise and public Web data. The company’s flagship Kapow Web Data Server powers solutions in Web and business intelligence, portal generation, SOA/WOA enablement, and CMS content migration. The visual programming and integrated development environment (IDE) technology enables business and technical decision-makers to create innovative business applications with no coding required. Kapow currently has more than 300 customers, including AT&T, Wells Fargo, Intel, DHL, Vodafone and Audi. The company is headquartered in Palo Alto, Calif. with additional offices in Denmark, Germany and the U.K
I navigated to the company’s Web site out of curiosity and learned several interesting factoids:
First, the company is a “market leader” in open source intelligence. It has technology to create Web crawling “robots”. The technology can, according to the company, “deliver new Web data sources from inside and outside the agency that can’t be reached with traditional BI and ETL tools.” More information is here. Kapow’s system can perform screen scraping; that is, extracting information from a Web page via software robots.
Second, the company offers what it calls a “portal generation” product. The idea is to build new portals or portlets without coding. The company said:
With Kapow’s technology, IT developers [can]: Avoid the burden of managing different security domains; eliminate the need to code new transaction; and bypass the need to create or access SOA interfaces, event-based bus architectures or proprietary application APIs.
Third, provide a system that handles content migration and transformation. With transformation an expensive line item in the information technology budget, managing these costs becomes more important each month in today’s economic environment. Kapow says here:
The module [shown below] acts much as an ETL tool, but performs the entire data extraction and transformation at the web GUI level. Kapow can load content directly into a destination application or into standard XML files for import by standard content importing tools. Therefore, any content can be migrated and synchronized to and between any web based CMS, CRM, Project Management or ERP system.
Kapow offers connections for a number of widely used content management systems, including Interwoven, Documentum, Vignette, and Oracle Stellent, among others.
Kapow includes a search function along with application programming interfaces, and a range of tools and utilities, including RoboSuite (a block diagram appears below):
Source: http://abss2.fiit.stuba.sk/TeamProject/2006/team05/doc/KapowTech.ppt
The Gilbane Lecture: Google Wave as One Environmental Factor
July 14, 2009
Author’s note: In early June 2009, I gave a talk to about 50 attendees of the Gilbane content management systems conference in San Francisco. When I tried to locate the room in which I was to speak, the sign in team could not find me on the program. After a bit of 30 something “we’re sure we’re right” outputs, the organizer of the session located me and got me to the room about five minutes late. No worries because the Microsoft speaker was revved and ready.
When my turn came, I fired through my briefing in 20 minutes and plopped down, expecting no response from the audience. Whenever I talk about the Google, I am greeted with either blank stares or gentle snores. I was surprised because I did get several questions. I may have to start arriving late and recycling more old content. Seems to be a winner formula.
This post is a summary of my comments. I will hit the highlights. If you want more information about this topic, you can get it by searching this Web log for the word “Wave”, buying the IDC report No. 213562 Sue Feldman and I did last September, or buying a copy of Google: The Digital Gutenberg. If you want to grouse about my lack of detail, spare me. This is a free Web log that serves a specific purpose for me. If you are not familiar with my editorial policy, take a moment to get up to speed. Keep in mind I am not a journalist, don’t pretend to be one, and don’t want to be included in the occupational category.
Here’s we go with my original manuscript written in UltraEdit from which I gave my talk on June 5, 2009, in San Francisco:
For the last two years, I have been concluding my Google briefings with a picture of a big wave. I showed the wave smashing a skin cancer victim, throwing surfer dude and surf board high into the air. I showed the surfer dude riding inside the “tube”. I showed pictures of waves smashing stuff. I quite like the pictures of tsunami waves crushing fancy resorts, sending people in sherbert colored shirts and beach wear running for their lives.
Yep, wave.
Now Google has made public why I use the wave images to explain one of the important capabilities Google is developing. Today, I want to review some features of what makes the wave possible. Keep in mind that the wave is a consequence of deeper geophysical forces. Google operates at this deeper level, and most people find themselves dealing with the visible manifestations of the company’s technical physics.
Source: http://www.toocharger.com/fiches/graphique/surf/38525.htm
This is important for enterprise search for three reasons. First, search is a commodity and no one, not even I, find key word queries useful. More sophisticated information retrieval methods are needed on the “surface” and in the deeper physics of the information factory. Second, Google is good at glacial movement. People see incremental actions that are separated in time and conceptual space. Then these coalesce and the competitors say, “Wow, where did that come from?” Google Wave, the present media darling, is a superficial development that combines a number of Google technologies. It is not the deep geophysical force, however. Third, Google has a Stalin-era type of planning horizon. Think in terms of five years, then you have the timeline on which to plot Google developments. Wave, in fact, is more than three years old if you start when Google bought a company called Transformics, older if you dig into the background of the Transformics technology and some other components Google snagged in the last five years. Keep that time thing in mind.
First, key word search is at a dead end. I have been one of the most vocal critics of key word search and variants of that approach. When someone says, “Key word search is what we need,” I reply, “Search is dead.” In my mind, I add, “So is your future in this organization.” I keep my parenthetical comment to myself.
Users need information access, not a puzzle to solve in order to open the information lock box. In fact, we have now entered the era of “data anticipation”, a phrase I borrowed from SAS, the statistics outfit. We have to view search in terms of social analytics because human interactions provide important metadata not otherwise obtainable by search, semantic, or linguistic technology. I will give you an example of this to make this type of metadata crystal clear.
You work at Enron. You get an email about creating a false transaction. You don’t take action but you forward the email to your boss and then ignore the issue. When Enron collapsed, the “fact” that you knew and did nothing when you first knew and subsequently is used to make a case that you abetted fraud. You say, “I sent the email to my boss.” From your prison cell, you keep telling your attorney the same thing. Doesn’t matter. The metadata about what you did to that piece of information through time put your tail feather in a cell with a biker convicted of third degree murder and a prior for aggravated assault.
Got it?
Hewlett Packard Extreme Scale Out for Extreme Cost
July 7, 2009
I read CBR Online’s June 11, 2009, article “HP Unveils new Portfolio for Extreme Scale-Out Computing” here. I decided to think about the implications of this announcement before offering some observations relative to the cost of infrastructure for search, content processing, and text mining licensees. The CBR Online staff did a good job of explaining the technical innovations of the system. One point nagged at me, however. I jotted down the passage that I have been thinking about:
HP said that the new ProLiant SL server family with consolidated power and cooling infrastructure and air flow design uses 28% less power per server than traditional rack-based servers. Its volume packaging also reduces the acquisition costs for customers who require thousands of server nodes.
Now this type of savings makes perfect sense. For example, if an online service were to experience growth to support 3.5 billion records and deliver 200 queries per second, under traditional search software and server architecture, the vendor may require up to 1,000 servers (dual processor, dual core, 32 Gb of RAM). Lots of servers translates to lots of money. Let’s assume that energy costs consume about 15 percent of an information technology group’s budget.
A 28 percent reduction would yield about $0.28 cents power cost reduction. But what’s the cost of these new servers? HP hasn’t made prices available on its Web site. I was invited to call an 800 number to talk with an HP sales engineer, however. The problem is that information is growing more rapidly than the 28 percent savings, so I think this type of hardware solution is a placeholder. The math can be fun to work out but the cost savings won’t be material because exotic servers are incrementally better than what’s available as a white box. Google figured this out a long time ago and has looked for power savings in many different places. These include on board batteries instead of uninterruptable power supplies and software efficiencies. Using commodity gizmos gives Google, as I reported in my BearStearns’ reports, a significant cost advantage that is not 0.25 percent but a four X reduction; that is, $1.00 becomes $0.25. Google has an efficiency weapon that cannot be countered with name brand hardware delivering modest improvements. Just my opinion, gentle reader, except that BearStearns paid me for my research and then published it in two reports in 2006 and 2007. Old news, I know, but the newer stuff is available in Google: The Digital Gutenberg here.
PC World shed some light on the configuration options of these servers. James Niccolai’s “HP Designs New Server for Extreme Scale Out Computing” here said:
HP introduced three ProLiant SL models, all 2u deep, but it emphasized that the configurations are flexible, and HP will even work with large customers to design specific server boards, said Steve Cumings, an HP marketing director. The SL160z is for memory- and I/O-intensive applications and comes with up to 128GB of DDR3 memory. The SL170z is a storage-centric system that can hold six 3.5-inch drives. And the SL2x170z, for maximum compute density, can hold up to four quad-core processors, for a total of 672 processor cores in a 42u rack.
I found the write up “HP High Efficiency Cloud Infrastructure Servers, Moving Away from Blades, Learning from Data Center Operation” in Green Data Center Blog here quite interesting. The point the article made was that blades create some data center management problems. The article stated:
The problem with blades is high density computing created hot spots with problems airflows. But, this behavior to use blades was driven by chargeback models that used rack space occupied. Which artificially can bring down IT costs when in reality it increases costs. Just read the above quotes again, on how these latest servers are the most efficient.
The infrared snap included with the Green Data Center Blog makes the hot spots easy to “spot”, no pun intended:
To be fair, the same problem was evident when I visited an organization’s on premises data center with dozens of Dell blades stuffed into racks. The heat was uncomfortable and the air conditioning was unable to cope with the output.
Concept Searching Update
July 3, 2009
Founded in 2002, Concept Searching provides licensees with search, auto-classification, taxonomy management and metadata tagging solutions. You can download a fact sheet about the privately firm here. The software can be used on an individual user’s computer or mounted on servers to deliver enterprise solutions. The company’s secret sauce is its statistical metadata generation and classification method. The technology uses concept extraction and compound term processing to facilitate access to unstructured information. The company operates from Stevenage in Hertsfordshire. A list of the Concept Searching offices is here.
The company emphasizes the value of lateral thinking, and its approach to content analysis implements numerical recipes to find these insights and linkages within unstructured text.
When I updated my profile for this company earlier this year, I noted that the firm had signed Portal Solutions, a company that focuses on things Microsoft. The idea is to make it possible for a user to search for “insider dealing” and retrieve documents where that bound phrase does not appear but a related phrase such as “insider trading” does appear. This type of system appeals to intelligence officers and financial analysts. Concept Searching’s methods generated lists of related topics. You can see an example of the system in action by navigating to this page. I ran several test queries and the interface provided useful information and suggestions about other related content in the processed corpus. A screen shot of the output appears below:
Concept Searching is a Microsoft and Fast Search partner. The idea is that Concept Searching’s technology complements and in some cases extends the search and content processing services in Microsoft products. In May 2009, the company sponsored a best practices site for Microsoft SharePoint. The deal involves a number of companies, including ShemaLogic, KnowlegeLake, and K2 Technologies among others. The site is supposed to go live in the next couple of weeks, but I don’t have a url or a date at this time.
The company had a busy May, signing deals with Allianz Global Investors, Directory, and AT&T Government Solutions.
For me, the most interesting system that Concept Searching offers is its ability to generate and classify terms found in SharePoint documents into a taxonomy. The company has prepared a brief video that demonstrates this functionality. You can find the video here. The company’s approach does not require a separate index. Microsoft Enterprise Search can use the outputs of the Concept Searching system. I noted two “uniques” in the narrative to the video, and I remain skeptical about categorical affirmatives. I think the bound phrase extraction and the close integration with SharePoint are benefits. I just bristle when I hear “unique”, which means the one and only anywhere in the world. Broad assertion in my experience.
Concept Searching’s president, Martin Garland, said here:
Our intellectual property is still unique as we are the only statistical search technology able to indentify multi-word patterns within text and insert these patterns directly into the index at ingestion or creation time. We call this “Compound Term Processing”.
Last week I sat in a briefing given by one of Microsoft’s enterprise search team. I thought I heard descriptions of functions that struck me as quite similar to those performed by Concept Search and such companies as Interse in Copenhagen, Denmark.
I think it will be fruitful to watch what features and functions are baked into the upcoming Microsoft Fast ESP version of the old Fast Search & Transfer system. Remember: the roots of Fast Search stretch deep to 1997, a year before Google poked its nose from the Stanford baby crib.
Partners like Concept Searching have invested significant resources in Microsoft technologies. Will Microsoft respect these investments, or will Microsoft in an effort to recoup is $1.23 billion investment take a hard line toward such companies as Concept Searching.
I am on the fence regarding this issue.
Stephen Arnold, July 3, 2009
Search Sucks: A Mini Case
June 30, 2009
I listened occasionally to the Gillmor Gang when it was available on iTunes. I noticed that the program disappeared, and I lost track of it. My RSS reader snagged a story about a verbal shoot out between the one man TV network Leo LaPorte and one of the participants in the Gilmore Gang. To make a long and somewhat confused story short, the show disappeared. I figured this would be a good topic to use to test Bing.com and Google.com. My premise was that neither service would be indexing the type of information about flaps in the wobbly world of real time content on the rich media Web.
I ran the query Gilmore Gang on Google and finally found a link to a story published on June 13, 2009, called “Hanging on for Dear Life.” The problem with the Google results was that the top rated links were just plain wrong in terms of answering my query. Granted I used a two word query and I was purposely testing the Google system to see if it was sufficiently “smart” to figure out that I wanted current and accurate information. Well, in my opinion, it was like a promising student who stayed up late and did not do his home work. Here is the result list Google generated for me on June 28, 2009:
The result I wanted I found using other tools.
Google and Image Recognition
June 29, 2009
Not content with sophisticated image compression, Google continues to press forward in image recognition. Face recognition surfaced about a year ago. You can get some background about that home-grown technology in “Identifying Images Using Face Recognition”, US2008/0130960, filed in December 2006. The company has long history of interest in non text objects. If you are not familiar with Larry Page’s invention “Method for Searching Media” US2004/0122811 was filed in 2003.
Source: Neven Technologies, 2006
The catalyst for the missing link between auto identified and processed images and assigning meaningful tags to images such as “animal” or “automobile” arrived via Google’s purchase of Neven Vision (originally I think the company used the “Eyematic” name. The switch seems to have taken place in 2003 or 2004.)
At that time, All Business described the company in this way:
Neven Vision purchased Eyematic’s assets in July 2003. Dr. Hartmut Neven, one of the world’s leading machine vision experts, led the technical team that created the original Eyematic system. Dr. Neven is also developing groundbreaking “next generation” face and object recognition technologies at USC’s Information Sciences Institute (ISI).
Google snagged with the acquisition the Eyematic patent documents. These make interesting reading, and I direct your attention to “Face Recognition from Video Images”, US6301370, which seems to be part of the Neven technology suite. The US patent document is – ah, somewhat disjointed.
Mixing Picasa, home grown technology, and the image recognition technology from Neven, Google had the ingredients for tackling a tough problem in content processing; namely, answering the question, “What’s that a picture of?”
Google provided some information in June 2009. A summary of Google’s image initiative appeared in Silicon.com, which published “Google Gets a New Vision When It Comes to Pictures”. (Silicon.com points to CNet.com which originally ran the story.) Tom Krazit reported:
Google thinks it has made a breakthrough in “computer vision”. Imagine stumbling upon a picture of a beautiful landscape filled with ancient ruins, one you didn’t recognize at first glance while searching for holiday destinations online. Google has developed a way to let a person provide Google with the URL for that image and search a database of more than 40 million geotagged photos to match that image to verified landmarks, giving you a destination for that next trip. The project is still very much in the research stage, said Jay Yagnik, Google’s head of computer vision research.
For me the key point in the Silicon.com story was that Google used its “big data” approach to making headway in image recognition. When matched to technology evolving from the FERET program, Google can disrupt a potentially lucrative sector for some big government integration firms. The idea is that with lots of data, Google’s “smart software” can figure out what an image is about. Tapping Google’s clustering technology, Google’s Picasa image collection has been processed engineers to assign meaningful semantic tags to digital objects that don’t contain text.
Arnold at NFAIS: Google Books, Scholar, and Good Enough
June 26, 2009
Speaker’s introduction: The text that appears below is a summary of my remarks at the NFAIS Conference on June 26, 2009, in Philadelphia. I talk from notes, not a written manuscript, but it is my practice to create a narrative that summarizes my main points. I have reproduced this working text for readers of this Web log. I find that it is easier to put some of my work in a Web log than it is to create a PDF and post that version of a presentation on my main Web site, www.arnoldit.com. I have skipped the “who I am” part of the talk and jump into the core of the presentation.
Stephen Arnold, June 26, 2009
In the past, epics were a popular form of entertainment. Most of you have read the Iliad, possibly Beowulf, and some Gilgamesh. One convention is that these complex literary constructs begin in the middle or what my grade school teacher call “In media res.”
That’s how I want to begin my comments about Google’s scanning project – an epic — usually referred to as Google Books. Then I want to go back to the beginning of the story and then jump ahead to what is happening now. I will close with several observations about the future. I don’t work for Google, and my efforts to get Google to comment on topics are ignored. I am not an attorney, so my remarks have zero legal foundation. And I am not a publisher. I write studies about information retrieval. To make matters even more suspect, I do my work from rural Kentucky. From that remote location, I note the Amazon is concerned about Google Books, probably because Google seeks to enter the eBook sector. This story is good enough; that is, in a project so large, so sweeping perfection is not possible. Pages are skewed. Insects scanned. Coverage is hit and miss. But what other outfit is prepared to spend to scan books?
Let’s begin in the heat of the battle. Google is fighting a number things. Google finds itself under scrutiny from publishers and authors. These are the entities with whom Google signed a “truce” of sorts regarding the scanning of books. Increasingly libraries have begun to express concern that Google may not be doing the type of preservation job to keep the source materials in a suitable form for scholars. Regulators have taken an interest in the matter because of the publicity swirling around a number of complicated business and legal issues.
These issues threaten Google with several new challenges.
Since its founding in 1998, Google has enjoyed what I would call positive relationships with users, stakeholders, and most of its constituents. The Google Books’ matter is now creating what I would describe as “rising tension”. If the tension escalates, a series of battles can erupt in the legal arena. As you know, battle is risky when two heroes face off in a sword fight. Fighting in a legal arena is in some ways more risky and more dangerous.
Second, the friction of these battles can distract Google from other business activities. Google, as some commentators, including myself in Google: The Digital Gutenberg may be vulnerable to new types of information challenges. One example is Google’s absence from the real time indexing sector where Facebook, Twitter, Scoopler.com, and even Microsoft seem to be outpacing Google. Distractions like the Google Books matter could exclude Google from an important new opportunity.
Finally, Google’s approach to its projects is notable because the scope of the project makes it hard for most people to comprehend. Scanning books takes exabytes of storage. Converting images to ASCII, transforming the text (that is, adding structure tags), and then indexing the content takes a staggering amount of computing resources.
Inputs to outputs, an idea that was shaped between 1999 to 2001. © Stephen E. Arnold, 2009
Google has been measured and slow in its approach. The company works with large libraries, provides copies of the scanned material to its partners, and has tried to keep moving forward. Microsoft and Yahoo, database publishers, the Library of Congress, and most libraries have ceded the scanning of books work to Google.
Now Google finds itself having to juggle a large number of balls.
Now let’s go back in time.
I have noticed that most analysts peg Google Books’s project as starting right before the initial public offering in 2004. That’s not what my research has revealed. Google’s interest in scanning the contents of books reaches back to 2000.
In fact, an analysis of Google’s patent documents and technical papers for the period from 1998 to 2003 reveals that the company had explored knowledge bases, content transformation, and mashing up information from a variety of sources. In addition, the company had examined various security methods, including methods to prevent certain material from being easily copied or repurposed.
The idea, which I described in my The Google Legacy (which I wrote in 2003 and 2004 with publication in early 2005) was to gather a range of information, process that information using mathematical methods in order to produce useful outputs like search results for users and generate information about the information. The word given to describe value added indexing is metadata. I prefer the less common but more accurate term meta indexing.
Lucid Meet Up: Open Source Search Draws Crowd
June 23, 2009
I was in San Francisco the day of the open source Lucene meet up sponsored by Lucid Imagination. The New Idea Engineering Web log wrote a useful summary of what transpired. You can find “Impressions of First Lucene / Solr Meet Up” on the Enterprise Search Blog. Keep in mind that the founders of the Enterprise Search Blog liked the study “Successful Enterprise Search Management” Martin White and I wrote. People who like what I do may have unusual tolerance for addled geese. You have been warned.
I noted the upside and downside of a technical meet up, but I wanted to know more. I chased down David Fishman, one of the spark plugs for Lucid Imagination. You can read an interview with one of the founders of Lucid Imagination, Marc Krellenstein, in the ArnoldIT.com “Search Wizards Speak” series.
I came away from my discussion with Mr. Fishman more than a little impressed. Some of the items that remained pinned to my brain’s search bulletin board warrant sharing.
First, open source is hot. Few information technology professionals want to go to a meeting about search without first hand information about Apache Lucene (http://lucene.apache.org/) and Solr.
Second, Lucid Imagination (www.lucidimagination.com) is gaining traction with its industrial strength approach to the open source search technology that promises relief from the seven figure licensing fees imposed by some of the high profile search and retrieval vendors.
The meet up brought together almost 50 engineers and programmers on June 3. Featured speakers included Grant Ingersoll, of Lucid Imagination, and of the Apache Lucene project development team, as well as Erik Hatcher, author of Lucene in Action, of the Apache Lucene project development team, and with Ingersoll, a co-founder of Lucid Imagination. Jason Rutherglen and Jake Mannix of Linked-In talked about how they’ve implemented search at the core of their cutting edge social network. Other speakers talked about a wide range of deep search questions, from numeric search, aka Trie Range queries. Avi Rappoport, a search consultant, talked about the approach to “stop words” — encouraging search application developers not to ignore words like “the”, “in”, and the like given the power of today’s compute resources to deal with such nuances.
Back to Lucid: Grant Ingersoll’s talk focused on innovations in Solr 1.4, the forthcoming release of the search platform built around the Lucene Search engine. While there are a good number of important new features, including Trie-range queries for better searching of numeric data, and advanced replication and better logging for improved scalability and deployment, that’s just the latest in a string of enterprise grade innovations that the open source community has rolled together, closing the gap with many, if not most, of the meaningful technology features of commercial enterprise search software. Erik Hatcher spoke about a new search engine for search developers (http://search.lucidimagination.com) that Lucid sponsors for the community, using Lucene and Solr technology to plow through the abundant discussions and technical info created over the years — providing faster troubleshooting and education than programmers could get before.
There were three takeaways from the meeting, according to David Fishman, who does marketing for Lucid Imagination. The breadth and depth of the search problem set means that it’s not going to be solved by one company or one set of people; the active, engaged open source community is constantly adding and innovating new features, putting them through their paces, and pushing the frontier faster than any single company could.
The technology upon which open source search rests is as good or maybe better than some of the commercial products’ code base. Many hands and many eyes mean that the gotchas hiding in some of the high profile brands’ products are not going to jump out and bite an administrator.
That demand is real: innovative companies, as different as IBM, Zappos, Netflix, Linked In, Digg, AOL, MySpace, Apple, Comcast Interactive and more — all these have built mission critical search services at the core of their business using this technology. The people who came to this meet up, and one just like it two weeks earlier in Reston Virginia (http://www.meetup.com/NOVA-Lucene-Solr-Meetup/) are part of that rapidly accelerating adoption curve, since there’s no need to call a salesperson or schedule a demo to get started — the community lowers the barriers to experimentation and participation.
Not least important is what wasn’t covered, said Fishman. Innovation is half the battle; the other, reliability. As Mark Bennett observed on his blog , this meet up was not the crowd that keeps datacenter and IT managers sleeping soundly through the night. Commercial grade reliability comes from a commercial-grade company with the expertise to help get it working and keep it working. And having talked to the Lucid Imagination team, they not only “get” search. They “get” service level agreements. That’ may be one reason why they’re in the business of offering commercial grade support for these technologies.
To sum up, what strikes me as new is that Lucid’s pool of engineers is available to help — many of them, the same engineers who help write the code and manage the innovations with the Apache Lucene community. What the IT guys get by working with Lucid is the combination of innovation with peace of mind and better control of customization and maintenance.
My hunch is that a company with a search system is going to invest in professional services for support no matter what search solution you deploy. Even if open source makes it easy to get search, it takes expertise to get search right.
If I know Marc Krellenstein, the Lucid Imagination team will be able to deliver that expertise at competitive rates. Certainly, the range of companies represented suggest that open source search is moving toward center stage.
Can open source search gain traction in the enterprise? The answer: In some organizations, the answer is, “Yes.”
Open source search is here and Lucene/Solr promises to push beyond simple search and retrieval.
Stephen Arnold, June 23, 2009
A Glimpse of the Google Collaborative Plumbing
June 19, 2009
On June 18, 2009, the ever efficient US Patent & Trademark Office published US2009/0157608, “Online Content Collaboration Model”, a patent document filed by the Google in December 2007. With Wave in demo mode, I found this document quite interesting. Your mileage may vary because you may see patent documents as flukes, hot air, or legal eagle acrobatics. I am not in concert with that type of thinking, but if you are, navigate to one of the Twitter search engines. That content will be more comfortable.
The inventors were two Googlers, William Strathearn and Michael McNally, neither identified as part of the Australian team responsible for Wave. I like to build little family trees of Googlers who work on certain projects. Mr. Strathearn seems to have worked on the Knol team, which works on collaboration and knowledge sharing. Mr. McNally, another member of the Knol team, and he has written a Knol about himself which is at this time (June 19, 2009) online as a unit of knowledge.
The two Googlers wrote:
A collaborative editing model for online content is described. A set of suggested edits to a version of the online content is received from multiple users. Each suggested edit in the set relates to the same version. The set of suggested edits is provided to an authorized editor, who is visually notified of differences between the version of the content and the suggested edits and conflicts existing between two or more suggested edits. Input is received from the editor resolving conflicts and accepting or rejecting suggested edits in the set. The first version of the content is modified accordingly to generate a second version of the content. Suggested edits from the set that were not accepted nor rejected and are not in conflict with the second version are carried over and can remain pending with respect to the second version.
What’s happening is that the basic editorial system for Knol and other Google products gets visual cues, enhanced work flow, and some versioning moxie.
Figure 2 from US2009/0157608
Is this a big deal? Well, I think that some of the big content management players will be interested in Google’s methodical enhancement of its primitive CMS tools. I also think that those thinking of Wave as a method for organizing communications related to a project might find these systems and methods suggestive as well.