MarkLogic: The Army’s New Information Access Platform
August 13, 2008
You probably know that the US Army has nicknames for its elite units. Screaming Eagle, Big Red One, and my favorite “Hell on Wheels.” Now some HUMINT, COMINT, and SIGINT brass may create a MarkLogic unit with its own flash. Based on the early reports I have, the MarkLogic system works.
Based in San Carlos (next to Google’s Postini unit, by the way), MarkLogic announced that the US Army Combined Arms Center or CAC in Ft. Leavenworth, Kansas, has embraced MarkLogic Server. BCKS, shorthand for the Army’s Battle Command Knowledge System, will use this next-generation content processing and intelligence system for the Warrior Knowledge Base. Believe me, when someone wants to do you and your team harm, access to the most timely, on point information is important. If Napoleon were based at Ft. Leavenworth today, he would have this unit report directly to him. Information, the famous general is reported to have said, is nine tenths of any battle.
Ft. Leavenworth plays a pivotal role in the US Army’s commitment to capture, analyze, share, and make available information from a range of sources. MarkLogic’s technology, which has the Department of Defense Good Housekeeping Seal of Approval, delivers search, content management, and collaborative functions.
An unclassified sample display from the US Army’s BCKS system. Thanks to MarkLogic and the US Army for permission to use this image.
The system applies metadata based on the DOD Metadata Specification (DDMS). The content is managed automatically by applying metadata properties such as the ‘Valid Until’ date. The system uses the schema standard used by the DOD community. The MarkLogic Server manages the work flow until the file is transferred to archives or deleted by the content manager. MarkLogic points to savings in time and money. My sources tell me that the system can reduce the risk to service personnel. So, I’m going to editorialize and say, “The system saves lives.” More details about the BCKS is available here. Dot Mil content does move, so click today. I verified this link at 0719, August 13, 2008.
Search Fundamentals: Cost
August 10, 2008
Set aside the fancy buzz words like taxonomies, natural language processing, and automatic classification. I want to relate one anecdote from a real-life conversation last week and then review five search fundamentals.
Anecdote
I’m sitting in a fancy conference room near Tyson’s Corner. The subject is large-scale information systems, not search. But search was assumed to be a function that would be available to the larger online system. And that’s where the problem with search fundamentals became a time bomb. The people in the room assumed that search was not a problem. One could send an email to one of the 300 vendors in the search and content processing market, negotiate a licensing deal, install the software, and move on to more important activities. After all, search was a mud flap on a very exotic sports car. Who gets excited about mud flaps?
The situation is becoming more and more common. I think it is a consequence of Googling. Most of the people with whom I meet in North America use Google for general Web search. The company’s name has become a verb, and the use of Google is becoming more ubiquitous each day. If I open Firefox, I have a Google search box available at all times.
If Google works, how hard can search be?
Five Fundamentals
I have created a table that lists five search fundamentals. Feel free to scan it, even recycle it in your search procurement background write ups. I want to make a few comments about each fundamental and then wrap up this essay with what seems to me to be an obvious caution. Table after jump.
Microsoft BrowseRank Round Up
August 8, 2008
Looking to compete with Google’s PageRank program, BrowseRank is a Microsoft-developed method of computing page importance for use in Internet search browsers.
The computations are based upon user behavior data and algorithms to “leverage hundreds of millions of users’ implicit voting on page importance.” (So says a Microsoft explanatory paper [http://research.microsoft.com/users/tyliu/files/fp032-Liu.pdf]). The whole point is to add “the human factor” to search to bring up more results people actually want to see.
On July 27 SEO Book posted a review/opinion [http://www.seobook.com/microsoft-search-browserank-research-reviewed] since Steve posted about BrowseRank here [http://arnoldit.com/wordpress/2008/07/26/microsofts-browser-rank/].Summary: While it’s a good idea, there are drawbacks like false returns because of heavy social media traffic, link sites, etc. Sites like Facebook, MySpace, and YouTube are popping up high on the list – not because they have good, solid, popular information, but just because they’re high traffic. Microsoft will have to combine its BrowseRank user feedback information with other data to be really useful. On the other hand, if Microsoft can collect this user data over a longer term, the info would more likely pan out. For example, BrowseRank will measure time spent on a site to help determine importance and relevance.
A blog post on WebProNews [http://www.webpronews.com/topnews/2008/07/28/browserank-the-next-pagerank-says-microsoft] on July 28 said flat out: “It shouldn’t be the links that come in, but the time spent browsing a relevant page, that should help determine where a page ranks for a given query.” So that idea lends some credence to BrowseRank’s plan. The next step is how Microsoft will acquire all that information – obviously through things like their Toolbar, but what else? (Let’s ignore, for now, screams about Internet browsing privacy.) If MSN’s counting on active participation from users, it won’t work. This blog post points out that “Google’s PageRank succeeds partially due to its invisibility.” And that’s what users expect.
Graphic from Microsoft Research Asia
For now, and granted there’s only this small bit of info out there, SEO Book says, in their opinion, PageRank (Google’s product) has the one up on Microsoft because it sorts informational links higher, connects them to Google’s advertising, and because Google has the ability to manipulate the information.
You can read this for more info on Microsoft vs. Google: CNET put out a pretty substantial article [http://news.cnet.com/8301-1023_3-9999038-93.html] on July 25 talking about PageRank vs. BrowseRank and what Microsoft hopes to accomplish.
Google Search Appliance: Showing Some Fangs
August 6, 2008
Assorted wizards have hit the replay button for Google’s official description of the Google Search Appliance (GSA)
If you missed the official highlights film, here’s a recap:
- $30,000 starting price, good for two years, “support” and 500,000 document capacity. The bigger gizmos each can handle 10 million documents. These work like Christmas tree lights. When you need more, just buy more GSAs and plug them in. This is the same type of connectivity “big Google” enjoys when it scales.
- Group personalization; for example, marketing wizards see brochures-type information and engineers see documents with equations
- Metadata extraction so you can search by author, department, and other discovered index points.
If you want jump right into Google’s official description, just click here. You can even watch a video about Universal Search, which is Google’s way of dancing away from the far more significant semantic functionality that will be described in a forthcoming white paper from a big consulting firm. This forthcoming report–alas–costs money and it even contains my name in very small type as a contributor. Universal Search was the PR flash created for Google’s rush Searchology conference not long after an investment bank published a detailed report of a far larger technical search initiative (Programmable Search Engine) within the Googleplex. For true Google watchers, you will enjoy Google’s analysis of complexity. The title of the video is a bit of Googley humor because when it comes to enterprise or behind the firewall search, complexity is really not that helpful. Somewhere between 50 and 75 percent of the users of a search system are dissatisfied with the search system. Complexity is one of the “problems” that Google wants to resolve with its GSA.
When you buy the upscale versions of the GSA, you can implement fail over to another GSA. GSAs can be distributed geographically as well. The GSA comes with support for various repositories such as EMC Documentum. This means that the GSA can index the Document content without custom coding. The GSAs support the OneBox API, which is an important component in Google’s enterprise strategy. With the GSA, a clever programmer can use the GSA to create Vivisimo-style federated search results, display live data from a Microsoft Exchange server so a “hit” on a person shows that person’s calendar, integrate Web and third-party commercial content with the behind-the-firewall information, and perform other important content processing tasks.
Google happily names some of its larger customers, including Adobe Systems, Kimberly-Clark, and Sunnybrook Health. The company also does not mention the deep penetration of the GSA into government agencies, police organizations, and universities.
Good “run the game plan” write ups are available from CNet here, my favorite TechCrunch with Eric Schonfeld’s readable touch here, and the “stilling hanging in there” eWeek write up here.
After registering for the enterprise videos, you will see this splash page. You can get more information about the upgrade to Version 5 of the GSA.
My Take
Now, here’s my take on this upgrade:
First, Google is responding to demands for better connectivity, more administrative control, and better security. With each upgrade to the GSA, Google has added features that have been available for a quarter century from outfits like Verity (now part of the Autonomy holdings). The changes are important because Google is often bad mouthed for offering a poor enterprise search solution. With this release, I am not so sure that the negatives competitors heap on these cheerful yellow boxes are warranted. This version of the GSA is better than most of the enterprise search appliances with which I am familiar and a worthy competitor where administrative and engineering resources are scarce.
SharePoint: Anyone Not Baffled, Please, Stand Up
August 5, 2008
For years–even before I wrote the first three editions of CMSWatch’s Enterprise Search Report–I have been pointing out that enterprise search in general is not so useful and Microsoft enterprise search in particular is in the bottom quartile of the 300 or so “enterprise search” offerings available.
In a sense, it’s gratifying that youngsters are starting to look at the reality of information in an organizational setting and asking, “What’s wrong with these vendors and their systems?” You can get a dose of the youth movement in what I call search realism here. Shawn Shell, embracing knowledge about enterprise search, identifies some of the wackiness that Microsoft employees routinely offer about enterprise search or what I call “behind the firewall” search. I am pleased with the well-crafted article and its pointing out that Microsoft has a bit of work to do. I find it amazing that four years after the first edition of Enterprise Search Report, that old information is rediscovered and made “new” again.
Even more astounding is the Microsoft news release about the Fast Search & Transfer acquisition, which became official, on August 4, 2008. You can read the full text of this news release, as reported in AMEinfo here. Quoting Patrick Beeharry, Server and Product Marketing Manager for SharePoint in the Middle East and Africa, AMEinfo reported Mr. Beeharry as saying:
‘With our companies combined, we are uniquely positioned to offer customers what they have been telling us they want most – a strategy for meeting everything from their basic to most complex enterprise search needs. We are pleased to have the talented team from FAST joining us here in the Middle East. Together we aim to deliver better technologies that will make enterprise search a ubiquitous tool that is central to how people find and use information.
Okay, Microsoft is offering a strategy. I don’t know if a strategy will address the problems of information access in an organization. Vivisimo’s white paper takes this angle, and I think that the cost issues I raised are fundamental to a strategy, but I may be wrong. Maybe a strategy is going to tame the search monster and the 50 to 75 percent of the users who are annoyed with their existing search and retrieval system.
I suppose I was not surprised to read in To the SharePoint: The SharePoint IT Pro Documentation Team Blog the essay, “Which Microsoft Search Product Is for You?” You must read this stellar essay here. For me, the key point was this table:
You can see the original here if this representation is too small. The point is not to read the table. My point is look at the cells. The table has 35 cells with the symbol Ö and seven cells with no data. In the table’s 54 cells only seven have data. For me, the table is useless, but you may have a mind meld with the SharePoint team and intuitively understand that High availability and load balancing is NULL for Search Server Express and Ö for Search Server 2008 and Office SharePoint Server 2007. How about a key to the NULL cells and the Ö thingy? (For more careless Microsoft Web log antics, click here. The basics of presenting information in tables seems to be a skill that some Microsoft professionals lack.)
Er, what about Fast Search & Transfer? The day this Web log posting appeared, Microsoft officially owned Fast Search, but it seems to me that either the author was not aware of this $1.2 billion deal, had not read the news story referenced above, or conveniently overlooked how Fast Search fits into the Microsoft search solution constellation. I can think of other reasons for the omission, but you don’t need me to tell you that communication seems to be a challenge for some large organizations.
The net net is that Microsoft has many search technologies; for example:
- Powerset
- Fast Search & Transfer (Web indexing and behind the firewall indexing)
- Vista search
- Live.com search
- The SharePoint “flavors”
- SQLServer “search”
- Microsoft Dynamics “search”
- Legacy search in Windows XP, Outlook Express (my heavens), and good old Outlook 2000 to 2007.
The word confusion does not capture the Microsoft search products. Microsoft has moved search into a manifestation of chaos. If I’m correct, licensees need to consider the boundary conditions of these many search systems. Hooking these together and making them stable may be fractal, not a good thing for a licensee wanting to make information accessible to employees. The cost of moving some of these search systems’ functions to the cloud may be resource intensive. I wanted to write impossible, but maybe Microsoft and its earnest Web log writers can achieve this goal? I hope so. Failure only amps the Google electro magnet to pull more customers from Microsoft and into the maw of Googzilla.
I am delighted to be over the hill. When senility finally hits me, I won’t have to struggle through today’s ankle biters making the old new again or describing symptoms, not diagnosing the disease. Don’t agree? Set me straight. Agree? You are too old to be reading Web logs, my friend.
Stephen Arnold, August 5, 2008
Vivisimo: Organizations Need a Search Strategy
August 3, 2008
Vivisimo, a company benefiting from the missteps of better known search vendors, has a new theme for its Fall sales push. Jerome Pesenti, chief scientist for Vivisimo, delivered a lecture called “Thinking Outside the (Search) Box”. The company issued a news release about the need for an organization to have an enterprise search strategy in order to prove the return on investment for a search system. What is remarkable is that–like Eric Schmidt’s opinions about how other companies should innovate here–scientists are providing consulting guidance. MBAs, accountants, and lawyers have long been the business gurus to whom challenged organizations turned for illumination. Now, a Ph.D. in math or a hard science provides the foundation for giving advice and counsel. Personally I think that scientists have a great deal to offer many of today’s befuddled executives. You will want to download the presentation here. You will have to register. I think that the company will use the names to follow up for marketing purposes, but no one has contacted me since I registered as Ben Kent, a name based on the names of beloved pets.
Is Vivisimo’s ROI Number Right?
For me the key point in the Vivisimo guidance is, and I am paraphrasing so your take may be different from mine, is that an organization needs to consider user needs when embarking on an enterprise search procurement. Mr. Pesenti reveals that the Vivisimo Velocity system saved Modine Manufacturing saved an estimated $3.5 million with a search strategy and the Vivisimo search system. You can learn more about Modine here. The company has about $1.8 billion in revenue in 2008, and it may punch through the $2.0 billion barrier in 2009. I know that savings are important, but when I calculated the percent of revenue the ROI yielded I got a small number. The payoff from search seems modest, but the $3.5 million is “large” in terms of the actual license fee and the estimated ROI. My thought is that if a mission critical system yields less than one percent return on investment, I would ask these questions:
- How much did the search system cost fully loaded; that is, staff time, consultants, license fees, and engineering?
- What’s the on going cost of maintaining and enhancing a search system; that is, when I project costs outwards for two years, a reasonable life for enterprise software in a fast moving application space, what is that cost?
- How can I get my money back? What I want as a non-scientific consultant and corporate executive is a “hard” number directly tied to revenue or significant savings? If I am running a $2.0 billion per year company, I need a number that does more than twiddle the least significant digits. I need hundreds of millions to keep my shareholder happy and my country club membership.
Enterprise search vendors continue to wrestle with the ROI (MBA speak for proving that spending X returns Y cash) for content processing. Philosophically search makes good business sense. In most organizations, an employee can’t do “work” unless he or she can find electronic mail, locate an invoice, or unearth the contract for a customer who balks at paying his bill. One measure of the ROI of search is Sue Feldman’s and her colleagues’ approach. Ms. Feldman, a pretty sharp thinker, focuses on time; that is, an employee who requires 10 minutes to locate a document rooting through paper folders costs the company 10 minutes worth of salary. Replace the paper with a search system from one of the hundreds of vendors selling information retrieval, and you can chop that 10 minutes down to one minute, maybe less.
This is the land of search costs. What’s your return on investment when you wade into this muck?
Problems with ROI for Utility Functions
The problem with any method of calculating ROI for a non-fungible service that incurs on going costs is that accounting systems don’t capture the costs. In the US government, costs are scattered hither and yon and not too many government executives work very hard to pull “total costs” together. In my experience, corporate cost analysis is somewhat similar. When I look at the costs reported by Amazon, I have a tough time figuring out how Mr. Bezos spends so little to build such a big online and search system. The costs are opaque to me, but I suppose MBA mavens can figure out what he spends.
The problem search, content processing, and text analytics vendors can’t solve is the value of investments in these complex information retrieval technologies. Even in tightly controlled, narrowly defined deployments of search systems, costs are tough to capture. Consider the investment special operations groups make in search systems. The cost is usually reported in a budget as the license fee, plus maintenance, and some hardware. The actual cost is unknown. Here’s why? How do you capture the staff cost for fixing a glitch in a system when the system must absolutely be online. That extraordinary cost disappears into a consulting or engineering budget. In some organizations, an engineer works overtime and bills the 16 hours to a project or maybe a broad category called “overtime”. Magnify this across a year of operations for a troubled search system and those costs exist but are often disassociated from the search system. Here’s why. The search system kills a network device due to a usage spike. The search system’s network infrastructure may be outsourced and the engineer records the time as “network troubleshooting.” The link to the search system is lost; therefore, the cost is not accrued to the search system.
In one search deployment, the first year operation cost was about $300,000. By the seventh year, the costs rose to $23.0 million. What’s the ROI on this installation? No one wants to gather the numbers and explain these costs. The standard operating procedure among vendors and licensees is to chop up the costs and push them under the rug.
IBM: Mammatus Clouds Are Us
August 2, 2008
G.K. Chesterson’s statement “There are no rules of architecture for a castle in the clouds” came to mind when I read Richard Martin’s article “IBM Brings Cloud Computing To Earth With Massive New Data Centers.” The write up is full of interesting information, and you will want to read it here.
Two points leapt from my flat panel display to my addled goose mind; to wit:
- IBM’s data centers cost about $400 million each. That’s a bargain compared to Microsoft’s San Antonio data center which cost about $650 million.
- IBM has opened centers in Dublin, Ireland; Beijing and Wuxi, China; and Johannesburg, South Africa. Mr. Martin does not tell us how many data centers IBM has.
For a company with $100 billion in revenue, IBM can build lots of data centers.
Mr. Martin reveals this juicy factoid:
IBM first opened a high-performance on-demand computing facility in New York in 2005. One advantage it enjoys over other cloud rivals like Google and Amazon, which essentially offer a do-it-yourself approach, is its army of system engineers and consultants who can assist companies in harnessing and deploying resources in the cloud.
I think of these as computing cloud type mammatus; that is, ominous looking but harmless. You can read more about mammatus clouds here. Better yet, take a gander at a mammatus and remember these clouds are toothless:
My recollection is that IBM has dipped in and out of the mammatus business a number of times. In 1996, IBM had a cloud-based Internet business. IBM sold this business to AT&T. IBM retooled and built a “grid” with a node in West Virginia. I don’t recall the details because the “grid” push drifted to the background at IBM. Now, like Hewlett Packard, IBM’s in the mammatus business–Big Blue mammatuses.
SearchCloud: Term Weighting Arrives
August 1, 2008
Yahoo’s BOSS (Build Your Own Search Service) has caught the attention of a number of companies in the information retrieval sector. A happy quack to the reader who alerted me to SearchCloud.net, a BOSS user.
SearchCloud.net, according to KillerStartUps, allows the user to weight certain terms:
The hook used by SearchCloud is providing users with the ability to weight the importance of keywords by changing the size of the fonts. Theoretically, this should allow for more accurate search results and the ability to search within given Web sites by simply placing the site name in a big font and the topic in a smaller one. While it is a great idea with a lot of potential, testing of the engine brought back very mixed results and the interface is not very well-designed. Searching for “Killerstartups” in a large font and “Cuil” in a smaller one did bring back a number of Killerstartups related pages but none with “Cuil” referenced.
You can read the KillerStartUps review here. In talking with the developers of SearchCloud.net, the SearchCloud.net team pointed out that KillerStartUps search would have returned better results had KillerStartUps reverse their weightings. The most specific search terms should be weighted higher by using larger letters. Here’s an example:
You can see the weights I assigned to each of my query terms. A larger font means the term has more weight in the query.
You can see that the terms that I wanted to emphasize I put in larger letters using the selector button above the cloud. And you may be interested in a contrarian review of SearchCloud.net on TechCrunch review here. I am tipping toward the positive with regard to this new service.
I found SearchCloud.net intuitive, and the system allows me to control the importance of certain terms in my query. For example, let’s take a query I ran this morning for a client about Google’s mobile search results.
I saw a report from South Africa that suggested Google was delivering a “mash up” of results from different Google indexes. I needed to locate information about this alleged Google function. You can read about what I learned here. I found SearchCloud.net–despite some start up rough edges–quite useful.
The tag cloud appears to the left of the results list. I have selected the grid display of results. I can scroll through a large number of relevance ranked hits very quickly. This is a useful interface option.
SearchCloud.net, like Kartoo.com, exploits Adobe technology to good effect.
There are some functions that I would like to see the SearchCloud.net team add; for example, in the results view, I want to be able to fiddle with the term weights and see the results rerank themselves. My hunch is that this function will be implemented, but like most start ups, SearchCloud.net must husband its resources.
When I spoke with the young-at-heart owners of SearchCloud.net, I was impressed with their candor and willingness to listen to my questions and suggestions. Right now, the company is self-funded and based in Milwaukee, Wisconsin. Ads are one of the revenue sources the team is discussing at this time.
Steven Eisenhauer, president, told me:
We would like to see the major players in the industry realize that the user is smart enough to control the parameters of their searches. It would be nice too see Google or Yahoo integrate our technology as an option for their users.
Milwaukee is known for beer, not investment banks. If you want to own a piece of a search company, maybe you could contact SearchCloud.net at info at searchcloud dot net?
SearchCloud.net shows considerable promise, and I have long been skeptical of Adobe’s Web technology. I may have to soften my stance based on what the SearchCloud.net wizards have been able to accomplish with Flex. I have added this company to my watch list.
Stephen Arnold, August 1, 2008
Cluuz.com: Useful Interface Enhancements
July 31, 2008
Cluuz.com is one of the search companies tapping Yahoo’s search index. The Cluuz.com has introduced some useful interface changes. I will be digging into this system in future write ups, but I want to call your attention to one of the innovations I found useful. (my first Cluuz.com write up is here.)
Navigate to Cluuz.com here. Enter your query. You will see a result screen that looks like my query for “fractal frameworks”.
The three major changes shown in this screenshot are:
- Entities appear in the tinted area above the graphic. My test queries suggested to me that Cluuz.com was identifying the most important entities in the result set.
- A top ranked link with selected images. Each image is a hot link. I could tell quickly that the top ranked document included the type of technical diagram that I typically want to review.
- A selected list of other entities and concepts.
Intel Chases the Cloud a Second Time
July 30, 2008
I wrote about Convera’s present business in vertical search here because I heard that Intel was going to chase clouds again. But before we look at the new deal with Hewlett Packard (the ink company), Yahoo (goodness knows what its business is now), and Intel, let’s go back in time.
Remember in late 2000 when Intel signed a deal with Excalibur? Probably not. Convera was the result of a fusion of Intel’s multimedia unit and Excalibur Technologies. When this deal took form, Intel had 10 data centers.
An Intel executive at the time was quoted in Tabor Communications DSstar saying:
We are creating a global network of Internet data centers with the goal of becoming a leader in world-class Internet application hosting and e-Commerce services, said Mike Aymar, president, Intel Online Services. The opening of a major Internet data center in Virginia is a key step toward this goal. We’ll bring our reliable and innovative approach to hosting customers running mission-critical Internet applications, both in the U.S. and around the world.
Part of the deal included the National Basketball Association. Intel and Convera would stream NBA games. These deals were complex and anticipated the online video boom that is now taking place. The problem was that Intel jumped into this game with Convera technology that was shall we say immature. In less than a year, the deal blew up. The NBA terminated its relationship with Convera. By the time the dust and law suits settled, the total price tag of this initiative was in the hundreds of millions of dollars.
Outside of a handful of Wall Street analysts and data center experts, few people know that Intel anticipated the cloud, made a play, muffed the bunny, and faded quietly into the background until today.
Intel is back again and demonstrating that it still doesn’t have a knack for picking the right partners. The big news is that Intel, HP, and Yahoo are going to tackle cloud computing. The approach is to allow academic researchers to collaborate with industry on projects. The companies will create an experimental network. In short, risk is reduced and the costs spread across the partners. You can read Thomson Reuters’ summary here.
Will the chip giant’s Cloud Two initiative work?
Sure, anything free will garner attention among academics and corporate researchers. Will the test spin money for the ink vendor and the confused online portal? Probably not.
Rounding up more cloud computing suspects.
But there’s another angle I want to discuss briefly.
Intel pumped money in Endeca, a well-regarded search and content processing company. You can refresh your memory about that $10 million investment here.
Is there a connection between this investment in Endeca and today’s cloud computing announcement from Intel? I believe there is. Intel is making chips with CPU cycles to spare. Few applications saturate the processors. With even more cores on a single die coming, software and applications are lagging far behind the chips capabilities.