Autonomy: Not Idle

September 5, 2008

On September 4, 2008, news about another Autonomy search and content processing win circulated in between and around the Google Chrome chum. HBO, a unit of Time Warner, is a premium programming company. In Newton Minnow’s world, HBO would be a provider of waste for the “vast wasteland”. Autonomy nailed this account under the noses of the likes of Endeca, Google, Microsoft Fast ESP and dozens of other companies salivating for HBO money and a chance to meet those involved with “Rome,” “The Sopranos”, and the funero-lark “Six Feet Under.” Too bad for the US vendors. HBO visted the River Cam and found search goodness. Brief stories are appearing at ProactiveInvestors.com here and MoneyAM.com here. When I checked Autonomy’s Web site, the company’s news release had not been posted, but it will appear shortly. Chatter about Autonomy has picked up in the last few weeks. Sources throwing bread crumbs to the addled goose suggest that Autonomy has another mega deal to announce in the next week or two. On top of that, Autonomy itself is making some moves to bolster its technology. When the addled goose gets some kernels of information, he will indeed pass them on.

In response to the Autonomy “summer of sales”, its competitors are  cranking up their marketing machines. Vivisimo is engaging in a Webinar which you can read about here. Other vendors are polishing new white papers. One vendor is ramping up a telemarketing campaign. Google, as everyone knows, is cranking the volume on its marketing and PR machine. The fact of the matter is that Autonomy has demonstrated an almost uncanny ability to find opportunities and close deals as other vendors talk about making sales. Will an outfit step forward and buy Autonomy. SAP hints that it has an appetite for larger acquisitions. Will Oracle take steps to address its search needs? Will a group of investors conclude that Autonomy might be worth more split into a search company, a fraud detection company, and an eDiscovery company? Autonomy is giving me quite a bit to consider. What’s your take on the rumors? Send the addled goose a handful of corn via the Comments function on this Web log.

Stephen Arnold, September 5, 2008

Blossom Search for Web Logs

September 5, 2008

Over the summer, several people have inquired about the search system I use for my WordPress Web log. Well, it’s not the default WordPress engine. Since I wrote the first edition of Enterprise Search Report (CMSWatch.com), I have had developers providing me with search and content processing technology. We’ve tested more than 50 search systems in the last year alone. After quite a bit of testing, I decided upon the Blossom Software search engine. This system received high marks in my reports about search and content processing. You can learn more about the Blossom system by navigating to www.blossom.com. Founded by a former Bell Laboratories’ scientist, Dr. Alan Feuer, Blossom search works quickly and unobtrusively to index content of Web sites, behind-the-firewall, and hybrid collections.

You can try the system by navigating to the home page for this Web log here and entering the search phrase in quotes “search imperative” and you will get this result:

search imperative blossom

When you run this query, you will see that the search terms are highlighted in red. The bound phrase is easily spotted. The key words in context snippet makes it easy to determine if I want to read the full article or just the extract.

Most Web log content baffles some search engines. For example, recent posts may not appear. The reason is that the index updating cycle is sluggish. Blossom indexes my Web site on a daily basis, but you can specify the update cycle appropriate to your users’ needs and your content. I update the site at midnight of each day, so a daily update allows me to find the most recent posts when I arrive at my desk in the morning.

The data management system for WordPress is a bit tricky. Our tests of various search engines identified three issues that came up when third-party systems were launched at my WordPress Web log:

  1. Some older posts were not indexed. The issue appeared to be the way in which WordPress handles the older material within its data management system.
  2. Certain posts could not be located. The posts were indexed, but the default OR for phrase searching displayed too many results. With more than 700 posts on this site, the precision of the query processing system was not too helpful to me.
  3. Current posts were not indexed. Our tests revealed several issues. The content was indexed, but the indexes did not refresh. The cause appeared to be a result of the traffic to the site. Another likely issue was WordPress’ native data management set up.

As we worked on figuring out search for Web logs, two other issues became evident. First, redundant hits (since there are multiple paths to the same content) as well as incorrect time stamps (since all of the content is generated dynamically). Blossom has figured out a way to make sense of the dates in Web log posts, a good thing from my point of view.

The Blossom engine operates for my Web log as a cloud service; that is, there is no on premises installation of the Blossom system. An on premises system is available. My preference is to have the search and query processing handled by Blossom in its data centers. These deliver low latency response and feature fail over, redundancy, and distributed processing.

The glitches we identified to Blossom proved to be no big deal for Dr. Feuer. He made adjustments to the Blossom crawler to finesse the issues with WordPress’ data management system. The indexing cycle does not choke my available bandwidth. The indexing process is light weight and has not made a significant impact on my bandwidth usage. In fact, traffic to the Web log continues to rise, and the Blossom demand for bandwidth has remained constant.

We have implemented this system on a site run by a former intelligence officer, which is not publicly accessible. The reason I mention this is that some cloud based search systems cannot conform to the security requirements of Web sites with classified content and their log in and authentication procedures.

The ArnoldIT.com site, which is the place for my presentations and occasional writings, is also indexed and search with the Blossom engine. You can try some queries at http://www.arnoldit.com/sitemap.html. Keep in mind that the material on this Web site may be lengthy. ArnoldIT.com is an archive and digital brochure for my consulting services. Several of my books, which are now out of print, are available on this Web site as well.

Pricing for the Blossom service starts at about $10 per month. If you want to use the Blossom system for enterprise search, a custom price quote will be provided by Dr. Feuer.

If you want to use the Blossom hosted search system on your Web site, for your Web log, or your organization, you can contact either me or Dr. Alan Feuer by emailing or phoning:

  • Stephen Arnold seaky2000 at yahoo dot com or 502 228 1966.
  • Dr. Alan Feuer arf at blossom dot com

Dr. Feuer has posted a landing page for readers of “Beyond Search”. If you sign up for the Blossom.com Web log search service, “Beyond Search” gets a modest commission. We use this money to buy bunny rabbit ears and paté. I like my logo, but I love my paté.

Click here for the Web log search order form landing page.

If you mention Beyond Search, a discount applies to bloggers who sign up for the Blossom service. A happy quack to the folks at Blossom.com for an excellent, reasonably priced, efficient search and retrieval system.

Stephen Arnold, September 5, 2008

Intel and Search

September 5, 2008

True, this is a Web log posting, but I am interested in search thoughts from Intel or its employees. I found the post  “Why I Will Never Own and Electronic Book” interesting. I can’t decide whether the post is suggestive or naive. You can read the posted by Clay Breshears here. On the surface, Mr. Breshears is pointing out that ebook readers’ search systems are able to locate key words. He wants these generally lousy devices to sport NLP or natural language processing. The portion of the post that caught my attention was:

We need better natural language processing and recognition in our search technology.  Better algorithms along with parallel processing is going to be the key.  Larger memory space will also be needed in these devices to hold thesaurus entries that can find the link between “unemployed” and “jobless” when the search is asked to find the former but only sees the latter.  Maybe, just maybe, when we get to something like that level of sophistication in e-book devices, then I might be interested in getting one.

Intel invested some money in Endeca. Endeca gets cash, and it seems likely that Intel may provide Endeca with some guidance with regard to Intel’s next generation multi core processors. In year 2000, Intel showed interest in getting into the search business with its exciting deal with Convera. I have heard references to Intel’s interest in content processing. The references touch upon the new CPUs computational capability. Most of this horsepower goes unused, and the grape vine suggests that putting some content pre-processing functions in an appliance, firmware, or on the CPU die itself might make sense.

This Web log post may be a one-off comment. On the other hand, this ebook post might hint at other, more substantives conversations about search and content processing within Intel. There’s probably nothing to these rumors, but $10 million signals a modicum of interest from my vantage point in rural Kentucky.

Stephen Arnold, September 5, 2008

Google on Chrome: What We Meant Really… No, Really

September 4, 2008

You must read Matt Cutts’s “Google Does Not Want Rights to Things You Do Using Chrome”. First, click here to read the original clause about content and rights. Now read the September 3, 2008, post about what Google * really * meant to say here. I may be an addled goose in rural Kentucky but I think the original statements in clause 11.1 expressed quite clearly Google’s mind set.

It sure seems to me that the two Google statements–the original clause 11.1 and Mr. Cutts’s statements–are opposite to one another. In large companies this type of “slip betwixt cup and lip” occurs frequently. What struck me as interesting about Google is that it is acting in what I call due to my lack of verbal skill, “nerd imperialism”.

What troubles me is the mounting evidence in my files that Google can do pretty much what it wants. Mr. Cutts’ writing is a little like those text books that explain history to suit the needs of the school district or the publisher.

Google may house it lawyers one mile from Shoreline headquarters, but the fact is that I surmise that Google’s legal eagles wrote exactly what Google management wanted. Further I surmise that Google needs Chrome to obtain more “context” information from Chrome users. I am speculating but I think the language of the original clause was reviewed, vetted, and massaged to produce the quite clear statements in the original version of clause 11.1.

When the the firestorm flared, Google felt the heat and rushed backwards to safety. The fix? Easy. Googzilla rewrote history in my opinion. The problem is that the original clause 11.1 showed the intent of Google. That clause 11.1 did not appear by magic from the Google singularity. Lawyers drafted it; Google management okayed the original clause 11.1. I can almost hear a snorting chuckle from Googzilla, but that’s my post heart attack imagination and seems amusing to me. (I was a math club member, and I understand mathy humor but not as well as a “real” Googler, of course.)

If you have attended my lecture on Google’s container invention or read my KMWorld feature about Google’s data model for user data, are you able to see a theme? For me, the core idea of the original clause 11.1 was to capture more data about “information.” Juicy meta information like who wrote what, who sent what to whom, and who published which fact where and when. These data are available in a dataspace managed by a dataspace support platform or DSSP which Google may be building.

Google wants these meta metadata to clean up the messiness of ambiguity in information. Better and more data means that predictive algorithms work with more informed thresholds. To reduce entropy in the information it possesses, you need more, better, and different information–lots of information. For more on usage tracking and Google’s technology, you can find some color in my 2005 The Google Legacy and my 2007 Google Version 2.0. If you are an IDC search research customer, you can read more about dataspaces in IDC report 213562. These reports cost money, and you will have to contact my publishers to buy copies. (No, I don’t give these away to be a kind and friendly former math club member. Giggle. Snort. Snort.)

Plus, I have a new Google monograph underway, and I will be digging into containers, janitors, and dataspaces as these apply to new types of queries and ad functions. For me the net net is that I think Google’s lawyers got it right the first time. Agree? Disagree? Help me learn.

Stephen Arnold, September 4, 2008

Google and Key Stroke Logging

September 4, 2008

Auto suggest is a function that looks at what you are typing in a search box. The agent displays words and phrases that offer suggestions. Sometimes called auto complete, you arrow down to the phrase you want and hit enter. The agent runs the query with the word or phrase you selected. This function turned up a couple of years ago on the Yahoo AllTheWeb.com search system. Now, it’s migrated to Google. You will want to read Ina Fried’s “Chrome Let’s Google Log User Keystrokes”, published on September 4, 2008, to get some additional information about this feature. Her point is that when you or I select a suggested search phrase, that selection is noted and sent to Google. For me, the most interesting point in her article was:

Provided that users leave Chrome’s auto-suggest feature on and have Google as their default search provider, Google will have access to any keystrokes that are typed into the browser’s Omnibox, even before a user hits enter. Google intends to retain some of that data even after it provides the promised suggestions. A Google representative told sister site CNET News.com that the company plans to store about two percent of that data, along with the IP address of the computer that typed it.

When I read statements assuring me that an organization will store “about two percent of that data”, I think about phrases such as “Your check is in the mail”. Based on my research, the substantive value of lots of clicks is that “two percent”. Here’s why. Most queries follow well worn ruts. If you’ve been to Pompei, you can see grooves cut in the roadway. Once a cart or chariot is in those grooves, changing direction is tough. What’s important, therefore, is not the ones in the grooves. What’s important are those carts that get out of the grooves. As Google’s base of user data grows, the key indicators are variances, deltas, and other simple calculations that provide useful insights. After a decade of capturing queries about pop stars, horoscopes, and PageRank values, that “two percent” is important. I ask, “How do I know what happens to that other 98 percent of the usage data?” The check is in the mail.

Stephen Arnold, September 4, 2008

Microsoft: Certified Gold Not Good Enough, Become a Certified Master

September 4, 2008

Brian McCann’s post about a super grade of Microsoft Certified Professional took my breath away. His article appeared on September 2, 2008, post here. The title is “Microsoft Certified Master”. That’s right master. You can become one by passing tests and paying–are you ready?–$18,500. If you flunk your tests, you can keep trying by paying an additional fee: $250 for the written test and $1,500 for the hands on part. For now, you can become a master in Exchange, SQL Server, and Active Directory. SharePoint is coming along soon. If you go for the SQL and SharePoint combination, you can become a master ^2 for a mere $37,000. Mr. McCann’s post includes links to Web logs with more information.

In my opinion, Microsoft is making certain that it has some indentured slaves working on its behalf. Oops, I really meant masters. How silly of me to assume that anyone who becomes a master would think non-Microsoft thoughts. Oracle DBAs are quite open minded, and the certification costs less.

Google is probably licking its chops with this program. Google’s enterprise team has pitched simplicity since the Google Search Appliance appeared years ago. The argument then and now is that enterprise software is too complex. The notion of “let Google do it” has resonated with more than 20,000 GSA licenses, a deal for 1.5 Gmail boxes in New South Wales, and a near lock on geospatial services in the US government (a technically challenged operation in some agencies). When enterprise software requires a master, Google can ask, “Why do you need to pump resources into a potential black hole of cost?”

Software is complex, but now Google does not have to do much more than describe this new certification level and ask a couple of cost, risk, and time questions. Agree? Disagree? Educate me.

Stephen Arnold, September 4, 2008

Security Dents Chrome

September 4, 2008

InfoWeek, now an online only publication, published Early Security Issues Tarnish Google’s Chrome” on September 3, 2008. Nancy Gohring has gathered a number of Chrome security issues. You can read the full text of her article here. She catalogs hacker threats, malicious code, Java vulnerabilities, and more. For me, the most interesting statement in the story was:

Google did not directly address questions about this [file download] vulnerability or whether it plans to make any changes to Chrome to prevent any potential problems.

This “no comment” and “indirection” clashes with Google’s transparency push. When I read this sentence is Ms. Gohring’s article, I wondered why journalists don’t confront Google about its slither away and ignore approach to important questions. Transparency? I see a magician’s finesse at work.

What do your perceive?

Stephen Arnold, September 4, 2008

Googzilla Plays Crawfish: Back Tracking on Chrome Terms

September 4, 2008

Ina Fried wrote “Google Backtracks on Chrome License Terms”. You can read her CNet story here. The point of the story is that Google has withdrawn some of the language of its Chrome license terms. Ms. Fried wrote:

Section 11 now reads simply: “11.1 You retain copyright and any other rights you already hold in Content which you submit, post or display on or through, the Services.”

For me, this this sudden reversal is good news and bad news. The good news is that the GOOG recognized that it was close to becoming a Microsoft doppelgänger and reversed direction–fast. The bad news is that the original terms make it clear that Google’s browser containers would monitor the clicks, context, content, and processes of a user. Dataspaces are much easier to populate if you have the users in a digital fishbowl. The change in terms does little to assuage my perception of the utility of dataspaces to Google.

To catch up on the original language, click here. To find out a bit about dataspaces, click here.

Stephen Arnold, September 4, 2008

A Vertical Search Engine Narrows to a Niche

September 4, 2008

Focus. Right before I was cut from one of the sports teams I tried to join I would hear, “Focus.” I think taking a book to football, wrestling, basketball, and wrestling practice was not something coaches expected or encouraged. Now SearchMedica, a search engine for medical professionals, is taking my coach’s screams of “Focus” to heart. The company announced on September 3, 2008,  a practice management category. The news release on Yahoo said:

The new category connects medical professionals with the best practice management resources available on the Web, including the financial, legal and administrative resources needed to effectively manage a medical practice.

To me the Practice Management focus is a collection of content about the business of running a health practice. In 1981, ABI/INFORM had a category tag for this segment of business information. Now, the past has been rediscovered. The principal difference is that access to this vertical search engine is free to the user. ABI/INFORM and other commercial databases charge money, often big money to access their content.

If you want to know more about SearchMedica, navigate to www.searchmedica.com. The company could encourage a host of copy cats. Some would tackle the health field, but others would focus on categories of information for specific user communities. If SearchMedica continues to grow, it and other companies with fresh business models will sign the death sentence for certain commercial database companies.

The fate of traditional newspapers is becoming increasingly clear each day. Super star journalists are starting Web logs and organizing conferences. Editors are slashing their staff. Senior management teams are reorganizing to find economies such as smaller trim sizes, fewer editions, and less money for local and original reporting. My though is that companies like SearchMedica, if they get traction, will push commercial databases companies down the same ignominious slope. Maybe one of the financial sharpies at Dialog Information Services, Derwent, or Lexis Nexis will offer convincing data that success is in their hands, not the claws of Google or upstarts like SearchMedica. Chime in, please. I’m tired of Chrome.

Stephen Arnold, September 4, 2008

Brainware: Oracle Exec Joins Its Board

September 3, 2008

Poor Oracle. A Google partner, a roll up of unprecedented proportions, master of the aging Codd RDBMS, and proud owner of SES10g–what a résumé for a technology firm! Rumors about a search-related acquisition have been swirling around for a couple of years. Autonomy, one candidate that I assume Oracle has vetted, is too expensive. The smaller companies don’t have enough of a foot print to dent the squishy sands of enterprise search. Is a move afoot?

The reason I as is that yesterday, Brainware, Inc., a search and content processing vendor, announced that David Bonnette of Oracle has joined the Brainware Board of Directors. Mr. Bonnette is a vice president with responsibility for customer support and call center systems, among other Oracle goodies. You can read one version of the news release here.

Brainware has been growing rapidly on the strength of its content acquisition and litigation support systems and services. Other eDiscovery vendors have been chasing other markets. My hunch is that this appointment will pay a couple of dividends. First, Mr Bonnette can provide guidance for Brainware in the CRM market. In addition, his presence on the Board increases the profile of Brainware at One Oracle Way, formerly Sea World Drive or some similar aquatic place name.

Is Brainware looking for investors, marketing support, a buyer? Let me know if you have any insight. Brainware’s pattern matching is interesting, and it is a technology quite distinct from the databasey, semantic SES10g Oracle is now selling along with bits and pieces from Triple Hop. Thoughts?

Stephen Arnold, September 3, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta