Blossom Search for Web Logs

September 5, 2008

Over the summer, several people have inquired about the search system I use for my WordPress Web log. Well, it’s not the default WordPress engine. Since I wrote the first edition of Enterprise Search Report (CMSWatch.com), I have had developers providing me with search and content processing technology. We’ve tested more than 50 search systems in the last year alone. After quite a bit of testing, I decided upon the Blossom Software search engine. This system received high marks in my reports about search and content processing. You can learn more about the Blossom system by navigating to www.blossom.com. Founded by a former Bell Laboratories’ scientist, Dr. Alan Feuer, Blossom search works quickly and unobtrusively to index content of Web sites, behind-the-firewall, and hybrid collections.

You can try the system by navigating to the home page for this Web log here and entering the search phrase in quotes “search imperative” and you will get this result:

search imperative blossom

When you run this query, you will see that the search terms are highlighted in red. The bound phrase is easily spotted. The key words in context snippet makes it easy to determine if I want to read the full article or just the extract.

Most Web log content baffles some search engines. For example, recent posts may not appear. The reason is that the index updating cycle is sluggish. Blossom indexes my Web site on a daily basis, but you can specify the update cycle appropriate to your users’ needs and your content. I update the site at midnight of each day, so a daily update allows me to find the most recent posts when I arrive at my desk in the morning.

The data management system for WordPress is a bit tricky. Our tests of various search engines identified three issues that came up when third-party systems were launched at my WordPress Web log:

  1. Some older posts were not indexed. The issue appeared to be the way in which WordPress handles the older material within its data management system.
  2. Certain posts could not be located. The posts were indexed, but the default OR for phrase searching displayed too many results. With more than 700 posts on this site, the precision of the query processing system was not too helpful to me.
  3. Current posts were not indexed. Our tests revealed several issues. The content was indexed, but the indexes did not refresh. The cause appeared to be a result of the traffic to the site. Another likely issue was WordPress’ native data management set up.

As we worked on figuring out search for Web logs, two other issues became evident. First, redundant hits (since there are multiple paths to the same content) as well as incorrect time stamps (since all of the content is generated dynamically). Blossom has figured out a way to make sense of the dates in Web log posts, a good thing from my point of view.

The Blossom engine operates for my Web log as a cloud service; that is, there is no on premises installation of the Blossom system. An on premises system is available. My preference is to have the search and query processing handled by Blossom in its data centers. These deliver low latency response and feature fail over, redundancy, and distributed processing.

The glitches we identified to Blossom proved to be no big deal for Dr. Feuer. He made adjustments to the Blossom crawler to finesse the issues with WordPress’ data management system. The indexing cycle does not choke my available bandwidth. The indexing process is light weight and has not made a significant impact on my bandwidth usage. In fact, traffic to the Web log continues to rise, and the Blossom demand for bandwidth has remained constant.

We have implemented this system on a site run by a former intelligence officer, which is not publicly accessible. The reason I mention this is that some cloud based search systems cannot conform to the security requirements of Web sites with classified content and their log in and authentication procedures.

The ArnoldIT.com site, which is the place for my presentations and occasional writings, is also indexed and search with the Blossom engine. You can try some queries at http://www.arnoldit.com/sitemap.html. Keep in mind that the material on this Web site may be lengthy. ArnoldIT.com is an archive and digital brochure for my consulting services. Several of my books, which are now out of print, are available on this Web site as well.

Pricing for the Blossom service starts at about $10 per month. If you want to use the Blossom system for enterprise search, a custom price quote will be provided by Dr. Feuer.

If you want to use the Blossom hosted search system on your Web site, for your Web log, or your organization, you can contact either me or Dr. Alan Feuer by emailing or phoning:

  • Stephen Arnold seaky2000 at yahoo dot com or 502 228 1966.
  • Dr. Alan Feuer arf at blossom dot com

Dr. Feuer has posted a landing page for readers of “Beyond Search”. If you sign up for the Blossom.com Web log search service, “Beyond Search” gets a modest commission. We use this money to buy bunny rabbit ears and patĂ©. I like my logo, but I love my patĂ©.

Click here for the Web log search order form landing page.

If you mention Beyond Search, a discount applies to bloggers who sign up for the Blossom service. A happy quack to the folks at Blossom.com for an excellent, reasonably priced, efficient search and retrieval system.

Stephen Arnold, September 5, 2008

Google on Chrome: What We Meant Really… No, Really

September 4, 2008

You must read Matt Cutts’s “Google Does Not Want Rights to Things You Do Using Chrome”. First, click here to read the original clause about content and rights. Now read the September 3, 2008, post about what Google * really * meant to say here. I may be an addled goose in rural Kentucky but I think the original statements in clause 11.1 expressed quite clearly Google’s mind set.

It sure seems to me that the two Google statements–the original clause 11.1 and Mr. Cutts’s statements–are opposite to one another. In large companies this type of “slip betwixt cup and lip” occurs frequently. What struck me as interesting about Google is that it is acting in what I call due to my lack of verbal skill, “nerd imperialism”.

What troubles me is the mounting evidence in my files that Google can do pretty much what it wants. Mr. Cutts’ writing is a little like those text books that explain history to suit the needs of the school district or the publisher.

Google may house it lawyers one mile from Shoreline headquarters, but the fact is that I surmise that Google’s legal eagles wrote exactly what Google management wanted. Further I surmise that Google needs Chrome to obtain more “context” information from Chrome users. I am speculating but I think the language of the original clause was reviewed, vetted, and massaged to produce the quite clear statements in the original version of clause 11.1.

When the the firestorm flared, Google felt the heat and rushed backwards to safety. The fix? Easy. Googzilla rewrote history in my opinion. The problem is that the original clause 11.1 showed the intent of Google. That clause 11.1 did not appear by magic from the Google singularity. Lawyers drafted it; Google management okayed the original clause 11.1. I can almost hear a snorting chuckle from Googzilla, but that’s my post heart attack imagination and seems amusing to me. (I was a math club member, and I understand mathy humor but not as well as a “real” Googler, of course.)

If you have attended my lecture on Google’s container invention or read my KMWorld feature about Google’s data model for user data, are you able to see a theme? For me, the core idea of the original clause 11.1 was to capture more data about “information.” Juicy meta information like who wrote what, who sent what to whom, and who published which fact where and when. These data are available in a dataspace managed by a dataspace support platform or DSSP which Google may be building.

Google wants these meta metadata to clean up the messiness of ambiguity in information. Better and more data means that predictive algorithms work with more informed thresholds. To reduce entropy in the information it possesses, you need more, better, and different information–lots of information. For more on usage tracking and Google’s technology, you can find some color in my 2005 The Google Legacy and my 2007 Google Version 2.0. If you are an IDC search research customer, you can read more about dataspaces in IDC report 213562. These reports cost money, and you will have to contact my publishers to buy copies. (No, I don’t give these away to be a kind and friendly former math club member. Giggle. Snort. Snort.)

Plus, I have a new Google monograph underway, and I will be digging into containers, janitors, and dataspaces as these apply to new types of queries and ad functions. For me the net net is that I think Google’s lawyers got it right the first time. Agree? Disagree? Help me learn.

Stephen Arnold, September 4, 2008

Google and Key Stroke Logging

September 4, 2008

Auto suggest is a function that looks at what you are typing in a search box. The agent displays words and phrases that offer suggestions. Sometimes called auto complete, you arrow down to the phrase you want and hit enter. The agent runs the query with the word or phrase you selected. This function turned up a couple of years ago on the Yahoo AllTheWeb.com search system. Now, it’s migrated to Google. You will want to read Ina Fried’s “Chrome Let’s Google Log User Keystrokes”, published on September 4, 2008, to get some additional information about this feature. Her point is that when you or I select a suggested search phrase, that selection is noted and sent to Google. For me, the most interesting point in her article was:

Provided that users leave Chrome’s auto-suggest feature on and have Google as their default search provider, Google will have access to any keystrokes that are typed into the browser’s Omnibox, even before a user hits enter. Google intends to retain some of that data even after it provides the promised suggestions. A Google representative told sister site CNET News.com that the company plans to store about two percent of that data, along with the IP address of the computer that typed it.

When I read statements assuring me that an organization will store “about two percent of that data”, I think about phrases such as “Your check is in the mail”. Based on my research, the substantive value of lots of clicks is that “two percent”. Here’s why. Most queries follow well worn ruts. If you’ve been to Pompei, you can see grooves cut in the roadway. Once a cart or chariot is in those grooves, changing direction is tough. What’s important, therefore, is not the ones in the grooves. What’s important are those carts that get out of the grooves. As Google’s base of user data grows, the key indicators are variances, deltas, and other simple calculations that provide useful insights. After a decade of capturing queries about pop stars, horoscopes, and PageRank values, that “two percent” is important. I ask, “How do I know what happens to that other 98 percent of the usage data?” The check is in the mail.

Stephen Arnold, September 4, 2008

Security Dents Chrome

September 4, 2008

InfoWeek, now an online only publication, published Early Security Issues Tarnish Google’s Chrome” on September 3, 2008. Nancy Gohring has gathered a number of Chrome security issues. You can read the full text of her article here. She catalogs hacker threats, malicious code, Java vulnerabilities, and more. For me, the most interesting statement in the story was:

Google did not directly address questions about this [file download] vulnerability or whether it plans to make any changes to Chrome to prevent any potential problems.

This “no comment” and “indirection” clashes with Google’s transparency push. When I read this sentence is Ms. Gohring’s article, I wondered why journalists don’t confront Google about its slither away and ignore approach to important questions. Transparency? I see a magician’s finesse at work.

What do your perceive?

Stephen Arnold, September 4, 2008

Googzilla Plays Crawfish: Back Tracking on Chrome Terms

September 4, 2008

Ina Fried wrote “Google Backtracks on Chrome License Terms”. You can read her CNet story here. The point of the story is that Google has withdrawn some of the language of its Chrome license terms. Ms. Fried wrote:

Section 11 now reads simply: “11.1 You retain copyright and any other rights you already hold in Content which you submit, post or display on or through, the Services.”

For me, this this sudden reversal is good news and bad news. The good news is that the GOOG recognized that it was close to becoming a Microsoft doppelgänger and reversed direction–fast. The bad news is that the original terms make it clear that Google’s browser containers would monitor the clicks, context, content, and processes of a user. Dataspaces are much easier to populate if you have the users in a digital fishbowl. The change in terms does little to assuage my perception of the utility of dataspaces to Google.

To catch up on the original language, click here. To find out a bit about dataspaces, click here.

Stephen Arnold, September 4, 2008

A Vertical Search Engine Narrows to a Niche

September 4, 2008

Focus. Right before I was cut from one of the sports teams I tried to join I would hear, “Focus.” I think taking a book to football, wrestling, basketball, and wrestling practice was not something coaches expected or encouraged. Now SearchMedica, a search engine for medical professionals, is taking my coach’s screams of “Focus” to heart. The company announced on September 3, 2008,  a practice management category. The news release on Yahoo said:

The new category connects medical professionals with the best practice management resources available on the Web, including the financial, legal and administrative resources needed to effectively manage a medical practice.

To me the Practice Management focus is a collection of content about the business of running a health practice. In 1981, ABI/INFORM had a category tag for this segment of business information. Now, the past has been rediscovered. The principal difference is that access to this vertical search engine is free to the user. ABI/INFORM and other commercial databases charge money, often big money to access their content.

If you want to know more about SearchMedica, navigate to www.searchmedica.com. The company could encourage a host of copy cats. Some would tackle the health field, but others would focus on categories of information for specific user communities. If SearchMedica continues to grow, it and other companies with fresh business models will sign the death sentence for certain commercial database companies.

The fate of traditional newspapers is becoming increasingly clear each day. Super star journalists are starting Web logs and organizing conferences. Editors are slashing their staff. Senior management teams are reorganizing to find economies such as smaller trim sizes, fewer editions, and less money for local and original reporting. My though is that companies like SearchMedica, if they get traction, will push commercial databases companies down the same ignominious slope. Maybe one of the financial sharpies at Dialog Information Services, Derwent, or Lexis Nexis will offer convincing data that success is in their hands, not the claws of Google or upstarts like SearchMedica. Chime in, please. I’m tired of Chrome.

Stephen Arnold, September 4, 2008

Google Chrome License

September 3, 2008

Update: September 4, 2008, 9 30 pm Eastern

Useful summary of the modified Chrome license terms. Navigate to TapTheHive at http://tapthehive.com/discuss/This_Post_Not_Made_In_Chrome_Google_s_EULA_Sucks

Update: September 4, 2008, 11 30 am Eastern

Related links about the Chrome license:

  • Change in Chrome license terms here
  • Key Stroke Logging here
  • Security issues here
  • Back Peddling on terms here

Update: September 3, 2008, 9 18 am Eastern

WebWare’s take on the Chrome license agreement. Worth reading. It is here.

Original Post

If true, this post by Poss is a keeper. You can read his original article on Shuzak beta here. The juicy part is an extract from the Chrome terms of service. I quote Mr. Shuzak beta:

11.1 You retain copyright and any other rights you already hold in Content which you submit, post or display on or through, the Services. By submitting, posting or displaying the content you give Google a perpetual, irrevocable, worldwide, royalty-free, and non-exclusive license to reproduce, adapt, modify, translate, publish, publicly perform, publicly display and distribute any Content which you submit, post or display on or through, the Services.

As I understand this passage, Googzilla has rights to what I do, what I post, what I see via its browser. Seems pretty reasonable for a Googzilla bent on conquering the universe. What do you think? Before you answer, check out the data model I included in my KMWorld column in July 2008.

Stephen Arnold, September 3, 2008

Google: More Chrome Browser Goodness

September 3, 2008

In my Google Version 2.0, published by Infonortics, I present a table of patent documents that act as beacons for Google’s engineers. On September 2, 2008, the USPTO published US 7421432 B1. Among the inventors of the “Hypertext Browser Assistant” is Larry Page. He is assisted by two super wizards, Urs Höelzle and Monika Henzinger. My research into Google’s investments in technology suggested that when either Mr. Brin’s or Mr. Page’s names appear on a patent document, that innovation is important. You and the legions of super smart MBAs who disdain grunting through technical documents will probably disagree. Nevertheless, I want to call the abstract for this invention to the attention of my two or three readers.

A system facilitates a search by a user. The system detects selection of one or more words in a document currently accessed by the user,  generates a search query using the selected word(s), and retrieves a document based on the search query. When the document includes one or more links corresponding to a linked document, the system analyzes each of the links, pre fetches the linked documents corresponding to a number of the links, and presents the document to the user. The system receives selection of one of the links and retrieves the linked document corresponding to the selected link. The system identifies one or more pieces of information in the retrieved document, determines a link to a related document for each of the identified pieces of information, and provides the determined links with the related document to the user.

My “pal” Cyrus, a Google demi-wizard, thinks that I create Google images in Photoshop. No, Cyrus, these images appear in Google’s patent documents, which I suggest you and your fellow demi-wizards read before opining on my Photoshop skills. You will see that the browser represented is not Mozilla’s, Microsoft’s or Opera’s.

smart browsing

What this invention purports to do is provide intelligent “training wheels” to help users find information they are seeking. The system uses a range of Google infrastructure functions to perform its “helper” functions; for example, predictive math, parsed content, and related objects. A more detailed analysis will appear in the Google monograph I am preparing for Infonortics, the publisher who has an appetite for my analyses of Googley innovations. Look for the monograph before the New Year.

If you want to revel in the Page-meister’s golden prose, you can download a copy for free from the outstanding USPTO Web site here. Hint: reading the syntax examples carefully. The patent narrative suggests that this “training wheels” function will work in a standard browser, my hunch is that some of the more sophisticated functions known to “those skilled in the art” will require Chrome. After you have read the patent document, feel free to post your views of the technology Google has “invented”.

Oh, Cyrus, if you have difficulty locating Google’s patent documents, give me a call. I’m in the system.

Stephen Arnold, September 3, 2008

Microsoft Squeezes Google’s Privacy Policies

September 3, 2008

ZDNet (Australia) reported on August 29, 2008, about Microsoft’s perception of Google and its approach privacy. I saw the post in the ZDNet UK Web log. (I have to tell you that the failure to have a common index to the ZDNet content is less than helpful. If  Bill Ziff were still running the outfit, I believe this oversight would have been addressed and quickly. Ah, youth and the lack of corporate memory. The folks don’t know why I am risking a heart attack over this sort of carelessness.) Liam Tung wrote “Microsoft Exec: Google Years behind Us on Privacy”. You can read the full UK article here. I haven’t been able to locate the Australian original thanks to ZDNet’s fine search system.

For me, the key point in the article was:

Google had not invested enough to build privacy into its products, citing Street View as a prime example.

What I find interesting is that Google does not break out its investments. The company prefers, like Amazon, to offer a big fuzzy ball of numbers. As a result, I don’t think I or anyone outside of Google’s finance unit knows what Google spends on privacy. The notion that a company trying to make headway in online advertising, personalization, and social functions is going to pay much attention to privacy tickles my funny bone. Yahoo’s disappointing ad performance might be attributable to the company’s alleged inability to deliver rolled up demographics so advertisers can pinpoint where to advertise to reach which specific demographic sector. If Microsoft wants to make real money from its $400 million purchase of Ciao.com, the company may have to revisit its own privacy policies.

Google’s picture taking is a privacy flash point. However, based on my research, there are other functions at Google that may warrant further research. Microsoft may be forced to follow in Google’s very big paw prints in its quest for money and catching up to Googzilla.

Stephen Arnold, September 3, 2008

Google Browser: ABCs of Information Access

September 1, 2008

A is for Apple. The company uses WebKit in Safari. B is for browser, the user’s interface to cloud applications and search. C is for containers, Google’s nifty innovation for making each window a baby window on functions. The world is abuzz today (September 1, 2008) with Google’s browser project. The information, according to Google Blogoscoped, appeared in a manga or comic book. You can read that story here. There are literally dozens of posts appearing every hour on this topic, and I want to highlight a few of the more memorable posts and offer several comments.

First, the most amusing post to me is Kara Swisher’s post here. She a pal of the GOOG and, of course, hooked up with the media giant, currently challenged for revenues and management expertise The Wall Street Journal. The best think about her story is that Google’s not creating an extension of the Google environment. Nope, Google is “igniting a new browser war”. I thought Google and Microsoft were at odds already. After a decade, a browser war seems so 1990s to me. But she’s a heck of a writer.

Second, Carnage4Life earned a chuckle with its concluding statement about the GOOG:

Am I the only one that thinks that Google is beginning to fight too many wars on too many fronts. Android (Apple), OpenSocial (Facebook), Knol (Wikipedia), Lively (IMVU/SecondLife), Chrome (IE/Firefox) and that’s just in the past year.

Big companies don’t have the luxury of doing one thing. Google is more in the “controlled chaos” school of product innovation. Of course, Google goes in a great many directions. The GOOG is not a search engine; it is an application platform. It makes sense to me to see the many tests, betas, and probes. Google’s been doing this innovation by diffusion since its initial public offering and never been shy about its approach or its success and failure rate.

Finally, I enjoyed this comment by Mark Evans in “Google Browser or Slow News Day” here. He writes:

The bigger question is whether a Google browser will resonate with computers users. Many people are using an increasing number of Google services (search, GMail, Blogger, etc.) but are they ready to surrender to Google completely by dumping Firefox and IE?

My take is a bit different. Two points without much detail. I have more but this is, after all, a free Web log written by an addled goose.

  1. Why do we assume that Google is suddenly working on a browser? Looking at the screen shots of Google patent documents over the last couple of years, the images do not look like Firefox, Opera or Safari. Indeed when I give talks and show these screen shots, some Googlers like the natty Cyrus are quick to point out that these are photoshopped. Not even some canny Googlers pay attention to what the brainiacs in the Labs are doing to get some Google functions to work reliably.
  2. Google’s patent documents make reference to janitors, containers, and metadata functions that cannot be delivered in the browsers I use. In order to make use of Google’s “inventions”, the company needs a controlled environment. Check out my dataspaces post and the IDC write up on this topic for a glimpse of the broader functionality that demands a controlled computing environment.

I’m not sure I want to call this alleged innovation a browser. I think it is an extension of the Googleplex. It is not an operating system. Google needs a data vacuum cleaner and a controlled computing environment. The application may have browser functions, but it is an extension, not a solution, a gun fight, or an end run around Firefox.

Stephen Arnold, September 1, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta