LexisNexis and Interwoven: An Odd Couple

September 6, 2008

The for fee legal information sector looks like a consistent winner to those who don’t know the cost structures and marketing hassles of selling to attorneys, intelligence agencies, and law schools. Let’s review at a high level the sorry state of the legal information business in the United States. Europe and the Asia Pacific region are a different kitchen of torts.

Background

First, creating legal information is still a labor intensive operation. Automated processes can reduce some costs, but other types of legal metatagging still require the effort of attorneys or those with sufficient legal education to identify and correct egregious errors. As you may know, making a mistake when preparing for a major legal matter is not too popular with the law firms’ clients.

Second, attorneys and law firms make up one of those interesting markets. At one end there are lots and lots of attorneys who work in very small shops. Someone told me that 90 percent of the attorneys are involved with small firms or work in legal flea markets. Several attorneys get together, lease a space, and then offer desks to other attorneys. Everyone pays the overhead, and the group can pursue individual work or form a loose confederation if necessary. Other attorneys abandon ship. I don’t have data on the quitters in the US, but I know that one of my acquaintances in Louisville, Kentucky, gave up the law to become a public relations advisor. One of my resources is an attorney who works only on advising companies trying to launch an IPO. He hires attorneys, preferring to use his managerial skills without the mind numbingly dull work that many legal eagles perform.

Third, there are lots of attorneys who have to mind their pennies. Clients in tough economic times are less willing to pay wild and crazy legal bills. These often carry such useful line items as “Research, $2,300” or “Telephone call, $550”. I have worked as an expert witness and gained a tiny bit of insight into the billing and the push back some clients exert. Other clients don’t pay the bills, which makes life tough for partners who can’t buy a new BMW and for the low paid “associates” who can’t buy happiness or pay law school loans.

Fourth, most people know that prices for legal information are high, but there’s a growing realization that the companies with these expensive resources are starting to look a lot like monopolies. Running the only poker game in town makes some of the excluded players want options. In the last few years, I’ve run across services that a single person will start up to provide specific legal type information to colleagues because the blue chip companies were charging too much or delivering stale information at fresh baked bread prices.

Folks like Google.com, small publishers, trade associations, and the Federal government put legal information on Web servers and let people browse and download. Granted, some of the bells and whistles like the nifty footnotes that tell a legal eagle to look at a specific case for a precedent are missing. But some folks are quite happy to use the free services first. Then, as a last resort, will the abstemious legal eagle pay upwards of $250 per query to look up information in a WestLaw, LexisNexis, or other blue chip vendors’ specialist online file.

Google’s government index service sports what may presage the “new look” for other Google vertical search services. Check it out in the screen shot below. Notice that the search box is unchanged, but the page features categories of information.

govt search

Now run the query , district court decisions. Sorry about the screen shots, but you can navigate to this site and run your own queries. I ran the bound phrase “district court decisions”. Here’s what Google showed me:

disttrict court decisions

Let me make three observations:

Read more

Google and Robots aka Computational Intelligence

September 5, 2008

Ed Cone, CIO Insight, posted a short article that had big implications. You can read his “The Cloud, the Haptic Web and Robotic Telepresence” here. Mr. Cone wrangled an interview with Vint Cerf, a Googler, in fact a super Googler. For me, the most important comment in this interview was:

I expect to see much more interesting interactions, including the possibility of haptic interactions – touch. Not just touch screens, but the ability to remotely interact with things. Little robots, for example, that are instantiations of you, and are remotely operated, giving you what is called telepresence. It’s a step well beyond the kind of video telepresence we are accustomed to seeing today.

I find the idea quite suggestive. In my analyses of Google patent documents, I noticed a number of references to agents, intelligent processes, and predictive methods. Is Mr. Cerf offering us his personal view, or is he hinting at Google’s increasing activity in computational intelligence and smart systems. Let me know your thoughts, humans.

Stephen Arnold, September 5, 2008

TinEye: Image Search

September 5, 2008

A happy quack to the reader who tipped me about TinEye, a search system that purports to do for images what the GOOG did for test.” The story about TinEye that I saw appeared in the UK computer news service PCPro.co.uk. The story “Visual Search Engine Is Photographer’s Best Friend” is here. The visual search engine was developed by Idée, based in Toronto. The company says:

TinEye is the first image search engine on the web to use image identification technology. Given an image to search for, TinEye tells you where and how that image appears all over the web—even if it has been modified.

The image index contains about one billion images. Search options include uploading an image for the system to pattern match, an image url, or via a plug in for Firefox or Internet Explorer.

Search results are displayed graphically. You can explore the images with a mouse click. One interface appears below:

image

The technology powering the service is Espion. I couldn’t locate a public demonstration of the service. You can request a demonstration of the system here. Toronto is becoming a hot bed of search activity. Arikus and Sprylogics both operate there. OpenText has an office. Coveo is present. I will add this outfit to my list of Canadian vendors.

Stephen Arnold, September 5, 2008

Autonomy: Not Idle

September 5, 2008

On September 4, 2008, news about another Autonomy search and content processing win circulated in between and around the Google Chrome chum. HBO, a unit of Time Warner, is a premium programming company. In Newton Minnow’s world, HBO would be a provider of waste for the “vast wasteland”. Autonomy nailed this account under the noses of the likes of Endeca, Google, Microsoft Fast ESP and dozens of other companies salivating for HBO money and a chance to meet those involved with “Rome,” “The Sopranos”, and the funero-lark “Six Feet Under.” Too bad for the US vendors. HBO visted the River Cam and found search goodness. Brief stories are appearing at ProactiveInvestors.com here and MoneyAM.com here. When I checked Autonomy’s Web site, the company’s news release had not been posted, but it will appear shortly. Chatter about Autonomy has picked up in the last few weeks. Sources throwing bread crumbs to the addled goose suggest that Autonomy has another mega deal to announce in the next week or two. On top of that, Autonomy itself is making some moves to bolster its technology. When the addled goose gets some kernels of information, he will indeed pass them on.

In response to the Autonomy “summer of sales”, its competitors are  cranking up their marketing machines. Vivisimo is engaging in a Webinar which you can read about here. Other vendors are polishing new white papers. One vendor is ramping up a telemarketing campaign. Google, as everyone knows, is cranking the volume on its marketing and PR machine. The fact of the matter is that Autonomy has demonstrated an almost uncanny ability to find opportunities and close deals as other vendors talk about making sales. Will an outfit step forward and buy Autonomy. SAP hints that it has an appetite for larger acquisitions. Will Oracle take steps to address its search needs? Will a group of investors conclude that Autonomy might be worth more split into a search company, a fraud detection company, and an eDiscovery company? Autonomy is giving me quite a bit to consider. What’s your take on the rumors? Send the addled goose a handful of corn via the Comments function on this Web log.

Stephen Arnold, September 5, 2008

Blossom Search for Web Logs

September 5, 2008

Over the summer, several people have inquired about the search system I use for my WordPress Web log. Well, it’s not the default WordPress engine. Since I wrote the first edition of Enterprise Search Report (CMSWatch.com), I have had developers providing me with search and content processing technology. We’ve tested more than 50 search systems in the last year alone. After quite a bit of testing, I decided upon the Blossom Software search engine. This system received high marks in my reports about search and content processing. You can learn more about the Blossom system by navigating to www.blossom.com. Founded by a former Bell Laboratories’ scientist, Dr. Alan Feuer, Blossom search works quickly and unobtrusively to index content of Web sites, behind-the-firewall, and hybrid collections.

You can try the system by navigating to the home page for this Web log here and entering the search phrase in quotes “search imperative” and you will get this result:

search imperative blossom

When you run this query, you will see that the search terms are highlighted in red. The bound phrase is easily spotted. The key words in context snippet makes it easy to determine if I want to read the full article or just the extract.

Most Web log content baffles some search engines. For example, recent posts may not appear. The reason is that the index updating cycle is sluggish. Blossom indexes my Web site on a daily basis, but you can specify the update cycle appropriate to your users’ needs and your content. I update the site at midnight of each day, so a daily update allows me to find the most recent posts when I arrive at my desk in the morning.

The data management system for WordPress is a bit tricky. Our tests of various search engines identified three issues that came up when third-party systems were launched at my WordPress Web log:

  1. Some older posts were not indexed. The issue appeared to be the way in which WordPress handles the older material within its data management system.
  2. Certain posts could not be located. The posts were indexed, but the default OR for phrase searching displayed too many results. With more than 700 posts on this site, the precision of the query processing system was not too helpful to me.
  3. Current posts were not indexed. Our tests revealed several issues. The content was indexed, but the indexes did not refresh. The cause appeared to be a result of the traffic to the site. Another likely issue was WordPress’ native data management set up.

As we worked on figuring out search for Web logs, two other issues became evident. First, redundant hits (since there are multiple paths to the same content) as well as incorrect time stamps (since all of the content is generated dynamically). Blossom has figured out a way to make sense of the dates in Web log posts, a good thing from my point of view.

The Blossom engine operates for my Web log as a cloud service; that is, there is no on premises installation of the Blossom system. An on premises system is available. My preference is to have the search and query processing handled by Blossom in its data centers. These deliver low latency response and feature fail over, redundancy, and distributed processing.

The glitches we identified to Blossom proved to be no big deal for Dr. Feuer. He made adjustments to the Blossom crawler to finesse the issues with WordPress’ data management system. The indexing cycle does not choke my available bandwidth. The indexing process is light weight and has not made a significant impact on my bandwidth usage. In fact, traffic to the Web log continues to rise, and the Blossom demand for bandwidth has remained constant.

We have implemented this system on a site run by a former intelligence officer, which is not publicly accessible. The reason I mention this is that some cloud based search systems cannot conform to the security requirements of Web sites with classified content and their log in and authentication procedures.

The ArnoldIT.com site, which is the place for my presentations and occasional writings, is also indexed and search with the Blossom engine. You can try some queries at http://www.arnoldit.com/sitemap.html. Keep in mind that the material on this Web site may be lengthy. ArnoldIT.com is an archive and digital brochure for my consulting services. Several of my books, which are now out of print, are available on this Web site as well.

Pricing for the Blossom service starts at about $10 per month. If you want to use the Blossom system for enterprise search, a custom price quote will be provided by Dr. Feuer.

If you want to use the Blossom hosted search system on your Web site, for your Web log, or your organization, you can contact either me or Dr. Alan Feuer by emailing or phoning:

  • Stephen Arnold seaky2000 at yahoo dot com or 502 228 1966.
  • Dr. Alan Feuer arf at blossom dot com

Dr. Feuer has posted a landing page for readers of “Beyond Search”. If you sign up for the Blossom.com Web log search service, “Beyond Search” gets a modest commission. We use this money to buy bunny rabbit ears and paté. I like my logo, but I love my paté.

Click here for the Web log search order form landing page.

If you mention Beyond Search, a discount applies to bloggers who sign up for the Blossom service. A happy quack to the folks at Blossom.com for an excellent, reasonably priced, efficient search and retrieval system.

Stephen Arnold, September 5, 2008

Intel and Search

September 5, 2008

True, this is a Web log posting, but I am interested in search thoughts from Intel or its employees. I found the post  “Why I Will Never Own and Electronic Book” interesting. I can’t decide whether the post is suggestive or naive. You can read the posted by Clay Breshears here. On the surface, Mr. Breshears is pointing out that ebook readers’ search systems are able to locate key words. He wants these generally lousy devices to sport NLP or natural language processing. The portion of the post that caught my attention was:

We need better natural language processing and recognition in our search technology.  Better algorithms along with parallel processing is going to be the key.  Larger memory space will also be needed in these devices to hold thesaurus entries that can find the link between “unemployed” and “jobless” when the search is asked to find the former but only sees the latter.  Maybe, just maybe, when we get to something like that level of sophistication in e-book devices, then I might be interested in getting one.

Intel invested some money in Endeca. Endeca gets cash, and it seems likely that Intel may provide Endeca with some guidance with regard to Intel’s next generation multi core processors. In year 2000, Intel showed interest in getting into the search business with its exciting deal with Convera. I have heard references to Intel’s interest in content processing. The references touch upon the new CPUs computational capability. Most of this horsepower goes unused, and the grape vine suggests that putting some content pre-processing functions in an appliance, firmware, or on the CPU die itself might make sense.

This Web log post may be a one-off comment. On the other hand, this ebook post might hint at other, more substantives conversations about search and content processing within Intel. There’s probably nothing to these rumors, but $10 million signals a modicum of interest from my vantage point in rural Kentucky.

Stephen Arnold, September 5, 2008

Google on Chrome: What We Meant Really… No, Really

September 4, 2008

You must read Matt Cutts’s “Google Does Not Want Rights to Things You Do Using Chrome”. First, click here to read the original clause about content and rights. Now read the September 3, 2008, post about what Google * really * meant to say here. I may be an addled goose in rural Kentucky but I think the original statements in clause 11.1 expressed quite clearly Google’s mind set.

It sure seems to me that the two Google statements–the original clause 11.1 and Mr. Cutts’s statements–are opposite to one another. In large companies this type of “slip betwixt cup and lip” occurs frequently. What struck me as interesting about Google is that it is acting in what I call due to my lack of verbal skill, “nerd imperialism”.

What troubles me is the mounting evidence in my files that Google can do pretty much what it wants. Mr. Cutts’ writing is a little like those text books that explain history to suit the needs of the school district or the publisher.

Google may house it lawyers one mile from Shoreline headquarters, but the fact is that I surmise that Google’s legal eagles wrote exactly what Google management wanted. Further I surmise that Google needs Chrome to obtain more “context” information from Chrome users. I am speculating but I think the language of the original clause was reviewed, vetted, and massaged to produce the quite clear statements in the original version of clause 11.1.

When the the firestorm flared, Google felt the heat and rushed backwards to safety. The fix? Easy. Googzilla rewrote history in my opinion. The problem is that the original clause 11.1 showed the intent of Google. That clause 11.1 did not appear by magic from the Google singularity. Lawyers drafted it; Google management okayed the original clause 11.1. I can almost hear a snorting chuckle from Googzilla, but that’s my post heart attack imagination and seems amusing to me. (I was a math club member, and I understand mathy humor but not as well as a “real” Googler, of course.)

If you have attended my lecture on Google’s container invention or read my KMWorld feature about Google’s data model for user data, are you able to see a theme? For me, the core idea of the original clause 11.1 was to capture more data about “information.” Juicy meta information like who wrote what, who sent what to whom, and who published which fact where and when. These data are available in a dataspace managed by a dataspace support platform or DSSP which Google may be building.

Google wants these meta metadata to clean up the messiness of ambiguity in information. Better and more data means that predictive algorithms work with more informed thresholds. To reduce entropy in the information it possesses, you need more, better, and different information–lots of information. For more on usage tracking and Google’s technology, you can find some color in my 2005 The Google Legacy and my 2007 Google Version 2.0. If you are an IDC search research customer, you can read more about dataspaces in IDC report 213562. These reports cost money, and you will have to contact my publishers to buy copies. (No, I don’t give these away to be a kind and friendly former math club member. Giggle. Snort. Snort.)

Plus, I have a new Google monograph underway, and I will be digging into containers, janitors, and dataspaces as these apply to new types of queries and ad functions. For me the net net is that I think Google’s lawyers got it right the first time. Agree? Disagree? Help me learn.

Stephen Arnold, September 4, 2008

Google and Key Stroke Logging

September 4, 2008

Auto suggest is a function that looks at what you are typing in a search box. The agent displays words and phrases that offer suggestions. Sometimes called auto complete, you arrow down to the phrase you want and hit enter. The agent runs the query with the word or phrase you selected. This function turned up a couple of years ago on the Yahoo AllTheWeb.com search system. Now, it’s migrated to Google. You will want to read Ina Fried’s “Chrome Let’s Google Log User Keystrokes”, published on September 4, 2008, to get some additional information about this feature. Her point is that when you or I select a suggested search phrase, that selection is noted and sent to Google. For me, the most interesting point in her article was:

Provided that users leave Chrome’s auto-suggest feature on and have Google as their default search provider, Google will have access to any keystrokes that are typed into the browser’s Omnibox, even before a user hits enter. Google intends to retain some of that data even after it provides the promised suggestions. A Google representative told sister site CNET News.com that the company plans to store about two percent of that data, along with the IP address of the computer that typed it.

When I read statements assuring me that an organization will store “about two percent of that data”, I think about phrases such as “Your check is in the mail”. Based on my research, the substantive value of lots of clicks is that “two percent”. Here’s why. Most queries follow well worn ruts. If you’ve been to Pompei, you can see grooves cut in the roadway. Once a cart or chariot is in those grooves, changing direction is tough. What’s important, therefore, is not the ones in the grooves. What’s important are those carts that get out of the grooves. As Google’s base of user data grows, the key indicators are variances, deltas, and other simple calculations that provide useful insights. After a decade of capturing queries about pop stars, horoscopes, and PageRank values, that “two percent” is important. I ask, “How do I know what happens to that other 98 percent of the usage data?” The check is in the mail.

Stephen Arnold, September 4, 2008

Microsoft: Certified Gold Not Good Enough, Become a Certified Master

September 4, 2008

Brian McCann’s post about a super grade of Microsoft Certified Professional took my breath away. His article appeared on September 2, 2008, post here. The title is “Microsoft Certified Master”. That’s right master. You can become one by passing tests and paying–are you ready?–$18,500. If you flunk your tests, you can keep trying by paying an additional fee: $250 for the written test and $1,500 for the hands on part. For now, you can become a master in Exchange, SQL Server, and Active Directory. SharePoint is coming along soon. If you go for the SQL and SharePoint combination, you can become a master ^2 for a mere $37,000. Mr. McCann’s post includes links to Web logs with more information.

In my opinion, Microsoft is making certain that it has some indentured slaves working on its behalf. Oops, I really meant masters. How silly of me to assume that anyone who becomes a master would think non-Microsoft thoughts. Oracle DBAs are quite open minded, and the certification costs less.

Google is probably licking its chops with this program. Google’s enterprise team has pitched simplicity since the Google Search Appliance appeared years ago. The argument then and now is that enterprise software is too complex. The notion of “let Google do it” has resonated with more than 20,000 GSA licenses, a deal for 1.5 Gmail boxes in New South Wales, and a near lock on geospatial services in the US government (a technically challenged operation in some agencies). When enterprise software requires a master, Google can ask, “Why do you need to pump resources into a potential black hole of cost?”

Software is complex, but now Google does not have to do much more than describe this new certification level and ask a couple of cost, risk, and time questions. Agree? Disagree? Educate me.

Stephen Arnold, September 4, 2008

Security Dents Chrome

September 4, 2008

InfoWeek, now an online only publication, published Early Security Issues Tarnish Google’s Chrome” on September 3, 2008. Nancy Gohring has gathered a number of Chrome security issues. You can read the full text of her article here. She catalogs hacker threats, malicious code, Java vulnerabilities, and more. For me, the most interesting statement in the story was:

Google did not directly address questions about this [file download] vulnerability or whether it plans to make any changes to Chrome to prevent any potential problems.

This “no comment” and “indirection” clashes with Google’s transparency push. When I read this sentence is Ms. Gohring’s article, I wondered why journalists don’t confront Google about its slither away and ignore approach to important questions. Transparency? I see a magician’s finesse at work.

What do your perceive?

Stephen Arnold, September 4, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta