Lawyers and Metadata
January 8, 2009
Now the indexing world gets something to gnaw on. Automated indexing systems beat out humans when measured by cost per item indexed, speed, and consistency. Automated indexing systems can be as good as a human for some types of content. But humans are variably bad at indexing. Software hits a sweet spot and doesn’t get significantly better or worse unless the content throws in a wrench. Now the issue of not providing metadata arises. We can automate the creation of metadata, but it is early days in the world of automatic metadata scrubbing. I quacked happily when I thought, “I wonder who knows where their metadata are?”
Jim Calloway’s “Metadata–What Is It and Waht Are My Ethical Duties” here breathes new life into human indexing. What I find interesting is that lawyers charge by the hour. Human indexes are paid by piece work schedules or given a flat year fee and maybe some benefit crumbs. The economics of human indexing is based on keeping the per record cost as low as possible whilst one maintains the “quality” of the indexing. “Quality” in the commercial database world is often defined as a metric such as “four to six index terms per bibliographic record” or “16 records per hour with required fields completed”. You may have a more academic definition, but my examples come from the soon-to-be-marginalized world of human commercial database production.
The article defines metadata in terms of a legal eagle, of course. But the story gets interesting when Mr. Calloway cites a sitaution in which metadata became a legal issue. Where there is a legal issue, there is the risk of a fine, jail, or losing pride of place among the brood of legal eagles. Forget the compensation. Ego may be a bigger force in the legal eagle world. Mr. Calloway nicely hooks metadata with risk.
For me, the most important comment in this useful write up was:
In this writer’s view, the key is to avoid sending out documents with metadata that could disclose confidential information. Comparing metadata to a wrongly sent fax or e-mail is questionable and the idea that lawyers will be prohibited from examining metadata while parties, law enforcement officers and private detectives will be free to do so seems artificial at best. The Colorado rule that one must disclose receiving confidential information via metadata before acting on it seems to strike a rational balance. The best rule is for law firms to develop best practices internally to keep metadata from “escaping” in the first place.
I quite like “keep metadata from escaping in the first place”. To close, let me ask several questions:
- Do you know why metadata are in the documents available for indexing on your Web site
- Do you know how value added indexing in a dataspace can expand the access to a document in an often unrelated context
- Do you know where metadata are in a document, in a Web page or other containing housing the document, or in the dataspace created for the information objects?
If not, you will want to dig up this information yourself. Asking your attorney will result in a very large legal bill. One final question: Do you think Mr. Madoff knows about his metadata?
Stephen Arnold, January 8, 2009
Non-Techies and Metadata
January 8, 2009
The metadata quandary for legal eagles will stick like Kentucky mine run off. If you want to make sure your Word documents are metadata free, you will want to read “How to Remove the Hidden Metadata from Word Document” here. A slightly more interesting exercise is to aim a search engine’s content acquisition system at shared folders and browse what the spider catches in its digital web. If you think metadata are a liability, check out the goodies you harvest. Download any desktop indexing system that can access your network shares. Now you know why eDiscovery is so important and often quite interesting for those paid to pour through metadata.
Stephen Arnold, January 8, 2009
Google Video Creeps Forward
January 7, 2009
Telecompaper.com reported on January 7, 2008, here that “T-Mobile Launches YouTube Channel for G1.” Google has a Google Channel on YouTube.com. How many more channels will be available for special niches? The GOOG, unlike the traditional TV crowd, generates metatags for its videos. Creating a channel is a software process, not one requiring humans sitting in dark control rooms twirling dials. Michael Hirschorn’s “End Times” here notwithstanding, the GOOG’s potential energy in another bastion of traditional energy will increase in force. Like an earthquake, a jump from a 2.0 to a 3.0 is not a linear force. Clever writing won’t do much to change the face of traditional media when Googzilla does its waltz to the Strauss tune Schatz-Walzer. There’s gold in those honking hot videos pumped to any device that can tap into the Google umbilical.
Stephen Arnold, January 7, 2008
Google White Paper: Universal Search
January 7, 2009
Most companies write long white papers. Not the GOOG. The company embraces brevity in its universal search marketing document. In fact, the best way to handle an uninformed journalist is to be silent. Writing an equation is also good. I call this universal brevity.
A Google white paper PR blitz is choking my newsreader. You can download your own copy of a tiny print, four-page white paper about universal search. Here’s the link I used to get the white paper.
The single technical illustration in the Google white paper. Enterprise search cannot become much easier, right?
Now what’s in the four-page white paper? Here are the highlights in my opinion:
- Lots of data without much supporting documentation to indicate that employees want to find information from multiple sources. The fix is “universal search”. A categorical affirmative finds its way into the first page, a sure sign that no mathematician read this argument.
- Universal search is not tailored to each user. The idea is personalization via groups.
- The benefits are pretty much the same benefits one can find in the marketing collateral of the more than 300 vendors who provides search and content processing to organizations. But there is one big difference. The system is from Google, and with about 75 percent of the Web search market, Google means search. This is pure Buyology in action.
- Happy customer, in this case, Stratus Technologies. The GOOG is allowing happy customers to talk. No surprise here. What’s interesting is that you can gin up an interesting customer list of the Google Search Appliance using my Overflight service and the Exalead entity extraction system that is available.
Should you download this paper? Absolutely. Believe me. Procurement teams hear from their constituents, “We want search to work like Google.” Whether you think the Google enterprise search system is fair or foul (no, not fowl), that’s irrelevant. Market share translates to mind share. People want their Google. Someone wrote me today and said that my assertion that Google could not be stopped in 2009.
The 360 degree approach in universal search. Is this role based information access? If so, is Google breaking new ground or following a well-worn path? I vote for the well-worn path.
As an official addled goose, I have to state that I don’t see IBM, Microsoft, Oracle, or SAP as putting up much of a fight. Autonomy is fighting hard. Exalead is making wins. Coveo is growing. Is any company actively thwarting Googzilla? No, in my opinion. By the way, I have held this view since 2003 and first stated it publicly in The Google Legacy which appeared in 2005. Download the white paper because in the enterprise, your colleagues will be reading this stuff as coming from the mouth of the GOOG (not God, the GOOG. That’s Google’s ticker symbol).
As Google begins its run up to its annual gathering of the enterprise sales faithful, watch for more of this “official” output from the company.
Stephen Arnold, January 7, 2008
Google and Disallow
January 7, 2009
You will want to check out “On Google Disallowing Carling of Their Life Hosting” here. Google Blogoscoped has a good write up about this — to some — surprising development. Other search engines cannot index the Time Warner Life Magazine images. Google inserted a blocking line in its robots.txt file. I noticed that I was limited in the number of images I could browse when the service first went live. I was surprised that these images were available to me without a fee. For years, the Time crowd has noodled about its picture archive. First, Time wanted to handle the scanning itself. Then Time wanted to subcontract the work but that was too expensive. Then it was a good idea to talk with experts about what to do. Then the cycle repeated. Along came the GOOG and the rest, as someone will write after this goose is cooked, is history. Here’s what is going on in my opinion:
- Restrictive content access is going to become more visible. If you read the Guha patent applications from February 2007, you will have noted that Google’s system can operate in a discrimatory way. That translates, in my view of the world, to restrictions on what others can and cannot do with Google information. This is an important phrase: “Google information.” Please, note it, copyright lovers.
- The Life images are a big deal, and I am confident that the restrictions are probably positioned as part of the method to balance public access with protection for the assets of Time Warner. Everyone has needs, so this restriction is a nifty way of finding a middle way with Googzilla’s hands on the controls.
- The cost of getting the Life images was not trivial. I have not heard anything substantive about the financial burden of this project, but based on my prior knowledge of the magnitude of the scanning and logistics of the images, this puppy was expensive. In my view, unlike a pure academic library play, this deal has a price tag and someone has to pay at some point.
What’s ahead? Well, in my view, once Google creates metadata and populates one of its knowledgebases, those data will be protected and probably with considerable enthusiasm. Google’s programmable search engine generates data and if some data items are missing, the system beavers away until the empty cell is filled. Once those dataspaces are populated, the information is not for just anyone’s use.
I mentioned the word dataspaces in a telephone converastion today. I know I am not communicating. The person on the other end of the call asked, “What’s a dataspace?” Well, you are now disallowed from one.
Stephen Arnold, January 7, 2008
Newspapers: Another Analysis of Failure
January 7, 2009
Slate’s Jack Shafer took a Tanaka ECS-3301 chain saw to traditional newspapers here. His “How Newspapers Tried to Invent the Web” was an enjoyable read for me. I don’t think the wizards at some of the formerly high flying newspaper companies were similarly affected. The hook for the article was Pablo J. Boczkowski’s 2004 book, Digitizing the News: Innovation in Online Newspapers. Armed with a fact platform, Mr. Shafer frolics through the misadventures of media mavens and the Web. The phrase I liked was “extreme suckage”. I wish this goose had thought of that. Wordsmithing aside, the comment that resonated with me was:
From the beginning, newspapers sought to invent the Web in their own image by repurposing the copy, values, and temperament found in their ink-and-paper editions. Despite being early arrivals, despite having spent millions on manpower and hardware, despite all the animations, links, videos, databases, and other software tricks found on their sites, every newspaper Web site is instantly identifiable as a newspaper Web site. By succeeding, they failed to invent the Web.
A congratulatory quack to Mr. Shafer for this write up. Read at once. Now think about a similar fate for motion picture outfits confident of their brilliance after a strong 2008. The party’s not over for that crowd. More about this in my forthcoming Google and Publishing monograph.
Stephen Arnold, January 7, 2009
Nine Trends: One Failed Economy
January 7, 2009
Here is rural Kentucky even my neighbors are economizing. Instead of shooting squirrels with semi-automatic rifles, the Hatfield-like brood has reverted to snares. Most of the lower quartile crowd with whom I spend my time did not see the economic meltdown coming. In reality, the whiz kids in London, Manhattan, and Tokyo didn’t seem too aware of the problem either. What’s wrong? There are real time intelligence systems available from big name outfits (Thomson Reuters), smaller tune (Connotate), and newcomers (FirstRain, Silobreaker), among others. These systems deliver business intelligence, often accompanied by graphs, heads up displays. The Cluuz.com system takes the woeful Yahoo search system and makes it yield useful information.
I read with interest the nine business intelligence trends that a reader forwarded to me early this morning. A capable analyst (David Stodder) tackles this subject in the economic climate of the moment. I wonder if the present economic climate is a consequence of business intelligence or a lack thereof. Never mind. That’s a topic for another forum.
Mr. Stodder’s article is a fairly long one. I don’t feel comfortable summarizing his nine trends. Please, read his full presentation here. I propose to highlight three of the trends he’s identified and offer a couple of comments, fresh from my small, stagnant pond in rural Kentucky.
One of his nine trends is that “users demand a richer experience.” His point in that interfaces have to be more like dashboards. Users want to get answers and “integrate at the glass”. I agree. If you take a look at the typical interface for Clarabridge or Cognos (to pick two vendors whose systems I examined at a recent conference), I needed a statistics Sherpa to help me get from A to B. I am not the brightest goose in the flock, but it was clear that these two vendors make some big assumptions about their users. This user thinks the vendors’ assumptions are wrong. No wonder there’s an economic meltdown. The outputs can be polymorphic.
Mr. Stodder creates a “6, 7, & 8” megatrend that seems okay to me. The bundle struck me as a bit confusing. I think the idea is that the idea of “breaking” the “mold” is spot on. The idea is that getting real time information is important. Getting data more quickly means more capital expense. He is right there. What I think is missing is explaining that the capital investments can be non linear, which to me is an important issue.
The third item I found interesting was the reference to the Google. Mr. Stodder indicates that the traditional database architecture may not be up to the task of today’s business intelligence task. Again, he is on the money. I think that the cost of making the old style databases handle today’s petascale data management needs contributes to decision meltdowns. It is tough for me to comment on likely outcomes when I can’t process the complete data stored in traditional database tables in the time available to me to produce an answer.
You must read his other points. You may, as I did, find them thought provoking.
In closing, let me make a couple of goose-grade points before I lose them in the pond scum in which I am floating:
- The phrase “business intelligence” like “enterprise search” and “content management system” are essentially meaningless. What these areas comprise are shopping carts into which one can toss various technologies with labels that assert “money saving” or “increased efficiency in operations” or other types of marketing ?????? (shibboleth). These are rallying points, not solutions.
- The notion of making an entitlement grad or trophy MBA smarter with a dashboard that presents what he or she needs to know is crazy. The heads up displays in fighter aircraft become very simple when the action gets hot. The notion of mashing, displaying, and combining is crazy because it is usually difficult even for an expert to point out what is important. If these systems worked, do you think the Madoff issues would create work for hundreds of attorneys? Nope, interfaces are secondary to delivering accurate, meaningful data. In my opinion, business intelligence systems have a long way to go. If these worked, would the surprise attack on Gaza caught the authorities by surprise?
- The delicate suggestion that Codd style databases don’t do the job soft peddles a very serious problem in data management. The notion that Google has a good data management in MapReduce is like describing New York City in terms of the deli at Lex and 33rd. Google has blown past today’s “business intelligence” vendors. I don’t think the slumbering business intelligence industry, its pundits, or its most vociferous supporters realize that Google is moving and may put the key players in a position of check in Google’s digital chess game.
I might be wrong, but I think some shakeups, business failures, and technology marginalization will make the 2009 business intelligence landscape quite interesting.
Stephen Arnold, January 7, 2009
Data for the 21st Century
January 6, 2009
A happy quack to Max Indelicato for his “Scalability Strategies Primer: Database Sharding” here. Mr. Indelicato has gathered very useful information about data management tactics. Unlike the IBM-Microsoft-Oracle database information, this write up delivers useful, interesting information. Download and save the article. For me, the most important comment in the write up was:
You may be wondering if there is a high amount of overhead involved in always connecting to the Index Shard and querying it to determine where the second data retrieving query should be executed. You would be correct to assume that there is some overhead, but that overhead is often insignificant in comparison to the increase in overall system performance, as a result of this strategy’s granted parallelization. It is likely, independent of most dataset scenarios encountered, that the Index Shard contains a relatively small amount of data. Having this small amount of lookup data means that the database tables holding that data are likely to be stored entirely in memory. That, coupled with the low latencies one can achieve on a typical gigabit LAN, and also the connection pooling in use within most applications, and we can safely assume that the Index Shard will not become a major bottleneck within the system (have fun cutting down this statement in the comments, I already know it’s coming 🙂
Ah, the Google legacy coming to light.
Stephen Arnold, January 6, 2009
Search Pioneer Upshifts: Interview with Mike Weiner
January 6, 2009
In the 1980s I relied on a very fast search system for my personal computer. The program was Gopher from Microlytics. In the late 1990s, I met the founder of Gopher and tracker his interest in linguistic-centric search systems. I lost track of Mike Weiner, former president of Microlytics, but we spoke on the telephone a day or two ago. You can get information about Technology Innovations here. I captured his comments in an interview which is now available on the ArnoldIT.com Search Wizards Speaks sub site here.
Two comments in my conversation with Mr. Weiner struck a chord with me. Let me highlight these in this brief news item about the interview.
First, search has grown beyond the desktop. Mr. Weiner said in response to a question about desktop search:
…the desktop of today and tomorrow are connected to the “world.” So there can be very clever background processing done on your behalf that can leverage off the information you access and the information you create. The question will be, what’s useful and important to you, and can the system fetch, or generate, this, for you, and in an efficient form you can cognitively benefit from. One of the next potentials for incredible retrieval will be intelligent “information extraction.”
Second, Mr. Weiner’s new interests pivot on innovation. Technology Innovations holds patents on different facets of electronic paper or “epaper”. About the future of epaper, Mr. Weiner said:
I see epaper heavily used in educational publications, where children and learners have questions, need definitions, etc. You may see a speller and thesaurus, and translation technology coming bundled on books with electronic chips in them.
If you are interested in search and publishing in the 21st century, you will find the Mike Weiner interview interesting.
Stephen Arnold, January 6, 2008
Can You Find Crackle Videos with Crackle Search
January 6, 2009
At lunch the subject of video search came up among the Beyond Search goslings. One of the newly-hatched goslings mentioned that Sony’s Crackle was indexed thoroughly on Google Video. Furthermore, Sony uses YouTube.com to promote new, original Crackle content. For an example, click here. We fired up our baby Asus netbook and gave the flakey Verizon high speed wireless a go. Success. We were able to connect to the Crackle.com Web site and run queries on Google Video. What’s this have to do with search? Well, the search system on the Crackle.com site is not too good. The system uses a weird and hard to read blue type on black motif, returns matches on “star” and truncates the “ving” without warning, and generally seems sluggish.
Crackle, I learned from the gosling, that Sony bought the Grouper.com site for $65 million in 2007. Some background information is here. Renamed Crackle.com, Sony’s video site is positioned–well–out of site for me. I did explore the site via the search system. The programs like Rocketboom resonated. Sony paid a hefty sum to get the rights to distribute the quirky Net-centric video show. More information about this deal is here.
Sony is spending to be a player in video. But with the PlayStation sucking air and a global financial crisis bubbling away, one wonders if Sony can do much to boost the visibility of the Crackle.com service and have the money to fix the Crackle.com search system. One plus. Crackle.com works a lot better than the piggy Web site for the Sony electronic book.
Stephen Arnold, January 6, 2008