Dead Tree Update: Times Roman Edition

March 7, 2009

Robert K. Blechman’s “The Decline and Fall of the Times Roman Empire” seemed at first glance to have little to do with my interests in search, content analysis, and text processing. You will want to read the essay in BlogCritics here. The article begins with the Times’s decision to sell its building. Once this was a great MBA notion. Now it suggests moving from a Long Island mansion to a trailer park in New Jersey. My metaphor, not the cultured Mr. Blechman’s. The Times has a number of businesses that are performing in a sub par way. Mr. Blechman provides useful background information about the information environment. His analysis is sound. For me, the most important point was:

How I see traditional media working to make newspapers, broadcast radio and television, and blockbuster motion pictures money earners in the Twitter Era. Source: http://thusagricola.com/wp-content/uploads/sisyphus.jpg

Having consolidated their smaller competitors out of existence, the declining newspapers can’t use the same trick that they used in the face of broadcast journalism, that is exploiting “local advantages in providing information to readers and connecting advertisers and consumers in a city.” This opportunity has been sucked away by the Internet.

I quite liked the phrase “sucked away by the Internet.”

Good writing. Incorrect view of reality in my opinion.

My view of this situation is distorted by my interest in search and my experience in traditional and electronic publishing. Points of importance to me not referenced in the write up include:

Electronic aggregators tried to work with established traditional media. The Business Dateline crafted by Ric Manning (Courier Journal & Louisville Times Co.) with some modest inputs from me and others on the team had to work quite hard to [a] explain what online meant as a revenue opportunity and [b] how electronic content different from print media. Believe me. We tried, and we arrived with the seal of approval of an old line monopolistic newspaper company. Didn’t matter. The mental leap was too great for those steeped in print. Sad thing is that even today, the leap is too great. Most traditional print wizards are clueless about the differences in the media.

Written by Stephen E. Arnold · Filed Under Business strategy, Feature, Online (general), Publishing, Technology | 2 Comments

Libraries: A Tipping Point in Commercial Online

March 5, 2009

Libraries find themselves in a tough spot. The economic downturn has created a surge in walk in traffic. In Louisville, Kentucky, I watched as patrons waited to use various online systems available. I spoke with several people. Most were looking for employment information or government benefit resources. I pop into the downtown library a couple of times a month, and at 4 pm on a Thursday, the place was busy.

In Massachusetts, four libraries found themselves in the spotlight. According to the Wicked Local Brockton here, “Wareham, Norton Libraries Lose Certification; Brockton, Rockland Given Reprieve”. The libraries, according to Maria Papadopoulos’ article had cut their budgets too much. As a result, the libraries lost their state certification, which further increases budget pressure. Across the country the Seattle Post Intelligencer reported “Big Challenges Await City’s New Librarian.” Kathy Mulady wrote:

Actual library visits are up 20 percent, and virtual visits online are up even more. About 13 million people visited city library branches last year.

That’s the good news. The bad news is that Seattle, home of Amazon (king of ebooks) and Microsoft (the go-to company for software and online information) has a budget crunch. The new library director will have to deal with inevitable financial pressure at a time when demand for services is going up. Tough job.

What’s this mean for commercial online services?

View of a collision between light rail and a freight locomotive. Will this happen when library budgets collide with the commercial online vendors in 2010? Image source: http://www.calbar.ca.gov/calbar/images/CBJ/2005/Metrolink-Train-Wreck.jpg

My view is that the companies dependent on libraries for their revenue will be facing a very lean 2009. The well managed companies will survive, but those companies that are highly leveraged may find themselves facing significant revenue pressure. Most of the vendors dependent on libraries for revenue are low profile operations. These companies aggregate information and make that information available to individual libraries or to groups of libraries that join together to act as a buying club. Most library acquisitions occur on a cycle that is governed by the budget authority funding a library. In effect, library vendors will receive orders and payments in 2009.

The big crunch may occur in 2010. When that happens, the library vendors will be put under increasing pressure. I have identified three potential developments to watch.

First, I think some high profile library dependent information companies will be forced to merge, cut back on staff and product development, or shut their doors. Size of the library centric company may not protect these firms. The costs of creating and delivering electronic information of higher value than this goose-based Web log are often high and difficult to compress. The commercial database companies are dependent on publishers for content. Publishers are in a difficult spot themselves. As a result, the interlocks between commercial publishing, traditional database companies, and libraries are complex. Destabilize one link and the chain disintegrates. No warning. Pop. Disintegration.

Image source: http://harvardinjurylaw.com/broken-chain.jpg

Second, the libraries themselves are going to have to rethink what they do with their budgets. This type of information decision has been commonplace for many years. For example, libraries have to decide what books to buy. Libraries have to decide what percent of their budget gets spent on periodicals in print or online. Libraries have to decide whether to cut hours or cut acquisitions. Libraries, in short, make life and death information decisions each day. The forced choices mean that libraries have to decide between serving patrons with online access to Internet resources or online access to high value information sources like those purchased from Cambridge Scientific Abstracts (privately held), Ebsco (privately held), Reed Elsevier (tie up between two non US commercial entities, one Dutch, one British), Thomson Reuters (public company) Wolters Kluwer (public, non US company) and some other companies that are not household names. Free services from Google, Microsoft, and Yahoo plus Web logs, Twitter, and metasearch systems like IxQuick.com would look pretty good to me when I had to decide between a $200,000 payment to a commercial database company and providing services to my patrons, students, and consortium partners.

Third, Google’s steady indexing of content in Google Books and in its government service and the general Google Web index offers an alternative to the high value, six figure deals that library centric information companies pursue. If I were working in a library, I would not hesitate to focus on Google-type resources. I would shift money from the commercial database line item to those expenses associated with keeping the library open and the public access terminals connected to the Internet available.

In short, the economic problems for companies in the search and content processing sector are here-and-now problems. The managers of these firms need to make sales in order to stay in business. The library centric information companies are sitting on railroad tracks used by the TGV, just waiting for the real budget collision to arrive. The traditional library information companies cannot get off the tracks even though they know the 2010 is going to arrive right on schedule.

I want to steer clear of these railroad tracks. Debris can do some collateral damage.

Stephen Arnold, March 5, 2009

Written by Stephen E. Arnold · Filed Under Business strategy, Feature, Financial, Online (general) | 2 Comments

SEO: Good, Bad, Ugly

March 3, 2009

A happy quack to the reader who sent me a link to the February 20, 2009, article by George for Insiders View: Insurance Blow here. “More and More SEO Scams” made the statement:

It seems that there are few whitehat agencies these days. I always advocate some gray hat to stay on top and some blackhat to determine what others are doing. But this is getting ridiculous. The economic climate has pushed people out of the city so instead of brokering toxic investments, they’re now brokering SEO services.

Strong words. I had seen the About.com posting “How to Avoid Being Taken by SEO Scams and Bad SEO Companies” here, but I was not sure how widespread the problem was. Dave Taylor here made this comment in his “SEO Company Promises Top Three Positions: A Scam?”:

Of all the aspects of the Internet, none seems to be so full of con artists and purveyors of dubious businesses than so-called search engine optimization companies. The reason for this is that the basics of SEO (which I’ll call it for simplicity) are simple and can be explained in five minutes. Heck, Google even has a free guide to SEO best practices.

Image source: http://3.bp.blogspot.com/_jhSlOGUoB5k/R-1-flxJm0I/AAAAAAAAE40/y1pVNDBfyXE/s400/scam.jpg

Several thoughts:

As the economy slides toward a financial black hole, some companies hope their Web sites can be a source of sales leads and revenue. Managers turn to their marketing advisors and Web professionals to deliver a return on the Web investment. Pressure increases.
The dominance of Google in Web search means that a company not in the Google index does not exist in some cases. A company whose product or service does not come up on the first page of Google results may not get much traffic.
The quality of Web sites (content, coding) becomes increasingly important. But quality takes thought, time, and effort.

When one mixes these three ingredients together, search engine optimization becomes a must. If a company can afford to buy Google AdWords, then the Web site must have compelling landing pages and the technical plumbing to make it easy for the person landing on a link to take the desired action.

Written by Stephen E. Arnold · Filed Under Business strategy, Feature, Online (general), Search, SEO, Technology, Text processing | 9 Comments

Mysteries of Online 9: Time

March 3, 2009

Electronic information has an interesting property: time distortion. The distortion has a significant effect on how users of electronic information participate in various knowledge processes. Information carries humans along much as a stream whisks a twig in the direction of the flow. Information, unlike water, moves in multiple directions, often colliding, sometimes reinforcing, and at others in paradoxical ways that leave a knowledge worked dazed, confused, and conflicted. The analogy of information as a tidal wave connotes only a partial truth. Waves come and go. Information flow for many people and systems is constant. Calm is tough to locate.

Vector fields. Source: http://www.theverymany.net/uploaded_images/070110_VectorField_test012_a-789439.jpg

In the good old days of cuneiform tablets, writing down the amount of wheat Eknar owed the king required specific steps. First, you had to have access to suitable clay, water, and a clay kneading specialist. Second, you needed to have a stylus of wood, bone, or maybe the fibula of an enemy removed in a timely manner. Third, you had to have your data ducks in a row. Dallying meant that the clay tablet would harden and make life more miserable than it already was. Once the document was created, the sun or kiln had to cooperate. Once the clay tablet was firm enough to handle without deleting a mark for a specified amount of wheat, the tablet was stacked in a pile inside a hut. Forth, the access the information, the knowledge worker had to locate the correct hut, find the right pile, and then inspect the tablets without breaking one, a potentially bad move if the king had a short temper or needed money for a war or a new wife.

In the scriptorium in the 9th century, information flow wasn’t much better. The clay tablets had been replaced with organic materials like plant matter or for really important documents, the scraped skin of sheep. Keep in mind that other animals were used. Yep, human skin worked too. Again time intensive processes were required to create the material on which a person would copy or scribe information. The cost of the materials made it possible to get patrons to spit out additional money to illustrate or illuminate the pages. Literacy was not widespread in the 9th century and there were a number of incentives to get sufficient person power to convert foul papers to fair copies and then to compendia. Not just anyone could afford a book. Buying a book or similar document did not mean the owner could read. The time required to produce hand copies was somewhat better than the clay tablet method or the chiseled inscriptions or brass castings used by various monarchs.

Yep, I will have it done in 11 months, our special rush service.

With the invention of printing in Europe, the world rediscovered what the Chinese had known for 800, maybe a thousand years. No matter. The time required to create information remained the same. What changed was that once a master set of printing plates had been created. A printer with enough capital to buy paper (cheaper than the skin and more long lasting than untreated plant fiber and less ink hungry than linen based materials) could manufacture multiple copies of a manuscript. The out of work scribes had to find a new future, but the impact of printing was significant. Everyone knows about the benefits of literacy, books, and knowledge. What’s overlooked is that the existence of books altered the time required to move information from point A to point B. Once time barriers fell, distance compressed as well. The world became smaller if one were educated. Ideas migrated. Information moved around and had impact, which I discussed in another Mysteries of Online essay. Revolutions followed after a couple hundred years, but the mindless history classes usually ignore the impact of information on time.

If we flash forward to the telegraph, time accelerated. Information no longer required a horse back ride, walk, or train ride from New York to Baltimore to close a real estate transaction. Once the new fangled electricity fell in love with information, the speed of information increased with each new innovation. In fact, more change in information speed has occurred since the telegraph than in previous human history. The telephone gave birth to the modem. The modem morphed into a wireless USB 727 device along with other gizmos that make possible real time information creation and distribution.

Time Earns Money

I dug out notes I made to myself sometime in the 1982 – 1983 time period. The implications of time and electronic information caught my attention for one reason. I noted that the revenue derived from a database with weekly updates was roughly 30 percent greater than information derived from the same database on a monthly update cycle. So, four updates yielded a $1.30, not $1.00. I wrote down, “Daily updates will generate an equal or greater increase.” I did not believe that the increase was infinite. The rough math I did 25 years ago suggested that with daily updates the database would yield about 1.6 percent more revenue than the same database with a monthly update cycle. In 1982 it was difficult to update a commercial database more than once a day. The cost of data transmission and service charges would gobble up the extra money, leaving none for my bonus.

In the financial information world, speed and churn are mutually reinforcing. New information makes it possible to generate commissions.

Time, therefore, not only accelerated the flow of information. Time could accelerate earnings from online information. Simply by u9pdating a database, the database would generate more money. Update the database less frequently, the database would generate less money. Time had value to the users.

I found this an interesting learning, and I jotted it down in my notebook. Each of the commercial database in which I played a role were designed for daily updates and later multiple updates throughout the day. To this day, the Web log in which this old information appears is updated on a daily basis and several times a week, it is updated multiple times during the day. Each update carries and explicit time stamp. This is not for you, gentle and patient reader. The time stamp is for me. I want to know when I had an idea. Time marks are important as the speed of information increases.

Implications

The implications of my probably third-hand insight included:

The speed up in dissemination means that information impact is broader, wider, and deeper with each acceleration.
Going faster translates to value for some users who are willing and eager to pay for speed. The idea is that knowing something (anything) first is an advantage.
Fast is not enough. Customers addicted to information speed want to know what’s coming. The inclusion of predictive data adds another layer of value to online services.
Individuals who understand the value of information speed have a difficult time understanding why more online systems and services cannot deliver what is needed; that is, data about what will happen with a probability attached to the prediction. Knowing that something has a 70 chance of taking place is useful in information sensitive contexts.

Let me close with one example of the problem speed presents. The Federal government has a number of specialized information systems for law enforcement and criminal justice professionals. These systems have some powerful, albeit complex, functions. The problem is that when a violation or crime occurs, the law enforcement professionals have to act quickly. The longer the reaction time, the greater the chance that the bad egg will tougher to apprehend increases. Delay is harmful. The systems, however, require that an individual enter a query, retrieve information, process it and then use another two or three systems in order to get the reasonably complete picture of the available information related to the matter under investigation.

The systems have a bottleneck. The human. Law enforcement personnel, on the other hand, have to move quickly. As a result, the fancy online systems operate in one time environment and the law enforcement professionals operate in another. The opportunity to create systems that bring both time universes together is significant. Giving a law enforcement team mobile comms for real time talk is good, but without the same speedy and fluid access to the data in the larger information systems, the time problem becomes a barrier.

Opportunity in online and search, therefore, is significant. Vendors who pitch another fancy search algorithm are missing the train in law enforcement, financial services, competitive intelligence, and medical research. Going fast is no longer a way to add value. Merging different time frameworks is a more interesting area to me.

Stephen Arnold, February 26, 2009

Written by Stephen E. Arnold · Filed Under Cloud computing, Database, EDiscovery, Feature, News, Online (general), Publishing, Technology, Text processing | Comments Off on Mysteries of Online 9: Time

Harry Collier, Infonortics, Exclusive Interview

March 2, 2009

Editor’s Note: I spoke with Harry Collier on February 27, 2009, about the Boston Search Engine Meeting. The conference, more than a decade into in-depth explorations of search and content processing, is one of the most substantive search and content processing programs. The speakers have come from a range of information retrieval disciplines. The conference organizing committee has attracted speakers from the commercial and research sectors. Sales pitches and recycled product reviews are discouraged. Substantive presentations remain the backbone of the program. Conferences about search, search engine optimization, and Intranet search have proliferated in the last decade. Some of these shows focus on the “soft” topics in search and wrap the talks with golf outings and buzzwords. The attendee learns about “platinum sponsors” and can choose from sales pitches disguised as substantive presentations. The Infonortics search conference has remained sharply focused and content centric. One attendee told me last year, “I have to think about what I have learned. A number of speakers were quite happy to include equations in their talks.” Yep, equations. Facts. Thought provoking presentations. I still recall the tough questions posed to Larry Page (Google) after his talk in at the 1999 conference. He argued that truncation was not necessary and several in attendance did not agree with him. Google has since implemented truncation. Financial pressures have forced some organizers to cancel some of their 2009 information centric shows; for example, Gartner, Magazine Publishers Association., and Newspaper Publishers Association. to name three. Infonortics continues to thrive with its reputation for delivering content plus an opportunity to meet some of the most influential individuals in the information retrieval business. You can learn more about Infonortics here. The full text of the interview with Mr. Collier, who resides in the Cotswolds with an office in Tetbury, Glou., appears below:

Why did you start the Search Engine Meeting? How does it different from other search and SEO conferences?

The Search Engine Meeting grew out of a successful ASIDIC meeting held in Albuquerque in March 1994. The program was organized by Everett Brenner and, to everyone’s surprise, that meeting attracted record numbers of attendees. Ev was enthusiastic about continuing the meeting idea, and when Ev was enthusiastic he soon had you on board. So Infonortics agreed to take up the Search Engine Meeting concept and we did two meetings in Bath in England in 1997 and 1998, then moved thereafter to Boston (with an excursion to San Francisco in 2002 and to The Netherlands in 2004). Ev set the tone of the meetings: we wanted serious talks on serious search domain challenges. The first meeting in Bath already featured top speakers from organizations such as WebCrawler, Lycos, InfoSeek, IBM, PLS, Autonomy, Semio, Excalibur, NIST/TREC and Claritech. And ever since we have tried to avoid areas such as SEO and product puffs and to keep to the path of meaty, research talks for either search engine developers, or those in an enterprise environment charged with implementing search technology. The meetings tread a line between academic research meetings (lots of equations) and popular search engine optimization meetings (lots of commercial exhibits).

Pictured from the left: Anne Girard, Harry Collier, and Joan Brenner, wife of Ev Brenner. Each year the best presentation at the conference is recognized with the Evvie, an award named in honor of her husband, and chair of the first conference in 1997.

There’s a great deal of confusion about the meaning of the word “search”, what’s the scope of the definition for this year’s program?

Yes, “Search” is a meaty term. When you step back, searching, looking for things, seeking, hoping to find, hunting, etc are basic activities for human beings — be it seeking peace, searching for true love, trying to find an appropriate carburetor for an old vehicle, or whatever. We tend now to have a fairly catholic definition of what we include in a Search Engine Meeting. Search — and the problems of search — remains central, but we are also interested in areas such as data or text mining (extracting sense from masses of data) as well as visualization and analysis (making search results understandable and useful). We feel the center of attention is moving away from “can I retrieve all the data?” to that of “how can I find help in making sense out of all the data I am retrieving?”

Over the years, your conference has featured big companies like Autonomy, start ups like Google in 1999, and experts from very specialized fields such as Dr. David Evans and Dr. Liz Liddy. What pulls speakers to this conference?

We tend to get some of the good speakers, and most past and current luminaries have mounted the speakers’ podium of the Search Engine Meeting at one time or another. These people see us as a serious meeting where they will meet high quality professional search people. It’s a meeting without too much razzmatazz; we only have a small, informal exhibition, no real sponsorship, and we try to downplay the commercialized side of the search world. So we attract a certain class of person, and these people like finding each other at a smaller, more boutique-type meeting. We select good-quality venues (which is one reason we have stayed with the Fairmont Copley Plaza in Boston for many years), we finance and offer good lunches and a mixer cocktail, and we select meeting rooms that are ideal for an event of 150 or so people. It all helps networking and making contacts.

What people should attend this conference? Is it for scientists, entrepreneurs, marketing people?

Our attendees usually break down into around 50% people working in the search engine field, and 50 percent those charged with implementing enterprise search. Because of Infonortics international background, we have a pretty high international attendance compared with most meetings in the United States: many Europeans, Koreans and Asians. I’ve already used the word “serious”, but this is how I would characterize our typical attendee. They take lots of notes; they listen; they ask interesting questions. We don’t get many academics; Ev Brenner was always scandalized that not one person from MIT had ever attended the meeting in Boston. (That has not changed up until now).

You have the reputation for delivering a content rich program. Who assisted you with the program this year? What are the credentials of these advisor colleagues?

I like to work with people I know, with people who have a good track record. So ever since the first Infonortics Search Engine Meeting in 1997 we have relied upon the advice of people such as you, David Evans (who spoke at the very first Bath meeting), Liz Liddy (Syracuse University) and Susan Feldman (IDC). And over the past nine years or so my close associate, Anne Girard, has provided non-stop research and intelligence as to what is topical, who is up-and-coming, who can talk on what.These five people are steeped in the past, present and future of the whole world of search and information retrieval and bring a welcome sense of perspective to what we do. And, until his much lamented death in January 2006, Ev Brenner was a pillar of strength, tough-minded and with a 45 year track record in the information retrieval area.

Where can readers get more information about the conference?

The Infonortics Web site (www.infonortics.eu) provides one-click access to the Search Engine Meeting section, with details of the current program, access to pdf versions of presentations from previous years, conference booking form and details, the hotel booking form, etc.

Stephen Arnold, March 2, 2009

Written by Stephen E. Arnold · Filed Under Conferences, Enterprise, Feature, Federated search, Interview, Online (general), Search, Semantic, Technology, Text processing | 3 Comments

Microsoft: The RCA Analogy

February 26, 2009

When I worked at Booz, Allen in the late 1970s, I was quite impressed with Harvard MBAs who could explain certain business events with aplomb. Most of these examples had a punch line, almost as if Bob Hope or Groucho Marx had gone into business, not vaudeville. The punch line was often delivered by these glossy, sleek, confident masters of the universe with a rising tone. Like a question except you were supposed to be floored by their brilliance.

I recall one afternoon in a class at the Harvard Business School where my boss (William P. Sommers) was giving a lecture about technology to the 100 people jammed into a lecture hall. After his prepared remarks, several students and the faculty member (I think it was the fellow from RCA who headed the laser research at RCA at one time) asked if Dr. Sommers and I would look at a new case. The case as it turns out concerned the laser technology applied to data storage. The reason I remember this was a fluke of memory. The article here by Joe Wilcox “Ballmer: RCA Is Our Role Model” took me back in time.

The dog Nipper was a great image in RCA’s salad days. By 1979, Nipper was a dog for sure.

By 1979, RCA was a loser in my opinion with Discovision and helium neo laser tubes which gave way to the wonderful infrared semiconductor laser diodes. Think a foot in diameter and bulky. The other charming feature was laser rot. I had been working on two projects. One was the study of world economic change. Former secretary of the treasury William Simon was one of the fine, friendly folks on that project. My lowly job was to analyze the relationship between R&D expenditures and forecast orders across several high tech sectors. I don’t remember the exact numbers, but my analysis suggested that RCA was not just falling behind. RCA was a goner. The other study was about innovation in 10 high tech firms scattered around the world. RCA was not one of the companies we were studying, but I do recall interviewing a number of executives from RCA who had moved to Thomson (the French outfit, not the newspaper Thomson), Hitachi, and eight other big guns. My recollection is after three decades was the departure of smart people who seemed to know which way the wind was blowing.

Which did you want in your auto for music?

Now, sitting with the Harvard prof and a handful of well fed, glossy MBAs to be, I was reading about RCA and its bumbling of its laser opportunity. To make a long story short, RCA did not pursue the data path, leaving the field wide open to other companies who leapfrogged hapless RCA. RCA had the technology and the market share. RCA was swallowed by GE and then sold to Thomson-Brandt SA, which became Thomson SA. For most people, RCA is a second tier, maybe a third tier brand. RCA and laser technology don’t line up in most folks’ mind I would assert.

Written by Stephen E. Arnold · Filed Under Business strategy, Feature, Microsoft, Technology | Comments Off on Microsoft: The RCA Analogy

Mysteries of Online 8: Duplicates

February 24, 2009

In print, duplicates are the province of scholars and obsessives. In the good old days, I would sit in a library with two books. I would then look at the data in one book and then hunt through the other book until I located the same or similar information. Then I would examine each entry to see if I could find differences. Once I located a major difference such as a number, a quotation, or an argument of some type, I would write down that information on a 5×8 note card. I had a forensics scholarship along with some other cash for guessing accurately on objective tests. To get the forensics grant, I had to participate in cross examination debate, extemporaneous speaking, and just about any other crazy Saturday time waster my “coaches” demanded.

Not surprisingly, mistakes or variances in books, journals, and scholarly publications were not of much concern to some of the students who attended the party school that accepted an addled goose with thick glasses. There were rewards for spending hours looking for information and then chasing down variances. I recall that our debate team, which was reasonably good if you liked goose arguments, were putting up with a team from Dartmouth College. I was listening when I heard a statement that did not match what I had located in a government reference document and in another source. The opponent from Dartmouth had erroneously presented the information. I gave a short rebuttal. I still remember the look of nausea that crossed our opponent’s face when she realized that I presented what I found in my hours of manual checking and reminded the judges that distorting information suggests an issue with the argument. We won.

For most people, the notion of having two individuals with the same source is an example of duplicate information. Upon closer inspection, duplication does not mean identical in gross features. Duplication drills down to the details of the information and to the need to determine which item of information is at variance and then figuring out why and what is the most likely version of the duplicate.

That’s when the fun begins in traditional research. An addled goose can do this type of analysis. Brains are less important than persistence and a toleration for some dull, tedious work. As a result, finding duplicative information and then figuring out variances was not something that the typical college sophomore spends much time doing.

Enter computer systems.

Written by Stephen E. Arnold · Filed Under Database, Enterprise, Feature, Online (general), Text analytics, Text processing | Comments Off on Mysteries of Online 8: Duplicates

A Publisher Who Violated Copyright: Foul Play

February 22, 2009

I did not want to write about this situation until I cooled down. I write monographs (dull and addled ones to be sure) but I expect publishers to follow the copyright laws for the country in which the publisher resides. I used to work at a big, hungry, successful publishing company in New York. Even in the go go 1980s, the owner set an example for the officers and professionals to follow. The guideline was simple. Treat information and copyright with respect. Before returning the the nutso New York scene, I worked at the Courier Journal & Louisville Times Co., then one of the top 25 newspapers in the world. The rules were clear there too. Respect copyright. I have three active publishers at this time: Frank Gilbane (The Gilbane Group), whom I have described as the least tricky information wizard I know; Harry Collier (Infonortics Ltd.), my Google publisher and long time colleague; and Steve Newton, at Galatea in the UK, who makes my lawyer look like a stand up comedian. Mr. Newton is serious and respectful of authors like the savvy Martin White and me, the addled goose.

I would go straight to my attorney if I found out that one of these professionals was sending without my permission copies of my monographs to individuals who were not reviewers or representatives of a procurement team. Gilbane, Collier, and Newton would either send an email or pick up the mobile and let me know who wanted a copy.

I was thunderstruck when a dead tree publisher in New Jersey, which I will not name, sent me via electronic mail and with no communication with me a copy of a hot off the press book about Google. I took three actions:

I alerted my attorney that a publisher was possibly violating copyright and that I wanted to know what to do to protect myself. “Delete the file” and “Tell ’em not to do this type of distribution again” were the two points I recall.
I asked one of my top researchers and one of the people who does research for my legal and investigative reports to telephone the publisher and state what the attorney told me. Then repeat the message again and inform the publisher to pass further communications to my assistant, not to me.
I deleted the file.

Written by Stephen E. Arnold · Filed Under Business strategy, Feature, Publishing | 6 Comments

Mysteries of Online 7: Errors, Quality, and Provenance

February 19, 2009

This installment of “Mysteries of Online” tackles a boring subject that means little or nothing to the entitlement generation. I have recycled information from one of my talks in 1998, but some of the ideas may be relevant today. First, let’s define the terms:

Errors–Something does not work. Information may be wildly inaccurate but the user may not perceive this problem. An error is a browser that crashes, a page that doesn’t render, a Flash that fails. This notion of an error is very important in decision making. A Web site that delivers erroneous information may be perceived as “right” or “good enough”. Pretty exciting consequences result from this notion of an “error” in my experience.
Quality–Content displayed on a Web page is consistent. The regularity of the presentation of information, the handling of company names in a standard way, and the tidy rows and columns with appropriate values becomes “quality” output in an online experience. The notion of errors and quality combine to create a belief among some that if the data come from the computer, then those data are right, accurate, reliable.
Provenance–This is the notion of knowing from where an item came. In the electronic world, I find it difficult to figure out where information originates. The Washington Post reprints a TechCrunch article from a writer who has some nerve ganglia embedded in the companies about which she writes. Is this provenance enough or do we need the equivalent of a PhD from Oxford University and a peer reviewed document. In my experience, few users of online information know or know how to think about the provenance of the information on a Web page or in a search results list. Pay for placement adds spice to provenance in my opinion.

So What?

A gap exists between individuals who want to know whether information is accurate and can be substantiated from multiple sources and those who take what’s on offer. Consider this Web log post. If someone reads it, will that individual poke around to find out about my background, my published work, and what my history is. In my experience, I see a number of comments that say, “Who do you think you are? You are not qualified to comment on X or Y.” I may be an addled goose, but some of the information recycled for this Web log are more accurate than what appears in some high profile publications. A recent example was a journalist’s reporting that Google’s government sales were about $4,000, down from a couple of hundred thousand dollars. The facts were wrong and when I checked back on that story I found that no one pointed out the mistake. A single GB 7007 can hit $250,000 without much effort. It doesn’t take many Google Search Appliance Sales to beat $4,000 a year in revenue from Uncle Sam.

The point is that most users:

Lack the motivation or expertise to find out if an assertion or a fact is correct or incorrect. Instead of becoming a priority, in my opinion, few people care too much about the dull stuff–chasing facts. Even when I chase facts, I can make an error. I try to correct those I can. What makes me nervous are those individuals who don’t care whether information is on target.
See research as a core competency. Research is difficult and a thankless task. Many people tell me that they have no time to do research. I received an email from a person asking me how I could post to this Web log every day. Answer: I have help. Most of those assisting me are very good researchers. Individuals with solid research skills do not depend solely upon the Web indexes. When was the last time your colleague did research among sources other than those identified in a Web index.
Get confused with too many results. Most users look at the first page of search results. Fewer than five percent of online users make use of advanced search functions. Google, based on my research, takes a “good enough” approach to their search results. When Google needs “real” research, the company hires professionals. Why? Good enough is not always good enough. Simplification of search and the finding of information is a habit. Lazy people use Web search because it is easy. Remember: research is difficult.

Written by Stephen E. Arnold · Filed Under Feature, Online (general), Rich media | 1 Comment

Mysteries of Online 6: Revenue Sharing

February 16, 2009

This is a short article. I was finishing the revisions to my monetization chapter in Google: The Digital Gutenberg and ran across notes I made in 1996, the year in which I wrote several articles about online for Online Magazine. One of the articles won the best paper award, so if you are familiar with commercial databases, you can track down this loosely coupled series in the LITA reference file or other Dialog databases.

Terms Used in this Write Up

database	A file of electronic information in a format specified by the online vendor; for example Dialog Format A or EBCIDIC
database producer	An organization that creates a machine-readable file designed to run on a commercial online service
online revenue	Cash paid to a database producer generated when a user connected to an online database and displayed online or output the results of a search to a file or a hard copy
online vendor	A commercial enterprise that operated a time sharing service, search system, and customer support service on a fee basis; that is, annual subscription, online connect charge, online type or print charge
publisher	An organization engaged in creating content by collecting submissions or paying authors to create original articles, reports, tables, and news
revenue	Money paid by an organization or a user to access an online vendor’s system and then connect and access the content in a specific database; for example, Dialog File 15 ABI/INFORM

My “mysteries” series has evoked some comments, mostly uninformed. The number of people who started working in search when IBM STAIRS was the core tool are dwindling in number. The people who cut their teeth in the granite choked world of commercial online comprise an even smaller group. Commercial online began with US government funding in the early 1960s, so Ruby loving script kiddies are blissfully ignorant of how online files were built and then indexed. No matter. The lessons form foundation stones in today’s online world.

Indexing and Abstracting: A Backwater

Aggregators collect content from many different sources. In the early days of online, this meant peer reviewed articles. Then the net gathered magazines and non-peer reviewed publications like trade association magazines. Indexing and abstracting in the mid 1960s was a backwater because few publishers knew much about online. Permission to index and abstract was often not required and when a publisher wanted to know why an outfit was indexing and abstracting a publication, the answer was easy. “We are creating a library reference book.” Most publishers cooperated, often providing some of the indexing and abstracting outfits with multiple copies of their publications.

Some of the indexing and abstracting was very difficult; for example, legal, engineering, and medical information posed special problems. The vocabulary used in the documents was specialized, and word lists with Use For and See Also references were essential to indexing and abstracting. The abstract might define a term or an acronym when it referenced certain concepts. When abstracts were included with a journal article, the outfit doing the indexing and abstracting would often ask the publisher if it was okay to include that abstract in the bibliographic record. For decades publishers cooperated.

The reason was that publishers and indexing and abstracting outfits were mutually reinforcing operations. The published collected money from subscribers, members, and in some cases advertisers. The abstracting and indexing shops earned money by creating print and electronic reference materials. In order to “read the full text”, the researcher had to have access to a hard copy of the source document or, in some cases, a microfilm instance of the document.

No money was exchanged in most cases. I think there was trust among publishers and indexing and abstracting outfits. Some of the people engaged in indexing and abstracting crated products so important to certain disciplines that courses were taught in universities worldwide to teach budding scientists and researchers how to “find” and “use” indexes, abstracts, and source documents. Examples include the Chemical Abstracts database, Beilstein, and ABI/INFORM, the database with which I was associated for many years.

Pay to Process Content

By 1982, some publishers were aware that abstracting and indexing outfits were becoming important revenue generators in their own right. Libraries were interested in online, first in catalogs for their patrons, and then in licensing certain content directly from the abstracting and indexing shops. The reason for this interest from libraries (medical, technical, university, public, etc.) was that the technology to ingest a digital file (originally on tape) was becoming available. Second, the cost of using commercial online services which would make hundreds of individual abstract and index databases available was variable. The library (academic or corporate) would obtain a password and a license. Each database incurred a charge, usually billed either by the minute or per query. Then there was online connect charges imposed by outfits like Tymnet or other services. And there were even charges for line returns on the original Lexis system. Libraries had limited budgets, so it made sense for some libraries to cut the variable costs by loading databases on a local system.

By 1985, full text became more attractive to users. The reason was that A&I (abstracting and indexing) services provided pointers. The user then had to go find and read the source document. The convenience of having the bibliographic information and the full text online was obvious to anyone who performed research in anything other than a casual, indifferent manner. The notion of disintermediation expanded first in the A&I field because with full text, why pay to crate a formal bibliographic record and manually assign index terms. The future was full text because systems could provide pointers to documents. Then the document of interest to the researcher could be saved to a file, displayed on screen, or printed for later reference.

The shift from the once innovative A&I business to the full text approach threw a wrench into the traditional reference business. Publishers were suspicious and then fearful that if the full text of their articles were in online systems, subscription revenues would fall. The publishers did not know how much risk these systems poses, but some publishers like Crain’s Chicago Business wanted an upfront payment to permit my organization to crate full text versions of certain articles in the Crain publications. The fees were often in the five figure range and had additional contractual obligations attached. Some of these original constraints may still be in operation.

Negotiating an online deal is similar to haggling to buy a sheep in an open market. The authors were often included among the sheep in the traditional marketplace for information. Source: http://upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Haggling_for_sheep.jpg/800px-Haggling_for_sheep.jpg

Revenue Sharing

Online vendors like Dialog Information Services knew that change was in the air. Some vendors like Dialog and LexisNexis moved to disintermediate the A&I companies. Publishers jockeyed to secure premium deals for their full text material. One deal which still resonates at LexixNexis today was the New York Times’s arrangement with LexisNexis for the New York Times’s content. At its height, the rumor was that LexisNexis paid more than $1 million for the exclusive that put the New York Times’s content in the LexisNexis services. The New York Times decided that it could do better by starting its own online system. Because publishers saw only part of the online puzzle, the New York Times’s decision was a fateful one which has hobbled the company to the present day. The New York Times did not understand the cost of the infrastructure and the importance of habituated users who respond to the magnetism of an aggregate service. Pull out a chunk of content, even the New York Times’s content, and what you get is a very expensive service with insufficient traffic to pay the overall cost of the online operation. Publishers making this same mistake include Dow Jones, the Financial Times, and others. The publishers will bristle at my assertion that their online businesses are doomed to be second string players, but look at where the money is today. I rest my case.

To stay in business, online players cooked up the notion of revenue sharing. There were a number of variations of this business model. The deal was rarely 50 – 50 for the simple reason that as contention and distrust grew among the vendors, the database companies, and the publishers, knowledge of costs was very difficult to get. Without an understanding of costs in online, most organizations are doomed to paddling upstream in a creek that runs red ink. The LexisNexis service may never be able to work off the debt that hangs over the company from its money sucking operations that date from the day the New York Times broke off to go on its own. Dow Jones may never be able to pay off the costs of the original Dow Jones online service which ran on the mainframe BRS search system and then the expensive joint venture with Reuters that is now a unit in Dow Jones called Factiva. Ziff Communications made online pay with its private label CompuServer service and its savvy investments in high margin database and operations that did business as Information Access. Characteristic of Ziff’s acumen, the Ziff organization exited the online database business in the early 1990s and sold off the magazine properties, leaving the Ziff group with another fortune in the midst of the tragedy of Mr. Ziff’s health problems. Other publishers weren’t so prescient.

With knowledge in short supply, here were the principal models used for revenue sharing:

Tactic A: Pool and Payout Based on Percentage of Content from Individual Publishers

This was a simple way to compensate publishers. The aggregator would collect revenues. The aggregator would scrape off an amount to cover various costs. The remainder would then be divided among the content providers based on the amount of content each provider contributed. To keep the model simple (it wasn’t) think of a gross online revenue of $110. Take off $10 for overhead (the actual figure was variable and much larger). The remainder is $100. One publisher provided 60 percent of the content in the pay period. Another publisher provided 40 percent of the content in the pay period. One publisher got a check for $60 and the other a check for $40. The pool approach guarantees that most publishers get some money. It also makes it difficult to explain to a publisher how a particular dollar amount was calculated. Publishers who turned an MBA loose on these deals would usually feel that their “valuable” content was getting short changed. It wasn’t. The fact is that disconnected articles are worth less in a large online file than a collection of articles in a branded traditional magazine. But most publishers and authors today don’t understand this simple fact of the value of an individual item within a very large collection.

I was fascinated when smart publishers would pull out of online services and then try to create their own stand alone online services without understanding the economic forces of online. These forces operate today and few understand them after more than 40 years of use cases.

Written by Stephen E. Arnold · Filed Under Business strategy, Feature, Online (general), Search, Technology | 6 Comments

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Fogint
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

Dead Tree Update: Times Roman Edition

Libraries: A Tipping Point in Commercial Online

SEO: Good, Bad, Ugly

Mysteries of Online 9: Time

Harry Collier, Infonortics, Exclusive Interview

Microsoft: The RCA Analogy

Mysteries of Online 8: Duplicates

A Publisher Who Violated Copyright: Foul Play

Mysteries of Online 7: Errors, Quality, and Provenance

Mysteries of Online 6: Revenue Sharing

Search the site

Categories

Archives

Recent Posts

Meta

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Search the site

Categories

Archives

Recent Posts

Meta