Database Content: Take or Use

March 12, 2009

You may want to read Out-Law.com’s “Database Infringements Depend on Taking, Not Usage of Data” here. The article tackles an issue that has triggered a European Court of Justice ruling. For me the key statement in the Out-Law.com synopsis of the ruling was:

The Directive protects against “extraction and/or re-utilisation of the whole or of a substantial part…of the contents of that database”. The ECJ said that infringement was independent of the use to which someone wants to put the information.

Does this ruling matter in the US or elsewhere?

In my opinion, the ruling underscores the difference between how a person who compiles and provides access to that specific compilation of data perceives the value of the data and the person who wants to repurpose some of the data in that database. I am no lawyer, but I do work with clients who can click to a Web site and find useful information; for example, the data available from a government Web site or the patent information I have compiled for my Google patent search service.

Software can now slice and dice data. A programmer can make many information “meals” with these amazing software tools.

There are different ways to view the structured data such as airline flight information or condos for sale in Baltimore, Maryland or loosely structured data such as an RSS feed or well formed XML documents.

An innovator / entrepreneur can see these data as raw material for something new. The idea is that individual data items may gain utility when assembled or organized in a way different from the way the information appear on a specific Web site. Because the information are viewable in a browser, it seems to the innovator / entrepreneur that the data or their constituent elements like a phone number are like molecules in a mixture. These can be combined without losing their original chemical structure. The data are publicly available, so the data are meant to be used.

The aggregator / builder of the data set sees the fields, the contents, and the idea of the data set as his / her personal creation. When I create a database, I see it like my own “thing”. I don’t know the legal issues but I have a notion of how much work it has taken to find the data, clean it up, organize, and make it available. In the case of the Google patent documents, the USPTO does a lousy job of handling patent information. My frustration in using this system and vendor supplied search tool (I heard Open Text but this may be incorrect attribution) gave me the idea of collecting the PDF files, finding or getting from the USPTO the HTML versions, cobbling together a relatively clean version of each HTML patent document, and linking the PDF file with the cleaned up HTML file. You can see the collection in action here. I have worked with several vendors to index this data set so people can search Google’s patent documents and then with one click view the PDF in their browser. There’s no charge for this access.

Do I want another to rip off what I have assembled, probably not very well, over the last three years? If someone asks, I would probably have no objection. If someone just takes the whole construct, I would be annoyed. Would the outfit taking the data understand why I would be honking at them in my addled goose Web log? Probably not.

The original assembler of the data. Quite a bit of data comes from individual research outfits. In the pre Depression days, a research firm (big, little, competent, incompetent) could come up with an idea. With a low cost effort, data could be gathered. The information might be sold as a one off report or a multi client study. With the Web, the research firms put up some of the research findings and try to sell the rest or get consulting work based on the findings (accurate, wrong, crazy, made up, whatever). If I link to some of these data, will the assembler accuse me of data theft? Some may. Some won’t. The PR play wants me and people like me to point to the data. The backlink has value in the Google centric world. The naive will assert that the link or unit of data constitutes a violation of one or more information conventions, laws, or regulations.

So which point of view does one adopt?

I think the Out-Law.com posting is important because it underscores these issues. Please, keep in mind that I am offering my opinion as an addled goose:

Composite applications running a browser are going to suck in, slice, dice, and spit out structured and semi structured data with increasing efficiency. The result is that most people won’t know whose data is where or where some of the data originated. A phone number is a phone number unless there are data points that permit the tracking of the point of origin of that single phone number. Looking at a phone number and saying, “That’s from my database” strikes me as an assertion that needs the provenance metadata.
Users don’t want to go different places, find a unit of information, copy it, and then reassemble the information into a table, report, or dossier. Software is going to do this work. As long as there are users who see this type of intellectual work as “pain”, then programmers will try to create a product or service that delivers what the users want. In my opinion, most people with Web sites don’t understand the A to B thinking of a programmer: Problem, php, product. Grill the programmer and he or she will say, “Dude, what’s your problem?”
Cloud services “suck” in or “eat” data to survive. I have written about the notion of time and freshness in my “Mysteries of Online” series. (Search this Web log to locate the installments.) Software looks for a CRC mismatch or a date stamp and the software sucks in the new item or what is called the delta. The cloud service at that point becomes more important than the individual source of the data. The cloud service must become the definitive data source because it can “look” at different field entries and resolve conflicts. At scale, little data aggregators begin to lose the importance.

I think that more conflicts between and among the parties to composite applications are likely. Court rulings are necessary, but I think that the pace of composite applications, repurposing of content by smart software, and developers who are solving problems will move more quickly than courts, publishers, or aggregators like me can understand. Once the data are in digital form and crawlable, the data are probably gone. Getting data back that are in the wild is in my experience a tough job. The confusion, the copyright nets, and the emergence of smart software are going to interact in what that will have some interesting data implications.

Stephen Arnold, March 12, 2009

Written by Stephen E. Arnold · Filed Under Business strategy, Feature, Online (general), Technology, Text analytics, Text processing

Comments

One Response to “Database Content: Take or Use”

Database Content: Take or Use : Beyond Search | Free Cell Phone Reverse Lookup on March 12th, 2009 7:15 am

[…] “That’s from my database” strikes me as an assertion that … See the rest here: Database Content: Take or Use : Beyond Search Author Comments […]

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.