August, 2007

The Determinator: Behind the Scenes at the Stanford…

Mary Minow: What was the impetus to put together the Stanford Copyright Renewal Database?

Mimi Calter: The project grew out of a conversation between Michael Keller, Stanford’s University Librarian, and Lawrence Lessig. A student of Professor Lessig’s had composed an interrogatory Q&A for determining the copyright status of any work. This led to a discussion of the possibility of automating the determination of the copyright status of a work, and of the necessary inputs to such a system. The 1923-1963 renewal data is one obvious input required for that system, and members of the Stanford University Libraries staff very quickly learned about the work of Project Gutenberg to scan the relevant Catalog of Copyright Entries, and the early version of a copyright renewal database compiled by Michael Lesk. (I did not get involved in the project until somewhat later). It was decided to pursue the project that Professor Lesk had started, with an eye to the framework suggested by Lawrence Lessig. We still hope to see the copyright renewal data integrated into a larger tool for copyright status analysis, and are having conversations with possible partners.

Minow: Who actually put “The Determinator” database together? Is that its official name?

Calter: We call it “The Determinator” in-house, but the official name is the Stanford Copyright Renewal Database. I was the project coordinator, and our Chief Information Architect, Jerry Persons, as well as several members of our wonderful Academic Computing team worked quite hard on this.

Minow:Can you give me an overview of your process for compiling the database?

Calter: Renewals for books originally registered between 1923 and 1963 should have taken place between 1950 and 1992. The Copyright Office moved to electronic records in 1978, which meant that we had to deal with two broad groups of records: the paper records from before 1978, and the electronic records that came after. Our mission was to have all of the data fielded, and searchable in a single database.

For the 1950-1977 records, we started with the Project Gutenberg transcriptions of Class A renewals which includes books, pamphlets and articles in serials. These records were the most challenging, as the data was completely unfielded, and we were essentially starting from scratch. Even worse, the record format used by the Copyright Office in the print books changed several times during those years. For the 1978-1992 records, we used records extracted from the Copyright Office’s online database, which had been collected by a member of the Project Gutenberg team. This data was largely fielded, and we only had to work with formatting, and breaking author data out of the title field.

For each type of data, we developed schemas for extracting the appropriate fields, and then worked with an outside firm to tag all of the records. We outsourced some of the parsing to an outside company, Innodata Isogen.

We’ve also done testing of the database. In our first round, we pulled 500 titles from the library catalog, so we could have the actual books in hand. We searched them manually in the Catalog of Copyright Entries (CCE), published by the Copyright Office, and also sent a subset of those to the Copyright Office to be searched (at $100 plus $150 per hour). We then repeated the searches in the Determinator. Overall, we’re very happy with the accuracy of the database, but we did find some unusual problems. For example, there is a book in our catalog titled Memoirs of a Spy that is listed as Memories of a Spy in both the Determinator and the CCE. That’s not a problem we can fix, as the problem is with the Copyright Office record.

The testing did reveal a few small problems with the database that are being cleaned up now. We’ll be doing a second round of testing once that is complete, and will make those results public.

Minow: How was the project funded?

Calter: We had a grant from the William and Flora Hewlett Foundation. In addition, the Stanford Library contributed staff time and in-house resources. We’re working on the final report for the Foundation right now, and it will be available online when it is complete.

Minow: Who are the target users of the database?

Calter: Frankly, we had a bit of a selfish motivation here. We are very interested in digitizing as much of the material in our library as we legally can. Looking at the copyright status of 1923-1963 works is an important part of this. We expect that the primary users of the database will be libraries and groups like ourselves that are involved in digitization projects, although I’m certain there will be other uses found!

Minow: Will this database be helpful to libraries, archives and museums who are digitizing “orphan works”? That is, do you think it will help them show “due diligence” when searching for copyright ownership?

Calter: Studies show that less than 15% of items eligible for renewals were in fact renewed.¹ Our work on this database has uncovered only about 280,000 renewal records, and a surprising number of these are for things like court reporters and Singer sewing machine manuals, so it’s clear to us that a very large portion of the books published in this period are now in the public domain. Nevertheless, we know that due diligence is important when dealing with orphan works, and we think our database can be very helpful in that regard.

Minow: What were the biggest challenges in the project?

Calter: By far the biggest challenge to this point in working with the Copyright Office records has been parsing out author names. Even in the electronic records that it produced after 1978, the Copyright Office included the author name as part of the title field. Extracting that information in order to allow searching has required significant effort.

That said, bigger challenges remain for those interested in using the data. We have only extracted the names, we have not yet attempted to insert authority control, and that makes matching a challenge. And since records from this time period have no ISBNs, there’s no easy way to tie the copyright records to particular books. We would like to see the renewal records in our database matched against catalog records, so that users can even more easily determine the status of a particular work.

Minow: What feedback have you gotten from users?

Calter: Very positive. Lots of folks want to know when we’ll be creating similar tools for other classes of works, but that’s not something we’re pursuing right now. But I have been contacted by a few organizations that are incorporating the database into their standard search process. I recently received a nice thank you note from Jack Herrick, a Stanford alum and founder of wikiHow, which has done just that: www.wikihow.com/Import-Old-Public-Domain-Books-to-wikiHow

Minow: What do you wish you could add to the database?

Calter: I certainly think it would be beneficial to expand the database to other classes of works, but that’s not something we have funding or staff to manage. However, I’m actually more interested in seeing our database become part of a tool that addresses a wide variety of copyright concerns and questions.

*Mimi Calter is the Executive Assistant to the University Librarian at Stanford University.

1 For example, a 1961 Copyright Office study found that fewer than 15% of all registered copyrights were renewed. For books, the figure was even lower: 7%. See Barbara Ringer, “Study No. 31: Renewal of Copyright” (1960), reprinted in Library of Congress Copyright Office. Copyright law revision: Studies prepared for the Subcommittee on Patents, Trademarks, and Copyrights of the Committee on the Judiciary, United States Senate, Eighty-sixth Congress, first [-second] session. (Washington: U. S. Govt. Print. Off, 1961), p. 220. Peter Hirtle, Copyright Term and the Public Domain the United States, 1 January 2007.

Published By Stanford Copyright and Fair Use Center

Articles Posted in August, 2007

The Determinator: Behind the Scenes at the Stanford Copyright Renewal Database An Interview with Mimi Calter, Stanford University Libraries