Rising Into the Public Domain: The Copyright Review Management System (CRMS) at the University of Michigan
Interview with John Wilkin, Associate University Librarian for Library Information Technology and Executive Director, HathiTrust and Principal Investigator for CRMS
Mary Minow: Where does CRMS fit into the scheme of other copyright tools, such as the Determinator?
John Wilkin: The Determinator is a good point of comparison for us. It serves as a resource for helping someone make a determination, and what we wanted to do is actually make determinations. The focus is on materials in our Collections, across the HathiTrust partnership. We are not so concerned about where a book comes from, because we think of [the corpus] as a “collective collection” … materials from across the board.
I think we did have, early on, perhaps a naive sense that we might be able to make those determinations without the materials being in front of us, digitally or in print. We quickly concluded, though, that the only way to do the work was to have those works in hand. And we chose to have them in hand, digitally. And the digital flow of materials drives the prioritization process.
Minow: When you say digitally in hand, it sounds like researchers are allowed to look at the text, the preface, etc.
Wilkin: That’s right. We have a strong authentication and authorization system, and it’s tied into the Michigan CoSign system, but also it uses Shibboleth. So that gives us a lot of tools there. In this case, we use a two factor authentication for all reviewers. They have to authenticate [with a password], and they have to be, essentially, at their desk. They can’t take their identities home and start looking at materials that are still in copyright. So it’s very much justified by the work they’re doing.
Minow: Doesn’t Google make its own determinations of what’s in the Public Domain? Do they come up with different determinations? Is there duplicative work going on?
Wilkin: We’re doing the 1923-1963 work.
Minow: That is, a focus on books published between 1923 and 1963. Books published in the U.S. prior to 1923 are in the Public Domain. The Copyright Renewal Act of 1992 automatically extended the copyright terms of works published in 1964 and later.
Wilkin: Right. So far as we know, Google is not doing the 23-63 work. Both Google and HathiTrust do a layer of very automatic determinations. Ours is entirely automatic, based on elements in the MARC record. They have reviewers look at materials to do some [consultation] because occasionally the bibliographic information is not reliable. That’s the point at which we’ll look most similar, with some exceptions.
There are important areas where we deviate. We are opening up U.S. Federal Docs, post 1922. Google is considering that now, but they have been slow to do that. They’re considering what classes of materials they’ll open up. HathiTrust will say that U.S. government docs are, by and large, in the Public Domain.
Then we diverge. For example, we’re going to look at U.S. pre-1923 materials as in the Public Domain, and we’re going to look at users outside the U.S. differently for materials that were published outside the US does that make sense?
Minow: Help me out here.
Wilkin: For the user in the U.S. or really for anybody in the world, we deem U.S. works pre-1923 as being in the Public Domain. And for the user in the U.S., we also deem non-U.S. works pre-1923 as in the Public Domain. For users outside the U.S., we are fairly conservative with non-U.S. works. I think the date we’re using now is about 1870. It’s a rolling wall, and essentially a best guess. It would be that date for a young author who lived a long time who published something. We use statistical probability, and we roll that wall forward every year.
Minow: How do you figure out if the work was published first outside the country?
Wilkin: We primarily use the bib record of the publication. If the place of publication is outside the U.S., we assume that it was [first published there]. Effectively we are conservative unless we get a good look at something and make an individual determination.
We ingested 700,000 volumes one month, so that gives you a sense of the scale we’re working at. We’re never going to have the resources needed to do individual sorts of this one should go here and that one should go there.
Minow: You mentioned that you’re using the Determinator, but that’s only available for Class A books. Are most of your materials Class A books?
Wilkin: They’re all Class A books. The reviewers use the Determinator and other tools, they look at the book and they make an assessment. They look to see that there are not embedded rights problems in making those determinations.
Minow: Inserts – photos, stories, poems – you’d almost have to read every page.
Wilkin: Well, we look at acknowledgements, not the entire book. There are going to be some cases where the acknowledgements are not that adequate. We have an advertised takedown policy, and we’ve never been contacted about anything that is an insert.
Minow: It takes my breath away to look at that level.
Wilkin: The insert issue is of particular concern in Congressional materials, such as materials that are inserted into the record for hearings. We work with the assumption that these inserts are part of the public record and that they are provided or reproduced in that context.
Minow: In Section 108(h), the copyright law gives 20 years back to libraries and archives even on the web, if not subject to normal commercial exploitation. Here’s a chart I made, showing that, for example, that libraries and archives may make and distribute copies of works up through 1934 this year, instead of 1922. The catch is that the works cannot be subject to an undefined “normal commercial exploitation.”
Wilkin: We’re not taking advantage of that at this point.
Minow: Another thought I had, after reading Melissa Levine’s article, is that many authors of older works retain their digital rights, because when they signed publisher agreements, digital rights were not yet contemplated. Are you taking advantage of that? [Opening Up Content in HathiTrust: Using HathiTrust Permissions Agreements to Make Authors’ Work Available, Research Library Issues, no. 269 (April 2010): Special Issue on Strategies for Opening Up Content]
Wilkin: We’re not. We’re just testing the waters, taking baby steps. We’re only dealing with works where the rights have reverted to the author and when the author or publisher knows they own the rights. As it turns out, we’ve had some fairly large lump permissions. For example, in at least one case where a journal died, the journal publisher gave us permission to open up the full run of the journal. As it turns out, a few organizations have opened up a large number of publications.
Melissa’s article is an early step for us. We haven’t gone out to seek permiss
ions from authors, yet. But it’s most definitely something we want to do.
Minow: The University of Michigan is a player in the OCLC pilot project, the WorldCat Copyright Evidence Registry. Does that mean your determinations of copyright for the works you examine then feed into that Registry?
Wilkin: I think that effort is in limbo right now. We did set up a mechanism that we could share our determinations with them. The Registry was set up to allow institutions to identify records that need to be enhanced or annotated with information about URLs and rights, etc. In our distribution mechanism, there’s one record for every volume in the repository at this point.
We think of OCLC as a central switching point for bibliographic info, so it seemed like a natural for them to have a registry of copyright evidence. We were making data available to them, but in fact we have now 6 million volumes, each identified with our either automatic or manual copyright determination, so that’s more than what OCLC would have, I guess, aspired to do.
In the CRMS process, that’s only been tens of thousands of volumes, but someone could start with our 6 million volumes and look for changes.
Minow: But it wouldn’t be open in the sense that someone could put their own data in, right?
Wilkin: Exactly, and the Copyright Evidence Registry was intended to be that.
Minow: Is there anything you’d like to add?
Wilkin: Well, for us, the question is “what next?” The easiest “what next” is expanding to other partners. Anne’s been busy as we laid out in the grant, she is training staff in Indiana, Minnesota and Wisconsin – just finished Wisconsin – the three pilots along with the Michigan staff. [Anne Karle-Zenith, Copyright Review Project Librarian]. This winter she’ll probably incorporate staff at a California partner.
And as we bring more hands in, it puts more pressure on the training and reliability piece as more people are making determinations.
Minow: Do you see members of the public as becoming able to add notes or comments in the future?
Wilkin: We have a tagging application for bib records. Probably not a day passes when someone doesn’t say, “I think this is in the Public Domain” or ask, “is this in the public domain?” That’s what stimulates someone to look at it. So it is user driven now. We won’t take someone’s assertion as fact, but it provides a good starting point to do investigation.
Minow: Do you have plans to add other materials, besides “Class A” books?
Wilkin: In HathiTrust, we have much more than “Class A,” but the only ones we’re pushing into the workflow right now are “Class A.” So that becomes a question for you. Then. How would we go beyond “Class A”? How could we build sustainable cost effective system? Probably going to be something piece by piece, right?
Minow: I’ve heard that the Copyright Office is working on a retrospective conversion of the copyright registration and renewal records of rest of the material types, beyond “Class A books.” If they make the records available in bulk, as they did with “Class A,” then others can set up or build on databases like Stanford’s “Determinator.”
Wilkin: Did you know that we’ve found about between 55% and, 60% of our materials have been found in the public domain?
Wilkin: The numbers you see out there say like, only 15% are in copyright. Some assertions are pretty wild. There was some early work done by the copyright office, but the law was in flux at the time. Best to have something so statistically sound. I’m guessing that between pre-CRMS and CRMS, we’ve gone through 100,000 titles and those numbers have held. I think we have another 400,000 titles to deal with in that period. One question we have, how many titles ARE there in the 23-63 period? There’s just so much indeterminacy because of variation in cataloguing practice and ways of reporting things, and so.
Minow: Are the other 40% ones that you’ve determined are in copyright or you just can’t figure them out?
Wilkin: I think early on it was about 30% in copyright and 10% in UND (undetermined or undeterminable). Anne found that as staff got more experienced, they were getting stuck on complicated problems, and we often found a lower yield of public domain determinations. So Anne encouraged staff to push things to UND rather than get some finality. So the number of UND has gone up, but the numbers in the Public Domain have stayed constant. That’s really a workflow strategy kind of thing.
It’s exciting to get those works opened up. The surprise has come in the titles. Because of the required renewal process, it’s stunning to see what was not renewed. The first time I encountered this was with my 13 year old daughter, who was doing a book report on code breakers. We found really modern materials by living mathematicians. I thought, “oh, we’re in trouble.” Then, looking further, these were ones where renewal did not take place. Interesting to learn the behavioral piece …
But the numbers, the numbers are really very interesting, the 60/40 sort of thing.
Minow: And yet, going forward, this is not going to be the case, because now there’s no renewal required. An anomaly really, unless law changes again in the other direction, which doesn’t seem likely.
Wilkin: That’s something for us to ponder as a society, as a culture, that these works are overwhelmingly not on the market. What’s happening is, without this effort, no one is able to take advantage of the information that’s there, or only in a limited way.
Another surprise is the Committee on Institutional Cooperation, the CIC, the non-Michigan, non-Wisconsin CIC institutions, don’t get back their in-copyright materials … by contract with Google. I think what we ought to say is they don’t get back those things that are putatively in copyright. With those numbers in mind, think about what are we not able to put online because they’re assumed to be in copyright, when we know that 60% or some large percent are in the public domain.
Minow: You mean, those institutions are not getting access to the full text of their own books?
Wilkin: They stay at Google, they’re embargoed. That may change with an amended agreement, but for now, Google doesn’t provide them back.
Minow: I thought those were called “library copies.”
Wilkin: It is important to call them “embargoed copies.” Jack Bernard, our Assistant General Counsel, has asked us to use the term “rising into the public domain” instead of “falling into the public domain.”
Minow: That’s a good title for this interview. Thanks so much for talking with us today.