Deals with Google to accelerate library digitization projects for Stanford, others


Michael Keller

In December, Stanford announced that it is one of five libraries cooperating with Google Inc. in a project to make millions of books from their collections available electronically to readers worldwide without charge. Along with Harvard University, the University of Michigan, the University of Oxford and the New York Public Library, Stanford will loan books to Google to be added to an electronic repository that could become the world's largest digital library.

"This is a great leap forward," said Michael A. Keller, university librarian and publisher of the Stanford University Press and the HighWire Press, Stanford's online co-publishing service for scholarly journals. For years, Stanford has been digitizing texts to make them more accessible and, as of January 2005, Highwire Press has helped publish more than 800,000 free full-text journal articles. But in the case of books, the university's efforts have been limited for technical and financial reasons, Keller said. "The Google arrangement catapults our effective digital output from the boutique scale to the truly industrial."

Google was founded in 1998 by Stanford doctoral students Larry Page and Sergey Brin. Google and Stanford have been talking with one another about the project since very early in the development of the idea, Keller said.

Both Stanford and Google are committed to respecting the rights of publishers and copyright holders of the books scanned, he said. Users will be able to browse the full text of works in the public domain. According to a Google press release, library books that are still in copyright will show up in Google search results, but users will see only bibliographic information and a few small text snippets unless permission is granted from publishers to show more.

The project's unveiling last month made national and world headlines and prompted speculation about the effect it might have on the future of libraries and publishing. Keller talked recently about what the project will mean for Stanford.

When will the actual scanning of books begin?

Lots of logistical issues remain to be worked out—things like transport, selection, physical control, sorting, etc. Google staff are working with Stanford University Libraries to develop detailed initial plans for the project.

How many of Stanford's more than 7.5 million books will be digitized?

Stanford has great hope of digitizing all its books eventually, so that each one can be made as accessible and addressable as possible. That said, the process will take quite a few years and we really do not know how Google's grand ambitions will play out over time. For that reason, we left the question open as to how many Stanford books Google would handle. The agreement with Google neither calls out specific collections nor specifies a minimum or maximum number of books to be digitized. At this point, we're not really worried about digitizing the last book.

How will they be selected?

That is still being determined. We most likely will begin with a few hundred thousand books that were not converted from the Dewey Decimal cataloging system when the libraries began using the Library of Congress classification system. That is intrinsically an older collection, so more of the books are likely to be in the public domain. We also will factor such things as current location and condition, as well as attempting to create as little disruption for our readers as possible.

How will the books be digitized?

Stanford will loan books from its library collections to Google, which will scan them at their Mountain View headquarters. Once digitized, the books will be returned to Stanford and re-shelved. We'll require that books be turned around fairly quickly. Google has promised not to damage the books, and we are taking them at their word. We are strongly committed to get as much work done as possible, without disrupting the services we provide to our readers.

How do you expect this project to affect the library and its mission?

I have been committed to finding ways of digitizing information for years. The Google plan allows us to accelerate our digitization schemes by orders of magnitude. I intend that our eventual use on campus of the digitized book files will be a tremendous asset to the Stanford community.

Some people seem to believe the effect will be to make the physical books redundant—that we can simply discard the books and convert our book stacks to offices and labs. I disagree strongly. In fact, I believe having books in digital form will actually increase the use of the physical books. The digital files will be great for searching and targeting material for study, but many of us prefer the hard copy original in hand for careful reading. So, in my opinion, it is not an either-or proposition; the book provides a valuable reading experience different from the valuable searching/scanning/excerpting work with the digital version.

Now the downside of the Google plan from the library's operational point of view is the work at our end: selecting books, protecting fragile or damaged books before they go off for digitizing, resorting, etc. Physically handling hundreds of thousands or millions of books is labor intensive by its nature.

When will Stanford materials appear in Google?

As of this writing, no timeline has been set.

Can faculty ask that certain books be digitized?

We already have a process for targeting books for digitization, and such needs should be communicated through the librarians of specific collections.

The Google library project reportedly was nicknamed Project Ocean. How do you respond to the criticism that the Internet is already a sea of information that's difficult to intelligently navigate?

There is obviously a huge amount of information on the web. However, information is not quite a generic commodity: Having millions of pages available online is of no immediate value if the information you need is represented only in a book on a shelf to which you do not have access. Further, not all information is of equal validity, integrity, accuracy, legitimacy, etc. The Google book digitization project will unlock a very large amount of relatively high-quality information of known, traceable origin, with proper bibliographic references. And, of course, that information will be searchable through Google. So I would say its net effect is to improve the chances that web users can obtain legitimate representations of the information they seek, thus improving the value and maybe even decreasing the chaotic quality of the web.

I also expect that the existing tools for extracting information will improve with the large-scale availability of full-text material. The tools that are emerging now will give us the ability to extract ideas from online content, rather than simply perform keyword searches.