Tuesday, January 24, 2006

Brewster Kahl's Open Library Project pushes imaging envelope -- an


The Chronicle of Higher Education -- Information Technology
From the issue dated January 27, 2006
Section: Information Technology / Volume 52, Issue 21, Page A34

Scribes of the Digital Era

A library-scanning project brings public-domain materials online and offers
an alternative to Google's model

The Chronicle of Higher Education

San Francisco

Brewster Kahle is mobilizing an army of Internet-era scribes who are fastidiously copying books page by page. Unlike the monks who slowly copied ancient tomes by hand, though, these scribes make digital reproductions, and they zip through hundreds of pages each hour.

Mr. Kahle, director of the nonprofit Internet Archive, is guiding a mass-digitization project called the Open Content Alliance, which was announced in October and is rapidly gaining partners. The alliance plans to take carefully selected collections of out-of-copyright books from libraries around the world and turn them into e-books that will be available free to scholars and anyone else who wants to view them, print them, or even download them to their own computers.

The project has the backing of Yahoo and Microsoft, and many see it primarily as a response to the controversial book-scanning project led by Google (http://print.google .com/googleprint/library.html). Google is digitizing millions of books from five major libraries, and it says it hopes to scan nearly every book held by one of those partners, the University of Michigan at Ann Arbor. Because many of the library's holdings are still protected by copyright, publishers have challenged the legality of Google's project.

Although the Open Content Alliance has pledged not to scan copyrighted works without permission, thereby avoiding that thorny legal issue, the project could do as much to shake up the library world as Google's effort has. The alliance's undertaking is more than just a mass-scanning project ÿÿ it is a new model for cooperation among libraries hoping to build their own digital archives of public-domain materials. Individual libraries have long worked on digitization projects on their own, but the new alliance promises to pool the digital content created by academic libraries. "It's a book-scanning initiative and a vision for an open library," says Mr. Kahle.

Indeed, the alliance involves far more players than Google's project: So far 34 libraries, most of them at universities, have agreed to join and contribute material. And the Open Content Alliance will make its digital books more freely available, putting them online in a way that anyone, even companies other than Yahoo and Microsoft, can index and search the files, or even download the books for their own use.

One key to achieving the project's goal of scanning hundreds of thousands of library books is to keep the price of scanning remarkably cheap ÿÿ with a charge to participating libraries of about 10 cents per page ÿÿ by scanning the volumes quickly and accurately. To do that, the project makes use of a specialized document scanner developed by the Internet Archive and called, appropriately, the Scribe.

The copying has already begun. In a building in the warehouse district here, employees of the Internet Archive who operate the book-scanning machines are working through an initial batch of books selected from the University of California system. Two more scanning machines are in place at the University of Toronto, where they run 15 hours a day. The project's leaders hope to have scanners in more libraries by the end of the year. Each machine costs tens of thousands of dollars, says Mr. Kahle.

One challenge for libraries, of course, is finding the money to scan large quantities of books, even at 10 cents per page. Daniel Greenstein, executive director of the California Digital Library, says he hopes that libraries can contribute to the project by shifting some of the money they now spend on digital-book subscriptions to scanning books and adding them to the shared online collection. Several companies sell access to e-book collections, such as the Chadwyck-Healey Literature Collections, from the ProQuest Information and Learning Company.

"We're going to spend the money anyway," Mr. Greenstein says. "Let's spend it more wisely." The alliance is also trying to entice companies and others to donate money to the effort, touting the benefits of offering the world's public-domain literature free to all online. "It will be remembered as one of the great things that humans have ever done ÿÿ up there with the library of Alexandria, Gutenberg press, and the man on the moon," Mr. Kahle said at a kickoff event for the project in the fall.

Difficult Work

At the Internet Archive offices one afternoon, Mr. Kahle demonstrates his book-scanning machine.

The device, about the size of a photo booth, is draped in heavy black cloth, with a V-shaped stand in the middle to hold a book open. Two high-resolution cameras are positioned at the top of the machine, one aimed at each page of the book's spread. The book is pressed open by a V-shaped piece of glass, which the machine's operator can raise or lower with a foot pedal. After each pair of pages is scanned, the operator raises the glass, turns the page by hand, and then lowers the glass back in place. A computer monitor at the back of the machine shows the cameras' views of the book pages, and the operator can make sure the text is lined up in the cameras' sights.

Working the machine is not easy. Putting the right amount of pressure on the foot pedal, so the glass lifts just high enough to turn pages, can be difficult at first. Mark Johnson, lead engineer for the Internet Archive, says the employees who spend their days at the machines get into a rhythm that lets them scan about 500 pages per hour. "They're amazing. If you watch the people scanning, it's like an athletic sport."

Once the book pages are scanned, a computer attached to the device automatically creates digital files that can be displayed and searched. The high-resolution images include any illustrations and even margin notes that are contained in the original volume. The machine then sends those digital files to a server, where they are available on a Web site run by the Internet Archive (http:// www.openlibrary.org). Copies of the files will also be sent to the library that lent the book for scanning.

Mr. Kahle says that the books will be given new life in digital form, and that they can be displayed in a number of ways. The archive has developed an on-screen interface that makes it easy to read and search each book. But online users can also request a printed and bound reproduction of a book by paying a small fee to a company that does the printing and binding. Soon the books may be able to be printed in Braille or in large print. They could even be downloaded to PDA's, cellphones, or other portable devices for reading on the go.

Rick Prelinger, president of the Internet Archive's Board of Directors, says that even though the materials scanned by the Open Content Alliance will be free to view or download online, some companies will find ways to make money with the digital files. "People will pay for enhanced services" such as printing, he says. "I think the print-on-demand business is going to do very well."

Let the Scanning Begin

The University of Toronto's libraries have been working with Mr. Kahle since before the Open Content Alliance formed, and have scanned more books for the project than any other participants. On the second floor of one of the university's libraries, in a room that once housed a computer cluster, two of the scanning machines are in use seven days a week, staffed by employees hired by the Internet Archive.

Carole Moore, chief librarian at the university, says each machine scans about 7,500 pages per day. Several thousand books by Canadian authors have been scanned so far. The volumes were selected in coordination with six other Canadian university libraries, and the national Library and Archives Canada.

Mr. Greenstein, of the California Digital Library, a project of the University of California system, says he hopes to eventually place scanners at the University of California system's two regional storage libraries ÿÿ warehouselike facilities that are closed to the public but whose books can be requested through interlibrary loan. Ideally, those storage libraries could routinely scan each book as it is first deposited, so that patrons could view the books online instantly rather than have to wait for a printed copy to be delivered. "We're looking at how much it would cost," says Mr. Greenstein.

Many of the libraries involved in the project have only recently joined and are still deciding what materials they will contribute. "Every library has some of those things that no one else has," says Shirley K. Baker, vice chancellor for information technology and dean of university libraries at Washington University in St. Louis, which recently joined the alliance. "We have probably a couple thousand books that are in the public domain that we could digitize and make publicly available."

Ms. Baker is also interested in digitizing films from the university's collection to add to the shared online library, including raw footage from Eyes on the Prize, a well-known documentary on the history of the civil-rights movement in the U.S. The book-scanning machines won't be necessary for that, of course, but the Internet Archive has experience digitizing and storing video and audio files as well, and the archive plans to collect a range of materials through the Open Content Alliance. "Within this calendar year, we hope to be contributing at a relatively modest rate, but ramping up over the long run," says Ms. Baker.

Hard-to-Capture Materials

José-Marie Griffiths, dean of the School of Information and Library Science at the University of North Carolina at Chapel Hill, says that her school has joined the project to experiment with how to better scan manuscripts and documents that are not in book form. "You can have whole documents, letters, notes written on fragments of paper," says Ms. Griffiths. "Much of it is handwritten" and therefore difficult for computers to translate into text form for searching, she says. "The actual scanning and creating the ability to search the content is much more challenging for nonprinted, nontypeset materials."

Librarians from Chapel Hill plan to take a few boxes of such materials to the Internet Archive soon, she says, to start trying to run them through the scanners.

Google's book-scanning project, meanwhile, is more restricted, and its leaders are far more secretive. Google officials have apparently developed a high-speed book scanner of their own, though they refuse to divulge details of how it works or say how fast it can scan books. Google also will not say how many books it has scanned so far from its partner libraries or even describe the types of books it has added. Such secrecy frustrates many librarians, who are accustomed to using collections that are carefully delineated. "It is, I think, important for people to know what they might be able to find," says Ms. Baker, of Washington University.

Mr. Greenstein says that he has met with Google officials, and that they seem more interested in grabbing a large quantity of materials than in carefully selecting certain collections of works. "None of them are interested in curation," he says, adding that their attitude is "the more of it, the better." Google is also less open in the way it presents its books. For those in its collection that are in the public domain, Google allows users to see the full text, but there is no way to download the data or easily print the whole book, features that are allowed by the Open Content Alliance.

When asked to respond to those criticisms, Google issued a statement comparing its scanning project to that of the Open Content Alliance: "We welcome efforts to make information accessible to the world. The OCA is focused on collecting out-of-copyright works which constitute a minority of the world's books ÿÿ a valuable minority, but certainly not complete." Google's plan to scan copyrighted works without permission from their publishers, while the most unique aspect of its project, is also the most controversial.

Google officials emphasize that only short snippets of copyrighted works will be shown to users. Still, members of the Association of American Publishers have filed a copyright-infringement lawsuit against Google in U.S. District Court, asking the court to prohibit Google from reproducing their works and to require Google to delete or destroy records already scanned.

Leaders of the Open Content Alliance say they will scan copyrighted books only if publishers grant permission first. But participants in the Open Content Alliance are also quick to credit Google with bringing more attention to book scanning. "We're just providing another model," says Robin Chandler, director of built content for the California Digital Library.

"Every generation of scholars looks at past events in a new way," she says, adding that bringing old books into an easily searchable digital format will help scholars revisit older works and better make comparisons with more recent texts. "The idea that you can analyze texts over the centuries is very exciting."
Copyright © 2006 by The Chronicle of Higher Education

This article above is copyrighted material, the use of which may not have specifically authorized by the copyright owner. The material is made available in an effort to advance understanding of political, economic, democracy, First Amendment, technology, journalism, community and justice issues, etc. We believe this constitutes a 'fair use' as provided by Section 107 of U.S. Copyright Law. In accordance with Title 17 U.S.C. Chapter 1, Section 107, the material above is distributed without profit to those who have expressed a prior interest in receiving the included information for research and educational purposes. If you wish to use copyrighted material from this blog for purposes beyond fair use, you must obtain permission from the copyright owner.

Comments: Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?