This project was kicked off by an informal Discord chat earlier today (February 3, 2024).

Several large-scale free document repositories and projects have grown up over the past 10-20 years. Examples include Wikimedia Commons, The Wikipedia Library, Internet Archive, Hathi Trust, Biodiversity Heritage Library and Library of Congress. Several commercial efforts such as Google Books and Newspapers.com also provide some level of no-cost (but possibly license-encumbered) access to a variety of scanned materials. However, this has only scratched the surface of the global repository of printed materials sequestered away; some professionally indexed and stored in climate controlled libraries, others randomly stashed away in basements and garages. In addition, with various copyrights expiring and the public domain growing every year, an ever-expanding collection of knowledge becomes available for legal uploading without restrictive licensing or paywalls.

Getting these materials digitized and made available on-line is one of the big tasks humanity faces over the next several decades. At some point, even the most dedicated archivists will no longer be able to justify the cost to keep all this paper around and it will be discarded. If it has not been digitized, the information it contains will be permanently lost to humanity.

Wikimedia New York City ("the chapter") rents space in Prime Produce. It is proposed that the chapter purchase a high-quality book scanner, install it at Prime Produce, and set it up as a public amenity for people to use to scan books. A requirement of using the scanner will be that the scanned materials be uploaded to a free archive such as Wikimedia Commons or Internet Archive; it must therefore be ensured that scanned material is suitable for such an upload, i.e., in the public domain or under an otherwise free license. The chapter will hold scan-a-thons in which the public will be invited to bring materials to be scanned and receive help and assistance scanning and uploading.

The initial feedback from the chapter is positive, and suggests that funding would probably be available for this. Initial feedback from Prime Produce is also positive, this being an idea that other resident organizations have previously thought about doing. At this point, there is nothing decided other than general but informal agreement that this seems like a good idea. The purpose of this page is to flesh out the technical details of what makes sense to buy, how much it costs, and firm up commitments.

Hardware[edit]

The two basic hardware routes seem to be a dedicated book scanner or a camera on a copy stand. Some things to consider:

Computer support. It needs to be usable with (at a minimum) both Windows and Macintosh. Linux support would be nice to have. On the other hand, if the system comes with an integrated computer that functions in a stand-alone fashion, that's a non-issue. But it still needs to be able to export the scanned files in a non-proprietary format.

Simplicity of use. Although we anticipate providing supervision and training, this will ultimately be used by many people with a range of technical skills. Something that requires extensive training won't be as useful as something which is simple and intuitive to use.

Ruggedness. As with any piece of shared infrastructure, it will get used by multiple people. Something which is fragile and fidgety won't last long.

Cost. Unknown. A few minutes of clicking around found a ScanSnap SV600 for $575, but there's much more research to be done.

Book-friendliness. Some bound material is easy to lay flat; a spiral-bound book, for example. Other material may be more difficult to lay flat due to the way it's bound. For rare/old books, forcing the book into a flat position can damage the book. High-end book digitizers only open the book part way and use different cameras (or camera positions) to photograph the left and right hand pages. Some devices have integrated post-processing software which can take an image of a curved page and algorithmically remove the distortion. The ideal situation would be to get a device which can handle rare/old books, but this may be economically unfeasible for an initial implementation.

Internet Archive: How the IA Scans material

Scanning Services

Digitizing Print Collections with the Internet Archive Open and free online discovery and access, long-term storage and engineering file management and unlimited downloads.

Tweet (or post) re: scanning

At the Internet Archive, this is how we digitize a book.

Blog post about the above: Meet Eliza Zhang, Book Scanner and Viral Video Star

A (re)Introduction to Book Digitization at the Internet Archive

49 minute video: A (re)Introduction to Book Digitization at the Internet Archive

From this 2021 webinar: A (re)Introduction to Book Digitization at the Internet Archive

Equipment at the Internet Archive: Table Top Scribe System

Description, specifications, and documentation for the IA's Scribe Station--what the IA uses to scan materials: Table Top Scribe System

Center for Jewish History

I visited the Center for Jewish History on 16th Street and saw first hand their archives and their scanning equipment, but can't find detailed info. I will check in with them and find out details --CmdrDan (talk) 22:21, 21 February 2024 (UTC)[reply]

Book2net[edit]

Book2net.net A manufacturer of some serious looking hardware.

"book2net is your reliable partner for all aspects of cultural heritage digitization."

Book2net Case Studies:

Integrating university knowledge[edit]

The proposal at Special:Permalink/1211465891#Proposal for Integrating University Knowledge into Wikipedia, while unlikely to go anywhere, may be of interest. RoySmith (talk) 20:09, 2 March 2024 (UTC)[reply]