To upload an item you need an Internet Archive â€œlibrary card.â€ (basically an account). Itâ€™s easy enough to do so, but realize that whatever email you use will be listed as the originator of the document, and will be publicly available. So if privacy is a concern you may want to use a throwaway email address.
What can be uploaded:
All uploads using this interface to the Archive are public, and may be downloaded by anyone. However, both the BSW form, as well as the Archive upload form, have a checkbox to indicate that it is a test item. Test items will be processed, OCRâ€™ed and made available, but not be indexed (or searchable). They also will be deleted after 30 days. Marking it as a test item can be useful for testing the process, or if you are uploading things only for the purpose of OCRâ€™ing them and shouldnâ€™t become part of the permanent archive.
Uploading books using the Archive.org website:
One option is to upload a book using the Archiveâ€™s own interface. You click on the â€œuploadâ€ button at the top of the archive. Its url is http://www.archive.org/create/
The Archive recommends uploading a pdf file. However a zip file that contains the pages that ends with _images.zip can also be used. The archive will accept a zip with jpeg, tiff, or jpeg2000 images. As part of the upload process you fill in metadata (things like the author, title, date etc).
Uploading books using Book Scan Wizard.
Book Scan Wizard has a new feature that allows you to easily upload books to the Internet Archive. It can be run either interactively or as part of a batch process. The easiest way to start it is by using the Web Start version which can be accessed from this link: http://bookscanwizard.sourceforge.net/run
For an example of what a book created with the upload feature, see this book. It was created by using a â€œNew Standardâ€ book scanner, Book Scan Wizard, and a pair of Canon A480 cameras.
Hereâ€™s the process: In the menu under tools, choose â€œPrepare for Uploadingâ€¦â€ and it will bring up the following screen:
Fill in the information for the book, and it will add to the BSW script the metadata and commands to create a zip file for uploading to the Archive.
The access key and secret key are a special id and password only used for transfers. You get them from here. (Or press the â€œLookup Keysâ€ button which will also bring you to the right page).
The identifier becomes part of the url for the book. On the archive books it is usually a combination of the title and the author of the book, but it can be whatever you want. Letters, numbers, periods (.), hyphens (-), and underscores(_) are permitted values for the identifier. All other fields can accept any characters. If needed, multiple lines can be used. For example, if there are multiple authors, you can add the additional authors by adding additional â€œcreatorâ€ lines to the other metadata section.
Once you press Ok, the following configuration will be added automatically:
Code: Select all
Metadata = identifier: BigBookOfFairyTalesA Metadata = title: Big Book of Fairy Tales Metadata = creator: Gustave DorÃ© Metadata = date: 1896 Metadata = subject: Childrens fairy tales Metadata = description: Hardcover title is Favorite Fairy Tales Metadata = keywords: childrens, fairy tales CreateArchiveZip = archive.zip 10:1 # Uncomment the following line to send to the archive as part of this job. #SaveToArchive = archive.zip xxxxxxxxxxxxxxxxx xxxxxxxxxxx
You can also create a zip file some other way, then use the command line option to send it to the archive. To do that, zip up your images, and include an xml file with the metadata. The images can be called whatever you like and will be saved in alphabetical order.
If you want to see an estimate of the size the zip file will be, you can right-click the CreateArchiveZip line. It will return this:
Then adjust the compression setting (the 10:1 in the example above) until you have a result you like.
How to Scan Books for the Archive:
While the Archive will accept any sort of scans, it is nice to provide the scans in a way that matches their own works. For that, it is best if the books meet the following criteria:
- It should have a resolution of 300-600 DPI.
- It should be done as a full color image that closely resembles the actual book image. The Internet Archive prefers color images because they have found people like reading the book with the original look intact.
- The book should be deskewed, and cropped.
- You should provide good metadata such as title, author, date, subject, keywords, etc.
To make good full color images it often takes a bit of tweaking to look really good. Ideally you want the left and right pages to be consistent with each other, and have the colors match the original. BSW can help with that.
Once you have corrected for perspective distortion and cropped the image, it is good to increase the contrast a bit of the image. Try right clicking the image and choose â€œautolevels.â€ This will give you a good starting point, but feel free to adjust the black and white levels until they appear accurate. The books done with Internet Archiveâ€™s Scribe scanners use the equivalent of the following, and may be helpful as a starting point if you are starting with well exposed images:
Levels = 12 94
Also, if the saturation doesnâ€™t look right (like there is more color in the image than there was in the original, the Saturation command can be used. Or if the brightness is off, try adjusting it with the Brightness command. If your lighting isnâ€™t quite consistent, it is sometimes necessary to adjust only the left or right images to make them match better. Its pretty much trial and error until you get the results looking the way you like. The good thing is once you figure out the settings that work for you, you will not need to adjust it much for other books.
Itâ€™s recommended that a lossy compression that results in a compression between 10:1 and 20:1 is used for the transfer. For example at 10:1, if an image was a 10 meg uncompressed tiff, it would be about a 1 meg .jp2 file. BSW will default to a 10:1 compression, which works well for 300 DPI images. If you are providing scans closer to 600 you will probably want to use a higher compression to keep the transfer sizes manageable.
The archive will accept a zip file containing jpegs, tiffs, and jp2 files. BSW uses jp2 as it gives the most control over the files size and a bit better compression than Jpeg files.
While it is preferable to transfer color images, there may be times where you need to do the transfer as grayscale or black and white. Color images are quite large, and if you a slow connection it might not be feasible to transfer them. Grayscale images are about a third the size of full color, and black and white are even smaller. Or if you canâ€™t get a good color image it may be best to save it grayscale or black and white.
How long will it take to process?
Depending on what kind of compression you are using, and the length of the book the zip files will be around 200-800 megs, so it can take quite a while to transfer, depending on your connection.
After the file is uploaded, it starts in motion a bunch of steps that end with the book OCRâ€™ed and converted to pdf, DjVu, Kindle, and other files. The process will take anywhere from an hour or so to a few days depending on how backed up the Archive is. You can check on the progress by logging into the archive, choosing patron info, then choosing tasks that are not yet completed.
For further information:
For more information about uploading books to the archive you can check these links out:
General overview on uploading content:
Information on the _images.zip format:
http://raj.blog.archive.org/2011/02/24/ ... e-uploads/
Detailed information for Internet Archive partners. This has some good information on the Internet Archive process for scanning documents:
Information on the protocol Book Scan Wizard uses to communicate with the Archive: