Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Distributed Digital Library - Ideas

A place to tell us about your work and projects. Self-links encouraged!

Would you contribute to the development of a digital library?

Yes - I could contribute ideas.
6
25%
Yes - I could contribute images.
6
25%
Yes - I could code.
3
13%
Yes - I could do ocr.
3
13%
Yes - I could do quality control.
3
13%
Yes - I could help administer and manage.
3
13%
No - I do not have the time but I think it is a good idea.
0
No votes
No way - how can ordinary people do what google does?
0
No votes
 
Total votes: 24

sanjayayogi
Posts: 19
Joined: 04 Mar 2014, 00:52

Distributed Digital Library - Ideas

Post by sanjayayogi » 21 Mar 2010, 17:50

Just as my books sitting in the corner have no utility if they are not being read, book scan images have no value if they are not shared and archived.

Needed:

1. Space to host images, websites, databases (for example: couchdb is easily replicated to other databases).

2. A uniform imaging naming system to allow retaining important information about the scan.

Suggested sample:
(BT)BookTitle_(ALN)AuthorLastName_(AFN)AuthorFirstname_(ISBN10)ISBN10-XXXX_(PN)PageNumber_(PT)PageTotal_(SID)-IDofPersonScanning_(SYMD)DateScanned_F.Format

BT-Hackers:-Heroes-of-the-Computer-Revolution_ALN-Levy_AFN-Steven_ISBN10-0141000511_ISBN13-978-0141000510_PN-1_PT-464_PB-Penguin_PBD_2001-01-02_SID-123456789_L-en_SYMD-2010-03-21_F.tiff

BT Hackers Heroes of the Computer Revolution
ALN Levy
AFN Steven
ISBN-10 0141000511
ISBN-13 978-0141000510
PN 1
PT 464
PB Penguin
PBD 2001-01-02
SID 123456789
L en
SYMD 2010-03-21
F tiff

Title Hackers: Heroes of the Computer Revolution
Paperback: 464 pages
Publisher: Penguin (Non-Classics); Updated edition (January 2, 2001)
Language: English
ISBN-10: 0141000511
ISBN-13: 978-0141000510

This naming convention can be easily split along the underscores and the output loaded into a database with a script.
It is also human readable.

Reasons for a standardized file naming convention:

If a group is to share images for example to OCR

One part of the group could scan, others could handle naming, another part ocr, another group do quality control, another group handles archiving, another for replication and backup, and others for access and sharing either images or final output, for example.

I have some space in an Ubuntu 9.04 VPS that could be used for testing purposes.

I have root access and can install any software for image processing, OCR (tesseract, ocrad).

We will need people to generate images and contribute high quality images with a naming format that is robust, uniform, humanly readable, and machine processable.

Is anybody interested in collaborating?

Please post comments here if interested in contributing to the building of a distributed digital library.

User avatar
daniel_reetz
Posts: 2785
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Distributed Digital Library - Ideas

Post by daniel_reetz » 21 Mar 2010, 18:00

I think a lot of people would be interested, myself included, but I have two questions from the get-go.

1. Might you be re-inventing the wheel, with library-oriented systems like ContentDM and etc out there?

2. How will you control for copyrighted work? Any library containing copyrighted works will be shut down immediately and painfully, IMO.

sanjayayogi
Posts: 19
Joined: 04 Mar 2014, 00:52

Re: Distributed Digital Library - Ideas

Post by sanjayayogi » 21 Mar 2010, 18:33

In answer to the first response to my post:
Question:
1. Might you be re-inventing the wheel, with library-oriented systems like ContentDM and etc out there?
Re-inventing the wheel is good, as far as I can tell ContentDM is for pay and specifically geared towards organizations.

However, after taking a look at their websites, I can see where a group looking at what they have done, can immediately build upon some good ideas, as well as brainstorm improvements.

I am thinking of a very distributed model globally with a distributed replicated, and searchable index that could, on the fly - link to an image, ocr it, display it, but not allow retrieve a work in its entirety. For example I can see that Amazon.com has images taken by somebody selling a used book. By searching and retrieving links to images, the processing could be done on images dynamically, neither image or ocr'ed material would be archived or stored centrally. However the architecture for the processing, scanning, ocr-ing, quality control, and organizational, and administration models would be developed and distributed with some top down control so that every system could "talk" to another. URL's could be encrypted, and changed dynamically, so that search engines and lawyers would have a difficult time directly downloading, and no one could download a work in its entirety.
2. How will you control for copyrighted work? Any library containing copyrighted works will be shut down immediately and painfully, IMO.
Self policing works aka youtube.com and Wikipedia both ie: agree to remove works that content providers object to. Start by only working with out of copyright material, or self published material that authors could agree to allow access to while in the testing period.

Build an advertising based business model and use the money to pay for legal, and set up a fund to pay copyright holders directly. Start negotiating with copyright holders directly.

The initial group needs to be ethical and focused on the issues and have a governing group to police the people and processing. Not anarchy as a governing body, but ethical and law respecting, understanding that the very nature of this kind of work is forging new territory, and is fundamentally important to both the preservation and advancement of culture.

Read:

http://www.nybooks.com/articles/23683
Publishing: The Revolutionary Future

The transition within the book publishing industry from physical inventory stored in a warehouse and trucked to retailers to digital files stored in cyberspace and delivered almost anywhere on earth as quickly and cheaply as e-mail is now underway and irreversible. This historic shift will radically transform worldwide book publishing, the cultures it affects and on which it depends.
Questions:

1. Why not a world wide grassroots effort, parallel to the efforts of Google, citizen based work contribution, and direct payback to copyright holders?

2. Why should big companies hold all the cards and money?

3. Individual effort always, but cordinated with organization, rules, and self-policing AKA wikipedia style.


Find a group of lawyers now to advise, and protect both intellectual property as it is created, and advise on copyright protection issues.
Last edited by Anonymous on 22 Mar 2010, 17:07, edited 2 times in total.

User avatar
daniel_reetz
Posts: 2785
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Distributed Digital Library - Ideas

Post by daniel_reetz » 21 Mar 2010, 18:44

I'll write back substantially in about a week, but in the meantime I highly recommend you check out the essays of Eric Hellman. He has written a lot on weird new distributed library models and thought some interesting thoughts. I'm very intrigued by these ideas and in a way, see some of these kind of outcomes as inevitable -- but at this point, it's definitely worth checking out some thinkers like Mr. Hellman if you haven't already.

sanjayayogi
Posts: 19
Joined: 04 Mar 2014, 00:52

Re: Distributed Digital Library - Ideas

Post by sanjayayogi » 21 Mar 2010, 19:03

External Links to Distributed Digital Library ideas:

Eric Hellman
http://go-to-hellman.blogspot.com/

Do It Yourself Book Scanning
http://www.bitsbook.com/2009/10/do-it-y ... -scanning/

http://www.themillions.com/2009/10/brin ... -home.html

sanjayayogi
Posts: 19
Joined: 04 Mar 2014, 00:52

Re: Distributed Digital Library - Ideas

Post by sanjayayogi » 21 Mar 2010, 20:33

As Thomas Jefferson wrote in 1791:
The lost cannot be recovered; but let us save what remains: not by vaults and locks which fence them from the public eye and use, in consigning them to the waste of time, but by such a multiplication of copies, as shall place them beyond the reach of accident.

(Julian P. Boyd, ‘These Precious Monuments of…Our History,’ pp.175-6)

http://www.archives.gov/publications/pr ... nhprc.html

I will argue that Thomas Jefferson, had the public's best interest at heart, with this statement. His words exist today because they were preserved. By limiting access, to digital works, is it possible that modern copyright law is creating barriers from access to culture (fence them from the public eye and use) . Just as the right to bear arms, allows a citizen to protect body, and home, from enemies, domestic, and foreign, possibly it is the citizen's duty to create such a multiplication of copies to fight against this type of control and restriction of access to culture. In the past before books, stories were maintained in the collective memory by oral repetition and under the guidance of priests whose duty it was to make sure that the stories and histories were not lost.

I live in a third world country that has no Borders Bookstores, in a city where the library has virtually no books, and of those, few in English. It is virtually impossible for the average person to order a book from Amazon, when the average salary in the country is $2000 per year. The digitalization of paper, the subsequent replication, and distribution of such works would certainly seem to be in the spirit of what Thomas Jefferson was speaking.
Last edited by Anonymous on 22 Mar 2010, 17:08, edited 1 time in total.

sanjayayogi
Posts: 19
Joined: 04 Mar 2014, 00:52

Re: Distributed Digital Library - Ideas

Post by sanjayayogi » 21 Mar 2010, 22:05

http://www.teleread.org/2009/09/22/opin ... ain-books/

Quote from the article:
How to figure out what’s in copyright? Not too hard for the most part. Just use a database of acceptable public domain books. Simple for books before the 1920s, but can get a bit difficult for books post-1920s where full lists of books that have fallen out of copyright are not available without research. This would mostly apply to books that weren’t renewed by the author/publisher/estate or didn’t have a copyright notice printed during the time media was required to have the notice to stay in copyright. Creative Commons, I’m sure, also causes a bit of strain in finding out what’s acceptable.

As for checking of content, have a simple solution that many teachers use today to combat plagiarism — file submitted, file compared with all available resources to find out how much of a paper is verbatim from another source. A legitimate copy should be as close to 100% as possible.
My comment:
Have a special group to enforce and control copyright. They would not need to be lawyers, but would need to be educated as to what falls under copyright and what is legitimately available under fair use. These issues will vary country to country.
Last edited by Anonymous on 22 Mar 2010, 17:09, edited 1 time in total.

sanjayayogi
Posts: 19
Joined: 04 Mar 2014, 00:52

Re: Distributed Digital Library - Ideas

Post by sanjayayogi » 21 Mar 2010, 23:00

http://www.opencontentalliance.org/2009 ... itization/

Some information on the economics of book scanning by different organizations.

Brewster Kahle of the Internet Archive and the Way Back Machine

http://www.archive.org/details/texts

sanjayayogi
Posts: 19
Joined: 04 Mar 2014, 00:52

Re: Distributed Digital Library - Ideas

Post by sanjayayogi » 21 Mar 2010, 23:34

Start with a not-for-profit public service organization to serve people with disabilities:

serving oraganizations such as:
http://www.accessiblebookcollection.org/

and others in the 2nd and 3rd world.

Some ideas after reading this post by Eric Hellman:

http://go-to-hellman.blogspot.com/2009/ ... l-non.html


After the organization has matured, technologically, and ethically, expand into different directions, third world, public education, digitazation of public paper records.

Meanwhile, continue the development of a viable business model for profit, starting with self publishing, and out of copyright material.
Attract lawyers, and business people, initially as business development advisors, eventually establish a Board of Directors, and build solid business organization with good funding.

sanjayayogi
Posts: 19
Joined: 04 Mar 2014, 00:52

Re: Distributed Digital Library - Ideas

Post by sanjayayogi » 21 Mar 2010, 23:46

http://www.loc.gov/homepage/legal.html

Part of the Legal Disclaimer from the Library of Congress:

About Copyright and the Collections
Whenever possible, the Library of Congress provides factual information about copyright owners and related matters in the catalog records, finding aids and other texts that accompany collections. As a publicly supported institution, the Library generally does not own rights in its collections. Therefore, it does not charge permission fees for use of such material and generally does not grant or deny permission to publish or otherwise distribute material in its collections. Permission and possible fees may be required from the copyright owner independently of the Library. It is the researcher's obligation to determine and satisfy copyright or other use restrictions when publishing or otherwise distributing materials found in the Library's collections. Transmission or reproduction of protected items beyond that allowed by fair use requires the written permission of the copyright owners. Researchers must make their own assessments of rights in light of their intended use.

If you have any more information about an item you've seen on our website or if you are the copyright owner and believe our website has not properly attributed your work to you or has used it without permission, we want to hear from you. Please contact OGC@loc.gov with your contact information and a link to the relevant content.

This should be a good beginning.

1. Start only as a publically supported institution.
2. Serve as a digitalization research library. Let the permissions and fees fall to the researchers.
3. Be a clearinghouse, portal, much as google serves as a portal to websites, webpages, code, books, video.
4. Initially do not host materially, only spider, and search, link out to sources, do not concentrate sources.

Post Reply