Plustek 3800

Discussions, questions, comments, ideas, and your projects having to do with DIY Book Scanner software. This includes the Stereo Data Maker software for the cameras, post-processing software, utilities, OCR packages, and so on.

Moderator: peterZ

genarcher
Posts: 2
Joined: 21 Jul 2017, 00:29
E-book readers owned: kindle, sony
Number of books owned: 5000
Country: Australia

Plustek 3800

Post by genarcher »

I've got a project of scanning my library - about 4,000 books and I'm using the PDF option to create files; I've set the parameters for Color - 150dpi; gray - 250 dpi & black&white 330dpi - I want to create as good an image as I can get to get a smaller PDF file but have the correct dpi to get OCR working as best as I can. I was using Omnipage 19 which worked well and gave me prompts when the spelling wasn't correct - now I've got Finereader 12 and Finereader 14 which allow me to save as docx, epub & text but doesn't have the spellchecker on.
I can scan one page about every 12 second so that give me 5 pages a minute and a book about 60 to 90 minutes depending on how many pages. Is there a faster way of scanning for the best results or are these formats okay for my purpose - seeing that I'll get about 4 to 5 books a day, I'm looking at a couple of years!
Alternatively, will i get better results and faster with the Plustek 4800.
Thanks for your assistance and best wishes from downunder.
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Plustek 3800

Post by BruceG »

Using a scanner with 2 cameras scanner such as David Landin's made with PVC piping (details can be found on the site) will increase through put to 1000 pages an hour. 2 pages at a time and only turning the page beside moving the book to and from a platen. He has made some videos on youtube, just search Easy Book Scanner. Post processing will take longer however.
I am currently making epub files for my ereader from someone elses pdf files scanned on a flat bed book scanner. With Omnipage to docx then using Jutoh to create chapter index to epub. You can go straight from Omnipage to epub but I prefer using Jutoh to prepare epub files.
I am also using a Czur E16 scanner which does not use a platen. The book just sit on a table, so the distance from camera to page changes from close at the start of the book to further away by the end. I did a series of minute books that had lots of pasted in letters, hand written notes, etc. either up and down or across the pages of different sizes which would have been difficult with a flat bed scanner. The Czur scanner uses the Finereader engine for OCR. When scanning in grayscale it produces a grey hue so I scan in B&W. Omnipage does not do this.
One of the reasons people use cameras with a platen scanner is difficulty in scanning the middle of a book with a flat bed scanner. I have found magazines are a bigger problem, when the printing goes across the gutter of pages is the worst. Book scanners such as the Plustek with the ability to scan up to 2mm from the edge the scanner help. They also do less damage to the books as does the camera platen scanners.
300dpi is the usual minimum for OCR.
genarcher
Posts: 2
Joined: 21 Jul 2017, 00:29
E-book readers owned: kindle, sony
Number of books owned: 5000
Country: Australia

Re: Plustek 3800

Post by genarcher »

Hi Bruce and thanks for your suggestions; I've scanned about 400 books in six months with the Plustek and most of the results have been okay - every so often I get a book which has been compiled with very tight margins and they have been a problem, otherwise new books seem to get better results than older books (which might be due to improved printing machinery, cleaner fonts and better quality paper). Older books, particularly those hand typset and in the 1930s and 1940s with coarse paper tend to result more difficult files. I agree that the epub from both ominpage and Finereader don't get best results - I need to re-edit them with Word to get section and chapter headings, section breaks and the occasional spell clean-up to be able to import them into an electronic book reader - who would have thought fifteen years ago of the ability now to create such content.
BillGill
Posts: 139
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

Re: Plustek 3800

Post by BillGill »

I don't worry too much about speed of scanning. Lately with my new 1 camera scanner I am getting about 250 pages per hour. That is a very small part of the process of creating an epub. I spend far more time proof reading the text after it has been converted. Even the best OCR will leave a lot of errors. And that is if the scan is really good.

The quality of the scan varies a lot depending on the book. I have a lot of older books that have yellowed pages that don't provide high quality scans. There are a lot of garbled words, incorrect punctuation, and misspelled words. Some common errors are confusion between 'h' and 'b' and substitution of '1' for 'l' or 'I'. Getting the text corrected is definitely the long end of the pole in book scanning.

I generally work on one chapter at a time and run through each chapter 3 times looking for errors. Then I put all the chapters together and go through it one more time. Then I convert it to epub and go through it one more time. And I am still finding errors on the last pass. All told this takes me about 1 1/2 to 2 weeks per book. That is working a few hours a day.

I use Calibre for the converting it to epub and then editing it.

Bill
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Plustek 3800

Post by BruceG »

The errors I find the most is as you say 1 for I in dates and the pound sign.

Others issues I have is - do I move footnotes to the actual location and put in brackets
- do I remove footnotes that refer to other pages as they are not relevant to ereader page numbers
I have only started doing index with chapters etc. A recent book had sketches before the chapter heading. So do I move the chapter heading before the sketches so they are included with the chaper or leave it and the sketches go with the previous chapter.

Today printing is different than in the past, spaces were used in the past, after and before ' !" etc. Do I leave or change. I tend to leave but if missing I insert as per today.

I have not found any paper that discusses these issues.

As well as creating epub files I create pdf files to index for searching purposes, ie many files can be searched at once.
BillGill
Posts: 139
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

Re: Plustek 3800

Post by BillGill »

I haven't run into some of the issues you have, since I am scanning mostly old fiction, and mostly old paperbacks. I have only run into one book that had images, and they were in the middle of a chapter. I don't know how you are creating your epubs, but I am creating word processor files, then importing them into Calibre and letting Calibre convert them to epub. I separate the chapters with page breaks, and Calibre uses the page breaks to create the chapter files. So if I have an image in the chapter it should wind up ahead of the chapter heading when it is read. I don't know that for a fact, but it seems reasonable to me.

For the other formatting matters I usually try to keep the final version as much like the original as I can. So I put in spaces where they have spaces. Sometimes I look at something and realize that they have it wrong in the original book. If it particularly bothers me I may correct it, but mostly I go along with their mistakes.

Bill
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Plustek 3800

Post by BruceG »

Most books I am doing have illustrations and a number have footnotes. They are mostly Missionary History or Biographies. I was going from Omnipage directly to epub then Calibre for final editing. I thought I would like to put in chapters on new pages and couldn't workout how calibre did it so brought Jutoh which uses 'headings 1' from word and other ways which I have not tried yet. So now I go OmniPage>word>Jutoh>epub.
It is because I have selected Jutoh 'Headings1" for chapters that any illustrations on the chapter page need to be after the Chapter number or name. I will need to check out the other methods for selecting Chapters.
As I am so far the only one is reading the epub files I am not so particular with editing. I edit once in OmniPage then once in word mostly for spaces e between pages and once again in Jutoh.
Someone else is scanning the books at a rate of about 3 a week and I am already slipping behind.
BillGill
Posts: 139
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

Re: Plustek 3800

Post by BillGill »

Sorry I was so long getting back to you. I was out of town for a couple of days and am still catching up.

I don't have a problem with the pictures that are before the chapter heading. Here is a picture of one page in the Calibre reader. I got to this point using the table of contents.
CalibreView.jpg
As you can see the picture appears properly at the top of the page. I created the epub by starting in a word processor (Open Office Writer). In the WP I marked the chapter headings as headings, just the way you did. I separated chapters with line breaks. Then I imported the Word Processor file into Calibre and let Calibre do the conversion to epub. When Calibre completed the conversion I wound up with a separate XHTML file for each chapter. Then I used the Calibre Edit Table of Contents function to create a TOC based on the headings. And I wound up with the result that you see above. I'm not sure how you are separating the chapters in your input to Calibre, but when each chapter is a separate file it works for me.

If you are winding up with one big file for the whole document then you may need to split it into multiple files. Calibre allows that to be accomplished relatively easily.

I can see you getting behind if you are getting 3 books a week to convert, even editing quickly for your own use. I am trying to do a pretty good job and it takes me quite a while to finish one.

Bill
zbgns
Posts: 61
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: Plustek 3800

Post by zbgns »

If you do not mind me joining the discussion on making epubs from scans, I would be happy to share my some observations on this.
I like to have more control over each stage of the process so now I prefer more “manual” approach. It takes more time and effort to finalize with the epub file, but gives also more satisfaction as the quality is usually way better in comparison with this what is provided automatically by e.g. FineReader.
1. Scanning
Usually I use my phone camera or flatbed scanner to get images. Afterwards I always use Scan Tailor (Experimental version) to postprocess them. In case of camera taken photos the process is quite complex and time consuming as it is necessary to correct geometric distortions (“dewrap” or at least “dekeystone” images). Scan Tailor do impressive job on this but manual corrections are unavoidable. Postprocess is much easier in case of a flatbed scanner output. A scanner is slower but all time savings from “scanning” with a camera are usually lost during the postprocessing. On the other hand the camera is always with me, so I’m able to “scan” not only at home but e.g. in a library. In result I have number of b&w .tiff images (in case of illustrations I save color pictures in addition).
2. OCR
Next step is optical recognition. I use tesseract for this, the new “alpha” 4.00 version. It works quite slowly but the quality is usually very good. My impression is that it may be even better than this what FineReader does. However the main disadvantage is that the tesseract’s output is a plain text so text formatting (italic, bold, original font size etc.) is lost. It must be recreated during proofreading. Although tesseract is CLI tool, there is gImageReader frontend which I use. It is useful especially for ridding off unwanted elements like headers footers, page numbers etc. It is possible to manually indicate the area of recognition (it may be applied to multiple pages), so the recognition is performed only on contents of a book. gImageReader provides also handy tool to do simple postprocessing of the recognized text (especially it can join separate lines into paragraphs also in case of hyphenated words).
3. Proofreading
After this proofread the text in a text processor. I use Libre Office for this because it may be equipped in some helpful extensions. Typical OCR mistakes may be identified and corrected thanks to Pepito Cleaner. It semi-automates proofreading work. For me this is the main advantage of the Libre Office Writer over the MS Office Word, since I’m not aware of any Word tool like this. The next big advantage of the Writer is the LanguageTool extension. Despite misspellings, typos etc. it is able to identify also grammatical and stylistic issues and it seems to be clearly more accurate than the MS Word spellchecker. Of course there is still necessary to do a lot of manual corrections especially with misspellings and formatting. In case of typical novels it is quite easy as usually there is no any fancy formatting in them and it is sufficient to indicate headings, insert pictures if any, and that’s all. However there may be some scientific books with e.g. lot of footnotes on each page, two column layout etc. I haven’t find any way to improve the process, and resolving this by hand usually costs a lot of effort. When proofreading is finalized I’m ready to create epub file from it.
4. Epub
I tried various ways to produce epub files. Currently, the Libre Office Writer2ePub extension is the tool I like most. The html code it produces is very clear. Also stylesheets it uses provide good looking output. In case of epubs produced by calibre there is more work, as it inserts also some garbage inside, even if a book looks good in an ebook reader. I used to clean the mess using the regex tool provided by calibre ebook editor (the epub creator is so so, but the editor is simply great). Now, with Writer2ePub, I also like to look inside the epub file, but usually there is not much to do with tweaking its contents. When the epub is ready I load it to my Tolino or turn into mobi and sent to my Kindle.


My conclusion is that first of all, before start of reprocessing a paper book into an e-book, it is necessary to assess if it is worth of efforts that must be taken. I’m thinking of satisfactory results as I do not accept quality of epubs created by automatic tools like e.g. FineReader. It is evident in case of books available in electronic form. Time spent on scanning, reprocessing etc. obviously costs more than such e-book itself, thus it is the easiest and the cheapest solution to simply buy it, even if you have a paper edition of this. In case of books with complex formatting it may be better to bundle images into pdf and add OCR layer on the top of it in order to make it searchable. In case of a 200-300 page novel with just text and simple formatting (even if there are some pictures in it), the whole process from scanning to final epub takes approx. 3 hours, and that looks reasonable for me.
BillGill
Posts: 139
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

Re: Plustek 3800

Post by BillGill »

Something wrong there I thought I had replied to your post, but it hasn't shown up on the forum.

Thanks for that info. It looks like some good stuff. I am particularly impressed with your 3 hour time for creation of the ebook. I copied your post off so I can print it and look at it more easily.

I don't bother scanning any book I can get some other way. It just isn't worth the effort. Most of the books I want I can get one way or another, but some just aren't available for one reason or another. So those I scan.

Bill
Post Reply