Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Extracting individual page text from PDFs?

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.
Post Reply
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Extracting individual page text from PDFs?

Post by Misty » 22 Mar 2010, 12:49

Does anyone know any software that can automatically output the text from a PDF into individual files for each page? I'm preparing a book to be uploaded to my library's webpage, which is going up in a couple of formats - one of them is a per-page JPEG version, created from the original PDF that went to the publisher. To make the per-page view searchable I need text files for each individual page. Since it's the original PDF it contains all selectable text, but I can't seen to figure out how to extract the text how I need it. Acrobat will output text, but only for the entire book - not for each page separately. Copy/pasting manually would be a total pain I'm hoping to avoid.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.

StevePoling
Posts: 290
Joined: 20 Jun 2009, 12:19
E-book readers owned: SONY PRS-505, Kindle DX
Number of books owned: 9999
Location: Grand Rapids, MI
Contact:

Re: Extracting individual page text from PDFs?

Post by StevePoling » 22 Mar 2010, 13:04

Misty wrote:Does anyone know any software that can automatically output the text from a PDF into individual files for each page?
Depends on whether you can program or not. There's an open source project called pdfbox that's part of the Apache project. It has functionality to snarf the text from a PDF on a page-by-page basis. Trouble is that it's a little balky about handing you text. You get one or more characters and its x,y position, but you don't get a nice string of a sentence like you'd like. I think that if you're a strong Java programmer you could knock out a solution using pdfbox in a week or so.

Depending on what you've got to start with, you might want to rethink your workflow and make PDF an output instead of an input to your process.

User avatar
daniel_reetz
Posts: 2779
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Extracting individual page text from PDFs?

Post by daniel_reetz » 22 Mar 2010, 13:52

pdftk might be able to do this--id provide a link but i am typing from a phone. definitely seems doable.

toka
Posts: 3
Joined: 04 Mar 2014, 00:52
E-book readers owned: 1
Number of books owned: 1000
Country: Germany
Contact:

Re: Extracting individual page text from PDFs?

Post by toka » 22 Mar 2010, 15:08

In ubuntu / debian there is a tool called pdfimages, which is part of the poppler pdf tools: http://poppler.freedesktop.org/

It can be installed by

Code: Select all

sudo apt-get install poppler-utils

User avatar
daniel_reetz
Posts: 2779
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Extracting individual page text from PDFs?

Post by daniel_reetz » 22 Mar 2010, 15:34

maybe acrobat or pdftk could split the book into individual page pdfs and then you could batch convert to text using acrobat...

Afish
Posts: 34
Joined: 04 Mar 2014, 00:52

Re: Extracting individual page text from PDFs?

Post by Afish » 23 Mar 2010, 01:11

Acrobat Professional can do this.

Document> Extract pages> input from page to page and select Extract Pages as separate Files.

User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: Extracting individual page text from PDFs?

Post by Misty » 23 Mar 2010, 14:54

Thanks, Afish - that's exactly what I needed! After splitting the pages, I was able to do a batch text extraction.

Steve: We already have a number of books up that aren't as suited to PDF downloading, and I want to have this book be consistent in presentation even if PDF downloading is an option. I prefer to have at least one view for each item that doesn't require third-party software, too - JPEG pages can be viewed by people who don't have or don't want a PDF reader, and the collections software we're using lets us make those JPEG pages searchable per-page if I provide plaintext files to associate with each of those pages.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.

ovencakerugby
Posts: 5
Joined: 05 Mar 2010, 12:33

Re: Extracting individual page text from PDFs?

Post by ovencakerugby » 23 Mar 2010, 19:10

Why not give Deskunpdf a go.
I have converted excel pdfs and word ones back to their respective starting points
regards
ovencakerugby

Post Reply