Text Dumping PDF files

The other day I got a request to convert a PDF file into a text file or something that could be imported to Excel. The was essentially some big accounting mumbo-jumbo full of numbers arranged in columns with fancy headings. There were over 200 pages of it.

Now the easiest thing to do was to use the Windows version of Adobe Acrobat and simply save the file as .txt. But of course, that knocked out all the white space. All the colums run into eachother and the file looked like crap. There is no way you could do anything useful with it.

Of course my linux PDF reader (acroread) did not have the “Save as Text” option, so the first place I turned to was the nifty linux app pdftotext.

pdftotext bigstupidfile.pdf

This gives you a quick text dump which is roughly equivalent to the buit in Acrobat save behavior. But fortunately pdftotext has all kindso of nifty features. If you want to preserve the whitespace and layout details you should do:

pdftotext -layout -eol dos bigstupidfile.pdf

The -eol dos bit is there to specify the end of line style. Remember, I’m on a unix box converting this file for a windows dude who will want to import this stuff to excel.

Needles to say, the trick worked perfectly. The columns were preserved and the file looked great. So whenever you need to convert some pdf data into text I highly recommend using -layout option.

This entry was posted in Uncategorized. Bookmark the permalink.



3 Responses to Text Dumping PDF files

  1. Dustin UNITED STATES Mozilla Firefox Mac OS says:

    Wow! Great tip.
    I needed to do the same thing, but didn’t have a Linux box to use. I’m a Mac user and found this link. (http://mac.softpedia.com/progDownload/pdftotext-Installer-Package-pl- Download-11127.html)
    After downloading it, I used your same commands to quickly/easily create a Text file from PDF.

    Reply  |  Quote
  2. Luke UNITED STATES Mozilla Firefox Ubuntu Linux says:

    Awesome. A lot of linux and unix apps are being ported to OSX nowadays.

    Check out the MacPorts and Fink projects. They both have tons of ported linux goodness you can download and run on your Mac. :)

    Reply  |  Quote
  3. Pingback: Munich Unix » Text Dumping PDF files UNITED KINGDOM WordPress

Leave a Reply

Your email address will not be published. Required fields are marked *