Text Dumping PDF files

The other day I got a request to convert a PDF file into a text file or something that could be imported to Excel. The was essentially some big accounting mumbo-jumbo full of numbers arranged in columns with fancy headings. There were over 200 pages of it.

Now the easiest thing to do was to use the Windows version of Adobe Acrobat and simply save the file as .txt. But of course, that knocked out all the white space. All the colums run into eachother and the file looked like crap. There is no way you could do anything useful with it.

Of course my linux PDF reader (acroread) did not have the “Save as Text” option, so the first place I turned to was the nifty linux app pdftotext.

pdftotext bigstupidfile.pdf

This gives you a quick text dump which is roughly equivalent to the buit in Acrobat save behavior. But fortunately pdftotext has all kindso of nifty features. If you want to preserve the whitespace and layout details you should do:

pdftotext -layout -eol dos bigstupidfile.pdf

The -eol dos bit is there to specify the end of line style. Remember, I’m on a unix box converting this file for a windows dude who will want to import this stuff to excel.

Needles to say, the trick worked perfectly. The columns were preserved and the file looked great. So whenever you need to convert some pdf data into text I highly recommend using -layout option.

Tags: , , ,

Related Posts:

  • Those pesky PDF files
  • The Benefits of Using Plain Text
  • Convert a large Access table into Excel files
  • On Config Files
  • 1984 in 2006…
  • State of total clulessness at MSU
  • Installing LaTex on Windows
  • Google Desktop for Linux
  • Big victory for Open Source and Open Standards
  • Adding new column to a text file

  • 3 Responses to “Text Dumping PDF files”

    1. Gravatar Dustin UNITED STATES Says:

      Wow! Great tip.
      I needed to do the same thing, but didn’t have a Linux box to use. I’m a Mac user and found this link. (http://mac.softpedia.com/progDownload/pdftotext-Installer-Package-pl- Download-11127.html)
      After downloading it, I used your same commands to quickly/easily create a Text file from PDF.

      Posted using Mozilla Firefox Mozilla Firefox 2.0.0.4 on Mac OS Mac OS X
    2. Gravatar Luke UNITED STATES Says:

      Awesome. A lot of linux and unix apps are being ported to OSX nowadays.

      Check out the MacPorts and Fink projects. They both have tons of ported linux goodness you can download and run on your Mac. )

      Posted using Mozilla Firefox Mozilla Firefox 2.0.0.2 on Ubuntu Linux Ubuntu Linux
    3. Gravatar Munich Unix » Text Dumping PDF files UNITED KINGDOM Says:

      […] read more | digg story […]

      Posted using WordPress WordPress wordpress

    Leave a Reply

    XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <pre lang=""> <em> <i> <strike> <strong>

    [Quote selected]