Tag | osx

The Many Roads Of PDF Processing

Oct 11th, 2012No Comments

The Easy Path

So you have a PDF, or a bunch of PDFs, and want to extract the text out of them? A few years ago, this would have been a horrible task, but life has gotten easier since then.

If your PDF is just filled with text, this becomes really easy:

 pdftotext pdfname.pdf

You can find pdftotext for most operating systems.

How you you know that it’s just text? If you open it up in Acrobat/Preview/XPDF/etc and can highlight the text, then pdftotext should work fine.

But if you can’t do that, then what the author probably did was make an image and embedded it in a PDF file. You then have to use OCR, which can give you some output which isn’t always right. A Google-sponsored tool called tesseract does a good job with this OCR stuff.. I remember that it used to stink, but it doesn’t anymore. Simply:

tesseract pdfname.pdf textpat

That will try to do an OCR scan of pdfname.pdf and save each page into a file called textpat.txt.

But, of course, the path isn’t always easy.

The Long and Winding Road

I have a bunch of typed documents (read: hard copies) coming in, all of which have to be typed in. Lucky me. We have a scanner on-site and I asked if it does OCR, and I was told that it doesn’t. I’m even getting luckier.

But I’ve parsed PDF’s before. I should be able to handle it.

I scanned in a few and had the PDFs sent to me. I installed tesseract via Homebrew. The results were. . . disappointing:

$ tesseract pdfname.pdf out
Tesseract Open Source OCR Engine v3.01 with Leptonica
Error in pixReadStream: Unknown format: no pix returned
Error in pixRead: pix not read
Unsupported image type.

So a quick google shows that either tesseract doesn’t have the right libraries installed, or the PDF wasn’t well-formed. Since tesseract told me it found Leptonica, I have to assume the proper libraries are there. So our scanner is making improper PDFs. This is great.

After some googling and head scratching, I discovered that tesseract works very well on Tiff files. I used Preview to export the PDF to a Tiff and — success!

 $ tesseract pdfname.tiff out
 Tesseract Open Source OCR Engine v3.01 with Leptonica
 Page 0
 Page 1
 Page 2
 $ ls  out*
 out.txt

Ok, I didn’t want to open all of these files in Preview. How to convert them from the command-line? Well, the first tool to think of is convert from ImageMagick. That has always been a tricky road for me nnd, sure enough, the resulting Tif file had horrid resolution. That made tesseract spit out garbage. I searched some more, even for OSX-specific solutions. I found sips which comes with OSX, but most people haven’t heard of it. The usage is a bit arcane but it uses the OSX libraries (i.e. the same thing my Preview export used). And, yes, it worked great out of the box — except that it doesn’t handle multi-page PDF’s. Ugh.

How does one break up a PDF into pages? More googling, and I found pdftk which is a little swiss army knife of PDF processing. And, hey, it can break a PDF into pages with the burst option! Or, maybe not:

 $  pdftk pdfname.pdf burst
 Unhandled Java Exception:
 java.lang.NullPointerException
 at com.lowagie.text.pdf.PdfCopy.copyIndirect(pdftk)
 at com.lowagie.text.pdf.PdfCopy.copyObject(pdftk)
 at com.lowagie.text.pdf.PdfCopy.copyDictionary(pdftk)

That’s not good. A few searches showed someone else with that same problem. The cause? A bad PDF of course! The thing that has started me down this path! But I could extract the PDF a page at a time . . but that’s bad to me.

Ok, time to refocus. I thought, “What I am trying to accomplish?” And that was converting the broken PDFs to Tifs so I can run tesseract. So let’s focus back on the PDF->Tiff part. I did more searching and found a StackOverflow entry that talked about the problem I had with ImageMagick and tesseract. and someone posted a nice recipe for using Ghostscript:

 /usr/local/bin/gs -o out.tif -sDEVICE=tiffgray -r720x720 \\
 -g6120x7920 -sCompression=lzw in.pdf

And I got a Tiff file out that tesseract could process wonderfully! Woot! The bad part was that tesseract took a long time to process this tif — much longer than the one from Preview. Most of that processing time was done in the first page of my PDF, which is essentially a cover page. How do I get rid of that cover page? Well, back to pdftk:

 pdftk pdfname.pdf cat 2-end output nocover.pdf

So that makes another PDF from the second page on (these PDF’s have a variable number of pages).

Running the PDF->Tiff conversion on the nocover.pdf command gave some errors. But then I ran tesseract on the resulting tif file and I had no problems.

Just for fun, I ran tesseract on the nocover.pdf that pdftk created — same error and the first thing. I figured as much but it was worth a shot.

So, in the end, I wrote a shell script that takes a PDF as a parameter and does this:

oldname=`basename $1`
name=${oldname%.pdf}

pdf=nocover/$name.pdf
tiff=tiffs/$name.tiff
text=extracted/$name

pdftk $1  cat 2-end output $pdf
/usr/local/bin/gs -o $tiff  -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw $pdf
tesseract $tiff $text

And that, my dear readers, is how to put a PDF through an OCR process.

Cleaning up with Hazel

Oct 21st, 2009No Comments

Anyone who knows me well knows that I’m not a neat-freak.  If you know me well enough, you know that I can be an out-right slob.  My wife has tried to train me in other ways, and it’s sorta worked. But I still leave things laying around that need to be dealt with — including things that should go straight to the trash.

I’m that way with files on my computers as well.  I let files sit around long after I need them and then, surprise!,  I have a problem with hard drive space. I end up having to scramble to find files to delete, and end up finding ISO images and tarballs of forgotten installs that I could have delete months, sometimes years go.

After  my clean install of Snow Leopard, I vowed I would be better at cleaning up after myself.  I would delete files that I know longer needed, remove those MP3 files after I import them into iTunes,  and empty my Trash periodicially.  But, really, who am I fooling? I’m not going to do daily or weekly sweeps of my hard drive seeing these things.  That’s where Hazel stepped into my life and made things much easier.

Hazel cleans up after you.  Essentially, you tell it where to look, what to look for, and what do to.  Want to import MP3 files automatically into iTunes? It will do that. Want to delete files that were downloaded more than a week ago?  It will do that.  Delete the Trash every month?  Yep.  Oh, and if the Trash bin gets large, it will delete it automatically — but only if you tell it to.  What if you want to do something weird with the file?  Well, you can write an AppleScript or a shell script to handle that.  And you tell it all this in a nice, mostly-intutive  GUI.  (Click on the Screenshots link on the main Hazel page for an idea.)

And added bonus is that it can delete application files when you delete the application.  What’s that?  You thought OSX did that for you when you moved an app from the Application folder to the Trash?  Well, look in your user’s Library->Preferences or Library->Application Support folder. Yeah, you see a lot of folders there for applications you no longer have installed.  If you had Hazel installed, it would see that you have moved an Application to the Trash and it will ask you if you want to delete the user-level files as well.

I think Hazel is an application that every Mac owner should have.  So, really,  at least try it out. Now. Go.   It’s worth far more than it’s $21.95 price tag.

The Harrowing Journey From Tiger to Snow Leopard

Sep 22nd, 20094 Comments

Unlike most Apple users, I didn’t make the quick jump from Tiger (10.4) to Leopard (10.5).  Mostly because I’ve learned the hard way not to be the first in line for upgrades.  And when I read about the changes they were making, I thought “I’ll wait until they work out the kinks.” And then they announced Snow Leopard (10.6)  and touted the $24 upgrade but, if you looked close at it, that was only from Leopard to Snow Leopard, where little was changed on the service (but much under the hood was redone).  A little looking and I found that I had to get the Mac Box Set with 10.6, iWork, and the new iLife, which I wanted anyway.  And lots of my tools that wanted to use only worked with Leopard on up anyway.  So Snow Leopard it is.

I ordered my Box Set and waited for Amazon to ship it (it was $20 cheaper there and no taxes. Yes, I’m that cheap!).  When I got it in my hands, and got ready for the upgrade . . . I did a backup first.  SuperDuper is my friend.  Before this process was over, it became my lifesaver.

So I stuck the Snow Leopard disc in, and told it that I wanted to upgrade.  The machine rebooted, the Snow Leopard install came up, and said it was starting and then . . . it quit, telling me that there was not enough room left on the drive.  Which was very possible — there was a lot of junk on that drive.  So I took the disk out and rebooted, thinking I would remove some more junk and then do the upgrade.  And then it happened . . .

The machine wouldn’t boot.

My Mac would start up just fine, give me the Apple logo and then shut down.  I put the Snow Leopard disc back in and that didn’t boot either!   A little research showed my assumption about it booting from the DVD drive if it couldn’t boot anything else was wrong — instead it just stops.  You have to hold down “C” during the boot sequence to get it to boot from the DVD drive.  I went to find my Tiger install discs and booted with that.  I went to run the DiskUtility and did “Repair” but it said it couldn’t.   Arrgggghhhh.

I tried the DiskUtility with the Snow Leopard and it wouldn’t even Repair it — it was not a Snow Leopard hard disk!!  Arrggghhhh again!

Now I had a choice.  SuperDuper makes my USB hard drive bootable.  I could boot off of it but that doesn’t solve my problem — the boot info on the hard drive was messed up. If I booted from the USB drive and removed enough stuff to make Snow Leopard install.  But, still, nothing can repair my drive.  My data was safely backed up and I know that I didn’t need it all anyway and, with SuperDuper, I can go and copy the files that I wanted off of it anyway.  So I took a leap and erased the drive.

You read that right. I wiped the drive clean with the Snow Leopard installer and installed from scratch. Of course, the installer was more than willing to do that.

After that, things went mostly well.  I copied our Music and Preferences folders over and Safari, iTunes, and Mail all saw the changes and updated their databases.  The copying part took a while, but after that it was all smooth.

But I had lots of problems with MacPorts.  Emacs.app needs some manual guidence , Python2.4 has some weirdness, and PostgreSQL/PostGIS are always a pain to install.  But I got them going.

A clean install was a good thing — it got rid of the junk and I was able to move just the files that I needed.  And, thankfull,  SuperDuper demostrated that it’s worth 10x it’s $28 price tag.

So do your backup kids.