Tag | linux

The Many Roads Of PDF Processing

Oct 11th, 2012No Comments

The Easy Path

So you have a PDF, or a bunch of PDFs, and want to extract the text out of them? A few years ago, this would have been a horrible task, but life has gotten easier since then.

If your PDF is just filled with text, this becomes really easy:

 pdftotext pdfname.pdf

You can find pdftotext for most operating systems.

How you you know that it’s just text? If you open it up in Acrobat/Preview/XPDF/etc and can highlight the text, then pdftotext should work fine.

But if you can’t do that, then what the author probably did was make an image and embedded it in a PDF file. You then have to use OCR, which can give you some output which isn’t always right. A Google-sponsored tool called tesseract does a good job with this OCR stuff.. I remember that it used to stink, but it doesn’t anymore. Simply:

tesseract pdfname.pdf textpat

That will try to do an OCR scan of pdfname.pdf and save each page into a file called textpat.txt.

But, of course, the path isn’t always easy.

The Long and Winding Road

I have a bunch of typed documents (read: hard copies) coming in, all of which have to be typed in. Lucky me. We have a scanner on-site and I asked if it does OCR, and I was told that it doesn’t. I’m even getting luckier.

But I’ve parsed PDF’s before. I should be able to handle it.

I scanned in a few and had the PDFs sent to me. I installed tesseract via Homebrew. The results were. . . disappointing:

$ tesseract pdfname.pdf out
Tesseract Open Source OCR Engine v3.01 with Leptonica
Error in pixReadStream: Unknown format: no pix returned
Error in pixRead: pix not read
Unsupported image type.

So a quick google shows that either tesseract doesn’t have the right libraries installed, or the PDF wasn’t well-formed. Since tesseract told me it found Leptonica, I have to assume the proper libraries are there. So our scanner is making improper PDFs. This is great.

After some googling and head scratching, I discovered that tesseract works very well on Tiff files. I used Preview to export the PDF to a Tiff and — success!

 $ tesseract pdfname.tiff out
 Tesseract Open Source OCR Engine v3.01 with Leptonica
 Page 0
 Page 1
 Page 2
 $ ls  out*
 out.txt

Ok, I didn’t want to open all of these files in Preview. How to convert them from the command-line? Well, the first tool to think of is convert from ImageMagick. That has always been a tricky road for me nnd, sure enough, the resulting Tif file had horrid resolution. That made tesseract spit out garbage. I searched some more, even for OSX-specific solutions. I found sips which comes with OSX, but most people haven’t heard of it. The usage is a bit arcane but it uses the OSX libraries (i.e. the same thing my Preview export used). And, yes, it worked great out of the box — except that it doesn’t handle multi-page PDF’s. Ugh.

How does one break up a PDF into pages? More googling, and I found pdftk which is a little swiss army knife of PDF processing. And, hey, it can break a PDF into pages with the burst option! Or, maybe not:

 $  pdftk pdfname.pdf burst
 Unhandled Java Exception:
 java.lang.NullPointerException
 at com.lowagie.text.pdf.PdfCopy.copyIndirect(pdftk)
 at com.lowagie.text.pdf.PdfCopy.copyObject(pdftk)
 at com.lowagie.text.pdf.PdfCopy.copyDictionary(pdftk)

That’s not good. A few searches showed someone else with that same problem. The cause? A bad PDF of course! The thing that has started me down this path! But I could extract the PDF a page at a time . . but that’s bad to me.

Ok, time to refocus. I thought, “What I am trying to accomplish?” And that was converting the broken PDFs to Tifs so I can run tesseract. So let’s focus back on the PDF->Tiff part. I did more searching and found a StackOverflow entry that talked about the problem I had with ImageMagick and tesseract. and someone posted a nice recipe for using Ghostscript:

 /usr/local/bin/gs -o out.tif -sDEVICE=tiffgray -r720x720 \\
 -g6120x7920 -sCompression=lzw in.pdf

And I got a Tiff file out that tesseract could process wonderfully! Woot! The bad part was that tesseract took a long time to process this tif — much longer than the one from Preview. Most of that processing time was done in the first page of my PDF, which is essentially a cover page. How do I get rid of that cover page? Well, back to pdftk:

 pdftk pdfname.pdf cat 2-end output nocover.pdf

So that makes another PDF from the second page on (these PDF’s have a variable number of pages).

Running the PDF->Tiff conversion on the nocover.pdf command gave some errors. But then I ran tesseract on the resulting tif file and I had no problems.

Just for fun, I ran tesseract on the nocover.pdf that pdftk created — same error and the first thing. I figured as much but it was worth a shot.

So, in the end, I wrote a shell script that takes a PDF as a parameter and does this:

oldname=`basename $1`
name=${oldname%.pdf}

pdf=nocover/$name.pdf
tiff=tiffs/$name.tiff
text=extracted/$name

pdftk $1  cat 2-end output $pdf
/usr/local/bin/gs -o $tiff  -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw $pdf
tesseract $tiff $text

And that, my dear readers, is how to put a PDF through an OCR process.

Review: Pangolin Laptop from System76

Jan 5th, 2011No Comments

I had known that my early-model MacBook Pro was getting to the end of it’s usefulness for me.  I mean, it still runs but as software has grown more and more complicated, my MBP wasn’t cutting it anymore.

Most people would just by another MacBook Pro!  And, while that sounded tempting, I had a few things that held me back from that.  For one thing, $2500 was a steep price for me and my rising side-business. And the tools I use 85% of the time were not Mac specific.  They are things like  zsh, Emacs, Python, PHP, and some of JetBrain’s products. All of them would work on Linux, which is a much lower cost to entry than another Mac.

I started this journey around six months ago when I started scouring the Internet on what the best Linux-based laptop would be.  I was quickly led to System76, a maker of Ubuntu-powered laptops, desktops, and servers.  I was impressed when I scoured the web about this company.  There were a lot of reviews and comments from their users and no one ever had anything bad to say about them.  I mean, they had things that they wished that maybe were different, but everyone was happy with the hardware they were getting, how well it worked with the Ubuntu, and, more importantly, how happy they were with the post-sales support they were getting.    The price was higher, especially compared to the laptops you get at big-box stores, but you got a machine that you knew would work with Ubuntu, and not have to fiddle around with it.  But, regardless, it was certainly cheaper than a MacBook Pro!

Fast-forward a few months: I was investigating putting more RAM into my aging MacBook Pro But I couldn’t!  I already had it maxed out at 2GB!  So this was when I decided to make the plunge.

System76 has a wide range of laptops available, but the choice was easy for me — The Pangolin Performance. It seemed like a good development machine and my display needs are not that heavy to warrant the next step up.   I spec’ed out what I wanted, and then compared it with a MacBook Pro.  Yep, about half the price even though I was getting 6GB of RAM instead of Apple’s 4GB, and I was getting a slightly larger hard drive. I thought I was getting a very good deal.

I ordered it about 10 days before Christmas, and System76 responded that it would ship within 8 business days.  I was surprised when I found out that it was delivered early, and expected to arrive on the Tuesday before Christmas!  And I was even more surprised to have it arrive a day earlier.  Huzzah!

The packaging of the laptop was nothing to write home about, but it was extremely well cushioned and supported inside.  It would be hard to damage it’s contents.  I took it out of the box and immediately started using it while the  battery was charging

The first thing I noticed is how quiet it is.  I didn’t think the fan was even running!  But it turned out that it was — it’s just that quiet.   I had my Pangolin on my lap, doing lots of installation, configuring, etc., when my wife asked me if my lap was hot yet.  I hadn’t even thought about that, so of course it wasn’t hot at all!  I discovered why when I was packing it up after using it for a while on a table.  Just left of where it was sitting on the table was a little hot, but underneath was fairly cool.  It seems that the fan blows the heat straight to the left side instead of blowing it underneath.  This allows the heat to escape and make your lap cooler as well as the underside of laptop itself.  +1 for great design!

As for as Ubuntu?  Almost flawless.  I thought I had to call support to get Bluetooth working, but then I found the button to turn on the F12 key.  That could have been an embarrassing phone call.

Note the word “almost” — the one thing that I can’t seem to get working right is to get Flash to use HDMI Audio.  The HDMI Video works fine, and I got HDMI Audio to work out of normal Gnome apps, but Flash seems to cheerfully ignore the HDMI output and always goes to my speakers.  Since my primary use of a developer machine and not a multimedia server, this is not a big deal.

The overall performance, however,  is fantastic.  The laptop boots in seconds, and every app I run starts in milliseconds.  And I run Apache, PHP, MySQL and PostgreSQL most of the time.  It finds my Android phone, Kindle, and iPod when I plugin them in and offers to start the right app.

So, after few weeks of fairly heavy use, would I recommend this laptop?  Resoundingly yes!  Especially if you are a developer in the open source space and just want everything to “just work”.  Everything just works for me — without paying the Apple premium.

Setting Proxy Environment in UNIX

Jun 8th, 2009No Comments

The easiest and best way to set proxy information on your Linux/Unix machine is with the http_proxy environment variable in your ~/.bashrc, ~/.zshrc, or whatever your favorite shell’s configuration file is. Set it like this:

http://user:password@proxy-server:portnum

In my brief bit of experimentation, the follow important (to me, at least) command-line tools use http_proxy:

  1. wget
  2. Python easy_install
  3. curl

I’ve been a Unix user for 14 years — why did I take me so long to figure this out?

How SSH knows how to get your keys

Feb 6th, 2009No Comments

Most people who work with SSH know that your user’s keys and config is at $HOME/.ssh.  Or is it?  A problem I had revealed that it isn’t so cut-and-dried.

I use Cygwin at work, because, well, I need to get work done.  I was smart enough a long while back to make my users HOME directory my shared H-drive, as opposed to somewhere on the C-drive.  That way if I got a new machine, etc my  configs would be safe.

Well, the moment arrived to see if that worked.  I got a new hard drive at work a month or so ago and one of the first things I did was install Cygwin. Everything worked great — except my SSH keys.  For some reason, Cygwin’s SSH refused to find my private keys, even though I have never moved them.  This week I finally dug in to figure out what happened.

I did a ssh -v host and saw this:

debug1: Trying private key: /cygdrive/c/.ssh/identity

Huh?  My HOME directory is set to H (/cygdrive/h/, in Cygwin-speech).  I mucked around again and it will was in the wrong spot.  A few google searches later revealed that OpenSSH doesn’t look at the $HOME environment variable at all!  Instead it goes by the directory for the user in /etc/passwd. I opened that up and, sure enough, it was set to “/cygdrive/c/“.  I changed it, saved the file and then it worked.

In an normal Unix instance, this works because, chances are, they will be the same.  But I guess when I installed Cygwin I didn’t have a HOME variable set, so it defaulted it to the C-Drive.  Then when I set it to the H-drive, it just happily used that and let things break.