Where Are The Wise Men?

Mike's Ramblings

The Many Roads Of PDF Processing

| Comments

The Easy Path

So you have a PDF, or a bunch of PDFs, and want to extract the text out of them? A few years ago, this would have been a horrible task, but life has gotten easier since then.

If your PDF is just filled with text, this becomes really easy:

 pdftotext pdfname.pdf

You can find pdftotext for most operating systems.

How you you know that it's just text? If you open it up in Acrobat/Preview/XPDF/etc and can highlight the text, then pdftotext should work fine.

But if you can't do that, then what the author probably did was make an image and embedded it in a PDF file. You then have to use OCR, which can give you some output which isn't always right. A Google-sponsored tool called [tesseract][] does a good job with this OCR stuff.. I remember that it used to stink, but it doesn't anymore. Simply:

tesseract pdfname.pdf textpat

That will try to do an OCR scan of pdfname.pdf and save each page into a file called textpat.txt.

But, of course, the path isn't always easy.

The Long and Winding Road

which have to be typed in. Lucky me. We have a scanner on-site and I asked if it does OCR, and I was told that it doesn't. I'm even getting luckier.

But I've parsed PDF's before. I should be able to handle it.

I scanned in a few and had the PDFs sent to me. I installed tesseract via Homebrew. The results were. . . disappointing:

$ tesseract pdfname.pdf out
Tesseract Open Source OCR Engine v3.01 with Leptonica
Unsupported image type.

So a quick google shows that either tesseract doesn't have the right libraries installed, or the PDF wasn't well-formed. Since tesseract told me it found [Leptonica][], I have to assume the proper libraries are there. So our scanner is making improper PDFs. This is great.

After some googling and head scratching, I discovered that tesseract works very well on Tiff files. I used Preview to export the PDF to a Tiff and -- success!

 $ tesseract pdfname.tiff out
 Tesseract Open Source OCR Engine v3.01 with Leptonica
 Page 0
 Page 1
 Page 2
 $ ls  out*
 out.txt

Ok, I didn't want to open all of these files in Preview. How to convert them from the command-line? Well, the first tool to think of is convert from ImageMagick. That has always been a tricky road for me nnd, sure enough, the resulting Tif file had horrid resolution. That made tesseract spit out garbage. I searched some more, even for OSX-specific solutions. I found sips which comes with OSX, but most people haven't heard of it. [The usage is a bit arcane][] but it uses the OSX libraries (i.e. the same thing my Preview export used). And, yes, it worked great out of the box -- except that it doesn't handle multi-page PDF's. Ugh.

How does one break up a PDF into pages? More googling, and I found [pdftk][] which is a little swiss army knife of PDF processing. And, hey, it can break a PDF into pages with the burst option! Or, maybe not:

 $  pdftk pdfname.pdf burst
 Unhandled Java Exception:
 java.lang.NullPointerException
 at com.lowagie.text.pdf.PdfCopy.copyIndirect(pdftk)
 at com.lowagie.text.pdf.PdfCopy.copyObject(pdftk)
 at com.lowagie.text.pdf.PdfCopy.copyDictionary(pdftk)

That's not good. A few searches showed someone else with that same problem. The cause? A bad PDF of course! The thing that has started me down this path! But I could extract the PDF a page at a time . . but that's bad to me.

Ok, time to refocus. I thought, "What I am trying to accomplish?" And that was converting the broken PDFs to Tifs so I can run tesseract. So let's focus back on the PDF->Tiff part. I did more searching and found [a StackOverflow entry that talked about the problem I had with ImageMagick and tesseract.][] and someone posted a nice recipe for using Ghostscript:

 /usr/local/bin/gs -o out.tif -sDEVICE=tiffgray -r720x720 \  
-g6120x7920 -sCompression=lzw in.pdf

And I got a Tiff file out that tesseract could process wonderfully! Woot! The bad part was that tesseract took a long time to process this tif -- much longer than the one from Preview. Most of that processing time was done in the first page of my PDF, which is essentially a cover page. How do I get rid of that cover page? Well, back to pdftk:

 pdftk pdfname.pdf cat 2-end output nocover.pdf

So that makes another PDF from the second page on (these PDF's have a variable number of pages).

Running the PDF->Tiff conversion on the nocover.pdf command gave some errors. But then I ran tesseract on the resulting tif file and I had no problems.

Just for fun, I ran tesseract on the nocover.pdf that pdftk created -- same error and the first thing. I figured as much but it was worth a shot.

So, in the end, I wrote a shell script that takes a PDF as a parameter and does this:

oldname=`basename $1`
name=$oldname.pdf

pdf=nocover/$name.pdf
tiff=tiffs/$name.tiff
text=extracted/$name

pdftk $1  cat 2-end output $pdf
/usr/local/bin/gs -o $tiff  -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw $pdf
tesseract $tiff $text

And that, my dear readers, is how to put a PDF through an OCR process.

[a StackOverflow entry that talked about the problem I had with

The Road to Scala

| Comments

To be honest, [Scala][] has been on my periphery for some time now. I had heard of it before, but the first real mention I actually remember was a talk [Ted Neward][] gave at No Fluff one year. I couldn't go to that talk, but I remember him talking about it a few times some other talks he did that weekend.

Fast-forward 2010. When I went to [Strange Loop][], there was some buzz on Scala. Of course, Scala is kinda mainstream for Strange Loop then so there wasn't that much talk on it, but there was buzz. Of course I ignored it.

So, with all that, this is what I knew about Scala:

  • It's statically-typed. Since Python has been my first love, I really can't get into static typing. I see the benefits, but writing code in those languages makes it feel pedantic.
  • It runs on the JVM. I already have Jython as my JVM-alternative of choice.
  • It's kinda functional and kinda OOP. OK, Python is also like that, but that idea weirded me out.

Then we fast-forward to just a couple months ago. I read [this excellent blog post][] and thought he was spot on when talking about the perils of modern day software developers. I honestly know nothing else about Michael Church, but he was spot on in the second part, so how right was he on the first part -- the list of languages?

I already know Python and C. And, OK, not ML and Clojure, but I know what their general idea was. And then there was Scala again. It was this thought that got my attention:

I think Scala is the language that will salvage the 5 percent of object-oriented programming that is actually useful and interesting, while providing such powerful functional features that the remaining 95% can be sloughed away. The salvage project in which a generation of elite programmers selects what works from a variety of programming styles -- functional, object-oriented, actor-driven, imperative -- and discards what doesn't work, is going to happen in Scala. So this is a great opportunity to see first-hand what works in language design and works in language design what doesn't.

And I'm all for that -- there are some good parts of OOP, but a lot of it has become painful. All the styles Church listed have some merits as well as downsides. If you can actually do all of them, then the cream of each style should rise to the top.

Another one of his thoughts grabbed me was:

[Scala] has an incredible amount of depth in its type system, which attempts to unify the philosophies of ML and Java and (in my opinion) does a damn impressive job.

Incredible type system? In a static language? I have yet to see such a beast. OK, the only static typed languages I have used are Pascal, C, and Java, and not one of them are good.

So, not to lengthen this anymore, I decided to dip more than my toe in the Scala waters and see what all this hype was about. After mucking with it off and on for about a week, I have to say that I've impressed. discovering a language since I started banging on Python over 10 years ago.

I'm far from a journeyman in Scala, but I'm getting up to speed on it rather quickly. When I learn something, I need to be a do'er , not a reader. I've been using [Scala Koans][] to play with. It uses [SBT][] to continuously run the tests, which is very cool. When I get to the point of mucking around a little deeper, I use [Scala Test][] with SBT to give me the same continuous feedback.

I recently did [Osherove's String Calculator kata][] to Step 6 in 30 minutes, without any Googling or even too much fumbling. That says something about how easy it can be to get started creating code that actually does something.

Here are some things I have learned to love in Scala:

  • [Pattern Matchers][]. This is probably my favorite. Now that I have groked them, I may never want to write a parser in anything but Scala ever again. I should also state I avoid switch-case statements of any kind in any other language but that structure works really well for Scala's pattern matching. When you use them with regular expression groups, magic happens.
  • [Case Classes][]. It does a lot of the boiler plate of making objects for you, and you get a sane equals to boot. And, as the link says, they go nicely with pattern matchers.
  • The static type system does make sense, and does not annoy me. Look numbers? Well, since we are filtering it, it must be a collection of some sort. Is it a List or is it an Array? Then what is negatives? Well, since we are using filter, it must be the same kind of collection that numbers is. But my favorite part is this: it doesn't matter. I know how negatives should behave, because it should behave just like numbers does. This makes sense to me, so much so that a type declaration for negatives becomes superfluous (hello Java ...)

Now there are things that have annoyed me in Scala. But I'm a beginner so I think some of those things will iron themselves out. I've been coming up with web app ideas that I can start writing in [Lift][], which probably says something about how how I feel about learning it.

Slicing Some Python With Emacs

| Comments

I have a new job and it's quite probable that I will be doing Python for a lot of it. Which suites me just fine.

. . . except that I've been out of the loop in a while. Sure, I have written some Python in the past five years and [some of it has been substantial][] but I feel out of the loop. Most of my simple scripts have been done in good ol' Emacs and bigger projects have been done in [Intellij IDEA][] with [their amazing Python plugin.][]

As I started at the new digs, I installed Intellij on my shiny MacBook Pro, turned on Emacs mode . . . and was underwhelmed. I forgot that Emacs mode in Intellij on Mac leaves a lot to be desired -- C-Del for Cut, Alt-P for paste? Ugh. A quick search shows that [I'm not the only one complaining, but it's not fixable.][]

I thought about Emacs and what I would miss about running things in Intellij IDEA. The biggies were:

  • Syntax checking
  • Running unit tests
  • Auto-refactoring (Extract Variable, Method, etc)

These are things that are supposed to separate an IDE from a text editor. However, Emacs is an elegant weapon from a more civilized age. So the hunt is on to see what others have did while I was on my hibernation from Python.

I've tried to use the [Rope library][] in the past and found it hard to setup. But I did note that it's still actively developed and so I tried find to some example configs to steal borrow from. That's when I found Gabriele Lanaro's excellent [emacs-for-python][] collection. It included Rope, [YA Snippet][], and other goodies, all configured to work together in harmony.

I forked it, cloned it, and had a few problems, so I fixed them and Gabriele merged them back in. It still didn't have unit test support, but I found [nosemacs][], which runs [Nose][] on the Python unit tests.

In searching for something else, I stumbled into [virtualenvwrapper][], which are some helpers around the most excellent [virtualenv][] utility, which creates a clean environment for Python development. These are used in emacs-for-python, so I put it in as well. [I then stumbled into this post,][] which explains how to use the hooks in virtualenvwrapper to control Emacs. Woot!

So now my workflow is like this:

  • type 'workon something', which will put my prompt in my "clean room" Python environment for the project. My Emacs has also switched to that environment, including using that version of the Python interpreter.
  • In Emacs, type C-c m, which will run and report on all my unit tests in my current module
  • In Emacs, type 'C-c r ` to extract a new variable. Other commands exist for class, method, etc.
  • type deactivate and my prompt moves away from my clean room, and my Emacs leaves too.
  • when I go back to work on something Emacs will remember the last buffers it worked on.

I put all these changes into my branch of emacs-for-python, and Gabriele has already pulled them in. They are available in HEAD on [emacs-for-python][]

Master Foo and Corporate IT

| Comments

An acolyte found Master Foo meditating in his garden, sitting under his favorite tree. The acolyte waited until he was acknowledged. The Wise One did so, by asking what was troubling him.

"Why is it that Great Developers generally don't work for Corporate IT? And, when they do, they leave after a short amount of time and become contractors or work for a software-only shop?

Master Foo leaned against the tree, thought for a minute and begun to speak:

"Corporate IT is like a large city that had a great amount of cars. Everyone was upset with how bad the traffic was -- cars drove too fast, people were injured, it was horrible.

"A man worked for the city leaders and he was really good at solving problems. So the mayor put this man in charge of fixing the traffic problem. The mayor told everyone do to what the new Head of Traffic said, without question or hesitation.

he has only ridden in one a few times. No one asked him if he knew anything about cars. But the money was good so he took the job. He would tackle it like he did all the other problems -- by looking at data.

"As he looked at the data and reports that people gave him, the Head of Traffic noticed that a lot of accidents happened at stoplights. Not only did people get hurt in those accidents, but they delayed traffic. So the problem was simple -- cars needed to stop when the lights were green as well as when the lights were red.

"And the people that worked in the Traffic Department were perplexed by this but they were told to obey without question. So they posted the rule and told the police to enforce it. The police were also perplexed but they were also told to do what Head of Traffic said, so they started issuing tickets when drivers went through green lights.

"Obviously the people in the city were angry about this -- it took even longer to get from place to place! But, as more and more people stopped at the green lights as well as the red ones, they did notice that it was safer. So the citizens stopped complaining and just left much, much earlier to get to their destination.

"Of course, there are always people that have to get around the rules. Some started walking, and that was faster most of the time, so more people started doing that. And, soon, accidents were happening there. So the Head of Traffic was called on again, to address the problem. He first said no one was allowed to walk along the street, so more people rode bicycles. And the there were more bike accidents, so soon the law was to not ride bicycles anymore. Then people started on unicycles and laws were made there. It just went on and on and on.

Master Foo then asked, "Who has the most guilt in this city?"

The acolyte quickly answered, "It is the Head of Traffic, of course."

Master Foo said, "He actually has the least amount of guilt. It's not his fault he was given a job he knows nothing about."

Upon hearing this, the acolyte was enlightened.

My Holy Grail of Content Delivery

| Comments

I've been on a quest for a long time to figure out the best way to write and publish documents. It has been a quest that has taken me years but I finally have a system that I am extremely happy with.

What I wanted was to be able to write via Emacs (my editor of choice) and output to anything else. Want my document in HTML? You can have it that way. Want a PDF? Yes, I can. What is in Word? Though I can't stand the application, and I don't actually own it, yes I would like to be able to output Word docs without actually opening another application. Just writing it in Emacs, run a process on it, and the I have a Word document!

This has been nothing but a pipedream for a long time. At the beginning I tried to use this idea with [LaTeX][], and I could get the PDF output to look outstanding. The HTML output took a lot of work to get right, and I never got [latex2rtf][] working well enough that I could send that document anywhere else. So I had to figure out what else to use.

Then I started playing with [Markdown][]. I really liked the easy format, but it really only would output into HTML. With a bit of work, I could do a PDF (html2pdf or something like that ), but something to load up with Word? Forget it! And even the PDF looked kinda bad. The same thing with Textile and reStructuredText -- HTML and that was about it. I did prefer the Markdown format over LaTeX based on it's simplicity but I still haven't found my Holy Grail yet.

Just a few months ago I somehow stumbled on [Pandoc][]. I think it was on Google+, on someone's random post. I was floored when I read it: "Take a file in format X, run Pandoc on it, and get format Y, with varying degrees with X and Y?" This seemed like just what I was looking for!

But did it work as advertise? Yes it did! It understands everything about Markdown that I currently throw at it. To get the PDF conversation script working, it used LaTeX as an intermediarty so you have to have LaTeX installed. But Word? Not directly -- but it does have RTF support, which is even better (since it's more portable). It also does [ODT][] format, which means I can open up in LibreOffice and tweak for presentation if need be. The ODF output is better than the RTF output, in my humble opinion.

The biggest surprise I got was that it does conversions to [S5][] -- so I can do representations in Emacs/Markdown and be able to present with just a browser. I have done this and it works amazingly well.

One thing I haven't tried yet is that is also outputs to ePub. If it only did the closed Mobi format for my Kindle.

So, yes, if you are looking for a "write-once, publish to anything" scheme, you can't do any better than [Pandoc][]