Where Are The Wise Men?

Mike's Ramblings

My Personal Cloud

| Comments

I've been having an ongoing conversation with a good friend of mine for a long time about email management. We want to keep our emails so we can find them later, and either don't trust or don't like the interfaces what GMail and other free providers give us. In that vein, early this spring, he sent me a link to the Sovereign project on github that is, essentially, your own cloud server with storage, email, et al.

What he didn't know at the time was that I was lamenting all the cloud services I was using (and spending bits of money on) but still not feeling in control of what was where. Add that to the fact that I've never really trusted a cloud storage service like Dropbox, Box.net, GDrive with my files. Yes I had all of them and had files in all of them, but never really dedicating anything to any of them, i.e. paying them money.

So I tried it out -- I got a cheap VPS and ran the Ansible scripts, taking out the things I didn't need (like, I haven't been on IRC for years!). As time as went on I tweaked things and simplified my digital services by replacing them with things on the server.

This is what I use:

ownCloud

This is really the most important item on this list, and probably the most surprising one to me. I have learned to like cloud storage, now that I feel like I have some control (and, honestly, some space to actually hold things). I'm using it to hold my ebooks I store in Calibre. I still don't have room to put my music on there, but I'm toying with the idea of upgrading the VPS so I have the room. I have all sorts of documents there for work. The Android app is great for most things -- it doesn't keep an entire folder in sync but, luckily, ownCloud implements webdav so I can use existing tools like FolderSync. ownCloud also keeps contacts via CardDav so I can keep my own copy of contacts.

This personal cloud storage is great, but I use the syncing abilities to keep track of other things. So Todo.txt has replaced Remember The Milk for tasks and KeePass2 has replaced LastPass. The last one was hard for me to let go of, but I'm actually now happier now with KeePass2.

On the Android side, the official Todo.txt Android app utilizes Dropbox for syncing, but a little research I found Simpletask Cloudless and Keepass2Android, using the fore-mentioned FolderSync to keep the files in sync. At most they are an hour behind, which is fine for me.

gitolite

This wasn't end the original version of Sovereign that I installed, but I'm glad that it was included. I do like having my own private git repo that I don't have to pay Github for. I disabled cgit mostly because I didn't see the need for it, for just me. Though I have included others in a couple of my projects.

selfoss

When the Google Reader Shutdown happened and I looked for self-hosted alternatives, selfoss didn't no pop up on my searches. But I'm glad that the it was included with this! It's very clean, lets me scan the articles from my feeds quickly. The mobile version (through my mobile browser) is actually better than the full-screen one, IMHO.

wallabag

Wallabag is another new-comer to Sovereign but I like it a lot -- it has replaced Pocket for me. It works just the same as Pocket, but with an added feature of saving articles to epub. I haven't actually done that yet but I fancy putting together a cookbook with it. Wouldn't that be cool?

There are other things I like -- using OpenVPN when using an open wifi connection, automatic backup to Tarsnap, and just having a SSH prompt at my exposure, when I need it.

To Begin Anew

| Comments

I've been wanting to get off my provider for a long time. They were one of the first webhosts that were ultra-cheap. And I could put up a WordPress blog with now problem. It was 2006 . . . what could go wrong? I mean, they didn't have sftp but only clear-text ftp, but they wouldn't have that forever, right?

Well . . . it's 2014. The WordPress site is usually very slow. They still do not have sftp ("it would take too much server power to support sftp"). And the email system only hold 250MB per account. Not address -- 10 addresses would share the 250MB. Crazy. Insane.

The WordPress was so bad that I didn't want to write in it again. . . which is bad for someone who like to write. And I finally have some techincal stull I want to write about again. Like code snippets, which I have never been able to get right in my WordPress blog.

And there are some neat things happening in the static generator world. I've looked (and have used) a few. But since I've been using Groovy lately I decided to eat my own dogfood. So this site is being generated by a Groovy-powered generator called Grain. I honestly grabbed a template that I liked and started writing. It has SASS/Compass support built in and uses Python for Pygments as well inside the page generation. Yeah, seems like a kitchen sink. But I can update the Sass files and I don't have to run a script or anything to process them -- it happens automagically. That's pretty cool.

I plan on writing a bit more about stuff I've discovered in Groovy, my dive back into the Java world, and how I (imperfectly) converted all my old blog content into this Grain blog.

Hoss Dreams of Software

| Comments

The other night we watched Jiro Dreams Of Sushi which I highly recommend everyone watch, even if you don’t like sushi. Even if you don’t know anything about sushi. Because it’s not about sushi — it’s about Jiro, an artist who is obsessed about quality, and his craft. And his craft is making sushi.

Jiro Ono is 85 years old and owns a nondescript sushi restaurant in Tokyo. His restaurant only has 10 seats, but it costs $300 per seat and you have to make your reservations at least a month in advance. Oh, and it is a 3-star Michelin rated restaurant. Jiro is, in face, the oldest chef to be awarded a 3-star Michelin award. The restaurant reviewer interviewed in the film said, many times, that Jiro’s sushi is the consistently the best he’s ever had. It’s always the best — never was there a time a bit worse than the other. And that is an astounding review. This all has to do with Jiro, who has committed his entire life to making sushi. Meaning, he’s been at this since he’s been 14 years old. He’s at his restaurant every day, overseeing the preparation of the fish, rice, eggs, etc. He will quickly give a criticism when he sees or tastes something under his exact standards — including his own 50-year old son who works there. Jiro keeps a close eye on his customers, noticing if they are left-handed (he puts the sushi in a different place on the plate if it is) as well as making slightly smaller pieces for females. H also admits when his restaurant is closed on state holidays, he doesn’t know what to with himself.

I’ve been saying for years that cooking is a lot like programming software, and I thought many times about that through this film. Jiro said that, if you want to be the best chef, you can never be satisfied, always strive to be better, and you have to love it. These traits, to me, are the same as what makes a great developer. You have to always been learning, striving to make you things better, and you have to love the work. I think the last item is the most important — writing software is hard and takes a certain kind of dedication, nerves, and brain work that, frankly, not everyone is cut out for.

But if you decide that you like this kind of work, then you dedicate your life to it. And, if you want to dedicate your life to it, then you should be constantly looking for ways to get better. Back to Jiro . . he has been making sushi for 70 years. 70 years! And he is always looking for ways to get better. Not necessarily One Big Thing that will change sushi forever, but little increments, like the kind of rice to use, the temperature of the rice when the the sushi is made and served, massaging the octopus for a longer time to bring that much more flavor out of it, finding the best fish mongers to buy from . . the list goes on and on.

I think most software developers (including myself) want to find the silver bullet, the one thing that will make us all better. But, alas, it doesn’t exist. There is no one methodology to follow, no one language to use, no One True Editor or IDE that solves all the problems. We have to get better, in bits of a time.

Really, what I am talking about comes back to craftsmanship. We want to write great software and, after we do that, we want to do it again, but better this time. Never going back, but always improving. Uncle Bob already wrote a great summary of what this looks like so I will just close with telling you to read that. And get started on your personal improvement.

Add a Read-Only Role to Django Admin

| Comments

I was in a meeting where I was asked to give someone read-only access to the Admin part of our application. That was fine -- it was written in Django and Django has really fantastic Admin functionality. So I assumed that it could handle it, no problem. So I said yes.

Of course, after a little googling, I found that that it doesn't support this at all -- you can only give people Add, Change, or Delete permissions. You can make individual fields read-only but, in an ideal world, I needed a whole object to be read-only or not, hopefully determined by Group membership.

My searches didn't give me a lot of hope, but I did find something close [in this post.][]. So I expanded it to look for a Group.

So you used ReadOnlyAdmin to inherit from instead of ModelAdmin for all Admin objects you want to make read-only. Then you also have to add these two properties:

  • user_readonly - list of the fields to be read-only. If you don't put in there, the user will be able to change the Model!
  • user_readonly_inlines - If you have a related Model that you want to display Inline, then you can't add it to user_readonly because it's not part of the Model. You have create a read-only InlineAdmin object and list that here.

Creating a read-only Admin object is simple:

  class MyModelInline(admin.StackedInline):

     model =MyModel


class MyModelReadOnlyInline(MyModelInline):

    readonly_fields = ["label",]

Then you just list MyModelReadOnlyInline in the user_readonly_inlines and MyModelInline in inlines.

To use the ReadOnlyAdmin:

  • Create a Admin Group called readonly.
  • Add the User to readonly and give them full access to the Models you want them to read -- yes, give them Add, Change, etc. Or they can't view them at all.

When the user logs in, they will see the Model and go to individual ones, but none of the fields will be in form fields -- just straight text.

The Many Roads Of PDF Processing

| Comments

The Easy Path

So you have a PDF, or a bunch of PDFs, and want to extract the text out of them? A few years ago, this would have been a horrible task, but life has gotten easier since then.

If your PDF is just filled with text, this becomes really easy:

 pdftotext pdfname.pdf

You can find pdftotext for most operating systems.

How you you know that it's just text? If you open it up in Acrobat/Preview/XPDF/etc and can highlight the text, then pdftotext should work fine.

But if you can't do that, then what the author probably did was make an image and embedded it in a PDF file. You then have to use OCR, which can give you some output which isn't always right. A Google-sponsored tool called [tesseract][] does a good job with this OCR stuff.. I remember that it used to stink, but it doesn't anymore. Simply:

tesseract pdfname.pdf textpat

That will try to do an OCR scan of pdfname.pdf and save each page into a file called textpat.txt.

But, of course, the path isn't always easy.

The Long and Winding Road

which have to be typed in. Lucky me. We have a scanner on-site and I asked if it does OCR, and I was told that it doesn't. I'm even getting luckier.

But I've parsed PDF's before. I should be able to handle it.

I scanned in a few and had the PDFs sent to me. I installed tesseract via Homebrew. The results were. . . disappointing:

$ tesseract pdfname.pdf out
Tesseract Open Source OCR Engine v3.01 with Leptonica
Unsupported image type.

So a quick google shows that either tesseract doesn't have the right libraries installed, or the PDF wasn't well-formed. Since tesseract told me it found [Leptonica][], I have to assume the proper libraries are there. So our scanner is making improper PDFs. This is great.

After some googling and head scratching, I discovered that tesseract works very well on Tiff files. I used Preview to export the PDF to a Tiff and -- success!

 $ tesseract pdfname.tiff out
 Tesseract Open Source OCR Engine v3.01 with Leptonica
 Page 0
 Page 1
 Page 2
 $ ls  out*
 out.txt

Ok, I didn't want to open all of these files in Preview. How to convert them from the command-line? Well, the first tool to think of is convert from ImageMagick. That has always been a tricky road for me nnd, sure enough, the resulting Tif file had horrid resolution. That made tesseract spit out garbage. I searched some more, even for OSX-specific solutions. I found sips which comes with OSX, but most people haven't heard of it. [The usage is a bit arcane][] but it uses the OSX libraries (i.e. the same thing my Preview export used). And, yes, it worked great out of the box -- except that it doesn't handle multi-page PDF's. Ugh.

How does one break up a PDF into pages? More googling, and I found [pdftk][] which is a little swiss army knife of PDF processing. And, hey, it can break a PDF into pages with the burst option! Or, maybe not:

 $  pdftk pdfname.pdf burst
 Unhandled Java Exception:
 java.lang.NullPointerException
 at com.lowagie.text.pdf.PdfCopy.copyIndirect(pdftk)
 at com.lowagie.text.pdf.PdfCopy.copyObject(pdftk)
 at com.lowagie.text.pdf.PdfCopy.copyDictionary(pdftk)

That's not good. A few searches showed someone else with that same problem. The cause? A bad PDF of course! The thing that has started me down this path! But I could extract the PDF a page at a time . . but that's bad to me.

Ok, time to refocus. I thought, "What I am trying to accomplish?" And that was converting the broken PDFs to Tifs so I can run tesseract. So let's focus back on the PDF->Tiff part. I did more searching and found [a StackOverflow entry that talked about the problem I had with ImageMagick and tesseract.][] and someone posted a nice recipe for using Ghostscript:

 /usr/local/bin/gs -o out.tif -sDEVICE=tiffgray -r720x720 \  
-g6120x7920 -sCompression=lzw in.pdf

And I got a Tiff file out that tesseract could process wonderfully! Woot! The bad part was that tesseract took a long time to process this tif -- much longer than the one from Preview. Most of that processing time was done in the first page of my PDF, which is essentially a cover page. How do I get rid of that cover page? Well, back to pdftk:

 pdftk pdfname.pdf cat 2-end output nocover.pdf

So that makes another PDF from the second page on (these PDF's have a variable number of pages).

Running the PDF->Tiff conversion on the nocover.pdf command gave some errors. But then I ran tesseract on the resulting tif file and I had no problems.

Just for fun, I ran tesseract on the nocover.pdf that pdftk created -- same error and the first thing. I figured as much but it was worth a shot.

So, in the end, I wrote a shell script that takes a PDF as a parameter and does this:

oldname=`basename $1`
name=$oldname.pdf

pdf=nocover/$name.pdf
tiff=tiffs/$name.tiff
text=extracted/$name

pdftk $1  cat 2-end output $pdf
/usr/local/bin/gs -o $tiff  -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw $pdf
tesseract $tiff $text

And that, my dear readers, is how to put a PDF through an OCR process.

[a StackOverflow entry that talked about the problem I had with