More Joys Of Python

I haven’t had a chance to do anything in Python lately.  But yesterday I needed to do some screen scraping and, after looking at the HTML I needed to scrap from, I knew that I needed some good tools.

The HTML is table in a form with several rows in it. In cell in each rows has hidden items representing that data.  And, inexplicably, each row also has a couple JavaScript functions embedded in it.  I did a File-Save on a page with ~150 rows in that table, which is large for what I am doing. You won’t believe the file size if I told you, so I’ll just copy-and-paste it in:

$ ls -lh file.htm
-rw-r–r– 1 me mkgroup-l-d 2.9M Aug 31 10:39 file.htm

Yes, one HTML file with one table is a little shy of 3MB.  There is no images embedded in this file — just pure text.  This is why I needed to get out the Big Guns for this exercise.

The Big Gun for this is Beautiful Soup.  It can parse anything that might resemble HTML and give you what you need.  It is not speedy, but it works well.  And for the hunk of HTML I need, correctness was more important than speed.

One thing I discovered about Beautiful Soup is that you can query with a regex.  Remember what I said that each cell in the table had a hidden form field in it?  The ID for each row was in a hidden input named “id0″ for the first ID, “id1″ for the second, and so on.  Similar for the customer column — “CustName0″,”CustName1″, etc.  So, really, this was easy to find:

soup = BeautifulSoup(file(fname))
ids = soup.findAll(”input”, attrs={’name’:re.compile(”id\d+$”)})
custs = soup.findAll(”input”, attrs={’name’:re.compile(”CustName\d+$”)})
What became a horrid parsing problem quickly became a three-liner.  Wow.

I needed to spit data out from this HTML into different files. Since I know what kind of a pain file management can be, I decided it would be great to put them in a Zip file.  I could have done a system call to zip but instead I used Python’s standard zipfile module.

I was storing the data files temporarily in a data directory, but I didn’t want to put that into the Zip file. Luckily that, too, was easy using the zipfile module and the non-standard yet wonderful path.py module:

  
zfile = ZipFile(zip_file,”w”)   
for f in fnames:       
    zfile.write(f,f.splitpath()[1])   
zfile.close()

Coding in Python has been a refreshing change of pace, because I can use these wonderful modules to bend the data to my will and all I have to it tell it how.  In Java, I have to worry about making sure the right kind of object is being passed and that things are casting right, etc., etc.  Although this is a highly complex problem it was easy in Python. 


Powered by ScribeFire.

Leave a Reply

You must be logged in to post a comment.