FactFinder makes screen scraping easy

Dec 9th, 2005No Comments

I need to do some research in church on average household incomes in different neighborhoods. Luckily the U.S. Census Bureau has all that informaction and they distribute it in the public domain at American FactFinder. Of course, I have a list of ZIP codes to research and I don’t want to do this manually.

The FactFinder website makes automation pretty easy. For one thing, the ZIP code to look up is in the URL three times. So I copied it out of my text editor and replaced the ZIP code with “%s”. Now, make to make the URL, I use these two line:

insertTup = (zipcode,)*3
factinfo = urlopen(factfindURL %insertTup)

So now I have a URL for any ZIP code, but how do I parse it? Well, naturally, mix it into some BeautifulSoup:

soup = BeautifulSoup(factinfo.read())

Okay, so now how do I get do the data? If you a show source on a Fact Sheet Page, you’ll see that the data table well-labelled with ID names. For example, we want the median household income which is in row 46 in the second column. Luckily for us, their class names are in the tag — we need to look for a td tag with a header value of R46 C2. But the content isn’t in the td tag directly but in the p tag that is td tag’s only child. BeautifulSoup makes this very easy:

 income = soup("td",{'headers':'R46 C2'})[0].first().string.strip()

So there it is! When I approached this problem, I thought it would be quite difficult, but a good use of web standards and a good HTML parser made this almost trivial.

Leave a Reply

You must be logged in to post a comment.