Financial Python

Studies in Finance and Python

Posts Tagged ‘beautifulsoup

Parsing DTCC Part 1: PITA

leave a comment »

In a previous post, I complained about the DTCC’s CDS data website and the one week lifespan of the data published there. For those of you who don’t know, the DTCC clears and settles a massive number of transactions every day for multiple asset classes. It’s one of those financial institutions that doesn’t get much press but underpins the entire capital market.

Anyway, the recent crisis motivated the DTCC to publish weekly CDS (single name, index, and tranche) exposure data. A good idea, until one realizes the data goes up in smoke when the next week’s data arrives. Although DTCC recently added links to data for “a week ago”, “a month ago”, and “a year ago,” it’s still pretty inconvenient. So, if you want the data, you have to parse it yourself. I originally wanted to write a smart parser that would dynamically react to whatever format it encountered…I came to my senses and adopted a simpler approach.

The approach thus far:

  • Download the raw html pages/files via “curl.” Urllib2 is the preferred method to pull web pages, but I didn’t have the patience to figure out how to handle redirects. Curl is a utility included with OS X that, for whatever reason, ignores redirects automatically. As such, I created a short python script to download the html for all the tables of interest weekly.
  • Use BeautifulSoup to parse the html. Other libraries, such as html5lib and lxml seem to be gaining ground on BeautifulSoup, particularly as it’s author wants to get out of the parsing game altogether. Nevertheless, I couldn’t be bothered to figure out the unicode issues I experienced with html5lib or lxml’s logic. BeautifulSoup is straightforward and “gives you unicode, dammit!” (quoting the author).
  • Use numpy for easier data manipulation. Since my html, css, DOM, etc. knowledge is basic, I thought it might be better to use numpy to manipulate the table data rather than rely solely on the parser. This meant vectorizing the html data into a 1D array, cleaning it up, and generally preparing it for future reshaping. Numpy, how did I ever live without you?

This would’ve been much easier if all the tables were exactly the same format. Unfortunately, that’s never the case. An extra cell here or there, or weird characters, can throw things off. This isn’t an issue if you are parsing individual pieces of data or a single table. But what if you need to parse ten, 20, 100, etc. tables? It can get ugly fast. The DTCC data is broken into 23 pages, some of which have multiple tables. Luckily, most of my pain was self-inflicted (hey, I’m a parsing virgin). I only had to account for a few different table formats in the end.

One downside to my approach is I do not dynamically produce headers for the data I’m pulling. I plan to manually set the headers for each table (the ultimate destination for the data right now are csv files). If there’s a better way, please let me know.

You can find the code here via pastebin (feedback is welcome).
You can find the DTCC tables here (if you want to view the html source).

Part 2 will cover the process of reformatting the data with numpy and perhaps feature some charts. I’m very curious to see what the numbers show!

Here are a few screenshots of a terminal session using the code so far:

See and download the full gallery on posterous

Written by DK

September 15, 2009 at 7:15 am

Posted in Finance, Python

Tagged with , , , , ,