PDF Data: Set Them Free at PDF Liberation

Above: Old School Federal Budget PDF, 1976

This weekend, transparency advocates are gathering around the country (and even as far away as Argentina) to tackle one of open government’s most vexing issues: valuable data locked in PDF format.

Data published as a PDF can’t be graphed, sorted, summarized, pivoted, or combined with other data. Only when information is in a universally-understood electronic format can we fully reap its benefits.

NPP’s tax breaks dataset is a great example of what can happen when PDFs are liberated. Last summer, we collected PDFs that show the estimated cost of tax breaks, going back to 1974. What began as decades of individual files, many not even machine-readable, became a single .csv file that anyone can open as a spreadsheet.

Once we had an electronic version of the tax break costs, we wrote a program to make the data even more useful, adding totals by category and percent change from the previous year. And we combined the tax break numbers with economic data to add Gross Domestic Product (GDP) and adjust the costs for inflation.

Finally, we used the improved data to create an interactive visualization so people can better understand the overall cost of tax breaks and who benefits from them.

All that was possible because we liberated PDFs, and we’re not done yet. In 2014 our research team will take a deeper dive into these tax breaks and provide additional categories for more nuanced analysis.

You can download the improved version of our tax break data here. We’ve also open sourced the code used to improve the numbers from the original PDFs. Finally, you can check out our interactive tool, Big Money in Tax Breaks (static version below).

Big Money in Tax Breaks

It’s not too late to get involved in PDF Liberation. Visit the website for locations, challenges, and a list of PDF extraction resources. And thanks for making it easier to set the data free.