This is a companion reference for IRSx, a python library intended to turn the IRS' versioned XML tax form 990's into python objects, .json, .csv or human readable text with original line number and description. There's more documentation available in the README and additional examples in the cookbook. It works in two modes:
Most of the library is devoted to organizing tax data into consistently named and formatted data structures. That's done using three .csv files that ship with IRSx (see the /metadata/ directory) variables.csv, groups.csv and schedule_parts.csv files; this site provides a clickable reference to the overall layout, forms, repeating groups, xpaths, and variables (a grouping of xpaths that refer to the same quantity).
Investigators familiar with nonprofits are generally acquainted with the paper / electronic filings released as documents by the IRS; translating how to extract a particular line item from a filed tax return to an variable in a spreadsheet is the most common use case. If you're not familiar with the returns themselves, you may want to spend some time looking at completed returns or samples.
Once you've found an actual line item in a return (ideally one from 2015 or 2016), go to the forms page and locate the corresponding part in the data structure. Note that "repeating" items (example: the directors of the nonprofit) are listed separately (and would be imported to a different table in a relational database). Clicking on the name of the form part or the repeating group should bring you to a page listing all of the associated variables. Note that some 'repeating' variables are only one code long (e.g. the list of state a return is filed to). Make sure you figure out whether the variable is in a repeating group or in a form part, and make sure you know the variable name or "db name" assigned to it.
Accessing the data
The IRSx documentation addresses in some detail how to access the data by the variable names once you've identified the form, the form part, the repeating group name (if it's a repeating group) and the group name.
What is Part 0?
The tax forms / schedules that IRS releases are divided into parts. However, some material is contained at the top of the form prior to the start of 'Part I'. When material is encountered here, it is listed as being part of a made up part 0 for convenience.
About the data available
Nonprofits in the United States must make their tax returns (aka form 990, 990EZ or 990PF) available electronically. Although these files have been submitted *to* the IRS in .xml, an electronic format, for years the IRS had refused to release the raw files and released only images of the submissions. Following a lawsuit by Carl Malamud, the IRS has begun releasing the original xml filings, primarily as an AWS public dataset. Not all nonprofits must file electronically (though rules require those of a certain size to; various estimates put the current rate of e-filing at 60-80%).
This project was originally built for ProPublica, an independent, nonprofit newsroom that runs NonProfit Explorer, a database of nonprofit organizations, their tax returns, and federal audits of them when available. This repository was originally written and maintained by Jacob Fenton, who is responsible for any errors.
Thanks also to Tyler Davis for testing an open source release candidate and suggesting improvements.
The data in this release was current as of IRSx 0.0.10 on March 7, 2018.