Tax Model Data at the NBER

In accordance with a data agreement, all of the data described below is available for use solely on computers resident at the NBER. See Dan Feenberg for an account, and access to the data directories.

The NBER has a complete collection of all the public use Tax Model files created by the Statistics of Income Division of the IRS. There are cross section files from 1960 to 2002 (except 61, 63, and 65) and a panel file from 1979 to 1990. Each file includes about 200 variables, with some censoring of sensitive information. The cross section files are stratified random samples with oversampling of high income households. The panel files are random samples based on the last 4 digits of the SSN.

A brief review of the data, the TAXSIM program for calculating tax liabilities and some of the applications made of it at the NBER is included at http://www.nber.org/feenberg-coutts.pdf">

All files are available as flat files in SAS and Stata format. The fully processed files are available as ASCII text also. In each data directory, file names have a structure. An initial letter "x" for the full cross section "s" for a two percent subset of the full cross section and "p" for the panel file is followed by a four digit year and an extension showing the file type. For example, x1960.dta is the full cross section for 1960 in Stata format.

There are no missing values in these files. If someone doesn't work, their wages are logically zero, and so forth for all the other variables. There is no ambiguity.

/homes/data/soi/raw/

These are the original Tax Model files from the Statistics of Income Division of the IRS. Many are in packed or zoned decimal and variables wander across the record layout from year to year so you will probably find it easier to use one of the more processed formats described below. Full documentation for these files is at http://www.nber.org/taxsim/iit-docs and is of interest even if you are using processed files, as it includes sample tax forms showing the source of every data item, or describing the calculation reported, as well as documenting what censoring has taken place.

/homes/data/soi/sas/
/homes/data/soi/dta/

These are the raw files converted to SAS or Stata format, and with semi-consistent variable names. A concordance of names, descriptions and source locations for all variables is at http://www.nber.org/taxsim/60-79.pdf (60-79) and http://www.nber.org/taxsim/79+.pdf . With these documents, you can quickly discover what is available for any given year.

The names are only semi-consistent because while some variables such as adjusted gross income keep their name in all years, variables such as long term gains change their basic meaning as the share of gains included in AGI changes. In most years SOI reports only the included amount and the variable name changes as the inclusion fraction changes.

/homes/data/soi/taxsim/sas/
/homes/data/soi/taxsim/dta/
/homes/data/soi/taxsim/txt/

These files provide a highly consistent naming (actually numbering) convention through time for a subset of the original variables. Wages are always "data11". Full long term gains are calculated by dividing the SOI supplied amount by 1., .5 or .4 and stored as "data68". Similar calculations are done for various deductions subject to a floor, etc. We have an index showing all the names, and the years for which each item is available.

Adding 10% of AGI to deductible medical expenses (for taxpayers with deductible medical expenses) and calling the result "gross medical expenses" simplifies using data from one year with a tax calculator for another year and allows one to answer questions such as "What is the effect of raising the floor on medical deductions". But no new data is created, so that it doesn't provide any answer (even a simulated one) to the question "What is the effect of reducing the floor", since taxpayers with less than the floor deduction will be left at zero.

When using this file keep in mind that very little is imputed without a firm source and items are zero if not available on the tax return that year. A complete list of variables is available here

At present the only statistically imputed variable is the state of residence for taxpayers with AGI over $200,000.

TAXSIM

These highly consistent files may be used in Stata or FORTRAN to calculate tax liabilities, marginal tax rates, and many intermediate tax calculations with the NBER TAXSIM model. More information about the Stata interface is available with the following Stata commands:

net cd http://www.nber.org/stata net describe taxpuf Because we are not allowed to distribute the PUF outside of 1050, taxpuf is not available elsewhere. Taxpuf is easy to use, for example: . use /homes/data/soi/taxsim/dta/s1999 . taxpuf . table state,c(mean ftax mean stax) reads in the subset version of the 1999 file, and calculates federal and state tax liabilites by state. Note that the variable names for the input are documented here, (as noted above) or you could use the "describe" command for a listing.

Sometimes it is easier to just use FORTRAN. In that case you should see me and I can get you started. FORTRAN is very easy, and you won't have a lot of trouble going forward once I get you on the right track with it. Calculations are much faster without the Stata interface.

TAXSIM on the Web

You can submit a 22 variable characterization of an individual taxpayer, or a file of such taxpayers to our website at http://www.nber.org/taxsim/taxsim5 and get back a detailed calculation of tax liabilities. This facility is intended to encourage users of survey datasets such as the CPS or SIPP to use after-tax prices in their work.

A Stata interface to the 22 variable version is also available:

net cd http://www.nber.org/stata net describe taxsim6

Modeling Tax Proposals

changes to the tax system naturally divide into two categories. Some that can be modeled as changes to the data, but others cannot. For example, a secondary earner's deduction can be modeled as a change in wages - data11 in the model. A change in the top bracket rate requires access to the model internals and would have to be discussed with me and implemented here. In many cases we would be glad to make these modifications, or help you make them. The model itself is about 20,000 lines of FORTRAN 77.

If you are using these files, I would like to meet with you.

Daniel Feenberg
feenberg@nber.org
617-588-0343