Income tax calculator for SOI-IRS individual Data

TAXCALC is a program available in SAS or Stata for the calculation of US Individual Income Tax liabilities from the confidential micro-data files of the IRS Statistics of Income Division, or from the anonymized public use files made available by that agency. It is intended to facilitate the econometric analysis of tax data, and is not suitable for the preparation of individual tax returns for submission as such.

TAXCALC is unrelated (other than authorship) to the widely used TAXSIM program, also from the NBER. TAXSIM is in Fortran, calculates federal taxes (1960-2023) and state taxes (1977-2016) and operates on a transformation of the SOI Public Use Files (PUF) files from 1960 to 2011, On the other hand, this program is written in SAS and Stata native code and operates directly on the proprietary binary files used by those packages. It covers federal liability only (1993 - 2013) but no state tax calculation. Both packages receive annual updates, but TAXCALC is limited to the last year for which data is available for testing.

TAXCALC was written with the following goals in mind:

  1. The same program should work on both public use and SOI internal files and should produce the same results in SAS or Stata.
  2. The tax calculation should fully utilize the data available in either file.
  3. There should be no editing of analytical program code required as such programs move between the external and internal (government) environments.
  4. It should be possible to calculate tax liability for any year according to the law for any other year.
  5. It should be possible to calculate marginal tax rates by finite difference accounting for all clawbacks, minimum tax, stacking rules, etc.
  6. All intermediate values should be calculated, rather than taken from the tax return. (Necessary to get marginal rates correct).
  7. Multi-year data analysis should be convenient.
  8. The user interface should be flexible, with simple calculations simple to set up, and complex calculations possible.
  9. Typical econometric calculations, such as for a tax-price regression, should be straightforward.
  10. It should be batch oriented, so that SOI staff can do runs for external researchers without having to modify submitted code, or interact with a running program.


For most common tax forms the calculation proceeds from the most basic data, and proceeds through all worksheets, clawbacks and restrictions. These include:

  1. 1040
  2. 1040 Schedules A, D (including worksheet), SE and EIC.
  3. 6251 (including worksheet)
  4. 4972
  5. 2441
  6. Child Care Credit and 8812
  7. 5405

We are careful to maintain consistency among the various forms. That is, if a form other than the 1040 asks for AGI, or capital gains, or earned income, we use the value supplied on the 1040, or Schedule D, even if the taxpayer may have supplied, and the SOI accepted a different value. This is essential to the correct calculation of marginal tax rates.


A number of features of the tax code are ignored in the interest of simplification. Only "bottom line" values from some forms are used

  1. Schedule C
  2. Schedule E
  3. 8814
  4. 2555 (but the FEIE worksheet is used to calculate tax).
A few special features of the tax law are simply ignored.
  1. Non-calendar year returns are processed as if for calendar year.
  2. Special treatment of capital gains on kiddie tax forms is ignored.
  3. Schedule J elected farm income is treated as regular income.
  4. Form 4972 gains are treated as long term gains.
  5. Compulsory itemization is sometimes treated as standard deduction.
  6. Optimization of itemization status under AMT is sometimes gotten wrong.
  7. Form 8812 is simplified
Nevertheless, after excluding returns with items 1-3 above, the R**2 on the calculated tax after credits is .9999 or better when used with the complete (confidential) file.

When calculating for a law year other than the file year, some data may be unavailable, and no attempt is made to impute such data. For example, in years when a deduction or adjustment is not available, the amount of the expenditure for that item will not be recorded in the data and will be treated as zero in the calculation of tax liability for some other year. While this might bias aggregate revenue unacceptably, the calculator's chief intended use is in econometric studies where cross-year estimates are likely to be used as instrumental variables, and the bias may not be of consequence in such uses.

Usage instructions

Documentation for TAXCALC is provided chiefly by a graduated series of example programs. Please study them and read the notes - they are brief, but vital.

sas stata 1. Total Liability
sas stata 2. Change a tax law parameter
sas stata 3. Switch a tax code year
sas stata 4. Marginal rates
sas stata 5. Fixups/Imputations
sas stata 6. Use of the SOI PUF file
sas stata 7. Non-SOI dataset


Dollar amount variables in the SOI file all begin with an "E" and then a 5-digit number. Our calculated values use a "c" and the same 5 digits. All of these variables have global scope. Non-SOI variables used only in the calculator all begin with an underscore, to avoid confusion with user written code in the same do-file or datastep. A list of calculated variables is available here Variables used in the calculation are listed here.


  • TAXCALC Source Current Versions (write for access).

    Acceptance Tests

    1. Compare aggregate differences.
    2. Regression Test of Accuracy
    3. Tests with late returns only
    4. Print 6 worst results, with details


    For copies of the program, and to report bugs or make suggestions, please write or call me. If you write, please include a phone number and a suggested time to call.


    All of the SAS calculator code has been written by Inna Shapiro of the NBER. Victoria Bryant of SOI has been a dedicated and patient tester and advisor throughout the process of writing this code. Writing code without access to test data presents real difficulties, and without her help through 98+ turnarounds this effort would not have succeeded. Mike Strudler (SOI), David Joulfaian (OTA) and James Pearce (OTA) have also been very helpful. The Stata version is partially a mechanical translation, and partly handwork by Inna and me. We are very interested in reports of use, and bug reports.

    Daniel Feenberg
    617-588-0343 (offfice)
    617-863-0343 (google voice)