Ancestry.com and IPUMS Complete Count Restricted File.

Ancestry.com has sponsored the digitization of the available complete count census files and allowed IPUMS to offer all but the respondent names on its website to the general user population.

Confidentiality Considerations

The name (namefrst and namelast) fields are available to affiliated users at the NBER by special arrangement through IPUMS. NBER affiliates wishing to use the IPUMS-RESTRICTED (Ancestry.com) census files for a new project should sign the application and NDA agreement forms here and send them to Alterra Milone and our IRB for our submission to IPUMS. Once approved and assigned a project number by IPUMS, the project will be forwarded to the NBER IRB for review. The IRB may follow-up with additional questions.

Once all approvals are in place you can be added to the Linux groups with permission to read the data. These files (or extracts) may be processed on our servers but should not be downloaded from them.

To add an investigator or RA to an existing project send a signed agreement form marked with the project title and IPUMS number. The approval process is the same and you will be notified when the new researcher can access the data.

Linking between the restricted and public versions is fine, but the data must be maintained/analyzed on the NBER server.

Please ensure that your extracts are not world readable. It is important to respect the agreement to ensure continued access to this important resource, for you and your collegues. I can create a shared directory for you and others working on the same project.

Citation

Publications and research reports based on the IPUMS USA database must cite it appropriately. The citation should include the following:

Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0

File locations

This page documents the NBER 1880-1940 collection. The 1790-1840 files are available in /home/data/cens1930. An earlier version of 1940 is in /home/data/cens1940.

Earlier versions of the data are kept available. There is no need to keep a personal copy.

Starting with the June 2019 distributions we keep our copies of the files in

/home/data/census-ipums/ and its subdirectories. We have 1850-1940 except 1890 (1890 was lost in a fire). These files contain all the named fields, not just the restricted fields. Unedited (numbered) fields are not included. If you have need for unedited fields, contact Dan Feenberg - we can add what you need. Each IPUMS revision set is kept in a separate directory, starting with /home/data/census-ipums/v2019 for the version obtained by NBER in June through October 2019. Within that directory are directories "do", "sps", etc for programs that can read the data files in various packages. These programs are provided for reference, as we have already modified the Stata code to convert the raw ASCII files (in ./dat) to dta format in ./dta. Comma delimited files are in the ./csv directory and Parquet format in ./parquet. Other formats can be added if useful and requested. Please do not make private copies of the datasets. The locations of earlier versions of the files will not change, and those files will not be deleted or updated. /home/data/census-ipums/current will always point to the latest available revision, but older versions will be retained for consistency. For this reason there is no reason for you to make private copies of the full datasets.

Documentation

The IPUMS website covers all the publicly available variables. The additional variables in the restricted use file include:
namefrst
16 character first name (and possibly middle initial)
namelast
16 character last name
histid
36 character person id for matching across IPUMS versions (but not census decades)
street
street address
Here is a compact concordance of variables and descriptions.

File Structure

The original files are hierarchical, but we have created the dta etc files as rectangular person datasets. That is, the household record is appended to each person record. We also apply the scaling factors in the IPUMS supplied code, which should conform the data to the documentation. Value labels that are merely the ASCII expression of the numeric value are dropped. There are no other changes.

Resource considerations

These are very large files as evidenced by record counts and file sizes (up to 100GB) but with some thought it is practical to work with only traditional econometric software. The .dta files are somewhat more reasonableThere is advice for Stata users on dealing with very large files here but see especially this which is highly relevant.

A new, fast and compact format is Parquet. This is column oriented, so if you load just a few variables only a fraction of the file need be read. Please see here for details.

Matching

The Census Linking Project at Princeton created a set of linked datasets between every historical Census pair using a variety of automated methods. There are considerable savings in time and resources in using a pre-made match. The code and documentation is also available on our system at while access to the data on our system is restricted to members of the "cens1930" group. For internal use the files are at:
  • /home/data/census-ipums/linking_project
Publications using data from the matches should cite the Census Linking Project as: Ran Abramitzky, Leah Boustan and Myera Rashid. Census Linking Project: Version 1.0 [dataset]. 2020. https://censuslinkingproject.org In order to facilitate users making their own matches, we have prepared two resources. First, for each census, for each year and sex, a file containing the variables useful for matching - bpl, sex, namefrst, namelast, age, datanum, serial, pernum and histid. These files are quite reasonable in size and will facilite making matches with a reasonable memory footprint. They are located in /home/data/census-ipums/current/mx/csv/mxMMMMMM.csv /home/data/census-ipums/current/mx/dta/mxMMMMMM.dta A second resource (under construction) are potential matches based on Jaro-Winkler scores better than .75 for first and last names and within the expected age difference plus or minus 5 years. These are: /home/data/census-ipums/mx/csv/mpBBBBS.csv /home/data/census-ipums/mx/dta/mpBBBBS.dta /home/data/census-ipums/mx/csv/mpNNNNMMMMBBBBS.csv /home/data/census-ipums/mx/dta/mpNNNNMMMMBBBBS.dta and may include multiple records from the later census (MMMM) that are potential matches for the record from the earlier census (NNNN). There are separate (BBBBS) files for each birthplace and sex. It would be up to the user to select the best match after merging with wider records.

Please respect other users by being reasonably efficient with computational resources, especially memory. In particular when reading even one of these files into Stata you will want to subset on variables or rows, or both. There is a directory

/home/data/census-ipums/tiny with Arkansas records only. This is a good way to get a small sample for testing that can be followed through time. In Stata, using a qualifier such as use /home/data/census-ipums/current/dta/1920 in 1/10000 will give you a small file, but with no ability to test linking through time, and some Stata versions will always read the entire file, discarding the records after 10000. That is time-consuming. Using use /home/data/census-ipums/tiny/dta/1920 is much more satisfactory.

Showload will show you the available memory on all the machines, and "top -o RES" will show how your job is doing on the current machine. Computer time is cheap, but waiting for the computer is not. You may run multiple jobs, but resist the urge to use more than half the available CPU or memory on any one machine. If you ask for more memory than the computer has, your job will run so slowly that it may never finish. It is always a good idea to keep track of the progress of large, long jobs.

Note that all our disk storage is compressed in the filesystem. Zip or Z compression will not reduce the actual resources used, and will add complexity and time to your analysis.

Notes and questions for discussion

  1. Are any more of the uscenNNNN_NNNN (numbered) variables useful? Including all of them would multiply the load times. At this time only rawhnum has been added (House number on street).
  2. Online documentation from IPUMS suggests using datanum,serial and pernum for identifying individual records, but common practice among NBER users is to use histid. Also, datanum is absent in 1860-70. An advantage of histid is that it is maintained across IPUMS versions, but do you ever merge across different IPUMS versions? Why? The histid is a rather long identifier, and I feel like I need to carry along the others anyway.
  3. The name fields sometimes contain what appear to me to be stray nonsense characters such as quote marks, dollar signs, brackets, etc. The quote marks especially can discomfort Stata. Would it be better to drop these?
  4. The programs from IPUMS would create hierarchical files. I didn't think most users would like that. Is there a preference for hierarchical files?
  5. I did not continue the practice of dividing the files into 100 pieces. The dta files can load in a couple of minutes. I do make extracts with the variables required for matching divided by birthplace and sex. Those are never very large and are available in ./mx. Is that ok?
  6. It should be possible to greatly reduce the resource load for matching across decades and I would like to talk to users doing that. Some preliminary work is outlined here.
  7. Jaro-Winkler distance programs don't seem to have uniform outcomes. The Feigenbaum -jarowinkler.ado- program applies the Winkler correction to all scores, while Winkler himself applies it only when the Jaro score is greater than .7. The Winkler adjustment parameter is .1, other authors have other values. The distance for null and one character strings varies across implementations. I can follow Feigenbaum, but seek input from all users.
  8. Winkler also has an adjustment for often confused characters (such as "X" and "K") which is not often used. Other authors have nickname lists. I would like to collect such lists and offer them for more general use.
  9. Some users have supplemented the data with additional variables. If you would like to allow others to use these new variables, I can add them to the common use files.
  10. If you have a crosswalk across decades, I would be happy to post it here for other users. Records can be identified by histid or the combination of datanum, serial and pernum. I will standarize the variable names by adding a 4 digit year to each.
  11. I have dropped 50,000 value labels which are simply ASCII presentations of numeric values, such as: label define value x 7 `7'
  12. There are 196 "P" records in 1940 with no corresponding "H" record. This does not happen in any other year, and those records are omitted from the .dta file.
  13. There are a number of records with blank histid.
  14. What is the appropriate sort order? Matching by histid requires records sorted by histid, which is not the native order.

Don't hesitate to contact me for computer issues, but I don't have that much domain-specific knowledge.

Daniel Feenberg
29 July 2019
617-863-0343