Ancestry.com and IPUMS Complete Count Restricted File.
Ancestry.com has sponsored the digitization of the available complete count census files and allowed IPUMS to offer all but the respondent names on its website to the general user population.
The name (namefrst and namelast) fields are available to
affiliated users at the NBER by special arrangement through
IPUMS. NBER affiliates wishing to use the IPUMS-RESTRICTED
(Ancestry.com) census files for a new project should sign
Once all approvals are in place you can be added to the Linux groups with permission to read the data. These files (or extracts) may be processed on our servers but should not be downloaded from them.
To add an investigator or RA to an existing project send a signed agreement form marked with the project title and IPUMS number. The approval process is the same and you will be notified when the new researcher can access the data.
Linking between the restricted and public versions is fine, but the data must be maintained/analyzed on the NBER server.
Please ensure that your extracts are not world readable. It is important to respect the agreement to ensure continued access to this important resource, for you and your collegues. I can create a shared directory for you and others working on the same project.
CitationPublications and research reports based on the IPUMS USA database must cite it appropriately. The citation should include the following:
Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0
This page documents the NBER 1880-1940 collection. The 1790-1840 files are available in /home/data/cens1930. An earlier version of 1940 is in /home/data/cens1940.
Earlier versions of the data are kept available. There is no need to keep a personal copy.
Starting with the June 2019 distributions we keep our copies of the files in
DocumentationThe IPUMS website covers all the publicly available variables. The additional variables in the restricted use file include:
- 16 character first name (and possibly middle initial)
- 16 character last name
- 36 character person id for matching across IPUMS versions (but not census decades)
- street address
File StructureThe original files are hierarchical, but we have created the dta etc files as rectangular person datasets. That is, the household record is appended to each person record. We also apply the scaling factors in the IPUMS supplied code, which should conform the data to the documentation. Value labels that are merely the ASCII expression of the numeric value are dropped. There are no other changes.
Resource considerationsThese are very large files as evidenced by record counts and file sizes (up to 100GB) but with some thought it is practical to work with only traditional econometric software. The .dta files are somewhat more reasonableThere is advice for Stata users on dealing with very large files here but see especially this which is highly relevant.
A new, fast and compact format is Parquet. This is column oriented, so if you load just a few variables only a fraction of the file need be read. Please see here for details.
MatchingThe Census Linking Project at Princeton created a set of linked datasets between every historical Census pair using a variety of automated methods. There are considerable savings in time and resources in using a pre-made match. The code and documentation is also available on our system at
Please respect other users by being reasonably efficient with computational resources, especially memory. In particular when reading even one of these files into Stata you will want to subset on variables or rows, or both. There is a directory
Showload will show you the available memory on all the machines, and "top -o RES" will show how your job is doing on the current machine. Computer time is cheap, but waiting for the computer is not. You may run multiple jobs, but resist the urge to use more than half the available CPU or memory on any one machine. If you ask for more memory than the computer has, your job will run so slowly that it may never finish. It is always a good idea to keep track of the progress of large, long jobs.
Note that all our disk storage is compressed in the filesystem. Zip or Z compression will not reduce the actual resources used, and will add complexity and time to your analysis.
Notes and questions for discussion
- Are any more of the uscenNNNN_NNNN (numbered) variables useful? Including all of them would multiply the load times. At this time only rawhnum has been added (House number on street).
- Online documentation from IPUMS suggests using datanum,serial and pernum for identifying individual records, but common practice among NBER users is to use histid. Also, datanum is absent in 1860-70. An advantage of histid is that it is maintained across IPUMS versions, but do you ever merge across different IPUMS versions? Why? The histid is a rather long identifier, and I feel like I need to carry along the others anyway.
- The name fields sometimes contain what appear to me to be stray nonsense characters such as quote marks, dollar signs, brackets, etc. The quote marks especially can discomfort Stata. Would it be better to drop these?
- The programs from IPUMS would create hierarchical files. I didn't think most users would like that. Is there a preference for hierarchical files?
- I did not continue the practice of dividing the files into 100 pieces. The dta files can load in a couple of minutes. I do make extracts with the variables required for matching divided by birthplace and sex. Those are never very large and are available in ./mx. Is that ok?
- It should be possible to greatly reduce the resource load for matching across decades and I would like to talk to users doing that. Some preliminary work is outlined here.
- Jaro-Winkler distance programs don't seem to have uniform outcomes. The Feigenbaum -jarowinkler.ado- program applies the Winkler correction to all scores, while Winkler himself applies it only when the Jaro score is greater than .7. The Winkler adjustment parameter is .1, other authors have other values. The distance for null and one character strings varies across implementations. I can follow Feigenbaum, but seek input from all users.
- Winkler also has an adjustment for often confused characters (such as "X" and "K") which is not often used. Other authors have nickname lists. I would like to collect such lists and offer them for more general use.
- Some users have supplemented the data with additional variables. If you would like to allow others to use these new variables, I can add them to the common use files.
- If you have a crosswalk across decades, I would be happy to post it here for other users. Records can be identified by histid or the combination of datanum, serial and pernum. I will standarize the variable names by adding a 4 digit year to each.
- I have dropped 50,000 value labels which are simply ASCII
presentations of numeric values, such as:
label define value x 7 `7'
- There are 196 "P" records in 1940 with no corresponding "H" record. This does not happen in any other year, and those records are omitted from the .dta file.
- There are a number of records with blank histid.
- What is the appropriate sort order? Matching by histid requires records sorted by histid, which is not the native order.
Don't hesitate to contact me for computer issues, but I don't have that much domain-specific knowledge.
29 July 2019