Athough the NBER NPI page is no longer maintained, those files will continue to be available. That offering j does not include information about providers that were not current in April of 2019.
CMS offers a complete file of currently eligible providers each month, but does not offer a file that includes the full content of older deactivated records. For those records only the deactivation date is provided - not what a serious researcher wants. Here we offer new dataset created by concatenating 15 monthly files from April 2007 to the present day with roughly 12 month spacing between files. Our collection of monthly files was not perfectly regular, however any provider that was active since April 2007 for at least a year will be included, and some others. We are certainly interested in obtaining files for 2005 and 2006. The omission of short-lived providers may be a source of bias in certain applications, such as studies of fraudulent providers, the presence of all reasonably persistent providers with their historical data is an improvement over using only survivors. The most recent file can be downloaded from the CMS website from the link at https://download.cms.gov/nppes/NPI_Files.html
We have elected not to treat this as a rectangular panel, but only include records when there is a change in one of the data items. The files are large and many records are essentially duplicates. This does mean that selecting all the facility records for year X is not as simple as "keep if year==Z". More on that below.
You might think that we could deduplicate by NPI and date of last update without loss of information. A record should not change without a change to that date. Nevertheless, there are many changes to provider records without a change to lastupdate. In at least some cases, the only difference from the discarded records was the order of values in the multiplicative variables. The file offered here is deduplicated by all variables except source file name and year. Presumably only NPI and last update should be sufficient, but it doesn't seem so.
We did include a variable source which gives the filename of the source file for the included record. If our deduplication represents a loss of information, then please contact us with an explanation and we will try to do better.
We have mentioned that there might not be a record for year t, if there was no change that year. To extract all records valid for year 2018, try the following code:
All files are zipped. They are very fluffy. The full file takes 86GB in Stata but only 1.8GB compressed. The core variables (all but the multiplicative variables) take 18GB in Stata, but 1.8GB compressed. Possibly merging individual multiplicative datasets with the core dataset would be the most practical way to proceed.
In April of 2020 We were able to obtain weekly files back to March 9, 2015, suggesting that CMS retains these files for 5 years. They are not linked on the CMS website, but can be obtained by guessing the URL (which differs only by the date fields). We did not use the weekly files in this round.
The original .csv files have a header with variable descriptions that are not suitable as variable names in a database or statistical package. Therefore we have created variable names and turned the supplied header into variable labels.
We have not updated the crosswalks, presumably the updates are not affected since UPINs have not been issued for many years.
We are very interested in speaking with users of this data, especially users of the older offerring. Please write or call Daniel Feenberg (firstname.lastname@example.org, 617-863-0343). We expect to provide a more comprehensive file once we have discussed with users their needs.