Design and Methodology

This document provides design and methodology information about the data contained in SESTAT. Click on a topic for additional information. Scroll down for additional topics.

Overview

The Scientists and Engineers Statistical Data System (SESTAT) is a unified database recording employment, education, and other characteristics of the nation's scientists and engineers. These data are collected from these three component surveys sponsored by the National Science Foundation (NSF) and conducted periodically throughout each decade: In the early 1990's, the data system was extensively redesigned, at which time the present names for the data system and the three component surveys were adopted.

Target Population

The target population for a data system is the specific population for which survey information is desired. SESTAT has as its target population: residents of the United States (U.S.) with a baccalaureate degree or higher who, as of the study's reference period, were noninstitutionalized, age 75 or less, and either trained as or working as a scientist or engineer (S&E). A baccalaureate-or-higher degree is a bachelor's, master's, doctorate, or professional degree. To meet the S&E requirement, the U.S. resident had to (1) have at least one baccalaureate-or-higher degree in an S&E field or (2) have a baccalaureate-or-higher degree in a non-S&E field and work in an S&E occupation as of the reference week. For the 1993 SESTAT, the reference period was the week of April 15, 1993.

Component Surveys

For the reference week of April 15, 1993, SESTAT compiled data on the nation's scientists and engineers (S&Es) from three component surveys sponsored by the National Science Foundation (NSF).

National Survey of College Graduates.

The largest segment of SESTAT's target population was derived from the National Survey of College Graduates (NSCG), which was conducted by the U.S. Bureau of the Census. The Bureau selected the 1993 NSCG sample from the 1990 Decennial Census Long Form sample; hence the NSCG includes only that portion of the 1993 SESTAT target population that resided in the United States on April 1, 1990 or resided abroad as U.S. military personnel.

National Survey of Recent College Graduates.

The 1993 National Survey of Recent College Graduates (NSRCG) covers the portion of SESTAT's target population that received bachelor's and master's degrees in an S&E field from a U.S. educational institution between April 1, 1990 and June 30, 1992. The Institute for Social Research (ISR) of Temple University selected the samples of educational institutions and recent graduates for the 1993 NSCG, and Westat, Inc. conducted the survey.

Survey of Doctorate Recipients.

The 1993 Survey of Doctorate Recipients (SDR) covers the portion of SESTAT's target population that received doctorate degrees in an S&E field from a U.S. educational institution between January 1, 1942 and June 30, 1992. SDR is conducted by the National Research Council (NRC), who also maintain the Doctorate Records File, a historic database of U.S. doctorate recipients used in constructing SDR's sampling frame.

Some elements of SESTAT's desired target population were not included within the target populations of any of the three SESTAT component surveys. See "Population Coverage" for a description of the difference between the desired target population and the surveyed population.


Population Coverage

Some elements of SESTAT's desired target population were not included within the target populations of any of the three SESTAT component surveys. Bachelor and master level S&E trained personnel missing from the survey frames are predominately:

U.S. residents whose S&E bachelor's and/or master's degrees were received prior to April 1990 or from a foreign institution, who resided outside the U.S. on April 1, 1990 but not as U.S. armed forces stationed abroad.

U.S. residents with no baccalaureate or higher degree of any field as of April 1, 1990 who were awarded an S&E degree after June 1992 by a U.S. institution or after April 1990 by a foreign institution.

Doctorate level S&E trained personnel missing from the survey frames are predominately:

Some scientists and engineers had multiple chances of selection because they were linked to the sampling frames for more than one SESTAT component survey. This frame characteristic is referred to as multiplicity. As an example, a U.S. resident with a bachelor's degree prior to 1990, who went on to complete a master's degree in statistics in June 1990 and then a doctorate degree in June 1992, would have an opportunity for selection for all three SESTAT component surveys. See "Weighting Strategy" for a discussion of how the SESTAT weights compensate for these multiple chances of selection.

Sample Designs

Probability sampling was used for the SESTAT component surveys to create a defensible basis for inference from the combined samples to the SESTAT target population. Selecting a probability sample requires locating a frame that identifies members of the target population, either directly or via linkage to other units (e.g., individuals to housing units). As scientists and engineers (S&Es) constitute only a small percentage of the U.S. population, it would have been cost prohibitive to survey the Nation as a whole to identify target population members for subsequent interview. Instead, SESTAT used a multiple-frame sampling approach to survey U.S. scientists and engineers. (See "Component Surveys.") Not all of the sampled cases were members of SESTAT's target population, however. The survey questionnaire incorporated screening questions to determine if sampled persons met SESTAT's target population definition.

Sample Design: National Survey of College Graduates
The 1993 National Survey of College Graduates (NSCG) used the 1990 Decennial Census Long Form sample to construct its sampling frame. Sampling was restricted to Long Form sampled individuals with baccalaureate-or-higher college degrees who as of April 1, 1990 were age 72 or younger. A total of 4,728,000 Decennial Census Long Form sample individuals met these criteria, from which 214,643 were selected for the NSCG sample. The sample design can be described as a two-phase stratified random sample of individuals with baccalaureate-or-higher degrees.

Phase 1 was the Long Form sample design (a stratified systematic sample). Phase 2 was the subsampling of Long Form cases, which used a stratified design with probability-proportional-to-size, systematic selection within strata. The Long Form sampling weight was used as the size measure in selection to achieve as close to as possible a self-weighting sample within Phase 2 strata. Phase 2 strata were defined based upon demographic characteristics, highest degree achieved, occupation, and sex. The minimum sampling rate was 3.00 percent, but most strata were sampled at rates between 2.03 and 2.82 percent. Successively lower rates were used for each of the following groups: whites with bachelor's or master's degrees and a science and engineering (S&E) occupation; nonwhites with bachelor's or master's degrees and a non-S&E occupation; non-foreign-born doctorate recipients; and whites with bachelor's or master's degrees and a non-S&E occupation. The 1993 NSCG achieved an unweighted response rate of 78 percent, yielding 148,932 interviews with individuals having baccalaureate-or-higher degrees and identifying an additional 19,224 cases ineligible for interview (e.g., deceased, over 75, not an S&E, no longer living in the U.S.). Interview data were then used to classify the respondents as to their membership in SESTAT's target population of scientists and engineers. A total of 74,693 of the survey respondents belonged to SESTAT's target population of scientists and engineers.

Sample Design: National Survey of Recent College Graduates
. The 1993 National Survey of Recent College Graduates (NSRCG) used a two-stage sample design, with educational institutions sampled in the first stage and then bachelor's and master's graduates within the sample institutions for the second stage. The Integrated Postsecondary Education Data System (IPEDS) was used to construct the sampling frame for educational institutions. IPEDS is a system of surveys sponsored by the National Center for Education Statistics to collect data from all U. S. educational institutions whose primary purpose is postsecondary education. For NSRCG sampling, the frame was restricted to those IPEDS data records associated with four-year U.S. colleges and universities offering bachelor's or master's degrees in one or more S&E fields. Of these institutions, 196 produced so many of the Nation's S&E graduates that they were selected with certainty. From the remaining institutions, 79 institutions were selected using systematic, probability-proportional-to-size sampling, after sorting the file by ethnic status, region, public/private status, and presence of agriculture. The measures of size were devised to account for the rareness of certain fields of study and for the incidence of Hispanic, African-American, and foreign students. Each sampled institution was asked to provide a roster of students receiving a bachelor's or master's degree in an S&E field between April 1, 1990 and June 30, 1992. From the 273 participating institutions, 25,785 students were selected using stratified sampling. Sampling rates ranged from 1 in 144 (for example, those receiving bachelor's degrees in psychology, or degrees in nonspecified fields) to 1 in 2 (for example, bachelor's and master's degrees in materials engineering). A total of 19,426 eligible scientists and engineers responded.

Sample Design: Survey of Doctorate Recipients
The Survey of Doctorate Recipients (SDR) is a longitudinal survey of doctorate recipients, with samples of new cohorts added to the base sample every two years. To construct its sampling frame, SDR uses the Doctorate Records File (DRF), a historic database derived from the Survey of Earned Doctorates, an ongoing census of all U.S. doctorate recipients since 1942. SDR restricts the frame to S&E doctorates under 76 years of age, who are U.S. citizens and those non-U.S. citizens with plans to remain in the U.S. after degree award. For the 1993 SDR, there were 568,726 age-75-or-younger S&Es on the sampling frame, from which 49,228 were sampled. A two-phase sample design has been used for the SDR since 1991. Prior to 1991, the SDR design could be described as a deeply stratified, simple random sample of doctorate S&Es. Strata were defined based upon frame information and a "cohort" variable associated with the year the doctorate was received. Beginning in 1991, the number of strata were reduced — primarily by collapsing over the pre-1991 cohorts and then new stratification variables were introduced to facilitate oversampling of the disabled and specific minority groups. At that time, a new 1991 cohort sample was selected using the Phase 1 stratum definitions and sampling rates. This new cohort was added to the older cohort samples to create the Phase 1 sample for the 1991 SDR and subsequent years. Next, this Phase 1 sample was restratified using the newer stratum definitions. As minority group and disability status were unknown for older cohorts, a combination of frame and survey responses were used to assign members of the older cohorts to Phase 2 strata. These Phase 2 sample cases were then subsampled in 1991 (and to a lesser extent in 1993) to yield the desired sample allocations for each stratum. For the 1993 SDR, the sample for the new cohort (1992-93 graduates) was selected as an independent supplement to the older cohort sample. The new cohort sample was selected using stratified simple random sampling, with comparable sampling rates and stratum definitions as those of the Phase 2 older cohort sample. The overall 1993 sampling rate was 8.8 percent, but rates for individual sampling strata ranged from 4.5 percent to 66.7 percent. Those strata sampled at 66.7 percent included Native American female doctorate recipients in earth/ocean/atmospheric sciences and handicapped, female, doctorate recipients in electrical/electronics/communications engineering. Those strata with the lowest sampling rates were white males receiving doctorates in economics or other social sciences. A total of 39,495 eligible scientists and engineers responded to the 1993 SDR.

Data Collection

The Survey Questionnaires. The questionnaire administered in each of these surveys is largely the same--roughly 90 percent of the questions are identical. The remaining questions are survey-specific, that is, they collect information that is relevant to only that survey's population. Given that two of the three surveys used a mixed mode approach, beginning with a self-administered mail questionnaire, these questionnaires were carefully designed to be as "mode-neutral" as possible since the mode used for administering the questionnaire (e.g., self-administered, by telephone or in-person) can influence a person's responses. The drafted 1990s SESTAT mail questionnaires were pretested in focus groups. Questionnaires were distributed at the start of the focus group and the focus group participants were asked to complete the questionnaire as if the questionnaire had just arrived in the mail. Once the participants had completed their questionnaire, the focus group moderator, using a "Think Aloud" approach probed for any problems participants might have experienced while completing the questionnaire.

Mode of Administration. Mode of administration refers to how a survey is conducted, that is by mail, by telephone or inperson. The NSCG and SDR are both mixed mode surveys, while the NSRCG is primarily conducted as a telephone survey. More specifically: The NSCG is a mail survey with telephone and in-person follow-up of sample members who fail to respond by mail. The telephone follow-up is conducted as a computer-assisted telephone interview (CATI). Sample members who did not responded by mail and not available by telephone were targeted for an inperson interview. Theses efforts achieved an overall 1993 weighted response rate of 80 percent; 58 percent by mail, 12 percent by telephone (CATI) and 10 percent in-person. The NSRCG was primarily conducted as a computer assisted telephone interview (CATI). A handful of sample members, inaccessible by telephone, were sent a mail questionnaire. In 1993, the NSRCG achieved a weighted response rate of 84 percent. The SDR is a mail survey with telephone follow-up of sample members who did not respond by mail. The telephone follow-up was conducted as a computer-assisted telephone (CATI) interview by Mathematical Policy Research for the National Research Council. The 1993 NSCG weighted response rate was 87 percent; 66 percent by mail and 21 percent by telephone (CATI).


Editing Guidelines and Procedures

The three SESTAT surveys were conducted by three separate survey data collection contractors. As a consequence, NSF developed rules in order to standardize the editing procedures across the three SESTAT surveys. All contractors used the same editing procedures for editing their respective surveys. All editing procedures were completed after critical item conflicts were resolved, and the "best coding" and "other, specify" coding procedures were completed. The editing rules include: (1) valid code range edits; (2) skip error edits; (3) mark one edits for question with more than one response marked; and (4) consistency edits. Procedures were developed for general editing rules such as distinguishing between questions that are "refused," "don't know," or "blank;" rounding rules for decimals or fractions; missing data on questions with a series of "yes/no" responses; number of employees; coding primary and secondary work activities and most and second most important reason for working outside field of highest degree; and most important reason for attending training.

Occupation and Education Coding. Special coding procedures were developed to increase the data quality and comparability for occupation and education codes. On a majority of the SESTAT surveys, respondents self-select occupation and education codes from job and education code lists at the end of the questionnaire. The remainder are chosen by CATI respondents through a series of questions that begin with the broad categories and narrow the selection to the specific category. The focus of the special coding procedures is the correction of respondent reporting errors. During coding, the "best code" is determined by combining a variety of respondent related information and standardized references and procedures to arrive at the best code for the response.

The "best code" for occupation is determined by using between 14 and 16 factors such as: (1) the respondent's open-ended response; (2) employer name and address; (3) whether primary employer was an educational institution; (4) the type of primary employer; (5) the number of people respondent supervised directly and indirectly; (6) the relationship between their work and education; (7) field of highest degree awarded; (8) any other degree fields; (9) primary work activity; (10) secondary work activity; (11) other work activities; (12)salary; (13) for CATI respondents only--which broad category was chosen first; (14) for SDR respondents only--tenure status; (15) marginal notes; and (16) respondent's self selected code. There are four situations when a best code differed from a respondent's self-code: (1) the respondent provides an open-ended answer but does not provide and NSF Job Code or provides a clearly invalid code; (2) the respondent chooses the "general" 500 code; (3) the respondent chooses a specific residual category such as 027 other biological/life science; (4) the respondent chooses a specific code determined after reviewing pertinent information to be in error. "Best codes" were only assigned when there was sufficient evidence for a better code.

The "best code" for education is determined by using one of two "flow charts." One "flow chart" outlines the procedures for verbatims that list one major and the other "flow chart" is the procedures for verbatims that list two majors. The "flow charts" operationalized the education coding rules and standardized their use. These rules include: (1) rules for exact matches; (2) rules for single, broad, nonspecific fields; and (3) rules for assigning the most specific NSF education code. "Best codes" for education are assigned after determining if the respondent selected a code that is too general, the respondent transposed the code numbers, or the numbers had been written incorrectly. Education codes were not "best coded" if the respondent selected code was more specific than the respondent verbatim and both verbatim and code are in the same field; the verbatim is more specific than the self-selected code and both are in the same field; or the verbatim and the selected code could fall under the same broad educational category. Only those cases where it is clearly evident that the self-code is incorrect is a "best code" assigned.

Other, Specify Coding. The purpose of editing "other, specify" responses is to identify entries that belong in existing response categories. This procedure is called "back-coding." "Other, specify" responses often fall into one of the following categories: (1) a response that should have been coded in an existing response category; (2) a response that is a "legitimate" other response; or (3) a response that is not a legitimate response (e.g. does not answer the question.) The first category of "other, specify" responses were "back-coded."


Missing Data Imputation

For their interview to be considered "complete," 1993 SESTAT respondents had to answer designated "critical" questions such as degrees received and occupation. When possible, follow-up telephone calls were used to complete critical items for otherwise complete questionnaires. (See "Editing Guidelines and Procedures" for further details.)

With the exception of items with verbatim responses, noncritical data items had missing data replaced or "imputed." When imputation occurred for an item, a new variable name was assigned to record the imputed-revised data, and an imputation indicator flag was created that recorded when the data value was imputed. The imputation of missing questionnaire data occurred after all logical editing had been completed.

Sequential hot deck imputation was used to replace the missing values for data items in the survey database. Hot deck imputation replaces missing values for a particular data item with a nonmissing response from another data record associated with an individual — the "donor" — considered to be "similar" to the individual whose data record has the missing value — the "recipient." Sequential hot deck procedures use as the donor record the last encountered record with a nonmissing response for the data item.

To ensure that adjacent data records were similar, each component survey grouped their respondent data records into imputation classes, using variables thought to be strongly or even uniquely associated with the data item subject to imputation. A donor record were selected only from those records that belonged to the same imputation class as the recipient record with the missing item data.

Prior to imputation, the component surveys also sorted the data records within each imputation class by variables thought to be associated with the answer for the data item as well as the propensity for nonresponse to the data item. Serpentine sorting was used as it ensured that adjacent data records were as similar as possible. In serpentine sorting, the sort order is reversed as boundaries are crossed for higher level sort variables.


Weighting Strategy

To derive unbiased survey estimates, estimation procedures must incorporate the selection probabilities for each sampling unit. SESTAT selection probabilities vary greatly from unit to unit due to the extensive oversampling used to facilitate analyses of smaller populations such as Native Americans and the disabled and of less commonly chosen fields of study. Nonresponse and undercoverage also lead to distortions of the sample with respect to the population of interest. SESTAT has removed some of the complexities associated with survey data analysis by constructing sampling weights that reflect the differential selection probabilities and then adjusting these weights to compensate for nonresponse and undercoverage. These adjusted sampling weights become the analysis weights, which have been added to each individual's record in the survey database.

Each component survey first developed its own independent analysis weights. Each component survey defined the sampling weight as the reciprocal of the probability of selection for each sampled units. The sampling rates varied substantially across and within component surveys, ranging from 1 to 436 for SESTAT as a whole. Next, each component survey adjusted for nonresponse using weighting class or poststratification adjustment procedures. The NSCG used poststratification adjustment to force the sampling weights for survey respondents to the 1990 Decennial Census Long Form sample estimates. The NSRCG sampling weight underwent both a weighting-class nonresponse adjustment and a ratio adjustment to reflect known proportions in the population. The SDR sampling weight underwent a weighting-class adjustment for nonresponse. The resulting analysis weights are included on the SESTAT database (as "Z_WEIGHTING_FACTOR_SURVEY") and can be used in making estimates for the individual surveys.

The component survey databases were designed to be combined in analysis to capture the advantages of increased sample sizes and the greater coverage of the target population. In combining the three survey databases, SESTAT had to address issues of cross-survey multiplicity. Depending upon the degrees they had and when they were received, scientists and engineers could belong to the surveyed population of more than one component survey. For instance, a bachelor's at the time of the 1990 Census that goes on to complete a master's degree in 1991 will have opportunities for selection in the NSCG and the NSRCG. A unique-linkage rule was devised to remove these multiple selection opportunities by uniquely linking each member of SESTAT's target population to one and only one component survey and then including the individual in SESTAT only when selected for the linked survey. Using the unique linkage rule, each person had only one chance of being selected into the combined SESTAT database. The rule linked cases with multiple selection opportunities to SDR first, then to NSRCG if the case was not also linked to SDR. Sampled individuals for each component survey were examined to determine for which other component surveys (if any) they had an opportunity of selection. NSCG sampled individuals that had an opportunity for selection by NSRCG or SDR were assigned zero as their SESTAT analysis weight. Similarly, NSRCG sampled individuals that had an opportunity for selection in SDR were assigned zero as their SESTAT analysis weight. All other cases had their component survey's analysis weight brought over for use as their SESTAT analysis weight. The SESTAT weight on the database (called "Z_WEIGHTING_FACTOR")should be used when analyzing SESTAT data derived from the three component surveys.


Updated: February 25, 1998