Design and Methodology
This document provides design and methodology information about the data contained in SESTAT. Click on a topic for additional information. Scroll down for additional topics.
The Scientists and Engineers Statistical Data System (SESTAT) is a unified database recording
employment, education, and other characteristics of the nation's scientists and engineers. These data
are collected from these three component surveys sponsored by the National Science Foundation (NSF)
and conducted periodically throughout each decade:
- National Survey of College Graduates (NSCG)
- National Survey of Recent College Graduates (NSRCG)
- Survey of Doctorate Recipients (SDR).
In the early 1990's, the data system was extensively redesigned, at which time the present names for the
data system and the three component surveys were adopted.
The target population for a data system is the specific population for which survey information is
desired. SESTAT has as its target population: residents of the United States (U.S.) with a baccalaureate
degree or higher who, as of the study's reference period, were noninstitutionalized, age 75 or less, and
either trained as or working as a scientist or engineer (S&E).
A baccalaureate-or-higher degree is a bachelor's, master's, doctorate, or professional degree.
To meet the S&E requirement, the U.S. resident had to (1) have at least one baccalaureate-or-higher
degree in an S&E field or (2) have a baccalaureate-or-higher degree in a non-S&E field and work in an
S&E occupation as of the reference week.
For the 1993 SESTAT, the reference period was the week of April 15, 1993.
For the reference week of April 15, 1993, SESTAT compiled data on the nation's scientists and
engineers (S&Es) from three component surveys sponsored by the National Science Foundation (NSF).
National Survey of College Graduates.
The largest segment of SESTAT's target population was
derived from the National Survey of College Graduates (NSCG), which was conducted by the U.S.
Bureau of the Census. The Bureau selected the 1993 NSCG sample from the 1990 Decennial Census
Long Form sample; hence the NSCG includes only that portion of the 1993 SESTAT target population
that resided in the United States on April 1, 1990 or resided abroad as U.S. military personnel.
National Survey of Recent College Graduates.
The 1993 National Survey of Recent College
Graduates (NSRCG) covers the portion of SESTAT's target population that received bachelor's and
master's degrees in an S&E field from a U.S. educational institution between April 1, 1990 and June 30,
1992. The Institute for Social Research (ISR) of Temple University selected the samples of educational
institutions and recent graduates for the 1993 NSCG, and Westat, Inc. conducted the survey.
Survey of Doctorate Recipients.
The 1993 Survey of Doctorate Recipients (SDR) covers the portion
of SESTAT's target population that received doctorate degrees in an S&E field from a U.S. educational
institution between January 1, 1942 and June 30, 1992. SDR is conducted by the National Research
Council (NRC), who also maintain the Doctorate Records File, a historic database of U.S. doctorate
recipients used in constructing SDR's sampling frame.
Some elements of SESTAT's desired target population were not included within the target populations
of any of the three SESTAT component surveys. See "Population Coverage" for a description of the
difference between the desired target population and the surveyed population.
Some elements of SESTAT's desired target population were not included within the target populations
of any of the three SESTAT component surveys. Bachelor and master level S&E trained personnel
missing from the survey frames are predominately:
U.S. residents whose S&E bachelor's and/or master's degrees were received prior to April 1990
or from a foreign institution, who resided outside the U.S. on April 1, 1990 but not as U.S. armed
forces stationed abroad.
U.S. residents with no baccalaureate or higher degree of any field as of April 1, 1990 who were
awarded an S&E degree after June 1992 by a U.S. institution or after April 1990 by a foreign
institution.
Doctorate level S&E trained personnel missing from the survey frames are predominately:
- U.S. residents with S&E doctorates received after June 1992 or from a foreign institution, with no baccalaureate-or-higher degree in any field as of April 1, 1990 and no bachelor's or master's S&E degree received from a U.S. in
stitution between April 1, 1990 and June 30. 1992.
- U.S. residents with S&E doctorates received after June 1992 or from a foreign institution but with no bachelor's or master's S&E degree received from a U.S. institution between April 1, 1990 and June 30, 1992, who resided o
utside the U.S. on April 1, 1990 but not as U.S. armed forces stationed abroad.
Some scientists and engineers had multiple chances of selection because they were linked to the sampling frames for more than one SESTAT component survey. This frame characteristic is referred to as multiplicity. As an example, a
U.S. resident with a bachelor's degree prior to 1990, who went on to complete a master's degree in statistics in June 1990 and then a doctorate degree in June 1992, would
have an opportunity for selection for all three SESTAT component surveys. See "Weighting Strategy"
for a discussion of how the SESTAT weights compensate for these multiple chances of selection.
Probability sampling was used for the SESTAT component surveys to create a defensible basis for
inference from the combined samples to the SESTAT target population. Selecting a probability sample
requires locating a frame that identifies members of the target population, either directly or via linkage
to other units (e.g., individuals to housing units). As scientists and engineers (S&Es) constitute only a
small percentage of the U.S. population, it would have been cost prohibitive to survey the Nation as a
whole to identify target population members for subsequent interview. Instead, SESTAT used a
multiple-frame sampling approach to survey U.S. scientists and engineers. (See "Component Surveys.")
Not all of the sampled cases were members of SESTAT's target population, however. The survey
questionnaire incorporated screening questions to determine if sampled persons met SESTAT's target
population definition.
The 1993 National Survey of College Graduates (NSCG) used the 1990 Decennial Census Long Form sample to construct its sampling frame. Sampling was
restricted to Long Form sampled individuals with baccalaureate-or-higher college degrees who as of April 1, 1990 were age 72 or younger. A total of 4,728,000 Decennial Census Long Form sample individuals met
these criteria, from which 214,643 were selected for the NSCG sample. The sample design can be described as a two-phase stratified random sample of individuals with baccalaureate-or-higher degrees.
Phase 1 was the Long Form sample design (a stratified systematic sample). Phase 2 was the
subsampling of Long Form cases, which used a stratified design with probability-proportional-to-size,
systematic selection within strata. The Long Form sampling weight was used as the size measure in
selection to achieve as close to as possible a self-weighting sample within Phase 2 strata. Phase 2 strata
were defined based upon demographic characteristics, highest degree achieved, occupation, and sex.
The minimum sampling rate was 3.00 percent, but most strata were sampled at rates between 2.03 and
2.82 percent. Successively lower rates were used for each of the following groups: whites with
bachelor's or master's degrees and a science and engineering (S&E) occupation; nonwhites with
bachelor's or master's degrees and a non-S&E occupation; non-foreign-born doctorate recipients; and
whites with bachelor's or master's degrees and a non-S&E occupation. The 1993 NSCG achieved an
unweighted response rate of 78 percent, yielding 148,932 interviews with individuals having
baccalaureate-or-higher degrees and identifying an additional 19,224 cases ineligible for interview (e.g.,
deceased, over 75, not an S&E, no longer living in the U.S.). Interview data were then used to classify
the respondents as to their membership in SESTAT's target population of scientists and engineers. A
total of 74,693 of the survey respondents belonged to SESTAT's target population of scientists and
engineers.
. The 1993 National Survey of Recent College
Graduates (NSRCG) used a two-stage sample design, with educational institutions sampled in the first
stage and then bachelor's and master's graduates within the sample institutions for the second stage.
The Integrated Postsecondary Education Data System (IPEDS) was used to construct the sampling frame
for educational institutions. IPEDS is a system of surveys sponsored by the National Center for
Education Statistics to collect data from all U. S. educational institutions whose primary purpose is
postsecondary education. For NSRCG sampling, the frame was restricted to those IPEDS data records
associated with four-year U.S. colleges and universities offering bachelor's or master's degrees in one
or more S&E fields. Of these institutions, 196 produced so many of the Nation's S&E graduates that
they were selected with certainty. From the remaining institutions, 79 institutions were selected using
systematic, probability-proportional-to-size sampling, after sorting the file by ethnic status, region,
public/private status, and presence of agriculture. The measures of size were devised to account for the
rareness of certain fields of study and for the incidence of Hispanic, African-American, and foreign
students. Each sampled institution was asked to provide a roster of students receiving a bachelor's or
master's degree in an S&E field between April 1, 1990 and June 30, 1992. From the 273 participating
institutions, 25,785 students were selected using stratified sampling. Sampling rates ranged from 1 in
144 (for example, those receiving bachelor's degrees in psychology, or degrees in nonspecified fields)
to 1 in 2 (for example, bachelor's and master's degrees in materials engineering). A total of 19,426
eligible scientists and engineers responded.
The Survey of Doctorate Recipients (SDR) is a longitudinal survey
of doctorate recipients, with samples of new cohorts added to the base sample every two years. To
construct its sampling frame, SDR uses the Doctorate Records File (DRF), a historic database derived
from the Survey of Earned Doctorates, an ongoing census of all U.S. doctorate recipients since 1942.
SDR restricts the frame to S&E doctorates under 76 years of age, who are U.S. citizens and those non-U.S. citizens with plans to remain in the U.S. after degree award. For the 1993 SDR, there were
568,726 age-75-or-younger S&Es on the sampling frame, from which 49,228 were sampled. A two-phase sample design has been used for the SDR since 1991. Prior to 1991, the SDR design could be
described as a deeply stratified, simple random sample of doctorate S&Es. Strata were defined based
upon frame information and a "cohort" variable associated with the year the doctorate was received.
Beginning in 1991, the number of strata were reduced primarily by collapsing over the pre-1991
cohorts and then new stratification variables were introduced to facilitate oversampling of the disabled
and specific minority groups. At that time, a new 1991 cohort sample was selected using the Phase 1
stratum definitions and sampling rates. This new cohort was added to the older cohort samples to create
the Phase 1 sample for the 1991 SDR and subsequent years. Next, this Phase 1 sample was restratified
using the newer stratum definitions. As minority group and disability status were unknown for older
cohorts, a combination of frame and survey responses were used to assign members of the older cohorts
to Phase 2 strata. These Phase 2 sample cases were then subsampled in 1991 (and to a lesser extent in
1993) to yield the desired sample allocations for each stratum. For the 1993 SDR, the sample for the
new cohort (1992-93 graduates) was selected as an independent supplement to the older cohort sample.
The new cohort sample was selected using stratified simple random sampling, with comparable sampling
rates and stratum definitions as those of the Phase 2 older cohort sample. The overall 1993 sampling
rate was 8.8 percent, but rates for individual sampling strata ranged from 4.5 percent to 66.7 percent.
Those strata sampled at 66.7 percent included Native American female doctorate recipients in
earth/ocean/atmospheric sciences and handicapped, female, doctorate recipients in
electrical/electronics/communications engineering. Those strata with the lowest sampling rates were
white males receiving doctorates in economics or other social sciences. A total of 39,495 eligible scientists
and engineers responded to the 1993 SDR.
The Survey Questionnaires. The questionnaire administered in each of these surveys is largely the same--roughly 90 percent of the questions are identical. The remaining questions are survey-specific, that is, they
collect information that is relevant to only that survey's population. Given that two of the three surveys used
a mixed mode approach, beginning with a self-administered mail questionnaire, these questionnaires were
carefully designed to be as "mode-neutral" as possible since the mode used for administering the questionnaire
(e.g., self-administered, by telephone or in-person) can influence a person's responses. The drafted 1990s
SESTAT mail questionnaires were pretested in focus groups. Questionnaires were distributed at the start of
the focus group and the focus group participants were asked to complete the questionnaire as if the
questionnaire had just arrived in the mail. Once the participants had completed their questionnaire, the focus
group moderator, using a "Think Aloud" approach probed for any problems participants might have
experienced while completing the questionnaire.
Mode of Administration. Mode of administration refers to how a survey is conducted, that is by mail, by
telephone or inperson. The NSCG and SDR are both mixed mode surveys, while the NSRCG is primarily
conducted as a telephone survey. More specifically:
The NSCG is a mail survey with telephone and in-person follow-up of sample members who
fail to respond by mail. The telephone follow-up is conducted as a computer-assisted
telephone interview (CATI). Sample members who did not responded by mail and not
available by telephone were targeted for an inperson interview. Theses efforts achieved an
overall 1993 weighted response rate of 80 percent; 58 percent by mail, 12 percent by
telephone (CATI) and 10 percent in-person.
The NSRCG was primarily conducted as a computer assisted telephone interview (CATI).
A handful of sample members, inaccessible by telephone, were sent a mail questionnaire. In
1993, the NSRCG achieved a weighted response rate of 84 percent.
The SDR is a mail survey with telephone follow-up of sample members who did not respond
by mail. The telephone follow-up was conducted as a computer-assisted telephone (CATI)
interview by Mathematical Policy Research for the National Research Council. The 1993
NSCG weighted response rate was 87 percent; 66 percent by mail and 21 percent by
telephone (CATI).
The three SESTAT surveys were conducted by three separate survey data collection contractors.
As a consequence, NSF developed rules in order to standardize the editing procedures across the
three SESTAT surveys. All contractors used the same editing procedures for editing their respective
surveys. All editing procedures were completed after critical item conflicts were resolved, and the
"best coding" and "other, specify" coding procedures were completed. The editing rules include:
(1) valid code range edits; (2) skip error edits; (3) mark one edits for question with more than one
response marked; and (4) consistency edits. Procedures were developed for general editing rules such
as distinguishing between questions that are "refused," "don't know," or "blank;" rounding rules for
decimals or fractions; missing data on questions with a series of "yes/no" responses; number of
employees; coding primary and secondary work activities and most and second most important reason
for working outside field of highest degree; and most important reason for attending training.
Occupation and Education Coding. Special coding procedures were developed to increase the data quality and comparability for occupation and education codes. On a majority of the SESTAT surveys, respondents self-select occupation and education codes from job and education code lists at the end of the questionnaire. The remainder are chosen by CATI respondents through a series of questions that begin with the broad categories and narrow the selection to the specific category. The focus of the special coding
procedures is the correction of respondent reporting errors. During coding, the "best code" is determined by combining a variety of respondent related information and standardized references and procedures to arrive at the best code for the response.
The "best code" for occupation is determined by using between 14 and 16 factors such as: (1)
the respondent's open-ended response; (2) employer name and address; (3) whether primary
employer was an educational institution; (4) the type of primary employer; (5) the number of people
respondent supervised directly and indirectly; (6) the relationship between their work and education;
(7) field of highest degree awarded; (8) any other degree fields; (9) primary work activity; (10)
secondary work activity; (11) other work activities; (12)salary; (13) for CATI respondents only--which broad category was chosen first; (14) for SDR respondents only--tenure status; (15) marginal
notes; and (16) respondent's self selected code. There are four situations when a best code differed
from a respondent's self-code: (1) the respondent provides an open-ended answer but does not
provide and NSF Job Code or provides a clearly invalid code; (2) the respondent chooses the
"general" 500 code; (3) the respondent chooses a specific residual category such as 027 other
biological/life science; (4) the respondent chooses a specific code determined after reviewing pertinent
information to be in error. "Best codes" were only assigned when there was sufficient evidence for
a better code.
The "best code" for education is determined by using one of two "flow charts." One "flow
chart" outlines the procedures for verbatims that list one major and the other "flow chart" is the
procedures for verbatims that list two majors. The "flow charts" operationalized the education
coding rules and standardized their use. These rules include: (1) rules for exact matches; (2) rules
for single, broad, nonspecific fields; and (3) rules for assigning the most specific NSF education code.
"Best codes" for education are assigned after determining if the respondent selected a code that is
too general, the respondent transposed the code numbers, or the numbers had been written
incorrectly. Education codes were not "best coded" if the respondent selected code was more
specific than the respondent verbatim and both verbatim and code are in the same field; the verbatim
is more specific than the self-selected code and both are in the same field; or the verbatim and the
selected code could fall under the same broad educational category. Only those cases where it is
clearly evident that the self-code is incorrect is a "best code" assigned.
Other, Specify Coding. The purpose of editing "other, specify" responses is to identify entries that belong in existing
response categories. This procedure is called "back-coding." "Other, specify" responses often fall
into one of the following categories: (1) a response that should have been coded in an existing
response category; (2) a response that is a "legitimate" other response; or (3) a response that is not
a legitimate response (e.g. does not answer the question.) The first category of "other, specify"
responses were "back-coded."
For their interview to be considered "complete," 1993 SESTAT respondents had to answer designated
"critical" questions such as degrees received and occupation. When possible, follow-up telephone calls
were used to complete critical items for otherwise complete questionnaires. (See "Editing Guidelines and Procedures" for
further details.)
With the exception of items with verbatim responses, noncritical data items had missing data replaced
or "imputed." When imputation occurred for an item, a new variable name was assigned to record the
imputed-revised data, and an imputation indicator flag was created that recorded when the data value
was imputed. The imputation of missing questionnaire data occurred after all logical editing had been
completed.
Sequential hot deck imputation was used to replace the missing values for data items in the survey
database. Hot deck imputation replaces missing values for a particular data item with a nonmissing
response from another data record associated with an individual the "donor" considered to be
"similar" to the individual whose data record has the missing value the "recipient." Sequential hot
deck procedures use as the donor record the last encountered record with a nonmissing response for the
data item.
To ensure that adjacent data records were similar, each component survey grouped their respondent data
records into imputation classes, using variables thought to be strongly or even uniquely associated with
the data item subject to imputation. A donor record were selected only from those records that belonged
to the same imputation class as the recipient record with the missing item data.
Prior to imputation, the component surveys also sorted the data records within each imputation class by
variables thought to be associated with the answer for the data item as well as the propensity for
nonresponse to the data item. Serpentine sorting was used as it ensured that adjacent data records were
as similar as possible. In serpentine sorting, the sort order is reversed as boundaries are crossed for
higher level sort variables.
To derive unbiased survey estimates, estimation procedures must incorporate the selection
probabilities for each sampling unit. SESTAT selection probabilities vary greatly from unit to unit due
to the extensive oversampling used to facilitate analyses of smaller populations such as Native
Americans and the disabled and of less commonly chosen fields of study. Nonresponse and
undercoverage also lead to distortions of the sample with respect to the population of interest. SESTAT
has removed some of the complexities associated with survey data analysis by constructing sampling
weights that reflect the differential selection probabilities and then adjusting these weights to compensate
for nonresponse and undercoverage. These adjusted sampling weights become the analysis weights,
which have been added to each individual's record in the survey database.
Each component survey first developed its own independent analysis weights. Each component
survey defined the sampling weight as the reciprocal of the probability of selection for each sampled
units. The sampling rates varied substantially across and within component surveys, ranging from 1 to 436 for SESTAT as a whole. Next, each component survey adjusted for nonresponse using weighting class
or poststratification adjustment procedures. The NSCG used poststratification adjustment to force the
sampling weights for survey respondents to the 1990 Decennial Census Long Form sample estimates.
The NSRCG sampling weight underwent both a weighting-class nonresponse adjustment and a ratio
adjustment to reflect known proportions in the population. The SDR sampling weight underwent a
weighting-class adjustment for nonresponse. The resulting analysis weights are included on the
SESTAT database (as "Z_WEIGHTING_FACTOR_SURVEY") and can be used in making estimates for the individual surveys.
The component survey databases were designed to be combined in analysis to capture the
advantages of increased sample sizes and the greater coverage of the target population. In combining
the three survey databases, SESTAT had to address issues of cross-survey multiplicity. Depending upon
the degrees they had and when they were received, scientists and engineers could belong to the surveyed
population of more than one component survey. For instance, a bachelor's at the time of the 1990
Census that goes on to complete a master's degree in 1991 will have opportunities for selection in the
NSCG and the NSRCG. A unique-linkage rule was devised to remove these multiple selection
opportunities by uniquely linking each member of SESTAT's target population to one and only one
component survey and then including the individual in SESTAT only when selected for the linked survey.
Using the unique linkage rule, each person had only one chance of being selected into the combined
SESTAT database. The rule linked cases with multiple selection opportunities to SDR first, then to
NSRCG if the case was not also linked to SDR. Sampled individuals for each component survey were
examined to determine for which other component surveys (if any) they had an opportunity of selection.
NSCG sampled individuals that had an opportunity for selection by NSRCG or SDR were assigned zero
as their SESTAT analysis weight. Similarly, NSRCG sampled individuals that had an opportunity for
selection in SDR were assigned zero as their SESTAT analysis weight. All other cases had their
component survey's analysis weight brought over for use as their SESTAT analysis weight. The
SESTAT weight on the database (called "Z_WEIGHTING_FACTOR")should be used when analyzing SESTAT data derived from the three component surveys.
Updated: February 25, 1998