Tools for Intuition about Sample Selection Bias and Its Correction
by
Ross M. Stolzenberg, Daniel A. Relles
Citation
Title:
Tools for Intuition about Sample Selection Bias and Its Correction
Author:
Ross M. Stolzenberg, Daniel A. Relles
Year:
1997
Publication:
American Sociological Review
Volume:
62
Issue:
3
Start Page:
494
End Page:
507
Publisher:
Language:
English
URL:
Select license:
Select License
DOI:
PMID:
ISSN:
Updated: February 19th, 2013
Abstract:
AND ITS CORRECTION *
Ross M. Stolzenberg Daniel A. Relles
University of Chicago RAND Corporatiorz
We provide mathematical tools to assist intuition about selection bias in con crete empirical analyses. These new tools do not offer a general solution to the selection bias problem; no method now does that. Rather, the techniques we present offer a new deconlposition of selection bias. This decomposition permits an analyst to develop intuition and make reasoned judgments about the sources, severity, and direction of sanlple selection bias in a particular ana1,ysis. When combined with simulation results, also presented in this pa per, our decomposition of bias also permits a reasoned, empiricallyinformed judgment of when the wellknown twostep estimator of Heckman (1976, 1979) is likely to increase or decrease the accuracy of regression coefficient estimates. We also use simulations to confirm mathematical derivations.
Sampie selection is said to occur when data for a variable are missing for some cases and present for others. Heckman (1976) made sociologists acutely aware that sample selection bias on a dependent vari able in a regression can cause severe bias in estimates of regression coefficients. A vari ety of sample selection bias corrections is now available, as is extensive information about the deficiencies, difficulties, and limi tations of all of these correction techniques (Winship and Mare 1992).' At present, no technique or combination of techniques ap pears to offer universal or even predictable rescue from the sometimes severe problems of selection bias. Further, simulation studies and analyses of survey data provide both an ecdotal and systematic evidence that these corrections can and do go awry under ordi
* Direct Correspondence to Ross M. Stolzen berg (rstolzenberg@uchicago.edu). This re search was supported by the National Institute for Child Health and Human Development (Grant No. P50 HD12639), The RAND Corporation, and the Graduate Management Admission Council. The comments and criticisms of reviewer George Farkas and of Deputy Editor Trond Petersen ma terially improved this work.
See Duan et al. (1984), Goldberger (1980), Lillard, Smith, and Welch (1986), Little (1985), Little and Rubin (1987), Nelson (1984), Paarsch (1984), Stolzenberg and Relles (1990). Wainer (1986) records sharp debate between Heckman and Robb, Glynn, Laird and Rubin, and Tukey.
nary circumstances, sometimes grossly wors ening estimates rather than improving them, without providing any indication that a prob lem has occurred. Thus, in spite of the el egant models and inventive methods brought to bear on sample selection bias, researchers often have little more than their intuition to guide them.
For the simplified situations usually found in introductory regression texts, Berk (1983) showed that it is fairly easy to examine se lection bias pictorially and to develop intu ition about the cause, size, and direction of selectivity effects. However, Berk also showed that this intuition is difficult to ob tain from pictures when examples are only slightly more complicated. As we show in this paper, the usual mathematics of Heck man's (1976) selection model does not lend itself to intuitive thinking either, in part be cause it relies on recalculating regression co efficients after adding a new independent variable, the inverse Mills Ratio. This ratio is a nonlinear and otherwise generally unfa miliar function based on the results of a probit analysis of selection in the sample over which the regression is estimated. As a consequence, it is often difficult to develop intuition about selection bias in any particu lar empirical analysis.
Our main purpose in this paper is to pro vide mathematical tools to assist intuition about sample bias in concrete empirical
American Sociological Review, 1997, Vol. 62 (June:494507)
analyses. These new tools are not a general solution to the selection bias problem; no method provides that at this time. But they do offer a new approximation and decompo sition of the bias. Most important, this de composition gives a simplified but still rig orous explanation of how selection bias oc curs and how the usual twostep selection correction can worsen estimates. Mathemati cally, the new approximation we offer is the product of a few familiar or easily computed quantities. Most of these quantities can be calculated from the data under analysis, while ranges of reasonable values must be selected for others. We use simulation stud ies to test the approximation methods we present here. These simulations also provide additional useful information about the cir cumstances under which selection bias cor rections are likely to improve estimation.
We begin with an anatomy of the twostep selection correction. Then we present deriva tions that simplify its mathematics, simula tion studies that confirm the accuracy of the mathematics, and conclusions about appro priate strategies for dealing with selection bias.
A MODEL OF CENSORING BIAS
Heckman's Model
For convenience, our selection model follows Heckman (1976:476, 1979: 154). We con sider the case with only one independent variable (described in Appendix A). Appen dix B considers the multiple regression case, for which the derivations are tedious, al though results are nearly as simple.
Let equation 1 be a regression equation of substantive interest:
X is the regression independent variable, Y1 is the regression dependent variable, and o~ is the regression error term, where ois a sca lar and E is normally distributed with a mean of 0 and a variance of 1 (N(o,l)).Po and pl are regression coefficients.
For the same data for which equation 1 is defined, we also define equation 2, which is called the selection equation:
6 is also normally distributed N(o,,, Z is the selection equation independent variable, and ais the coefficient of Z. Z may be identical to
X . Tis a scalar called the selection threshold. The value of Y1 is observed for some cases (selected cases) and is missing for other cases (censored cases). A data case is selected if Y2 > T for that case; the case is censored if Y2 5 T We do not need a constant term in the selection equation because that term is ab sorbed into the selection threshold, T.
In a variation on the classic example of sample selection, we might wish to estimate a regression of earnings (Y,) of married women on their years of schooling (X). However, for purposes of this example only, as sume that many married women can choose between paid work in the labor market and unpaid work at home. Also for purposes of this example, assume that women make this choice on the basis of the occupational so cioeconomic status (Y2) provided by the job they would obtain ifthey chose market work. If Yz exceeds some threshold (T), then women choose market work and their earn ings are observed; otherwise, the women lack observable earnings data, and they fall from the sample when the regression of Y, is com puted.
For a given value of Z, the probability of selection is determined by T, a,and the ran dom error term 6. The larger the value of a, the more that sample selection depends on the value of Z. The larger the value of T, the lower the probability that any observation will be selected, regardless of the value of Z. If a equals 0, then selection is random and merely reduces sample size. If T equals w, then all cases are selected no matter how large the value of a.If T equals +w, then no cases are selected, no matter how small the value of a.
Heckman (1976) observed that there is a potential bias in using only selected cases to estimate equation 1. He computed the condi tional expectation of Y1 given that YI is ob served, as:
where p,s is the correlation between E and 6, and A is the reciprocal of the Mills Ratio function as defined in equation 4.
where $(TaZ) is the normal density func tion (the height of the normal curve) evalu ated at TaZ, and @(TaZ) is the normal cu mulative distribution function (the area un der the normal curve) evaluated at TaZ. Thus, in the presence of selection, the origi nal regression equation is not appropriate be cause it omits an independent variable that belongs in the equation. This omitted vari able is h(TaZ), and its coefficient is an es timate of o p,s. If h(TaZ) is substantially correlated with X and Y1,then, instead of es timating
Y, = Po + PIX+ oe, (5)
we should estimate
where &'has a mean of 0 but is not normally distributed, and o' is a constant not neces sarily equal to o.
Heckman (1976) noted that TaZcould be estimated as the predicted values in a probit analysis in which the independent variable is Z and the dependent variable is a dummy that equals 0 if Y, has a missing value and 1 if the value of Y, is not missing. Heckman's "two step" correction consists of using probit analysis to estimate the value of TaZ for each data case, calculating the inverse Mills Ratio A(TaZ) from those estimates for each data case, and then using A(TaZ) as an ad ditional regressor in equation 1. Thus, in more familiar notation, we estimate
Y, = Po + PIX + P2M + o'E', (7)
where M is A(TaZ) and 0%'is the error term for the regression.
Equation 7 is a multiple regression equa tion in which one independent variable is a ratio of two nonlinear transformations of the predicted values of a probit analysis com puted from an overlapping but different data set than that used to calculate the regression equation of interest. We think it is very dlfl cult to have much intuition about the coefl cients of an equation such as this. In the ab sence of intuition, one can only use the two step correction blindly in the hope that it works properly. When all goes well this is a
AMERICAN SOCIOLOGICAL REVIEW
good way to proceed. But when the twostep estimator produces substantive results that seem to contradict reason, then intuition is essential to learn if the problem is caused by substantive or methodological errors.
Next we show some reasons why the two step estimator is particularly prone to meth odological problems. Then we develop an approximation to the twostep selection cor rection that is mathematically sound and straightforward enough to support intuitive thinking about selectivity bias in regression.
How Things Go Wrong
Under seemingly ordinary circumstances, even when its assumptions and formal re quirements are satisfied, the twostep selec tion bias correction is known to sometimes produce estimates that are farther from true parameter values than estimates obtained by uncorrected ordinary least squares (Hartman 1991).~It is not possible to know how often such problems occur, but they do not seem to be rare (Lillard et al. 1986), and some times they are catastrophic (Stolzenberg and Relles 1985).
We now consider specific mechanisms that cause these difficulties. According to Equa tion 7, ignoring sample selection amounts to failing to include M as an independent vari able. Elementary properties of regression in dicate that if X and Mare not correlated, then this omissionand therefore sample selec tiondoes not bias the estimate of 13,.Those
, .
same properties of regression indicate that
2Technically this is not a failing of the estima tor, since its aim is to reduce bias rather than to increase efficiency. But in most situations in which substantive arguments are evaluated by analyzing a single sample of data, bias cannot be distinguished from other sources of error, and the estimator with the smallest total error is preferred. So, as a practical matter, we say that a selection bias correction method "goes wrong" in a pnrticulnr analysis if it produces an estimate of PI that is farther from the population value of PIthan the biased OLS estimate it is intended to correct. In a distribution of the analyses of a number of different samples, bias can be distinguished from random error, and bias and random error can be attacked separately. Hartman (1991) offers a simulationbased comparison of various estima tors, including the leastsquares and maximum likelihood versions of Heckman's (1976) model.
Figure 1. Mills Ratio of X versus X
two things happen as the correlation between X and M increases: First, the estimate of pl is increasingly affected by including M in the regression (i.e., bias occurs if M is omitted). And, second, random error in the estimate of pl increases when M is included in the re gression. As the random error component of the estimate of pl grows large, the biascor rected estimate can become unstable enough to have a substantial chance of being farther from the true population value of PIthan the OLS estimate it is intended to correct. At ex tremely high correlations between X and M, muticollinearity or near multicollinearity among regressors occurs, and coefficient es timates become volatile, often taking on sub stantively ridiculous values characteristic of regressions afflicted by multicollinearity. These problems are not peculiar to the two step estimator; they tend to occur in any re gression in which independent variables are highly correlated. For kxample, these prob lems occur frequently in polynomial regres sions in which X, x2, and x3are regressors. Confounding the coefficients for X, x2,and X3 is acceptable when it is sufficient to know the combined effects of all powers of X. However, in correcting for selection bias, the essential purpose is to distinguish the coeffi cient of X from the coefficient of M.
Why is X so likely to be highly correlated with M? Recall that M = A(Ta. Xis high ly correlated with A(TaZ) because X and Z are highly correlated and because A is rea sonably linear over the fixed range of values taken by TaZ in virtually any data set. X and Z are usually highly correlated, or even identical, because variables that cause Y1 are often the same variables that cause selection. For example, in the hypothetical analysis de scribed above in which Y, is earnings and Y2 is occupational SES, the causes of Y, and the causes of Y2 might be similar, or even the same. And although Ais a nonlinear function and its nonlinearity identifies the selection parameter, its variation in any particular sample of data is approximately linear. Fig ure 1 plots the function %X), the inverse of the Mills Ratio function, for values of X, showing its approximate linearity.
Table 1 reports some simple simulation based estimates of the correlation between aZT and %T aZ) over different ranges of aZT. In each simulation, we define a range for clZT and then create a data set consist ing of (1) 1,000 data points that are equally spaced over this range, and (2) the corre sponding value of A(TaZ) for each data point. We compute the correlation between the data points and the corresponding values of A(TaZ). Notice that the absolute values of all these correlations exceed .96, and all but two are larger than .99.
Because of the high correlations between Z and A(TaZ), the correlation between X and Z that is necessary for substantial selec tion bias to occur also tends to increase esti mation error substantially, or even introduce collinearity problems when the twostep se lection correction is applied. The twostep estimator is a delicate balance of selection bias against errors introduced by adding a re gressor that is highly correlated with the re gressor of substantive interest.
Table 1. Illustrative Simulated Correlations be
tween aZ T and A(T&) over Various
Ranges of aZ T
Range of Correlation between
aZT aZ T and A(TaZ)
(0,1) .9996
(1,l) .9960
( 02) .9992
(1 2) .9957
(22) .9836
( 03) .9991
(193) .9961
(2J) .9865
(3,3) .9665
Insights from Approximating A in Simple Regressions
We clarify and simplify the effect of selec tion bias by using a Taylor series to approxi mate the function A. Once the Taylor series approximation and some algebraic manipu lation is complete, the individual contribu tions of each parameter in equations 1 and 2 to the selection bias in pl can be seen more clearly, as shown in equation 8. In equation 8, we use Kp, (alephbetaone) to denote the bias in estimation of pl that would occur if no correction for sample selection bias were applied. Thus,
Kp, 'A'(T~~)~PE,Px, a(s~lfx), (8)
where pxZis the correlation between the in dependent variables in the regression equa tion (X) and selection equation (Z), sx and s~ are the standard deviations of X and Z in the selected part of the sample, Z is the mean of Z, and other variables are as defined before. A'is the first derivative of the reciprocal of the Mills Ratio function. Appendix B derives the multiple regression form of equation 8, (see equation B6).
Notice that equations 8 and B6 do not require calculation of the inverse Mills Ratio or its first derivative for every case. In the simple regression case, the argument A' can be calculated by substituting the mean of Z icto the estimated probit equation to obtain Z =a2T and multiplying the result by 1. In the multiple repression case, the argument il'(T6,Z1 .. . 6, Z,) can be calculated by substituting the means of zl, z 2, . . . , Z, into the estimated probit equation and multiply ing the result by 1. An easier method, de pending on the computer program used to calculate the probit analysis, is to simply re tain the predicted values from the probit equation, take their mean, and then multiply by 1 (because a and T are constants, the mean of aZT equals uZT).~
Once the ar
Statistical programs frequently offer the op tion of retaining predicted probits and/or predicted probabilities. For present purposes, the predicted probit is the appropriate quantity to retain.
Normal cumulat~ve Normal distribution
AMERICAN SOCIOLOGICAL REVIEW
gument of A'is obtained, A'(T~Z) or A'(T &, 2,... &,Z,) is easily calculated with a spreadsheet program, statistical analysis pro gram, or a table of normal density and nor mal cumulative distribution function^.^
Equation 8 shows the conditions under which sample selection bias is small enough to disregard or too large to ignore. If vari ables are standardized to a variance of 1, then all components except a are constrained to absolute values between 0 and 1 so that small values for one (or, especially, two) of them are likely to drive selection bias to small val ues. We now consider each of these compo nents.
(1)
A'(TuZ), the first derivative of the function A. To obtain the value at which A' is evaluated, a probit analysis of sample selec tion is performed, and the mean of predicted values from the probit equation is calculated and multiplied by 1.
(2)
0, the square root of the regression er ror variance. This determines the magnitude of the regression R2. Bias varies inversely with the regression R2. If the regression R2 is large, then selectivity bias is small, other things equal. A regression with a high R2 can tolerate a lot of samule selection without showing much bias. However, in sociologi cal analyses of individuallevel data, the re gression R is often between .2 and .5, corresponding to values of obetween .98 and .87 in standardized regressions.
(3)
peg,the correlation between the regres sion error and selection error terms. peg is not directly observable, but like all correla tions it is between 1 and +l. Values in this range can be used in sensitivity analyses to determine the likely range of selection bias.
The first derivative of the reciprocal of the Mills Ratio function is defined as follows. Where @(x)is the normal density function and qx) is the normal cumulative distribution function, then
~'(x)={[l@(x)I [XI @[XI + [@(x)I2)/ [1@(x)l2 .
Cell formulas for a twoline Excel 7.0 spreadsheet, which makes these and other calculations, follows at the bottom of this page. Formulas as sume that the upper left corner of the sheet is cell al.
I
dens~tv(x) function (x)
A correlation of 1 would occur if selection and regression were accomplished by identi cal processes. A correlation closer to 0 would occur when there is a lot of randomlymiss ing data or when sample selection is created by a process unrelated to the process de scribed by the regression equation.
(4)pxZ, the correlation between indepen dent variables from the regression equation and the selection equation. Selection equa tions and regression equations often have identical independent variables, in which case the correlation between regression and selection independent variables is 1. Abso lute values of pxz between .9 and 1 seem likely in many sociological analyses.
(5)
a, the coefficient of Z in the probit equation. If a is small, then Z does not ex plain selection very well, the selection mech anism will have little correlation with X, and the bias will be small. This means that if the probit equation used to estimate A(TaZ) fits the data poorly, then selection bias is small or the process of selection is misunder stood. In the case of misunderstanding, no statistical method can help. Otherwise, poor probitfit of the selection equation is evidence that selection bias is likely to be small, other things equal. In practice, probit fit is the easi est indicator of selection bias to compute and interpret. In most sociological analyses of individuallevel data, probit fit is weak, and we therefore expect probit fit in selection models of individual data to be weak as well. Absolute values of a between .1 and .3 are likely in individuallevel analyses.
(6)
sZlsx ,the standard deviation of the in dependent variable in the selection equation compared to the standard deviation of X in the selected data. If the standard deviation of Z is small compared to the standard devia tion of X, then bias is reduced. If X and Z are the same variable, as is often the case, then this ratio equals 1 and it neither increases nor decreases the selection bias.
Equation 8 can be used to develop intuition about the amount of selection bias occurring under hypothetical conditions or in a particu lar data set. With actual data, one would first construct a probit model of selection in the data, calculate and retain predicted values from the probit equation, take the mean of the predicted values, multiply the mean by 1, and evaluate A' at that value. Most other quantities appearing on the right side of equation 8 can be estimated from sample data. However, for intuitive appeal, in Table 2 we have generated normally distributed random data with censoring of 5, 10, and 20 percent; the values of A' shown below are based on those data. For other terms on the right side of equation 8 Table 2 displays val ues we think are commonplace for sociologi cal analysis. The regression R in these ex amples takes values of .2 and .5, which cor respond to ovalues of .9798 and .8660; pxZ and sz/sx are both set to equal 1, reflecting the common situation in which the same in dependent variables are used as regressors in both selection and regression equations; the probit coefficient a is equal to .3; the corre lation between regression and probit equa tion residuals is .71 (.712 = .5; &explains half the variance in 6).Estimated bias equals the product of the column entries that are not shaded. Notice that bias is relatively small in all examples in Table 2.
The simulated results shown in Table 2 suggest that selection bias in standardized regression coefficients is likely to be small under many easily imagined circumstances, even if 10 or 20 percent of data cases are censored. As the fit of the selection model improves, however, selection bias increases. For example, if 6increased from .3 to .6 in these examples, then the selection bias would double. Poor model fit is seldom viewed as a virtue, but poor fit of the selection model can indicate that substantial selectivity coexists with small selection bias.
Notice that this exercise gives an idea of the direction of bias as well as its size under various conditions. If one hypothesizes and finds a positive coefficient for X in the re gression equation, and if simulations like those that follow suggest that bias is likely to be downward, then one might reasonably conclude that one has found a lowerbound estimate of Dl. A downwardbiased estimate could be quite useful for testing a substan tive hypothesis of a positive coefficient for X, as long as that test found the hypothesized positive coefficient.
An Empirical Example
For a brief empirical example, we consider the regression of the years of schooling com
Hypothetical Values
Component of Bias Case 1 Case 2 Case 3 Case 4 Case 5 Case 6
Censoring rate 5(70 10% 204 5% 10% 20%
Mean predicted probit 2.33260 1.82121 1.19641 2.33260 1.82421 1 .I9641
Regression R
Estimated bias in standardized regression .013 1 .03 11 .0652 .0115 .0275 .0576 coefficient Kn
Note: The estimated bias equals the product of other column entries in the table that are not shaded
pleted by a respondent (EDUC) on the respondent's father's years of schooling (PAEDUC). First we describe how to gener ate the data used in this analysis. We use data from the public use files of the 1985 NORC General Social Survey (GSS). To simplify calculations, we delete cases with missing values on either variable (leaving a sample of 1,153 cases), and we standardize both variables to a mean of 0 and a standard de viation of 1. Since this is an illustrative ex ample, it is useful to know the impact of sample censoring. So we censor the data our selves by creating a variable Y2 equal to the sum of EDUC and a random variable with a mean of 0 and a standard deviation of 1. For the censored sample estimation, we "ob serve" cases for which Y2exceeds 1, and we censor all other cases. Our choice of 1 is arbitrary. After censoring, 893 cases remain
(77.45 percent of the sample).
We use STATA (release 4) for statistical calculations. Using the uncensored sample, ordinary leastsquares regression yields an estimate of .4898 for the coefficient of PAE DUC (R2 = .2392; t = 19.060). Using the cen sored sample, the coefficient for PAEDUC is .3597 (R2= .1644; t = 13.242). So the differ ence between the censored and uncensored estimates is about .13, which is more than five times the standard error of the regression coefficient in the censored sample.
Next, we follow the procedures we recom mend for approximating sample selection bias in the coefficient of PAEDUC. We cre ate a dummy variable equal to 1 for observed cases, and equal to 0 for censored cases. We perform a probit regression analysis of this dummy variable, with PAEDUC as the inde pendent variable. The probit analysis yields a coefficient of .4131 for PAEDUC and a constant term of .8160. We retain the pre dicted values of the probit analysis and find that their mean is .8160. We calculate A'(.8160) to be .4245. Following our hypo thetical examples, we guess a range of val ues for p,s. Assuming that these variables explain about half of the variance in each other, we try .71 (= a)for the value of peg (In practice, one should try a range of pos sible values for pES.) Filling all this informa tion into equation 8 and multiplying yields an approximate bias of .12, which is about equal to the difference between censored and uncensored estimates. On a purely subjective basis, this seems to be sufficient bias to war rant serious efforts to correct for selection effects. If those efforts failed to produce credible results, we might use our approxi mation results to suggest interpretation of the censored sample regression estimate as a low estimate of the coefficient of PAEDUC.
How Small Is Small?
A value of Kpl that is "big" in one situation may be unimportant in another situation. Subjective judgments of this sort often cause haggling, but most of this subjectivity can be avoided by asking if bias is large compared to sampling error in the estimate of PI.If bias is small compared to sampling error, then bias can be said to lack practical importance, much like a statistically insignificant regres sion coefficient.
A standard result in regression is
", 'o/(sx&]. (9)
Dividing equation 8 by the right hand side of equation 9 and taking absolute values gives equation 10, the formula for comparing bias in the estimate of PI to the sampling error in that estimate:
where n is the sample size. As n grows large, the ratio of bias to sampling error also grows large. Equation B7 in Appendix B extends equation 10 to multiple regression models. But we think that substantial insight can be gained by applying equations 8 and 10 to simple regression models, even when one's ultimate interest is in multiple regression analysis.
SIMULATIONS
We use simulations to evaluate the Taylor se ries approximations of A. These simulations differ from earlier studies, which tested the extent to which twostep estimation corrects selection bias (Nelson 1984; Paarsch 1984; Stolzenberg and Relles 1990). Our simula tions are used to check of the accuracy of a mathematically derived method for approxi mating bias.
To evaluate selection bias, we simulate several data sets, censor them, and fit re gressions to the censored data. In all simu lated data sets, the true value of the regres sion coefficient Do is 0 and the coefficient of XI equals 1. Following Equation 8, our ex perimental design is based upon six factors: o,cr, T, pE&PXZ, and N (the sample size be fore selection). The values of these factors are as follows: o,the regression standard er ror, takes values 2, 1, and .5 (corresponding to regression R* of .2, .5, and .8.); a,the se lection equation coefficient, takes values 1 and .333333 (corresponding to selection equation R2 of .5 and .I); T, the selection threshold, equals <(1 +d),5,where < takes values .674, 0, and .674 (corresponding to selection rates of 25 percent, 50 percent, and 75 percent, respectively; this parameter ization is required because the variance of Yz varies with a);((P,~)~,
the squared corre lation between E and 6, takes values of 0, .25, .SO, and .75; (pXlZ)2, the squared corre lation between XI and Z, takes values of 0, .25, .SO, and .75; N, the sample size before selection, takes values of 200, 500, 1,000, 2,000, and 5,000.
Thus, our experimental design has 1,440 cells (3x2~3~4~4~5).
For each cell, we perform 50 simulations, yielding 72,000 to tal simulations. For each simulation, we per form an ordinary leastsquares regression, a probit analysis of selection bias, and a two step regression for correcting selection bias, for a total of 216,000 analyses.
The Pearsonian correlation between Kp, and the actual selection bias is .99. The Tay lor series estimate explains 97.68 percent of variance in observed bias.5 Thus, the Taylor approximation of il is sufficiently accurate to give confidence in equations 8 and 10.
OLS versus the TwoStep Estimator
Our simulations also permit comparison of the accuracy of Heckman's (1976) twostep estimates to uncorrected OLS estimates of selected data. Knowing the correct value of PI,we estimate it using both techniques and compare the results. First, we simply tabu late the proportion of simulations in which OLS is more accurate than the twostep esti mator. In the 1,440 cells in the experimental design, that proportion ranged from 0 per cent to 98 percent, with a mean of 42.6 per
The equation is:
Actual Bias = .0017 + .8505Kp,
Similar results were obtained in regressions with out the constant term.
Table 3. Two Nonlinear Probit Models of the Probability that OLS Gives a More Accurate Estimate of Dl than the TwoStep Selection Correction Estimator, by Sample Size before Selection
Sample Size before Selection Indeoendent Variable 200 500 1,000 2,000 5,000
A. Using Bins in Standardized Regression Coefficient as Predictor
Constant .006 ,284 .23 1
(.333) (12.313) (8.768)
Number of cases 28 8 288 288
B. Using Ratio of Bins to Srandnrd Error of PI as Predictor
Constant .010 ,3030 .251 (,619) (15.542) (13.331)
Number of cases 288 288 28 8
Note: Numbers in parentheses are tstatistics. Regressions are based on 1,440data cases, each case corre sponding to one cell of the experimental design described in the text. Each cell of the design contains 50 simulations. In each panel of this table, each of the five columns of the table reports an analysis of 288 cases. The dependent variable in each regression is the probit of the proportion of simulations in which the OLS estimate of PI is more accurate than the estimate obtained with the twostep estimator.
cent and a standard deviation of 21.2 percent. Large sample size alone is not suffi cient to make the twostep estimator better than OLS: In simulations based on samples of 5,000 cases, OLS outperforms the two step correction in 34.5 percent of the simu lations.
We use these simulations to model the probability that OLS is more accurate than the twostep correction. To fit these models, we define a data set in which each cell of the experimental design represents one case. The dependent variable in these analyses is the probit of the proportion of simulations in which the OLS estimate of 0,is more accu rate than the estimate obtained with the two step e~timator.~
We calculate two sets of regression analy ses in which the dependent variable de scribed above (Y,)is regressed on a measure
"he variable we wish to explain with these regressions is a proportion, and therefore it is lim ited to values between 0 and 1. Probit regression constrains predicted values from this regression to the interval (0,l). Probit regression of grouped data is accomplished by taking the inverse normal cumulative distribution function of that pro portion, then regressing the transformed propor tion on independent variables of interest. See Hanushek and Jackson (1977).
of bias severity. In the first set, the measure of bias severity is the expected bias measured in units of the standardized regression coef ficient in the regression of Y1 on X (this is
l~~,(o,o~)I).
In the second set, the measure
df bias sevkrity is the expected bias divided by the standard error of the regression coef
ficient for X(~Bl/jSII)
We fit cubic polyno mial regressions to allow the effects to level off as the severity of selection bias gets very large or very small. Analyses are stratified by simulation sample size and are reported in Table 3. R~ statistics indicate that these mod els fit rather well for sample sizes of 1,000 or more. Regressions reported in Table 3 (or graphs drawn from them) can be used to estimate how severe bias must be before the twostep estimator becomes more accurate than OLS some percentage of the time. For simulations of 1,000 cases before sample selection, we cannot be 95 percent sure that the twostep correction outperforms OLS unless selection bias in the regression coefficient estimate is at least 4 times the standard error of Pl . We cannot be 80 percent sure the correction is better unless bias is 2.14 times the standard error of 0,.When bias severity is measured in standardized units, the results are much the same: Unless the bias changes the stan dardized regression coefficient by least .146, we cannot be 95 percent sure that the two step estimator performs better than OLS. Unless bias changes the standardized coeffi cient by at least .064, we cannot be even 80 percent sure that the twostep estimator per forms better than OLS. In short, the two stage estimator worsens estimation unless selection bias is severe.
Our third approach to comparing the accu racy of the twostep estimator to OLS is based on the difference in rootmeansquared errors (RMSE) of the two estimates. Our simulations show that RMSE tends to be lower for OLS than for the twostep estima tor when ftpl/splis less than .4. For values of ftpl /spl between .4 and 1.O, the correction performs only marginally better than OLS.
In short, our simulationbased compari sons of OLS and the twostep estimator sug gest that if bias is very severe and the samples are large one can be reasonably con fident that the twostep correction improves estimates. If bias is only moderate, however, or if samples have only a few hundred cases, there is considerable risk that the twostep estimator makes estimates worse, not better, even when sample selection is known to be present and the assumptions of the twostep correction method are satisfied. Compared to the conclusions of some earlier simulation studies, the rules of thumb based on our simulations offer more precision and inspire more confidence in the twostep estimator (Hartman 1991; Stolzenberg and Relles 1990). In particular, our results are inconsis tent with Hartman's (1991) recommendation for wholesale abandonment of the twostep procedure.
CONCLUSIONS
What is to be done about sample selection bias? Several bias correction procedures are now available, and more are coming. Winship and Mare (1992) review a number of these techniques and conclude that none of them works well all the time. Some tech niques make strong assumptions that require more courage than a particular empirical problem might justify. Other techniques are so imprecise that they often rob empirical analyses of their power to answer meaning ful questions (Manski 1995). Yet others re quire data that are rarely available in socio logical research (Little and Rubin 1987:230 34; Rubin 1977).
Instead of seeking a method that always corrects sample selection bias, one might use as manydifferent selection correction methods as possible, giving each estimator a vote at the statistical polls. In this approach, if several of these "experts" offer the same correction, that correction is believed to be right. But the sheer inefficiency of many se lectivitv correction methods introduces ran domness which easily can produce agree ment by chance. Getting two of these meth ods to suggest that OLS estimates are too high, for example, may be no more informa tive than getting heads on two flips of a coin.
Finally, one can perform a significance test for the presence of selection bias (Heckman 1980). Significance tests are important and useful, but they are conceptually removed from estimation. If confidence intervals are large, a badly biased estimate can fail to dif fer significantly from an unbiased estimate. Thus, the availability of a significance test for selection bias does not lessen the need for a selection bias correction.
We think the safest approach to sample se lection bias problems is first to understand how nonrandom selection occurs in one's data. This understanding focuses attention on selection bias as a missing data problem and sometimes can lead to an artful construction of missing values (Braun and Szatrowski 1982). Thinking about the process of sample selection may help reveal whether selection occurs through a process consistent with Heckman's (1976) model. If no such process exists in a particular data set, then it may be possible to rule out selection bias on purely logical grounds. (That is not to say that the data are necessarily unbiased, but, rather, that they are not biased in the way described in Heckman's [I9761 paper and in the litera ture that subsequently grew from it.) If data do appear to be selected as described by Heckman's model, then it is appropriate to consider Heckman's twostep estimator. When Heckman's model is substantively ap propriate, one can use equation 8 to assess the probable direction and likely severity of sample selection bias, even if precise values for all terms on the right side of the equation are not available. In practice, one would be gin with univariate analyses and equations 8 and 10, and then move to the multivariate analyses described in Appendix B equations B6 and B7. These equations sharply reduce the number of factors one must speculate about to forecast the selection bias in a re gression analysis. Equations 8 and A6 (from Appendix A) support considerable intuition about whether selection bias tends to make analyses stringent or lenient tests of the sub stantive hypotheses under consideration. For example, if the sign of an estimated coeffi cient is opposite the sign of selection bias af fecting that coefficient, then an uncorrected, biased analysis may be sufficient to support important substantive hypotheses, even if precise measurement of effects is impossible. In addition, equation 10 and the regression equations reported in Table 3 can help indi cate whether the twostep estimator is likely to improve on OLS estimation.
The bad news here is no surprise: There appears to be no automatic way to diagnose and correct sample selection bias. Analytical methods cannot make imperfect data perfect. To return to an earlier example, there is no way to know for sure how much nonemployed married women would earn if they became employed. But analytical methods can help bridge some of the gap between the data one can get and the data one would like. Intuition, informed judgment, simulation, experimentation, and statistical methods are necessary to understand and manage inevi table problems in data. Selection bias is just one of these problems. The methods we present here, when combined with other tools for selection bias correction, can help provide the information necessary to cope with selection bias problems. We do not ex pect that these methods can make data per fect. But we do think that these techniques, in combination with other procedures, can help make the analysis of imperfect data in formative and infinitely more useful than speculation about important substantive con cerns in the absence of any data at all.
Ross M. Stolzenberg is Professor of Sociology cit the University of Chicago. His current research interests and some recent publications corzcerrz tlze relationship between attitudes and behavior (Social Forces, 1995; American Sociological Re view, 1995; American Journal of Sociology, 1994), and the effects of schooling on ertzploy rtlent of Asians and Hispanics (Social Science Research, 1997). This work continues in his cur rent projects: a study of the structure of poutzg adults' attitudes toward work, family arzd school ing; arzd a study of the longterrtz conseyuerlces ofschooling.
Daniel A. Relles is Senior Statisticiarz nt the RAND Corporation. His rtzain resenrcll irzteresf is in cleveloping statistical rtlethorls to efficierztly manage and analyze large data sets. At RAND, he works on a variety of projects that have corn plex data reyuirertlents; all projects involve ir7
terdisciplinary tearns. He is currently working orz a healthrelated project to model the supply anrl dernancl for orthopaeclic surgeons tlzro~~glz the year 2010, a rnilitay logistics project to reduce delays in tlze repair of Arrtzy vehicle arlrl weap ons systems, and a resource rnanagertlent project on the efficient rlispatcllirzg of electric gerleratirzg capacity, given uncertain power rlernar~ds.
Appendix A. The Taylor Series Approximation of the Mills Ratio Inverse
We expand L(TorZ) in its Taylor series about the mean of Z in the data for which Y, is observed (call it F), then decompose Z into its projection on X plus a residual. The Taylor series expansion of L(T aZ) through the linear term is
The linear projection of Z onto X is [X pxZ(sZ/sx)],
where pxZis the correlation between X and Z, and
sx and sz denote the standard deviations of X and Z
in the regression sample. Hence, the omitted term
[opEsL(TaZ)] can be decomposed into a constant,
a part of which is a multiple of X, and another part that is orthogonal to X. The X term in this decompo sition has the coefficient Kp,, shown in equation 8 (see page 7),which will be absorbed into the esti mate of p if selection bias is not corrected. Kp, is therefore the bias induced by failing to correct for selection bias. Note that L'(T 2)is a constant, since all of its arguments are constants. To test the Taylor approximation, we perform 72,000 simulations (described in the body of this paper) in which we observe actual bias and calculate the Taylor se ries estimate of it. The correlation between actual bias and the Taylor approximation is ,9883, which seems sufficient iustification for the usefulness of the Taylor approximation.
I
Appendix B. The Taylor Series Approximation in Multiple Regression
Equations B6 and B7 below are the multiple re gression versions of equations 8 and 10 respectively. We obtain B6 and B7 by restating in multiple regression form the expression for selection bias, applying the Taylor series approximation of L to this restated form, and then simplifying the result by or thogonalizing the variables. Finally, we replace un familiar quantities introduced by the orthogonaliza tion with more familiar equivalents. Multiple and partial correlations appearing in equations B6 and B7 can be computed using many statistical analysis !xoerams, including SAS and SPSS. Refer to AD
. 
pendix Table Bl for definitions of symbols used in Appendix B.
ables xl, x?, ... ,xi,~n place of X and variables z1, 72, . . . ,7, in place of Z:
YI = P,,+ PIXI + ~2x2+ . . . + PIPI> 1
505
where &' has a mean of 0 but is not normally dis tributed. Ignoring the selection phenomenon is equivalent to leaving the term L(Talzl ... a,,z,,)
. .
7 7
out of the regression equation. Since the order of the independent variables is arbitrary, it is simplest to consider the bias in the estimate of PI if L(Talzl ... aqzq)is omitted from the regression.
The twostep estimator requires calculating A for each data case and reestimating the regression equa tion with h added to the list OF regressors. We wish to find an expression for the bias that does not re quire adding a new independent variable, the Mills Ratio inverse, and a subsequent reestimation of the regression equation. We use a Taylor expansion to estimate the Mills Ratio inverse. In the derivations below, GI,... ,&, are the probit estimates of the se lection equation independent variable coefficients, a,,.. . ,q; are the mean values of the selec
f ,..., Z, tion equation independent variables zl,...,z,; Z is the predicted value from the probit equation (Z =blzl+...+kqzqT),and K is a constant. The Taylor series expansion of the Mills Ratio inverse L(T &lzl...6+zq) around TI,...,Tq is:
A(T&~Z~...&qz(i)=

L(T&~T~
...z(/Tq)
A
= K A'(T&~T~...a,~,)
K= ~(T&I?l ...kq?(/)
+L'(T . . . &,?,)
+...+&(]?(I). (B3)
Substituting the Taylor series approximation for L(T . . .&,<,)into equation Al gives
. ,
6= PO +Plxl 'P2~2+...+Ppxj~
T . . . &</Tq]2)
as the approximate equation to fit that will correct for the sample selection bias. Except for Z, all terms
or oPE6[~ are con
I[TB~T,
stants. So, Z,the predicted value from the probit equation, is added as a regressor to the regression equation.
To estimate the bias created by omitting the cor rection term from the regression, we draw on a stan dard result from regression theory that is now only rarely cited or taught: If Y1,XI, andZare orthogo nalized with respect to x2 through x,, then the coef ficients of xl and Z in equation B4 can be estimated by regressing the orthogonalized value of Y1on the orthogonalized values of xl and Z (see Morris and Rolph 1981:7882), thereby eliminating x2 through x, from consideration and making the multiple re gression case described in this appendix similar to the simpler case described in the main part of the paper. To orthogonalize YI with respect to x2 through x, ,regress YI on x2 through x,; the residu als of this regrecsion are the orthogonal(zed values of Yl , denoted 4.Orthogonalize xl and Z similarly, as residuals from their regressions on x2,.;.,x,. Denote the orthogonalized values of xl and Z as n,and 2,resp_ectively. PiI~is the correlation between i,and _Z , sand si are the standard deviations of
.I1
nland Z (wh~ch are the conditional standard devia tions of xl and Z). Thus, the multivariate counter part of equation 8 is equation B5, where Upl denotes the bias in the estimate of PI:
Table B1. Symbols Used in Appendix B
Symbol Definition
z~,z~ Means of selection equation inde
,...,zy pendent variables zl,z2,...,zq.
P,,,PI,P2,...,P,, Regression equation coefficients.
aI,a2,...,aq Selection equation coefficients.
A
..
a,, cc2,..., a, Probit estimates of selection equa tion coefficients.
Predicted value from probit selection
i
equation, Z = +hlzl+... + ayzy.
c2 Regression equation error variance.
T Selection threshold.
$(x) is the normal density function (height of normal curve) evaluated at x.
0 qx) is the normal curve cumulative distribution function (area under the normal curve) evaluated at x.
Reciprocal of Mills Ratio function, 4x) = $(x)l(1 0[XI1.
L' First derivative of reciprocal of Mills Ratio function,
(1@[XI) (XI 40)+ XI)^
L'(x)= [I O[XI)~
AMERICAN SOCIOLOGICAL REVIEW
Equation B5 can be written without reference to
orthogonalized variables, pi? is just p .xp ' the
x1Z.x2. . partial correlation of xl with Z net of x2,..., x,. In addition, s,, can be obtained easily from the stan
dard deviation of xl and the multiple correlation of x, with x2 through x, ,Rx1,x2,,,Xp:
And sf can be calculated from the standard devia tion of Z and the multiple correlation of Z with x2 through x,.
SO we can write
upI = 0 P,& P~~~,~~,,,~klil
A'(T 
Symbol Definition
Ri,x2 Multiple correlation between iand regression independent variables
X2, ...,xp.
xl,x2,,,xl, Multiple correlation between xl and regression independent variables
x2,... ,x,. Bias estimate of PI. Partial correlation of xl with 2 net
P.r1z,.r2. ,.XI,
of X2, . . . ,Xp.
PE~ Correlation between regression and selection equation errors.
xi Standard deviation of Z.
s Standard deviation of il.
11
Standard deviation of 2. Pi,i Correlation between iland z (equal to Pxli.x2, ,,x,, 1.
Z Value of iorthogonalized on x2 through x,. Value of YI orthogonalized on x2
r,
through x,.
XI Value of xl orthogonalized on x2 through x,.
4data; and the ratio
~~~i~~ that equation B6 differs from 8 in very ways, which account for the additional independent variables in the regression and selection equations, jl(~i?~?~ ...i?47Y)is

approximated from the proportion of cases selected using precisely the same procedure in the multiple regression case as is used in the simple regression case; the partial correlation PXiX replaces the zeroorder correlation p xzand ii dal'zfated from the
is also calcu 1Rx,.xI... X,,
lated from the data.
REFERENCES
Berk, Richard. 1983. "An Introduction to Sample Selection Bias in Sociological Data." American Sociological Review 48:38698.
Braun, Henry and Theodore Szatrowski. 1982. The Reconstruction of Ideal Validity Experi ments through CriterionEquating: A New Ap proach. Princeton, NJ: Educational Testing Service.
Duan, Naihua, Will Manning, Carl Morris, and Joseph Newhouse. 1984. "Choosing between the Sample Selection Model and the MultiPart Model." Journal of Business and Economic Statistic 2:28389.
Goldberger, Arthur. 1980 "Abnormal Selection Bias." Department of Economics, University of Wisconsin, Madison, WI. Unpublished manu script.
Hanushek, Eric and John Jackson. 1977. StatisticalLMethods for Social Scientists. New York: Academic Press.
Hartman, Raymond. 1991. "A Monte Carlo Analysis of Alternative Estimators in Models Involving Selectivity." Journal of Business and Economic Statistics 9:4149.
Heckman, James. 1976. "The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models." Annals of Economic and Social Measurement 5:47592. . 1979. "Sample Selection Bias as a Speci fication Error." Econometrica 45: 1536 1. . 1980 "Addendum to 'Sample Selection Bias as a Specification Error."' Pp. 6974 in Evaluatiorz Studies Review Annual, edited by E. Stormsdorfer and G. Farkas. Beverly Hills, CA: Sage.
Lillard, Lee, James P. Smith, and Finis Welch. 1986. "What Do We Really Know about Wages? The Importance of Nonreporting and Census Imputation." Journal of Pol~tical
AS in the simple regression case described in the text, it is useful to judge the seriousness of selection bias by the standard for PI. Dividing equation B6 by the formula for the stan d"d error of PI 2 canceling terms and taking the ab solute value yields the result:
kpl~pll kt(Ti?lTl ,..
=PEJ pxli,x2,,,xp (B7)
IR~,,~2.,.~p
In short, the multiple regression case is a straight forward generalization of the onevariable case.
Economy 94:489506. Little, Roderick. 1985. "A Note about Models for Selectivity Bias." Econometrica 53:146974.
Little, Roderick and Donald Rubin. 1987. Smtistical Analysis with Missing Data. New York: Wiley.
Manski, Charles. 1995. Identification Problems irz the Social Sciences. Cambridge, MA: Harvard University Press.
Morris, Carl and John Rolph. 198 1. Introrl~tctior~ to Data Analysis and Statistical Inference.
Englewood Cliffs, NJ: PrenticeHall.
Nelson, Forrest. 1984. "Efficiency of the Two Step Estimator for Models with Endogenous Sample Selection." Journal of Econotnetrics 24:18196.
Paarsch, Harry. 1984. "A Monte Carlo Compari son of Estimators for Censored Regression Models." Journal of Econometrics 24: 197
213.
Rubin, Donald. 1977. "Formalizing Subjective Notions about the Effect of Nonrespondents in Sample Surveys." Journal of the Atnericnrz Stn tistical Association 72: 53843.
Stolzenberg, Ross and Daniel Relles. 1985. Calculation and Practical Application of GMAT Predictive Validity Measures. Santa Monica. CA: Graduate Management Admission Coun cil.
Stolzenberg, Ross and Daniel Relles. 1990 "Theory Testing in a World of Constrained Rc search Design: The Significance of Heckman's Censored Sampling Bias Correction for Nonexperimental Research." Sociologiccil Methods arlcl Research 18:395415.
Wainer, Howard, ed. 1986. Drawing It?fereticrs from SelfSelected Snrnples. New York: SpringerVerlag.
Winship, Christopher and Robert Mare. 1992. "Models for Selection Bias." Arzrz~tnl Revierc~ of' Sociology 18:32750.
Comments