Estimation of a Panel Data Sample Selection Model

by Ekaterini Kyriazidou
Estimation of a Panel Data Sample Selection Model
Ekaterini Kyriazidou
Start Page: 
End Page: 
Select license: 
Select License

Econornettica,Vol. 65, No. 6 (November, 19971, 1335-1364


We consider the problem of estimation in a panel data samplc selection model, where both thc selection and the regression equation of intercst contain unobservable individ- ual-specific effects. We propose a two-step estimation procedure, which "differences out" both the sample selection effect and the unobservable individual effect from the cquation of intercst. In the first step, the unknown coefficients of the "selection" equation are consistently estimated. The estimates are then used to estimate thc regression equation of interest. The estimator proposed in this paper is consistent and asymptotically normal, with a rate of convergence that can be made arbitrarily close to n-'I2, depending on the strength of certain smoothness assumptions. The finite sample properties of the estimator are invcstigated in a small Monte Carlo simulation.

KEYWORDS:Sample selection, panel data, individual-specific effects.


SAMPLESELECTION IS A PROBLEM frequently encountered in applied research. It arises as a result of either self-selection by the individuals under investigation, or sample selection decisions made by data analysts. A classic example, studied in the seminal work of Gronau (1974) and Heckman (1976), is female labor supply, where hours worked are observed only for those women who decide to participate in the labor force. Failure to account for sample selection is well known to lead to inconsistent estimation of the behavioral parameters of interest, as these are confounded with parameters that determine the probability of entry into the sample. In recent years a vast amount of econometric literature has been devoted to the problem of controlling for sample selectivity. The research however has almost exclusively focused on the cross-sectional data case. See Powell (1994) for a review of this literature and for references. In contrast, this paper focuses on the case where the researcher has panel or longitudinal data a~ailable.~

Sample selectivity is as acute a problem in panel as in cross section data. In addition, panel data sets are commonly characterized by nonrandomly missing observations due to sample attrition.

This paper is bascd on Chapter 1of my thesis completed at Northwestern University. Evanston, Illinois. I wish to thank my thesis advisor Bo Honor& for invaluable help and support during this project. Many individuals, among them a co-editor and two anonymous referecs, have offered useful comments and suggestions for which I am very grateful. Joel Horowitz kindly provided a computer program used in this study. An earlicr version of the paper was prescnted at the North American Summer Meetings of the Econometric Society, June, 1994. Financial support from NSF through Grant No. SES-9210037 to Bo Honor& is gratefully acknowledged. All remaining errors are my responsibility. An Appendix which contains a proof of a theorem not included in the paper may be obtained at the world wide web site:

" Obviously, the analysis is similar for any kind of data that have a group structure.


The most typical concern in empirical work using panel data has been the presence of unobserved heterogeneity. Heterogeneity across economic agents may arise for example as a result of different preferences, endowments, or attributes. These permanent individual characteristics are commonly unobserv- able, or may simply not be measurable due to their qualitative nature. Failure to account for such individual-specific effects may result in biased and inconsistent estimates of the parameters of interest. In linear panel data models, these unobserved effects may be "differenced" out, using the familiar "within" ("fixed-effects") approach. This method is generally not applicable in limited dependent variable models. Exceptions include the discrete choice model stud- ied by Rasch (1960, 1961), Anderson (1970), and Manski (1987), and the censored and truncated regression models (Honor6 (1992, 1993)). See also Chamberlain (1984), and Hsiao (1986) for a discussion of panel data methods.

The simultaneous presence of sample selectivity and unobserved heterogene- ity has been noted in empirical work (as for example in Hausman and Wise (19791, Nijman and Verbeek (1992), and Rosholm and Smith (1994)). Given the pervasiveness of either problem in panel data studies, it appears highly desirable to be able to control for both of them simultaneously. The present paper is a step in this direction.

In particular, we consider the problem of estimating a panel data model. where both the sample selection rule, assumed to follow a binary response model, and the (linear) regression equation of interest contain additive perma- nent unobservable individual-specific effects that may depend on the observable explanatory variables in an arbitrary way. In this type 2 Tobit model (in the terminology of Amemiya (1985)), sample selectivity induces a fundamental nonlinearity in the equation of interest with respect to the unobserved charac- teristics, which, in contrast to linear panel data models, cannot be "differenced away." This is because the sample selection effect, which enters additivelp in the main equation, is a (generally unknown) nonlinear function of both the observed time-varying regressors and the unobservable individual effects of the selection equation, and is therefore not constant over time.

Furthermore, even if one were willing to specify the distribution of the underlying time-varying errors (for example normal) in order to estimate the model by maximum likelihood, the presence of unobservable effects in the selection rule would require that the researcher also specify a functional form for their statistical dependence on the observed variables. Apart from being nonrobust to distributional misspecification, this fully parametric "random ef- fects" approach is also computationally cumbersome, as it requires multiple numerical integration over both the unobservable effects and the entire length of the panel. Heckman's (1976, 1979) two-step correction, although computa- tionally much more tractable, also requires full specification of the underlying distributions of the unobservables, and is therefore susceptible to inconsisten- cies due to misspecification. Thus, the results of this paper will be important even if the distribution of the individual effects is the only nuisance parameter in the model.

Panel data selection models with latent individual effects have been most recently considered by Verbeek and Nijman (19921, and Wooldridge (19951, who proposed methods for testing and correcting for selectivity bias. A crucial assumption underlying these methods is the parameterization of the sample selection mechanism. Specifically, these authors assume that both the unobsew- able effect and the idiosyncratic errors in the selection process are normally distributed. The present paper is an important departure from this work, in the sense that the distributions of all unobservables are left unspecified.

We focus on the case where the data consist of a large number of individuals observed through a small number of time periods, and analyze asymptotics as the number of individuals (n) approaches infinity. Short-length panels are not only the most relevant for practical purposes, they also pose problems in estimation. In such cases, even if the individual effects are treated as parameters to be estimated, a parametric maximum likelihood approach yields inconsistent estimates, the well known "incidental parameters problem."

Our method for estimating the main regression equation of interest follows the familiar two-step approach proposed by Heckman (1974, 1976) for paramet- ric selection models, which has been used in the construction of most semipara- metric estimators for such models. In the first step, the unknown coefficients of the "selection" equation are consistently estimated. In the second step, these estimates are used to estimate the equation of interest by a weighted least squares regression: The fixed effect from the main equation is eliminated by taking time differences on the observed selected variables, while the first-step estimates are used to construct weights, whose magnitude depends on the magnitude of the sample selection bias. For a fixed sample size, observations with less selectivity bias are given more weight, while asymptotically, only those observations with zero bias are used. This idea has been used by Powell (19871, and Ahn and Powell (1993) for the estimation of cross sectional selection models. The intuition is that, for an individual that is selected into the sample in two time periods, it is reasonable to assume that the magnitude of the selection effect in the main equation will be the same if the observed variables determin- ing selection remain constant over time. Therefore, time differencing the outcome equation will eliminate not only its unobservable individual effect but also the sample selection effect. In fact. by imposing a linear regression structure on the latent model underlying the selection mechanism, the above argument will also hold if only the linear combination of the observed selection covariates, known up to a finite number of estimable parameters, remains constant over time. Under appropriate assumptions on the rate of convergence of the first step estimator, the proposed estimator of the main equation of interest is shown to be consistent and asymptotically normal, with a rate of convergence that can be made arbitrarily close to n-'I2. In particular, by assuming that the selection equation is estimated at a "faster" rate than the main equation, we obtain a limiting distribution which does not depend on the distribution of the first step estimator.


The first step of the proposed estimation method requires that the discrete choice selection equation be estimated consistently and at a sufficiently fast rate. To this end, we propose using a "smoothed" version of Manski's (1987) condi- tional maximum score e~timator,~

which follows the approach taken by Horowitz (1992) for estimating cross section discrete choice models. Under appropriate assumptions, stronger than those in Manski (1987), the smoothed estimator improves on the rate of convergence of the original estimator, and also allows standard statistical inference. Furthermore, it dispenses with parametric assumptions on the distribution of the errors, required for example by the conditional maximum likelihood estimator proposed by Rasch (1960, 1961) and Andersen (1970).

Although our analysis is based on the assumption of a censored panel, with only two observations per individual, it easily generalizes to the case of a longer and possibly,unbalanced panel, and may be also modified to accommodate truncated samples, in which case estimation of the selection equation is infeasi- ble. Extensions of our estimation method to cover these situations are discussed at the end of the next section.

The paper is organized as follows. Section 2 describes the model and moti- vates the proposed estimation procedure. Section 3 states the assumptions and derives the asymptotic properties of the estimator. Section 4 presents the results of a Monte Carlo study investigating the small sample performance of the proposed estimator. Section 5 offers conclusions and suggests topics for future research. The proofs of theorems and lemmata are given in the Appendix.


We consider the following model:

(2.2) d,, = l{wity+ 17, -uit 201.

Here, pE Ftk and y E 84are unknown parameter vectors which we wish to e~timate,~ and wi, are vectors of explanatory variables (with possibly common

x: elements), a>nd 17, are unobservable time-invariant individual-specific effects5 (possibly correlated with the regressors and the errors), E,T and uit are unob- served disturbances (not necessarily independent of each other), while yz E 3is a latent variable whose observability depends on the outcome of the indicator

The smoothed conditional maximum score estimator for binary response panel data models, along with its asymptotic properties and necessary assumptions, is presented in an earlier version of this paper (Kyriazidou (1994)). See also Charlier, Melenberg, and van Soest (1995).

Obviously constants cannot be identified in either equation, since they would be absorbed in the individual effects.

These will be treated as nuisance parameters and will not be estimated. Our analysis also applies to the case where a: = rl,

variable d,, E {O,l). In particular, it is assumed that, while (d,,,~,,) is always observed, (y:, x:) is observed only6 if d,, = 1. In other words, the "selection" variable d,, determines whether the itth observation in equation (2.1) is cen- sored or not. Thus, our problem is to estimate P and y from a sample consisting of quadruples (dil,wi,,yi,,xi,). We will denote the vector of (observed and unobserved) explanatory variables by ii= (wil, w,,, x:,, x:,, a" q).Notice that, without the "fixed effects" a* and rl,, our model becomes a panel data version of the well known sample selection model considered in the literature, and could be estimated by any of the existing methods. Without sample selectivity, that is with d,, = 1 for all i and t, equation (2.1) is the standard panel data linear regression model.

In our setup, it is possible to estimate y in the discrete choice "selection" equation (2.2) using either the conditional maximum likelihood approach pro- posed by Rasch (1960, 1961) and Andersen (1970), or the conditional maximum score method proposed by Manski (1987). On the other hand, estimation of P based on the main equation of interest (2.1) is confronted with two problems: first, the presence of the unobservable effect ai,=d,,. a" and second and more fundamental, the potential "endogeneity" of the regressors xi, = di;x:, which arises from their dependence on the selection variable d,,, and which may result in "selection bias."

The first problem is easily solved by noting that for those observations that have d,, =d,, = 1, time differencing will eliminate the effect a,, from equation (2.1). This is analogous to the "fixed-effects" approach taken in linear panel data models. In general though, application of standard methods, e.g., OLS, on this first-differenced subsample will yield inconsistent estimates of P, due to sample selectivity. This may be seen from the population regression function for the first-differenced subsample:


= (x:~ -4,)p +E(E~-&;Idil = 1,di2= 1, ii).

In general, there is no reason to expect that E(&,T Id,, = 1, d,, = 1, li) = 0, or that E(E~ Idil = 1,di2= 1, i,) =E(e2ldil = 1,d,, = I,&). In particular, for each time period the "sample selection effect" A,,=E(E: Idil = 1, d,, = 1, ii) depends not only on the (partially unobservable) conditioning vector ii,but also on the (generally unknown) joint conditional distribution of (e:, u,,, u,,), which may differ across individuals, as well as over time for the same individual:

A,, =E(&:ldil = 1,di2= 1, i,)

=E(sI::luil IW,,Y+ 7,,ui24wi2y+ vi,li)

= A(wily+ ~i,~i2~+ qi;F,,(&,T,~il,~i2Iii))

=Ail(wil~+ 77,wi2~+ 7h,li).

Obviously, the analysis carries through to the case where x: is always observed, which is the case most commonly treated in the literature.


It is convenient to rewrite the main equation (2.1) as a "partially linear regression:"

where ui, = s,,-A,, is a new error term, which by collstruction satisfies E(u,,ld,, = 1, di2 = 1,Ji) = 0. The idea of our scheme for estimating /? is to "difference out" the nuisance terms ai, and A,, from the equation above.

As a motivation of our estimation procedure, consider the case where (s:, u,,) is independent and identically distributed over time and across individuals, and is independent of J,. Under these assumptions, it is easy to see that

where A(.)is an unknown function, the same over time and across individuals, of the single index wily + 7,. Obviously in general, hi, # A,,, unless wily = wi2 y. In other words, for an individual i that has wily = wi2 y and d,, =d,, = 1, the sample selection effect A,, will be the same in the two periods. Thus, for this particular individual, applying first-differences in equation (2.1') will eliminate both the unobservable effect a,, and the selection effect hi,. At this point it is important to notice that, even if the functional form of A were known (as for example in the case of a bivariate normal distribution-see Heckman (197611, it would still involve the unobservable effect rl, This suggests that it would be generally infeasible to consistently estimate P from (2.1') even in the absence of the effect a,,, and with knowledge of y, unless a parametric form for the distribution of qi conditional on the observed exogenous variables were also specified.

The preceding argument for "differencing out" both nuisance terms from equation (2.1') will hold under much weaker distributional assumptions. In particular, since first-differences are taken on an individual basis, it is not required that (sz, ui,) be i.i.d. across individuals nor that it be independent of the individual-specific vector &. In other words, we may allow the functional form of 11 to vary across individuals. It is also possible to allow for serial correlation in the errors. Consider for example the case where (E;, 82,uil, ui2) and (E:, E:, LL,,, uil) are identically distributed conditional on J,, i.e. F(s:, E;, uil, ui21 lj)=F(s;*2, E: ,ui2, uil 1 f;). Under this conditional exchangeability assump- tion, it is easy to see that for an individual i that has wily = wi2 y,

Notice that in general, it is not sufficient to assume joint conditional stationarity of the errors. An extreme example is the case where 82, E:, and ui, are i.i.d. N(0,l) and independent of Li,while ui2 = 8:. Then, A,, =E(s2 1s; 5 wiZy+ rl,) # Ai2 =E(sg), regardless of whether wily = wi2 y.

The above discussion, which presumes knowledge of the true y, suggests estimating p by OLS from a subsample that consists of those observations that have wily = w,, y and d,, = d,, = 1. Defining Ti= l{wily = wi2 y}, Qi = l{dil = d,, = I} = di,di2, and with A denoting first differences, the OLS estimator is of

the form jn = [Cy=, Ax: Axi %@,I-'[Cy=, Ax: Ay, TiQi]. Under appropriate reg- ularity conditions, this estimator will be consistent and root-n asymptotically normal. An obvious requirement is that Pr(Awi y = 0) > 0, which may be satis- fied for example when all the random variables in wit are discrete, or in experimental cases where the distribution of wit is in the control of the researcher, situations that are rare in economic applications.

Of course, this estimation scheme cannot be directly implemented since y is unknown. Furthermore, as argued above, it may be the case that Ti= 0 6.e. Aw, y # 0) for all individuals in our sample. Notice though that, if A is a sufficiently "smooth" function, and .i;, is a consistent estimate of y, observations for which the difference Aw, is close to zero should also have AA, E0, and the preceding arguments would hold approximately.

We therefore propose the following two-step estimation procedure, which is in the spirit of Powell (1987), and Ahn and Powell (1993): In the first step, y is consistently estimated based on equation (2.2) alone. In the second step, the estimate yn is used to estimate p,based on those pairs of observations for which wi,qn and wi,Tn are "close." Specifically, we propose

where &, is a weight that declines to zero as the magnitude of the difference I wi,qn -wi2YnI increases. We choose "kernel" weights of the form:

where K is a "kernel density" function, and h, is a sequence of "bandwidths" which tends to zero as n + m. Thus, for a fixed (nonzero) magnitude of the difference 1 Aw, ?,I, the weight Ginshrinks as the sample size increases, while for a fixed n, a larger I Aw, ?,I corresponds to a smaller weight.

It is interesting to note that the arguments used in estimating the main regression equation may be modified to accommodate the case of a truncated sample, that is when we only observe those individuals that have d,, = 1 for all time periods. Recall that our method for eliminating the sample selection effect from equation (2.1') is based on the fact that, under certain distributional assumptions, Aw, y = 0 implies Ah, = 0. However, Aw, = 0 also implies Ah, = 0. In other words, we might dispense altogether with the first step of estimating y, and estimate p from those observations for which wil and wi2 are "close," which would suggest using the weights: Gin = (l/h:)K(Aw,/h,). Although this ap- proach would imply a slower rate of convergence for the resulting estimator, this


estimation scheme may be used for estimating p from a truncated sample, in which case estimation of the selection equation is infeasible. An obvious drawback in this method is that, in order to consistently estimate the entire parameter vector p, we would have to impose the restriction that wit and x,Y, do not contain any elements in common.

The above analysis extends naturally to the case of a longer (and possibly unbalanced) panel, that is when T.22. Then p could be estimated from those observations that have d,, = d,, = 1, and for which wit?, and wis?, are "close," for all s, t = 1... ,q..The estimator is of the form


In the following section we derive the asymptotic properties of our proposed estimator for the main equation of interest, under the assumption that y has been consistently estimated. At the end of the section, we examine the applica- bility of existing estimators for obtaining first-step estimates of the selection equation.


3.1. Asymptotic Properties of the Estimator

The derivation of the large sample properties of fin of equations (2.3) and

(2.4) proceeds in two steps. First, the asymptotic behavior of the infeasible estimator which uses the true y in the construction of the kernel weights, denoted by fin, is analyzed. Then the large sample behavior of the difference ( fin -fin) is investigated.

It will be useful to define the scalar index W,= Aw, y and its estimated counterpart = Aw, y,, along with the following quantities:

j,,= -C -K -Ax: Axi @,,

n ,=1 h,

With these definitions we can write: &, -/3 = S$(S,, + S,,) and bn-/3 =

i;;(ixL, + $,,I.

Our asymptotic results for the infeasible estimator are based on the following assumptions.' From Section 2, @, = dildi2, ii= (w,~,wi2, x:~, a*,q,), and uit = ditE: -E(E~

Idil = 1, di2 = 1, 6,).

ASSUMPTIONR1: (E:, E;, uI1, ui2) and (&A, E,T, ui2, uil) are identically dis- tributed conditional on 6,. That is, F(E;, E;, uil, ui21 6,) =F(E;, E:, ui2, uill 6,).

As discussed in Section 2, this conditional exchangeability assumption is crucial to our method for eliminating the sample selection effect. Although in principle we could allow F to vary across individuals, it will be convenient for our analysis to assume that cross-section sampling is random:

ASSUMPTION , a" wit, u,,, I~);t = 1,2}:!

R2: An i.i.d. sample, {(xT,, E; is drawn from the population. For each i = 1,.. . ,n, and each t = 1,2, we obserue (djt, Wit, ~jt, xit).

With this assumption, we may from now on drop the subscripts i that denote the identity of each panel member.

ASSUMPTIONR3: E( Ax' Ax @I W = 0) is finite and nonsingular.

Note that this assumption implicitly imposes an exclusion restriction on the set of regressors, namely that at least one of the variables in the selection equation, wit, is not contained in x:.


R4: The marginal distribution of the index function WEAw y is absolutely continuous, with density function f, which is bounded from aboue on its support and strictly positive at zero, i.e. f,(O) > 0. In addition, f, is almost everywhere r times (r 2 1) continuously differentiable and has bounded deri~atiues.~

Observe that by definition, @, Ax,= QiAx?. Thus, although certain assumptions are stated in terms of the observed regressors x,,, they also hold for the latent (possibly unobserved) x$ It is possible to relax certain smoothness assumptions so that they hold only in a neighborhood of W near zero, at the cost though of more technical detail.


ASSUMPTIONR5: The unknown function9 il(wly + 7,w, y + 7,J) = E(E: Idl = l,d,=l,~)~E(~~Iu~<w~y+~,u,<w,y+_r],J)A(s,,s,,J)

satisfies: A(s,,_s,,J)=il.(s,-s,) for t,r=1,2, where A is afunction of (s,,s,, J), i.e.,A= Ais,, s,, 5 1, which is bounded" on its support.

This assumption is crucial to our analysis. It will be satisfied, for example, if A is continuously differentiable with respect to its first two arguments, with bounded first-order partial derivatives (as, for example, when the errors are jointly normally distributed), in which case we may apply the multivariate mean-value theorem:

Here A(]) (j = 1,2) denotes the first-order partial derivative of A with respect to its first and second argument respectively, and c; lies on the line segment connecting (w, y + r], w, y + 7, !:) and (w, + 7, wl y + 7, J ). Thus, in this case, A = Acl)(cT)-11(2)(~1

), and by assumption will be bounded.

ASSUMPTIONR6: (a) x: and r: have bounded 4 + 2 6moments conditional on W, for any 6 E (0,l).

E(Axl Ax @I W) and E(Axt Ax Au2 @I W) are continuous at W = 0 and do not uanish.
E( Ax' j@l W) is almost eueiywhere r times continuously differenfiable as a fiinction of W, and has bounded deri~latices.

ASSUMPTIONR7: The function K : 3+91 satisfies: (a) jK(v) dv = 1, (b) lIK(v)l dv < a, (c) supvlK(v>l < m, id) llvlrfll~(v)l dv < %, and (el lvJK(v) d v=O forallj= 1,...,r.

ASSUMPTIONR8: h, +0 and nh, +m as n -t cc.

From our analysis in Section 2, it is easy to see that Assumptions R1-R3 would suffice to identify P for known y. An identification scheme in the spirit of our discussion in Section 2 would obviously require support of W at zero, as well as nonsingularity of the matrix 2,y,y,

imposed by Assumption R3, analogous to the familiar full rank assumption.

The continuity of the distribution of the index W, imposed in Assumption R4, is a regularity condition, common in kernel estimation of density a;d regression functions. It is precisely this continuity that renders the estimator P, of Section 2 infeasible, even if y were known.

~otice that by Assumption R1, thc functional form of A is the same over time for the same individual, while by Assumption R2, it is also the same across !ndividuals. 10 In principle, we could dispense with the assumption that 11 is bounded, by assuming that has finite fourth moment conditional on 1V.

Since our estimation scheme is based on pairs of observations for which

= Aw, y E 0, it is obvious that additional smoothness conditions are required.

These are imposed by Assumptions R4-R8. Notice, in particular, Assumption

R5, which imposes a Lipschitz continuity property on the selection correction

function A( ). It is easy to see that simple continuity will not be sufficient to

guarantee that Ah, + 0 as U:+ 0, since Ahi is not a function of U.;.Further

more, similarly to kernel density and regression estimation, a high order of

differentiability r for certain functions of the index W,along with the appropri-

ate choice of the kernel function and the bandwidth sequence, imply a faster

rate of convergence in distribution for fin.Specifically, we choose a "(r + 1)th

order bias-reducing" kernel, which by Assumption R7(e) is required to be

negative in part of its domain.

The next lemma establishes the asymptotic properties of the infeasible esti-

mator p,.

LEMMA 1: Let Assumptions R1-R8 hold. Define


I,,=fW(O)E(Axr Ax Au2@1 W=o)/K(~)~


where g(r)(0) is the (k x 1) uector of rth-order deriuatiues of

eualuated at W = 0. Then,


Sxx-+ Zxx.
If Khkf ' + with 0 5I; < .o, then (i) Ks,,,;N(0, Z,,,), and (ii)

P -

Ksx* h ZxA.



(c) If Kh;+' + m, then (i) h;(r+')S,y, -+ 0 and (ii) h;('+')S,, -,ZxA.

The asymptotic properties of fin easily follow from the previous Lemma: If Kh;" + I;, then K(fin -/3) N(A Z;.'X~~, Z;x'Xx, Z;,'), while if


Kh;+ ' -+ m, then hiirf 'I(fin -+ IzIx,.In order to derive the asymptotic properties of the feasible estimator fin, we will make the following additional assumptions:

ASSUMPTIONR9: In addition to the conditions of Assumption R7, the kernel function satisfies: (a) K(v) is three times continuously differentiable with bounded deriuatiues, and (b) /IKr(v>ldv, lIK"(v)l dv, l~~K'(v)~dv,

and ~v~K"(v)~~v are finite.


The conditions of Assumption R9 are satisfied, for example, for K(v)being the standard normal density function, which is a second order kernel.


R10: xT , 87, and w, have bounded 8 + 46 moments conditional on W, for some 6 E (0, 1). In addition, E(Axl Au Awj @ 1 W) and E(AX' Au Awj Awm @IW) are continuous at W= 0 for all 1 = 1,.. . ,k and j, m = 1, ...,q.

ASSUMPTION in the selection equation lies in a

R11: The parameter vector y compact1' set, and .i;, is a consistent estimator that satisfies: qn -y = Op(npP), where 2/5 < p I 1/2.

For example, p = 1/2, if y is estimated by maximizing the conditional likelihood function.

ASSUMPTION = h .KP, where 0 < h < m, and 1-2p <

R12: h, <p/2.

Assumption R12 is crucial for establishing the result that follows. This result states that ixx, i,,, and S^,, have the same probability limits as their infeasible counterparts S,,K, S,,, and S,K,, provided that the bandwidth sequence h, is chosen appropriately for any given rate of convergence of the first-step estima- tor, that is for any given p, and for any degree of smoothness r.

LEMMA2: Let Assumptions R1-R12 hold. Then:

i,;:-Sk:: = op(l).
If Kh;+' -+ h with 0 Ih < m, then (i) K(&,, -S,,,) = op(l) and (ii) K(iXA-= oP(l).
If Kh;+ ' + a,then (i) h;i"+')($,Ku -Sxu)= op(l) and (ii) h;("+')($,, -s,K,>= op(l>.

Lemma 2 readily implies that, if Kh;" -+ h then a ( b, -6,) = op(l),


while if Kh;+ ' + x, then h; "+ 'I(P, -P,) = op(l). Since ( /?, -P ) = ( b, 6,) + ( 6, -p), we have the following theorem:

THEOREM1: Let Assumptions R1-R12 hold.

(a) If Kh;+l -+h, with 0 ~h < m, thenfi(& -PI 2~(hZ;:x~,,

xxp;xx,xk:: 1.


(b) If fib;+' -+ x, then hiir+')( fin -p -+ Z;,ZXA.

11 Compactness of the parameter space is required for consistency of both Manski's estimator and the smoothed conditional maximum score estimator, while it is not required for the conditional maximum likelihood estimator. Notice though, that since y can only be estimated up to scale, we can always normalize it so that it lies on the unit circle. Thus the compactness assumption is not restrictive.

Thus, in the limit, the fact we are using Tit to estimate P does not affect the asymptotic distribution of Bf,.The lower bound on p, imposed by Assumption R12, is the key for this result to hold. In words, this bound implies that ,B is estimated at a rate slower than y. Indeed, from Theorem 1, the rate of convergence of fin is (nh,)-'/" n -I/>-~,'2, which is obviously slower than n-P, since p > 1 -2p. Thus in effect, Assumption R12 requires that fi(?, y) = o,(l).

In principle, we could allow P to be estimated at the same rate as y. Thus, if K(g,-y) = OP(l)for Kh;" -+ h, we obtain the following asymptotic representation, which may be easily derived from the analysis of Lemma 2(b)in the Appendix:



0= plim, ,,(l/n) (l/h~)~'(~/i;/h,)Ax: Awi Ahi Qi

i= 1

provided that E(dxl AW~~@IW) is continuous at W=O and vK(v)-+O as lvl -f m Asymptotic normality of fir, may still be established if Kiq, -y) has an asymptotic representation of the form: Jnh,(TiJ -y) = l/ Kc:,,~(A~,.Ad,. y) + 0,(1).'~

At first glance it looks attractive to eliminate the asymptotic bias of fin by choosing h, so that ah:,+' + = 0, or equivalently by setting p > (1/(2(r+ 1)+ 1)).In that case,'however, the rate of convergence of fin is lower than when

> 0. Indeed, the rate of convergence in distribution of fin is maximized by making p as small as possible, that is by setting p = 1/(2(r + 1)+ I), in which Case it becomes -'I + 1)/(2("+ 1) -11. Thus, for r large enough, the estimator converges at a rate that can be arbitrarily close to n-'/< provided also that y is estimated fast enough, that is provided y > (r + 1)/(2(r+ 1)+ 1).

Although the proposed estimator is asymptotically biased, it is possible to eliminate the asymptotic bias while maintaining the maximal rate of convergence, in the manner suggested by Bierens (1987).

COROLLARY:Let 6,be the estimator with window width h, = h . n -/(*(I 'I+ I).


and fin, the estimator. with window width h,, a = h .n + "''I,where 6 E (0,l).

12 We can also derive an asymptotic representation for i,, is estimated at

in thc case where y rate n-" that is slower than 1/ 6.In this case we obtain rzP( in-/3) = .X;xlfl.nP(.i;,-y) + op(l), which implies that inconverges at the same rate as .i;,,which is slower than thc "optimal" rate obtained for the infeasible estimator fin,that is when y is known.



(I -6)(r+ 1)/(2(r+ I)+ 1)A


p sz fin+ Pa, s

1-n-(l-6)(r+l)/(2(rAl)+ 1) '


Then, n(r+1'/(2('T fin-p) N(0, h-12;X12Xc



In order to compute i?,in an application, one needs to choose the

or p, kernel function K, and to assign a numerical value to the bandwidth parameter h,. Results on kernel density and regression function estimation suggest that the asymptotic performance of the estimator will be likely more sensitive to the choice of the window width than to the choice of the kernel. Furthermore, the asymptotic normality result of the Corollary above shows that the variance of the limiting distribution depends crucially on the choice of the constant h. We will thus focus here on the problem of bandwidth selection. Bierens (1987) discusses the construction of high order bias-reducing kernels.

For a given order of differentiability r, and a given sample size n, the results of Theorem 1 suggest that h, = h .n-+ be chosen so that p = 1/(2(r + 1) + 1). So the problem of bandwidth selection reduces to the problem of choosing the constant h. A natural way to proceed (see Horowitz (1992) and Hardle (1990)), is to choose h so as to minimize some kind of measure of the "distance" of the estimator from the true value, based on the asymptotic result of Theorem 1. Consider for example minimizing the asymptotic mean squared error of the estimator, defined as:

-2 + / trace[X(XC + hX'+ ')x,,x,,)x,']

for any nonstochastic positive semidefinite matrix A that satisfies 2~,_CXX~~Z;'Z,,

# 0. It is straightforward to show that MSE is minimized by setting

1/(2(17 1) t 1)

trace [ 2;1A 2;,'2,,]

(3.2.1) h = h" = 2(r+ I)Z;*E,;~A~~;~~~,,

This last expression suggests that we may construct a consistent estimate of h* if consistent estim:tes of XI,, Z,,, and 2,, are available. By part (a) of Lemmata 1and 2, S,, consistently estimates S,, for any h, that satisfies h,, -jr 0 and nh, +m. In the next theorem, we provide consistent estimators of S,, and


THEOREM2:'' Assume that Assumptions Rl-R12 hold. (a) Let fii2be a consistent estimator of p based on h, =h .n-1/(2("1'+1', and define ;,, =jJ,,-x,, P,.

13 The proof of Theorem 2 IS omitted herc to conserve space. It is available at the author's world wide web page.


(b) Let h,,, = h .n-o(2(r")+1), where 0 < 6 < 1. Then, for g,, defined as in part (a),

Returning to our discussion about the construction of the estimator of P in practice, we propose the following method (see also Horowitz (1992)). In the first stage, for a given r and n, choose any h, = l~.n-'/(~("')+~)

and any -h .n-8/(2(1 '1' '1, with h an arbitrary positive constant, and 0 < S < 1.


hn,8 Compute fin based on h,, and construct g,, as defined in Theorem 2. Use ,6,, to compute^ the estimates of Z2,, Zx,,and Z,, as discussed above. Then estimate h* by h, using equation (3.2.1) with Cx1, C,,,, and C,, replaced by their consistent estimates. In the second stage, compute the asymptotic bias-corrected estimates as in the Corollary using as the constant in the definition of h,, and A,,,8. This two-stage procedure is similar to the "plug-in" method used in kernel density and regression function estimation, and it shares the same disadvan- tages: First, it involves the choice of a smoothing parameter in the first stage, namely choosing the initial constant h. Second, by specifying the order of differentiability r, the researcher is restricted to a certain smoothness class. It is interesting to note that standard statistical software may be used for computing estimates for the main equation and their standard errors: Given a consistent estimate Tn for the selection equation, and a bandwidth h,, = h.

n-1/(2(1+'"'), run OLS regression of I?, = JK(AW, ~,/h,) Ayi QLon XI



Ax, @,, and compute the (asymptotically biased) estimate fin.

Standard errors are obtained from the Eicker-White covariance matrix:

using the residuals from the regression, ti= -gifi. The bias-corrected esti- mate fin is obtained as a linear combination of fi,? and fin., as described in the Corollary of Theorem 1,where fin,, comes from the auxiliary OLS regression of



I?, on X, with bandwidth h, ,= h .

We next turn to the problem of estimating the unknown parameter vector y in the selection equation. As we established, the asymptotic results obtained for the proposed estimator of /3 depend crucially on the rate of convergence of the first-step estimator of y. In particular, it is straightforward to establish con


sistencyl%f 6,if h;'(?, -y) = op(l), for any h, that satisfies Assumption R8, i.e, for h, -.0 and nh, -t m. 011the other hand, the asymptotic normality result of Theorem 1 requires that K(.i;,-y) =op(l), for any h, that satisfies

K12~+'-.&, with 0 I6 < m. ,.

The conditions for obtaining consistency and asymptotic normality of P,, are satisfied by the conditional maximum likelihood estimator proposed by Rasch (1960, 1961) and Andersen (1970), which is consistent and root-n asymptotically normal, under the assumption that the errors in the selection equation are white noise with a logistic distribution and independent of the regressors and the individual effects. In fact, as Chamberlain (1992) has shown, if the support of the predictor variables in the selection equation is bounded, then identification of y is possible only in the logistic case. Furthermore, even if the support is unbounded, in which case y may be identified and thus consistently estimated, consistent estimation at rate n-'7' is possible only in the logistic case. As is well known though, if the distribution of the errors is misspecified, the conditional maximum likelihood approach will in general produce inconsistent estimators.

Another possible choice for estimating y is the conditional maximum score estimator, proposed by Manski (1987). Under fairly weak distributional assump- tions, this estimator consistently estimates y up to scale. However, the results of Cavanagh (1987), and Kim and Pollard (1990) for the maximum score estimator proposed by Manski (1975, 1985) for the cross section binary response model, namely that it converges at the slow rate of nP'l3 to a non-normal random variable, suggest that these properties carry through to its panel data analog, the conditional maximum score estimator. Thus, if (%,-y) = 0,(nP1/3), it is possi-

ble to consistently estimate ,B by choosing h, to satisfy n'l3h; -,m. In this case though, the analysis for obtaining the asymptotic distribution for p,, is not applicable.

It is possible, however, to modify Manski's conditional maximum score estima- tor and obtain control over both its rate of convergence and its limiting distribution, by imposing sufficient smoothness on the distribution of the errors and the explanatory variables in the selection equation. Specifically, following the approach taken by Horowitz (1992) for estimating the cross section binary response model, we can construct a "smoothed conditional maximum score" estimator, which under weak (but stronger than Manski's) assumptions, is consistent and asymptotica!ly normally distributed, with a rate of convergence that can be arbitrarily close to n-'I2, depending on the amount of smoothness

14 Consistency of p, may be established under the weaker restriction that /z;'ll.F, -yll' = o,(l). The proof of Lemma 2(a) would then have to be modified, by taking a third instead of a first order Taylor series expansion. This modification does not alter the basic restriction for obtaining an asymptotic distribution for 6,which does not depend on the estimation of y in the first step, namely that y has to be estimated at a faster rate than p. Notice that in this case, the upper bound on ,u in Assumption R12 would have to be replaced by (6p-1)/7. However, this modification would affect the proof of Theorem 2, which would become unnecessarily complicated and long.

we are willing to assume for the underlying distributions. This estimator is considered in an earlier version of the paper (Kyriazidou (1994)) and also in Charlier et al. (1995).


In this section we illustrate certain finite sample properties of the proposed estimator. The Monte Carlo results presented here are in no sense representa- tive of the estimator's sampling behavior since only one experimental design is considered. Further, there is little justification for the choice of the particular design, except that it is simple to set up and that, in the absence of sample selectivity, ordinary least squares on the first differences would perform quite well. The simulation study of this section is intended more as an investigation of the sensitivity of the estimator to the choice of bandwidth, the order of the kernel, the proposed asymptotic bias correction, the first step estimation method, the performance in practice of the proposed plug-in method for estimating the bandwidth constant, and finally the practical usefulness of the proposed covari- ance matrix estimator in testing hypotheses about the main regression equation coefficients.

Data for the Monte Carlo experiments are generated according to the model:

where p O = 1, y, = y, = 1, w ,,,, and w2 ,, are independent N( -1,l) variables, q, = (w,,,, + w,,,,)/2 + 25,, with 5, an independent variable distributed uni- formly over the interval (0,1), u,, is logistically distributed normalized to have variance equal to 1, x,,= w,,,,, a,= (w,,,,

+ w,, ,,)/2 + 5,, with 5, an indepen- dent N(0, 2) variable, and s,, = 0.8t3 + 0.6ul,, with 5, an independent standard normal variable. All data are generated i.i.d. across individuals and over time. This design implies that Pr(d, + d, = 1)= 0.37, and Pr(d, = d, = 1) = 0.31, so that approximately 37 percent of each sample is used in the first step estimation of the selection equation and approximately 31 percent in the second step. Each Monte Carlo experiment is performed 1000 times, while the same pseudoran- dom number sequences are used for each one of three different sample sizes n: 250, 1000, and 4000.

Table I presents the finite sample properties of the "naive" estimator, denoted by p,,,,,, that ignores sample selectivity and is therefore inconsistent. This estimator is obtained by applying OLS on the first differences using only those individuals that are selected into the sample both time periods, i.e. those that have d,, = d,?= 1.This estimator may be viewed as a limiting case of our proposed estimator with bandwidth equal to infinity. Panel A reports the estimated mean bias and root mean squared error (RMSE) for this estimator over 1000 replications for different sample sizes n. As the estimator may not have a finite mean or variance in any finite sample, we also report its median



Panel A: Finite Sample Properties of bNAIVL Mean Median Bias Bias RMSE MAD

Panel B: Sizes of i tests

0.01 0.05 0.10 0.20

bias and the median absolute deviation (MAD). Panel B reports the number of rejections of the null hypothesis that ,B is equal to its true value ,BO= 1at the 1, 5, 10, and 20 percent significance levels. Both panels confirm that the estimator is inconsistent.

Table I1 presents the finite sample properties of the proposed two-step estimator. The left-hand-side panels are for ,6,, obtained by specifying r = 1and using K(v)= +(u), where 4 is the density of the standard normal distribution,


FINI~EPROPERTIESOF j,, h,,= n-'I5, K(v)= 4(~)


b,, j,,

(Without Asymptot~c Bias Correction) (With Asymptotic Bias Correction)

hlean Median Mean Median Bias Bias RMSE MAD Bias Blas RMSE MAD

Panel A: True y

0.2427 0.1625 0.0018 0.1368 0 0924 0.0078 0.0792 0.0511 0.0024

Panel B: qL

0.2076 0.1438 0.0145 0.1169 0.0778 0.0117 0.0672 0.0455 0 0059

Panel C:

0.2592 0.1725 -0.0021 0.1435 0.0950 -0.0026 0.0826 0.0544 -0.0005

Panel D: %c,ws,4

0.1780 0.1255 0.0327 0.1063 0.0703 0.0106 0.0629 0.0410 -0.0139

Panel E: qscnls,r

0.1765 0.1242 0.0361 0.1071 00721 0.0146 0.0659 0.0416 -0.0098

which is a second order bias-reducing kernel. The bandwidth sequence is

h, =h .n-1/(2'r+"+1'=h .n-lI5 with h = 1. The panels on the right-hand side present the results for fin, the estimator of the Corollary of Theorem 1which corrects for asymptotic bias, where we use 6 = 0.1. Going from top to bottom of Table 11, Panel A reports the results for the proposed estimator using the true y in the construction of the kernel weights.15 In Panel B, y is estimated by conditional logit, denoted by qL,which in this case will be consistent since all of the assumptions underlying the approach hold in our Monte Carlo design. In Panel C, y is estimated using the conditional maximum score estimator,l6 denoted by qc,,ry, and in Panels D and E we use the smoothed conditional maximum score estimator, denoted by q,,,,. In Panel D, y is estimated at a rate faster than p, while in Panel E both @ and y are estimated at the same rate."

From Table I1 we see that the propose estimator is less biased than the "naive" OLS estimator both with and without the asymptotic bias correction. Furthermore, this bias decreases with sample size since the estimator is consis- tent, at rate slower than n-'I2, as predicted by the asymptotic theory. This may be seen by the fact that the RMSE decreases by less than half when we quadruple the sample size. Notice that the results do not change substantially whether we use the true y or we estimate it for the construction of the kernel weights, except when the smoothed maximum score approach is used. In the latter case (Panels D and E), the estimator is significantly more biased, although its RMSE is lower than in the other panels. This may be due to the relatively large finite sample bias of the smoothed maximum score estimates (see also Horc3witz (1992)), which may be thought of as increasing the effective window

15 In the construction of the kernel weights of both the infeasible estimator j,, of Panel A and the feasible estimators of Panels B-E, the norm of y is set equal to one so that the results across panels are comparable.

''The CMS estimates are computed by maximizing the objective function (l/n)C:_,Ad; ~{Aw,,gs + Awt2g22 0) (see also equation (7) in Manski (1987)) over g, = sin(g) and g2 =cos(g) with g ranging in a 2.000-point equispaced grid from 0 to 27r.

17 The SCMS estimates are computed by maximizing

over all g E %"hat have /g,/= 1and gl in a compact subset of !It by the method of fast simulated annealing. Joel Horowitz kindly provided the optimization routine. In Panel D, we set L(v)=Kj(v) of Horowitz (1992, page 5161, which implies that the estimator, denoted by Tsc,tfs,a, converges in distribution at rate ,1-4'9 (faster than the rate of P, which in the case of a second order kernel is n-2'5), so that the asynlptotic theory of Section 3.1 is valid. hl Panel E. we use Liv) = @iv) where @ is the standard normal cumtllative distribution function. In this case the estimator, denoted by +sFSC,ZfS,2r converges in distribution at the same rate as P,,,n-'/j The SCMS estimates used in the construction of the kernel weights are corrected for asymptotic bias using 6 = 0.1 and are obtained by the two stage "plug-in" procedure, where in the first stage the bandwidth sequence is cr, =

0,5~-(1fih~1') (in= 2 or 41, while the second stage uses the estimated optimal constant in the

construction of the bandwidth. For details, see Horowitz (1992) and Kyriazidou (1994).


width used in the estimation of P. Furthermore, we notice that the results are very similar when y is estimated at the same rate as p (Panel E) relative to the case where it is estimated faster than p (Panel D). Comparing the right and left sides of Table 11, we see that the asymptotic bias correction does decrease the estimated (mean and median) bias of the estimator, it invariably however increases its variability.

In Table I11 we investigate the sensitivity of the (infeasible) estimator with respect to the choice of the bandwidth constant and the choice of the kernel


function. Panels A and B present the results for b,, and P, using a bandwidth constant h equal to 0.5 and 3, respectively, and a second order bias-reducing kernel. As expected the estimator's bias increases as we increase the bandwidth while the RMSE decreases. The increase in both mean and median bias appears quite large, which indicates that point estimates may be quite sensitive to the choice of bandwidth. In order to give a sense of the precision with which these biases are estimated, we provide at the bottom of Table I11 their estimated standard errors for the two sets of experiments that use 0.5 and 3 as bandwidth constant (Panels A and B).'~

In Panels C and D we use a fourth and a sixth order bias-reducing kernel19 and set h, =n-1/(2("+l)") with r = 3 and r = 5, respectively. A comparison of Panels 11-A and 111-C and 111-D suggests that the use of higher order kernels speeds up the rate of convergence of the estimator, although there does not appear to be much gain from increasing the order of the kernel from four to six.

Table IV explores the properties of the proposed estimator when the "plug-in" method described in Section 3.2 is used. The specification is the same as in Table 11. Comparing Panels A-D in Tables I1 and IV, we see that the bias of the estimates increases when the optimal bandwidth constant 6" is used yhile their RMSE decreases (except in Panel IV-Dl. This is because, in general, h* is larger than the initial constant (here the initial bandwidth constant is set equal to one2'). Table V displays the mean of 6" across 1000 replications for different specifications of the initial constant for the case of the infeasible estimator. We find that the means of the estimates are increasing in the initial bandwidth constant (although this is not necessarily true for all 1000 samples). Our finding may be interpreted by the asymptotic bias term being in general poorly esti- mated in the particular Monte Carlo design used in this study. Indeed, we find that, for the sample sizes considered here, the estimated asymptotic bias of the estimator decreases with the bandwidth constant h contrary to the asymptotic

l8 To estimate the standard errors for the median bias we need to calculate the estimator's density. This is estimated using a normal kernel and the rule-of-thumb bandwidth suggested by Silverman (1986. equation 3.28).

19 The fourth-order kernel is K,(v) = l.lexp(-~'~/2)-~.lexp(-c2/2 11)(1/m), and the sixth-order kernel is K,(v) = 1.5 e~~(-~'~/2) + 0.1 exp(-u2/2. 9)(l/ 6)-0.6 exp(-u2/2 . 4)(1/ 4).

See Bierens (1987).

20 We chose the initial h equal to one as the mean squared error of the distribution of the (infeasible) estimator in the 1000 replications was found to be minimized in that neighborhood when a rough search over a 10-point grid from 0.5 to 10 was performed for a sample size n = 100,000.



i, it,

(Without Asymptot~c Bias Correction) (With Asymptotic Bias Correction)

Mean Median Mean Median Bias Bias RMSE MAD Bias Bias RMSE MAD

Panel A: K(v)= 4(v),h,,= 0.5.n1/' 0.0040 0.3463 0.2140 -0.0017 0.0065 0.0064 0.1930 0.1308 0.0053 0.0023 0.0002 0.1119 0.0752 -0.0005 -0.0014

Panel B: ~(v)= 4(v).h, = 3.n115 0.0631 0.1550 0.1097 0.0542 0.0566 0.0459 0.0933 0.0626 0.0435 0.0426 0.0351 0.0565 0.0418 0.0316 0.0321

Panel C: K(v)= Kj(v),

h,,= n1l9 0.0246 0.1966 0.1390 0.0080 0.0121 0.0159 0.1067 0.0723 0.0099 0.0003 0.0159 0.0582 0.0397 0.0051 0.0054

Panel D: K(v)= K,(v).h, = n1/13 0.0269 0.1973 0.1362 0.0002 0.0030 0.0144 0.1041 0.0719 0.0032 -0.0031 0.0170 0.0560 0.0391 -0.0006 -0.0002

a The estimated standard errors of the mean bias estimates for n = 250, 1000, and 4000 are 0.0110, 0.0061, 0.0035 for Panel A, and 0.0045, 0.0026, and 0.0014 for Panel B, respectively. The estimated standard errors of the median hias estimates for IZ = 250, 1000, and 4000 are 0.0136, 0.0077, and 0.0044 for Panel A, and 0.0059, 0.0033, and 0.0018 for Panel B, respectively.


FINITEPROPERTIES &* n- 'I5, INITIALh = 1,K( v) = 4(v)

SAMPLEOF bnAND b,: h,, =

a,, A,

(Without Asymptotic Bias Correction) (With Asymptotic Bias Correction)

Mean Median Mean Median Bias Bias RMSE MAD Bias Bias RMSE MAD

Panel A: True y

0.1919 0.1287 0.0261 0.1053 0.0700 0.0330 0.0653 0.0507 0.0273

Panel B: TL

0.1703 0.1191 0.0454 0.1000 0.0693 0.0465 0.0654 0.0504 0.0385

Panel C: TcMs

0.2117 0.1329 0.0221 0.1114 0.0718 0.0246 0.0671 0.0507 0.0246

D: ?/SCMS,~ 0.1543 0.1086 0.0705

0.1004 0.0740 0.0604 0.0658 0.0488 0.0401


Iilitial Initial Initial Initial

h = 0.5 h=l 11 = 2 h=3

result of Theorem 1.It thus appears that, for the particular design, small sample bias is more important than asymptotic bias. The sensitivity of the optimal constant estimate A* to the choice of the initial constant suggests that further research on alternative methods for choosing the bandwidth may be warranted.

We next investigate whether normality might be a good approximation to the finite sample distribution of the proposed estimator. In Figure 1 we plot the quantiles of b,, against those of a normal random variable with the same mean and variance as the sample mean and sample variance of p,. Such quantile- quantile plots are provided for different sample sizes, and for the true and the

True y


0 .5 1 1.5 2 0 .5 115 2 0 .5 1 1.5 2 Flg la Fig, 1b Fig lc

Note: Figures la, Id. lg: n = 250. Figures lb, le, lh: n = 1000. Figures lc. If, li: tl = 4000. FIGURE 1.-Quantile-quantile plots of inagainst a Normal: h,, = n-'/',~(v) = $(v).

estimated values of y, using the specification of Table I1 (that is, using a second order kernel and h, =n-'I5). We find that, for the experimental design used in this study, the small sample distribution of the proposed estimator is well approximated by a normal distribution. The plots for the asymptotic bias-cor- rected estimator are very similar, albeit displaying a larger dispersion, and are not given here.

Finally, we examine the size of "t tests" where the test statistics use the asymptotic covariance matrix estimator proposed in Theorem 2. Specifically, in Table VI we test the null hypothesis that P isAequal to its true value Po= 1. To this end, we construct t statistics for 1, and 1, for the specification of Table I1 (that is, using a second order kernel and h, =n-'I5). Standard errors are constructed using the estimator given by equation (3.2.2). The table presents the fraction of samples for which the null hypothesis is rejected at the 1, 5, 10, and 20 percent statistical significance level. We find that the actual levels of the tests are not far from the nominal levels, especially for larger sample sizes, and that they are closer for the estimates without the asymptotic bias correction. Note that, although we report the results of the t tests for bn using Manski's CMS estimator in the first step (Panel VI-C), the standard errors calculated for the two-step estimator of the main equation are only heuristic, since as discussed in


Section 3.2 the asymptotic normality of fin (and P,,) does not obtain in this case due to the slow rate of convergence of yc,,. However, the levels of the tests even in this case are reasonable. Alternatively, we could have used bootstrap standard errors.


SIZEOF t TESTSUSINGfin AND b,: h,, = n-'/', K(u)= 4(u)

b,, k,,

(Without Anymptotic Bias Correction) (With Asymptotic Bias Correction)

0.01 0.05 0.10 0.20 0.01 0.05 0.10 0.20

Panel A: True y

0.1610 0.2530 0.0590

0.1240 0.2180 0.0260

0.1120 0.2260 0.0210

Panel B: TL

0.1580 0.2680 0.0450

0.1160 0.2140 0.0230

0.1140 0.2250 0.0180

Panel C: Scnfs

0.1600 0.2720 0.0610

0.1170 0.2160 0.0350

0.1180 0.2390 0.0240

Panel D: SScMS,

0.1430 0.2570 0.0280

0.1220 0.2250 0.0190

0.1230 0.2430 0.0250



This paper proposed estimators for a sample selection model from panel data with individual-specific effects. We developed a two-step estimation procedure for the parameters of the regression equation of interest, which exploits a conditional exchangeability assumption on the errors to "difference out" both the unobservable individual effect and the sample selection effect, in a manner similar to the "fixed-effects" approach taken in linear panel data models. The Monte Carlo results indicate that the estimator may work well in practice with sufficiently large data sets. However it is quite sensitive to the choice of the bandwidth parameter, which suggests that further research on this issue may be warranted. Two more issues will be also left for future investigation:

First, notice that the exchangeability assumption (Assumption R1) underlying the proposed estimator implies a conditional symmetry restriction for the first-differenced errors of the main equation, which could be used to develop a Least Absolute Deviations-type estimator. This estimator might then be com- bined optimally with the Least-Squares-type estimator proposed in this paper for efficiency considerations. Furthermore, LAD estimators might be preferable in the case of heavy-tailed distributions, but they do not have closed-form solutions and their asymptotic properties are more difficult to derive.

Second, although the analysis rested on the strict exogeneity of the explana- tory variables in both equations, it is possible to allow for lagged endogenous variables in the set of regressors. Honor6 and Kyriazidou (1997) propose estimators for discrete choice panel data models with exogenous regressors, individual effects, and lags of the dependent discrete variable. Kyriazidou (1997) proposes estimators for dynamic sample selection models where the latent equations contain strictly exogenous regressors, individual effects, and lags of the dependent endogenous variables.

Department of Economics, Uniuersity of Chicago, 1126 E. 59th St., Chicago, Illinois 60637, U. S.A.

Maizuscrrpt receiced May, 1994; ,final reL ision receiced January, 199%


The proofs of the results in the main text make use of the following two lemmas, which maintain Assumptions R4 and R8 of Section 3.

LEMMAAl: Let S = (l/n)Z:=l(l/h,,)L(M:/h,)Z,v, s 2 0, where {(Z,, y)]:=,

is a random sam- ple from a disirrbuiron that has E(IZI~I for almost all W, and the functron L ~at~sfies:

W)<M <


~lv%(v)l' dv < M. Then, E(S) = O(ki) and var(S) = O(h;"nh,,). Tlzus, for s 2 1, S + 0, while for


s = 0, S +f,(O)E(ZI W = O)lL(v)dv, procrded that E(ZIW) rs contrnuo~ts at W= 0.

PROOF: Random sampling implies that

Under our assumptions and by bounded convergence we obtain:

The stated probability limits then obtain by Chebyshev's theorem.

LEMMAA2 (Liapounov CLT for doublc arrays): Let = (1/ \ltl)~?= I ti,,,where an Independent sequence of scalar random ~arrables that satis$es: E( (,,,I0. var( (,,,I < rn, var( V< a,and I3:= ,El (,,/ 61''' +0 for some 8 E (0,1) as n +". Then = Jizh,~'N(0. V). +

PROOF: See Theorem 7.1.2 and comment on pagc 209 in Chung (1973).

COROLLARY = (I/ &)L(w/~,,)z,

Al: Let (,, where {(Z,, U.;)l,"= 1s a random sample from a d~stnbutlonsuch that E(ZI W) = 0 and E(IZI" '1 W) < M < w for almost all W, E(Z2 I W) IS conhnuous at W = 0, and the functlon L satrsfies: llL(v)l'' 'dv < 53. Then, KS= (l/ \ix)~l:'=&,, N(0,


PROOFOF LEMMA1: (a) Apply Lemma A1 with 2,= Ax! Ax/ di, (1, j = I,. . . ,k), s = 0? and L(v) = K(v). (b-i) Apply Lemma A2 with tt,= c1(1/ &)K(U.;/h,,) Ax, Ac, @,. where c is a k X 1 vector of constants such that c'c = 1. (b-ii) Note that, by Assumption R5, Ah, = A,.W;.Thus, wc may write

S,, = (1/~1)I3~=~(l/h,,)K(H(/h,)


Therefore, E(S,,) = l(l/h,)K(W/h,,)Wg(W) dW, where g(W) -E(Axr A@lW)fw(W) is by assumption r times colltinuously differentiable, with derivatives that are bounded on the support of W.and has g(0) < m. A Taylor series expansion of g(.) around 0, and a change of variables W= vh,, lead to



for some ci lying between 0 and W, since jvlK(v) dv = 0 for j = 1,.. . ,r. Therefore, by bounded convergence,

since under our assumptions /I vlr+ 'K( v)/ dv < a,and by assumption, Kh;,+' +&. Furthermore, by Lemma Al, var(SxA) = ~(hi/nh,,), which ~mplies that var(Jnh,~,,) = O(nh,,)O(h,,/n) = hi)


= dl). Hencc. Ks*,

-) hXX\.

(c-i) Note that,

while by Lemma Al, var(S,,, = O((nh,,)-'1. Therefore, E(h,;('+ ')S,, ) = 0 and var(h; ('+ ')Sx,,)= ~(h;~('+'). Since by assumption Kh;" + as n +a, (nh,)-') = ~((izh~(~+')+')-')


Thus, h;'""~,,, +0. (c-ii) From part (b-ii) above,



s~nce nh;('+ ')+ '+ implies that nh;' +'+a.Thus, h,;('+ "S rA z~~


REMARKS:ii) In what follows, A4 stands for a generic constant which is the uppcr bound of certain quantities.

(ii) We define the matrix norm IIAll= dtrace(A'A).

(iii) In the Taylor series expansions, c:, stands for a generic value between U:and @.

PROOFOF LEMMA2: (a) By a Taylor series expansion, we can write


since by assumption p <p/2, IK1(v)l < m, and E(llAwIlll~x11~)

< a.

(b-i) Let $it,and s;,, dcnote the Ith (I = 1,.. . ,k) elements of .fx,and S,,, respectively. A third order Taylor series expansion yields:




1 1

+ liiz --K"' AX: d~~ @,(div,(Tn -y113hj,'' 6n i=,

We will show that A, and A, are 0,(1), while A, = o,(l). The desired result will then follow from the fact that p <p/2 implies that hi1(%,-y) = Op(niL-"1= o,(l). Let A{ be the jth element (j = 1,.. . ,q) of the (1X q) vector A,. Write A{ = l/ v'z~:=,


where t,,= (I/ fi)K'(&</lz,,) AX:A/,;,mi Aw). Note that {(J,z}:= is a sequence of scalar random variables that satisfies the requirements of Lemma A?,, since under our assumptions, ~(ldx'dw~,lr,/~~"W)< for almost all W, while lK'(v)l < w and lIKf(v)l dv< imply that j 1 K'(v)12*'dv < m. Therefore, A, is bounded in probability.

Similarly, we can show that the jmth element (j, m = 1,.. . ,q) of the (qX q) matrix A, is also bounded in probability, by defining c,,,= (l/.VK)~"(~/h,) since

AX: dc, @, dwi Aw:, ~(i W)< m for almost all W,and the boundedness and absolute integrability of

As' Awl Awn' Aci2+ '1 K"(v) implies that lj~"(v)/~~bv

< a.
Next, obscrve that, since p > 2/5 and ,u <p/2 imply that (1/2)+ (7~12)-3p < 0,

1 1"

llAw,l1~1~~~~1h;j2 r=l

(b-ii) Let .f;, and S-L, denote the lth (I = 1.. . . ,k) elements of $,, and S,, respectively. 4 third order Taylor series expansion yields:

JlZh,($, -S:,)

1 1 '"

+ &--x K"' AX:AA,B,(A~,(?,-yil3

h7,/, 6n ,=,

We will show that Bl and B, are 0,,(1), while B3 = o,(l). Thc desired result will thenfollow from the fact that 1 -2p < ,u <p/2 implies that hi1(?,,-y) = Op(n'L-')= o,(l), and -y) = o,(n'/'-"/~'-") = o,(l).


Note that Bl is a (I x q)row-vector. For its jth element,

application of Lemma A1 with s= = yields

1, Z,3AX/A,@, Awj, and ~(v)~'(v),


E(Bf)= -.O(h,,)= O(1) and

since E(A~'~ W)<afor almost all W,and /lv~'(v)l~

A2@~wj2/ dv < a.
Similarly, we can show that the jmth element (j, m = 1,.. .,q) of the (qX q) matrix B,,

is also bounded in probability, since E(AX'~ W)< a for

A2@~~j2~~'n2/almost all W, and JIvK1'(v)ldv<a.

Next, observe that

since under assumptions, (1/2)+ (7~/2)-3p < 0, y lies in a compact set, and E(llAx1 IAWI~)

< a.

(c-i) Note that, with h, =h .n-@, the condition nh;('+')+'+a implies that p < 1/(2(r+ 1)+ 1). In what follows, we will use the fact that, for r r 1,

Define f;, and s;, as before. A third order Taylor series expansion yields:

1 In W 11

+-ci-yi(r En.rf(i;i) nxjna,q aw: nw,

24, -(%,-Y) nhn ,=I id-nh, hi;+' h,

1 1 1 1 1

.-(Tn-y) +-(+,,-yl'.A2 . .-(+?, -Y) +A4 = ' h h., 2h, ' 4,


where Ai and A? are defined as in the proof of part (b-1). As we showed there, both these quantities are bounded in probability for any h, that satisfies h, -,O and nh,, -t 13 as n increases. Furthermore, from (1) above, hi1(%, -y) = OP(nF-") = op(l). T~LIS,

the first two terms of the sum above are o,(l). Now, by (21,

(c-ii) Lct $, and Sf, be defined as before. A third order Taylor series evpansion yiclds:

where Biand B2 are defined as in the proof of part (b-ii), and as we showed there, they arc houndcd in probability for any {I,, that satisfies nh,, + 13 as n increases. Thus, the first two terms of the sum above are o,(l). Furthermore,


AHN,H.. AND J. L. POWELL (1993): "Semiparametric Estimation of Censorcd Selection Models with a Nonparamctric Selection Mechanism," Journal of Econometrics, 58, 3-29. AMEMIYA,

T. (1985): Aduancetl Econometrics. Cambridge: Harvard University Prcss. ANDERSEW,

E. (1970): "Asymptotic Properties of Conditional Maximum Likelihood Estimators," Jortrrzal of the Royal Statistical Sociely. Series B, 32, 283-301. BIERENS,H. J. (1987): "Kernel Estimators of Regression Functions," in Advaaces in Ecor~omefrics:

Fifih World Congress, Vol. 1, ed, by T. F. Bewley. Cambridge: Cambridge University Prcss. CAVANAGH,

C. L,. (1987): "Limiting Behavior of Estimators Defined by Optimization," unpublished manuscript. CHAMBERLAIN,G. (1984): "Panel Data," Handbook of Econometrics, Volume 11, edited by Z. Griliches and M. Intriligator. Amsterdam: North-Holland, Ch. 22. -(1992): "Binary Response Models for Panel Data: Identification and Information," unpub-

lished manuscript. Department of Econon~ics, Haward University. CHARLIER, AND A. H. 0. VAN

E., B. MELENBERG, SOEST (1995): "A Smoothed Maximum Score Estimator for the Binary Choice Panel Data Model with an Application to Labour Force Participation," Sfatistica fiderlandica, 49, 324-342.

CHUNG,K. L. (1974): A Course in Probabilily Theoqi. New York: Academic Press.

GRONAU,R. (1974): "Wage Comparisons-A Selectivity Bias:" Joztrnal of Political Eco~zorrzy, 82. 1110-1144.


HARDLE, W. (1990): Applied Nonparametric Regression. Cambridge: Cambridge University Press.


J. A., AND D. WISE (1979): "Attrition Bias in Experimental and Panel Data: The Gary Income Maintenance Experiment," Econometrica, 47, 455-473. HECKMAN,J. J. (1974): "Shadow Prices, Market Wages, and Labor Supply," Econornetrica, 42, 679-694.

-(1976): "The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables, and a Simple Estimator for Such Models," Annals of Economic and Social Measurement, 15, 475-492.

-(1979): "Sample Selection Bias as a Specification Error," Econometrica, 47, 153-161. HONOR^, B. E. (1992): "Trimmed LAD and Least Squares Estimation of Truncated and Censored Regression Models with Fixed Effects," Econometrica, 60, 533-565. -(1993): "Orthogonality Conditions for Tobit Models with Fixed Effects and Lagged Depen- dent Variables," Journal of Econometrics, 59, 35-61.

(1997): "Panel Data Discrete Choice Models with Lagged Dependent Variables," unpublished manuscript. HOROWITZ,

J. (1992): "A Smoothed Maximum Score Estimator for the Binary Response Model,"

Econornetrica, 60, 505-531. HSIAO, C. (1986): Analysis of Panel Data. Cambridge: Cambridge University Press. KIM, J., AND D. POLLARD (1990): "Cube Root Asymptotics," Annals of Statistics, 18, 191-219. KYRIAZIDOU, of A
Panel Data

E. (1994): "Estimation Sample Selection Model," unpublished manuscript, Northwestern University.
-(1997): "Estimation of Dynamics Panel Data Sample Selection Models," unpublished manuscript, University of Chicago. MANSKI,C. (1975): "Maximum Score Estimation of the Stochastic Utility Model of Choice," Joumal of Econometrics, 3, 205-228. -(1985): "Semiparametric Analysis of Discrete Response: Asymptotic Properties of Maximum Score Estimation," Journal of Econometrics, 27, 313-334. -(1987): "Semiparametric Analysis of Random Effects Linear Models from Binary Panel Data," Econornetrica, 55, 357-362. NIJMAN,T., AND M. VERBEEK (1992): "Nonresponse in Panel Data: The Impact on Estimates of a Life Cycle Consumption Function," Journal ofApplied Econometrics, 7, 243-257. POWELL,J. L. (1987): "Semiparametric Estimation of Bivariate Latent Variable Models," Working Paper No. 8704, Social Systems Research Institute, University of Wisconsin-Madison. -(1994): "Estimation of Semiparametric Models," Handbook of Econometrics, Vol. 4, 2444-2521. RASCH, G. (1960): Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Denmarks Paedagogiske Institut.

-(1961): "On General Laws and the Meaning of Measurement in Psychology," Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 4. Berkeley and Los Angeles: University of California Press.

ROSHOLM,M., AND N. SMITH (1994): "The Danish Gender Wage Gap in the 1980s: A Panel Data Study," Working Paper 94-2, Center for Labour Market and Social Research, University of Aarhus and Aarhus School of Business.


B. W. (1986): Density Estimation for Statistics and Data Analysis. New York: Chapman and Hall. VERBEEK,M., AND T. NIJMAN (1992): "Testing for Selectivity Bias in Panel Data Models," Intema

tional Economic Review, 33, 681-703. WOOLDRIDGE,

J. M. (1995): "Selection Corrections for Panel Data Models under Conditional Mean Independence Assumptions," Journal of Econometrics, 68, 115-132.

  • Recommend Us