Part of a series on |

Regression analysis |
---|

Models |

Estimation |

Background |

In statistics, **errors-in-variables models** or **measurement error models** are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.^{[citation needed]}

In the case when some regressors have been measured with errors, estimation based on the standard assumption leads to inconsistent estimates, meaning that the parameter estimates do not tend to the true values even in very large samples. For simple linear regression the effect is an underestimate of the coefficient, known as the *attenuation bias*. In non-linear models the direction of the bias is likely to be more complicated.^{[1]}^{[2]}^{[3]}

Consider a simple linear regression model of the form

where denotes the *true* but unobserved regressor. Instead we observe this value with an error:

where the measurement error is assumed to be independent of the true value .

If the ′s are simply regressed on the ′s (see simple linear regression), then the estimator for the slope coefficient is

which converges as the sample size increases without bound:

Variances are non-negative, so that in the limit the estimate is smaller in magnitude than the true value of , an effect which statisticians call *attenuation* or regression dilution.^{[4]} Thus the ‘naïve’ least squares estimator is inconsistent in this setting. However, the estimator is a consistent estimator of the parameter required for a best linear predictor of given : in some applications this may be what is required, rather than an estimate of the ‘true’ regression coefficient, although that would assume that the variance of the errors in observing remains fixed. This follows directly from the result quoted immediately above, and the fact that the regression coefficient relating the ′s to the actually observed ′s, in a simple linear regression, is given by

It is this coefficient, rather than , that would be required for constructing a predictor of based on an observed which is subject to noise.

It can be argued that almost all existing data sets contain errors of different nature and magnitude, so that attenuation bias is extremely frequent (although in multivariate regression the direction of bias is ambiguous^{[5]}). Jerry Hausman sees this as an *iron law of econometrics*: "The magnitude of the estimate is usually smaller than expected."^{[6]}

Usually measurement error models are described using the latent variables approach. If is the response variable and are observed values of the regressors, then it is assumed there exist some latent variables and which follow the model's “true” functional relationship , and such that the observed quantities are their noisy observations:

where is the model's parameter and are those regressors which are assumed to be error-free (for example when linear regression contains an intercept, the regressor which corresponds to the constant certainly has no "measurement errors"). Depending on the specification these error-free regressors may or may not be treated separately; in the latter case it is simply assumed that corresponding entries in the variance matrix of 's are zero.

The variables , , are all *observed*, meaning that the statistician possesses a data set of statistical units which follow the data generating process described above; the latent variables , , , and are not observed however.

This specification does not encompass all the existing errors-in-variables models. For example in some of them function may be non-parametric or semi-parametric. Other approaches model the relationship between and as distributional instead of functional, that is they assume that conditionally on follows a certain (usually parametric) distribution.

- The observed variable may be called the
*manifest*,*indicator*, or*proxy*variable. - The unobserved variable may be called the
*latent*or*true*variable. It may be regarded either as an unknown constant (in which case the model is called a*functional model*), or as a random variable (correspondingly a*structural model*).^{[7]} - The relationship between the measurement error and the latent variable can be modeled in different ways:
*Classical errors*: the errors are independent of the latent variable. This is the most common assumption, it implies that the errors are introduced by the measuring device and their magnitude does not depend on the value being measured.*Mean-independence*: the errors are mean-zero for every value of the latent regressor. This is a less restrictive assumption than the classical one,^{[8]}as it allows for the presence of heteroscedasticity or other effects in the measurement errors.*Berkson's errors*: the errors are independent of the*observed*regressor*x*.^{[9]}This assumption has very limited applicability. One example is round-off errors: for example if a person's age* is a continuous random variable, whereas the observed age is truncated to the next smallest integer, then the truncation error is approximately independent of the observed age. Another possibility is with the fixed design experiment: for example if a scientist decides to make a measurement at a certain predetermined moment of time , say at , then the real measurement may occur at some other value of (for example due to her finite reaction time) and such measurement error will be generally independent of the "observed" value of the regressor.*Misclassification errors*: special case used for the dummy regressors. If is an indicator of a certain event or condition (such as person is male/female, some medical treatment given/not, etc.), then the measurement error in such regressor will correspond to the incorrect classification similar to type I and type II errors in statistical testing. In this case the error may take only 3 possible values, and its distribution conditional on is modeled with two parameters: , and . The necessary condition for identification is that , that is misclassification should not happen "too often". (This idea can be generalized to discrete variables with more than two possible values.)

Linear errors-in-variables models were studied first, probably because linear models were so widely used and they are easier than non-linear ones. Unlike standard least squares regression (OLS), extending errors in variables regression (EiV) from the simple to the multivariable case is not straightforward.

The simple linear errors-in-variables model was already presented in the "motivation" section:

where all variables are scalar. Here *α* and *β* are the parameters of interest, whereas *σ _{ε}* and

This model is identifiable in two cases: (1) either the latent regressor *x** is *not* normally distributed, (2) or *x** has normal distribution, but neither *ε _{t}* nor

Before this identifiability result was established, statisticians attempted to apply the maximum likelihood technique by assuming that all variables are normal, and then concluded that the model is not identified. The suggested remedy was to *assume* that some of the parameters of the model are known or can be estimated from the outside source. Such estimation methods include^{[11]}

- Deming regression — assumes that the ratio
*δ*=*σ²*/_{ε}*σ²*is known. This could be appropriate for example when errors in_{η}*y*and*x*are both caused by measurements, and the accuracy of measuring devices or procedures are known. The case when*δ*= 1 is also known as the orthogonal regression. - Regression with known reliability ratio
*λ*=*σ²*_{∗}/ (*σ²*+_{η}*σ²*_{∗}), where*σ²*_{∗}is the variance of the latent regressor. Such approach may be applicable for example when repeating measurements of the same unit are available, or when the reliability ratio has been known from the independent study. In this case the consistent estimate of slope is equal to the least-squares estimate divided by*λ*. - Regression with known
*σ²*may occur when the source of the errors in_{η}*x'*s is known and their variance can be calculated. This could include rounding errors, or errors introduced by the measuring device. When*σ²*is known we can compute the reliability ratio as_{η}*λ*= (*σ²*−_{x}*σ²*) /_{η}*σ²*and reduce the problem to the previous case._{x}

Newer estimation methods that do not assume knowledge of some of the parameters of the model, include

- Method of moments — the GMM estimator based on the third- (or higher-) order joint cumulants of observable variables. The slope coefficient can be estimated from
^{[12]}where (

*n*_{1},*n*_{2}) are such that*K*(*n*_{1}+1,*n*_{2}) — the joint cumulant of (*x*,*y*) — is not zero. In the case when the third central moment of the latent regressor*x**is non-zero, the formula reduces to - Instrumental variables — a regression which requires that certain additional data variables
*z*, called*instruments*, were available. These variables should be uncorrelated with the errors in the equation for the dependent (outcome) variable (*valid*), and they should also be correlated (*relevant*) with the true regressors*x**. If such variables can be found then the estimator takes form

The multivariable model looks exactly like the simple linear model, only this time *β*, *η*_{t}, *x*_{t} and *x**_{t} are *k×*1 vectors.

In the case when (*ε*_{t},*η*_{t}) is jointly normal, the parameter *β* is not identified if and only if there is a non-singular *k×k* block matrix [*a A*], where *a* is a *k×*1 vector such that *a′x** is distributed normally and independently of *A′x**. In the case when *ε*_{t}, *η*_{t1},..., *η*_{tk} are mutually independent, the parameter *β* is not identified if and only if in addition to the conditions above some of the errors can be written as the sum of two independent variables one of which is normal.^{[13]}

Some of the estimation methods for multivariable linear models are

- Total least squares is an extension of Deming regression to the multivariable setting. When all the
*k*+1 components of the vector (*ε*,*η*) have equal variances and are independent, this is equivalent to running the orthogonal regression of*y*on the vector*x*— that is, the regression which minimizes the sum of squared distances between points (*y*,_{t}*x*) and the_{t}*k*-dimensional hyperplane of "best fit". - The method of moments estimator
^{[14]}can be constructed based on the moment conditions E[*z*·(_{t}*y*−_{t}*α*−*β'x*)] = 0, where the (5_{t}*k*+3)-dimensional vector of instruments*z*is defined as_{t}where designates the Hadamard product of matrices, and variables

This method can be extended to use moments higher than the third order, if necessary, and to accommodate variables measured without error.*x*,_{t}*y*have been preliminarily de-meaned. The authors of the method suggest to use Fuller's modified IV estimator._{t}^{[15]}

^{[16]} - The instrumental variables approach requires us to find additional data variables
*z*that serve as_{t}*instruments*for the mismeasured regressors*x*. This method is the simplest from the implementation point of view, however its disadvantage is that it requires collecting additional data, which may be costly or even impossible. When the instruments can be found, the estimator takes standard form_{t}

A generic non-linear measurement error model takes form

Here function *g* can be either parametric or non-parametric. When function *g* is parametric it will be written as *g(x*, β)*.

For a general vector-valued regressor *x** the conditions for model identifiability are not known. However in the case of scalar *x** the model is identified unless the function *g* is of the "log-exponential" form ^{[17]}

and the latent regressor *x** has density

where constants *A,B,C,D,E,F* may depend on *a,b,c,d*.

Despite this optimistic result, as of now no methods exist for estimating non-linear errors-in-variables models without any extraneous information. However there are several techniques which make use of some additional data: either the instrumental variables, or repeated observations.

**Newey's simulated moments method**^{[18]}for parametric models — requires that there is an additional set of observed*predictor variables**z*, such that the true regressor can be expressed as_{t}where

*π*_{0}and*σ*_{0}are (unknown) constant matrices, and*ζ*⊥_{t}*z*. The coefficient_{t}*π*_{0}can be estimated using standard least squares regression of*x*on*z*. The distribution of*ζ*is unknown, however we can model it as belonging to a flexible parametric family — the Edgeworth series:_{t}where

*ϕ*is the standard normal distribution.Simulated moments can be computed using the importance sampling algorithm: first we generate several random variables {

*v*~_{ts}*ϕ*,*s*= 1,…,*S*,*t*= 1,…,*T*} from the standard normal distribution, then we compute the moments at*t*-th observation aswhere

*θ*= (*β*,*σ*,*γ*),*A*is just some function of the instrumental variables*z*, and*H*is a two-component vector of moments*m*one can apply standard GMM technique to estimate the unknown parameter_{t}*θ*.

In this approach two (or maybe more) repeated observations of the regressor *x** are available. Both observations contain their own measurement errors, however those errors are required to be independent:

where *x** ⊥ *η*_{1} ⊥ *η*_{2}. Variables *η*_{1}, *η*_{2} need not be identically distributed (although if they are efficiency of the estimator can be slightly improved). With only these two observations it is possible to consistently estimate the density function of *x** using Kotlarski's deconvolution technique.^{[19]}

**Li's conditional density method**for parametric models.^{[20]}The regression equation can be written in terms of the observable variables aswhere it would be possible to compute the integral if we knew the conditional density function

*ƒ*. If this function could be known or estimated, then the problem turns into standard non-linear regression, which can be estimated for example using the NLLS method._{x*|x}

Assuming for simplicity that*η*_{1},*η*_{2}are identically distributed, this conditional density can be computed aswhere with slight abuse of notation

*x*denotes the_{j}*j*-th component of a vector.

All densities in this formula can be estimated using inversion of the empirical characteristic functions. In particular,In order to invert these characteristic function one has to apply the inverse Fourier transform, with a trimming parameter

*C*needed to ensure the numerical stability. For example:**Schennach's estimator**for a parametric linear-in-parameters nonlinear-in-variables model.^{[21]}This is a model of the formwhere

*w*represents variables measured without errors. The regressor_{t}*x**here is scalar (the method can be extended to the case of vector*x**as well).

If not for the measurement errors, this would have been a standard linear model with the estimatorwhere

It turns out that all the expected values in this formula are estimable using the same deconvolution trick. In particular, for a generic observable

*w*(which could be 1,_{t}*w*_{1t}, …,*w*_{ℓ t}, or*y*) and some function_{t}*h*(which could represent any*g*or_{j}*g*) we have_{i}g_{j}where

*φ*is the Fourier transform of_{h}*h*(*x**), but using the same convention as for the characteristic functions,- ,

and

**Schennach's estimator**for a nonparametric model.^{[22]}The standard Nadaraya–Watson estimator for a nonparametric model takes form*K*and the bandwidth*h*. Both expectations here can be estimated using the same technique as in the previous method.