Regression analysis
Part of a series on
Models
Linear regression Simple regression Polynomial regression General linear model
Generalized linear model Vector generalized linear model Discrete choice Binomial regression Binary regression Logistic regression Multinomial logistic regression Mixed logit Probit Multinomial probit Ordered logit Ordered probit Poisson
Multilevel model Fixed effects Random effects Linear mixed-effects model Nonlinear mixed-effects model
Nonlinear regression Nonparametric Semiparametric Robust Quantile Isotonic Principal components Least angle Local Segmented
Errors-in-variables
Estimation
Least squares Linear Non-linear
Ordinary Weighted Generalized Generalized estimating equation
Partial Total Non-negative Ridge regression Regularized
Least absolute deviations Iteratively reweighted Bayesian Bayesian multivariate Least-squares spectral analysis
Background
Regression validation Mean and predicted response Errors and residuals Goodness of fit Studentized residual Gauss–Markov theorem
Mathematics portal
v t e

In statistics, binomial regression is a regression analysis technique in which the response (often referred to as Y) has a binomial distribution: it is the number of successes in a series of $n$ independent Bernoulli trials, where each trial has probability of success $p$ .^[1] In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

Binomial regression is closely related to binary regression: a binary regression can be considered a binomial regression with $n=1$ , or a regression on ungrouped binary data, while a binomial regression can be considered a regression on grouped binary data (see comparison).^[2] Binomial regression models are essentially the same as binary choice models, one type of discrete choice model: the primary difference is in the theoretical motivation (see comparison). In machine learning, binomial regression is considered a special case of probabilistic classification, and thus a generalization of binary classification.

Example application

In one published example of an application of binomial regression,^[3] the details were as follows. The observed outcome variable was whether or not a fault occurred in an industrial process. There were two explanatory variables: the first was a simple two-case factor representing whether or not a modified version of the process was used and the second was an ordinary quantitative variable measuring the purity of the material being supplied for the process.

Specification of model

The response variable Y is assumed to be binomially distributed conditional on the explanatory variables X. The number of trials n is known, and the probability of success for each trial p is specified as a function θ(X). This implies that the conditional expectation and conditional variance of the observed fraction of successes, Y/n, are

E(Y/n\mid X)=\theta (X)

\operatorname {Var} (Y/n\mid X)=\theta (X)(1-\theta (X))/n

The goal of binomial regression is to estimate the function θ(X). Typically the statistician assumes $\theta (X)=m(\beta ^{\mathrm {T} }X)$ , for a known function m, and estimates β. Common choices for m include the logistic function.^[1]

The data are often fitted as a generalised linear model where the predicted values μ are the probabilities that any individual event will result in a success. The likelihood of the predictions is then given by

L({\boldsymbol {\mu ))\mid Y)=\prod _{i=1}^{n}\left(1_{y_{i}=1}(\mu _{i})+1_{y_{i}=0}(1-\mu _{i})\right),\,\!

where 1_A is the indicator function which takes on the value one when the event A occurs, and zero otherwise: in this formulation, for any given observation y_i, only one of the two terms inside the product contributes, according to whether y_i=0 or 1. The likelihood function is more fully specified by defining the formal parameters μ_i as parameterised functions of the explanatory variables: this defines the likelihood in terms of a much reduced number of parameters. Fitting of the model is usually achieved by employing the method of maximum likelihood to determine these parameters. In practice, the use of a formulation as a generalised linear model allows advantage to be taken of certain algorithmic ideas which are applicable across the whole class of more general models but which do not apply to all maximum likelihood problems.

Models used in binomial regression can often be extended to multinomial data.

There are many methods of generating the values of μ in systematic ways that allow for interpretation of the model; they are discussed below.

Link functions

There is a requirement that the modelling linking the probabilities μ to the explanatory variables should be of a form which only produces values in the range 0 to 1. Many models can be fitted into the form

{\boldsymbol {\mu ))=g({\boldsymbol {\eta )))\,.

Here η is an intermediate variable representing a linear combination, containing the regression parameters, of the explanatory variables. The function g is the cumulative distribution function (cdf) of some probability distribution. Usually this probability distribution has a support from minus infinity to plus infinity so that any finite value of η is transformed by the function g to a value inside the range 0 to 1.

In the case of logistic regression, the link function is the log of the odds ratio or logistic function. In the case of probit, the link is the cdf of the normal distribution. The linear probability model is not a proper binomial regression specification because predictions need not be in the range of zero to one; it is sometimes used for this type of data when the probability space is where interpretation occurs or when the analyst lacks sufficient sophistication to fit or calculate approximate linearizations of probabilities for interpretation.

Comparison with binary regression

Binomial regression is closely connected with binary regression. If the response is a binary variable (two possible outcomes), then these alternatives can be coded as 0 or 1 by considering one of the outcomes as "success" and the other as "failure" and considering these as count data: "success" is 1 success out of 1 trial, while "failure" is 0 successes out of 1 trial. This can now be considered a binomial distribution with $n=1$ trial, so a binary regression is a special case of a binomial regression. If these data are grouped (by adding counts), they are no longer binary data, but are count data for each group, and can still be modeled by a binomial regression; the individual binary outcomes are then referred to as "ungrouped data". An advantage of working with grouped data is that one can test the goodness of fit of the model;^[2] for example, grouped data may exhibit overdispersion relative to the variance estimated from the ungrouped data.

Comparison with binary choice models

A binary choice model assumes a latent variable U_n, the utility (or net benefit) that person n obtains from taking an action (as opposed to not taking the action). The utility the person obtains from taking the action depends on the characteristics of the person, some of which are observed by the researcher and some are not:

{\displaystyle U_{n}={\boldsymbol {\beta ))\cdot \mathbf {s_{n)) +\varepsilon _{n))

where ${\boldsymbol {\beta ))$ is a set of regression coefficients and $\mathbf {s_{n))$ is a set of independent variables (also known as "features") describing person n, which may be either discrete "dummy variables" or regular continuous variables. ${\displaystyle \varepsilon _{n))$ is a random variable specifying "noise" or "error" in the prediction, assumed to be distributed according to some distribution. Normally, if there is a mean or variance parameter in the distribution, it cannot be identified, so the parameters are set to convenient values — by convention usually mean 0, variance 1.

The person takes the action, y_n = 1, if U_n > 0. The unobserved term, ε_n, is assumed to have a logistic distribution.

The specification is written succinctly as:

- U_n = βs_n + ε_n
- $Y_{n}={\begin{cases}1,&{\text{if ))U_{n}>0,\\0,&{\text{if ))U_{n}\leq 0\end{cases))$
- ε ∼ logistic, standard normal, etc.

Let us write it slightly differently:

- U_n = βs_n − e_n
- $Y_{n}={\begin{cases}1,&{\text{if ))U_{n}>0,\\0,&{\text{if ))U_{n}\leq 0\end{cases))$
- e ∼ logistic, standard normal, etc.

Here we have made the substitution e_n = −ε_n. This changes a random variable into a slightly different one, defined over a negated domain. As it happens, the error distributions we usually consider (e.g. logistic distribution, standard normal distribution, standard Student's t-distribution, etc.) are symmetric about 0, and hence the distribution over e_n is identical to the distribution over ε_n.

Denote the cumulative distribution function (CDF) of $e$ as $F_{e},$ and the quantile function (inverse CDF) of $e$ as $F_{e}^{-1}.$

Note that

{\begin{aligned}\Pr(Y_{n}=1)&=\Pr(U_{n}>0)\\[6pt]&=\Pr({\boldsymbol {\beta ))\cdot \mathbf {s_{n)) -e_{n}>0)\\[6pt]&=\Pr(-e_{n}>-{\boldsymbol {\beta ))\cdot \mathbf {s_{n)) )\\[6pt]&=\Pr(e_{n}\leq {\boldsymbol {\beta ))\cdot \mathbf {s_{n)) )\\[6pt]&=F_{e}({\boldsymbol {\beta ))\cdot \mathbf {s_{n)) )\end{aligned))

Since ${\displaystyle Y_{n))$ is a Bernoulli trial, where $\mathbb {E} [Y_{n}]=\Pr(Y_{n}=1),$ we have

\mathbb {E} [Y_{n}]=F_{e}({\boldsymbol {\beta ))\cdot \mathbf {s_{n)) )

or equivalently

F_{e}^{-1}(\mathbb {E} [Y_{n}])={\boldsymbol {\beta ))\cdot \mathbf {s_{n)) .

Note that this is exactly equivalent to the binomial regression model expressed in the formalism of the generalized linear model.

If $e_{n}\sim {\mathcal {N))(0,1),$ i.e. distributed as a standard normal distribution, then

\Phi ^{-1}(\mathbb {E} [Y_{n}])={\boldsymbol {\beta ))\cdot \mathbf {s_{n))

which is exactly a probit model.

If $e_{n}\sim \operatorname {Logistic} (0,1),$ i.e. distributed as a standard logistic distribution with mean 0 and scale parameter 1, then the corresponding quantile function is the logit function, and

\operatorname {logit} (\mathbb {E} [Y_{n}])={\boldsymbol {\beta ))\cdot \mathbf {s_{n))

which is exactly a logit model.

Note that the two different formalisms — generalized linear models (GLM's) and discrete choice models — are equivalent in the case of simple binary choice models, but can be extended if differing ways:

GLM's can easily handle arbitrarily distributed response variables (dependent variables), not just categorical variables or ordinal variables, which discrete choice models are limited to by their nature. GLM's are also not limited to link functions that are quantile functions of some distribution, unlike the use of an error variable, which must by assumption have a probability distribution.
On the other hand, because discrete choice models are described as types of generative models, it is conceptually easier to extend them to complicated situations with multiple, possibly correlated, choices for each person, or other variations.

Latent variable interpretation / derivation

A latent variable model involving a binomial observed variable Y can be constructed such that Y is related to the latent variable Y* via

Y={\begin{cases}0,&{\mbox{if ))Y^{*}>0\\1,&{\mbox{if ))Y^{*}<0.\end{cases))

The latent variable Y* is then related to a set of regression variables X by the model

Y^{*}=X\beta +\epsilon \ .

This results in a binomial regression model.

The variance of ϵ can not be identified and when it is not of interest is often assumed to be equal to one. If ϵ is normally distributed, then a probit is the appropriate model and if ϵ is log-Weibull distributed, then a logit is appropriate. If ϵ is uniformly distributed, then a linear probability model is appropriate.

Notes

References

Center	Mean Arithmetic Arithmetic-Geometric Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test (normal) Student's t-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity
Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging