The green curve, which asymptotically approaches heights of 0 and 1 without reaching them, is the true cumulative distribution function of the standard normal distribution. The grey hash marks represent the observations in a particular sample drawn from that distribution, and the horizontal steps of the blue step function (including the leftmost point in each step but not including the rightmost point) form the empirical distribution function of that sample. (Click here to load a new graph.)

In statistics, an empirical distribution function (commonly also called an empirical cumulative distribution function, eCDF) is the distribution function associated with the empirical measure of a sample.^[1] This cumulative distribution function is a step function that jumps up by $1/ n$ at each of the $n$ data points. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value.

The empirical distribution function is an estimate of the cumulative distribution function that generated the points in the sample. It converges with probability 1 to that underlying distribution, according to the Glivenko–Cantelli theorem. A number of results exist to quantify the rate of convergence of the empirical distribution function to the underlying cumulative distribution function.

Definition

Let $(X 1, \dots, X n)$ be independent, identically distributed real random variables with the common cumulative distribution function $F (t)$ . Then the empirical distribution function is defined as^[2]

{\widehat {F))_{n}(t)={\frac ((\mbox{number of elements in the sample))\leq t}{n))={\frac {1}{n))\sum _{i=1}^{n}\mathbf {1} _{X_{i}\leq t},

where ${\displaystyle \mathbf {1} _{A))$ is the indicator of event $A$ . For a fixed $t$ , the indicator ${\displaystyle \mathbf {1} _{X_{i}\leq t))$ is a Bernoulli random variable with parameter $p = F (t)$ ; hence $n{\widehat {F))_{n}(t)$ is a binomial random variable with mean $nF (t)$ and variance $nF (t)(1 - F (t))$ . This implies that ${\widehat {F))_{n}(t)$ is an unbiased estimator for $F (t)$ .

However, in some textbooks, the definition is given as

{\displaystyle {\widehat {F))_{n}(t)={\frac {1}{n+1))\sum _{i=1}^{n}\mathbf {1} _{X_{i}\leq t))

^[3]^[4]

Asymptotic properties

Since the ratio $(n + 1)/ n$ approaches 1 as $n$ goes to infinity, the asymptotic properties of the two definitions that are given above are the same.

By the strong law of large numbers, the estimator $\scriptstyle {\widehat {F))_{n}(t)$ converges to $F (t)$ as $n \to \infty$ almost surely, for every value of $t$ :^[2]

{\widehat {F))_{n}(t)\ {\xrightarrow {\text{a.s.))}\ F(t);

thus the estimator $\scriptstyle {\widehat {F))_{n}(t)$ is consistent. This expression asserts the pointwise convergence of the empirical distribution function to the true cumulative distribution function. There is a stronger result, called the Glivenko–Cantelli theorem, which states that the convergence in fact happens uniformly over $t$ :^[5]

\|{\widehat {F))_{n}-F\|_{\infty }\equiv \sup _{t\in \mathbb {R} }{\big |}{\widehat {F))_{n}(t)-F(t){\big |}\ \xrightarrow {} \ 0.

The sup-norm in this expression is called the Kolmogorov–Smirnov statistic for testing the goodness-of-fit between the empirical distribution $\scriptstyle {\widehat {F))_{n}(t)$ and the assumed true cumulative distribution function $F$ . Other norm functions may be reasonably used here instead of the sup-norm. For example, the L²-norm gives rise to the Cramér–von Mises statistic.

The asymptotic distribution can be further characterized in several different ways. First, the central limit theorem states that pointwise, $\scriptstyle {\widehat {F))_{n}(t)$ has asymptotically normal distribution with the standard ${\sqrt {n))$ rate of convergence:^[2]

{\sqrt {n)){\big (}{\widehat {F))_{n}(t)-F(t){\big )}\ \ {\xrightarrow {d))\ \ {\mathcal {N)){\Big (}0,F(t){\big (}1-F(t){\big )}{\Big )}.

This result is extended by the Donsker’s theorem, which asserts that the empirical process $\scriptstyle {\sqrt {n))({\widehat {F))_{n}-F)$ , viewed as a function indexed by $\scriptstyle t\in \mathbb {R}$ , converges in distribution in the Skorokhod space $\scriptstyle D[-\infty ,+\infty ]$ to the mean-zero Gaussian process $\scriptstyle G_{F}=B\circ F$ , where $B$ is the standard Brownian bridge.^[5] The covariance structure of this Gaussian process is

\operatorname {E} [\,G_{F}(t_{1})G_{F}(t_{2})\,]=F(t_{1}\wedge t_{2})-F(t_{1})F(t_{2}).

The uniform rate of convergence in Donsker’s theorem can be quantified by the result known as the Hungarian embedding:^[6]

\limsup _{n\to \infty }{\frac {\sqrt {n)){\ln ^{2}n)){\big \|}{\sqrt {n))({\widehat {F))_{n}-F)-G_{F,n}{\big \|}_{\infty }<\infty ,\quad {\text{a.s.))

Alternatively, the rate of convergence of $\scriptstyle {\sqrt {n))({\widehat {F))_{n}-F)$ can also be quantified in terms of the asymptotic behavior of the sup-norm of this expression. Number of results exist in this venue, for example the Dvoretzky–Kiefer–Wolfowitz inequality provides bound on the tail probabilities of ${\displaystyle \scriptstyle {\sqrt {n))\|{\widehat {F))_{n}-F\|_{\infty ))$ :^[6]

\Pr \!{\Big (}{\sqrt {n))\|{\widehat {F))_{n}-F\|_{\infty }>z{\Big )}\leq 2e^{-2z^{2)).

In fact, Kolmogorov has shown that if the cumulative distribution function $F$ is continuous, then the expression ${\displaystyle \scriptstyle {\sqrt {n))\|{\widehat {F))_{n}-F\|_{\infty ))$ converges in distribution to ${\displaystyle \scriptstyle \|B\|_{\infty ))$ , which has the Kolmogorov distribution that does not depend on the form of $F$ .

Another result, which follows from the law of the iterated logarithm, is that ^[6]

\limsup _{n\to \infty }{\frac ((\sqrt {n))\|{\widehat {F))_{n}-F\|_{\infty )){\sqrt {2\ln \ln n))}\leq {\frac {1}{2)),\quad {\text{a.s.))

and

\liminf _{n\to \infty }{\sqrt {2n\ln \ln n))\|{\widehat {F))_{n}-F\|_{\infty }={\frac {\pi }{2)),\quad {\text{a.s.))

Confidence intervals

As per Dvoretzky–Kiefer–Wolfowitz inequality the interval that contains the true CDF, $F(x)$ , with probability $1-\alpha$ is specified as

F_{n}(x)-\varepsilon \leq F(x)\leq F_{n}(x)+\varepsilon \;{\text{ where ))\varepsilon ={\sqrt {\frac {\ln {\frac {2}{\alpha ))}{2n))}.

As per the above bounds, we can plot the Empirical CDF, CDF and confidence intervals for different distributions by using any one of the statistical implementations.

Statistical implementation

A non-exhaustive list of software implementations of Empirical Distribution function includes:

In R software, we compute an empirical cumulative distribution function, with several methods for plotting, printing and computing with such an “ecdf” object.
In MATLAB we can use Empirical cumulative distribution function (cdf) plot
jmp from SAS, the CDF plot creates a plot of the empirical cumulative distribution function.
Minitab, create an Empirical CDF
Mathwave, we can fit probability distribution to our data
Dataplot, we can plot Empirical CDF plot
Scipy, we can use scipy.stats.ecdf
Statsmodels, we can use statsmodels.distributions.empirical_distribution.ECDF
Matplotlib, using the matplotlib.pyplot.ecdf function (new in version 3.8.0)^[7]
Seaborn, using the seaborn.ecdfplot function
Plotly, using the plotly.express.ecdf function
Excel, we can plot Empirical CDF plot
ArviZ, using the az.plot_ecdf function

References

External links

Statistics

Descriptive statistics

Continuous data

Center	Mean Arithmetic Arithmetic-Geometric Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test (normal) Student's t-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity
Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging