Probability distribution modeling a coin toss which need not be fair
In probability theory and statistics , the Bernoulli distribution , named after Swiss mathematician Jacob Bernoulli ,[ 1] is the discrete probability distribution of a random variable which takes the value 1 with probability
p
{\displaystyle p}
and the value 0 with probability
q
=
1
−
p
{\displaystyle q=1-p}
. Less formally, it can be thought of as a model for the set of possible outcomes of any single experiment that asks a yes–no question . Such questions lead to outcomes that are Boolean -valued: a single bit whose value is success/yes /true /one with probability p and failure/no/false /zero with probability q . It can be used to represent a (possibly biased) coin toss where 1 and 0 would represent "heads" and "tails", respectively, and p would be the probability of the coin landing on heads (or vice versa where 1 would represent tails and p would be the probability of tails). In particular, unfair coins would have
p
≠
1
/
2.
{\displaystyle p\neq 1/2.}
The Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted (so n would be 1 for such a binomial distribution). It is also a special case of the two-point distribution , for which the possible outcomes need not be 0 and 1.
[ 2]
If
X
{\displaystyle X}
is a random variable with a Bernoulli distribution, then:
Pr
(
X
=
1
)
=
p
=
1
−
Pr
(
X
=
0
)
=
1
−
q
.
{\displaystyle \Pr(X=1)=p=1-\Pr(X=0)=1-q.}
The probability mass function
f
{\displaystyle f}
of this distribution, over possible outcomes k , is
f
(
k
;
p
)
=
{
p
if
k
=
1
,
q
=
1
−
p
if
k
=
0.
{\displaystyle f(k;p)={\begin{cases}p&{\text{if ))k=1,\\q=1-p&{\text{if ))k=0.\end{cases))}
[ 3] This can also be expressed as
f
(
k
;
p
)
=
p
k
(
1
−
p
)
1
−
k
for
k
∈
{
0
,
1
}
{\displaystyle f(k;p)=p^{k}(1-p)^{1-k}\quad {\text{for ))k\in \{0,1\))
or as
f
(
k
;
p
)
=
p
k
+
(
1
−
p
)
(
1
−
k
)
for
k
∈
{
0
,
1
}
.
{\displaystyle f(k;p)=pk+(1-p)(1-k)\quad {\text{for ))k\in \{0,1\}.}
The Bernoulli distribution is a special case of the binomial distribution with
n
=
1.
{\displaystyle n=1.}
[ 4]
The kurtosis goes to infinity for high and low values of
p
,
{\displaystyle p,}
but for
p
=
1
/
2
{\displaystyle p=1/2}
the two-point distributions including the Bernoulli distribution have a lower excess kurtosis , namely −2, than any other probability distribution.
The Bernoulli distributions for
0
≤
p
≤
1
{\displaystyle 0\leq p\leq 1}
form an exponential family .
The maximum likelihood estimator of
p
{\displaystyle p}
based on a random sample is the sample mean .
The probability mass distribution function of a Bernoulli experiment along with its corresponding cumulative distribution function. The expected value of a Bernoulli random variable
X
{\displaystyle X}
is
E
[
X
]
=
p
{\displaystyle \operatorname {E} [X]=p}
This is due to the fact that for a Bernoulli distributed random variable
X
{\displaystyle X}
with
Pr
(
X
=
1
)
=
p
{\displaystyle \Pr(X=1)=p}
and
Pr
(
X
=
0
)
=
q
{\displaystyle \Pr(X=0)=q}
we find
E
[
X
]
=
Pr
(
X
=
1
)
⋅
1
+
Pr
(
X
=
0
)
⋅
0
=
p
⋅
1
+
q
⋅
0
=
p
.
{\displaystyle \operatorname {E} [X]=\Pr(X=1)\cdot 1+\Pr(X=0)\cdot 0=p\cdot 1+q\cdot 0=p.}
[ 3] The variance of a Bernoulli distributed
X
{\displaystyle X}
is
Var
[
X
]
=
p
q
=
p
(
1
−
p
)
{\displaystyle \operatorname {Var} [X]=pq=p(1-p)}
We first find
E
[
X
2
]
=
Pr
(
X
=
1
)
⋅
1
2
+
Pr
(
X
=
0
)
⋅
0
2
=
p
⋅
1
2
+
q
⋅
0
2
=
p
=
E
[
X
]
{\displaystyle \operatorname {E} [X^{2}]=\Pr(X=1)\cdot 1^{2}+\Pr(X=0)\cdot 0^{2}=p\cdot 1^{2}+q\cdot 0^{2}=p=\operatorname {E} [X]}
From this follows
Var
[
X
]
=
E
[
X
2
]
−
E
[
X
]
2
=
E
[
X
]
−
E
[
X
]
2
=
p
−
p
2
=
p
(
1
−
p
)
=
p
q
{\displaystyle \operatorname {Var} [X]=\operatorname {E} [X^{2}]-\operatorname {E} [X]^{2}=\operatorname {E} [X]-\operatorname {E} [X]^{2}=p-p^{2}=p(1-p)=pq}
[ 3] With this result it is easy to prove that, for any Bernoulli distribution, its variance will have a value inside
[
0
,
1
/
4
]
{\displaystyle [0,1/4]}
.
The skewness is
q
−
p
p
q
=
1
−
2
p
p
q
{\displaystyle {\frac {q-p}{\sqrt {pq))}={\frac {1-2p}{\sqrt {pq))))
. When we take the standardized Bernoulli distributed random variable
X
−
E
[
X
]
Var
[
X
]
{\displaystyle {\frac {X-\operatorname {E} [X]}{\sqrt {\operatorname {Var} [X]))))
we find that this random variable attains
q
p
q
{\displaystyle {\frac {q}{\sqrt {pq))))
with probability
p
{\displaystyle p}
and attains
−
p
p
q
{\displaystyle -{\frac {p}{\sqrt {pq))))
with probability
q
{\displaystyle q}
. Thus we get
γ
1
=
E
[
(
X
−
E
[
X
]
Var
[
X
]
)
3
]
=
p
⋅
(
q
p
q
)
3
+
q
⋅
(
−
p
p
q
)
3
=
1
p
q
3
(
p
q
3
−
q
p
3
)
=
p
q
p
q
3
(
q
−
p
)
=
q
−
p
p
q
.
{\displaystyle {\begin{aligned}\gamma _{1}&=\operatorname {E} \left[\left({\frac {X-\operatorname {E} [X]}{\sqrt {\operatorname {Var} [X]))}\right)^{3}\right]\\&=p\cdot \left({\frac {q}{\sqrt {pq))}\right)^{3}+q\cdot \left(-{\frac {p}{\sqrt {pq))}\right)^{3}\\&={\frac {1}((\sqrt {pq))^{3))}\left(pq^{3}-qp^{3}\right)\\&={\frac {pq}((\sqrt {pq))^{3))}(q-p)\\&={\frac {q-p}{\sqrt {pq))}.\end{aligned))}
Higher moments and cumulants [ edit ] The raw moments are all equal due to the fact that
1
k
=
1
{\displaystyle 1^{k}=1}
and
0
k
=
0
{\displaystyle 0^{k}=0}
.
E
[
X
k
]
=
Pr
(
X
=
1
)
⋅
1
k
+
Pr
(
X
=
0
)
⋅
0
k
=
p
⋅
1
+
q
⋅
0
=
p
=
E
[
X
]
.
{\displaystyle \operatorname {E} [X^{k}]=\Pr(X=1)\cdot 1^{k}+\Pr(X=0)\cdot 0^{k}=p\cdot 1+q\cdot 0=p=\operatorname {E} [X].}
The central moment of order
k
{\displaystyle k}
is given by
μ
k
=
(
1
−
p
)
(
−
p
)
k
+
p
(
1
−
p
)
k
.
{\displaystyle \mu _{k}=(1-p)(-p)^{k}+p(1-p)^{k}.}
The first six central moments are
μ
1
=
0
,
μ
2
=
p
(
1
−
p
)
,
μ
3
=
p
(
1
−
p
)
(
1
−
2
p
)
,
μ
4
=
p
(
1
−
p
)
(
1
−
3
p
(
1
−
p
)
)
,
μ
5
=
p
(
1
−
p
)
(
1
−
2
p
)
(
1
−
2
p
(
1
−
p
)
)
,
μ
6
=
p
(
1
−
p
)
(
1
−
5
p
(
1
−
p
)
(
1
−
p
(
1
−
p
)
)
)
.
{\displaystyle {\begin{aligned}\mu _{1}&=0,\\\mu _{2}&=p(1-p),\\\mu _{3}&=p(1-p)(1-2p),\\\mu _{4}&=p(1-p)(1-3p(1-p)),\\\mu _{5}&=p(1-p)(1-2p)(1-2p(1-p)),\\\mu _{6}&=p(1-p)(1-5p(1-p)(1-p(1-p))).\end{aligned))}
The higher central moments can be expressed more compactly in terms of
μ
2
{\displaystyle \mu _{2))
and
μ
3
{\displaystyle \mu _{3))
μ
4
=
μ
2
(
1
−
3
μ
2
)
,
μ
5
=
μ
3
(
1
−
2
μ
2
)
,
μ
6
=
μ
2
(
1
−
5
μ
2
(
1
−
μ
2
)
)
.
{\displaystyle {\begin{aligned}\mu _{4}&=\mu _{2}(1-3\mu _{2}),\\\mu _{5}&=\mu _{3}(1-2\mu _{2}),\\\mu _{6}&=\mu _{2}(1-5\mu _{2}(1-\mu _{2})).\end{aligned))}
The first six cumulants are
κ
1
=
p
,
κ
2
=
μ
2
,
κ
3
=
μ
3
,
κ
4
=
μ
2
(
1
−
6
μ
2
)
,
κ
5
=
μ
3
(
1
−
12
μ
2
)
,
κ
6
=
μ
2
(
1
−
30
μ
2
(
1
−
4
μ
2
)
)
.
{\displaystyle {\begin{aligned}\kappa _{1}&=p,\\\kappa _{2}&=\mu _{2},\\\kappa _{3}&=\mu _{3},\\\kappa _{4}&=\mu _{2}(1-6\mu _{2}),\\\kappa _{5}&=\mu _{3}(1-12\mu _{2}),\\\kappa _{6}&=\mu _{2}(1-30\mu _{2}(1-4\mu _{2})).\end{aligned))}
The Bernoulli distribution is simply
B
(
1
,
p
)
{\displaystyle \operatorname {B} (1,p)}
, also written as
B
e
r
n
o
u
l
l
i
(
p
)
.
{\textstyle \mathrm {Bernoulli} (p).}
^ Uspensky, James Victor (1937). Introduction to Mathematical Probability . New York: McGraw-Hill. p. 45. OCLC 996937 .
^ Dekking, Frederik; Kraaikamp, Cornelis; Lopuhaä, Hendrik; Meester, Ludolf (9 October 2010). A Modern Introduction to Probability and Statistics (1 ed.). Springer London. pp. 43–48. ISBN 9781849969529 .
^ a b c d Bertsekas, Dimitri P. (2002). Introduction to Probability . Tsitsiklis, John N. , Τσιτσικλής, Γιάννης Ν. Belmont, Mass.: Athena Scientific. ISBN 188652940X . OCLC 51441829 .
^ McCullagh, Peter ; Nelder, John (1989). Generalized Linear Models, Second Edition . Boca Raton: Chapman and Hall/CRC. Section 4.2.2. ISBN 0-412-31760-5 .
^ Orloff, Jeremy; Bloom, Jonathan. "Conjugate priors: Beta and normal" (PDF) . math.mit.edu . Retrieved October 20, 2023 .
Johnson, N. L.; Kotz, S.; Kemp, A. (1993). Univariate Discrete Distributions (2nd ed.). Wiley. ISBN 0-471-54897-9 .
Peatman, John G. (1963). Introduction to Applied Statistics . New York: Harper & Row. pp. 162–171.
Discrete univariate
with finite support with infinite support
Continuous univariate
supported on a bounded interval supported on a semi-infinite interval supported on the whole real line with support whose type varies
Mixed univariate
Multivariate (joint) Directional Degenerate and singular Families