This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: "Family-wise error rate" – news · newspapers · books · scholar · JSTOR (June 2016) (Learn how and when to remove this template message)

In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests.

Familywise and Experimentwise Error Rates

John Tukey developed in 1953 the concept of a familywise error rate as the probability of making a Type I error among a specified group, or "family," of tests.^[1] Ryan (1959) proposed the related concept of an experimentwise error rate, which is the probability of making a Type I error in a given experiment.^[2] Hence, an experimentwise error rate is a familywise error rate for all of the tests that are conducted within an experiment.

As Ryan (1959, Footnote 3) explained, an experiment may contain two or more families of multiple comparisons, each of which relates to a particular statistical inference and each of which has its own separate familywise error rate.^[2] Hence, familywise error rates are usually based on theoretically informative collections of multiple comparisons. In contrast, an experimentwise error rate may be based on a co-incidental collection of comparisons that refer to a diverse range of separate inferences. Consequently, some have argued that it may not be useful to control the experimentwise error rate.^[3] Indeed, Tukey was against the idea of experimentwise error rates (Tukey, 1956, personal communication, in Ryan, 1962, p. 302).^[4] More recently, Rubin (2021) criticised the automatic consideration of experimentwise error rates, arguing that “in many cases, the joint studywise [experimentwise] hypothesis has no relevance to researchers’ specific research questions, because its constituent hypotheses refer to comparisons and variables that have no theoretical or practical basis for joint consideration.”^[5]

Background

Within the statistical framework, there are several definitions for the term "family":

Hochberg & Tamhane (1987) defined "family" as "any collection of inferences for which it is meaningful to take into account some combined measure of error".^[3]
According to Cox (1982), a set of inferences should be regarded a family:^{[citation needed]}

To take into account the selection effect due to data dredging
To ensure simultaneous correctness of a set of inferences as to guarantee a correct overall decision

To summarize, a family could best be defined by the potential selective inference that is being faced: A family is the smallest set of items of inference in an analysis, interchangeable about their meaning for the goal of research, from which selection of results for action, presentation or highlighting could be made (Yoav Benjamini).^{[citation needed]}

Classification of multiple hypothesis tests

Main article: Classification of multiple hypothesis tests

The following table defines the possible outcomes when testing multiple null hypotheses. Suppose we have a number m of null hypotheses, denoted by: $H 1, H 2, ..., H m .$ Using a statistical test, we reject the null hypothesis if the test is declared significant. We do not reject the null hypothesis if the test is non-significant. Summing each type of outcome over all H_i yields the following random variables:

	Null hypothesis is true (H₀)	Alternative hypothesis is true (H_A)	Total
Test is declared significant	$V$	$S$	$R$
Test is declared non-significant	$U$	$T$	$m-R$
Total	${\displaystyle m_{0))$	${\displaystyle m-m_{0))$	$m$

$m$ is the total number hypotheses tested
${\displaystyle m_{0))$ is the number of true null hypotheses, an unknown parameter
${\displaystyle m-m_{0))$ is the number of true alternative hypotheses
$V$ is the number of false positives (Type I error) (also called "false discoveries")
$S$ is the number of true positives (also called "true discoveries")
$T$ is the number of false negatives (Type II error)
$U$ is the number of true negatives
$R=V+S$ is the number of rejected null hypotheses (also called "discoveries", either true or false)

In $m$ hypothesis tests of which ${\displaystyle m_{0))$ are true null hypotheses, $R$ is an observable random variable, and $S$ , $T$ , $U$ , and $V$ are unobservable random variables.

Definition

The FWER is the probability of making at least one type I error in the family,

\mathrm {FWER} =\Pr(V\geq 1),\,

or equivalently,

\mathrm {FWER} =1-\Pr(V=0).

Thus, by assuring $\mathrm {FWER} \leq \alpha \,\!\,$ , the probability of making one or more type I errors in the family is controlled at level $\alpha \,\!$ .

A procedure controls the FWER in the weak sense if the FWER control at level $\alpha \,\!$ is guaranteed only when all null hypotheses are true (i.e. when $m_{0}=m$ , meaning the "global null hypothesis" is true).^[6]

A procedure controls the FWER in the strong sense if the FWER control at level $\alpha \,\!$ is guaranteed for any configuration of true and non-true null hypotheses (whether the global null hypothesis is true or not).^[7]

Controlling procedures

For broader coverage of this topic, see Multiple testing correction.

Further information: List of post hoc tests

Some classical solutions that ensure strong level $\alpha$ FWER control, and some newer solutions exist.

The Bonferroni procedure

Main article: Bonferroni correction

Denote by ${\displaystyle p_{i))$ the p-value for testing ${\displaystyle H_{i))$
reject ${\displaystyle H_{i))$ if $p_{i}\leq {\frac {\alpha }{m))$

The Šidák procedure

Main article: Šidák correction

Testing each hypothesis at level $\alpha _{SID}=1-(1-\alpha )^{\frac {1}{m))$ is Sidak's multiple testing procedure.
This procedure is more powerful than Bonferroni but the gain is small.
This procedure can fail to control the FWER when the tests are negatively dependent.

Tukey's procedure

Main article: Tukey's range test

Tukey's procedure is only applicable for pairwise comparisons.
It assumes independence of the observations being tested, as well as equal variation across observations (homoscedasticity).
The procedure calculates for each pair the studentized range statistic: ${\frac {Y_{A}-Y_{B)){SE))$ where ${\displaystyle Y_{A))$ is the larger of the two means being compared, ${\displaystyle Y_{B))$ is the smaller, and $SE$ is the standard error of the data in question.^{[citation needed]}
Tukey's test is essentially a Student's t-test, except that it corrects for family-wise error-rate.^{[citation needed]}

Holm's step-down procedure (1979)

Main article: Holm–Bonferroni method

Start by ordering the p-values (from lowest to highest) ${\displaystyle P_{(1)}\ldots P_{(m)))$ and let the associated hypotheses be ${\displaystyle H_{(1)}\ldots H_{(m)))$
Let $k$ be the minimal index such that $P_{(k)}>{\frac {\alpha }{m+1-k))$
Reject the null hypotheses ${\displaystyle H_{(1)}\ldots H_{(k-1)))$ . If $k=1$ then none of the hypotheses are rejected.^{[citation needed]}

This procedure is uniformly more powerful than the Bonferroni procedure.^[8] The reason why this procedure controls the family-wise error rate for all the m hypotheses at level α in the strong sense is, because it is a closed testing procedure. As such, each intersection is tested using the simple Bonferroni test.^{[citation needed]}

Hochberg's step-up procedure

Hochberg's step-up procedure (1988) is performed using the following steps:^[9]

Start by ordering the p-values (from lowest to highest) ${\displaystyle P_{(1)}\ldots P_{(m)))$ and let the associated hypotheses be ${\displaystyle H_{(1)}\ldots H_{(m)))$
For a given $\alpha$ , let $R$ be the largest $k$ such that $P_{(k)}\leq {\frac {\alpha }{m-k+1))$
Reject the null hypotheses ${\displaystyle H_{(1)}\ldots H_{(R)))$

Hochberg's procedure is more powerful than Holms'. Nevertheless, while Holm’s is a closed testing procedure (and thus, like Bonferroni, has no restriction on the joint distribution of the test statistics), Hochberg’s is based on the Simes test, so it holds only under non-negative dependence.^{[citation needed]}

Dunnett's correction

Main article: Dunnett's test

Charles Dunnett (1955, 1966) described an alternative alpha error adjustment when k groups are compared to the same control group. Now known as Dunnett's test, this method is less conservative than the Bonferroni adjustment.^{[citation needed]}

Scheffé's method

Main article: Scheffé's method

This section is empty. You can help by adding to it. (February 2013)

Resampling procedures

The procedures of Bonferroni and Holm control the FWER under any dependence structure of the p-values (or equivalently the individual test statistics). Essentially, this is achieved by accommodating a `worst-case' dependence structure (which is close to independence for most practical purposes). But such an approach is conservative if dependence is actually positive. To give an extreme example, under perfect positive dependence, there is effectively only one test and thus, the FWER is uninflated.

Accounting for the dependence structure of the p-values (or of the individual test statistics) produces more powerful procedures. This can be achieved by applying resampling methods, such as bootstrapping and permutations methods. The procedure of Westfall and Young (1993) requires a certain condition that does not always hold in practice (namely, subset pivotality).^[10] The procedures of Romano and Wolf (2005a,b) dispense with this condition and are thus more generally valid.^[11]^[12]

Harmonic mean p-value procedure

Main article: Harmonic mean p-value

The harmonic mean p-value (HMP) procedure^[13]^[14] provides a multilevel test that improves on the power of Bonferroni correction by assessing the significance of groups of hypotheses while controlling the strong-sense family-wise error rate. The significance of any subset ${\textstyle {\mathcal {R))}$ of the ${\textstyle m}$ tests is assessed by calculating the HMP for the subset,

{\overset {\circ }{p))_{\mathcal {R))={\frac {\sum _{i\in {\mathcal {R))}w_{i)){\sum _{i\in {\mathcal {R))}w_{i}/p_{i))},

where

{\textstyle w_{1},\dots ,w_{m))

are weights that sum to one (i.e.

{\textstyle \sum _{i=1}^{m}w_{i}=1}

). An approximate procedure that controls the strong-sense family-wise error rate at level approximately

{\textstyle \alpha }

rejects the null hypothesis that none of the p-values in subset

{\textstyle {\mathcal {R))}

are significant when

{\textstyle {\overset {\circ }{p))_{\mathcal {R))\leq \alpha \,w_{\mathcal {R))}

^[15] (where

{\textstyle w_{\mathcal {R))=\sum _{i\in {\mathcal {R))}w_{i))

). This approximation is reasonable for small

{\textstyle \alpha }

(e.g.

{\textstyle \alpha <0.05}

) and becomes arbitrarily good as

{\textstyle \alpha }

approaches zero. An asymptotically exact test is also available (see main article).

Alternative approaches

Further information: False discovery rate

FWER control exerts a more stringent control over false discovery compared to false discovery rate (FDR) procedures. FWER control limits the probability of at least one false discovery, whereas FDR control limits (in a loose sense) the expected proportion of false discoveries. Thus, FDR procedures have greater power at the cost of increased rates of type I errors, i.e., rejecting null hypotheses that are actually true.^[16]

On the other hand, FWER control is less stringent than per-family error rate control, which limits the expected number of errors per family. Because FWER control is concerned with at least one false discovery, unlike per-family error rate control it does not treat multiple simultaneous false discoveries as any worse than one false discovery. The Bonferroni correction is often considered as merely controlling the FWER, but in fact also controls the per-family error rate.^[17]