In science, randomized experiments are the experiments that allow the greatest reliability and validity of statistical estimates of treatment effects. Randomization-based inference is especially important in experimental design and in survey sampling.

Overview

In the statistical theory of design of experiments, randomization involves randomly allocating the experimental units across the treatment groups. For example, if an experiment compares a new drug against a standard drug, then the patients should be allocated to either the new drug or to the standard drug control using randomization.

Randomized experimentation is not haphazard. Randomization reduces bias by equalising other factors that have not been explicitly accounted for in the experimental design (according to the law of large numbers). Randomization also produces ignorable designs, which are valuable in model-based statistical inference, especially Bayesian or likelihood-based. In the design of experiments, the simplest design for comparing treatments is the "completely randomized design". Some "restriction on randomization" can occur with blocking and experiments that have hard-to-change factors; additional restrictions on randomization can occur when a full randomization is infeasible or when it is desirable to reduce the variance of estimators of selected effects.

Randomization of treatment in clinical trials pose ethical problems. In some cases, randomization reduces the therapeutic options for both physician and patient, and so randomization requires clinical equipoise regarding the treatments.

Online randomized controlled experiments

Web sites can run randomized controlled experiments ^[2] to create a feedback loop.^[3] Key differences between offline experimentation and online experiments include:^[3]^[4]

Logging: user interactions can be logged reliably.
Number of users: large sites, such as Amazon, Bing/Microsoft, and Google run experiments, each with over a million users.
Number of concurrent experiments: large sites run tens of overlapping, or concurrent, experiments.^[5]
Robots, whether web crawlers from valid sources or malicious internet bots.^{[clarification needed]}
Ability to ramp-up experiments from low percentages to higher percentages.
Speed / performance has significant impact on key metrics.^[3]^[6]

Ability to use the pre-experiment period as an A/A test to reduce variance.^[7]

History

Main article: History of experiments

A controlled experiment appears to have been suggested in the Old Testament's Book of Daniel. King Nebuchadnezzar proposed that some Israelites eat "a daily amount of food and wine from the king's table." Daniel preferred a vegetarian diet, but the official was concerned that the king would "see you looking worse than the other young men your age? The king would then have my head because of you." Daniel then proposed the following controlled experiment: "Test your servants for ten days. Give us nothing but vegetables to eat and water to drink. Then compare our appearance with that of the young men who eat the royal food, and treat your servants in accordance with what you see". (Daniel 1, 12– 13).^[8]^[9]

Randomized experiments were institutionalized in psychology and education in the late eighteen-hundreds, following the invention of randomized experiments by C. S. Peirce.^[10]^[11]^[12]^[13] Outside of psychology and education, randomized experiments were popularized by R.A. Fisher in his book Statistical Methods for Research Workers, which also introduced additional principles of experimental design.

Statistical interpretation

This section needs expansion. You can help by adding to it. (September 2012)

The Rubin Causal Model provides a common way to describe a randomized experiment. While the Rubin Causal Model provides a framework for defining the causal parameters (i.e., the effects of a randomized treatment on an outcome), the analysis of experiments can take a number of forms. The model assumes that there are two potential outcomes for each unit in the study: the outcome if the unit receives the treatment and the outcome if the unit does not receive the treatment. The difference between these two potential outcomes is known as the treatment effect, which is the causal effect of the treatment on the outcome. Most commonly, randomized experiments are analyzed using ANOVA, student's t-test, regression analysis, or a similar statistical test. The model also accounts for potential confounding factors, which are factors that could affect both the treatment and the outcome. By controlling for these confounding factors, the model helps to ensure that any observed treatment effect is truly causal and not simply the result of other factors that are correlated with both the treatment and the outcome.

The Rubin Causal Model is a useful a framework for understanding how to estimate the causal effect of the treatment, even when there are confounding variables that may affect the outcome. This model specifies that the causal effect of the treatment is the difference in the outcomes that would have been observed for each individual if they had received the treatment and if they had not received the treatment. In practice, it is not possible to observe both potential outcomes for the same individual, so statistical methods are used to estimate the causal effect using data from the experiment.

Empirical evidence that randomization makes a difference

Empirically differences between randomized and non-randomized studies,^[14]^{[needs update]} and between adequately and inadequately randomized trials have been difficult to detect.^[15]^[16]

References

^ Schulz KF, Altman DG, Moher D; for the CONSORT Group (2010). "CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials". BMJ. 340: c332. doi:10.1136/bmj.c332. PMC 2844940. PMID 20332509.((cite journal)): CS1 maint: multiple names: authors list (link)
^ Kohavi, Ron; Longbotham, Roger (2015). "Online Controlled Experiments and A/B Tests" (PDF). In Sammut, Claude; Webb, Geoff (eds.). Encyclopedia of Machine Learning and Data Mining. Springer. pp. to appear.
^ ^a ^b ^c Kohavi, Ron; Longbotham, Roger; Sommerfield, Dan; Henne, Randal M. (2009). "Controlled experiments on the web: survey and practical guide". Data Mining and Knowledge Discovery. 18 (1): 140–181. doi:10.1007/s10618-008-0114-1. ISSN 1384-5810.
^ Kohavi, Ron; Deng, Alex; Frasca, Brian; Longbotham, Roger; Walker, Toby; Xu Ya (2012). "Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained". Proceedings of the 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
^ Kohavi, Ron; Deng Alex; Frasca Brian; Walker Toby; Xu Ya; Nils Pohlmann (2013). "Online controlled experiments at large scale". Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. Vol. 19. Chicago, Illinois, USA: ACM. pp. 1168–1176. doi:10.1145/2487575.2488217. ISBN 9781450321747. S2CID 13224883.((cite book)): CS1 maint: date and year (link)
^ Kohavi, Ron; Deng Alex; Longbotham Roger; Xu Ya (2014). "Seven rules of thumb for web site experimenters". Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. Vol. 20. New York, New York, USA: ACM. pp. 1857–1866. doi:10.1145/2623330.2623341. ISBN 9781450329569. S2CID 207214362.((cite book)): CS1 maint: date and year (link)
^ Deng, Alex; Xu, Ya; Kohavi, Ron; Walker, Toby (2013). "Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data". WSDM 2013: Sixth ACM International Conference on Web Search and Data Mining.
^ Neuhauser, D; Diaz, M (2004). "Daniel: using the Bible to teach quality improvement methods". Quality and Safety in Health Care. 13 (2): 153–155. doi:10.1136/qshc.2003.009480. PMC 1743807. PMID 15069225.
^ Angrist, Joshua; Pischke Jörn-Steffen (2014). Mastering 'Metrics: The Path from Cause to Effect. Princeton University Press. p. 31.
^ Charles Sanders Peirce and Joseph Jastrow (1885). "On Small Differences in Sensation". Memoirs of the National Academy of Sciences. 3: 73–83. http://psychclassics.yorku.ca/Peirce/small-diffs.htm
^ Hacking, Ian (September 1988). "Telepathy: Origins of Randomization in Experimental Design". Isis. 79 (3): 427–451. doi:10.1086/354775. JSTOR 234674. MR 1013489. S2CID 52201011.
^ Stephen M. Stigler (November 1992). "A Historical View of Statistical Concepts in Psychology and Educational Research". American Journal of Education. 101 (1): 60–70. doi:10.1086/444032. S2CID 143685203.
^ Trudy Dehue (December 1997). "Deception, Efficiency, and Random Groups: Psychology and the Gradual Origination of the Random Group Design" (PDF). Isis. 88 (4): 653–673. doi:10.1086/383850. PMID 9519574. S2CID 23526321.
^ Anglemyer A, Horvath HT, Bero L (April 2014). "Healthcare outcomes assessed with observational study designs compared with those assessed in randomized trials". Cochrane Database Syst Rev. 2014 (4): MR000034. doi:10.1002/14651858.MR000034.pub2. PMC 8191367. PMID 24782322.
^ Odgaard-Jensen J, Vist G, et al. (April 2011). "Randomisation to protect against selection bias in healthcare trials". Cochrane Database Syst Rev. 2015 (4): MR000012. doi:10.1002/14651858.MR000012.pub3. PMC 7150228. PMID 21491415.
^ Howick J, Mebius A (2014). "In search of justification for the unpredictability paradox". Trials. 15: 480. doi:10.1186/1745-6215-15-480. PMC 4295227. PMID 25490908.

Caliński, Tadeusz & Kageyama, Sanpei (2000). Block designs: A Randomization approach, Volume I: Analysis. Lecture Notes in Statistics. Vol. 150. New York: Springer-Verlag. ISBN 978-0-387-98578-7.
Caliński, Tadeusz & Kageyama, Sanpei (2003). Block designs: A Randomization approach, Volume II: Design. Lecture Notes in Statistics. Vol. 170. New York: Springer-Verlag. ISBN 978-0-387-95470-7.
Hacking, Ian (September 1988). "Telepathy: Origins of Randomization in Experimental Design". Isis. 79 (3): 427–451. doi:10.1086/354775. JSTOR 234674. MR 1013489. S2CID 52201011.
Hinkelmann, Klaus; Kempthorne, Oscar (2008). Design and Analysis of Experiments, Volume I: Introduction to Experimental Design (Second ed.). Wiley. ISBN 978-0-471-72756-9. MR 2363107.
Kempthorne, Oscar (1992). "Intervention experiments, randomization and inference". In Malay Ghosh and Pramod K. Pathak (ed.). Current Issues in Statistical Inference—Essays in Honor of D. Basu. Institute of Mathematical Statistics Lecture Notes - Monograph Series. Hayward, CA: Institute for Mathematical Statistics. pp. 13–31. doi:10.1214/lnms/1215458836. ISBN 978-0-940600-24-9. MR 1194407.

v t e Design of experiments
Scientific method	Scientific experiment Statistical design Control Internal and external validity Experimental unit Blinding Optimal design: Bayesian Random assignment Randomization Restricted randomization Replication versus subsampling Sample size
Treatment and blocking	Treatment Effect size Contrast Interaction Confounding Orthogonality Blocking Covariate Nuisance variable
Models and inference	Linear regression Ordinary least squares Bayesian Random effect Mixed model Hierarchical model: Bayesian Analysis of variance (Anova) Cochran's theorem Manova (multivariate) Ancova (covariance) Compare means Multiple comparison
Designs Completely randomized	Factorial Fractional factorial Plackett–Burman Taguchi Response surface methodology Polynomial and rational modeling Box–Behnken Central composite Block Generalized randomized block design (GRBD) Latin square Graeco-Latin square Orthogonal array Latin hypercube Repeated measures design Crossover study Randomized controlled trial Sequential analysis Sequential probability ratio test
Glossary Category Mathematics portal Statistical outline Statistical topics

Statistics

Descriptive statistics

Continuous data

Center	Mean Arithmetic Arithmetic-Geometric Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test (normal) Student's t-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity
Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging