Chemometrics is the science of extracting information from chemical systems by data-driven means. Chemometrics is inherently interdisciplinary, using methods frequently employed in core data-analytic disciplines such as multivariate statistics, applied mathematics, and computer science, in order to address problems in chemistry, biochemistry, medicine, biology and chemical engineering. In this way, it mirrors other interdisciplinary fields, such as psychometrics and econometrics.

Background

Chemometrics is applied to solve both descriptive and predictive problems in experimental natural sciences, especially in chemistry. In descriptive applications, properties of chemical systems are modeled with the intent of learning the underlying relationships and structure of the system (i.e., model understanding and identification). In predictive applications, properties of chemical systems are modeled with the intent of predicting new properties or behavior of interest. In both cases, the datasets can be small but are often large and complex, involving hundreds to thousands of variables, and hundreds to thousands of cases or observations.

Chemometric techniques are particularly heavily used in analytical chemistry and metabolomics, and the development of improved chemometric methods of analysis also continues to advance the state of the art in analytical instrumentation and methodology. It is an application-driven discipline, and thus while the standard chemometric methodologies are very widely used industrially, academic groups are dedicated to the continued development of chemometric theory, method and application development.

Origins

Although one could argue that even the earliest analytical experiments in chemistry involved a form of chemometrics, the field is generally recognized to have emerged in the 1970s as computers became increasingly exploited for scientific investigation. The term 'chemometrics' was coined by Svante Wold in a 1971 grant application,^[1] and the International Chemometrics Society was formed shortly thereafter by Svante Wold and Bruce Kowalski, two pioneers in the field. Wold was a professor of organic chemistry at Umeå University, Sweden, and Kowalski was a professor of analytical chemistry at University of Washington, Seattle.^[2]

Many early applications involved multivariate classification, numerous quantitative predictive applications followed, and by the late 1970s and early 1980s a wide variety of data- and computer-driven chemical analyses were occurring.

Multivariate analysis was a critical facet even in the earliest applications of chemometrics. Data from infrared and UV/visible spectroscopy are often counted in thousands of measurements per sample. Mass spectrometry, nuclear magnetic resonance, atomic emission/absorption and chromatography experiments are also all by nature highly multivariate. The structure of these data was found to be conducive to using techniques such as principal components analysis (PCA), partial least-squares (PLS), orthogonal partial least-squares (OPLS), and two-way orthogonal partial least squares (O2PLS).^[3] This is primarily because, while the datasets may be highly multivariate there is strong and often linear low-rank structure present. PCA and PLS have been shown over time very effective at empirically modeling the more chemically interesting low-rank structure, exploiting the interrelationships or 'latent variables' in the data, and providing alternative compact coordinate systems for further numerical analysis such as regression, clustering, and pattern recognition. Partial least squares in particular was heavily used in chemometric applications for many years before it began to find regular use in other fields.

Through the 1980s three dedicated journals appeared in the field: Journal of Chemometrics, Chemometrics and Intelligent Laboratory Systems, and Journal of Chemical Information and Modeling. These journals continue to cover both fundamental and methodological research in chemometrics. At present, most routine applications of existing chemometric methods are commonly published in application-oriented journals (e.g., Applied Spectroscopy, Analytical Chemistry, Analytica Chimica Acta, Talanta). Several important books/monographs on chemometrics were also first published in the 1980s, including the first edition of Malinowski's Factor Analysis in Chemistry,^[4] Sharaf, Illman and Kowalski's Chemometrics,^[5] Massart et al. Chemometrics: a textbook,^[6] and Multivariate Calibration by Martens and Naes.^[7]

Some large chemometric application areas have gone on to represent new domains, such as molecular modeling and QSAR, cheminformatics, the '-omics' fields of genomics, proteomics, metabonomics and metabolomics, process modeling and process analytical technology.

An account of the early history of chemometrics was published as a series of interviews by Geladi and Esbensen.^[8]^[9]

Techniques

Multivariate calibration

Many chemical problems and applications of chemometrics involve calibration. The objective is to develop models which can be used to predict properties of interest based on measured properties of the chemical system, such as pressure, flow, temperature, infrared, Raman,^[10] NMR spectra and mass spectra. Examples include the development of multivariate models relating 1) multi-wavelength spectral response to analyte concentration, 2) molecular descriptors to biological activity, 3) multivariate process conditions/states to final product attributes. The process requires a calibration or training data set, which includes reference values for the properties of interest for prediction, and the measured attributes believed to correspond to these properties. For case 1), for example, one can assemble data from a number of samples, including concentrations for an analyte of interest for each sample (the reference) and the corresponding infrared spectrum of that sample. Multivariate calibration techniques such as partial-least squares regression, or principal component regression (and near countless other methods) are then used to construct a mathematical model that relates the multivariate response (spectrum) to the concentration of the analyte of interest, and such a model can be used to efficiently predict the concentrations of new samples.

Techniques in multivariate calibration are often broadly categorized as classical or inverse methods.^[7]^[11] The principal difference between these approaches is that in classical calibration the models are solved such that they are optimal in describing the measured analytical responses (e.g., spectra) and can therefore be considered optimal descriptors, whereas in inverse methods the models are solved to be optimal in predicting the properties of interest (e.g., concentrations, optimal predictors).^[12] Inverse methods usually require less physical knowledge of the chemical system, and at least in theory provide superior predictions in the mean-squared error sense,^[13]^[14]^[15] and hence inverse approaches tend to be more frequently applied in contemporary multivariate calibration.

The main advantages of the use of multivariate calibration techniques is that fast, cheap, or non-destructive analytical measurements (such as optical spectroscopy) can be used to estimate sample properties which would otherwise require time-consuming, expensive or destructive testing (such as LC-MS). Equally important is that multivariate calibration allows for accurate quantitative analysis in the presence of heavy interference by other analytes. The selectivity of the analytical method is provided as much by the mathematical calibration, as the analytical measurement modalities. For example, near-infrared spectra, which are extremely broad and non-selective compared to other analytical techniques (such as infrared or Raman spectra), can often be used successfully in conjunction with carefully developed multivariate calibration methods to predict concentrations of analytes in very complex matrices.

Classification, pattern recognition, clustering

Supervised multivariate classification techniques are closely related to multivariate calibration techniques in that a calibration or training set is used to develop a mathematical model capable of classifying future samples. The techniques employed in chemometrics are similar to those used in other fields – multivariate discriminant analysis, logistic regression, neural networks, regression/classification trees. The use of rank reduction techniques in conjunction with these conventional classification methods is routine in chemometrics, for example discriminant analysis on principal components or partial least squares scores.

A family of techniques, referred to as class-modelling or one-class classifiers, are able to build models for an individual class of interest.^[16] Such methods are particularly useful in the case of quality control and authenticity verification of products.

Unsupervised classification (also termed cluster analysis) is also commonly used to discover patterns in complex data sets, and again many of the core techniques used in chemometrics are common to other fields such as machine learning and statistical learning.

Multivariate curve resolution

In chemometric parlance, multivariate curve resolution seeks to deconstruct data sets with limited or absent reference information and system knowledge. Some of the earliest work on these techniques was done by Lawton and Sylvestre in the early 1970s.^[17]^[18] These approaches are also called self-modeling mixture analysis, blind source/signal separation, and spectral unmixing. For example, from a data set comprising fluorescence spectra from a series of samples each containing multiple fluorophores, multivariate curve resolution methods can be used to extract the fluorescence spectra of the individual fluorophores, along with their relative concentrations in each of the samples, essentially unmixing the total fluorescence spectrum into the contributions from the individual components. The problem is usually ill-determined due to rotational ambiguity (many possible solutions can equivalently represent the measured data), so the application of additional constraints is common, such as non-negativity, unimodality, or known interrelationships between the individual components (e.g., kinetic or mass-balance constraints).^[19]^[20]

Other techniques

Experimental design remains a core area of study in chemometrics and several monographs are specifically devoted to experimental design in chemical applications.^[21]^[22] Sound principles of experimental design have been widely adopted within the chemometrics community, although many complex experiments are purely observational, and there can be little control over the properties and interrelationships of the samples and sample properties.

Signal processing is also a critical component of almost all chemometric applications, particularly the use of signal pretreatments to condition data prior to calibration or classification. The techniques employed commonly in chemometrics are often closely related to those used in related fields.^[23] Signal pre-processing may affect the way in which outcomes of the final data processing can be interpreted.^[24]

Performance characterization, and figures of merit Like most arenas in the physical sciences, chemometrics is quantitatively oriented, so considerable emphasis is placed on performance characterization, model selection, verification & validation, and figures of merit. The performance of quantitative models is usually specified by root mean squared error in predicting the attribute of interest, and the performance of classifiers as a true-positive rate/false-positive rate pairs (or a full ROC curve). A recent report by Olivieri et al. provides a comprehensive overview of figures of merit and uncertainty estimation in multivariate calibration, including multivariate definitions of selectivity, sensitivity, SNR and prediction interval estimation.^[25] Chemometric model selection usually involves the use of tools such as resampling (including bootstrap, permutation, cross-validation).

Multivariate statistical process control (MSPC), modeling and optimization accounts for a substantial amount of historical chemometric development.^[26]^[27]^[28] Spectroscopy has been used successfully for online monitoring of manufacturing processes for 30–40 years, and this process data is highly amenable to chemometric modeling. Specifically in terms of MSPC, multiway modeling of batch and continuous processes is increasingly common in industry and remains an active area of research in chemometrics and chemical engineering. Process analytical chemistry as it was originally termed,^[29] or the newer term process analytical technology continues to draw heavily on chemometric methods and MSPC.

Multiway methods are heavily used in chemometric applications.^[30]^[31] These are higher-order extensions of more widely used methods. For example, while the analysis of a table (matrix, or second-order array) of data is routine in several fields, multiway methods are applied to data sets that involve 3rd, 4th, or higher-orders. Data of this type is very common in chemistry, for example a liquid-chromatography / mass spectrometry (LC-MS) system generates a large matrix of data (elution time versus m/z) for each sample analyzed. The data across multiple samples thus comprises a data cube. Batch process modeling involves data sets that have time vs. process variables vs. batch number. The multiway mathematical methods applied to these sorts of problems include PARAFAC, trilinear decomposition, and multiway PLS and PCA.

References

^ As recounted in Wold, S. (1995). "Chemometrics; what do we mean with it, and what do we want from it?". Chemometrics and Intelligent Laboratory Systems. 30 (1): 109–115. doi:10.1016/0169-7439(95)00042-9.
^ Kowalski, Bruce R. (1975). "Chemometrics: Views and Propositions". J. Chem. Inf. Comput. Sci. 15 (4): 201–203. doi:10.1021/ci60004a002.
^ Trygg, J.; Wold, S. (2003). "O2-PLS, a two-block (X–Y) latent variable regression (LVR) method with an integral OSC filter". Journal of Chemometrics. 17: 53–64. doi:10.1002/cem.775. S2CID 123071521.
^ Malinowski, E. R.; Howery, D. G. (1980). Factor Analysis in Chemistry. New York: Wiley. ISBN 978-0471058816. (other editions followed in 1989, 1991 and 2002).
^ Sharaf, M. A.; Illman, D. L.; Kowalski, B. R., eds. (1986). Chemometrics. New York: Wiley. ISBN 978-0471831068.
^ Massart, D. L.; Vandeginste, B. G. M.; Deming, S. M.; Michotte, Y.; Kaufman, L. (1988). Chemometrics: a textbook. Amsterdam: Elsevier. ISBN 978-0444426604.
^ ^a ^b Martens, H.; Naes, T. (1989). Multivariate Calibration. New York: Wiley. ISBN 978-0471909798.
^ Geladi, P.; Esbensen, K. (2005). "The Start and Early History of Chemometrics: Selected Interviews. Part 1". J. Chemometrics. 4 (5): 337–354. doi:10.1002/cem.1180040503. S2CID 120490459.
^ Esbensen, K.; Geladi, P. (2005). "The Start and Early History of Chemometrics: Selected Interviews. Part 2". J. Chemometrics. 4 (6): 389–412. doi:10.1002/cem.1180040604. S2CID 221546473.
^ Barton, Bastian; Thomson, James; Lozano Diz, Enrique; Portela, Raquel (September 2022). "Chemometrics for Raman Spectroscopy Harmonization". Applied Spectroscopy. 76 (9): 1021–1041. Bibcode:2022ApSpe..76.1021B. doi:10.1177/00037028221094070. ISSN 0003-7028. PMID 35622984. S2CID 249129065.
^ Franke, J. (2002). "Inverse Least Squares and Classical Least Squares Methods for Quantitative Vibrational Spectroscopy". In Chalmers, John M (ed.). Handbook of Vibrational Spectroscopy. New York: Wiley. doi:10.1002/0470027320.s4603. ISBN 978-0471988472.
^ Brown, C. D. (2004). "Discordance between Net Analyte Signal Theory and Practical Multivariate Calibration". Analytical Chemistry. 76 (15): 4364–4373. doi:10.1021/ac049953w. PMID 15283574.
^ Krutchkoff, R. G. (1969). "Classical and inverse regression methods of calibration in extrapolation". Technometrics. 11 (3): 11–15. doi:10.1080/00401706.1969.10490714.
^ Hunter, W. G. (1984). "Statistics and chemistry, and the linear calibration problem". In Kowalski, B. R. (ed.). Chemometrics: mathematics and statistics in chemistry. Boston: Riedel. ISBN 978-9027718464.
^ Tellinghuisen, J. (2000). "Inverse vs. classical calibration for small data sets". Fresenius' J. Anal. Chem. 368 (6): 585–588. doi:10.1007/s002160000556. PMID 11228707. S2CID 21166415.
^ Oliveri, Paolo (2017). "Class-modelling in food analytical chemistry: Development, sampling, optimisation and validation issues – A tutorial". Analytica Chimica Acta. 982: 9–19. Bibcode:2017AcAC..982....9O. doi:10.1016/j.aca.2017.05.013. hdl:11567/881059. PMID 28734370. S2CID 10119515.
^ Lawton, W. H.; Sylvestre, E. A. (1971). "Self Modeling Curve Resolution". Technometrics. 13 (3): 617–633. doi:10.1080/00401706.1971.10488823.
^ Sylvestre, E. A.; Lawton, W. H.; Maggio, M. S. (1974). "Curve Resolution Using a Postulated Chemical Reaction". Technometrics. 16 (3): 353–368. doi:10.1080/00401706.1974.10489204.
^ de Juan, A.; Tauler, R. (2003). "Chemometrics Applied to Unravel Multicomponent Processes and Mixtures. Revisiting Latest Trends in Multivariate Resolution". Analytica Chimica Acta. 500 (1–2): 195–210. Bibcode:2003AcAC..500..195D. doi:10.1016/S0003-2670(03)00724-4.
^ de Juan, A.; Tauler, R. (2006). "Multivariate Curve Resolution (MCR) from 2000: Progress in Concepts and Applications". Critical Reviews in Analytical Chemistry. 36 (3–4): 163–176. doi:10.1080/10408340600970005. S2CID 95309963.
^ Deming, S. N.; Morgan, S. L. (1987). Experimental design: a chemometric approach. Elsevier. ISBN 978-0444427342.
^ Bruns, R. E.; Scarminio, I. S.; de Barros Neto, B. (2006). Statistical design – chemometrics. Amsterdam: Elsevier. ISBN 978-0444521811.
^ Wentzell, P. D.; Brown, C. D. (2000). "Signal Processing in Analytical Chemistry". In Meyers, R. A. (ed.). Encyclopedia of Analytical Chemistry. Wiley. pp. 9764–9800.
^ Oliveri, Paolo; Malegori, Cristina; Simonetti, Remo; Casale, Monica (2019). "The impact of signal pre-processing on the final interpretation of analytical outcomes – A tutorial". Analytica Chimica Acta. 1058: 9–17. Bibcode:2019AcAC.1058....9O. doi:10.1016/j.aca.2018.10.055. PMID 30851858. S2CID 73727614.
^ Olivieri, A. C.; Faber, N. M.; Ferre, J.; Boque, R.; Kalivas, J. H.; Mark, H. (2006). "Guidelines for calibration in analytical chemistry Part 3. Uncertainty estimation and figures of merit for multivariate calibration". Pure and Applied Chemistry. 78 (3): 633–650. doi:10.1351/pac200678030633. S2CID 50546210.
^ Illman, D. L.; Callis, J. B.; Kowalski, B. R. (1986). "Process Analytical Chemistry: a new paradigm for analytical chemists". American Laboratory. 18: 8–10.
^ MacGregor, J. F.; Kourti, T. (1995). "Statistical control of multivariate processes". Control Engineering Practice. 3 (3): 403–414. doi:10.1016/0967-0661(95)00014-L.
^ Martin, E. B.; Morris, A. J. (1996). "An overview of multivariate statistical process control in continuous and batch process performance monitoring". Transactions of the Institute of Measurement & Control. 18 (1): 51–60. Bibcode:1996TIMC...18...51M. doi:10.1177/014233129601800107. S2CID 120516715.
^ Hirschfeld, T.; Callis, J. B.; Kowalski, B. R. (1984). "Chemical sensing in process analysis". Science. 226 (4672): 312–318. Bibcode:1984Sci...226..312H. doi:10.1126/science.226.4672.312. PMID 17749872. S2CID 38093353.
^ Smilde, A. K.; Bro, R.; Geladi, P. (2004). Multi-way analysis with applications in the chemical sciences. Wiley.
^ Bro, R.; Workman, J. J.; Mobley, P. R.; Kowalski, B. R. (1997). "Overview of chemometrics applied to spectroscopy: 1985–95, Part 3—Multiway analysis". Applied Spectroscopy Reviews. 32 (3): 237–261. Bibcode:1997ApSRv..32..237B. doi:10.1080/05704929708003315.

External links

Analytical chemistry

Analytical chemistry
Instrumentation	Atomic absorption spectrometer Flame emission spectrometer Gas chromatograph High-performance liquid chromatograph Infrared spectrometer Mass spectrometer Melting point apparatus Microscope Optical spectrometer Spectrophotometer
Techniques	Calorimetry Chromatography Electroanalytical methods Gravimetric analysis Ion mobility spectrometry Mass spectrometry Spectroscopy Titration
Sampling	Coning and quartering Dilution Dissolution Filtration Masking Pulverization Sample preparation Separation process Sub-sampling
Calibration	Chemometrics Calibration curve Matrix effect Internal standard Standard addition Isotope dilution
Prominent publications	Analyst Analytica Chimica Acta Analytical and Bioanalytical Chemistry Analytical Chemistry Analytical Biochemistry
Category Commons Portal WikiProject

Statistics

Descriptive statistics

Continuous data

Center	Mean Arithmetic Arithmetic-Geometric Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test (normal) Student's t-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity
Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Authority control databases: National