Part of a series on
Machine learning and data mining
Paradigms Supervised learning Unsupervised learning Online learning Batch learning Meta-learning Semi-supervised learning Self-supervised learning Reinforcement learning Curriculum learning Rule-based learning Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Artificial neural network Autoencoder Cognitive computing Deep learning DeepDream Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Restricted Boltzmann machine GAN Diffusion model SOM Convolutional neural network U-Net Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Machine-learning venues ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

Statistical view of online learning

In statistical learning models, the training sample $(x_{i},y_{i})$ are assumed to have been drawn from the true distribution $p(x,y)$ and the objective is to minimize the expected "risk"

I[f]=\mathbb {E} [V(f(x),y)]=\int V(f(x),y)\,dp(x,y)\ .

A common paradigm in this situation is to estimate a function

{\hat {f))

through empirical risk minimization or regularized empirical risk minimization (usually Tikhonov regularization). The choice of loss function here gives rise to several well-known learning algorithms such as regularized least squares and support vector machines. A purely online model in this category would learn based on just the new input

(x_{t+1},y_{t+1})

, the current best predictor

{\displaystyle f_{t))

and some extra stored information (which is usually expected to have storage requirements independent of training data size). For many formulations, for example nonlinear kernel methods, true online learning is not possible, though a form of hybrid online learning with recursive algorithms can be used where

{\displaystyle f_{t+1))

is permitted to depend on

{\displaystyle f_{t))

and all previous data points

(x_{1},y_{1}),\ldots ,(x_{t},y_{t})

. In this case, the space requirements are no longer guaranteed to be constant since it requires storing all previous data points, but the solution may take less time to compute with the addition of a new data point, as compared to batch learning techniques.

A common strategy to overcome the above issues is to learn using mini-batches, which process a small batch of $b\geq 1$ data points at a time, this can be considered as pseudo-online learning for $b$ much smaller than the total number of training points. Mini-batch techniques are used with repeated passing over the training data to obtain optimized out-of-core versions of machine learning algorithms, for example, stochastic gradient descent. When combined with backpropagation, this is currently the de facto training method for training artificial neural networks.

Example: linear least squares

Main article: Linear least squares (mathematics)

The simple example of linear least squares is used to explain a variety of ideas in online learning. The ideas are general enough to be applied to other settings, for example, with other convex loss functions.

Batch learning

Consider the setting of supervised learning with $f$ being a linear function to be learned:

{\displaystyle f(x_{j})=\langle w,x_{j}\rangle =w\cdot x_{j))

where

{\displaystyle x_{j}\in \mathbb {R} ^{d))

is a vector of inputs (data points) and

{\displaystyle w\in \mathbb {R} ^{d))

is a linear filter vector. The goal is to compute the filter vector

w

. To this end, a square loss function

{\displaystyle V(f(x_{j}),y_{j})=(f(x_{j})-y_{j})^{2}=(\langle w,x_{j}\rangle -y_{j})^{2))

is used to compute the vector

w

that minimizes the empirical loss

{\displaystyle I_{n}[w]=\sum _{j=1}^{n}V(\langle w,x_{j}\rangle ,y_{j})=\sum _{j=1}^{n}(x_{j}^{\mathsf {T))w-y_{j})^{2))

where

y_{j}\in \mathbb {R} .

Let $X$ be the $i\times d$ data matrix and ${\displaystyle y\in \mathbb {R} ^{i))$ is the column vector of target values after the arrival of the first $i$ data points. Assuming that the covariance matrix $\Sigma _{i}=X^{\mathsf {T))X$ is invertible (otherwise it is preferential to proceed in a similar fashion with Tikhonov regularization), the best solution $f^{*}(x)=\langle w^{*},x\rangle$ to the linear least squares problem is given by

w^{*}=(X^{\mathsf {T))X)^{-1}X^{\mathsf {T))y=\Sigma _{i}^{-1}\sum _{j=1}^{i}x_{j}y_{j}.

Now, calculating the covariance matrix $\Sigma _{i}=\sum _{j=1}^{i}x_{j}x_{j}^{\mathsf {T))$ takes time $O(id^{2})$ , inverting the $d\times d$ matrix takes time $O(d^{3})$ , while the rest of the multiplication takes time $O(d^{2})$ , giving a total time of $O(id^{2}+d^{3})$ . When there are $n$ total points in the dataset, to recompute the solution after the arrival of every datapoint $i=1,\ldots ,n$ , the naive approach will have a total complexity $O(n^{2}d^{2}+nd^{3})$ . Note that when storing the matrix ${\displaystyle \Sigma _{i))$ , then updating it at each step needs only adding $x_{i+1}x_{i+1}^{\mathsf {T))$ , which takes $O(d^{2})$ time, reducing the total time to $O(nd^{2}+nd^{3})=O(nd^{3})$ , but with an additional storage space of $O(d^{2})$ to store ${\displaystyle \Sigma _{i))$ .^[1]

Online learning: recursive least squares

The recursive least squares (RLS) algorithm considers an online approach to the least squares problem. It can be shown that by initialising ${\displaystyle \textstyle w_{0}=0\in \mathbb {R} ^{d))$ and ${\displaystyle \textstyle \Gamma _{0}=I\in \mathbb {R} ^{d\times d))$ , the solution of the linear least squares problem given in the previous section can be computed by the following iteration:

{\displaystyle \Gamma _{i}=\Gamma _{i-1}-{\frac {\Gamma _{i-1}x_{i}x_{i}^{\mathsf {T))\Gamma _{i-1)){1+x_{i}^{\mathsf {T))\Gamma _{i-1}x_{i))))

w_{i}=w_{i-1}-\Gamma _{i}x_{i}\left(x_{i}^{\mathsf {T))w_{i-1}-y_{i}\right)

The above iteration algorithm can be proved using induction on

i

.^[2] The proof also shows that

{\displaystyle \Gamma _{i}=\Sigma _{i}^{-1))

. One can look at RLS also in the context of adaptive filters (see RLS).

The complexity for $n$ steps of this algorithm is $O(nd^{2})$ , which is an order of magnitude faster than the corresponding batch learning complexity. The storage requirements at every step $i$ here are to store the matrix ${\displaystyle \Gamma _{i))$ , which is constant at $O(d^{2})$ . For the case when ${\displaystyle \Sigma _{i))$ is not invertible, consider the regularised version of the problem loss function ${\displaystyle \sum _{j=1}^{n}\left(x_{j}^{\mathsf {T))w-y_{j}\right)^{2}+\lambda \left\|w\right\|_{2}^{2))$ . Then, it's easy to show that the same algorithm works with ${\displaystyle \Gamma _{0}=(I+\lambda I)^{-1))$ , and the iterations proceed to give ${\displaystyle \Gamma _{i}=(\Sigma _{i}+\lambda I)^{-1))$ .^[1]

Stochastic gradient descent

Main article: Stochastic gradient descent

When this

w_{i}=w_{i-1}-\Gamma _{i}x_{i}\left(x_{i}^{\mathsf {T))w_{i-1}-y_{i}\right)

is replaced by

w_{i}=w_{i-1}-\gamma _{i}x_{i}\left(x_{i}^{\mathsf {T))w_{i-1}-y_{i}\right)=w_{i-1}-\gamma _{i}\nabla V(\langle w_{i-1},x_{i}\rangle ,y_{i})

{\displaystyle \Gamma _{i}\in \mathbb {R} ^{d\times d))

\gamma _{i}\in \mathbb {R}

, this becomes the stochastic gradient descent algorithm. In this case, the complexity for

n

steps of this algorithm reduces to

O(nd)

. The storage requirements at every step

i

are constant at

O(d)

However, the stepsize ${\displaystyle \gamma _{i))$ needs to be chosen carefully to solve the expected risk minimization problem, as detailed above. By choosing a decaying step size $\gamma _{i}\approx {\frac {1}{\sqrt {i))},$ one can prove the convergence of the average iterate ${\textstyle {\overline {w))_{n}={\frac {1}{n))\sum _{i=1}^{n}w_{i))$ . This setting is a special case of stochastic optimization, a well known problem in optimization.^[1]

Incremental stochastic gradient descent

In practice, one can perform multiple stochastic gradient passes (also called cycles or epochs) over the data. The algorithm thus obtained is called incremental gradient method and corresponds to an iteration

w_{i}=w_{i-1}-\gamma _{i}\nabla V(\langle w_{i-1},x_{t_{i))\rangle ,y_{t_{i)))

The main difference with the stochastic gradient method is that here a sequence

{\displaystyle t_{i))

is chosen to decide which training point is visited in the

i

-th step. Such a sequence can be stochastic or deterministic. The number of iterations is then decoupled to the number of points (each point can be considered more than once). The incremental gradient method can be shown to provide a minimizer to the empirical risk.^[3] Incremental techniques can be advantageous when considering objective functions made up of a sum of many terms e.g. an empirical error corresponding to a very large dataset.^[1]

Kernel methods

Online convex optimization

Online convex optimization (OCO) ^[4] is a general framework for decision making which leverages convex optimization to allow for efficient algorithms. The framework is that of repeated game playing as follows:

For $t=1,2,...,T$

Learner receives input ${\displaystyle x_{t))$
Learner outputs ${\displaystyle w_{t))$ from a fixed convex set $S$
Nature sends back a convex loss function $v_{t}:S\rightarrow \mathbb {R}$ .
Learner suffers loss $v_{t}(w_{t})$ and updates its model

The goal is to minimize regret, or the difference between cumulative loss and the loss of the best fixed point $u\in S$ in hindsight. As an example, consider the case of online least squares linear regression. Here, the weight vectors come from the convex set ${\displaystyle S=\mathbb {R} ^{d))$ , and nature sends back the convex loss function ${\displaystyle v_{t}(w)=(\langle w,x_{t}\rangle -y_{t})^{2))$ . Note here that ${\displaystyle y_{t))$ is implicitly sent with ${\displaystyle v_{t))$ .

Some online prediction problems however cannot fit in the framework of OCO. For example, in online classification, the prediction domain and the loss functions are not convex. In such scenarios, two simple techniques for convexification are used: randomisation and surrogate loss functions.^{[citation needed]}

Some simple online convex optimisation algorithms are:

Follow the leader (FTL)

The simplest learning rule to try is to select (at the current step) the hypothesis that has the least loss over all past rounds. This algorithm is called Follow the leader, and round $t$ is simply given by:

w_{t}=\mathop {\operatorname {arg\,min} } _{w\in S}\sum _{i=1}^{t-1}v_{i}(w)

This method can thus be looked as a greedy algorithm. For the case of online quadratic optimization (where the loss function is

{\displaystyle v_{t}(w)=\left\|w-x_{t}\right\|_{2}^{2))

), one can show a regret bound that grows as

\log(T)

. However, similar bounds cannot be obtained for the FTL algorithm for other important families of models like online linear optimization. To do so, one modifies FTL by adding regularisation.

Follow the regularised leader (FTRL)

This is a natural modification of FTL that is used to stabilise the FTL solutions and obtain better regret bounds. A regularisation function $R:S\to \mathbb {R}$ is chosen and learning performed in round $t$ as follows:

w_{t}=\mathop {\operatorname {arg\,min} } _{w\in S}\sum _{i=1}^{t-1}v_{i}(w)+R(w)

As a special example, consider the case of online linear optimisation i.e. where nature sends back loss functions of the form

v_{t}(w)=\langle w,z_{t}\rangle

. Also, let

{\displaystyle S=\mathbb {R} ^{d))

. Suppose the regularisation function

{\textstyle R(w)={\frac {1}{2\eta ))\left\|w\right\|_{2}^{2))

is chosen for some positive number

\eta

. Then, one can show that the regret minimising iteration becomes

{\displaystyle w_{t+1}=-\eta \sum _{i=1}^{t}z_{i}=w_{t}-\eta z_{t))

Note that this can be rewritten as

w_{t+1}=w_{t}-\eta \nabla v_{t}(w_{t})

, which looks exactly like online gradient descent.

If $S$ is instead some convex subspace of ${\displaystyle \mathbb {R} ^{d))$ , $S$ would need to be projected onto, leading to the modified update rule

w_{t+1}=\Pi _{S}(-\eta \sum _{i=1}^{t}z_{i})=\Pi _{S}(\eta \theta _{t+1})

This algorithm is known as lazy projection, as the vector

{\displaystyle \theta _{t+1))

accumulates the gradients. It is also known as Nesterov's dual averaging algorithm. In this scenario of linear loss functions and quadratic regularisation, the regret is bounded by

O({\sqrt {T)))

, and thus the average regret goes to

0

as desired.

Online subgradient descent (OSD)

Other algorithms

Quadratically regularised FTRL algorithms lead to lazily projected gradient algorithms as described above. To use the above for arbitrary convex functions and regularisers, one uses online mirror descent. The optimal regularization in hindsight can be derived for linear loss functions, this leads to the AdaGrad algorithm. For the Euclidean regularisation, one can show a regret bound of $O({\sqrt {T)))$ , which can be improved further to a $O(\log T)$ for strongly convex and exp-concave loss functions.