Machine learning and data mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Online learning Batch learning Meta-learning Semi-supervised learning Self-supervised learning Reinforcement learning Curriculum learning Rule-based learning Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Artificial neural network Autoencoder Cognitive computing Deep learning DeepDream Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Restricted Boltzmann machine GAN Diffusion model SOM Convolutional neural network U-Net Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Machine-learning venues ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

A restricted Boltzmann machine (RBM) (also called a restricted Sherrington–Kirkpatrick model with external field or restricted stochastic Ising–Lenz–Little model) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs.^[1]

RBMs were initially proposed under the name Harmonium by Paul Smolensky in 1986,^[2] and rose to prominence after Geoffrey Hinton and collaborators used fast learning algorithms for them in the mid-2000s. RBMs have found applications in dimensionality reduction,^[3] classification,^[4] collaborative filtering,^[5] feature learning,^[6] topic modelling,^[7] immunology,^[8] and even many‑body quantum mechanics.^[9]^[10] They can be trained in either supervised or unsupervised ways, depending on the task.^{[citation needed]}

As their name implies, RBMs are a variant of Boltzmann machines, with the restriction that their neurons must form a bipartite graph:

a pair of nodes from each of the two groups of units (commonly referred to as the "visible" and "hidden" units respectively) may have a symmetric connection between them; and
there are no connections between nodes within a group.

By contrast, "unrestricted" Boltzmann machines may have connections between hidden units. This restriction allows for more efficient training algorithms than are available for the general class of Boltzmann machines, in particular the gradient-based contrastive divergence algorithm.^[11]

Restricted Boltzmann machines can also be used in deep learning networks. In particular, deep belief networks can be formed by "stacking" RBMs and optionally fine-tuning the resulting deep network with gradient descent and backpropagation.^[12]

Structure

The standard type of RBM has binary-valued (Boolean) hidden and visible units, and consists of a matrix of weights $W$ of size $m\times n$ . Each weight element $(w_{i,j})$ of the matrix is associated with the connection between the visible (input) unit ${\displaystyle v_{i))$ and the hidden unit ${\displaystyle h_{j))$ . In addition, there are bias weights (offsets) ${\displaystyle a_{i))$ for ${\displaystyle v_{i))$ and ${\displaystyle b_{j))$ for ${\displaystyle h_{j))$ . Given the weights and biases, the energy of a configuration (pair of boolean vectors) $(v, h)$ is defined as

{\displaystyle E(v,h)=-\sum _{i}a_{i}v_{i}-\sum _{j}b_{j}h_{j}-\sum _{i}\sum _{j}v_{i}w_{i,j}h_{j))

or, in matrix notation,

E(v,h)=-a^{\mathrm {T} }v-b^{\mathrm {T} }h-v^{\mathrm {T} }Wh.

This energy function is analogous to that of a Hopfield network. As with general Boltzmann machines, the joint probability distribution for the visible and hidden vectors is defined in terms of the energy function as follows,^[13]

{\displaystyle P(v,h)={\frac {1}{Z))e^{-E(v,h)))

where $Z$ is a partition function defined as the sum of ${\displaystyle e^{-E(v,h)))$ over all possible configurations, which can be interpreted as a normalizing constant to ensure that the probabilities sum to 1. The marginal probability of a visible vector is the sum of $P(v,h)$ over all possible hidden layer configurations,^[13]

{\displaystyle P(v)={\frac {1}{Z))\sum _{\{h\))e^{-E(v,h)))

,

and vice versa. Since the underlying graph structure of the RBM is bipartite (meaning there are no intra-layer connections), the hidden unit activations are mutually independent given the visible unit activations. Conversely, the visible unit activations are mutually independent given the hidden unit activations.^[11] That is, for m visible units and n hidden units, the conditional probability of a configuration of the visible units $v$ , given a configuration of the hidden units $h$ , is

P(v|h)=\prod _{i=1}^{m}P(v_{i}|h)

.

Conversely, the conditional probability of $h$ given $v$ is

P(h|v)=\prod _{j=1}^{n}P(h_{j}|v)

.

The individual activation probabilities are given by

P(h_{j}=1|v)=\sigma \left(b_{j}+\sum _{i=1}^{m}w_{i,j}v_{i}\right)

and

\,P(v_{i}=1|h)=\sigma \left(a_{i}+\sum _{j=1}^{n}w_{i,j}h_{j}\right)

where $\sigma$ denotes the logistic sigmoid.

The visible units of Restricted Boltzmann Machine can be multinomial, although the hidden units are Bernoulli.^{[clarification needed]} In this case, the logistic function for visible units is replaced by the softmax function

P(v_{i}^{k}=1|h)={\frac {\exp(a_{i}^{k}+\Sigma _{j}W_{ij}^{k}h_{j})}{\Sigma _{k'=1}^{K}\exp(a_{i}^{k'}+\Sigma _{j}W_{ij}^{k'}h_{j})))

where K is the number of discrete values that the visible values have. They are applied in topic modeling,^[7] and recommender systems.^[5]

Relation to other models

Restricted Boltzmann machines are a special case of Boltzmann machines and Markov random fields.^[14]^[15]

The graphical model of RBMs corresponds to that of factor analysis.^[16]

Training algorithm

Restricted Boltzmann machines are trained to maximize the product of probabilities assigned to some training set $V$ (a matrix, each row of which is treated as a visible vector $v$ ),

\arg \max _{W}\prod _{v\in V}P(v)

or equivalently, to maximize the expected log probability of a training sample $v$ selected randomly from $V$ :^[14]^[15]

\arg \max _{W}\mathbb {E} \left[\log P(v)\right]

The algorithm most often used to train RBMs, that is, to optimize the weight matrix $W$ , is the contrastive divergence (CD) algorithm due to Hinton, originally developed to train PoE (product of experts) models.^[17]^[18] The algorithm performs Gibbs sampling and is used inside a gradient descent procedure (similar to the way backpropagation is used inside such a procedure when training feedforward neural nets) to compute weight update.

The basic, single-step contrastive divergence (CD-1) procedure for a single sample can be summarized as follows:

Take a training sample $v$ , compute the probabilities of the hidden units and sample a hidden activation vector $h$ from this probability distribution.
Compute the outer product of $v$ and $h$ and call this the positive gradient.
From $h$ , sample a reconstruction $v'$ of the visible units, then resample the hidden activations $h'$ from this. (Gibbs sampling step)
Compute the outer product of $v'$ and $h'$ and call this the negative gradient.
Let the update to the weight matrix $W$ be the positive gradient minus the negative gradient, times some learning rate: $\Delta W=\epsilon (vh^{\mathsf {T))-v'h'^{\mathsf {T)))$ .
Update the biases $a$ and $b$ analogously: $\Delta a=\epsilon (v-v')$ , $\Delta b=\epsilon (h-h')$ .

A Practical Guide to Training RBMs written by Hinton can be found on his homepage.^[13]

Stacked Restricted Boltzmann Machine

This section may be too technical for most readers to understand. Please help improve it to make it understandable to non-experts, without removing the technical details. (August 2023) (Learn how and when to remove this message)

This section needs additional citations for verification. Please help improve this article by adding citations to reliable sources in this section. Unsourced material may be challenged and removed. (August 2023) (Learn how and when to remove this message)

Literature

Fischer, Asja; Igel, Christian (2012), "An Introduction to Restricted Boltzmann Machines", Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, Lecture Notes in Computer Science, vol. 7441, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 14–36, doi:10.1007/978-3-642-33275-3_2, ISBN 978-3-642-33274-6

Structure

Relation to other models

Training algorithm

Stacked Restricted Boltzmann Machine

Literature

See also

References

Bibliography

External links