This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages) This article includes a list of general references, but it lacks sufficient corresponding inline citations. Please help to improve this article by introducing more precise citations. (June 2015) (Learn how and when to remove this template message) This article may be too technical for most readers to understand. Please help improve it to make it understandable to non-experts, without removing the technical details. (June 2015) (Learn how and when to remove this template message) This article's tone or style may not reflect the encyclopedic tone used on Wikipedia. See Wikipedia's guide to writing better articles for suggestions. (June 2015) (Learn how and when to remove this template message) (Learn how and when to remove this template message)

Machine learning and data mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Online learning Batch learning Meta-learning Semi-supervised learning Self-supervised learning Reinforcement learning Rule-based learning Quantum machine learning
Problems Classification Generative model Regression Clustering dimension reduction density estimation Anomaly detection Data Cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Artificial neural network Autoencoder Cognitive computing Deep learning DeepDream Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Restricted Boltzmann machine GAN Diffusion model SOM Convolutional neural network U-Net Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Machine-learning venues ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

Multimodal learning, in the context of machine learning, is a type of deep learning using a combination of various modalities of data, often arising in real-world applications. An example of multi-modal data is data that combines text (typically represented as feature vector) with imaging data consisting of pixel intensities and annotation tags. As these modalities have fundamentally different statistical properties, combining them is non-trivial, which is why specialized modelling strategies and algorithms are required. The model is then trained to able to understand and work with multiple forms of data.

Motivation

Many models and algorithms have been implemented to retrieve and classify certain types of data, e.g. image or text (where humans who interact with machines can extract images in form of pictures and texts that could be any message etc.). However, data usually come with different modalities (it is the degree to which a system's components may be separated or combined) which carry different information. For example, it is very common to caption an image to convey the information not presented in the image itself. Similarly, sometimes it is more straightforward to use an image to describe the information which may not be obvious from texts. As a result, if different words appear in similar images, then these words likely describe the same thing. Conversely, if a word is used to describe seemingly dissimilar images, then these images may represent the same object. Thus, in cases dealing with multi-modal data, it is important to use a model which is able to jointly represent the information such that the model can capture the correlation structure between different modalities. Moreover, it should also be able to recover missing modalities given observed ones (e.g. predicting possible image object according to text description). The Multimodal Deep Boltzmann Machine model satisfies the above purposes.

Background: Boltzmann machine

A Boltzmann machine is a type of stochastic neural network invented by Geoffrey Hinton and Terry Sejnowski in 1985. Boltzmann machines can be seen as the stochastic, generative counterpart of Hopfield nets. They are named after the Boltzmann distribution in statistical mechanics. The units in Boltzmann machines are divided into two groups: visible units and hidden units. General Boltzmann machines allow connection between any units. However, learning is impractical using general Boltzmann Machines because the computational time is exponential to the size of the machine^{[citation needed]}. A more efficient architecture is called restricted Boltzmann machine where connection is only allowed between hidden unit and visible unit, which is described in the next section.

Restricted Boltzmann machine

A restricted Boltzmann machine^[1] is an undirected graph model with stochastic visible variables and stochastic hidden variables. Each visible variable is connected to each hidden variable. The energy function of the model is defined as

{\displaystyle E(\mathbf {v} ,\mathbf {h} ;\theta )=-\sum _{i=1}^{D}\sum _{j=1}^{F}W_{ij}v_{i}h_{j}-\sum _{i=1}^{D}b_{i}v_{i}-\sum _{j=1}^{F}a_{j}h_{j))

where ${\displaystyle \theta =\{\mathbf {v} ,\mathbf {h} ;\theta \))$ are model parameters: ${\displaystyle W_{ij))$ represents the symmetric interaction term between visible unit $i$ and hidden unit $j$ ; ${\displaystyle b_{i))$ and ${\displaystyle a_{j))$ are bias terms. The joint distribution of the system is defined as

P(\mathbf {v} ;\theta )={\frac {1}((\mathcal {Z))(\theta )))\sum _{\mathbf {h} }\mathrm {exp} (-E(\mathbf {v} ,\mathbf {h} ;\theta ))

where ${\mathcal {Z))(\theta )$ is a normalizing constant. The conditional distribution over hidden $\mathbf {h}$ and $\mathbf {v}$ can be derived as logistic function in terms of model parameters.

P(\mathbf {h} |\mathbf {v} ;\theta )=\prod _{j=1}^{F}p(h_{j}|\mathbf {v} )

, with

p(h_{j}=1|\mathbf {v} )=g(\sum _{i=1}^{D}W_{ij}v_{i}+a_{j})

P(\mathbf {v} |\mathbf {h} ;\theta )=\prod _{i=1}^{D}p(v_{i}|\mathbf {h} )

, with

p(v_{i}=1|\mathbf {h} )=g(\sum _{j=1}^{F}W_{ij}h_{j}+b_{i})

where $g(x)={\frac {1}{(1+\mathrm {exp} (-x))))$ is the logistic function.

The derivative of the log-likelihood with respect to the model parameters can be decomposed as the difference between the model's expectation and data-dependent expectation.

Gaussian-Bernoulli RBM

Gaussian-Bernoulli RBMs^[2] are a variant of restricted Boltzmann machine used for modeling real-valued vectors such as pixel intensities. It is usually used to model the image data. The energy of the system of the Gaussian-Bernoulli RBM is defined as

{\displaystyle E(\mathbf {v} ,\mathbf {h} ;\theta )=\sum _{i=1}^{D}{\frac {(v_{i}-b_{i})^{2)){2\sigma _{i}^{2))}-\sum _{i=1}^{D}\sum _{j=1}^{F}{\frac {v_{i)){\sigma _{i))}W_{ij}v_{i}h_{j}-\sum _{i=1}^{D}b_{i}v_{i}-\sum _{j=1}^{F}a_{j}h_{j))

where ${\displaystyle \theta =\{\mathbf {a} ,\mathbf {b} ,\mathbf {w} ,\mathbf {\sigma } \))$ are the model parameters. The joint distribution is defined the same as the one in restricted Boltzmann machine. The conditional distributions now become

P(\mathbf {h} |\mathbf {v} ;\theta )=\prod _{j=1}^{F}p(h_{j}|\mathbf {v} )

, with

p(h_{j}=1|\mathbf {v} )=g(\sum _{i=1}^{D}W_{ij}{\frac {v_{i)){\sigma _{i))}+a_{j})

P(\mathbf {v} |\mathbf {h} ;\theta )=\prod _{i=1}^{D}p(v_{i}|\mathbf {h} )

, with

p(v_{i}|\mathbf {h} )\sim {\mathcal {N))(\sigma _{i}\sum _{j=1}^{F}W_{ij}h_{j}+b_{i},\sigma _{i}^{2})

In Gaussian-Bernoulli RBM, the visible unit conditioned on hidden units is modeled as a Gaussian distribution.

Replicated Softmax Model

The Replicated Softmax Model^[3] is also a variant of restricted Boltzmann machine and commonly used to model word count vectors in a document. In a typical text mining problem, let $K$ be the dictionary size, and $M$ be the number of words in the document. Let $\mathbf {V}$ be a $M\times K$ binary matrix with $v_{ik}=1$ only when the ${\displaystyle i^{th))$ word in the document is the ${\displaystyle k^{th))$ word in the dictionary. ${\displaystyle {\hat {v))_{k))$ denotes the count for the ${\displaystyle k^{th))$ word in the dictionary. The energy of the state ${\displaystyle \{\mathbf {V} ,\mathbf {h} \))$ for a document contains $M$ words is defined as

{\displaystyle E(\mathbf {V} ,\mathbf {h} )=-\sum _{j=1}^{F}\sum _{k=1}^{K}W_{jk}{\hat {v))_{k}h_{j}-\sum _{k=1}^{K}b_{k}{\hat {v))_{k}-M\sum _{j=1}^{F}a_{j}h_{j))

The conditional distributions are given by

p(h_{j}=1|\mathbf {V} )=g(Ma_{j}+\sum _{k=1}^{K}{\hat {v))_{k}W_{jk})

p(v_{ik}=1|\mathbf {h} )={\frac {\mathrm {exp} (b_{k}+\sum _{j=1}^{F}h_{j}W_{jk)){\sum _{q=1}^{K}\mathrm {exp} (b_{q}+\sum _{j=1}^{F}h_{j}W_{jq))})

Deep Boltzmann machines

A deep Boltzmann machine^[4] has a sequence of layers of hidden units. There are only connections between adjacent hidden layers, as well as between visible units and hidden units in the first hidden layer. The energy function of the system adds layer interaction terms to the energy function of general restricted Boltzmann machine and is defined by ${\begin{aligned}E({\mathbf {v} ,\mathbf {h} ;\theta })=&-\sum _{i=1}^{D}\sum _{j=1}^{F_{1))W_{ij}^{(1)}v_{i}h_{j}^{(1)}-\sum _{j=1}^{F_{1))\sum _{l=1}^{F_{2))W_{jl}^{(2)}h_{j}^{(1)}h_{l}^{(2)}\\&-\sum _{l=1}^{F_{2))\sum _{p=1}^{F_{3))W_{lp}^{(3)}h_{l}^{(2)}h_{p}^{(3)}-\sum _{i=1}^{D}b_{i}v_{i}-\sum _{j=1}^{F_{1))b_{j}^{(1)}h_{j}^{(1)}-\sum _{l=1}^{F_{2))b_{l}^{(2)}h_{l}^{(2)}-\sum _{p=1}^{F_{3))b_{p}^{(3)}h_{p}^{(3)}\end{aligned))$

The joint distribution is

P(\mathbf {v} ;\theta )={\frac {1}((\mathcal {Z))(\theta )))\sum _{\mathbf {h} }\mathrm {exp} (-E(\mathbf {v} ,\mathbf {h} ^{(1)},\mathbf {h} ^{(2)},\mathbf {h} ^{(3)};\theta ))

Multimodal deep Boltzmann machines

Multimodal deep Boltzmann machine^[5]^[6] uses an image-text bi-modal DBM where the image pathway is modeled as Gaussian-Bernoulli DBM and text pathway as Replicated Softmax DBM, and each DBM has two hidden layers and one visible layer. The two DBMs join together at an additional top hidden layer. The joint distribution over the multi-modal inputs defined as ${\begin{aligned}P(\mathbf {v} ^{m},\mathbf {v} ^{t};\theta )&=\sum _{\mathbf {h} ^{(2m)},\mathbf {h} ^{(2t)},\mathbf {h} ^{(3)))P(\mathbf {h} ^{(2m)},\mathbf {h} ^{(2t)},\mathbf {h} ^{(3)})(\sum _{\mathbf {h} ^{(1m)))P(\mathbf {v} _{m},\mathbf {h} ^{(1m)}|\mathbf {h} ^{(2m)}))(\sum _{\mathbf {h} ^{(1t)))P(\mathbf {v} ^{t},\mathbf {h} ^{(1t)}|\mathbf {h} ^{(2t)}))\\&={\frac {1}((\mathcal {Z))_{M}(\theta )))\sum _{\mathbf {h} }\mathrm {exp} (\sum _{kj}W_{kj}^{(1t)}v_{k}^{t}h_{j}^{(1t)}\\&+\sum _{jl}W_{jl}^{(2t)}h_{j}^{(1t)}h_{l}^{(2t)}+\sum _{k}b_{k}^{t}v_{k}^{t}+M\sum _{j}b_{j}^{(1t)}h_{j}^{(1t)}+\sum _{l}b_{l}^{(2t)}h_{l}^{(2t)}\\&-\sum _{i}{\frac {(v_{i}^{m}-b_{i}^{m})^{2)){2\sigma ^{2))}+\sum _{ij}{\frac {v_{i}^{m)){\sigma _{i))}W_{ij}^{(1m)}h_{j}^{(1m)}\\&+\sum _{jl}W_{jl}^{(2m)}h_{j}^{(1m)}h_{l}^{(2m)}+\sum _{j}b_{j}^{(1m)}h_{j}^{(1m)}+\sum _{l}b_{l}^{(2m)}h_{l}{(2m)}\\&+\sum _{lp}W^{(3t)}h_{l}^{(2t)}h_{p}^{(3)}+\sum _{lp}W^{(3m)}h_{l}^{(2m)}h_{p}^{(3)}+\sum _{p}b_{p}^{(3)}h_{p}^{(3)}\end{aligned))$

The conditional distributions over the visible and hidden units are

p(h_{j}^{(1m)}=1|\mathbf {v} ^{m},\mathbf {h} ^{(2m)})=g(\sum _{i=1}^{D}W_{ij}^{(1m)}{\frac {v_{i}^{m)){\sigma _{i))}+\sum _{l=1}^{F_{2}^{m))W_{jl}^{(2m)}h_{l}^{(2m)}+b_{j}^{(1m)})

p(h_{l}^{(2m)}=1|\mathbf {h} ^{(1m)},\mathbf {h} ^{(3)})=g(\sum _{j=1}^{F_{1}^{m))W_{jl}^{(2m)}h_{j}^{(1m)}+\sum _{p=1}^{F_{3))W_{lp}^{(3m)}h_{p}^{(3)}+b_{l}^{(2m)})

p(h_{j}^{(1t)}=1|\mathbf {v} ^{t},\mathbf {h} ^{(2t)})=g(\sum _{k=1}^{K}W_{kl}^{(1t)}v_{k}^{(t)}+\sum _{l=1}^{F_{2}^{t))W_{jl}^{(2t)}h_{l}^{(2t)}+Mb_{j}^{(1t)})

p(h_{l}^{(2t)}=1|\mathbf {h} ^{(1t)},\mathbf {h} ^{(3)})=g(\sum _{j=1}^{F_{1}^{t))W_{jl}^{(2t)}h_{j}^{(1t)}+\sum _{p=1}^{F_{3))W_{lp}^{(3t)}h_{p}^{(3)}+b_{l}^{(2t)})

p(h_{p}^{3)}=1|\mathbf {h} ^{(2)})=g(\sum _{l=1}^{F_{2}^{m))W_{lp}^{(3m)}h_{l}^{(2m)}+\sum _{l=1}^{F_{2}^{t))W_{lp}^{(3t)}h_{l}^{(2t)}+b_{p}^{(3)})

p(v_{ik}^{t}=1|\mathbf {h} ^{(1t)})={\frac {\mathrm {exp} (\sum _{j=1}^{F_{1}^{t))h_{j}^{(1t)}W_{jk}^{(1t)}+b_{k}^{t})}{\sum _{q=1}^{K}\mathrm {exp} (\sum _{j=1}^{F_{1}^{t))h_{j}^{(1t)}W_{jq}^{(1t)}+b_{k}^{t})))

p(v_{i}^{m}|\mathbf {h} ^{(1m)})\sim {\mathcal {N))(\sigma _{i}\sum _{j=1}^{F_{1}^{m))W_{ij}^{(1m)}h_{j}^{(1m)}+b_{i}^{m},\sigma _{i}^{2})

Inference and learning

Exact maximum likelihood learning in this model is intractable, but approximate learning of DBMs can be carried out by using a variational approach, where mean-field inference is used to estimate data-dependent expectations and an MCMC based stochastic approximation procedure is used to approximate the model’s expected sufficient statistics.^[7]

Application

Multimodal deep Boltzmann machines are successfully used in classification and missing data retrieval. The classification accuracy of multimodal deep Boltzmann machine outperforms support vector machines, latent Dirichlet allocation and deep belief network, when models are tested on data with both image-text modalities or with single modality.^{[citation needed]} Multimodal deep Boltzmann machine is also able to predict missing modalities given the observed ones with reasonably good precision.^{[citation needed]} Self Supervised Learning brings a more interesting and powerful model for multimodality. OpenAI developed CLIP and DALL-E models that revolutionized multimodality.

Multimodal deep learning is used for cancer screening – at least one system under development integrates such different types of data.^[8]^[9]

Multimodal transformers

Transformers can also be used/adapted for modalities (input or output) beyond just text, usually by finding a way to "tokenize" the modality.

Vision transformers^[10] adapt the transformer to computer vision by breaking down input images as a series of patches, turning them into vectors, and treating them like tokens in a standard transformer.

Conformer^[11] and later Whisper^[12] follow the same pattern for speech recognition, first turning the speech signal into a spectrogram, which is then treated like an image, i.e. broken down into a series of patches, turned into vectors and treated like tokens in a standard transformer.

Perceivers by Andrew Jaegle et al. (2021)^[13]^[14] can learn from large amounts of heterogeneous data.

Regarding image outputs, Peebles et al introduced a diffusion transformer (DiT) which facilitates use of the transformer architecture for diffusion-based image production.^[15] Also, Google released a transformer-centric image generator called "Muse" based on parallel decoding and masked generative transformer technology.^[16] (Transformers played a less-central role with prior image-producing technologies,^[17] albeit still a significant one.^[18])

Motivation

Background: Boltzmann machine

Restricted Boltzmann machine

Gaussian-Bernoulli RBM

Replicated Softmax Model

Deep Boltzmann machines

Multimodal deep Boltzmann machines

Inference and learning

Application

Multimodal transformers

See also

References