Machine learning and data mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Online learning Batch learning Meta-learning Semi-supervised learning Self-supervised learning Reinforcement learning Rule-based learning Quantum machine learning
Problems Classification Generative model Regression Clustering dimension reduction density estimation Anomaly detection Data Cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Artificial neural network Autoencoder Cognitive computing Deep learning DeepDream Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Restricted Boltzmann machine GAN Diffusion model SOM Convolutional neural network U-Net Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Machine-learning venues ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

Mamba is a deep learning architecture focused on sequence modeling. It was developed by researchers from Carnegie Mellon University and Princeton University to address some limitations of transformer models, especially in processing long sequences, and it is based on the Structured State Space sequence (S4) model.^[1]^[2]^[3]

Architecture

To enable handling long data sequences, Mamba incorporates the Structured State Space sequence model (S4).^[1] S4 can effectively and efficiently model long dependencies by combining the strengths of continuous-time, recurrent, and convolutional models, enabling it to handle irregularly sampled data, have unbounded context, and remain computationally efficient both during training and testing.^[4]

Mamba, building on the S4 model, introduces significant enhancements, particularly in its treatment of time-variant operations. Central to its design is a unique selection mechanism that adapts structured state space model (SSM) parameters based on the input.^[5]^[1] This enables Mamba to selectively focus on relevant information within sequences, effectively filtering out less pertinent data. The model transitions from a time-invariant to a time-varying framework, which impacts both the computation and efficiency of the system.^[1]^[6]

To address the computational challenges introduced by this time-variance, Mamba employs a hardware-aware algorithm. This algorithm enables efficient computation on modern hardware, like GPUs, by using kernel fusion, parallel scan, and recomputation.^[1] The implementation avoids materializing expanded states in memory-intensive layers, thereby optimizing performance and memory usage. The result is an architecture that is significantly more efficient in processing long sequences compared to previous methods.^[1]^[6]

Additionally, Mamba simplifies its architecture by integrating the SSM design with MLP blocks, resulting in a homogeneous and streamlined structure, furthering the model's capability for general sequence modeling across various data types, including language, audio, and genomics, while maintaining efficiency in both training and inference.^[1]

Variants

MoE-Mamba integrates the Mamba architecture with a mixture of experts (MoE) layer. This combination allows for a more efficient implementation, enabling the model to achieve comparable performance to Mamba with 2.2x fewer training steps and maintaining the inference performance gains of Mamba over transformers.^[7] The model's design involves alternating Mamba and MoE layers, allowing it to efficiently integrate the entire sequence context and apply the most relevant expert for each token.

References

^ ^a ^b ^c ^d ^e ^f ^g Gu, Albert; Dao, Tri. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces". arXiv:2312.00752.
^ Chowdhury, Hasan. "The tech powering ChatGPT won't make AI as smart as humans. Others might". Business Insider. Retrieved 13 January 2024.
^ Pandey, Mohit (6 December 2023). "Mamba is Here to Mark the End of Transformers". Analytics India Magazine. Retrieved 13 January 2024.
^ Gu, Albert; Goel, Karan; Re, Christopher (6 October 2021). "Efficiently Modeling Long Sequences with Structured State Spaces". ICLR. Retrieved 13 January 2024.
^ Gu, Albert; Johnson, Isys; Goel, Karan; Saab, Khaled Kamal; Dao, Tri; Rudra, A.; R'e, Christopher (26 October 2021). "Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers". NeurIPS. Retrieved 13 January 2024.
^ ^a ^b Tickoo, Aneesh (10 December 2023). "Researchers from CMU and Princeton Unveil Mamba: A Breakthrough SSM Architecture Exceeding Transformer Efficiency for Multimodal Deep Learning Applications". MarkTechPost. Retrieved 13 January 2024.
^ Pióro, Maciej; Ciebiera, Kamil; Król, Krystian; Ludziejewski, Jan; Jaszczur, Sebastian. "MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts". arXiv:2401.04081.

Differentiable computing

General

Concepts

Applications

Hardware

Software libraries

Implementations

Audio–visual	AlexNet WaveNet Human image synthesis HWR OCR Speech synthesis Speech recognition Facial recognition AlphaFold DALL-E Midjourney Stable Diffusion Whisper
Verbal	Word2vec Seq2seq BERT Gemini LaMDA Bard NMT Project Debater IBM Watson GPT-1 GPT-2 GPT-3 GPT-4 ChatGPT GPT-J Chinchilla AI PaLM BLOOM LLaMA
Decisional	AlphaGo AlphaZero Q-learning SARSA OpenAI Five Self-driving car MuZero Action selection Auto-GPT Robot control

People

Organizations

Architectures

Portals
- Computer programming
- Technology
Categories
- Artificial neural networks
- Machine learning

Architecture

Variants

See also

References