A language model is a probabilistic model of a natural language.^[1] In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.^[2]

Language models are useful for a variety of tasks, including speech recognition^[3] (helping prevent predictions of low-probability (e.g. nonsense) sequences), machine translation,^[4] natural language generation (generating more human-like text), optical character recognition, handwriting recognition,^[5] grammar induction,^[6] and information retrieval.^[7]^[8]

Large language models, currently their most advanced form, are a combination of larger datasets (frequently using words scraped from the public internet), feedforward neural networks, and transformers. They have superseded recurrent neural network-based models, which had previously superseded the pure statistical models, such as word n-gram language model.

Small language models are scaled down LLMs that are trained on smaller, private, or proprietary data sets. Also referred to as "personal language models" since they are typically trained on personal, not public data.^[9]

Pure statistical models

Models based on word n-grams

A word n-gram language model is a purely statistical model of language. It has been superseded by recurrent neural network–based models, which have been superseded by large language models. ^[10] It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if n − 1 words, an n-gram model.^[11] Special tokens were introduced to denote the start and end of a sentence $\langle s\rangle$ and $\langle /s\rangle$ .

To prevent a zero probability being assigned to unseen words, each word's probability is slightly lower than its frequency count in a corpus. To calculate it, various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen n-grams, as an uninformative prior) to more sophisticated models, such as Good–Turing discounting or back-off models.

Exponential

Maximum entropy language models encode the relationship between a word and the n-gram history using feature functions. The equation is

$P(w_{m}\mid w_{1},\ldots ,w_{m-1})={\frac {1}{Z(w_{1},\ldots ,w_{m-1})))\exp(a^{T}f(w_{1},\ldots ,w_{m}))$

where $Z(w_{1},\ldots ,w_{m-1})$ is the partition function, $a$ is the parameter vector, and $f(w_{1},\ldots ,w_{m})$ is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certain n-gram. It is helpful to use a prior on $a$ or some form of regularization.

The log-bilinear model is another example of an exponential language model.

Skip-gram model

Skip-gram language model is an attempt at overcoming the data sparsity problem that preceding (i.e. word n-gram language model) faced. Words represented in an embedding vector were not necessarily consecutive anymore, but could leave gaps that are skipped over.^[12]

Formally, a $k$ -skip- $n$ -gram is a length- $n$ subsequence where the components occur at distance at most $k$ from each other.

For example, in the input text:

the rain in Spain falls mainly on the plain

the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences

the in, rain Spain, in falls, Spain mainly, falls on, mainly the, and on plain.

In skip-gram model, semantic relations between words are represented by linear combinations, capturing a form of compositionality. For example, in some such models, if $v$ is the function that maps a word $w$ to its $n$ -d vector representation, then

$v(\mathrm {king} )-v(\mathrm {male} )+v(\mathrm {female} )\approx v(\mathrm {queen} )$

where ≈ is made precise by stipulating that its right-hand side must be the nearest neighbor of the value of the left-hand side.^[13]^[14]

Neural models

Recurrent neural network

Continuous representations or embeddings of words are produced in recurrent neural network-based language models (known also as continuous space language models).^[15] Such continuous space embeddings help to alleviate the curse of dimensionality, which is the consequence of the number of possible sequences of words increasing exponentially with the size of the vocabulary, furtherly causing a data sparsity problem. Neural networks avoid this problem by representing words as non-linear combinations of weights in a neural net.^[16]

Large language models

Machine learning and data mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Online learning Batch learning Meta-learning Semi-supervised learning Self-supervised learning Reinforcement learning Curriculum learning Rule-based learning Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Artificial neural network Autoencoder Cognitive computing Deep learning DeepDream Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Restricted Boltzmann machine GAN Diffusion model SOM Convolutional neural network U-Net Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Machine-learning venues ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

A large language model (LLM) is a computational model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. Based on language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a computationally intensive self-supervised and semi-supervised training process.^[17] LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.^[18]

LLMs are artificial neural networks that utilize the transformer architecture, invented in 2017. The largest and most capable LLMs, as of June 2024^[update], are built with a decoder-only transformer-based architecture, which enables efficient processing and generation of large-scale text data.

Historically, up to 2020, fine-tuning was the primary method used to adapt a model for specific tasks. However, larger models such as GPT-3 have demonstrated the ability to achieve similar results through prompt engineering, which involves crafting specific input prompts to guide the model's responses.^[19] These models acquire knowledge about syntax, semantics, and ontologies^[20] inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained on.^[21]

Some notable LLMs are OpenAI's GPT series of models (e.g., GPT-3.5 and GPT-4, used in ChatGPT and Microsoft Copilot), Google's Gemini (the latter of which is currently used in the chatbot of the same name), Meta's LLaMA family of models, Anthropic's Claude models, and Mistral AI's models.

Although sometimes matching human performance, it is not clear whether they are plausible cognitive models. At least for recurrent neural networks, it has been shown that they sometimes learn patterns that humans do not, but fail to learn patterns that humans typically do.^[22]

Evaluation and benchmarks

Evaluation of the quality of language models is mostly done by comparison to human created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine the intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data they see, some proposed models investigate the rate of learning, e.g., through inspection of learning curves.^[23]

Various data sets have been developed for use in evaluating language processing systems.^[24] These include:

Corpus of Linguistic Acceptability^[25]
GLUE benchmark^[26]
Microsoft Research Paraphrase Corpus^[27]
Multi-Genre Natural Language Inference
Question Natural Language Inference
Quora Question Pairs^[28]
Recognizing Textual Entailment^[29]
Semantic Textual Similarity Benchmark
SQuAD question answering Test^[30]
Stanford Sentiment Treebank^[31]
Winograd NLI
BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, MMLU (Massive Multitask Language Understanding), BIG-bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS-Pairs.^[32] (LLaMa Benchmark)