Seq2seq is a family of machine learning approaches used for natural language processing.[1] Applications include language translation, image captioning, conversational models, and text summarization.[2] Seq2seq uses sequence transformation: it turns one sequence into another sequence.


The algorithm was developed by Google for use in machine translation.[2][unreliable source]

Similar earlier work includes Tomáš Mikolov's 2012 PhD thesis.[3][non-primary source needed] .

In 2023, after receiving the Test of Time Award from NeurIPS for the word2vec paper, Mikolov made a public announcement.[4]In it he confirmed that the idea of neural sequence-to-sequence translation comes from him and was conceived before he joined Google. He also stated that he mentioned the idea to Ilya Sutskever and Quoc Le and discussed it with them many times. And he accused them of publishing their seq2seq paper without acknowledging him.

In 2019, Facebook announced its use in symbolic integration and resolution of differential equations. The company claimed that it could solve complex equations more rapidly and with greater accuracy than commercial solutions such as Mathematica, MATLAB and Maple. First, the equation is parsed into a tree structure to avoid notational idiosyncrasies. An LSTM neural network then applies its standard pattern recognition facilities to process the tree.[5]

In 2020, Google released Meena, a 2.6 billion parameter seq2seq-based chatbot trained on a 341 GB data set. Google claimed that the chatbot has 1.7 times greater model capacity than OpenAI's GPT-2,[6] whose May 2020 successor, the 175 billion parameter GPT-3, trained on a "45TB dataset of plaintext words (45,000 GB) that was ... filtered down to 570 GB."[7]

In 2022, Amazon introduced AlexaTM 20B, a moderate-sized (20 billion parameter) seq2seq language model. It uses an encoder-decoder to accomplish few-shot learning. The encoder outputs a representation of the input that the decoder uses as input to perform a specific task, such as translating the input into another language. The model outperforms the much larger GPT-3 in language translation and summarization. Training mixes denoising (appropriately inserting missing text in strings) and causal-language-modeling (meaningfully extending an input text). It allows adding features across different languages without massive training workflows. AlexaTM 20B achieved state-of-the-art performance in few-shot-learning tasks across all Flores-101 language pairs, outperforming GPT-3 on several tasks.[8]


A seq2seq model is composed of an encoder and a decoder that typically implemented as RNNs. The encoder captures the context of the input sequence and sends it to the decoder, which then produces the final output sequence. [9]


The encoder is responsible for processing the input sequence and capturing its essential information, which is stored as the hidden state of the network and, in a model with attention mechanism, a context vector. The context vector is the weighted sum of the input hidden states and is generated for every time instance in the output sequences.


The decoder takes the context vector and hidden states from the encoder and generates the final output sequence. The decoder operates in an autoregressive manner, producing one element of the output sequence at a time. At each step, it considers the previously generated elements, the context vector, and the input sequence information to make predictions for the next element in the output sequence. Specifically, in a model with attention mechanism, the context vector and the hidden state are concatenated together to form an attention hidden vector, which is used as an input for the decoder. [10]

Attention mechanism

The attention mechanism is an enhancement introduced by Bahdanau et al. in 2014 [11] to address limitations in the basic Seq2Seq architecture where longer input sequence results in the hidden state output of the encoder become irrelevant for the decoder. It enables the model to selectively focus on different parts of the input sequence during the decoding process. At each decoder step, an alignment model calculates the attention score using the current decoder state and all of the attention hidden vectors as input. An alignment model is another neural network model that is trained jointly with the seq2seq model used to calculate how well an input, represented by the hidden state, matches with the previous output, represented by attention hidden state. A softmax function is then applied to the attention score to get the attention weight.

In some models, the encoder states are directly fed into an activation function, removing the need for alignment model. An activation function receives one decoder state and one encoder state and returns a scalar value of their relevance. [12]

Related software

Software adopting similar approaches includes OpenNMT (Torch), Neural Monkey (TensorFlow) and NEMATUS (Theano).[13]

See also


  1. ^ Sutskever, Ilya; Vinyals, Oriol; Le, Quoc Viet (2014). "Sequence to sequence learning with neural networks". arXiv:1409.3215 [cs.CL].
  2. ^ a b Wadhwa, Mani (2018-12-05). "seq2seq model in Machine Learning". GeeksforGeeks. Retrieved 2019-12-17.
  3. ^ p. 94 of,
  4. ^
  5. ^ "Facebook has a neural network that can do advanced math". MIT Technology Review. December 17, 2019. Retrieved 2019-12-17.
  6. ^ Mehta, Ivan (2020-01-29). "Google claims its new chatbot Meena is the best in the world". The Next Web. Retrieved 2020-02-03.
  7. ^ Gage, Justin. "What's GPT-3?". Retrieved August 1, 2020.
  8. ^ Rodriguez, Jesus. "🤘Edge#224: AlexaTM 20B is Amazon's New Language Super Model Also Capable of Few-Shot Learning". Retrieved 2022-09-08.
  10. ^ Dugar, Pranay. "Attention — Seq2Seq Models". Retrieved 2023-12-20.
  11. ^ p. 1 of,
  12. ^ Voita, Lena. "Sequence to Sequence (seq2seq) and Attention". Retrieved 2023-12-20.
  13. ^ "Overview - seq2seq". Retrieved 2019-12-17.