Part of a series on |

Machine learning and data mining |
---|

**Gated recurrent unit**s (**GRU**s) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al.^{[1]} The GRU is like a long short-term memory (LSTM) with a forget gate,^{[2]} but has fewer parameters than LSTM, as it lacks an output gate.^{[3]}
GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM.^{[4]}^{[5]} GRUs have been shown to exhibit better performance on certain smaller and less frequent datasets.^{[6]}^{[7]}

There are several variations on the full gated unit, with gating done using the previous hidden state and the bias in various combinations, and a simplified form called minimal gated unit.^{[8]}

The operator denotes the Hadamard product in the following.

Initially, for , the output vector is .

Variables

- : input vector
- : output vector
- : candidate activation vector
- : update gate vector
- : reset gate vector
- , and : parameter matrices and vector

- : The original is a sigmoid function.
- : The original is a hyperbolic tangent.

Alternative activation functions are possible, provided that .

Alternate forms can be created by changing and ^{[9]}

- Type 1, each gate depends only on the previous hidden state and the bias.
- Type 2, each gate depends only on the previous hidden state.
- Type 3, each gate is computed using only the bias.

The minimal gated unit is similar to the fully gated unit, except the update and reset gate vector is merged into a forget gate. This also implies that the equation for the output vector must be changed:^{[10]}

Variables

- : input vector
- : output vector
- : candidate activation vector
- : forget vector
- , and : parameter matrices and vector

Content Adaptive Recurrent Unit (CARU) is a variant of GRU, introduced in 2020 by Ka-Hou Chan et al.^{[11]} The CARU contain the update gate like GRU, but introduce a content-adaptive gate instead of the reset gate. CARU is design to alleviate the long-term dependence problem of RNN models. It was found to have a slight performance improvement on NLP task and also has fewer parameters than the GRU.^{[12]}

In the following equations, the lowercase variables represent vectors and denote the training parameters, which are linear layers consisting of weights and biases, respectively. Initially, for , CARU directly returns ; Next, for , a complete architecture is given by:

Note that it has at the end of each recurrent loop. The operator denotes the Hadamard product, and denotes the activation function of sigmoid and hyperbolic tangent, respectively.

- : It first projects the current word into as the input feature. This result would be used in the next hidden state and passed to the proposed content-adaptive gate.
- : Compare to GRU, the reset gate has been taken out. It just combines the parameters related to and to produce a new hidden state .
- : It is the same as the update gate in GRU and is used to the transition of the hidden state.
- : There is a Hadamard operator to combine the update gate with the weight of current feature. This gate had been named as content-adaptive gate, which will influence the amount of gradual transition, rather than diluting the current hidden state.
- : The next hidden state is combined with and .

Another feature of CARU is the weighting of hidden states according to the current word and the introduction of content-adaptive gates, instead of using reset gates to alleviate the dependence on long-term content. There are three trunks of data flow to be processed by CARU:

**content-state**: It produces a new hidden state achieved by a linear layer, this part is equivalent to simple RNN networks.

**word-weight**: It produces the weight of the current word, it has the capability like a GRU reset gate but is only based on the current word instead of the entire content. More specifically, it can be considered as the tagging task that connects the relation between the weight and parts-of-speech.**content-weight**: It produces the weight of the current content, the form is the same as a GRU update gate but with the purpose to overcome the long-term dependence.

In contrast to GRU, CARU does not intend to process those data flow, instead dispatch the word-weight to the content-adaptive gate and multiplies it with the content-weight. In this way, the content-adaptive gate considers of both the word and the content.