A review of Dropout as applied to RNNs part 2.
Results and analysis.
In part 1 of this post I gave an overview of a brief history of dropout as applied to recurrent neural networks (RNNs). In part 2 I will show results of analysis of the importance of various dropout types on RNNs.
The dropout parameters tested here are the five dropout types applied in AWD-LSTM for language modelling tasks and four dropouts applied to different parts of a sequence to sequence attention model used in translation tasks.
Language modelling was conducted on the following datasets: Penn Tree Bank and Wikitext-2, while the translation task was conducted for French to English translation. The source code for the background to the translation modelling can be found here.
In the two figures below I show the results on the sensitivity of resultant validation loss to variation of each dropout parameter where other dropout parameters are set to default values. The results plotted are final loss after 100 epochs.
It is evident that in the awd-lstm-lm original implementation of AWD-LSTM weight drop > 0.6 causes a dramatic increase in validation loss.
In the results below I show tests where each dropout parameter in-turn was varied between 0 and 0.9, with all other dropout parameters set to zero.
Again we see a marked loss of performace for the awd-lstm-lm original implementation of AWD-LSTM where weight drop is > 0.6.
I repeated the analyses but just running for 12 epochs, as this is around the number of epochs often used in preliminary analysis. Note that the results below show dropout starting at 0.1 (not 0 as above) for all parameters.
Differences in code implementation on the AWD-LSTM vs fastai library for Weight Drop manifest in a clear difference in results at wd≥ 0.7, however all other parameter variation exhibit similar loss sensitivity.
Each dropout parameter is described below.
Embedding dropout (dropoute, abbreviated here as de) applies dropout to remove words from the embedding layer. “Following Gal & Ghahramani (2016), we employ embedding dropout. This is equivalent to performing dropout on the embedding matrix at a word level, where the dropout is broadcast across all the word vector’s embedding. The remaining non-dropped-out word embeddings are scaled by 1−pe where pe is the probability of embedding dropout. As the dropout occurs on the embedding matrix that is used for a full forward and backward pass, this means that all occurrences of a specific word will disappear within that pass, equivalent to performing variational dropout on the connection between the one-hot embedding and the embedding lookup.” Merity et al. (2017)
The code for embedded dropout from the awd-lstm-lm codebase can be seen below.
def embedded_dropout(embed, words, dropout=0.1, scale=None):
mask = embed.weight.data.new().resize_((embed.weight.size(0), 1)).bernoulli_(1 - dropout).expand_as(embed.weight) / (1 - dropout)
mask = Variable(mask)
masked_embed_weight = mask * embed.weight
masked_embed_weight = embed.weight
masked_embed_weight = scale.expand_as(masked_embed_weight) * masked_embed_weight
padding_idx = embed.padding_idx
if padding_idx is None:
padding_idx = -1
X = embed._backend.Embedding.apply(words, masked_embed_weight,
padding_idx, embed.max_norm, embed.norm_type,
This dropout has the most significant impact on network performance (excepting tail end WeightDrop on AWD-LSTM).
Input dropout (dropouti abbrv. di) is applied in AWD-LSTM straight after embedding dropout in RNNModel(). Hidden dropout (dropouth abbrv. dh) is applied for each RNN, for each layer in the RNN and Dropout (dropout abbrv. d) is applied to the RNN output, after looping each RNN and each layer in them, dropout to the output as can be seen in the forward method of RNNModel in the awd-lstm-lm codebase.
def forward(self, input, hidden, return_h=False):
emb = embedded_dropout(self.encoder, input, dropout=self.dropoute if self.training else 0)
emb = self.lockdrop(emb, self.dropouti)
raw_output = emb
new_hidden = 
raw_outputs = 
outputs = 
for l, rnn in enumerate(self.rnns):
current_input = raw_output
raw_output, new_h = rnn(raw_output, hidden[l])
if l != self.nlayers - 1:
raw_output = self.lockdrop(raw_output, self.dropouth)
hidden = new_hidden
output = self.lockdrop(raw_output, self.dropout)
result = output.view(output.size(0)*output.size(1), output.size(2))
return result, hidden, raw_outputs, outputs
return result, hidden
These three dropout parameters, abbreviated here as di, dh, d all generate similar loss response (with input dropout having least loss variation for different di values) as they are varied in return as can be seen below.
Weight Drop (wdrop abbrv. wd) amount of weight dropout to apply to the RNN hidden to hidden matrix.
In the awd-lstm-lm codebase, each RNN is wrapped in a WeightDrop class during the initialization of the RNNModel.
self.rnns = [WeightDrop(rnn, ['weight_hh_l0'], dropout=wdrop) for rnn in self.rnns]
Note how the hidden to hidden weights (weight_hh_l0) are only weights this is applied to. In the forward method in the WeightDrop class we call a _setweights() method which applies dropout to a ‘raw’ copy of the weights.
The Weight Drop parameter (wdrop) is a DropConnect (Wan et al. 2013) (as briefly summarized in part 1) type of dropout where rather than dropping activations, randomly selected weights themselves are dropped from the network during training.
With high weight drop for the original implementation of AWD-LSTM, loss values are off the charts in the figure above.
In the figures below I show the results running up to 12 epochs — a ‘zoomed in’ version of the plots above showing generally the fastest changing part of the loss vs epoch plot.
The most striking feature of the loss result as the weight drop parameter is varied are the two curve types, with a sudden shift in network performance at around 70% dropout. This response is not seen when running the same tasks on the fastai implementation of the AWD-LSTM code. A more detailed analysis on the reasons for this difference may be the subject of a follow-up post.
Other parameters set to zero
Re-running the experiments for 100 epochs with other dropouts set to zero shows some interesting differences to when using defaults for other dropouts.
It is apparent that when just using a single dropout type (so far only output dropout (d), embedding dropout (de), hidden dropout (dh) and weight drop (wd) have been tested) results in deteriorating loss values after 10–20 epochs in most cases. If using AWD-LSTM, running 12 epochs and just applying dropout to embeddings will all other dropouts set to zero one might conclude that embedding dropout (de) should be set relatively low as per Fig. 6 above. Whereas best performance in this particular case would be with a embedding dropout of 0.5, with validation loss being quite sensitive to changes in embedding dropout.
In the tests below we see that like with the AWD-LSTM codebase, the fastai implementation of AWD-LSTM where a single dropout parameter is used with others set to zero most cases lead to increasing loss after 20 or so epochs.
Preliminary findings of this work are that for language modelling with AWD-LSTM the Embedding Dropout the dropout with most range in resultant loss, with the input dropout having the least impact. Large Weight Drop values should not be used with the AWD-LSTM code base.
Sequence to Sequence Attention model for translation tasks
In this section I change from language modelling to translation task dropout analysis.
Here the task to to translate from one language to another. Here we use FastText pre-trained word vectors for both languages and the Shared Task: Machine Translation French-English training dataset (http://www.statmt.org/wmt15/translation-task.html).
Example question pairs in english and french are:
‘What is light ?’, ‘Qu’est-ce que la lumière?’
‘Who are we?’, ‘Où sommes-nous?’
In these tests the following dropout parameters were applied:
RNN Encoder Dropout (red): single dropout parameter applied to the RNN_Encoder. The dropout is applied as a default parameter for CudnnRNN()
self.rnn_enc = nn.GRU(em_sz_enc, nh, num_layers=nl, dropout=rnn_enc_drop)
RNN Decoder Dropout (rdd): single dropout parameter applied to the RNN_Decoder.
self.rnn_dec = nn.GRU(em_sz_dec, em_sz_dec, num_layers=nl, dropout=rnn_dec_drop)
Embedding Encoder Dropout (eed): single dropout parameter applied to input embeddings in the forward method of a sequence to sequence RNN (Seq2SeqAttnRNN in fastai code)
self.emb_enc_drop = nn.Dropout(emb_enc_drop)
... emb = self.emb_enc_drop(self.emb_enc(inp))
Output Dropout (od): single dropout parameter applied to output in the forward method in a loop iterating over target language total number of sequences.
self.out_drop = nn.Dropout(out_drop)
for i in range(self.out_sl):
... outp = self.out(self.out_drop(outp))
It is interesting to view the rate of change of the validation loss with epoch run as per below, showing we need to do a minimum of 7 or 8 epochs for change in validation loss to settle.
The aim of this work was to get a feel for how different dropout types and how the amount of dropout applied for each type affects validation loss with recurrent neural networks. By understanding the effects of each parameter I can now focus on optimizing those with the greatest effect.
Default dropout values (where defaults were used) for AWD-LSTM (awd-lstm-lm implementation) language modelling:
dropout = 0.4
dropouth = 0.2
dropouti = 0.65
dropoute = 0.1
wdrop = 0.5
Default dropout values for fastai code (where defaults were used), language modelling:
drops_scalar = 0.7
dropouti = 0.25*drops_scalar
dropout = 0.1*drops_scalar
dropoute = 0.02*drops_scalar
dropouth = 0.15*drops_scalar
wdrop = 0.2*drops_scalar