# SELU vs RELU activation in simple NLP models

24 Jul 2017### Background on SELU

Normalized outputs seem to be really helpful in stabilizing the training process. That’s the main reason behind the popularity of BatchNormalization. SELU is a way to output the normalized activations to the next layer.

The overall function is really simple:

For `mean 0`

and `stdev 1`

inputs, the values of α and λ come out to be `1.6732632423543772848170429916717`

and `1.0507009873554804934193349852946`

respectively.

```
# PyTorch implementation
import torch.nn.functional as F
def selu(x):
alpha = 1.6732632423543772848170429916717
scale = 1.0507009873554804934193349852946
return scale * F.elu(x, alpha)
```

```
# Numpy implementation
import numpy as np
def selu(x):
alpha = 1.6732632423543772848170429916717
scale = 1.0507009873554804934193349852946
return scale * ((x > 0)*x + (x <= 0) * (alpha * np.exp(x) - alpha))
```

### SNLI dataset

SNLI dataset is a collection of 570k english sentence pairs. The task is to classify each pair as either:

**entailment**- “A soccer game with multiple males playing.” and “Some men are playing a sport.”**contradiction**- “A man inspects the uniform of a figure in some East Asian country.” and “The man is sleeping.”**neutral**- “A smiling costumed woman is holding an umbrella.” and “A happy woman in a fairy costume holds an umbrella.”

### Model Architecture

I am using a simple Bag of words model written in keras. The following python snippet describes the major components. This code is taken from Stephen Merity’s repo here.

```
# Embedding layer
embed = Embedding(VOCAB, EMBED_HIDDEN_SIZE, weights=[embedding_matrix], input_length=MAX_LEN, trainable=False)
# A dense layer applied over each sequence point
translate = TimeDistributed(Dense(SENT_HIDDEN_SIZE, activation=ACTIVATION))
# A layer to sum up the sequence of words
rnn = keras.layers.core.Lambda(lambda x: K.sum(x, axis=1), output_shape=(SENT_HIDDEN_SIZE, ))
# 2 pairs of input sentences
premise = Input(shape=(MAX_LEN, ), dtype='int32')
hypothesis = Input(shape=(MAX_LEN, ), dtype='int32')
# Get the word embeddings for each of these 2 pairs
prem = embed(premise)
hypo = embed(hypothesis)
# Apply the Dense layer
prem = translate(prem)
hypo = translate(hypo)
# Sum up the sequence
prem = rnn(prem)
hypo = rnn(hypo)
prem = BatchNormalization()(prem)
hypo = BatchNormalization()(hypo)
# Combined the 2 sentences
joint = concatenate([prem, hypo])
joint = Dropout(DP)(joint)
# Add Few dense layers in the end
for i in range(3):
joint = Dense(2 * SENT_HIDDEN_SIZE, activation=ACTIVATION, kernel_regularizer=l2(L2)(joint)
joint = Dropout(DP)(joint)
joint = BatchNormalization()(joint)
```

### SELU vs RELU results

Code for this excercise is available in this repo.

`RELU`

is clearly converging much faster than `SELU`

. My first was to remove the `BatchNormalization`

and do the same comparison. The following graph shows the comparison after **removing the BatchNorm components**.

Still, `RELU`

seems to be doing a much better job than `SELU`

for the default configuration.

This behavior remains more or less the same after iterating through hyperparameters. The following graph is for one of the hyperparameter configurations.

(Edit: I incorporated the suggestion from Dan Ofer below and included the graph with `AlphaDropout`

.)

To be fair, it is still possible that `SELU`

is better in some configurations. Some of the possible reasons are listed below. However, it is clear to me that simply replacing `RELU`

with `SELU`

isn’t going to improve your existing models.

`SELU`

authors recommend a specific initialization scheme for it to be effective.

Additionally, `SELU`

is a bit more computationally expensive than `RELU`

. On a g2.2xlarge EC2 instance, `RELU`

model took about 49 seconds to complete an epoch, while `SELU`

model took 65 seconds to do the same (33% more).