• Recurrent Neural Networks - These are a special type of Neural Nets. They are generally used for Text i.e. NLP tasks.

  • Long Short Term Memory is a variation of a vanilla RNN

  • We will cover RNNs in Part1.

  • LSTMs will be covered in Part2.

  • This blog post should be read with Chapter 12 of FastAI book.

Understanding the toy data of English Text for numbers

This is a toy dataset from FastAI which has English Text for numbers from "one" to "nine thousand nine hundred and ninety nine"

from fastbook import *
from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)

There are two files train and valid as shown below

path.ls()
(#2) [Path('/root/.fastai/data/human_numbers/train.txt'),Path('/root/.fastai/data/human_numbers/valid.txt')]

Lets load the train and valid files into an enhanced FastAI List class called L()

train_lines = L()
with open(path/'train.txt') as f: train_lines += L(*f.readlines())
train_lines
(#7999) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]
type(train_lines)
fastcore.foundation.L
train_lines[-1:]
(#1) ['seven thousand nine hundred ninety nine \n']

We can see that the last element of this train list is seven thousand nine hundred ninety nine

Now, we load the valid file similarly.

valid_lines = L()
with open(path/'valid.txt') as f: valid_lines += L(*f.readlines())
valid_lines
(#1999) ['eight thousand one \n','eight thousand two \n','eight thousand three \n','eight thousand four \n','eight thousand five \n','eight thousand six \n','eight thousand seven \n','eight thousand eight \n','eight thousand nine \n','eight thousand ten \n'...]
valid_lines[-1:]
(#1) ['nine thousand nine hundred ninety nine \n']

In the above cells, we can see that valid file has numbers from eight thousand one to nine thousand nine hundred ninety nine

Lets load both files into a single List as shown below.

lines = L()
with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'valid.txt') as f: lines += L(*f.readlines())
lines
(#9998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]

lines now has data from both train.csv and valid.csv

We will remove the extra spaces and newline characters from each number's text and then concatenate them with "space.space" in between as done below

text = ' . '.join([l.strip() for l in lines])
text[:100]
'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'

Tokenization and Numericalization

Lets split the "space" separated text into tokens. Note that we will keep the . also as a token

tokens = text.split(' ')
tokens[:10]
['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']
type(tokens)
list

There are around 63 K tokens

len(tokens)
63095

Below are the first fifteen tokens

tokens[0:15]
['one',
 '.',
 'two',
 '.',
 'three',
 '.',
 'four',
 '.',
 'five',
 '.',
 'six',
 '.',
 'seven',
 '.',
 'eight']

Below are the last 15 tokens

tokens[-15:]
['seven',
 '.',
 'nine',
 'thousand',
 'nine',
 'hundred',
 'ninety',
 'eight',
 '.',
 'nine',
 'thousand',
 'nine',
 'hundred',
 'ninety',
 'nine']

However there are only 30 unique values. Come to think of it, it is text for 0 to 9999 which would be having


1 to 20 (20 unique),

30, 40 and so on till 90 (7 unique),

hundred (1 unique),

thousand (1 unique) and

. (1 unique)


which will total to 30

len(np.unique(np.array(tokens)))
30

Below is a fastai L (enhanced list class) implementation of the same thing we did above.

We create a vocab of unique values of tokens.

vocab = L(*tokens).unique()
vocab
(#30) ['one','.','two','three','four','five','six','seven','eight','nine'...]
len(vocab)
30

Printing the enumeration of the vocab below

for i,w in enumerate(vocab): print ('i = ' +str(i)+ ' and w = '+str(w))
i = 0 and w = one
i = 1 and w = .
i = 2 and w = two
i = 3 and w = three
i = 4 and w = four
i = 5 and w = five
i = 6 and w = six
i = 7 and w = seven
i = 8 and w = eight
i = 9 and w = nine
i = 10 and w = ten
i = 11 and w = eleven
i = 12 and w = twelve
i = 13 and w = thirteen
i = 14 and w = fourteen
i = 15 and w = fifteen
i = 16 and w = sixteen
i = 17 and w = seventeen
i = 18 and w = eighteen
i = 19 and w = nineteen
i = 20 and w = twenty
i = 21 and w = thirty
i = 22 and w = forty
i = 23 and w = fifty
i = 24 and w = sixty
i = 25 and w = seventy
i = 26 and w = eighty
i = 27 and w = ninety
i = 28 and w = hundred
i = 29 and w = thousand

Using the above, we can create a dictionary of numbers(indices) to tokens

word2idx = {w:i for i,w in enumerate(vocab)}
word2idx
{'.': 1,
 'eight': 8,
 'eighteen': 18,
 'eighty': 26,
 'eleven': 11,
 'fifteen': 15,
 'fifty': 23,
 'five': 5,
 'forty': 22,
 'four': 4,
 'fourteen': 14,
 'hundred': 28,
 'nine': 9,
 'nineteen': 19,
 'ninety': 27,
 'one': 0,
 'seven': 7,
 'seventeen': 17,
 'seventy': 25,
 'six': 6,
 'sixteen': 16,
 'sixty': 24,
 'ten': 10,
 'thirteen': 13,
 'thirty': 21,
 'thousand': 29,
 'three': 3,
 'twelve': 12,
 'twenty': 20,
 'two': 2}

Lets get the indices of the dictionary into a List (fastai L)

nums = L(word2idx[i] for i in tokens)
nums
(#63095) [0,1,2,1,3,1,4,1,5,1...]
type(nums)
fastcore.foundation.L

We now have numericalized our tokens. This can be passed into a Neural Net

Our First Language Model from Scratch

Lets say this Model will take as input a sequence of 3 words and predict the fourth word.

For example:

  • Do you know ----- the

  • you know the ----- answer

  • know the answer ----- to

  • the answer to ----- this

  • answer to this ----- question

and so on..

Lets create a list for our data in this fashion.

The range(0,len(tokens)-4,3) i.e. range(start, stop, step) is in this way so that we dont go over the range of the list.

L((tokens[i:i+3], tokens[i+3]) for i in range(0,len(tokens)-4,3))
(#21031) [(['one', '.', 'two'], '.'),(['.', 'three', '.'], 'four'),(['four', '.', 'five'], '.'),(['.', 'six', '.'], 'seven'),(['seven', '.', 'eight'], '.'),(['.', 'nine', '.'], 'ten'),(['ten', '.', 'eleven'], '.'),(['.', 'twelve', '.'], 'thirteen'),(['thirteen', '.', 'fourteen'], '.'),(['.', 'fifteen', '.'], 'sixteen')...]

Above is a list that we created for predictng the fourth word when the input is sequence of three words

We will create a Tensor version of the same below

seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0,len(nums)-4,3))
seqs
(#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

Lets create a dataloader and split the data for train and valid datasets in 80:20 ratio

len(seqs)
21031
int(len(seqs) * 0.8)
16824
cut = int(len(seqs) * 0.8)

The batch size is 64 and we are making shuffle = False because this is for text data and we want to keep the sequences in proper order. The text of any language makes sense only when the words are in a proper sequence

bs = 64
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

A simple Language Model in PyTorch

This Model has 3 layers.

  • Input to Hidden - Embedding of size (30 (vocab size in this example), n_hidden)
  • Hidden to Hidden - Linear of size (n_hidden, n_hidden)
  • Hidden to Output - Linear of size(n_hidden, 30 (vocab size in this example))

See the comments in the code below.

  • The first word of each X is passed to input to hidden layer

  • The output of that is passed to hidden to hidden layer and then that is passed through a ReLu

  • This becomes the hidden state of the first word (The activations that are updated at each step of a recurrent neural network is called hidden state)

  • This is then passed as input along with second word to input to generate a new hidden state.

  • This process repeats for the third word

  • Finally the hidden to output is generated.

class LMModel1(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.input_to_hidden = nn.Embedding(vocab_sz, n_hidden)  
        self.hidden_to_hidden = nn.Linear(n_hidden, n_hidden)     
        self.hidden_to_output = nn.Linear(n_hidden,vocab_sz)
        
    def forward(self, x):
        h = F.relu(self.hidden_to_hidden(self.input_to_hidden(x[:,0]))) #First word
        h = h + self.input_to_hidden(x[:,1]) # Second word + Activations of previous layer
        h = F.relu(self.hidden_to_hidden(h)) #  Non Linearity
        h = h + self.input_to_hidden(x[:,2]) # Third word + Activations of previous layer
        h = F.relu(self.hidden_to_hidden(h)) # Non Linearity
        return self.hidden_to_output(h) # Predicted fourth word
len(vocab)
30

Lets create a learner, pass the Dataloaders and train for some epochs

learn = Learner(dls, LMModel1(len(vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)
epoch train_loss valid_loss accuracy time
0 1.824297 1.970941 0.467554 00:02
1 1.386973 1.823242 0.467554 00:01
2 1.417556 1.654497 0.494414 00:01
3 1.376440 1.650849 0.494414 00:01

The most common word is 'thousand' and if we used that to predict that as 4th word always, then we get an accuracy of 15% as shown below.

Our above model fares much better with much better accuracy

n,counts = 0,torch.zeros(len(vocab))
for x,y in dls.valid:
    n += y.shape[0]
    for i in range_of(vocab): counts[i] += (y==i).long().sum()
idx = torch.argmax(counts)
idx, vocab[idx.item()], str((counts[idx].item()/n)*100)+' %'
(tensor(29), 'thousand', '15.165200855716662 %')

Above model rewritten as a Recurrent Neural Network

  • A RNN can be thought of as a Looping Neural Network.
  • We have added a for loop and refactored the code to reduce the number of lines
  • This is a RNN in its simplest form
class LMModel2(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        
    def forward(self, x):
        h = 0
        for i in range(3):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
        return self.h_o(h)
learn = Learner(dls, LMModel2(len(vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)
epoch train_loss valid_loss accuracy time
0 1.816274 1.964143 0.460185 00:01
1 1.423805 1.739964 0.473259 00:01
2 1.430327 1.685172 0.485382 00:02
3 1.388390 1.657033 0.470406 00:01

Improving the RNN

There are two problems with the simple RNN built above

  • h is set to 0 in forward method. This means there is no information stored about previous words. We are predicting 4th word using inputs as current 3 words only.
  • The knowledge of previous sequences is lost whenever a new three-word-sequence is being worked upon.
  • Also, predicting 4th word was our toy requirement. We should be able to predict any next word given a word as input.
  • We can move the h=0 to init as done below
  • But, there is one issue with setting h=0 only once and removing it from forward method. The derivative of h will keep getting computed and can cause memory issues. The detach function we see below is to stop the derivative calculation after some duration and in this case after three layers. This is backpropagation through time

Changes to maintain the State of an RNN

Below model implements the BPTT but still works taking 3 word inputs and predicts the 4th word.

  • See that h is set to 0 in init only.
  • Also look at the detach done after coming out of the loop of 3
  • reset method of our model is to reset the "hidden state" at the beginning of each epoch and before each validation phase. This works with "cbs" option of the Learner
class LMModel3(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = 0
        
    def forward(self, x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
        out = self.h_o(self.h)
        self.h = self.h.detach() # Detach the gradient and store it. 
                                 # Backpropagation will not happen for entire stream
        return out
    
    def reset(self): self.h = 0 # This is for resetting the hidden state before each epoch

There is another issue.

We cannot use the data in the same format that we had before. Why ??

Well, there are couple of things -

  • The sequence of the data is important for text analysis and therefore for RNNs

  • In the previous examples h (hidden state) was reset to 0 always and therefore we did not care about the sequence of data. We just cared about the previous word.

  • Now, we are storing the previous activations/hidden state for the previous 3 words.

  • Since we process 1 batch at a time, the sequence of words cannot be next to one another in a single batch.

  • Rather they need to be in same positions in consecutive batches. Lets understand it a bit more further.

See the first 3 values of the existing data

seqs[:cut]
(#16824) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

Just showing the first three below

seqs[0:3]
(#3) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1)]

These tensors correspond to

  • (['one', '.', 'two'], '.'),

  • (['.', 'three', '.'], 'four'),

  • (['four', '.', 'five'], '.')

  • (tensor([0, 1, 2]), 1)

  • (tensor([1, 3, 1]), 4)

  • (tensor([4, 1, 5]), 1)

These have to be rearranged now across consecutive batches.

Below is a function which will help with that.

def group_chunks(ds, bs):
    m = len(ds) // bs
    new_ds = L()
    for i in range(m): new_ds += L(ds[i + m*j] for j in range(bs))
    return new_ds
m = len(seqs[:cut])//bs
m, cut, bs
(262, 16824, 64)

Below is a handwritten loop output of the above function to show how the items get re-arranged.

WhatsApp Image 2020-10-06 at 9.20.31 PM.jpeg

new_ds = group_chunks(seqs[:cut], bs)

See how the same first three values are now re-arranged in different positions across consecutive batches

new_ds[0] # Part of first batch
(tensor([0, 1, 2]), 1)
new_ds[64] # Part of second batch
(tensor([1, 3, 1]), 4)
new_ds[128] # Part of third batch
(tensor([4, 1, 5]), 1)

The length of this new_ds is less than 16824 (cut)

Why is that ?

len(new_ds.items)
16768
cut
16824

That is because the last 56 cannot go into a batch of size 64 and hence it will be discarded with drop_last as seen in below cells.

cut - len(new_ds.items) # 16824-16768
56
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], bs), 
    group_chunks(seqs[cut:], bs), 
    bs=bs, drop_last=True, shuffle=False)

Lets train this for 10 epochs

learn = Learner(dls, LMModel3(len(vocab), 64), loss_func=F.cross_entropy,
                metrics=accuracy, cbs=ModelResetter) # cbs calls the reset method
learn.fit_one_cycle(10, 3e-3)
epoch train_loss valid_loss accuracy time
0 1.708551 1.870440 0.400721 00:02
1 1.256061 1.770536 0.431010 00:01
2 1.058304 1.490938 0.520673 00:01
3 0.986795 1.731392 0.525000 00:02
4 0.944511 1.744132 0.536779 00:01
5 0.921901 1.518385 0.556250 00:02
6 0.872340 1.830094 0.549760 00:01
7 0.834834 1.810466 0.554567 00:02
8 0.797066 1.921501 0.549279 00:01
9 0.777922 1.919559 0.542067 00:02

Okay.. I hope there were some 'aha' moments earlier.

Lets move forward.

Creating More Signal

If you see the "Improving the RNN" section above, we had this below line

Also, predicting 4th word was our toy requirement. We should be able to predict any next word given a word as input.

We still have not done that.

We will see how that can be done below


  1. First, we need to change the data so that the dependent variable has each of the "n" next words after each of our "n" input words instead of how we had before (3 inputs and 1 output). We can control the n through a parameter.

  2. Change the model so that it predicts a word after each word instead of 3 inputs


Data Changes

We will do the data changes in below cells.

sl is the parameter for "n" whose value is 16

Similar to earlier code, we use group_chunks for re-arranging the data across consecutive batches

sl = 16
seqs = L((tensor(nums[i:i+sl]), tensor(nums[i+1:i+sl+1]))
         for i in range(0,len(nums)-sl-1,sl))
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(group_chunks(seqs[:cut], bs),
                             group_chunks(seqs[cut:], bs),
                             bs=bs, drop_last=True, shuffle=False)

Outputs of 0 and 1 of seqs

seqs[0]
(tensor([0, 1, 2, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, 1, 8, 1]),
 tensor([1, 2, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, 1, 8, 1, 9]))
seqs[1]
(tensor([ 9,  1, 10,  1, 11,  1, 12,  1, 13,  1, 14,  1, 15,  1, 16,  1]),
 tensor([ 1, 10,  1, 11,  1, 12,  1, 13,  1, 14,  1, 15,  1, 16,  1, 17]))

Vocab of the above two tensors

[L(vocab[o] for o in s) for s in seqs[0]]
[(#16) ['one','.','two','.','three','.','four','.','five','.'...],
 (#16) ['.','two','.','three','.','four','.','five','.','six'...]]
[L(vocab[o] for o in s) for s in seqs[1]]
[(#16) ['nine','.','ten','.','eleven','.','twelve','.','thirteen','.'...],
 (#16) ['.','ten','.','eleven','.','twelve','.','thirteen','.','fourteen'...]]

Re-arranged tensors across batches in the same positions in each batch i.e.

1st position in batch 1 and 1st position in batch 2 and so on..

group_chunks(seqs[:cut], bs)[0]
(tensor([0, 1, 2, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, 1, 8, 1]),
 tensor([1, 2, 1, 3, 1, 4, 1, 5, 1, 6, 1, 7, 1, 8, 1, 9]))
group_chunks(seqs[:cut], bs)[64]
(tensor([ 9,  1, 10,  1, 11,  1, 12,  1, 13,  1, 14,  1, 15,  1, 16,  1]),
 tensor([ 1, 10,  1, 11,  1, 12,  1, 13,  1, 14,  1, 15,  1, 16,  1, 17]))

Model Changes

len(vocab) # Just for refreshing the memory
30

We are doing couple of changes to the model

  • Since we want to predict a word after every word, but the inputs and outputs are 16 values, we need to append the output within the for loop

    NOTE - We keep 16 (or any number higher than 1) since we want a "memory" for the RNN

  • torch.stack(outs, dim=1) gives the output in the shape of sl x vocab_size (before stacking it was sl x n_hidden x vocab_size) for each batch i.e. bs x sl x vocab_size

    NOTE - Dims are (0,1,2) and dim used for stacking was 1.

class LMModel4(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = 0
        
    def forward(self, x):
        outs = []
        for i in range(sl):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
            outs.append(self.h_o(self.h))  # This is now inside the loop and appends the data
        self.h = self.h.detach()
        return torch.stack(outs, dim=1) # This is stacking the output on dim = 1 
                                        # (collapses the column)
    
    def reset(self): self.h = 0

We cannot use cross_entropy directly like how we have been using since the shapes of the model output (prediction) and target are different.

Therefore, we define a loss function as below

  • inp.view(-1, len(vocab)) converts the predictions (which are input to loss function) from bs x sl x vocab_size (64 x 16 x 30) into 1024 x 30

  • targ.view(-1) converts the y (targets) from bs x sl i.e. 64 x 16 to 1024

  • Pytorch will further use broadcasting to compare the target with each of the 30 values in vocab and calculate the loss

def loss_func(inp, targ):
    return F.cross_entropy(inp.view(-1, len(vocab)), targ.view(-1))

Lets train the model. This is a simple RNN with very few layers but explains the various steps in detail.

learn = Learner(dls, LMModel4(len(vocab), 64), loss_func=loss_func,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)
epoch train_loss valid_loss accuracy time
0 3.215018 3.007599 0.167074 00:00
1 2.272449 1.879874 0.458252 00:00
2 1.716600 1.808229 0.467122 00:00
3 1.418651 1.769082 0.509928 00:00
4 1.238434 1.682457 0.532308 00:00
5 1.082703 1.701110 0.555827 00:00
6 0.952198 1.742052 0.576742 00:00
7 0.847260 1.769703 0.590413 00:00
8 0.757634 1.895885 0.604329 00:00
9 0.708090 1.972085 0.588135 00:00
10 0.664931 1.914927 0.601400 00:00
11 0.622304 1.959683 0.613444 00:00
12 0.591103 1.988362 0.618652 00:00
13 0.569842 2.015175 0.614827 00:00
14 0.557822 2.007410 0.615479 00:00

Multilayer RNNs

Instead of a Linear layer between input and output we can have another RNN i.e. the output of one RNN can be passed to another RNN

The Model

  • We use Pytorch RNN class to create this model for keeping it simple. (We have already seen how to create RNN from scratch before)

  • "batch_first= True" is to tell that X that we pass will have batch size as the 0th Dimension (64 x 16 x 30).

  • By default Pytorch RNN will expect batch size to be 1st dim (16 x 64 x 30)

  • We also notice there is no loop for sequence length (16) in the forward any more.

  • Previously we used a Linear layer and therefore we did the looping.

  • Now, we are using Pytorch's RNN and it will take care of the looping.

class LMModel5(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first=True)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = torch.zeros(n_layers, bs, n_hidden)
        
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(res)
    
    def reset(self): self.h.zero_()
learn = Learner(dls, LMModel5(len(vocab), 64, 2), 
                loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)
epoch train_loss valid_loss accuracy time
0 3.055853 2.591640 0.437907 00:01
1 2.162359 1.787310 0.471598 00:01
2 1.710663 1.941807 0.321777 00:01
3 1.520783 1.999726 0.312012 00:01
4 1.330846 2.012902 0.413249 00:01
5 1.163297 1.896192 0.450684 00:01
6 1.033813 2.005209 0.434814 00:01
7 0.919090 2.047083 0.456706 00:01
8 0.822939 2.068031 0.468831 00:01
9 0.750180 2.136064 0.475098 00:01
10 0.695120 2.139140 0.485433 00:01
11 0.655752 2.155081 0.493652 00:01
12 0.629650 2.162583 0.498535 00:01
13 0.613583 2.171649 0.491048 00:01
14 0.604309 2.180355 0.487874 00:01

Exploding or Disappearing Activations

LSTM

We will cover LSTM in next part.

Below is a sneak peek into its architecture..!

Screen Shot 2020-10-07 at 7.20.49 PM.png