RNNs and LSTMs - Part1
Recurrent Neural Networks - These are a special type of Neural Nets. They are generally used for Text i.e. NLP tasks.
Long Short Term Memory is a variation of a vanilla RNN
We will cover RNNs in Part1.
LSTMs will be covered in Part2.
This blog post should be read with Chapter 12 of FastAI book.
This is a toy dataset from FastAI which has English Text for numbers from "one" to "nine thousand nine hundred and ninety nine"
from fastbook import *
from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)
There are two files train and valid as shown below
path.ls()
Lets load the train and valid files into an enhanced FastAI List class called L()
train_lines = L()
with open(path/'train.txt') as f: train_lines += L(*f.readlines())
train_lines
type(train_lines)
train_lines[-1:]
We can see that the last element of this train list is seven thousand nine hundred ninety nine
Now, we load the valid file similarly.
valid_lines = L()
with open(path/'valid.txt') as f: valid_lines += L(*f.readlines())
valid_lines
valid_lines[-1:]
In the above cells, we can see that valid file has numbers from eight thousand one to nine thousand nine hundred ninety nine
Lets load both files into a single List as shown below.
lines = L()
with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'valid.txt') as f: lines += L(*f.readlines())
lines
lines now has data from both train.csv and valid.csv
We will remove the extra spaces and newline characters from each number's text and then concatenate them with "space.space" in between as done below
text = ' . '.join([l.strip() for l in lines])
text[:100]
Lets split the "space" separated text into tokens. Note that we will keep the . also as a token
tokens = text.split(' ')
tokens[:10]
type(tokens)
There are around 63 K tokens
len(tokens)
Below are the first fifteen tokens
tokens[0:15]
Below are the last 15 tokens
tokens[-15:]
However there are only 30 unique values. Come to think of it, it is text for 0 to 9999 which would be having
1 to 20 (20 unique),
30, 40 and so on till 90 (7 unique),
hundred (1 unique),
thousand (1 unique) and
. (1 unique)
which will total to 30
len(np.unique(np.array(tokens)))
Below is a fastai L (enhanced list class) implementation of the same thing we did above.
We create a vocab of unique values of tokens.
vocab = L(*tokens).unique()
vocab
len(vocab)
Printing the enumeration of the vocab below
for i,w in enumerate(vocab): print ('i = ' +str(i)+ ' and w = '+str(w))
Using the above, we can create a dictionary of numbers(indices) to tokens
word2idx = {w:i for i,w in enumerate(vocab)}
word2idx
Lets get the indices of the dictionary into a List (fastai L)
nums = L(word2idx[i] for i in tokens)
nums
type(nums)
We now have numericalized our tokens. This can be passed into a Neural Net
Lets say this Model will take as input a sequence of 3 words and predict the fourth word.
For example:
Do you know ----- the
you know the ----- answer
know the answer ----- to
the answer to ----- this
answer to this ----- question
and so on..
Lets create a list for our data in this fashion.
The range(0,len(tokens)-4,3) i.e. range(start, stop, step) is in this way so that we dont go over the range of the list.
L((tokens[i:i+3], tokens[i+3]) for i in range(0,len(tokens)-4,3))
Above is a list that we created for predictng the fourth word when the input is sequence of three words
We will create a Tensor version of the same below
seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0,len(nums)-4,3))
seqs
Lets create a dataloader and split the data for train and valid datasets in 80:20 ratio
len(seqs)
int(len(seqs) * 0.8)
cut = int(len(seqs) * 0.8)
The batch size is 64 and we are making shuffle = False because this is for text data and we want to keep the sequences in proper order. The text of any language makes sense only when the words are in a proper sequence
bs = 64
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)
This Model has 3 layers.
- Input to Hidden - Embedding of size (30 (vocab size in this example), n_hidden)
- Hidden to Hidden - Linear of size (n_hidden, n_hidden)
- Hidden to Output - Linear of size(n_hidden, 30 (vocab size in this example))
See the comments in the code below.
The first word of each X is passed to input to hidden layer
The output of that is passed to hidden to hidden layer and then that is passed through a ReLu
This becomes the hidden state of the first word (The activations that are updated at each step of a recurrent neural network is called hidden state)
This is then passed as input along with second word to input to generate a new hidden state.
This process repeats for the third word
Finally the hidden to output is generated.
class LMModel1(Module):
def __init__(self, vocab_sz, n_hidden):
self.input_to_hidden = nn.Embedding(vocab_sz, n_hidden)
self.hidden_to_hidden = nn.Linear(n_hidden, n_hidden)
self.hidden_to_output = nn.Linear(n_hidden,vocab_sz)
def forward(self, x):
h = F.relu(self.hidden_to_hidden(self.input_to_hidden(x[:,0]))) #First word
h = h + self.input_to_hidden(x[:,1]) # Second word + Activations of previous layer
h = F.relu(self.hidden_to_hidden(h)) # Non Linearity
h = h + self.input_to_hidden(x[:,2]) # Third word + Activations of previous layer
h = F.relu(self.hidden_to_hidden(h)) # Non Linearity
return self.hidden_to_output(h) # Predicted fourth word
len(vocab)
Lets create a learner, pass the Dataloaders and train for some epochs
learn = Learner(dls, LMModel1(len(vocab), 64), loss_func=F.cross_entropy,
metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)
The most common word is 'thousand' and if we used that to predict that as 4th word always, then we get an accuracy of 15% as shown below.
Our above model fares much better with much better accuracy
n,counts = 0,torch.zeros(len(vocab))
for x,y in dls.valid:
n += y.shape[0]
for i in range_of(vocab): counts[i] += (y==i).long().sum()
idx = torch.argmax(counts)
idx, vocab[idx.item()], str((counts[idx].item()/n)*100)+' %'
- A RNN can be thought of as a Looping Neural Network.
- We have added a for loop and refactored the code to reduce the number of lines
- This is a RNN in its simplest form
class LMModel2(Module):
def __init__(self, vocab_sz, n_hidden):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.h_h = nn.Linear(n_hidden, n_hidden)
self.h_o = nn.Linear(n_hidden,vocab_sz)
def forward(self, x):
h = 0
for i in range(3):
h = h + self.i_h(x[:,i])
h = F.relu(self.h_h(h))
return self.h_o(h)
learn = Learner(dls, LMModel2(len(vocab), 64), loss_func=F.cross_entropy,
metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)
There are two problems with the simple RNN built above
- h is set to 0 in forward method. This means there is no information stored about previous words. We are predicting 4th word using inputs as current 3 words only.
- The knowledge of previous sequences is lost whenever a new three-word-sequence is being worked upon.
- Also, predicting 4th word was our toy requirement. We should be able to predict any next word given a word as input.
- We can move the h=0 to init as done below
- But, there is one issue with setting h=0 only once and removing it from forward method. The derivative of h will keep getting computed and can cause memory issues. The detach function we see below is to stop the derivative calculation after some duration and in this case after three layers. This is backpropagation through time
Below model implements the BPTT but still works taking 3 word inputs and predicts the 4th word.
- See that h is set to 0 in init only.
- Also look at the detach done after coming out of the loop of 3
- reset method of our model is to reset the "hidden state" at the beginning of each epoch and before each validation phase. This works with "cbs" option of the Learner
class LMModel3(Module):
def __init__(self, vocab_sz, n_hidden):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.h_h = nn.Linear(n_hidden, n_hidden)
self.h_o = nn.Linear(n_hidden,vocab_sz)
self.h = 0
def forward(self, x):
for i in range(3):
self.h = self.h + self.i_h(x[:,i])
self.h = F.relu(self.h_h(self.h))
out = self.h_o(self.h)
self.h = self.h.detach() # Detach the gradient and store it.
# Backpropagation will not happen for entire stream
return out
def reset(self): self.h = 0 # This is for resetting the hidden state before each epoch
There is another issue.
We cannot use the data in the same format that we had before. Why ??
Well, there are couple of things -
The sequence of the data is important for text analysis and therefore for RNNs
In the previous examples h (hidden state) was reset to 0 always and therefore we did not care about the sequence of data. We just cared about the previous word.
Now, we are storing the previous activations/hidden state for the previous 3 words.
Since we process 1 batch at a time, the sequence of words cannot be next to one another in a single batch.
Rather they need to be in same positions in consecutive batches. Lets understand it a bit more further.
See the first 3 values of the existing data
seqs[:cut]
Just showing the first three below
seqs[0:3]
These tensors correspond to
(['one', '.', 'two'], '.'),
(['.', 'three', '.'], 'four'),
(['four', '.', 'five'], '.')
(tensor([0, 1, 2]), 1)
(tensor([1, 3, 1]), 4)
(tensor([4, 1, 5]), 1)
These have to be rearranged now across consecutive batches.
Below is a function which will help with that.
def group_chunks(ds, bs):
m = len(ds) // bs
new_ds = L()
for i in range(m): new_ds += L(ds[i + m*j] for j in range(bs))
return new_ds
m = len(seqs[:cut])//bs
m, cut, bs
Below is a handwritten loop output of the above function to show how the items get re-arranged.
new_ds = group_chunks(seqs[:cut], bs)
See how the same first three values are now re-arranged in different positions across consecutive batches
new_ds[0] # Part of first batch
new_ds[64] # Part of second batch
new_ds[128] # Part of third batch
The length of this new_ds is less than 16824 (cut)
Why is that ?
len(new_ds.items)
cut
That is because the last 56 cannot go into a batch of size 64 and hence it will be discarded with drop_last as seen in below cells.
cut - len(new_ds.items) # 16824-16768
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(
group_chunks(seqs[:cut], bs),
group_chunks(seqs[cut:], bs),
bs=bs, drop_last=True, shuffle=False)
Lets train this for 10 epochs
learn = Learner(dls, LMModel3(len(vocab), 64), loss_func=F.cross_entropy,
metrics=accuracy, cbs=ModelResetter) # cbs calls the reset method
learn.fit_one_cycle(10, 3e-3)
Okay.. I hope there were some 'aha' moments earlier.
Lets move forward.
If you see the "Improving the RNN" section above, we had this below line
Also, predicting 4th word was our toy requirement. We should be able to predict any next word given a word as input.
We still have not done that.
We will see how that can be done below
First, we need to change the data so that the dependent variable has each of the "n" next words after each of our "n" input words instead of how we had before (3 inputs and 1 output). We can control the n through a parameter.
Change the model so that it predicts a word after each word instead of 3 inputs
We will do the data changes in below cells.
sl is the parameter for "n" whose value is 16
Similar to earlier code, we use group_chunks for re-arranging the data across consecutive batches
sl = 16
seqs = L((tensor(nums[i:i+sl]), tensor(nums[i+1:i+sl+1]))
for i in range(0,len(nums)-sl-1,sl))
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(group_chunks(seqs[:cut], bs),
group_chunks(seqs[cut:], bs),
bs=bs, drop_last=True, shuffle=False)
Outputs of 0 and 1 of seqs
seqs[0]
seqs[1]
Vocab of the above two tensors
[L(vocab[o] for o in s) for s in seqs[0]]
[L(vocab[o] for o in s) for s in seqs[1]]
Re-arranged tensors across batches in the same positions in each batch i.e.
1st position in batch 1 and 1st position in batch 2 and so on..
group_chunks(seqs[:cut], bs)[0]
group_chunks(seqs[:cut], bs)[64]
len(vocab) # Just for refreshing the memory
We are doing couple of changes to the model
Since we want to predict a word after every word, but the inputs and outputs are 16 values, we need to append the output within the for loop
NOTE - We keep 16 (or any number higher than 1) since we want a "memory" for the RNN
torch.stack(outs, dim=1) gives the output in the shape of sl x vocab_size (before stacking it was sl x n_hidden x vocab_size) for each batch i.e. bs x sl x vocab_size
NOTE - Dims are (0,1,2) and dim used for stacking was 1.
class LMModel4(Module):
def __init__(self, vocab_sz, n_hidden):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.h_h = nn.Linear(n_hidden, n_hidden)
self.h_o = nn.Linear(n_hidden,vocab_sz)
self.h = 0
def forward(self, x):
outs = []
for i in range(sl):
self.h = self.h + self.i_h(x[:,i])
self.h = F.relu(self.h_h(self.h))
outs.append(self.h_o(self.h)) # This is now inside the loop and appends the data
self.h = self.h.detach()
return torch.stack(outs, dim=1) # This is stacking the output on dim = 1
# (collapses the column)
def reset(self): self.h = 0
We cannot use cross_entropy directly like how we have been using since the shapes of the model output (prediction) and target are different.
Therefore, we define a loss function as below
inp.view(-1, len(vocab)) converts the predictions (which are input to loss function) from bs x sl x vocab_size (64 x 16 x 30) into 1024 x 30
targ.view(-1) converts the y (targets) from bs x sl i.e. 64 x 16 to 1024
Pytorch will further use broadcasting to compare the target with each of the 30 values in vocab and calculate the loss
def loss_func(inp, targ):
return F.cross_entropy(inp.view(-1, len(vocab)), targ.view(-1))
Lets train the model. This is a simple RNN with very few layers but explains the various steps in detail.
learn = Learner(dls, LMModel4(len(vocab), 64), loss_func=loss_func,
metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)
Instead of a Linear layer between input and output we can have another RNN i.e. the output of one RNN can be passed to another RNN
We use Pytorch RNN class to create this model for keeping it simple. (We have already seen how to create RNN from scratch before)
"batch_first= True" is to tell that X that we pass will have batch size as the 0th Dimension (64 x 16 x 30).
By default Pytorch RNN will expect batch size to be 1st dim (16 x 64 x 30)
We also notice there is no loop for sequence length (16) in the forward any more.
Previously we used a Linear layer and therefore we did the looping.
Now, we are using Pytorch's RNN and it will take care of the looping.
class LMModel5(Module):
def __init__(self, vocab_sz, n_hidden, n_layers):
self.i_h = nn.Embedding(vocab_sz, n_hidden)
self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first=True)
self.h_o = nn.Linear(n_hidden, vocab_sz)
self.h = torch.zeros(n_layers, bs, n_hidden)
def forward(self, x):
res,h = self.rnn(self.i_h(x), self.h)
self.h = h.detach()
return self.h_o(res)
def reset(self): self.h.zero_()
learn = Learner(dls, LMModel5(len(vocab), 64, 2),
loss_func=CrossEntropyLossFlat(),
metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)
We will cover LSTM in next part.
Below is a sneak peek into its architecture..!