Named Entity Recognition using LSTM in Keras

Named Entity Recognition is a form of NLP and is a technique for extracting information to identify the named entities like people, places, organizations within the raw text and classify them under predefined categories.

12 min readAug 3, 2020

Note: Article Originally published at https://valueml.com on August 3, 2020

Introduction

Named Entity Recognition (NER) models can be used to identify the mentions of people, location, organization, times, company names, and so on. So the Named Entity Recognition model not only acts as a standard tool for information extraction.

fig. NER model finding & Classifying Text Categories

But it also serves as a foundational and important preprocessing toll for many downstream applications like Machine Translation, Question-Answering, Customer Feedback Handling, and even Text Summarization.

Motivation Behind The Project

Human engineered features were domain-specific and rule-based and were tedious. But in recent years, Deep Learning empowered by the continuous real-valued vector representation and semantic composition through non-linear processing has been able to employ any hard system using the state of the art performance. This allows the machine to fed with raw data.

Named Entity Recognition dataset we are using in this project is very helpful/playful because when you are able to pick intent and custom-named entities from your own sentence with more features then, it helps you solve real business problems(like picking entities from Electronic Medical Records, etc).

Implementation

The pre-requisites for this project are some prior experience with python projects as well as understanding neural networks mainly Recurrent Neural Network (RNN). And we’ll implement this project on Jupyter Notebook.

Task 1: Import Modules

First, we will import the necessary python libraries or modules and helper function. We are gonna use mainly Keras API, and Tensorflow2 as back-end.

% matplotlib inline sets the background of matplotlib to inline because of which the output of plotting commands will be displayed inline within frontends like the Jupyter notebook, directly below the code cell.

%matplotlib inline 
import matplotlib.pyplot as plt 
import pandas as pd 
import numpy as np 
np.random.seed(0) 
plt.style.use("ggplot") 
import tensorflow as tf 
print('Tensorflow version:', tf.__version__) print('GPU detected:', tf.config.list_physical_devices('GPU'))

Task 2: Load and Explore the NER Dataset

Here, we’ll read the NER dataset which is in CSV format, using pandas. I found this dataset in Kaggle and you can download by Clicking Here .

This dataset contains of course sentences in English but they also have corresponding annotations for each word. The sentence in dataset are encoded in Latin 1.

Essential info about entities:

geo = Geographical Entity
org = Organization
per = Person
gpe = Geopolitical Entity
tim = Time indicator
art = Artifact
eve = Event
nat = Natural Phenomenon

Total Words Count = 1354149
Target Data Column: “tag”

data = pd.read_csv("ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")
data.head(20)

Here, we have filled the missing values in dataset using “ffill” method. After reading dataset, we’ll see the first 20 entries in the dataset which looks like this;

Visualizing the Sentenceentence

The first sentence is ” Thousands of demonstrations have marched through London to protest ……..”. In the rightmost column, we can see the tag of words like B-Geo for London, Iraq, and Britain which signifies that these words are geolocation. POS means parts of speech, ignore it for now.

Now, let’s see the number of unique words in the corpus and the number of unique tags in our dataset using “nunique” function which is the helper function from the pandas library.

print("Unique words in corpus:", data['Word'].nunique())
print("Unique tags in corpus:", data['Tag'].nunique())

And it shows result like

Unique words in corpus: 35178
Unique tags in corpus: 17

No. of tags in corpus dataset is directly proportional to the number of classes which is 17 in our case and our input dimension is 35178.

print(num_words, num_tags)

Now, here we are going to create a list and use the set method to get de-duplicated values within the “word” column. And we’ll append the corresponding padding named “Endpad”.

words = list(set(data["Word"].values)
words.append("ENDPAD")
num_words = len(words)

Now, let’s do a similar process for our target variables or tags and we’ll see the no. of words as 35179(since padding is appended) and no. of tags as same before i.e. 17.

tags = list(set(data["Tag"].values))
num_tags = len(tags)
print(num_words, num_tags)

Now, we are gonna modify our dataset so that we can easily split our dataset into our feature matrix and the target vector. So we want to create two pools (containing 3 values) for each sentence so that the 1st value in the two pools is the word and the 2nd value is the POS(Part of Speech) and 3rd is Tag i.e. class name.

Task 3: Retrieve Sentences and Corresponding Tags

In this task, we are gonna create a class that will allow us to retrieve the sentences and their corresponding tags so that we can clearly define the input and output to our Neural Network Model.

We’ll start with 1 sentence and group them using the lambda function. We will select the sentence from “word” column, see their values, and convert them into list. We will repeat this process for Part Of Speech tag and Name Entity Recognition.

Then we’ll apply this aggregated function to our sentences. Later we’ll split that entire list into sub-lists.

class SentenceGetter(object):
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1s
            return s
        except:
            return None

Here, we’ll just use that class using getter method.

getter = SentenceGetter(data) 
sentences = getter.sentences
sentences[0]

Here, we can see the extracted first sentence which contains that list having three values.

So, I hope you get the idea now. And join me in the nest task to actually define our vocabulary by observing the words and frequencies of the word within our dataset.

Task 4: Define Mappings between Sentences and Tags

Here, we are going to build two dictionaries. one is to represent words as numerical values or unique indices and second is to representing our tags and assigning them unique indices.

word2idx = {w: i + 1 for i, w in enumerate(words)} 
tag2idx = {t: i for i, t in enumerate(tags)}word2idx

Now, we can see that each word is assigned to a unique tag. We can retrieve these words using their indices and looking them up in our dictionary and returning the corresponding keys.

Task 5: Padding Input Sentences and Creating Train/Test Splits

For the use of Neural Network at least with Keras and TF, we need to be able to use equal length sentences. So we are going to pad our input sentences to a prespecified length. But first, we need to figure out what that length is. One of the easiest heuristics for that is to look at the distribution of your sentence length within your corpus.

Let’s do this visually by plotting the histogram using matplotlib hist function. For each sentence, we are gonna take the length of the sentence and plot it in our histogram.

plt.hist([len(s) for s in sentences], bins=50) 
plt.show()

And the output looks like

You can see in the plot that, the mean value of the distribution i.e the mean length of sentence in our dataset is around 20 t0 22 wordmark. And in X-axis we can see the safe value to be around 50, so we that most values in our dataset don’t need to be padded.

Padding

Now, in the next step, we’ll use pad_sequence helper function for padding. Then we’ll define the parameters like max_length equal to 50. X is going to be a numerical representation of our words. We’ll take iteratively each word in that sentence and get its corresponding values from our word to index dictionary that we created previously. We can use Python’s in-built word2idx list comprehension, thanks to Python for that.

Now we can make use of our pad_sequence helper function. ‘post’ is just a value of padding argument at the end of the sentence. y is our target vector. In y, we want to iterate through our sentences list for the words in the sentence and then retrieve tag to the index value. After that, all we have to do is pad y and convert it to categorical. At this point, we have successfully created our feature matrix and target vector.

from tensorflow.keras.preprocessing.sequence import pad_sequences
max_len = 50X = [[word2idx[w[0]] for w in s] for s in sentences]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=num_words-1)
y = [[tag2idx[w[2]] for w in s] for s in sentences]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])

The next task is to split the dataset into training and testing using sklearn library, which is the backbone of Machine Learning. test_size=0.2 means our 80% dataset is split for training and the remaining 20% for testing.

from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

Now, join me in the next task to actually get the most succeeding part of this project.

Task 6: Build and Compile a Bidirectional LSTM Model

Here, we’ll use Bidirectional RNN/LSTM instead of simple RNN because bidirectional RNN makes use of both the past and future information along with current information for a specific timeframe. Bidirectional LSTM, therefore, become a defector standard for composing deep context-dependent representations of texts. We are going to use one such model which Bidirectional LSTM to build our Named Entity Recognition model.

Here we are not using the Sequential model from Keras, rather we’ll use a Model class from Keras functional API. This allows a bit more flexibility.

from tensorflow.keras import Model, Input from tensorflow.keras.layers import LSTM, Embedding, Dense from tensorflow.keras.layers import TimeDistributed, SpatialDropout1D, Bidirectional

First, we need to define the input layer to our model and specify the shape to be max_length which is 5o. Then we are doing raw word embedding, not including Part Of Speech tag in this project. And the input dimension will be of course the no. of unique words in our vocabulary. We are applying this to our input layer and creating embedding from input_word.

Let’s apply a SpatialDropout layer here. What SpatialDropout does is it drops second value form all the channels. We are applying a dropout of 0.1, to our model from the previous layer and i t drops the entire 1D feature map rather than dropping individual nodes.

Bidirectional LSTM

Now we can go ahead and create our Bidirectional LSTM. We are using LSTM rather than RNN because RNN suffers from vanishing gradient problems. The units(no. of times Bidirectional LSTM will train) is set reasonably high, 100 for now.

You can change these hyperparameters like changing units to 250, max_length to 100 but should result in more accuracy of the model.

And recurrent_dropout is set to a small value in the first few layers. We are applying this to the model of the previous layer so there’s a model argument. Then we are going to use TimeDistributed which accepts the Dense layer as an input and applies it to every temporal slice of the input. And another thing to note is we are applying this Dense layer 100 times.

Let’s use a softmax activation function which can be interpreted as a probability and look at the maximum argument of that return value and select the highest probability which corresponds to the predicted output class.

Now, let’s combine them and feed the input_word and output layer to the model and see it’s summary.

input_word = Input(shape=(max_len,))model = Embedding(input_dim=num_words, output_dim=50, input_length=max_len)(input_word)model = SpatialDropout1D(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(model)model = Model(input_word, out)
model.summary()

fig. Summary Of Bidirectional LSTM Model

The summary shows that we have 1.88 million of parameters to be trained.

Now let’s compile our model and specify the loss function, the matrix we want to track, and the optimizer function. We’ll use adam optimizer here, sparce_categorical_crossentropy as the loss function and the matrix we gonna concern is accuracy matrix.

model.compile(optimizer="adam",
              loss="sparse_categorical_crossentropy, 
              metrics=["accuracy"])

Now join me in the next task where we’ll use some callbacks and finish our training so that we can evaluate it on our test set to see how well is our model doing and naming the entity with tags.

Task 7: Train the Model

Specifically, we will use EarlyStopping and PlotLossesCallback. We are not going to save the ModelCheckPoint as of now, maybe in future improvement of this project.

from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping 
from livelossplot.tf_keras import PlotLossesCallback

Let’s go ahead and instantiate our callback. Use EarlyStopping so that we don’t need to hard code the number of epochs. If our network doesn’t improve for 2 consecutive epochs,i.e. validation loss is not decreased we are going to stop our training process. That is the meaning of patience.

Accuracy

Then we have to monitor the validation accuracy, patience is already mentioned above and we’ll set the verbose equal to be 0 so that we don’t get any output. The mode is set to ‘max’ as we want to maximize the validation accuracy.

And we will make use of PlotLossesCallback which is one of my favorite callback as it eliminates the need for you to go outside the Jupyter Notebook to plot the model’s training and progress matrix. Everything is updated live within the Jupyter Notebook. Let’s put it on our callback list.

Train the Model

Then all we have left to start training is to call model.fit() , and we’ll pass our training data which is x_train and y_train. Then we’ll create our validation data by further splitting training data. You can increase the batch_size if you have GPU of more memory size. Here we will use just 3 epochs as it takes more than 10 to 15 minutes to train the model if we use more epochs.

%%time
chkpt = ModelCheckpoint("model_weights.h5", monitor='val_loss',verbose=1, save_best_only=True, save_weights_only=True, mode='min')early_stopping = EarlyStopping(monitor='val_accuracy', min_delta=0, patience=2, verbose=0, mode='max', baseline=None, restore_best_weights=False)callbacks = [PlotLossesCallback(), chkpt, early_stopping]
history = model.fit(
    x=x_train,y=y_train,
    validation_data=(x_test,y_test),
    batch_size=32,
    epochs=3,
    callbacks=callbacks,
    verbose=1)

Then shift+Enter to run that cell.

We can see at the bottom right end that the accuracy of our model is more than 98% .

Task 8: Evaluate Named Entity Recognition Model

Here, we will evaluate our model on unbiased test data that the model hasn’t seen before and then perform prediction.

model.evaluate(x_test, y_test)

And we got pretty good accuracy at more than 98%.

Now let’s take a look at some prediction. Create a table where the leftmost column is for words from our test set and the second column is for the true value for the tags and third entry is gonna be our model’s predicted tag.

We are going to create an index to select values from our test set and index i can take random values from 0 to max no. of entries in our dataset.

Prediction

Let’s get our model’s prediction, store that in p, call the predict method on model and feed test set to it. And remember that we are choosing ith example from test set.

This is going to return one hot encoded matrix and we are just going to select the argmax from the relevant axis. SO the first 3 lines of code take care of picking a random example and generating our model’s prediction on that random example. Then look at the true values that our model is trying to predict. y_test has the true value but we need to convert that into NumPy array and then specify axis argument to -1.

Then finally, let’s define the pattern of our result.

i = np.random.randint(0, x_test.shape[0]) #659
p = model.predict(np.array([x_test[i]]))
p = np.argmax(p, axis=-1)
y_true = y_test[i]
print("{:15}{:5}\t {}\n".format("Word", "True", "Pred"))
print("-" *30)
for w, true, pred in zip(x_test[i], y_true, p[0]):
    print("{:15}{}\t{}".format(words[w-1], tags[true], tags[pred]))

The final result of our project Named Entity Recognition looks like below:

Now let’s discuss the goal of our project. As the above result shows the United Nations is an organization, Ituri is geolocation and other words are not from a specific category so their Tag is 0 i.e. neutral words.

Applications of Named Entity Recognition Model:

-Classifying content for news providers,

-Automating the Recommendation System,

-Segregating Research papers on the basis of relevant entities,

-Customer Feedback Handling in big companies, services,

-Efficient search algorithms to search all words in millions of articles.

That’s it.

Thank you for reading and you can download the Source Code from Github: Tekraj Github

You can Reach me at Linkedin, Github , Twitter , Gmail .