This website uses cookies and similar technologies to understand visitors' experiences. By continuing to use this website, you accept our use of cookies and similar technologies,Terms of Use, and Privacy Policy.

Oct 03 2019 - 06:00 PM
BERT - Google's New NLP model: A Layman's View and Intro to NLP

Note: This post is part of an ongoing project on Autotaggging [1] TCR content that is being done by the research team.

Previous posts in the series

1- A case for Autotagging - Does Machine learning know better about your work than you? Link [2]

Image result for bert algorithm

NLP - Natural Language Processing

Natural Language Processing [3] is a field of Computer Science which deals with the interactions between computers and human (natural) languages, particularly on how to process large amounts of text data.

Let us look at a small crossection of models for NLP

Bag of Words Models

Bag of words [4] is a simplifying represenation in NLP. Each of the text can be represented as a bag of words, disregarding context, grammar and word order

Example 1: "EdLab is a research, design, and development unit at Teachers College, Columbia University"

Bag of words: {"EdLab" : 1, "is" : 1, "a" : 1, "research" : 1, "design" : 1, "and" : 1, "development" : 1, "unit" : 1, "at" : 1, "Teachers" : 1, "College" : 1, "Columbia" : 1, "University" : 1}

Example 2: "John went to the river bank today, and he visited the bank to withdraw his salary on his way back"

Bag of words: {"John" : 1, "went" : 1, "to" : 2, "the" : 2, "river" : 1, "bank" : 2, "today" : 1, "and" : 1, "he" : 1, "visited" : 1, "withdraw" : 1, "his" : 2, "salary" : 1, "on" : 1, "way" : 1, "back" : 1 }

Things to notice:

·      The bag of words model keeps the multiplicity of the words intact

·      The model does not care about the context of the model. The word 'bank' in "river bank" and the financial institution bank both considered as the same word


In simple terms, Word2vec [5] is a set of related models which take large text data as input and represents the words in form of vectors[6](sequence of numbers with direction) based on the context of the word. It can identify the context of how the word was used in the text. The location of a word relative to another word gives relationship between them.

For example, it can understand the context of the words "King", "Queen", "Man" and "Woman" such that

King - Man + Woman = Queen

The algorithm represents the words vectors and identifies the context of the words.[7]

·      One thing to consider is that the model still regards the 'bank' in "river bank" amd "financial bank" as the same and incorrectly builds around both context

BERT Algorithm

BERT [8] (Bidirectional Encoder Representations from Transformers) is a recent paper published by data scientists and researchers at Google. The key innovative part of BERT is that it takes content of words into consideration from both the direction (words present before a word and words present after a word) while building/training the Data Science Algorithm

The algorithm runs in 2 steps

Let us consider an example to understand the method better. Take the sentences.

"The man goes to the store. He buys a Gallon of Milk"

STEP 1 : Masking Words

Before running the algorithm , 15% of the words in each sequence are masked (hidden). The model then attempls to predict the original value of the masked word based on the context provided by the rest of the words in the sequence (both direction).

Lets look at the example

Sentence: "The man goes to the [Mask1]. He buys a [Mask2] of Milk"

The algorithm then predicts [Mask1] as "Store" and [Mask2] as "Gallon"

This enables the algorithm to learn the relationships between words of a sentence

STEP 2 : Predicting sentence

The second step, the models helps to understand the relations between sentences. It learns whether sentence B comes after sentence A or not

Lets look at 2 examples

Sentence A: "The man goes to the Store

Sentence B: "He buys a Gallon of Milk"

Label : IsNextSentence

Sentence A: "The man goes to the Store

Sentence B: "Penguins are Flightless"

Label : NotNextSentence

As we can see. it is highly likely that in the first example sentence B can come after sentence A, but it is highly unlikely that the case in example 2 is possible. This allows BERT to learn what sentence can follow another sentence

This NLP training is done on a large corpus of data by the Algorithm. BERT uses the entire corpus of wikipedia for building the model. Google has provided BERT in a pretrained form so it is like the user is getting readymade food in a frozen form which they can cook for themselves.

What NEXT??

Different user, based on their various problem statements and different dataset, have to tune BERT according to their problem. The readymade frozen food by BERT has to be cooked as per as the needs of the user.

Testing BERT

I decided to try running BERT in my system and on the cloud, and here is my story.

The dataset I decided to try was Yelp review dataset which had text reviews of users and whether it was a positive or a negative review. So the task was to feed the reviews as input to the algorithm and classify whether it was a positive review or a negative review. This task is called as binary classification A sample of the dataset is given below

The first column indicates if the review was positive(1) or negative(0), and the second column shows the review text

First I created the features out of the text, and then tuned the BERT model for the binary classification. The entire execution of the code took about 9 hours on cloud (I did not execute the code on the laptop as the estimated time of completion was ~150 hours 😂).

The Evaluation results were

The results say that, out of 19000 positive reviews, the machine was able to recognise 18185 correctly and out the remaining 19000 negative reviews, the machine was able to recognise 18172 correctly as negative - about 96 % in both cases (check [9] for meaning of TP, TN, FP and FN)

Why people in industry are using BERT

  • Unlike previous models, we do not have to train the machine from scratch and BERT does half of the job (Remember frozen food)
  • The user can use BERT and tune it to work for their own use cases (Binary classification, multi class classification etc)
  • As BERT is trained on Wikipedia Data corpus, we may not require large amount of data to tune it

Where does BERT come in our TCR research as Edlab?

  • We plan to user BERT in the same way, by tuning it to our requirements and help us tag TCR content.
  • For the tuning part, we plan to feed it other sources and tag as training data and then give the TCR content for which we get tags as output.
  • We are still looking for content with tags to train and tune BERT

I have tried to make the explaination simpler for a wider audience If you are interested in knowing more about the technical details of the algorithm, I strongly recommend you to check out this video [10]


1 - Wikipedia -

2 - Edlab -

3 - Wikipedia -

4 - Wikipedia -

5 - Wikipedia -

6 - Wikipedia -

7 - Medium -

8 - BERT -

9 - Wikipedia -

10 - Youtube -

Other resources refered

Main from :

P. S. I have also skipped a lot of intermediatory stuff while explaining Bag of words and Word2vec, as I was trying to generalise it.

Posted in: TechnologyWork ProgressProject IdeaResearch|By: Ameya Karnad|2112 Reads