Google BERT NLP Technology

Google’s BERT takes NLP to much higher accuracy

As a follow up to my earlier LinkedIn Post of Google’s BERT model on NLP, I am writing this to explain further about BERT and the results of our experiment.

In a recent blog post, Google announced they have open-sourced BERT, their state-of-the-art training technique for natural language processing (NLP) applications. The paper released (https://arxiv.org/abs/1810.04805) along with the blog is receiving accolades from across the machine learning community. This is because BERT broke several records for how well models can handle language-based tasks and more accurately NLP tasks.

Here are a few highlights that make BERT unique and powerful:

  • BERT stands for Bidirectional Encoder Representations from Transformers. As the name suggests, it uses Bidirectional encoder that allows it to access context from both past and future directions, and unsupervised, meaning it can ingest data that’s neither classified nor labeled. This is unique because previous models looked at a text sequence either from left to right or combined left-to-right and right-to-left training. This method is opposed to conventional NLP models such as word2vec and GloVe, which generate a single, context-free word embedding (a mathematical representation of a word) for each word in their vocabularies.
  • BERT uses Google Transformer, an open source neural network architecture based on a self-attention mechanism that’s optimized for NLP. The transformer method has been gaining popularity due to its training efficiency and superior performance in capturing long-distance dependencies compared to a recurrent neural network (RNN) architecture. The transformer uses attention (https://bit.ly/2AzmocB) to boost the speed with which these models can be trained.As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).
  • In the pre-training process, researchers used a masking approach to prevent words that’s being predicted to indirectly “see itself” in a multi-layer model.  A certain percentage (10-15%) of the input tokens were masked to train the deep bidirectional representation. This method is referred to as a Masked Language Model (MLM).
  • BERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. BERT is pre-trained on 40 epochs over a 3.3 billion word corpus, including BooksCorpus (800 million words) and English Wikipedia (2.5 billion words). BERT has 24 Transformer blocks, 1024 hidden layers, and 340M parameters. The model runs on cloud TPUs (https://cloud.google.com/tpu/docs/tpus) for training which enables quick experimentation, debug and to tweak the model
  • It enables developers to train a “state-of-the-art” NLP model in 30 minutes on a single Cloud TPU (tensor processing unit, Google’s cloud-hosted accelerator hardware) or a few hours on a single graphics processing unit.

These are just a few highlights on what makes BERT the best NLP model so far.

Our Experiment:

To evaluate the performance of BERT, we compared BERT to IBM Watson based NER. The test was performed against the same set of annotated large unstructured documents. The model created using BERT and IBM Watson was applied to the annotated large unstructured documents. Below table shows the results we achieved:

Google's BERT Comparison chart

 

Based on our comparison and what we have seen so far, it is fairly clear that BERT is a breakthrough and a milestone in the use of Machine Learning for Natural Language Processing.

Deep Learning

What is Deep Learning?

Deep Learning is a subset of machine learning that allows machines to do tasks that typically require human like intelligence. The inspiration for deep learning comes from neuroscience, if you look at the architecture of Deep Learning Neural Networks, they are connected in a fundamental way that mirrors the brain. Deep-learning networks are distinguished from the more commonplace neural networks by their depth; that is, the number of node layers through which data passes in a multistep process.

Earlier versions of neural networks were shallow, composed of one input and one output layer, and at most one hidden layer in between. More than three layers (including input and output) qualifies as “deep” learning. So deep as strictly defined means more than one hidden layer.

Neural Network

Deep learning Neural network

In deep-learning networks, each layer of nodes trains on a distinct set of features based on the previous layer’s output. The further you advance into the neural net, the more complex the features your nodes can recognize, since they aggregate and recombine features from the previous layer.

Let’s take a simple example of recognizing hand written numbers from 1 – 10. If 10 people wrote the numbers, the numbers will look very different from each person. For a human brain, it is fairly easy to identify these numbers. For a traditional machine it is impossible to detect and hence Neural Networks are used to mimic the way, neurons in the brain interact. These multiple hidden layers allow a computer to determine the nature of a handwritten digit by providing a way for the neural network to build a rough hierarchy of different features that make up the handwritten digit.

For instance, if the input is an array of values representing the individual pixels in the image of the handwritten figure, the next layer might combine these pixels into lines and shapes, the next layer combines those shapes into distinct features like the loops in an 8 or upper triangle in a 4, and so on. By building a picture of these features, neural networks can determine with a very high level of accuracy the number that corresponds to a handwritten digit. Additionally, the model will learn which links between neurons are critical in making successful predictions during training. Over the course of several training cycles, and with the help of occasional manual tuning, the network will continue to learn and generate better predictions until it reaches desired accuracy.

Thus, Deep learning allows machines to solve complex problems even when using a data set that is very diverse, unstructured and inter-connected. Deep learning networks excel at dealing with vast amount of disparate data. In fact, the larger the amount of data the more efficient Deep learning becomes and the more deep learning algorithms learn, the better they perform.

Few additional links on this topic:
MIT Technology Review: https://www.technologyreview.com/s/513696/deep-learning/
Cambridge Univerisity paper: https://bit.ly/2Fbbrlr