Named Entity Recognition

Bring clarity to unstructured data using Natural Language Processing (NLP) – Part 1

Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language, in particular how to program computers to process and analyze large amounts of natural language data.

In my previous articles, I have addressed some specific topics on NLP like Text Classification, Natural Language Search, etc. Here I want to give a quick introduction to a few key technical capabilities of Natural Language Processing.With recent advances in Artificial intelligence technologies, computers have become very adept at reading, understanding and interpreting human language. Let’s look a few key capabilities of NLP. These are by no means a comprehensive list of all NLP capabilities.

 

Named Entity Recognition (NER):
NER is one of the first steps towards information extraction from large unstructured data. NER seeks to locate and extract named entities that are present in a text into pre-defined categories like persons, countries, organizations etc. This helps with answering many questions such as:
– How many mentions of an organization is in this article?
– Were there any specific products mentioned in a customer review?

This technology will enable organizations to extract individual entities from documents, social media, knowledge base etc. The better defined and trained the ontologies are, the more efficient the outcome will be.

 

Topic Modeling:
Topic Modeling is a type of statistical modeling for discovering abstract topics from a large document set. It is frequently used to discover hidden semantic structures in a textual body. It is different from traditional classification in that, it is an unsupervised method of extract main topics. This technique is used in the initial exploring phase to find what the common topics are in the data. Once you discover the topics, you can use language in those topics to create categories. One of the popular methods used for Topic Modeling is Latent Dirichlet Allocation (LDA). LDA builds a topic per document model and words per topic model, modeled as Dirichlet distributions. You can read more about LDA here: http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

 

Text Classification:
Text classification (a.k.a text categorization or text tagging) is the task of assigning a set of predefined categories to free-text. This is a supervised training methodology as opposed to Topic Modeling above. I have written in detail about text classification here:
https://abeyon.com/textclassification/

 

Information Extraction:
Information Extraction is used to automatically find meaningful information in unstructured text. Information extraction (IE) distills structured data or knowledge from the unstructured text by identifying references to named entities as well as stated relationships between such entities. IE systems can be used to directly extricate abstract knowledge from a text corpus or to extract concrete data from a set of documents which can then be further analyzed with traditional data-mining techniques to discover more general patterns.

 

Sentiment Analysis:
Sentiment analysis is the automated process of understanding an opinion about a given subject from written or spoken language. Sentiment analysis decodes the meaning behind human language, allowing organizations to analyze and interpret comments on social media platforms, documents, news articles, websites, and other venues for public comment.

 

Within government agencies and organizations, there is a deluge of unstructured data both in analog and digital form. NLP can provide the needed tools to move the needle forward in providing better visibility and knowledge into unstructured data. NLP can be utilized in many ways. To name a few: Analyze public data like Social Media, reviews, comments, etc., Get visibility into the organizational knowledge base, provide predictive capabilities, enhance citizen services, etc. There is much to be learned from the potential of AI and, in particular, its ability to analyze masses of unstructured data. It is time now for agencies and organizations to take action to harness the power of NLP to stay ahead.

Text Classification: Binary to Multi-label Multi-class classification

Unstructured data in the form of text is everywhere: emails, web pages, social media, survey responses, domain data and more. While textual data is very enriching, it is very complex to gain insights easily and classifying text manually can be hard and time-consuming. For businesses to make intelligent data-driven decisions, understanding the insights in the text in a fast and reliable way is essential. Artificial Intelligence makes that possible with Natural Language Processing (NLP) and text classification. The capability to automatically classify text into one or more categories have seen tremendous improvements in recent years. Gone are the days of manually tagging textual data which can be laborious, time-consuming, inconsistent and expensive.

So let’s look at a few types of text classification in AI.

Binary classification: As the name suggests is the process of assigning a single boolean label to textual data. Example: Reviewing an email and classifying it as good or spam.

AI Binary Classification

Multi-class classification: Multi-class classification involves the process of reviewing textual data and assigning one (single label) or more (multi) labels to the textual data. The complexity of the problem increases as the number of classes increase. Lets take an example of assigning genres to movies. Each movie is assigned one or more genres from a list of movie genres (Drama, Action, Comedy, Horror, etc.). This is a Multi-class classification problem with a manageable set of labels.

AI Multiclass classifiction

 

Now imagine a classification problem where a specific item will need to be classified across a very large category set (10,000+ categories). The problem becomes exponentially difficult. Here is where eXtreme Multi-Label Text Classification with BERT (X-BERT) comes into play. If you want to learn more about Google’s NLP framework BERT, click here.

X-BERT aims to tag each input text with the most relevant labels from an extremely large label set.

Here are a few examples of multi-class classification: Classifying a product in retail to product categories. There are hundreds of thousands of product categories (https://www.researchandmarkets.com/categories) and classifying a single product to one of category based on the product description constitutes a multi-label (specific product category) multi-class ((broader product category) example.

Displaying sponsored content based on user search queries. There are thousands of combinations of ways, users can type in a search query and in order to classify user inputs to display a specific ad under sponsored ads is another extremely large classification example.

AI based MultiLabel classification

In the work we do for the US Navy, we tackle a similar problem of identifying a single equipment name & id from a list of equipment names across ships. The need is to find the right equipment from a list of 50,000+ items with more than 90% accuracy. We utilized X-BERT model connected to additional dense layer and softmax layer to conduct fine-tuning training to identify the equipment. This combined with the subject matter expert validation and verification helped train the machine to get better over time in identifying the equipment.

Extremely Large Multi-class classification X-BERT

As shown in the examples above, with the right methodology and data training, unstructured text can be categorized automatically using AI NLP technology. Employing AI-based auto-classification will make classification more effective and efficient.

Transfer Learning

Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task.

In transfer learning, we leverage prior knowledge from one domain into a different domain. The way transfer learning is done is by deleting the last output layer and creating a new set neural network layers for the new problem. Then these layers are trained using the new data set.

For example, let’s say you have an AI model to recognize cats, now we can use that knowledge to recognize elephants. The model for recognizing cats is created by training the model with pictures of cats (plenty on the internet). Once the model is trained to recognize cats with high accuracy, then the last layer of the neural network will be replaced with additional layers and those layers will be trained using pictures of elephants to recognize elephants. This is done so that a lot of the low-level features like detecting edges, curves, etc. could be learned from the large dataset (in this case Cats) and the newer model will be trained to recognize specific elements (elephants specific features) with fewer data as shown in the below figure.

 

Most of the success today in achieving high accuracy in AI models has been driven by extensive supervised learning which relies on large amounts of labeled datasets. For simple use cases, large amounts of labeled public data is available through various online sources (Ex: ImageNet, WordNet, etc.) but if you are building a model for a specific domain solution, large amounts of labeled data is hard to obtain or data will need to be cleaned and labeled manually for building the model. Transfer learning enables you to develop fairly accurate models using comparatively little data. This is very useful at enterprises that might not have a lot of clean labeled data.

Therefore on some problems where you may not have very much data, transfer learning will enable you to develop skillful models that you simply could not develop in the absence of transfer learning.

Knowledge Integration

Knowledge Integration in AI

So let’s think about how humans learn, we humans are very good at continuously enriching and refining our knowledge and skills by seamlessly combining existing knowledge with new experiences. We exhibit a wide spectrum of learning abilities in various fields. We can be lawyers during the day and go play tennis or go for a run in the evening and make dinner at night. We are fairly adept at doing multiple tasks. When you think about AI systems, that is usually not the case. AI systems are very good at doing a specific task through machine learning alternatively called Narrow Intelligence.

Despite recent breakthroughs and advances, machine learning has a number of shortcomings when it comes to obtaining knowledge in various fields and in developing methods to identify how new and prior knowledge interact to gain more insights. Knowledge integration is the process of synthesizing multiple knowledge representations into a common model. It represents the process of how new information and existing information interact, what effects will the new information will have on existing knowledge and if existing information needs to be modified to accommodate new information.

Why is this concept important? It is important for building a better machine learning model for enterprise knowledge insights.  Not all knowledge will be readily available or can be fed into the machine learning model at once. Substantial knowledge bases are developed incrementally and a growing body of knowledge will need to be added separately. By identifying subtle conflicts and gaps in knowledge, KI facilitates better learning models. Large firms like Google are using a combination of Symbolic AI, Deep learning and Supervised learning to create better knowledge understanding and knowledge reasoning.

If you are an organization looking to extract valuable information and identify patterns within your data to create efficiency, these concepts are critical and I highly recommend doing further research around these to achieving success.

What is NLS?

NLS: Natural Language Search is a search using everyday spoken language, such as English. Using this type of search, you can ask a database a question or type in a descriptive sentence that describes your question.

Though asking questions in a more natural way (ex: What is the population of England as of 2018? Or who was the 44th president of America?) has only recently come into its own in the field, natural language search engines have been around almost as long as web search.

Remember Ask Jeeves? The 1996 search engine encouraged its users to phrase their search queries in the form of a question, to be “answered” by a virtual suited butler. Ask Jeeves was actually ahead of its time in this regard when other search engines like Google and Yahoo were having greater success with keyword-based search. In 2010, Ask Jeeves finally bowed to the pressure from its competition and outsourced its search technology. Ironically, had Ask Jeeves been founded about fifteen years later, it most likely would have been at the cutting edge of natural language search, ahead of the very search engines that squeezed it out.

Ask jeeves

 

Today the advent of smart speakers and mobile phones has brought voice-based search and conversational search to the forefront. Advances in NLP (Natural Language Processing) technology has made this possible not just for search giants like Google and Microsoft but also for enterprises to search their internal knowledge base and domain data utilizing artificial intelligence (AI).

Anyone who has used an enterprise application will be familiar with multiple criteria search boxes (like below).

NLS

These searches are cumbersome and perform search on structured data stored in the database. As more and more data are now being stored in NOSQL databases and in unstructured text across documents and folders, the need for search across these data sources becomes essential. A simple Boolean search (a simple search for keywords) does not provide the extensive search capabilities necessary to review complex relationships between topics, issues, new terms, and languages. It’s 2019 – searches need to, and can, go beyond simple keyword matching.

An effective search will need to include indexed data that is extracted from a knowledge base using AI technologies like NER (Named Entity Recognition), OpenIE (Information Extraction), key phrase extraction, Text classification, STS (Semantic Text Similarity) and Text Clustering. The data extracted from the above processes will need to be classified, indexed and stored using a multi-label classification technology. This classification methodology will populate the database with knowledge, links, and relations between all data sources. Data, along with structured data stored through transactional operations, will then be used to train the AI models in understanding entities, relations, and common phrases. And that’s where natural language search comes in. NLS provides an efficient way to search data stored in structured or unstructured formats (like scanned pdf documents etc.) thereby providing a comprehensive search across all data.

You can think about it like this: Take a query from an engineer like “Show me Oil Leaks on Main Engine for TAO class in the last 2 years.” The ships that belong to a specific class are stored in a structured database where specific oil leak failures are in repair text as unstructured information. Using AI models to extract this information and index it will provide the ability to answer this query with much more accuracy and greater speed – digging through documents is not going to be the best use of your time. In this case, the query will be analyzed by the model to identify entities that fit the query and dynamically build a solution query to retrieve information from the appropriate data store.

Artificial intelligence has enabled the possibility of implementing NLS within enterprises. This will increase the efficiency and effectiveness of search while reducing the time to perform searches.

Ai Microservice

Deploy AI Models as Microservice

Microservice is a software development technique for developing an application as a suite of small, independently deployable services built around specific business capabilities. Microservices is the idea of breaking down big, monolithic application into a collection of smaller, independent applications.

Why should machine learning models be deployed as microservices?

This is an empirical era for machine learning as successful as deep learning has been, our level of understanding of why it works so well is still lacking. Machine learning engineers need to explore and experiment with different models before they settle on a model that works for their specific use case. Once a model is developed there are inherent advantages to deploying machine learning models in a container and serving it as microservices.

Here are a few reasons to why it makes sense to deploy AI models as microservices:

  • Microservices are smaller and are easier to understand as opposed to large monolithic application. Microservices are focused around business functions and so it makes it simpler to deploy a single specific function without worrying about all the other business functions.
  • Each service can be deployed independently of each other. This also allows for independent scaling of each service as opposed to the entire application. This is a much efficient way of using computing capabilities and will achieve a balance of computing resource allocation. Microservices deployed in a container architecture allows for further efficiency in scaling.
  • Because each service is focused on a specific business function, it makes it easier for development resource(s) to understand a small set of function rather than the entire application.
  • Making the model as a service provides the ability to expose the services to both internal and external applications without having to move the code. The ability to access data using well-defined interfaces. Containers have mechanisms built in for external and distributed data access, so you can leverage common data-oriented interfaces that support many data models.
  • Each team also has the luxury of choosing whatever languages and tools they want for their job without affecting anyone else. It eliminates vendor or technology lock-in. By deploying Machine Learning models as Microservices with API endpoints, the data scientists and AI programmers can write models in whatever framework- Tensorflow, PyTorch or Keras, without worrying about the technology stack compatibility.
  • Microservices allow for deployment of new versions in parallel and independent of other services. Developers can work in parallel and get changes to production independently and faster. Enables the continuous delivery and deployment of large, complex Machine Learning applications. With production ready frameworks like Tensorflow Serving, the management of versions of a model become very easy.
  • Deploy to any environment local, private or public cloud. If there are data privacy concerns on deploying AI models on the cloud, creating individual models as containers allow for deployment of AI models in the local environment.
  • In most AI projects, there will be several AI models that will be developed to do specific functions (ex: A model to do Named Entity Recognition, Model to do Information Extraction etc.). Microservices allows for these models to be independently developed, updated and deployed.

Now let’s talk about some technologies that help with deploying models as microservices. Here we want to focus on two prominent technologies that allow for this to happen.

Docker:
Docker helps you create and deploy microservices within containers. It’s an open source collection of tools that help you build and run any app, anywhere. Here is a great resource on Docker Basics. There are plenty of resources out on the internet for getting started with Docker as well.

Kubernetes:
When it comes to deploying microservices as containers, another aspect that should be kept in mind is the management of individual containers. If you want to run multiple containers across multiple machines – which you’ll need to do if you’re using microservices, you will need to manage these efficiently. To start the right containers at the right time, make them talk to each other, handle storage and memory considerations, and deal with failed containers or hardware. Doing all of this manually would be a nightmare and hence having a tool like Kubernetes is critical. Kubernetes is an open source container orchestration platform, allowing large numbers of containers to work together in harmony, reducing operational burden.

When used together, both Docker and Kubernetes are great tools for developing a modern AI cloud architecture.

Google BERT NLP Technology

Google’s BERT takes NLP to much higher accuracy

As a follow up to my earlier LinkedIn Post of Google’s BERT model on NLP, I am writing this to explain further about BERT and the results of our experiment.

In a recent blog post, Google announced they have open-sourced BERT, their state-of-the-art training technique for natural language processing (NLP) applications. The paper released (https://arxiv.org/abs/1810.04805) along with the blog is receiving accolades from across the machine learning community. This is because BERT broke several records for how well models can handle language-based tasks and more accurately NLP tasks.

Here are a few highlights that make BERT unique and powerful:

  • BERT stands for Bidirectional Encoder Representations from Transformers. As the name suggests, it uses Bidirectional encoder that allows it to access context from both past and future directions, and unsupervised, meaning it can ingest data that’s neither classified nor labeled. This is unique because previous models looked at a text sequence either from left to right or combined left-to-right and right-to-left training. This method is opposed to conventional NLP models such as word2vec and GloVe, which generate a single, context-free word embedding (a mathematical representation of a word) for each word in their vocabularies.
  • BERT uses Google Transformer, an open source neural network architecture based on a self-attention mechanism that’s optimized for NLP. The transformer method has been gaining popularity due to its training efficiency and superior performance in capturing long-distance dependencies compared to a recurrent neural network (RNN) architecture. The transformer uses attention (https://bit.ly/2AzmocB) to boost the speed with which these models can be trained.As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).
  • In the pre-training process, researchers used a masking approach to prevent words that’s being predicted to indirectly “see itself” in a multi-layer model.  A certain percentage (10-15%) of the input tokens were masked to train the deep bidirectional representation. This method is referred to as a Masked Language Model (MLM).
  • BERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. BERT is pre-trained on 40 epochs over a 3.3 billion word corpus, including BooksCorpus (800 million words) and English Wikipedia (2.5 billion words). BERT has 24 Transformer blocks, 1024 hidden layers, and 340M parameters. The model runs on cloud TPUs (https://cloud.google.com/tpu/docs/tpus) for training which enables quick experimentation, debug and to tweak the model
  • It enables developers to train a “state-of-the-art” NLP model in 30 minutes on a single Cloud TPU (tensor processing unit, Google’s cloud-hosted accelerator hardware) or a few hours on a single graphics processing unit.

These are just a few highlights on what makes BERT the best NLP model so far.

Our Experiment:

To evaluate the performance of BERT, we compared BERT to IBM Watson based NER. The test was performed against the same set of annotated large unstructured documents. The model created using BERT and IBM Watson was applied to the annotated large unstructured documents. Below table shows the results we achieved:

Google's BERT Comparison chart

 

Based on our comparison and what we have seen so far, it is fairly clear that BERT is a breakthrough and a milestone in the use of Machine Learning for Natural Language Processing.

Deep Learning

What is Deep Learning?

Deep Learning is a subset of machine learning that allows machines to do tasks that typically require human like intelligence. The inspiration for deep learning comes from neuroscience, if you look at the architecture of Deep Learning Neural Networks, they are connected in a fundamental way that mirrors the brain. Deep-learning networks are distinguished from the more commonplace neural networks by their depth; that is, the number of node layers through which data passes in a multistep process.

Earlier versions of neural networks were shallow, composed of one input and one output layer, and at most one hidden layer in between. More than three layers (including input and output) qualifies as “deep” learning. So deep as strictly defined means more than one hidden layer.

Neural Network

Deep learning Neural network

In deep-learning networks, each layer of nodes trains on a distinct set of features based on the previous layer’s output. The further you advance into the neural net, the more complex the features your nodes can recognize, since they aggregate and recombine features from the previous layer.

Let’s take a simple example of recognizing hand written numbers from 1 – 10. If 10 people wrote the numbers, the numbers will look very different from each person. For a human brain, it is fairly easy to identify these numbers. For a traditional machine it is impossible to detect and hence Neural Networks are used to mimic the way, neurons in the brain interact. These multiple hidden layers allow a computer to determine the nature of a handwritten digit by providing a way for the neural network to build a rough hierarchy of different features that make up the handwritten digit.

For instance, if the input is an array of values representing the individual pixels in the image of the handwritten figure, the next layer might combine these pixels into lines and shapes, the next layer combines those shapes into distinct features like the loops in an 8 or upper triangle in a 4, and so on. By building a picture of these features, neural networks can determine with a very high level of accuracy the number that corresponds to a handwritten digit. Additionally, the model will learn which links between neurons are critical in making successful predictions during training. Over the course of several training cycles, and with the help of occasional manual tuning, the network will continue to learn and generate better predictions until it reaches desired accuracy.

Thus, Deep learning allows machines to solve complex problems even when using a data set that is very diverse, unstructured and inter-connected. Deep learning networks excel at dealing with vast amount of disparate data. In fact, the larger the amount of data the more efficient Deep learning becomes and the more deep learning algorithms learn, the better they perform.

Few additional links on this topic:
MIT Technology Review: https://www.technologyreview.com/s/513696/deep-learning/
Cambridge Univerisity paper: https://bit.ly/2Fbbrlr

How do Machines Learn?

A good definition by TechEmergence states that “machine learning is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.”

From the definition it is fairly apparent that all forms of machine learning (ML) rely on the availability of data, not just some data but large volumes of data. Therefore, in order to take advantage of ML, access to large sets of well-organized data is critical. As far as machine learning goes, there are several approaches; from a simple decision tree to multilayered neural networks, all depending on the task and amount and type of available data.

There is no one-size-fits-all solution when it comes to a machine learning algorithm. Most times, the best solution is derived when working on real applications with real data because every organization’s data is unique. Solutions are derived by working with domain experts and creating custom neural networks.

There are a few methods to teach the machine with data: supervised learning, unsupervised learning and semi-supervised learning.

Supervised Learning: In supervised ML, the artificial intelligence (AI) model is given data that is labeled in an organized fashion. For example, one might provide pictures of cat with the labels. Once enough structured and labeled data is provided, the AI model built can recognize and respond to patterns in data without explicit instructions. The output and the accuracy of supervised learning algorithms are easy to measure making supervised learning the most common method of machine learning today.

Unsupervised learning: You guessed it, it’s the opposite of supervised learning. Here the AI model is given data that is not labeled in an organized fashion. For example, one might provide pictures of animals (cats, dogs, etc.) without any labels. This method is used to identify underlying patterns or hidden structures from unlabeled data. The expectation is not to derive the right output but to explore datasets and draw inferences. This is rarely used today as the implications of unsupervised learning are unknown.

Semi-supervised learning: This method falls somewhere between supervised and unsupervised data. In this scenario, the model is given a small amount of labeled data and a much larger pool of unlabeled data. Semi-supervised learning combines the best of both worlds by having improved accuracy associated with supervised ML and makes use of unlabeled data. Often, the process of labeling massive amounts of data for supervised learning is time consuming and expensive. This process actually tends to improve the accuracy of the final model while reducing time and cost.

So what method should be used? Well, it depends. The structure and volume of data should inform the method and the approach that needs to be taken. Hence, there is not a one-size-fits-all solution when it comes to machine learning.

Next we will talk about deep learning, a powerful machine learning technique.

Miami Herald Article: Chatting with chirrp.ai – Miami company uses AI to engage with customers

Chirrp has been featured in the Miami Herald! Writer Nancy Dalberg speaks to chirrp’s leadership about chirrp’s unique technology as part of the Herald’s Biz Monday feature.

Check out the full story here or read the transcript below.

Chatting with Chirrp: Miami company uses AI to engage with customers