Create Specific Task LLMs – Part 1 – Prompt Engineering to Fine Tuning


In the world of artificial intelligence (AI) and machine learning, Large Language Models (LLMs) like GPT-3 have fundamentally transformed our approach to handling vast quantities of data and knowledge. However, when working on domain specific AI tasks such as Classification or Named Entity Recognition (NER), the need for specialized LLMs becomes apparent. Here we delve into why specific-task LLMs are essential, how to commence with prompt engineering, and the crucial transition to fine-tuning for optimal performance to get specific task LLM. Abeyon’s AI team has invested significant resources into researching these topics, yielding profound insight and conclusions.

Why Specific-Task LLMs Are Necessary

General-purpose LLMs, while impressive in their breadth, often lack the precision required for specific tasks. In scenarios like NER or classification, where high accuracy is of paramount importance, specific-task LLMs becomes crucial. This consideration is especially vital to Abeyon’s projects where precise and insightful statistics are a desired output of the client. The design of these specific-task LLMs focuses on a few key factors:

  • High Accuracy Over Creativity: The primary goal of these models is accuracy. In specific-task applications, the precision of information is more critical than the model’s creative output, while it is important for generic LLM to have creative output.
  • Consistency: Consistent answers to the same and/or similar queries are vital, ensuring reliability and trust in the model’s outputs.
  • Response time: Optimizing response time is another critical goal for specific-task LLMs. In many real-world applications, such as customer service or real-time data analysis, the speed of the response can be as important as its accuracy. By streamlining the model to focus on specific tasks, the computational load is reduced, leading to faster response times. This is particularly beneficial in high-volume, time-sensitive environments like financial markets analysis or on-the-fly translations where quick and accurate responses are essential.
  • Focus on Specific Domain Knowledge: By concentrating on domain specific text, these LLMs can provide responses that adhere to the logic and terminology of the target domain. This is especially important in domains requiring technical or specialized language, like marine engineering, or contexts sensitive to content, such as materials intended for children. Organizations that were required to manage large quantities of complex internal documents have found incredible utility in this feature, as these LLMs were able to adapt to the unique nuances of how their documents were structured and the vernacular with which they were written.

Starting with Prompt Engineering

The journey towards customizing Large Language Models (LLMs) for specific tasks begins with prompt engineering. This critical step requires a delicate balance: prompts must be clear and detailed to guide the LLM effectively, yet not so verbose or complex that they obfuscate the intended task. The essence of prompt engineering lies in its ability to succinctly communicate the task requirements and desired logic flow to the model. Effective prompt engineering is not just about instructing the LLM; it’s about setting the stage for successful fine-tuning. By crafting prompts that are comprehensive yet straightforward, we can identify the model’s current limitations and areas where fine-tuning is necessary. These prompts act as preliminary tests, revealing how well the LLM grasps specific concepts and logic, and where it may deviate or struggle.

The diversity in prompt construction is vital. From providing step-by-step Chain of Thought (CoT) guides to incorporating explicit definitions, each type of prompt serves a purpose. They are not just commands but tools to map out the model’s learning path. By using these prompts, we can pinpoint where the model’s understanding is lacking, thus highlighting the potential training points for fine-tuning. Through this process, Abeyon has been able to discover qualities in our clients’ data that they were previously unaware of and develop a deep understanding of the unique dynamics present in those projects. This is especially helpful in instances where the logic behind certain processes becomes necessarily complex, thus requiring the solution to be incredibly robust and capable of managing considerably high degrees of variation within the client’s data.

In essence, prompt engineering is the art of balancing clarity with brevity. It’s about designing prompts that are sufficiently informative to guide the LLM toward the desired reasoning path, yet simple enough to prevent unnecessary complications. This careful crafting of prompts sets the foundation for the fine-tuning process, ensuring that the LLM not only understands the task at hand but also follows the correct logic to arrive at accurate conclusions.

Recognizing the Limitations of Prompts and Transitioning to Fine-Tuning

It’s important to clarify a common misconception regarding fine-tuning. While some references might suggest that fine-tuning is primarily about adjusting the output format, its role is far more extensive. Fine-tuning can be employed to train the LLM to follow a specific logic path, particularly when the original model doesn’t adhere to sound reasoning. Additionally, fine-tuning helps the model understand the material more accurately through Retrieval-Augmented Generation (RAG) or other instructions.  This feature of fine-tuning helped Abeyon meet client needs in a project where our LLM was required to incorporate special, organization-specific definitions of a particular set of terms that would often differ from the typical colloquial meanings these words usually held in other, more ordinary contexts. As conflicts between the respective definitions of these terms were identified, our developers faced a unique challenge in which the LLM had to essentially unlearn the traditional syntactic conventions of the English language upon which the technology was initially built, to adapt to the specific needs and context of our client. This example illustrates a scenario in which the LLM would make incorrect connections or conclusions when referring to its pre-programmed retrieved material, thus demonstrating a clear need for further training.

Another significant benefit of fine-tuning is the improvement it brings in both accuracy and response time. By refining the model’s understanding and processing capabilities, fine-tuning ensures that the LLM not only provides more accurate responses but also does so more swiftly. This enhancement is particularly vital in applications where timely and precise information delivery is critical.

After recognizing the limitations of prompt engineering and transitioning to fine-tuning, the next critical step involves a targeted approach to each identified training point. This process is meticulous and requires a strategic approach to ensure the Large Language Model (LLM) is accurately fine-tuned for the specific tasks at hand.

  1. Tuning on Each Identified Training Point: The fine-tuning process should systematically address each training point identified during the prompt testing phase. This means that every instance where the LLM’s response was not accurate, consistent, or logical needs to be revisited. Fine-tuning at this stage is not a blanket approach; it’s about pinpointing specific areas of improvement and addressing them individually.
  2. Creating Training Data: For each of these identified points, there may be a need to modify existing response text from LLM or create new responses. This tailored data is crucial as it will directly address the gaps or inaccuracies revealed during the prompt testing. The data must be representative of the scenarios the LLM struggled with, ensuring that the model learns from relevant and contextual examples.
  3. Developing Test Cases for Each Training Point: Alongside modifying training data, it’s equally important to develop specific test cases for each training point. These test cases will serve as benchmarks to evaluate the LLM’s learning and adaptation post-fine-tuning. They should be designed to rigorously test the LLM’s ability to handle previously challenging tasks and scenarios, providing a clear measure of the effectiveness of the fine-tuning process.

We only need a few training and testing data sets for each training point. This stage of the development process is crucial for refining the LLM’s abilities. By focusing on each training point with tailored data and test cases, the fine-tuning process becomes more precise and effective, leading to a model that not only performs better in terms of accuracy and response time but also demonstrates enhanced understanding and logical reasoning in line with the specific requirements of the task.

After initial prompt engineering and fine-tuning, it’s vital to re-engage in prompt engineering and continuous testing. This iterative process ensures ongoing improvement and adaptation of the LLM. Abeyon employs this philosophy of constant verification and revaluation to ensure LLMs performance is reliable, accurate and consistent.

  • Iterative Prompt Refinement: Post-fine-tuning, revisit your prompts. Refine them based on the insights gained from the fine-tuning process. This could involve simplifying prompts, introducing new formats, or making them more specific.
  • Continuous Testing: Regularly test the LLM with new prompts and scenarios. This ongoing testing helps in identifying any lingering issues or new areas where the LLM might struggle. It’s an essential part of ensuring the model remains effective and accurate over time.
  • Feedback Loop: Establish a feedback loop where the results from continuous testing inform further prompt engineering and fine-tuning. This loop helps in constantly adapting and refining the LLM to changing requirements and new challenges.
  • User Interaction Analysis: If possible, analyze how users interact with the LLM. User interactions can provide valuable insights into how well the LLM is performing and highlight areas for further improvement.


The journey from prompt engineering to fine-tuning in specific-task LLMs is a testament to the evolving nature of machine learning. By focusing on accuracy, consistency, and the quality of data and prompts, these models are fine-tuned to deliver precise results in specialized domains. The future of specific-task LLMs in tasks like NER and classification is bright, promising more tailored and effective solutions that is catered for domain specific solutions.

Creating an Enterprise Knowledge Search using Large Language Models (LLMs)

In today’s environment, enterprises, federal agencies, and departments face challenges in managing vast amounts of internal data and information. Traditional keyword searches or navigation through folder systems are no longer efficient in meeting the demands of modern-day information retrieval. A superior, more robust search system provides advantages such as providing visibility into the most relevant up-to-date, and accurate information, improved contextual information, facilitating knowledge-sharing and collaboration among employees, and helping identify knowledge gaps and areas for improvement. Implementing a system that provides better visibility into data will improve efficiency and performance when managing information and knowledge.

With the introduction of GPT and the recent popularity of ChatGPT, many are wondering if this could be the end to all search and knowledge extraction from a large corpus of enterprise data. Although it is too early to tell, we at Abeyon have explored the possibilities of this technology and how it can be leveraged for internal data search. This is our informed and researched first take on GPT.

With enterprise data, implementing a hybrid of the following approaches is optimal in building a robust search using large language models (like GPT created by OpenAI):

  • vectorization with large language models (LLMs),
  • fine-tuning of large language models, and
  • semantic search.

Vectorization of Enterprise Data: As a first step in building an enterprise data repository, enterprise data should be vectorized to create a vector repository. Documents must be preprocessed before vectorization in order to comply with the size limits of the LLMs. The vectorized data will be stored in a vector database (e.g., or

Fine Tuning Large Language Model: LLMs can be fine-tuned to understand domain-specific data. During fine-tuning, the model is trained on the dataset by providing domain-specific questions and corresponding answers, which allows it to learn how to generate appropriate answers for new questions. Once the model is fine-tuned, it can be used to generate answers for new questions by feeding in the question as input and allowing the model to trigger a corresponding answer. This process can be repeated for multiple questions, allowing the model to build a knowledge base of question-and-answer pairs. However, fine-tuning has several pitfalls which we will discuss later in this post.

Semantic Search: Semantic search, also known as neural search or vector search, uses a semantic embedding of numbers to represent the context or meaning of a   text, unlike traditional keyword searches. Semantic searches attempt to generate the most accurate results possible by understanding the search based on the searcher’s intent, query context, and the relationship between words. This allows new databases to scale and search based on the actual content and context of the records.

Search Approach: Combining all of these (vectorization, fine-tuning, and a semantic search) into a search approach will create a more robust search solution. The high-level process involves vectorizing and indexing an enterprise corpus of data with semantic embeddings, using a large language model (LLM) to generate relevant search terms or queries, and using a semantic search engine to find the most relevant documents based on those queries. Once the relevant documents are identified, the LLM can be used to quickly read and summarize the relevant parts of those documents. Finally, relevant information can be compiled together to answer the question.

Why this Approach: Generally, the advantage of using a semantic search over fine-tuning is that it can be more efficient and effective in identifying relevant documents, especially when dealing with complex and nuanced data. However, fine-tuning in some instances may be more accurate in generating answers to specific questions, especially when dealing with highly specialized domains or topics. Some domains may use specific abbreviations or terms that have a different meaning than within a general context. Fine-tuning a large language model (LLM) on a specific domain can help it understand these specialized terms and improve its accuracy in generating answers related to that domain. For example, in a set of documents related to the marine engineering industry, the term “ME” may stand for “Main Engine” rather than the usual personal pronoun “me.” If we do not fine-tune the LLM on this specialized domain, it may generate inaccurate responses by misinterpreting the meaning of ME as this personal pronoun because of its typical usage. By fine-tuning the model on this specific domain, and training it to understand the specialized meaning of “ME,” we can improve its accuracy in generating responses related to that industry.

Potential Barriers and Advantages: It is worth noting that fine-tuning a language model for specialized domains or topics can result in a loss of generalization ability, meaning that the model may not perform as well on general language tasks outside of its specific domain or topic. Nonetheless, fine-tuning remains an effective approach for improving the accuracy of language models in specialized domains or topics where specific language patterns and meanings are used.

One major issue for fine-tuning is it does not rule out confabulation or hallucination. Fine-tuning models also lack a theory of knowledge or epistemology. They cannot explain what they know or why they know it, and therefore are unreliable as a source of information. Most artificial intelligence (AI) research focuses on developing larger and more powerful models rather than investing in cognitive architecture or neuroscience. Creating a single model that understands what it does and does not know is fairly complex as AI technology stands today.

Fine-tuning a large language model (LLM) like GPT-3 can be a complex, resource-intensive and expensive process, especially when dealing with specialized domains or tasks. This is due to the numerous parameters in the model, which can make fine-tuning expensive in terms of computational resources.

In addition to the cost, fine-tuning large language models can be time-consuming. Validating the accuracy of the fine-tuned model can require a significant amount of time and effort, and may involve a trial-and-error process of adjusting hyperparameters and other configuration.

When fine-tuning language models, adding new documents to a knowledge base that has already been fine-tuned requires re-training the entire model. This can be a lengthy and resource-intensive process, especially when dealing with large datasets.

In contrast, with semantic search, the addition of new documents to a knowledge base is typically a more efficient process that does not require re-training the entire model. Instead, the semantic embeddings of the new documents can be added directly to the existing database, which can then be searched using semantic similarity metrics.

Our Recommendation: Fine-tuning large language models (LLMs) and semantic search have advantages and pitfalls. Creating a hybrid solution that leverages the benefits of these technologies and customizing the solutions with contextual data and use cases will yield results that are worth considering.


Contact us at [email protected] if you are interested in learning more. Want to learn more about AI concepts? Click here to see our Insights series

Intelligent Process Automation

Intelligent Process Automation – Why is it important?

Intelligent Process Automation is the combination of different technologies to automate more complete, end-to-end business processes

Government agencies and enterprises have been performing business process improvements for the past several years to improve efficiency within their internal processes, reduce waste in internal workflows and streamline business functions. This has led to human workers performing tasks more efficiently. However, this consequently has led to humans performing high volumes of repetitive, mundane tasks leading to burn-out and increasing risks for human error.

Today, advances in technology have enabled the automation of human tasks, especially those that are repetitive and mundane, as well as those that involve a certain level of cognitive functionality (i.e., “decision-making”). This has enabled government agencies and enterprises to enter the next level of creating efficiency by automating functions using methods ranging from simple workflow automation to complex intelligent process automation (IPA). For instance, IPA holds the promising potential to expand automation capabilities to include additional and more complex workflows and functions that utilize both structured and unstructured data. IPA combines Cognitive/Artificial Intelligence (AI) and Robotic Process Automation (RPA). The AI provides the intelligence (i.e., the “brains”) of the process while RPA provides the processing (hands) of functional workflows. RPA, which is fundamentally rule-based with limited capabilities, enables IPA to yield high returns and better business outcomes for processes that involve well-defined rules, are repetitive, require access to multiple systems, have manual steps following Standard Operating Procedures, and have a high possibility for human error.

Intelligent Process Automation

Image Reference: – October 30, 2019

However, currently, most government agencies and enterprises are solely just implementing RPA- based automation in an attempt to improve workflows and functions that typically require less cognitive decision-making steps, but are only witnessing moderate success. The low-hanging fruit for intelligent automation is data-intensive and repetitive tasks that machines can do better and faster than humans. Many applications will keep a “human in the loop”—at least until the system has proven its reliability. Even then, AI systems by their nature learn (both in supervised and unsupervised ways) and therefore change, so they need to be reviewed on an ongoing basis to ensure they are still performing as intended.

Due to these limitations, this has led to more adoption of IPA-based solutions across agencies and enterprises which has dramatically improved organizational efficiency, reduce costs, and increase customer satisfaction levels. AI-enhanced automation can significantly expand the scope of process automation to new and exciting areas that were previously considered too complex for consideration. For instance, a primary objective of IPA is to provide the human workforce with additional knowledge, support, and insights by automating repetitive, manually-intensive, and otherwise mundane tasks. With IPA, organizations can amplify human potential and move employees from low-value work to high-value work. As an example, Abeyon is employing IPA to “read” and analyze voluminous quantities of unstructured documents to extract datasets and utilize them for further analysis.

For agencies and enterprises to adopt IPA into their businesses, a well thought-out long-term plan is needed since the ROI on IPA work will take longer than simple automation. Since AI models will need large sets of datasets to be trained on, a large initial investment will be needed to see long-term value. This involves creating a very clear picture of the cost and benefits of IPA efforts. Abeyon has worked with several government agencies and enterprises to realize the power of AI and how IPA can greatly increase value of automation.

Contact us at [email protected] if you are interested in learning more. Want to learn more about AI concepts? Click here to see our Insights series


Bring clarity to unstructured data using Natural Language Processing (NLP) – Part 2

Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language, in particular how to program computers to process and analyze large amounts of natural language data.

This is series 2 of the introduction to key capabilities of NLP technologies. Here is the link to Series 1 of this article. With recent advances in Artificial intelligence technologies, computers have become very adept at reading, understanding, and interpreting human language. Here a few additional NLP capabilities that have made that possible.

Text Clustering: 
Clustering in general refers to the grouping of similar data together. Text clustering is a technique used to group text or documents based on similarities in content. It can be used to group similar documents (such as news articles, tweets, and social media posts), analyze them, and discover important but hidden subjects. Text classification as we discussed before also puts objects in a group, but the major difference between clustering and classification is that classification is a supervised method whereas clustering in an unsupervised method. All objects/data within clustering are new and no the resultant groups are unknown. This method is heavily used to identify key topics, patterns in large data sets as the first step to classification.

Text Clustering NLP

Text Summarization: 
Text Summarization refers to the technique of producing a concise summary of long pieces of text while preserving key information, content and overall meaning. There are many reasons and uses for a summary of a larger document. One example that might come readily to mind is to create a concise summary of a long news article, but there are many more cases of text summaries that we may come across every day. There are two different approaches that are used for text summarization: Extractive Summarization, Abstractive Summarization. Extractive Summarization identifies the important sentences or phrases from the original text and extracts only those from the text. Abstractive Summarization generates new sentences from the original text.

Extractive Text Summarization

Abstractive Text Summarization



Relation Extraction:
Relationship extraction refers to the technique of extracting semantic relationships from a text. Relationship Extraction products attributes and relations for entities in a sentence. For example: given the sentence “John was born in Fairfax, Virginia” a relation classifier aims at predicting the relation of “bornInCity”. Relation Extraction is the key component for building relation knowledge graphs, and it is of crucial significance to natural language processing applications such as structured search, sentiment analysis, question answering, and summarization.

These techniques combined with techniques discussed in series 1, can provide tools to create value from the deluge of unstructured data found within government agencies and other organizations. There is much to be learned from the potential of AI and, in particular, its ability to analyze masses of unstructured data

Want to learn more about AI concepts? Click here to see our Insights series


measure an AI models performance using F1 score

How to measure an AI models performance – F1 score explained

Organizations often ask us, “How well is the AI model is doing?” Or “How do I measure its performance?”, we often respond with “Performance of the AI model is based on what the F1 score of the model is” and we will get a puzzled look on everyones face or asking “what is an F1 score?”  So here I am going to attempt to explain F1 score in an easily understandable way:

Definition of F1 score:

F1 score (also F-score or F-measure) is a measure of a test’s accuracy. It considers both the precision (p) and the recall (r) of the test to compute the score (as per wikipedia)

Accuracy is how most people tend to think about it when it comes to measuring performance (Ex: How accurate is the model predicting etc.?). But accuracy is not a true measure of AI models performance. Accuracy only measures the number of correctly predicted values among the total predicted value. Although it is a good measure of performance it is not complete and does not work when the cost of false negatives is high. Ex: Lets assume we are using an AI model to predict cancer cells, after training, the model is fed with 100 samples that have cancer and the model identifies 90 samples as having cancer. That 90% accuracy, which sounds pretty high. But the cost of not identifying 10 samples is very costly. Therefore accuracy is not always the best measure.

So to explain it further lets consider this table:



True Positive:

True Positive is an outcome where the model correctly predicts the positive class. Ex: when cancer is present and the model predicts cancer.

False Positive is an outcome where the model incorrectly predicts the positive class. Ex: when cancer is not present and the model predicts cancer.

False Negative is an outcome where the model incorrectly predicts the negative class. Ex: when cancer is present and the model predicts no cancer.

True Negative is an outcome where the model correctly predicts the negative class. Ex: when cancer is not present and the model predicts no cancer.

As explained by the definition, the F1 score is a combination of Precision and Recall.

Precision is the number of True Positives divided by the number of True Positives and False Positives. Precision can be thought of as a measure of exactness. Therefore, low precision will indicate a large number of False Positives.

Recall is the number of True Positives divided by the number of True Positives and the number of False Negatives. Recall can be thought of as a measure of completeness. Therefore, low recall indicates a large number of False Negatives.

Now, F1 score is the harmonic mean of Precision and Recall and gives a much better measure of the model.

F1 Score = 2*((precision*recall)/(precision+recall)).

A good F1 score means that you have low false positives and low false negatives. Accuracy is used when the True Positives and True negatives are more important while F1-score is used when the False Negatives and False Positives are crucial

Interested in more AI insights. Click here and read our other articles.

Named Entity Recognition

Bring clarity to unstructured data using Natural Language Processing (NLP) – Part 1

Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language, in particular how to program computers to process and analyze large amounts of natural language data.

In my previous articles, I have addressed some specific topics on NLP like Text Classification, Natural Language Search, etc. Here I want to give a quick introduction to a few key technical capabilities of Natural Language Processing.With recent advances in Artificial intelligence technologies, computers have become very adept at reading, understanding and interpreting human language. Let’s look a few key capabilities of NLP. These are by no means a comprehensive list of all NLP capabilities.


Named Entity Recognition (NER):
NER is one of the first steps towards information extraction from large unstructured data. NER seeks to locate and extract named entities that are present in a text into pre-defined categories like persons, countries, organizations etc. This helps with answering many questions such as:
– How many mentions of an organization is in this article?
– Were there any specific products mentioned in a customer review?

This technology will enable organizations to extract individual entities from documents, social media, knowledge base etc. The better defined and trained the ontologies are, the more efficient the outcome will be.


Topic Modeling:
Topic Modeling is a type of statistical modeling for discovering abstract topics from a large document set. It is frequently used to discover hidden semantic structures in a textual body. It is different from traditional classification in that, it is an unsupervised method of extract main topics. This technique is used in the initial exploring phase to find what the common topics are in the data. Once you discover the topics, you can use language in those topics to create categories. One of the popular methods used for Topic Modeling is Latent Dirichlet Allocation (LDA). LDA builds a topic per document model and words per topic model, modeled as Dirichlet distributions. You can read more about LDA here:


Text Classification:
Text classification (a.k.a text categorization or text tagging) is the task of assigning a set of predefined categories to free-text. This is a supervised training methodology as opposed to Topic Modeling above. I have written in detail about text classification here:


Information Extraction:
Information Extraction is used to automatically find meaningful information in unstructured text. Information extraction (IE) distills structured data or knowledge from the unstructured text by identifying references to named entities as well as stated relationships between such entities. IE systems can be used to directly extricate abstract knowledge from a text corpus or to extract concrete data from a set of documents which can then be further analyzed with traditional data-mining techniques to discover more general patterns.


Sentiment Analysis:
Sentiment analysis is the automated process of understanding an opinion about a given subject from written or spoken language. Sentiment analysis decodes the meaning behind human language, allowing organizations to analyze and interpret comments on social media platforms, documents, news articles, websites, and other venues for public comment.


Within government agencies and organizations, there is a deluge of unstructured data both in analog and digital form. NLP can provide the needed tools to move the needle forward in providing better visibility and knowledge into unstructured data. NLP can be utilized in many ways. To name a few: Analyze public data like Social Media, reviews, comments, etc., Get visibility into the organizational knowledge base, provide predictive capabilities, enhance citizen services, etc. There is much to be learned from the potential of AI and, in particular, its ability to analyze masses of unstructured data. It is time now for agencies and organizations to take action to harness the power of NLP to stay ahead.

Text Classification: Binary to Multi-label Multi-class classification

Unstructured data in the form of text is everywhere: emails, web pages, social media, survey responses, domain data and more. While textual data is very enriching, it is very complex to gain insights easily and classifying text manually can be hard and time-consuming. For businesses to make intelligent data-driven decisions, understanding the insights in the text in a fast and reliable way is essential. Artificial Intelligence makes that possible with Natural Language Processing (NLP) and text classification. The capability to automatically classify text into one or more categories have seen tremendous improvements in recent years. Gone are the days of manually tagging textual data which can be laborious, time-consuming, inconsistent and expensive.

So let’s look at a few types of text classification in AI.

Binary classification: As the name suggests is the process of assigning a single boolean label to textual data. Example: Reviewing an email and classifying it as good or spam.

AI Binary Classification

Multi-class classification: Multi-class classification involves the process of reviewing textual data and assigning one (single label) or more (multi) labels to the textual data. The complexity of the problem increases as the number of classes increase. Lets take an example of assigning genres to movies. Each movie is assigned one or more genres from a list of movie genres (Drama, Action, Comedy, Horror, etc.). This is a Multi-class classification problem with a manageable set of labels.

AI Multiclass classifiction


Now imagine a classification problem where a specific item will need to be classified across a very large category set (10,000+ categories). The problem becomes exponentially difficult. Here is where eXtreme Multi-Label Text Classification with BERT (X-BERT) comes into play. If you want to learn more about Google’s NLP framework BERT, click here.

X-BERT aims to tag each input text with the most relevant labels from an extremely large label set.

Here are a few examples of multi-class classification: Classifying a product in retail to product categories. There are hundreds of thousands of product categories ( and classifying a single product to one of category based on the product description constitutes a multi-label (specific product category) multi-class ((broader product category) example.

Displaying sponsored content based on user search queries. There are thousands of combinations of ways, users can type in a search query and in order to classify user inputs to display a specific ad under sponsored ads is another extremely large classification example.

AI based MultiLabel classification

In the work we do for the US Navy, we tackle a similar problem of identifying a single equipment name & id from a list of equipment names across ships. The need is to find the right equipment from a list of 50,000+ items with more than 90% accuracy. We utilized X-BERT model connected to additional dense layer and softmax layer to conduct fine-tuning training to identify the equipment. This combined with the subject matter expert validation and verification helped train the machine to get better over time in identifying the equipment.

Extremely Large Multi-class classification X-BERT

As shown in the examples above, with the right methodology and data training, unstructured text can be categorized automatically using AI NLP technology. Employing AI-based auto-classification will make classification more effective and efficient.

Transfer Learning

Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task.

In transfer learning, we leverage prior knowledge from one domain into a different domain. The way transfer learning is done is by deleting the last output layer and creating a new set neural network layers for the new problem. Then these layers are trained using the new data set.

For example, let’s say you have an AI model to recognize cats, now we can use that knowledge to recognize elephants. The model for recognizing cats is created by training the model with pictures of cats (plenty on the internet). Once the model is trained to recognize cats with high accuracy, then the last layer of the neural network will be replaced with additional layers and those layers will be trained using pictures of elephants to recognize elephants. This is done so that a lot of the low-level features like detecting edges, curves, etc. could be learned from the large dataset (in this case Cats) and the newer model will be trained to recognize specific elements (elephants specific features) with fewer data as shown in the below figure.


Most of the success today in achieving high accuracy in AI models has been driven by extensive supervised learning which relies on large amounts of labeled datasets. For simple use cases, large amounts of labeled public data is available through various online sources (Ex: ImageNet, WordNet, etc.) but if you are building a model for a specific domain solution, large amounts of labeled data is hard to obtain or data will need to be cleaned and labeled manually for building the model. Transfer learning enables you to develop fairly accurate models using comparatively little data. This is very useful at enterprises that might not have a lot of clean labeled data.

Therefore on some problems where you may not have very much data, transfer learning will enable you to develop skillful models that you simply could not develop in the absence of transfer learning.

Intelligent Process Automation

Knowledge Integration in AI

So let’s think about how humans learn, we humans are very good at continuously enriching and refining our knowledge and skills by seamlessly combining existing knowledge with new experiences. We exhibit a wide spectrum of learning abilities in various fields. We can be lawyers during the day and go play tennis or go for a run in the evening and make dinner at night. We are fairly adept at doing multiple tasks. When you think about AI systems, that is usually not the case. AI systems are very good at doing a specific task through machine learning alternatively called Narrow Intelligence.

Despite recent breakthroughs and advances, machine learning has a number of shortcomings when it comes to obtaining knowledge in various fields and in developing methods to identify how new and prior knowledge interact to gain more insights. Knowledge integration is the process of synthesizing multiple knowledge representations into a common model. It represents the process of how new information and existing information interact, what effects will the new information will have on existing knowledge and if existing information needs to be modified to accommodate new information.

Why is this concept important? It is important for building a better machine learning model for enterprise knowledge insights.  Not all knowledge will be readily available or can be fed into the machine learning model at once. Substantial knowledge bases are developed incrementally and a growing body of knowledge will need to be added separately. By identifying subtle conflicts and gaps in knowledge, KI facilitates better learning models. Large firms like Google are using a combination of Symbolic AI, Deep learning and Supervised learning to create better knowledge understanding and knowledge reasoning.

If you are an organization looking to extract valuable information and identify patterns within your data to create efficiency, these concepts are critical and I highly recommend doing further research around these to achieving success.