Creating an Enterprise Knowledge Search using Large Language Models (LLMs)

In today’s environment, enterprises, federal agencies, and departments face challenges in managing vast amounts of internal data and information. Traditional keyword searches or navigation through folder systems are no longer efficient in meeting the demands of modern-day information retrieval. A superior, more robust search system provides advantages such as providing visibility into the most relevant up-to-date, and accurate information, improved contextual information, facilitating knowledge-sharing and collaboration among employees, and helping identify knowledge gaps and areas for improvement. Implementing a system that provides better visibility into data will improve efficiency and performance when managing information and knowledge.

With the introduction of GPT and the recent popularity of ChatGPT, many are wondering if this could be the end to all search and knowledge extraction from a large corpus of enterprise data. Although it is too early to tell, we at Abeyon have explored the possibilities of this technology and how it can be leveraged for internal data search. This is our informed and researched first take on GPT.

With enterprise data, implementing a hybrid of the following approaches is optimal in building a robust search using large language models (like GPT created by OpenAI):

  • vectorization with large language models (LLMs),
  • fine-tuning of large language models, and
  • semantic search.

Vectorization of Enterprise Data: As a first step in building an enterprise data repository, enterprise data should be vectorized to create a vector repository. Documents must be preprocessed before vectorization in order to comply with the size limits of the LLMs. The vectorized data will be stored in a vector database (e.g., Pinecone.io or Milvus.io).

Fine Tuning Large Language Model: LLMs can be fine-tuned to understand domain-specific data. During fine-tuning, the model is trained on the dataset by providing domain-specific questions and corresponding answers, which allows it to learn how to generate appropriate answers for new questions. Once the model is fine-tuned, it can be used to generate answers for new questions by feeding in the question as input and allowing the model to trigger a corresponding answer. This process can be repeated for multiple questions, allowing the model to build a knowledge base of question-and-answer pairs. However, fine-tuning has several pitfalls which we will discuss later in this post.

Semantic Search: Semantic search, also known as neural search or vector search, uses a semantic embedding of numbers to represent the context or meaning of a   text, unlike traditional keyword searches. Semantic searches attempt to generate the most accurate results possible by understanding the search based on the searcher’s intent, query context, and the relationship between words. This allows new databases to scale and search based on the actual content and context of the records.

Search Approach: Combining all of these (vectorization, fine-tuning, and a semantic search) into a search approach will create a more robust search solution. The high-level process involves vectorizing and indexing an enterprise corpus of data with semantic embeddings, using a large language model (LLM) to generate relevant search terms or queries, and using a semantic search engine to find the most relevant documents based on those queries. Once the relevant documents are identified, the LLM can be used to quickly read and summarize the relevant parts of those documents. Finally, relevant information can be compiled together to answer the question.

Why this Approach: Generally, the advantage of using a semantic search over fine-tuning is that it can be more efficient and effective in identifying relevant documents, especially when dealing with complex and nuanced data. However, fine-tuning in some instances may be more accurate in generating answers to specific questions, especially when dealing with highly specialized domains or topics. Some domains may use specific abbreviations or terms that have a different meaning than within a general context. Fine-tuning a large language model (LLM) on a specific domain can help it understand these specialized terms and improve its accuracy in generating answers related to that domain. For example, in a set of documents related to the marine engineering industry, the term “ME” may stand for “Main Engine” rather than the usual personal pronoun “me.” If we do not fine-tune the LLM on this specialized domain, it may generate inaccurate responses by misinterpreting the meaning of ME as this personal pronoun because of its typical usage. By fine-tuning the model on this specific domain, and training it to understand the specialized meaning of “ME,” we can improve its accuracy in generating responses related to that industry.

Potential Barriers and Advantages: It is worth noting that fine-tuning a language model for specialized domains or topics can result in a loss of generalization ability, meaning that the model may not perform as well on general language tasks outside of its specific domain or topic. Nonetheless, fine-tuning remains an effective approach for improving the accuracy of language models in specialized domains or topics where specific language patterns and meanings are used.

One major issue for fine-tuning is it does not rule out confabulation or hallucination. Fine-tuning models also lack a theory of knowledge or epistemology. They cannot explain what they know or why they know it, and therefore are unreliable as a source of information. Most artificial intelligence (AI) research focuses on developing larger and more powerful models rather than investing in cognitive architecture or neuroscience. Creating a single model that understands what it does and does not know is fairly complex as AI technology stands today.

Fine-tuning a large language model (LLM) like GPT-3 can be a complex, resource-intensive and expensive process, especially when dealing with specialized domains or tasks. This is due to the numerous parameters in the model, which can make fine-tuning expensive in terms of computational resources.

In addition to the cost, fine-tuning large language models can be time-consuming. Validating the accuracy of the fine-tuned model can require a significant amount of time and effort, and may involve a trial-and-error process of adjusting hyperparameters and other configuration.

When fine-tuning language models, adding new documents to a knowledge base that has already been fine-tuned requires re-training the entire model. This can be a lengthy and resource-intensive process, especially when dealing with large datasets.

In contrast, with semantic search, the addition of new documents to a knowledge base is typically a more efficient process that does not require re-training the entire model. Instead, the semantic embeddings of the new documents can be added directly to the existing database, which can then be searched using semantic similarity metrics.

Our Recommendation: Fine-tuning large language models (LLMs) and semantic search have advantages and pitfalls. Creating a hybrid solution that leverages the benefits of these technologies and customizing the solutions with contextual data and use cases will yield results that are worth considering.

References:

https://www.mlq.ai/gpt-3-fine-tuning-key-concepts/

https://medium.com/data-science-at-microsoft/building-gpt-3-applications-beyond-the-prompt-504140835560

https://www.techtarget.com/searchenterpriseai/feature/Exploring-GPT-3-architecture

https://community.openai.com/t/finetuning-for-domain-knowledge-and-questions/24817/10

https://medium.com/technology-hits/new-gpt-3-model-text-davinci-003-is-awesome-ada11ef660a9

Contact us at [email protected] if you are interested in learning more. Want to learn more about AI concepts? Click here to see our Insights series