How To Build an AI Knowledge Base With RAG
LLMs can generate incorrect or imprecise responses due to the limitations of training data. Learn how to build an AI knowledge base to improve the accuracy of LLM output.
Join the DZone community and get the full member experience.
Join For FreeLarge Language Models are immensely powerful and can quickly generate intelligent and natural-sounding responses. Their major problem, however, is a limitation of training data – GPT-4, for instance, has a knowledge cutoff date in September 2021. This means that the model is unaware of any events or developments after this date. LLMs also have consistent troubles with the accuracy of information and “hallucinations” – coherent and logical, but factually incorrect responses. Finally, LLMs are unaware of specific, niche information, and can only generate responses on a certain level of generality.
To solve these issues and make the LLM usable for specific and information-heavy tasks, LLMs can be connected to AI knowledge bases – repositories of information that consist of organized data – product documentation, articles, messages, and other materials. In this article, I explain how to create a knowledge base that would later be connected to the LLM model and enable it to generate factually correct and specific responses.
What Is RAG?
RAG, or retrieval-augmented generation, is a technique that provides LLM with access to relevant documents in a knowledge base. It allows the LLM to generate accurate responses based on the documentation accessed. RAG technique works like this:
- First, a knowledge base is searched to find information that responds to the user’s query.
- Then, the most fitting search results are added to the prompt as context, and an instruction is added, such as: “Answer the following question by using information exclusively from the following passages”.
- If the LLM model you’re using is not tuned for instructions, you’ll need to add examples that demonstrate what expected input and expected output look like.
- The prompt text that contains the instruction, results of the search, and format of the output is sent to the LLM model.
- The LLM uses information from the context to generate an accurate response.
Components of RAG
RAG consists of two components – an information retrieval component, and a text generator LLM.
- Retriever: Retriever is a particular query encoder and a vector-based document search index. Today, vector databases are often used as effective retrievers. A vector database stores a vector embedding of the data, which is a numerical representation of the data that reflects its semantic meaning. For instance, HELLO -> [0.23, 0.001, 0.707]. The knowledge base can be a vector data store, and the retriever will convert the query into a vector and use similarity search to find relevant information. Popular vector databases include Chroma, FAISS, and Pinecone.
- Text generator LLM: The LLM model used will depend on your purposes. For example, if the end solution demands privacy of the data, it is not recommended to use OpenAI’s GPT model. For my purposes, I used Mistral 7B, the first model released by Mistral AI.
In the example, I will also use LangChain – a framework that is frequently applied to create RAG-based applications.
Steps To Create an AI Knowledge Database
You need to create an AI knowledge database to use the RAG technique later. There are four steps you need to take.
- Gather and prepare the knowledge base: You need to collect the documents – pdf, txt, or other formats. This step is performed manually and does not require coding.
- Chunk: Some documents are large and contain a lot of text: Most LLMs have a limited size of the context – for example, Mistral 7B can consider up to 8192 tokens of text when predicting the next token in a sequence. Therefore, you will need to split the documents into sectors with a fixed number of symbols. The most frequently used size is 1024.
- Create vector embeddings: Vector embeddings are created by converting the text chunks into text embeddings. For that, specialized models are used, for example, bge-large-en-1.5 or all-Mini-I6.
- Storing the embeddings: You will need to store embeddings in a vector data store. Vector databases enable fast retrieval and similarity search.
These steps create the AI knowledge database, which you will later connect to the LLM to create an AI solution.
Let’s discuss the steps in more detail.
Chunking
In the RAG pipeline, text chunking is an essential preprocessing step. It entails dividing lengthy text documents into more digestible segments, or "chunks," so that Language Learning Models (LLMs) can handle them effectively. Text chunking aims to increase retrieval accuracy and optimize the model's use of the context window.
Typical Methods for Chunking Text:
- Fixed-length chunking: Dividing the text into sections consisting of a predetermined quantity of tokens or words.
- Semantic chunking: Dividing the text along semantic lines, like those between paragraphs or sentences.
- Sliding window: Using overlapping chunks to guarantee that the context surrounding chunk borders is maintained.
Code Examples Using Popular Tools such as LangChain and Llamaindex:
from langchain.document_loaders import
TextLoader
from langchain.text_splitter import
RecursiveCharacterTextSplitter
# Load your document
loader =
TextLoader("path_to_your_document.txt")
documents = loader.load()
# Define a text splitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # Number of characters per chunk
chunk_overlap=24 # Overlap between chunks
)
# Split the documents
chunks = splitter.split_documents(documents)
from llama_index.core import
SimpleDirectoryReader
from llama_index.core import Settings
documents =
SimpleDirectoryReader("./documents_directory")
.load_data()
Settings.chunk_size = 512
Settings.chunk_overlap = 24
Creating and Storing Embeddings
Now that we have chunks it is time to create and store embeddings using the Embedding model (OpenAI in our case) Chroma vector store (in the case of Langchain) and In-Memory vector store (in the case of Llamaindex).
LangChain and LlamaIndex code examples:
from langchain.document_loaders import
TextLoader
from langchain.text_splitter import
RecursiveCharacterTextSplitter
from langchain.embeddings import
OpenAIEmbeddings
from langchain.vectorstores import
Chroma
# Load your document
loader =
TextLoader("path_to_your_document.txt")
documents = loader.load()
# Define a text splitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # Number of characters per chunk
chunk_overlap=24 # Overlap between chunks
)
# Split the documents
chunks = splitter.split_documents(documents)
# Use embeddings and vectorstore for retrieval
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever()
from llama_index.core import
SimpleDirectoryReader, VectorStoreIndex
from llama_index.core import Settings
documents =
SimpleDirectoryReader("./data").load_data()
Settings.chunk_size = 512
Settings.chunk_overlap = 24
index = VectorStoreIndex.from_documents(
documents,
)
query_engine = index.as_query_engine()
These steps allow you to create an AI database that can be used in a variety of solutions to improve the accuracy of LLM responses.
How To Choose an Embedding Model and a Vector Database?
New embedding models are released every week. To choose the right one, start with MTEB Leaderboard or Hugging Face. There you can find up-to-date lists and performance statistics for each model. The most important parameters are:
- Retrieval average: The common metric to measure the performance of retrieval systems is called NDCG – Normalized Discounted Cumulative Gain. A higher retrieval average shows that the model is more successful at ranking correct items higher in the list of results.
- Size of the model: The size of the model (in GB) shows the amount of computational resources needed to run the model. It is important to choose a model that will offer the right balance between the use of resources and performance. This will depend on your project.
- Embedding latency: Embedding latency is the time required to create embeddings for a full dataset. To measure embedding latency, you can compare how quickly different models can generate embeddings for a local vector store. However, it is important to note that shorter time is often connected to larger charges and larger requirements in terms of computational resources.
- Retrieval quality: To evaluate retrieval quality, use questions that correspond to themes in your dataset. For a real application, you can also use questions you expect the users of the app to ask.
When it comes to a vector database, other parameters are key:
- Open-source or private: Open-source databases often have communities that drive their development. They can be suitable for organizations with limited budgets. Proprietary vector databases, on the other hand, offer additional features and effective customer support and may be more suitable if your project has particular technical or compliance requirements.
- Performance: The most important parameters are the number of queries per second and average query latency. A high number of queries shows that the database can process multiple queries simultaneously. This is essential if you expect your app to serve multiple users concurrently. Query latency shows the time it takes for a database to process the query. Fast processing is essential if your app requires real-time responses, such as in conversational AI chatbots.
- Cost-efficiency: Each database has specific pricing models. Usually, databases charge either for the number of vectors or the storage capacity. The pricing can also differ depending on the type of queries and the complexity of operations. Also, some databases charge for data transfer, which should be taken into account if your app requires frequent retrieval or uploading of data.
Conclusion
To sum up, an AI knowledge base is an effective way to amplify the capacities of LLMs in a fast and cost-efficient manner. A knowledge base serves as a repository of reliable data that is used to augment prompts and enable an LLM to generate an accurate response. RAG technique is indispensable for the integration of LLMs and knowledge bases, and provides a clear and unproblematic pathway to achieve development goals.
Opinions expressed by DZone contributors are their own.
Comments