Architectural Patterns for Enterprise Generative AI Apps: DSFT, RAG, RAFT, and GraphRAG

Explore the architectural patterns available to build GenAI solutions plus enterprise-level strategies to choose the right framework for the right use case.

Suvoraj Biswas

Aug. 23, 24 · Analysis

Likes (2)

Comment

Save

5.4K Views

A best-designed Enterprise Architecture is the backbone for any organization's IT systems, which support the foundational building blocks to achieve the organization's business objective. The architecture consists of best practices, clearly outlined strategies, common frameworks, and guidelines for the engineering team and other stakeholders to pick the right tool to accomplish the tasks. Enterprise Architecture is mostly governed by the architecture team that supports the line of business. In most organizations, the architecture team is responsible for outlining the architecture patterns and common frameworks which would help the engineering and product team not to spend hours of effort in doing proof of concepts, but rather help them to adopt the strategies to design the core building blocks based on the patterns.

Since Generative AI has been transforming the entire landscape, most organizations are either building Generative AI-based applications or they are integrating Generative AI capabilities or features into their existing applications or products. In this article, we will deep dive into the common architectural patterns that are available for building Generative AI solutions. We will also be discussing various enterprise-level strategies in picking the right framework for the right use case.

Pattern 1: Domain-Specific Fine Tuning (DSFT)

Large Language Models (LLM) play an important building block in the architecture for Enterprise Generative AI. The LLM is responsible for generating unique content based on the training it has undergone and the knowledge it has. However, LLMs available from vendors like OpenAI, Microsoft, or the open-source community lack the knowledge of Enterprise data. Sometimes in the organization, they have their own standards and principles which should also be followed while generating the content.

To solve use cases in this space, fine-tuning is one of the strategies which we can utilize. Fine-tuning involves further training a pre-trained LLM on a smaller, specialized dataset that is curated to have the enterprise's unique data, standards, and principles. This process helps tailor the model’s outputs to align more closely with the organization's requirements, thereby enhancing its applicability and effectiveness in the enterprise context.

What Is Domain-Specific Fine Tuning?

As just stated, fine-tuning Large Language Models (LLMs) involves adapting pre-trained language models to perform better in specific tasks or areas. This is done by training the models further using smaller, specialized datasets made up of <input, output> pairs. These pairs are examples that show the desired behavior or output.

During fine-tuning, the model's parameters are updated, which helps bridge the gap between the general abilities of the pre-trained model and the specific needs of the task. This process improves the model's performance, making it more accurate and aligned with human expectations for the given task.

Use Cases Where the DSFT Pattern Can Be the Best Candidate

Fine-tuning excels in the area where the organization is looking for more specialized and domain-specific content generation. If the use case requires specific standards and style to follow while generating the content, then "fine-tuning" is a great tool in the toolbox.

For example, let's imagine the customer service department wants to develop an automated workflow solution to replace a manual process. In the manual process, the live customer service agents would respond back to the query of their customers or prospects on their products or services. Currently, live customer service agents respond to customer or prospect inquiries about products or services. The agent needs to understand the intent and meaning of the customer's email, do some research, and then follow company guidelines when responding. This process typically takes an agent 2 to 3 hours, and the organization receives a large volume of customer emails inquiring about their products.

By using fine-tuning, the organization can train an AI model to understand and respond to these inquiries automatically, following the company’s standards and guidelines. This can save a significant amount of time and ensure that responses are consistent and accurate.

Fine-tuning is generally classified as the following:

Supervised Fine Tuning (SFT)
Reinforcement Learning from Human Feedback (RLHF)
Parameter Efficient Fine Tuning (PEFT)
- Low Rank Adaptation (LoRA)
- Quantized Low-Rank Adaptation (QLoRA)

Enterprise Strategy for Fine Tuning Pattern

If we plan to use this pattern for building the next Generative AI application, the only pitfall is this pattern is a time-consuming and expensive pattern, though it produces almost the perfect and high-quality output. It is time-consuming because the LLM needs to be re-trained using any of the above-mentioned methods which requires preparing the dataset, training corpus, and human labelers. If the organization's data is dynamic and it gets updated frequently then this pattern is not advisable because every time there is a data change, the LLM needs to undergo re-training which would become a costly solution. If the data is not very dynamic in nature and we are looking for the LLM to produce high-quality domain-specific output, then fine-tuning is the best approach.

Pattern 2: RAG (Retrieval Augmented Generation)

RAG or Retrieval Augmented Generation is one of the popular patterns being used in almost all Enterprise Generative AI development, as this is one of the most cost-effective patterns that saves significant development effort for building Gen AI applications. The basic structure of RAG can be outlined as follows:

R - (R)etrieve the context based on the similarity search algorithm.
A - (A)ugment the retrieved context along with the instruction (Prompt Engineering) for the LLM on what to generate based on the context we are supplying.
G - LLM will (G)enerate the content based on the context and the instruction (Prompt Engineering) and send this generated response to the user.

In the RAG pattern, we integrate a vector database that can store and index embeddings (numerical representations of digital content). We use various search algorithms like HNSW or IVF to retrieve the top k results, which are then used as the input context. The search is performed by converting the user's query into embeddings. The top k results are added to a well-constructed prompt, which guides the LLM on what to generate and the steps it should follow, as well as what context or data it should consider.

Once the LLM generates the content based on the prompt and input context, it goes through a profanity check (optional) or moderation layer. The validated response is then presented to the user in a human-understandable format.

Use Cases Where the “RAG” Pattern Can Be the Best Candidate

RAG is an easy-to-build and cost-effective solution when we need the LLM to produce content, based on organization-specific data. Since the LLM is not trained with the organization's private data, and training requires significant time, we utilize the RAG pattern to build a Gen AI app.

AI-based intelligent enterprise search, virtual assistants, or chatbots that help customers understand complex documentation, HR chatbots, recommendation engines, and customer care agents who need to quickly understand procedures to better assist customers are perfect use cases for RAG.

Some of the popular Enterprise-based use cases are:

HR support by employee training and onboarding: The RAG pattern can be used to build an HR support application that would deliver customized training materials and answer specific questions to facilitate smooth onboarding processes and thus free up time for the HRs to focus on other areas.
Healthcare industry: RAG-based Generative AI applications can support medical professionals with information on various treatment protocols and medical research for better patient care.
Enterprise knowledge mining and management systems: RAG can be used to build products to help employees find and retrieve relevant organization-specific information from vast internal content repositories.
Applications for sales and marketing: Using RAG, it is easy to build personalized product recommendations and generate content for marketing campaigns or product-related data.
Technical support applications: Gen-AI-based applications can summarize troubleshooting steps and relevant technical documentation for customer care agents to resolve issues faster.

Enterprise Strategy for RAG

When the data source is dynamic in nature (meaning we expect the data to be updated frequently), RAG (Retrieval-Augmented Generation) is an ideal solution. RAG performs better in environments where data changes regularly because it allows for real-time updates and ensures that the information retrieved is always synced with the changes. With RAG, each time the data source is updated, the embeddings in the vector database must also be updated during data ingestion to reflect these changes accurately.

Most Enterprise RAG applications have the following two primary workflows in their architecture:

1. Data Processing and Ingestion

This workflow involves the extraction, transformation, and loading (ETL) of source data into the vector database in the form of embeddings. When new data is added or existing data is modified, the system processes these changes, generates new embeddings, and updates the vector database. This ensures that our vector database remains in sync with the latest information. This workflow is triggered whenever there are changes in the data source. This allows the AI system to adapt quickly to new information or changes to existing information.

2. Retrieval via Similarity Search

In this workflow, when a user query is received, the system converts the query into embeddings and performs a similarity search based on ANN, KNN, or another algorithm against the updated vector database. The top-k results are retrieved and used as context for generating responses using the help of the LLM. This ensures that the information provided is relevant and based on the most recent data.

When there is any change in the data source, only the data processing and ingestion workflow gets triggered, which syncs the changes and updates the vector database. By implementing change detection mechanisms within the RAG architecture, the system can seamlessly synchronize with updates. This ensures that the retrieval process always uses the most recent data without requiring a complete overhaul of the entire system.

The RAG pattern provides great benefits to the enterprise because it separates data syncing from data retrieval. This decoupling means that updates to the data source can be handled efficiently without disrupting the retrieval process with zero downtime for users. This kind of modularized architectural pattern allows for scalability and flexibility. This makes it easier to adapt to growing data volumes and changing requirements.

This approach is not only cost-effective but also reduces build time, so it is an efficient choice for enterprises that require up-to-date and accurate information from dynamic data sources. This architectural pattern helps the engineering and product team to quickly integrate and synchronize new data to the AI system. So, for frequently changing data sources, it is always recommended to go for RAG based approach and not the fine-tuning approach for providing timely and relevant information that might be needed for decision-making and operational efficiency.

Pattern 3: RA-FT (Retrieval Augmented - Fine Tuning)

RA-FT has been popularized by researchers at Meta, Microsoft, and UC Berkley. A recent paper published by the team proposes a new framework to tackle the limitations of both the generic RAG framework and the Domain Specific Fine Tuning (DSFT) approach.

To explain the framework, the researchers have compared the RAG approach with an "Open Book Exam" and Fine Tuning with a "Closed Book Exam."

Limitation of RAG

In RAG, the context is formed by doing a vector-based similarity search of an index. This search may bring up documents (or chunks) that are semantically close to the query but not necessarily meaningful, causing the LLM to struggle with generating a coherent and meaningful answer. The LLM doesn’t know which documents are truly relevant and which are misleading. These “distractor” documents may be included in the LLM's context even when they are not good sources for a well-reasoned answer.

Limitation of DSFT

The researchers also argued that with the DSFT approach, the LLM is limited to only what it was trained on. It can make guesses and even give incorrect answers because it doesn't have access to external sources for accurate information.

How Does RA-FT Address the Limitations of DSFT and RAG Patterns?

To solve both the limitations of DSFT and basic RAG, the RA-FT framework combines the RAG and fine-tuning approaches in a new way. In the RA-FT approach, the LLM is trained in such a way that it becomes smart enough to pick out the most useful and relevant documents from the context generated using the similarity search as part of the retrieval process.

Using RA-FT, when the model is given a question and a batch of retrieved documents, it is taught to ignore documents that do not help answer the question. Because of the training it underwent during the fine-tuning process, the LLM learns how to identify the "distractor" documents and only uses the useful and non-distractor documents (or chunks) to generate a coherent answer for the user's query.

In RA-FT, the training data is prepared so that each data point includes a question, a set of contextually relevant documents, and a corresponding chain-of-thought style answer. RA-FT combines fine-tuning with this training set consisting of question-answer pairs, using documents in a simulated imperfect retrieval scenario. This approach effectively prepares the LLM for open-book exams. RA-FT is a method for adapting LLMs to read and derive solutions from a mix of relevant and irrelevant documents.

Enterprise Strategy for RA-FT Pattern

Since RAFT consists of both RAG and fine-tuning approaches, the cost is even higher than the DSFT approach. However, the results are impressive, which means this technique is suitable in use cases where providing high-quality output along with grounded data and sources is an essential requirement. This approach will yield the best results when you might expect to get mixed results (having both relevant as well as distractor documents/chunks) from the vector similarity search and you don't want to generate or formulate a response by LLM based on distractor or not useful documents/chunks. For a highly regulated industry, this solution would be beneficial to integrate into the existing Gen AI ecosystem.

Pattern 4: Knowledge Graph/RAG Graph

As you know, basic RAG- as well RAFT-based approaches both depend heavily on the underlying vector database and the various similarity algorithms (A-NN or K-NN) it uses to retrieve the chunked dataset to be used as the context for the LLM to formulate the response. However, the biggest issue this approach suffers is that when a contextually meaningful large paragraph is broken into small chunks, it loses the inner meaning and the relation. Due to this, when the similarity search is performed, it only picks the result set where the documents (or chunks) have words close to each other based on relevance. Generic RAG approaches, which primarily rely on vector-based retrieval, face several limitations such as a lack of deep contextual understanding and complex reasoning capabilities while generating the responses for the users.

To address this shortfall, the Knowledge Graph database has emerged as another integral component that can be plugged into the existing RAG system so that your Generative AI application becomes smarter while assisting the user with answers. This technique is called GraphRAG where a different kind of database called Knowledge Graph database is added into the system which helps to assist the content generation based on the external domain-specific data when RAG's similarity search has not yielded correct responses.

How Does GraphRAG Work?

GraphRAG is an advanced RAG approach that uses a graph database to retrieve information for specific tasks. Unlike traditional relational databases that store structured data in tables with rows and columns, graph databases use nodes, edges, and properties to represent and store data. This method provides a more intuitive and efficient way to model, view, and query complex systems. GraphRAG connects concepts and entities within the content using a knowledge graph it builds based on the LLM.

Ingestion Flow

GraphRAG leverages a large language model (LLM) to automatically generate a detailed knowledge graph from a collection of text documents. This knowledge graph captures the meaning and structure of the data by identifying and connecting related concepts. During indexing flow, the system extracts all entities, relationships, and key claims from the granular text unit by using an LLM.

It also detects "communities" or "clusters" of closely related nodes, organizing them at different levels of detail. This helps in understanding the data's overall semantic structure.

These community-based summaries provide a comprehensive overview of the entire dataset and a holistic picture of the entire article. This allows the system to address broad or complex queries that simpler Retrieval-Augmented Generation (RAG) methods will struggle with.

Retrieval Flow

When a user asks a question, GraphRAG efficiently retrieves the most relevant information from the knowledge graph. It then uses this information to guide and refine the LIAM's response, improving both the accuracy of the answer and reducing the chances of generating incorrect or misleading information.

Enterprise Strategy for GraphRAG Pattern

Like the basic RAG system, GraphRAG also uses a specialized database to store the knowledge data it generates with the help of an LLM. However, generating the knowledge graph is more costly compared to generating embeddings and storing them in a vector database. Therefore, GraphRAG should be used in scenarios where the basic RAG might struggle to produce accurate answers.

When the source data is highly dynamic (meaning it changes frequently), you need to rebuild the graph of the corpus and update the graph database accordingly. Rebuilding the graph database for every change in the source data can be expensive but necessary to maintain the same comprehensive understanding.

In an enterprise setting, it is recommended to integrate GraphRAG with the basic RAG to create a more effective Generative AI system. This way, if the basic RAG fails to retrieve the desired result, the system can search for context in the GraphRAG database and generate response for the users instead of hallucinating or not generating responses when the system has the right answers and the context but spread out among different chunks or documents which are not clustered together. Combing GraphRAG with the basic RAG system makes the AI apps more robust.

AI Data (computing) Knowledge Graph generative AI vector database

Opinions expressed by DZone contributors are their own.

Related

Trending