Build End-To-End RAG Applications Using Airbyte and Snowflake Cortex

In this article, create a dedicated vector store directly within Snowflake — with no coding required — as the basis for AI applications.

John Lafleur

Jul. 23, 24 · Tutorial

Likes (1)

Comment

Save

2.5K Views

As a practical example, this article shows how to build an AI Product assistant — an AI chatbot capable of answering product-related questions using data from multiple sources. With the AI Product Assistant, you can ask questions across all your company’s data in one place.

You can utilize the Snowflake Cortex destination to load vector data into Snowflake, followed by leveraging Snowflake’s Cortex functions to perform Retrieval-Augmented Generation (RAG). Cortex offers general-purpose ML functions and LLM-specific functions, along with vector functions for semantic search, each backed by multiple models, all without needing to manage or host the underlying models or infrastructure.

With Airbyte’s newly launched Snowflake Cortex destination, users can create their own dedicated vector store directly within Snowflake — with no coding required! Creating a new pipeline with Snowflake Cortex takes only a few minutes, allowing you to load any of the 300-plus data sources into your new Snowflake-backed vector store.

We will use the following three data sources specific to Airbyte:

Airbyte’s official documentation for the Snowflake Cortex destination
GitHub issues, documenting planned features, known bugs, and work in progress.
Zendesk tickets comprising customer details and inquiries/requests

RAG Overview

LLMs are general-purpose tools trained on an enormous corpus of data, allowing them to answer questions on almost any topic. However, if you are looking to extract information from your proprietary data, LLMs fall short, leading to potential inaccuracies and hallucinations. This is where RAG becomes super relevant. RAG addresses this limitation by enabling LLMs to access relevant and up-to-date data from any data source. The LLM can then use this information to answer relevant questions about your proprietary data.

Prerequisites

To follow through the whole process, you will need the following accounts. You can also work with your own custom sources or work with a single source:

Airbyte cloud instance
Snowflake account with Cortex functions enabled
Open AI
Source-specific: Google Drive, GitHub and Zendesk

Step 1: Set up Snowflake Cortex Destination

To select a destination in Airbyte Cloud, click on the Destinations tab on the left, select “New Destination” and filter for the destination you are looking for.

Configure the Snowflake Cortex destination in Airbyte:

Destination name: Provide a friendly name.
Processing: Set chunk size to 1000 (the size refers to the number of tokens, not characters, so this is roughly 4KB of text. The best chunking is dependent on the data you are dealing with). You can leave other options empty. You would usually set the metadata and text fields to limit the reading of records to specific columns in a given stream.
Embedding: Select OpenAI from the dropdown and provide your OpenAI API key for powering the embedding service. You can use any other available models also. Just remember to embed the questions you ask the product assistant with the same model.
Indexing: Provide credentials for connecting to your Snowflake instance.

Step 2: Set up Google Drive, GitHub, and Zendesk Sources

To select a source, click on the sources tab on the left, select “New Source” and filter for the source you are looking for.

Google Drive

Select Google Drive on the sources page and configure the Google Drive source. You will need to authenticate and provide the following:

Folder URL: Provide a link to your Google Drive folder. For this tutorial, you can move the snowflake-cortex.md file from here over to your Google Drive. Alternatively, you can use your own file. Refer to the accompanying documentation to learn about valid file formats.
The list of streams to sync
- Format: Set the format dropdown to “Document File Type Format”
- Name: Provide a stream name.

GitHub

Select GitHub on the sources page and configure the GitHub source. You will need to authenticate with your GitHub account and provide the following:

GitHub repositories: You can use the repo:airbytehq/PyAirbyte or provide your own.

Zendesk

Optional: Select Zendesk on the sources pages and configure the Zendesk source. You will be able to authenticate using OAuth, or API token, and provide the following:

Sub-domain name: if your Zendesk account URL is https://MY_SUBDOMAIN.zendesk.com/, then MY_SUBDOMAIN is your subdomain.

After you have set up all sources, this is what your source screen will look like:

Step 3: Set up All Connections

Next, we will set up the following connections:

Google Drive → Snowflake Cortex
Github → Snowflake Cortex
Zendesk → Snowflake Cortex

Google Drive → Snowflake Cortex

Select the Connections tab on the left navigation in Airbyte Cloud; then click on “New connection” on the top right.

In the configuration flow, pick the existing Google Drive source and Snowflake Cortex as the destination.
On the select streams screen, select Full Refresh|Overwrite as the sync mode.
On the configure connection, select the Schedule type to every 24 hours. Optionally provide a stream prefix for your table. In our case we prefix airbyte_ to all streams.
Click on Finish and Sync.
If everything went well, there should be a connection now synching data from Google Drive to Snowflake Cortex destination. Give the sync a few minutes to run. Once the first run has been completed, you can check Snowflake to confirm data sync. The data should be synched under a table named “[optional prefix][stream name]”

Github → Snowflake Cortex

Select the Connections tab on the left navigation in Airbyte Cloud; then click on “New connection” on the top right.

In the configuration flow, pick the existing Github source and Snowflake Cortex as the destination.
On the select streams screen, make sure all streams exceptissues are unchecked. Also, set the sync mode to Full Refresh|Overwrite.
On the configure connection screen, select the Schedule type every 24 hours. Optionally provide a stream prefix for your table. In our case, we prefix airbyte_ to all streams.
Click on Finish and Sync.
If everything goes well, there should be a connection now synching data from GitHub to the Snowflake Cortex destination. Give the sync a few minutes to complete. Once the first run has been completed, you can check Snowflake to confirm data sync. The data should be synched under a table named “[optional prefix]issues”

Zendesk → Snowflake Cortex

Select the Connections tab on the left navigation in Airbyte Cloud; then click on “New connection” on the top right.

In the configuration flow, pick the existing Zendesk source and Snowflake Cortex as the destination.
On the select streams screen, make sure all streams except tickets and usersare unchecked. Also, set the sync mode to Full Refresh | Overwrite.
On the configure connection screen, select the Schedule type to every 24 hours. Optionally provide a stream prefix for your table. In our case, we prefix airbyte_ to all streams.
Click on Finish and Sync.
If everything goes well, there should be a connection now synching data from Zendesk to the Snowflake Cortex destination. Give the sync a few minutes to complete. Once the first run has been completed, you can check Snowflake to confirm data sync. The data should be synched under a table named “[optional prefix]tickets” and ““[optional prefix]tickets”

Once all the three connections are set, the Connections screen should look like this.

Step 4: Explore Data in Snowflake

You should see the following tables in Snowflake. Depending on your stream prefix, your table names might be different.

Google Drive related: airbyte_docs
GitHub related: airbyte_github_issues
Zendesk related: airbyte_zendesk_tickets, airbyte_zendesk_users

All tables have the following columns

DOCUMENT_ID: Unique based on primary key
CHUNK_ID: Randomly generated uuid
DOCUMENT_CONTENT: Text context from source
METADATA: Metadata from source
EMBEDDING: Vector representation of the document_content

Results from one of these tables should look like the following:

Step 5: RAG Building Blocks

In this section, we'll go over the essential building blocks of a Retrieval-Augmented Generation (RAG) system. This involves embedding the user's question, retrieving matching document chunks, and using a language model to generate the final answer. The first step is done using OpenAI, and the second and third steps are achieved using Snowflake’s Cortex functions. Let's break down these steps:

5.1 Embedding User’s Question

When a user asks a question, it's crucial to transform the question into a vector representation using the same model employed for generating vectors in the vector database. In this example, we use Langchain to access and use OpenAI's embedding model. This step ensures that the question is represented in a format that can be effectively compared with document vectors.

5.2 Retrieving Matching Chunks

The embedded question is then sent to Snowflake, where document chunks with similar vector representations are retrieved. All relevant chunks are returned.

5.3 Passing Chunks to LLM for a Final Answer

All the chunks are concatenated into a single string and sent to the Snowflake (using Snowflake Python Connector) along with a prompt (set of instructions) and the user’s question.

Step 6: Putting It All Together

Google Colab

For convenience, we have integrated everything into a Google Collab notebook. Explore the fully functional RAG code in Google Colab.

Every time a question is asked, we pass the list of relevant tables and models. For the first question below, the answer is retrieved from Airbyte docs only, whereas, for the second question, the answer is retrieved from GitHub issues.

You can also provide multiple tables as input, and the resulting response will be a combination of data from both sources.

Streamlit App

We have also integrated the above RAG functionality into a Streamlit chat app. Explore the fully functional code on GitHub.

Conclusion

Using the LLM capabilities of Snowflake Cortex in combination with Airbyte’s catalog of 300-plus connectors, you can build an insightful AI assistant to answer questions across a number of data sources. We have kept things simple here, but as your number of sources increases, consider integrating an AI agent to interpret user intent and retrieve answers from the appropriate source.

AI large language model

Published at DZone with permission of John Lafleur. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending

Build End-To-End RAG Applications Using Airbyte and Snowflake Cortex

In this article, create a dedicated vector store directly within Snowflake — with no coding required — as the basis for AI applications.

RAG Overview

Prerequisites

Step 1: Set up Snowflake Cortex Destination

Step 2: Set up Google Drive, GitHub, and Zendesk Sources

Google Drive

GitHub

Zendesk

Step 3: Set up All Connections

Google Drive → Snowflake Cortex

Github → Snowflake Cortex

Zendesk → Snowflake Cortex

Step 4: Explore Data in Snowflake

Step 5: RAG Building Blocks

5.1 Embedding User’s Question

5.2 Retrieving Matching Chunks

5.3 Passing Chunks to LLM for a Final Answer

Step 6: Putting It All Together

Google Colab

Streamlit App

Conclusion

Related

Partner Resources