How To Use ChatGPT API in Python for Your Real-Time Data
Looking to make ChatGPT answer unfamiliar topics? Here's a step-by-step tutorial on how to build an AI-powered app with a few lines of code.
Join the DZone community and get the full member experience.
Join For FreeOpenAI’s GPT has emerged as the foremost AI tool globally and is proficient at addressing queries based on its training data. However, it can not answer questions about unknown topics:
- Recent events after Sep 2021
- Your non-public documents
- Information from past conversations
This task gets even more complicated when you deal with real-time data that frequently changes. Moreover, you cannot feed extensive content to GPT, nor can it retain your data over extended periods. In this case, you need to build a custom LLM (Language Learning Model) app efficiently to give context to the answer process. This piece will walk you through the steps to develop such an application utilizing the open-source LLM App library in Python. The source code is on GitHub (linked below in the section "Build a ChatGPT Python API for Sales").
Learning Objectives
You will learn the following throughout the article:
- The reason why you need to add custom data to ChatGPT
- How to use embeddings, prompt engineering, and ChatGPT for better question-answering
- Build your own ChatGPT with custom data using the LLM App
- Create a ChatGPT Python API for finding real-time discounts or sales prices
Why Provide ChatGPT With a Custom Knowledge Base?
Before jumping into the ways to enhance ChatGPT, let’s first explore the manual methods and identify their challenges. Typically, ChatGPT is expanded through prompt engineering. Assume you want to find real-time discounts/deals/coupons from various online markets.
For example, when you ask ChatGPT “Can you find me discounts this week for Adidas men’s shoes?”, a standard response you can get from the ChatGPT UI interface without having custom knowledge is:
As evident, GPT offers general advice on locating discounts but lacks specificity regarding where or what type of discounts, among other details. Now to help the model, we supplement it with discount information from a trustworthy data source. You must engage with ChatGPT by adding the initial document content prior to posting the actual questions. We will collect this sample data from the Amazon products deal dataset and insert only a single JSON item we have into the prompt:
As you can see, you get the expected output and this is quite simple to achieve since ChatGPT is context-aware now. However, the issue with this method is that the model’s context is restricted (GPT-4 maximum text length is 8,192 tokens). This strategy will quickly become problematic when input data is huge you may expect thousands of items discovered in sales and you can not provide this large amount of data as an input message. Also, once you have collected your data, you may want to clean, format, and preprocess data to ensure data quality and relevancy.
- Cost — By providing more detailed information and examples, the model’s performance might improve, though at a higher cost (for GPT-4 with an input of 10k tokens and an output of 200 tokens, the cost is $0.624 per prediction). Repeatedly sending identical requests can escalate costs unless a local cache system is utilized.
- Latency — A challenge with utilizing ChatGPT APIs for production, like those from OpenAI, is their unpredictability. There is no guarantee regarding the provision of consistent service.
- Security — When integrating custom plugins, every API endpoint must be specified in the OpenAPI spec for functionality. This means you’re revealing your internal API setup to ChatGPT, a risk many enterprises are skeptical of.
- Offline Evaluation — Conducting offline tests on code and data output or replicating the data flow locally is challenging for developers. This is because each request to the system may yield varying responses.
Using Embeddings, Prompt Engineering, and ChatGPT for Question-Answering
A promising approach you find on the internet is utilizing LLMs to create embeddings and then constructing your applications using these embeddings, such as for search and ask systems. In other words, instead of querying ChatGPT using the Chat Completion endpoint, you would do the following query:
Given the following discounts data: {input_data}
, answer this query: {user_query}
.
The concept is straightforward. Rather than posting a question directly, the method first creates vector embeddings through OpenAI API for each input document (text, image, CSV, PDF, or other types of data), then indexes generated embeddings for fast retrieval and stores them into a vector database and leverages the user’s question to search and obtain relevant documents from the vector database. These documents are then presented to ChatGPT along with the question as a prompt. With this added context, ChatGPT can respond as if it’s been trained on the internal dataset.
On the other hand, if you use Pathway’s LLM App, you don’t need even any vector databases. It implements real-time in-memory data indexing directly reading data from any compatible storage, without having to query a vector document database that comes with costs like increased prep work, infrastructure, and complexity. Keeping source and vectors in sync is painful. Also, it is even harder if the underlined input data is changing over time and requires re-indexing.
ChatGPT With Custom Data Using LLM App
These simple steps below explain a data pipelining approach to building a ChatGPT app for your data with the LLM App.
- Collect: Your app reads the data from various data sources (CSV, JSON Lines, SQL databases, Kafka, Redpanda, Debezium, and so on) in real-time when a streaming mode is enabled with Pathway (Or you can test data ingestion in static mode too). It also maps each data row into a structured document schema for better managing large data sets.
- Preprocess: Optionally, you do easy data cleaning by removing duplicates, irrelevant information, and noisy data that could affect your responses’ quality and extracting the data fields you need for further processing. Also, at this stage, you can mask or hide privacy data to avoid them being sent to ChatGPT.
- Embed: Each document is embedded with the OpenAI API and retrieves the embedded result.
- Indexing: Constructs an index on the generated embeddings in real time.
- Search: Given a user question let’s say from an API-friendly interface, generate an embedding for the query from the OpenAI API. Using the embeddings, retrieve the vector index by relevance to the query on the fly.
- Ask: Insert the question and the most relevant sections into a message to GPT. Return GPT’s answer (chat completion endpoint).
Build a ChatGPT Python API for Sales
Once we have a clear picture of the processes of how the LLM App works in the previous section. You can follow the steps below to understand how to build a discount finder app. The project source code can be found on GitHub. If you want to quickly start using the app, you can skip this part clone the repository, and run the code sample by following the instructions in the README.md file there.
Sample Project Objective
Inspired by this article around enterprise search, our sample app should expose an HTTP REST API endpoint in Python to answer user queries about current sales by retrieving the latest deals from various sources (CSV, Jsonlines, API, message brokers, or databases) and leverages OpenAI API Embeddings and Chat Completion endpoints to generate AI assistant responses.
Step 1: Data Collection (Custom Data Ingestion)
For simplicity, we can use any JSON Lines as a data source. The app takes JSON Lines files like discounts.jsonl and uses this data when processing user queries. The data source expects to have an doc
object for each line. Make sure that you convert your input data first to Jsonlines. Here is an example of a Jsonline file with a single raw:
{"doc": "{'position': 1, 'link': 'https://www.amazon.com/deal/6123cc9f', 'asin': 'B00QVKOT0U', 'is_lightning_deal': False, 'deal_type': 'DEAL_OF_THE_DAY', 'is_prime_exclusive': False, 'starts_at': '2023-08-15T00:00:01.665Z', 'ends_at': '2023-08-17T14:55:01.665Z', 'type': 'multi_item', 'title': 'Deal on Crocs, DUNLOP REFINED(\u30c0\u30f3\u30ed\u30c3\u30d7\u30ea\u30d5\u30a1\u30a4\u30f3\u30c9)', 'image': 'https://m.media-amazon.com/images/I/41yFkNSlMcL.jpg', 'deal_price_lower': {'value': 35.48, 'currency': 'USD', 'symbol': '$', 'raw': '35.48'}, 'deal_price_upper': {'value': 52.14, 'currency': 'USD', 'symbol': '$', 'raw': '52.14'}, 'deal_price': 35.48, 'list_price_lower': {'value': 49.99, 'currency': 'USD', 'symbol': '$', 'raw': '49.99'}, 'list_price_upper': {'value': 59.99, 'currency': 'USD', 'symbol': '$', 'raw': '59.99'}, 'list_price': {'value': 49.99, 'currency': 'USD', 'symbol': '$', 'raw': '49.99 - 59.99', 'name': 'List Price'}, 'current_price_lower': {'value': 35.48, 'currency': 'USD', 'symbol': '$', 'raw': '35.48'}, 'current_price_upper': {'value': 52.14, 'currency': 'USD', 'symbol': '$', 'raw': '52.14'}, 'current_price': {'value': 35.48, 'currency': 'USD', 'symbol': '$', 'raw': '35.48 - 52.14', 'name': 'Current Price'}, 'merchant_name': 'Amazon Japan', 'free_shipping': False, 'is_prime': False, 'is_map': False, 'deal_id': '6123cc9f', 'seller_id': 'A3GZEOQINOCL0Y', 'description': 'Deal on Crocs, DUNLOP REFINED(\u30c0\u30f3\u30ed\u30c3\u30d7\u30ea\u30d5\u30a1\u30a4\u30f3\u30c9)', 'rating': 4.72, 'ratings_total': 6766, 'page': 1, 'old_price': 49.99, 'currency': 'USD'}"}
The cool part is that the app is always aware of changes in the data folder. If you add another JSON Lines file, the LLM app does magic and automatically updates the AI model’s response.
Step 2: Data Loading and Mapping
With Pathway’s JSON Lines input connector, we will read the local JSONlines file, map data entries into a schema, and create a Pathway Table. See the full source code in app.py:
... sales_data = pw.io.jsonlines.read( "./examples/data", schema=DataInputSchema, mode="streaming" )
Map each data row into a structured document schema. See the full source code in app.py:
class DataInputSchema(pw.Schema): doc: str
Step 3: Data Embedding
Each document is embedded with the OpenAI API and retrieves the embedded result. See the full source code in embedder.py:
... embedded_data = embeddings(context=sales_data, data_to_embed=sales_data.doc)
Step 4: Data Indexing
Then we construct an instant index on the generated embeddings:
index = index_embeddings(embedded_data)
Step 5: User Query Processing and Indexing
We create a REST endpoint, take a user query from the API request payload, and embed the user query with the OpenAI API.
... query, response_writer = pw.io.http.rest_connector( host=host, port=port, schema=QueryInputSchema, autocommit_duration_ms=50, ) embedded_query = embeddings(context=query, data_to_embed=pw.this.query)
Step 6: Similarity Search and Prompt Engineering
We perform a similarity search by using the index to identify the most relevant matches for the query embedding. Then we build a prompt that merges the user’s query with the fetched relevant data results and send the message to the ChatGPT Completion endpoint to produce a proper and detailed response.
responses = prompt(index, embedded_query, pw.this.query)
We followed the same in-context learning approach when we crafted the prompt and added internal knowledge to ChatGPT in the prompt.py.
prompt = f"Given the following discounts data: \\n {docs_str} \\nanswer this query: {query}"
Step 7: Return the Response
The final step is just to return the API response to the user.
# Build prompt using indexed data responses = prompt(index, embedded_query, pw.this.query)
Step 9: Put Everything Together
Now if we put all the above steps together, you have LLM-enabled Python API for custom discount data ready to use as you see the implementation in the app.py Python script.
import pathway as pw from common.embedder import embeddings, index_embeddings from common.prompt import prompt def run(host, port): # Given a user question as a query from your API query, response_writer = pw.io.http.rest_connector( host=host, port=port, schema=QueryInputSchema, autocommit_duration_ms=50, ) # Real-time data coming from external data sources such as jsonlines file sales_data = pw.io.jsonlines.read( "./examples/data", schema=DataInputSchema, mode="streaming" ) # Compute embeddings for each document using the OpenAI Embeddings API embedded_data = embeddings(context=sales_data, data_to_embed=sales_data.doc) # Construct an index on the generated embeddings in real-time index = index_embeddings(embedded_data) # Generate embeddings for the query from the OpenAI Embeddings API embedded_query = embeddings(context=query, data_to_embed=pw.this.query) # Build prompt using indexed data responses = prompt(index, embedded_query, pw.this.query) # Feed the prompt to ChatGPT and obtain the generated answer. response_writer(responses) # Run the pipeline pw.run() class DataInputSchema(pw.Schema): doc: str class QueryInputSchema(pw.Schema): query: str
(Optional) Step 10: Add an Interactive UI
To make your app more interactive and user-friendly, you can use Streamlit to build a front-end app. See the implementation in this app.py file.
Running the App
Follow the instructions in the README.md (linked earlier) file’s "How to run the project" section and you can start to ask questions about discounts, and the API will respond according to the discounts data source you have added.
After we give this knowledge to GPT using UI (applying a data source), look how it replies:
The app takes both Rainforest API and discounts.csv
file documents into account (merges data from these sources instantly.), indexes them in real-time, and uses this data when processing queries.
Further Improvements
We’ve only discovered a few capabilities of the LLM App by adding domain-specific knowledge like discounts to ChatGPT. There are more things you can achieve:
- Incorporate additional data from external APIs, along with various files (such as Jsonlines, PDF, Doc, HTML, or Text format), databases like PostgreSQL or MySQL, and stream data from platforms like Kafka, Redpanda, or Debedizum.
- Maintain a data snapshot to observe variations in sales prices over time, as Pathway provides a built-in feature to compute differences between two alterations.
- Beyond making data accessible via API, the LLM App allows you to relay processed data to other downstream connectors, such as BI and analytics tools. For instance, set it up to receive alerts upon detecting price shifts.
Published at DZone with permission of Bobur Umurzokov. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments