How To Use ChatGPT API in Python for Your Real-Time Data

Looking to make ChatGPT answer unfamiliar topics? Here's a step-by-step tutorial on how to build an AI-powered app with a few lines of code.

Bobur Umurzokov

Aug. 30, 23 · Tutorial

Likes (4)

Comment

Save

4.7K Views

OpenAI’s GPT has emerged as the foremost AI tool globally and is proficient at addressing queries based on its training data. However, it can not answer questions about unknown topics:

Recent events after Sep 2021
Your non-public documents
Information from past conversations

This task gets even more complicated when you deal with real-time data that frequently changes. Moreover, you cannot feed extensive content to GPT, nor can it retain your data over extended periods. In this case, you need to build a custom LLM (Language Learning Model) app efficiently to give context to the answer process. This piece will walk you through the steps to develop such an application utilizing the open-source LLM App library in Python. The source code is on GitHub (linked below in the section "Build a ChatGPT Python API for Sales").

Learning Objectives

You will learn the following throughout the article:

The reason why you need to add custom data to ChatGPT
How to use embeddings, prompt engineering, and ChatGPT for better question-answering
Build your own ChatGPT with custom data using the LLM App
Create a ChatGPT Python API for finding real-time discounts or sales prices

Why Provide ChatGPT With a Custom Knowledge Base?

Before jumping into the ways to enhance ChatGPT, let’s first explore the manual methods and identify their challenges. Typically, ChatGPT is expanded through prompt engineering. Assume you want to find real-time discounts/deals/coupons from various online markets.

For example, when you ask ChatGPT “Can you find me discounts this week for Adidas men’s shoes?”, a standard response you can get from the ChatGPT UI interface without having custom knowledge is:

"Can you find me discounts this week for Adidas men shoes" prompt

As evident, GPT offers general advice on locating discounts but lacks specificity regarding where or what type of discounts, among other details. Now to help the model, we supplement it with discount information from a trustworthy data source. You must engage with ChatGPT by adding the initial document content prior to posting the actual questions. We will collect this sample data from the Amazon products deal dataset and insert only a single JSON item we have into the prompt:

As you can see, you get the expected output and this is quite simple to achieve since ChatGPT is context-aware now. However, the issue with this method is that the model’s context is restricted (GPT-4 maximum text length is 8,192 tokens). This strategy will quickly become problematic when input data is huge you may expect thousands of items discovered in sales and you can not provide this large amount of data as an input message. Also, once you have collected your data, you may want to clean, format, and preprocess data to ensure data quality and relevancy.

If you utilize the OpenAI Chat Completion endpoint or build custom plugins for ChatGPT, it introduces other problems as follows:

Cost — By providing more detailed information and examples, the model’s performance might improve, though at a higher cost (for GPT-4 with an input of 10k tokens and an output of 200 tokens, the cost is $0.624 per prediction). Repeatedly sending identical requests can escalate costs unless a local cache system is utilized.
Latency — A challenge with utilizing ChatGPT APIs for production, like those from OpenAI, is their unpredictability. There is no guarantee regarding the provision of consistent service.
Security — When integrating custom plugins, every API endpoint must be specified in the OpenAPI spec for functionality. This means you’re revealing your internal API setup to ChatGPT, a risk many enterprises are skeptical of.
Offline Evaluation — Conducting offline tests on code and data output or replicating the data flow locally is challenging for developers. This is because each request to the system may yield varying responses.

Using Embeddings, Prompt Engineering, and ChatGPT for Question-Answering

A promising approach you find on the internet is utilizing LLMs to create embeddings and then constructing your applications using these embeddings, such as for search and ask systems. In other words, instead of querying ChatGPT using the Chat Completion endpoint, you would do the following query:

Given the following discounts data: {input_data}, answer this query: {user_query}.

The concept is straightforward. Rather than posting a question directly, the method first creates vector embeddings through OpenAI API for each input document (text, image, CSV, PDF, or other types of data), then indexes generated embeddings for fast retrieval and stores them into a vector database and leverages the user’s question to search and obtain relevant documents from the vector database. These documents are then presented to ChatGPT along with the question as a prompt. With this added context, ChatGPT can respond as if it’s been trained on the internal dataset.

On the other hand, if you use Pathway’s LLM App, you don’t need even any vector databases. It implements real-time in-memory data indexing directly reading data from any compatible storage, without having to query a vector document database that comes with costs like increased prep work, infrastructure, and complexity. Keeping source and vectors in sync is painful. Also, it is even harder if the underlined input data is changing over time and requires re-indexing.

ChatGPT With Custom Data Using LLM App

These simple steps below explain a data pipelining approach to building a ChatGPT app for your data with the LLM App.

Collect: Your app reads the data from various data sources (CSV, JSON Lines, SQL databases, Kafka, Redpanda, Debezium, and so on) in real-time when a streaming mode is enabled with Pathway (Or you can test data ingestion in static mode too). It also maps each data row into a structured document schema for better managing large data sets.
Preprocess: Optionally, you do easy data cleaning by removing duplicates, irrelevant information, and noisy data that could affect your responses’ quality and extracting the data fields you need for further processing. Also, at this stage, you can mask or hide privacy data to avoid them being sent to ChatGPT.
Embed: Each document is embedded with the OpenAI API and retrieves the embedded result.
Indexing: Constructs an index on the generated embeddings in real time.
Search: Given a user question let’s say from an API-friendly interface, generate an embedding for the query from the OpenAI API. Using the embeddings, retrieve the vector index by relevance to the query on the fly.
Ask: Insert the question and the most relevant sections into a message to GPT. Return GPT’s answer (chat completion endpoint).

Build a ChatGPT Python API for Sales

Once we have a clear picture of the processes of how the LLM App works in the previous section. You can follow the steps below to understand how to build a discount finder app. The project source code can be found on GitHub. If you want to quickly start using the app, you can skip this part clone the repository, and run the code sample by following the instructions in the README.md file there.

Sample Project Objective

Inspired by this article around enterprise search, our sample app should expose an HTTP REST API endpoint in Python to answer user queries about current sales by retrieving the latest deals from various sources (CSV, Jsonlines, API, message brokers, or databases) and leverages OpenAI API Embeddings and Chat Completion endpoints to generate AI assistant responses.

Step 1: Data Collection (Custom Data Ingestion)

For simplicity, we can use any JSON Lines as a data source. The app takes JSON Lines files like discounts.jsonl and uses this data when processing user queries. The data source expects to have an doc object for each line. Make sure that you convert your input data first to Jsonlines. Here is an example of a Jsonline file with a single raw:

{"doc": "{'position': 1, 'link': 'https://www.amazon.com/deal/6123cc9f', 'asin': 'B00QVKOT0U', 'is_lightning_deal': False, 'deal_type': 'DEAL_OF_THE_DAY', 'is_prime_exclusive': False, 'starts_at': '2023-08-15T00:00:01.665Z', 'ends_at': '2023-08-17T14:55:01.665Z', 'type': 'multi_item', 'title': 'Deal on Crocs, DUNLOP REFINED(\u30c0\u30f3\u30ed\u30c3\u30d7\u30ea\u30d5\u30a1\u30a4\u30f3\u30c9)', 'image': 'https://m.media-amazon.com/images/I/41yFkNSlMcL.jpg', 'deal_price_lower': {'value': 35.48, 'currency': 'USD', 'symbol': '$', 'raw': '35.48'}, 'deal_price_upper': {'value': 52.14, 'currency': 'USD', 'symbol': '$', 'raw': '52.14'}, 'deal_price': 35.48, 'list_price_lower': {'value': 49.99, 'currency': 'USD', 'symbol': '$', 'raw': '49.99'}, 'list_price_upper': {'value': 59.99, 'currency': 'USD', 'symbol': '$', 'raw': '59.99'}, 'list_price': {'value': 49.99, 'currency': 'USD', 'symbol': '$', 'raw': '49.99 - 59.99', 'name': 'List Price'}, 'current_price_lower': {'value': 35.48, 'currency': 'USD', 'symbol': '$', 'raw': '35.48'}, 'current_price_upper': {'value': 52.14, 'currency': 'USD', 'symbol': '$', 'raw': '52.14'}, 'current_price': {'value': 35.48, 'currency': 'USD', 'symbol': '$', 'raw': '35.48 - 52.14', 'name': 'Current Price'}, 'merchant_name': 'Amazon Japan', 'free_shipping': False, 'is_prime': False, 'is_map': False, 'deal_id': '6123cc9f', 'seller_id': 'A3GZEOQINOCL0Y', 'description': 'Deal on Crocs, DUNLOP REFINED(\u30c0\u30f3\u30ed\u30c3\u30d7\u30ea\u30d5\u30a1\u30a4\u30f3\u30c9)', 'rating': 4.72, 'ratings_total': 6766, 'page': 1, 'old_price': 49.99, 'currency': 'USD'}"}

The cool part is that the app is always aware of changes in the data folder. If you add another JSON Lines file, the LLM app does magic and automatically updates the AI model’s response.

Step 2: Data Loading and Mapping

With Pathway’s JSON Lines input connector, we will read the local JSONlines file, map data entries into a schema, and create a Pathway Table. See the full source code in app.py:

...
sales_data = pw.io.jsonlines.read(
    "./examples/data",
    schema=DataInputSchema,
    mode="streaming"
)

Map each data row into a structured document schema. See the full source code in app.py:

class DataInputSchema(pw.Schema):
    doc: str

Step 3: Data Embedding

Each document is embedded with the OpenAI API and retrieves the embedded result. See the full source code in embedder.py:

...
embedded_data = embeddings(context=sales_data, data_to_embed=sales_data.doc)

Step 4: Data Indexing

Then we construct an instant index on the generated embeddings:

index = index_embeddings(embedded_data)

Step 5: User Query Processing and Indexing

We create a REST endpoint, take a user query from the API request payload, and embed the user query with the OpenAI API.

...
query, response_writer = pw.io.http.rest_connector(
    host=host,
    port=port,
    schema=QueryInputSchema,
    autocommit_duration_ms=50,
)

embedded_query = embeddings(context=query, data_to_embed=pw.this.query)

Step 6: Similarity Search and Prompt Engineering

We perform a similarity search by using the index to identify the most relevant matches for the query embedding. Then we build a prompt that merges the user’s query with the fetched relevant data results and send the message to the ChatGPT Completion endpoint to produce a proper and detailed response.

responses = prompt(index, embedded_query, pw.this.query)

We followed the same in-context learning approach when we crafted the prompt and added internal knowledge to ChatGPT in the prompt.py.

prompt = f"Given the following discounts data: \\n {docs_str} \\nanswer this query: {query}"

Step 7: Return the Response

The final step is just to return the API response to the user.

# Build prompt using indexed data
responses = prompt(index, embedded_query, pw.this.query)

Step 9: Put Everything Together

Now if we put all the above steps together, you have LLM-enabled Python API for custom discount data ready to use as you see the implementation in the app.py Python script.

import pathway as pw

from common.embedder import embeddings, index_embeddings
from common.prompt import prompt


def run(host, port):
    # Given a user question as a query from your API
    query, response_writer = pw.io.http.rest_connector(
        host=host,
        port=port,
        schema=QueryInputSchema,
        autocommit_duration_ms=50,
    )

    # Real-time data coming from external data sources such as jsonlines file
    sales_data = pw.io.jsonlines.read(
        "./examples/data",
        schema=DataInputSchema,
        mode="streaming"
    )

    # Compute embeddings for each document using the OpenAI Embeddings API
    embedded_data = embeddings(context=sales_data, data_to_embed=sales_data.doc)

    # Construct an index on the generated embeddings in real-time
    index = index_embeddings(embedded_data)

    # Generate embeddings for the query from the OpenAI Embeddings API
    embedded_query = embeddings(context=query, data_to_embed=pw.this.query)

    # Build prompt using indexed data
    responses = prompt(index, embedded_query, pw.this.query)

    # Feed the prompt to ChatGPT and obtain the generated answer.
    response_writer(responses)

    # Run the pipeline
    pw.run()


class DataInputSchema(pw.Schema):
    doc: str


class QueryInputSchema(pw.Schema):
    query: str

(Optional) Step 10: Add an Interactive UI

To make your app more interactive and user-friendly, you can use Streamlit to build a front-end app. See the implementation in this app.py file.

Discounts tracker with LLM app: Choose data sources

Running the App

Follow the instructions in the README.md (linked earlier) file’s "How to run the project" section and you can start to ask questions about discounts, and the API will respond according to the discounts data source you have added.

After we give this knowledge to GPT using UI (applying a data source), look how it replies:

Discounts tracker with LLM app: "show me discounts" results

The app takes both Rainforest API and discounts.csv file documents into account (merges data from these sources instantly.), indexes them in real-time, and uses this data when processing queries.

Further Improvements

We’ve only discovered a few capabilities of the LLM App by adding domain-specific knowledge like discounts to ChatGPT. There are more things you can achieve:

Incorporate additional data from external APIs, along with various files (such as Jsonlines, PDF, Doc, HTML, or Text format), databases like PostgreSQL or MySQL, and stream data from platforms like Kafka, Redpanda, or Debedizum.
Maintain a data snapshot to observe variations in sales prices over time, as Pathway provides a built-in feature to compute differences between two alterations.
Beyond making data accessible via API, the LLM App allows you to relay processed data to other downstream connectors, such as BI and analytics tools. For instance, set it up to receive alerts upon detecting price shifts.

API Question answering Data (computing) Python (language) ChatGPT

Published at DZone with permission of Bobur Umurzokov. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending