DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Beyond Simple Responses: Building Truly Conversational LLM Chatbots
  • Blue Skies Ahead: An AI Case Study on LLM Use for a Graph Theory Related Application
  • From Zero to Production: Best Practices for Scaling LLMs in the Enterprise
  • Transforming AI-Driven Data Analytics with DeepSeek: A New Era of Intelligent Insights

Trending

  • Revolutionizing Financial Monitoring: Building a Team Dashboard With OpenObserve
  • AWS to Azure Migration: A Cloudy Journey of Challenges and Triumphs
  • Why Database Migrations Take Months and How to Speed Them Up
  • Designing a Java Connector for Software Integrations
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Improving the Capabilities of LLM-Based Analytics Copilots With Semantic Search and Fine-Tuning

Improving the Capabilities of LLM-Based Analytics Copilots With Semantic Search and Fine-Tuning

Learn how LLMs can be deployed to make critical decisions like domain-specific question-answering, SQL generation needed for data retrieval, and more.

By 
Meghana Puvvadi user avatar
Meghana Puvvadi
·
Jul. 22, 24 · Tutorial
Likes (4)
Comment
Save
Tweet
Share
2.7K Views

Join the DZone community and get the full member experience.

Join For Free

Picture this: You're an analyst drowning in a sea of data, trying to make sense of complex attribution models and customer journeys. Wouldn't it be great if you had a super-smart AI assistant that could instantly answer your questions, generate SQL queries on the fly, and break down complex tabular data? Well, that's exactly what we're working on with Large Language Model (LLM)- based analytics copilots. But as with any cutting-edge tech, it's not all smooth sailing. Let's dive into the challenges we faced and the cool solutions we came up with to make these AI assistants truly shine.

The LLM Conundrum: Brilliant, but Flawed

First things first: let's talk about why we're so excited about using LLMs in analytics. These language models are like the Swiss Army knives of the AI world – they can tackle a wide range of tasks, from answering questions to generating code. For us analysts, that means:

  • Less time spent digging through dashboards and reports
  • More flexible insights that go beyond static visualizations
  • Quicker problem-solving and decision-making

Sounds great, right? But here's the catch: LLMs aren't perfect. They've got some quirks that can make them a bit tricky to work with:

  • They've got memory limits (imagine trying to read "War and Peace," but forgetting the beginning by the time you reach the end).
  • Sometimes they confidently spout nonsense (we call this "hallucination" – it's less fun than it sounds).
  • They're not great with numbers (which is kind of important in analytics).
  • It can be hard to understand why they give certain outputs.
  • They can be biased (just like us humans, unfortunately).

So, we set out on a mission to overcome these challenges and create analytics copilots that are actually useful in the real world. Our secret weapons? Semantic search and fine-tuning. Let's break it down.

Semantic Search: Teaching Our AI To Find the Right Context

Imagine you're at a huge library, trying to find the answer to a specific question. You could read every book, or you could ask a librarian who knows exactly where to look. Semantic search is like giving our LLM its own super-librarian.

Here's how we did it:

  1. We built a knowledge base by scraping relevant websites and documents.
  2. We chopped up this info into bite-sized chunks.
  3. We used fancy math (okay, it's called "embedding") to turn these chunks into numbers that represent their meaning.
  4. We stored all this in a special database that can quickly find similar chunks.

When someone asks a question, we use the same embedding magic on their query, find the most relevant chunks in our database, and feed that context to the LLM. It's like giving the AI a cheat sheet before it answers the question.

Here's a simplified Python code snippet to give you an idea of how this works:

Python
 

```python

from sentence_transformers import SentenceTransformer

from faiss import IndexFlatL2

import numpy as np



# Load a pre-trained sentence transformer model

model = SentenceTransformer('all-MiniLM-L6-v2')



# Create a FAISS index

index = IndexFlatL2(384)  # 384 is the embedding dimension for this model



# Embed and index our document chunks

for chunk in document_chunks:

    embedding = model.encode(chunk)

    index.add(np.array([embedding]))



# When we get a query, embed it and find similar chunks

query_embedding = model.encode(user_query)

distances, indices = index.search(np.array([query_embedding]), k=5)



# Use the top 5 most relevant chunks as context for the LLM

relevant_context = [document_chunks[i] for i in indices[0]]

```


We tested this setup with different LLMs (GPT-4, Falcon-40B, and Llama-2-70b) and different embedding models. The results were pretty exciting:

  • GPT-4 with semantic search was the top performer.
  • Llama-2-70b was nipping at its heels (and it's open-source, which is cool).
  • Some open-source embedding models held their own against the fancy proprietary ones.

Fine-Tuning: Teaching Old LLMs New Tricks

While semantic search helped with question-answering, we still had two big problems to solve: generating SQL queries and analyzing tabular data. This is where fine-tuning came to the rescue.

Fine-tuning is like sending your LLM to a specialized training camp. We take a pre-trained model and give it additional training on specific tasks. It's like teaching a chess champion how to play poker – they already understand game strategy, but now they're learning the specific rules and tactics of a new game.

SQL Query Generation: From Natural Language to Database Speak

For SQL generation, we used a dataset called b-mc2/sql-create-context from Hugging Face. It's got a bunch of examples that pair natural language questions with SQL queries. Here's what a typical example looks like:

Plain Text
 
Question: How many heads of the departments are older than 56?

Context: CREATE TABLE head (age INTEGER)


SQL
 
Answer: SELECT COUNT(*) FROM head WHERE age > 56


We fine-tuned our models on thousands of examples like this. The results were mind-blowing:

  • GPT-4 (which we couldn't fine-tune) got about 64.5% accuracy with few-shot learning
  • Our fine-tuned open-source models jumped from less than 30% accuracy to over 80%!
  • The big Llama-2-70b model showed tons of potential even without fine-tuning

Here's a simplified look at how we did the fine-tuning:

Python
 
```python

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer



model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b")

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")



def tokenize_function(examples):

    return tokenizer(examples["text"], padding="max_length", truncation=True)



tokenized_datasets = datasets.map(tokenize_function, batched=True)



training_args = TrainingArguments(

    output_dir="./results",

    num_train_epochs=3,

    per_device_train_batch_size=8,

    warmup_steps=500,

    weight_decay=0.01,

    logging_dir='./logs',

)



trainer = Trainer(

    model=model,

    args=training_args,

    train_dataset=tokenized_datasets["train"],

    eval_dataset=tokenized_datasets["test"],

)



trainer.train()

```


Tabular Data Analysis: Making Sense of the Numbers

For tabular analysis, we created a dataset specifically for attribution. We wanted our models to explain changes in attribution credit for different touchpoints. Here's an example of what our data looked like:

Plain Text
 
```

model_name: lead

channel: display

absolute_change: -82

targeting_quality: 63

contact_frequency: -4

ad_cannibalization: -33

```


We then fine-tuned our models to generate explanations like this:

"The display channel for the lead model has seen a significant decrease in attribution credit (-82%). This is primarily due to improved targeting quality (63%), which suggests that while fewer impressions are needed, they are more effective. However, this positive effect is partially offset by increased ad cannibalization (-33%), indicating some overlap in the audience reached by different channels. The contact frequency had a minimal impact (-4%)."

The results were fascinating:

  • GPT-4 with basic prompt engineering hit about 70% accuracy.
  • A special setup called "pandas agent" with GPT-4 nailed 100% accuracy.
  • Our fine-tuned open-source models went from less than 10% accuracy to over 80-90%!

Here's a peek at how we used the pandas agent:

Python
 
```python

from langchain.agents import create_pandas_dataframe_agent

from langchain.llms import OpenAI

import pandas as pd



# Load your data into a pandas DataFrame

df = pd.read_csv("attribution_data.csv")



# Create the pandas DataFrame agent

agent = create_pandas_dataframe_agent(OpenAI(temperature=0), df, verbose=True)



# Ask the agent to analyze the data

response = agent.run("Explain the changes in attribution credit for the display channel in the lead model, considering targeting quality, contact frequency, and ad cannibalization.")



print(response)

```


Conclusion

By combining semantic search and fine-tuning, we've managed to supercharge our analytics copilots, making them more accurate, reliable, and useful. The journey wasn't easy, but the results speak for themselves. With these advanced techniques, we’re paving the way for smarter, more efficient analytics tools.

More details can be found in the original paper by my team.

AI Analytics Data analysis Semantic search large language model

Opinions expressed by DZone contributors are their own.

Related

  • Beyond Simple Responses: Building Truly Conversational LLM Chatbots
  • Blue Skies Ahead: An AI Case Study on LLM Use for a Graph Theory Related Application
  • From Zero to Production: Best Practices for Scaling LLMs in the Enterprise
  • Transforming AI-Driven Data Analytics with DeepSeek: A New Era of Intelligent Insights

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!