DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • The QA Paradox: To Save Artificial Intelligence, We Must Stop Blindly Trusting Data—And Start Trusting Human Judgment
  • Large Language Models: A Comprehensive Analysis of Real-World CX Applications
  • 4 Software QA Metrics To Enhance Dev Quality and Speed
  • Data Ingestion: The Front Door to Modern Data Infrastructure

Trending

  • Stop Building Monolithic AI Brains, Build a Specialist Team Instead
  • Breaking Free from ZooKeeper: Why Kafka’s KRaft Mode Matters
  • The AWS Playbook for Building Future-Ready Data Systems
  • Indexed Views in SQL Server: A Production DBA's Complete Guide
  1. DZone
  2. Data Engineering
  3. Data
  4. Supervised Fine-Tuning (SFT) on VLMs: From Pre-trained Checkpoints To Tuned Models

Supervised Fine-Tuning (SFT) on VLMs: From Pre-trained Checkpoints To Tuned Models

Generate QA pairs with a VLM, filter them using another VLM as a judge, then fine-tune your model on the cleaned data. No human labels needed.

By 
AKHILESH KUMAR user avatar
AKHILESH KUMAR
·
May. 26, 25 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
2.2K Views

Join the DZone community and get the full member experience.

Join For Free

Vision-Language Models (VLMs) like LLaMA are becoming increasingly powerful at understanding and generating text grounded in visual content. They excel at tasks like image captioning, visual question answering (VQA), and multimodal reasoning—making them highly useful in a wide range of real-world applications.

But while these models perform impressively out of the box, domain-specific or task-specific use cases often demand additional tuning. This is where Supervised Fine-Tuning (SFT) comes in. By fine-tuning a pre-trained VLM on curated image–question–answer (QA) pairs, we can significantly improve its performance for specific applications.

The catch? Collecting these QA pairs manually is time-consuming and expensive.

To get around this bottleneck, we turn to synthetic data generation. By using a powerful VLM to generate QA pairs and a second model to filter (or “judge”) the quality of the outputs, we can scale dataset creation without needing human annotators. This technique unlocks scalable fine-tuning for teams without access to large, labeled datasets or annotation pipelines.

Creating Synthetic Data for Fine-Tuning

Our pipeline consists of two core components:

    1.    A large vision-language model to generate QA pairs

    2.    A separate VLM to evaluate and filter those pairs

An image showing synthetic data for fine-tuning

In this setup:

  • Teacher model: [meta-llama/Llama-3.2-90B-Vision-Instruct] – We use a powerful 90B parameter instruction-tuned vision-language model (VLM) to generate high-quality question-answer (QA) data. Leveraging a large-scale model for QA generation is critical, as it typically produces more coherent, contextually grounded, and instruction-following outputs. However, for the evaluation phase, it is equally important to use a separate, comparatively smaller model to serve as the Judge. This separation of roles ensures an unbiased assessment of the generated content. Using the same model family or even the same model instance for both generation and evaluation can lead to overfitting-like behavior, inflated scores, or biased judgments. Therefore, maintaining model diversity between the generator and the evaluator is a recommended best practice in multimodal QA pipelines.
  • Judge model: [Qwen/Qwen2.5-VL-7B-Instruct] – a 7B parameter vision-language model used to assess both the relevance and correctness of each QA pair. Importantly, this evaluator belongs to a different model family than the generator, which helps reduce bias and is widely regarded as a best practice for more objective and robust quality evaluation.
  • Optional: Human-in-the-Loop (HITL) Verification – While automated judgment using a smaller, distinct VLM is efficient and scalable, incorporating a human-in-the-loop adds an additional layer of quality assurance. In practice, this means randomly sampling a subset of the model-judged QA pairs and having human annotators review their relevance, correctness, and overall clarity. This not only validates the accuracy of the Judge model’s assessments but also helps calibrate or fine-tune its scoring thresholds over time. In high-stakes or production-grade systems, occasional human review ensures that edge cases or nuanced content—where even strong models may falter—are caught and corrected. Though optional, HITL is a valuable tool for maintaining long-term trust and reliability in the QA evaluation pipeline.

Step 1: Load the QA Generator Model

Python
 
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
import torch

llama_gen_id = "meta-llama/Llama-3.2-90B-Vision-Instruct"
llama_tokenizer = AutoTokenizer.from_pretrained(llama_gen_id)
llama_processor = AutoProcessor.from_pretrained(llama_gen_id)
llama_model = AutoModelForCausalLM.from_pretrained(
    llama_gen_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

We’ll use this model to generate synthetic QA data. Here’s the core function:

Python
 
from PIL import Image

def generate_qa(image: Image.Image):
    prompt = (
        "You are an AI that sees the image and explains it to a human.\n"
        "Generate a meaningful question someone might ask about this image, and provide an accurate answer.\n"
        "Respond in the format:\nQuestion: ...\nAnswer: ..."
    )
    inputs = llama_processor(text=prompt, images=image, return_tensors="pt").to(llama_model.device)
    outputs = llama_model.generate(**inputs, max_new_tokens=150)
    output_text = llama_processor.batch_decode(outputs, skip_special_tokens=True)[0]
    return output_text


Step 2: Load the VLM Judge Model

We now load a smaller model to act as a “judge” that filters out low-quality QA pairs.

Python
 
from transformers import AutoModelForVision2Seq

qwen_id = "Qwen/Qwen2.5-VL-7B-Instruct"
qwen_processor = AutoProcessor.from_pretrained(qwen_id)
qwen_model = AutoModelForVision2Seq.from_pretrained(
    qwen_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
)


The judge evaluates whether the generated answer correctly addresses the question in context of the image:

Python
 
def judge_qwen(image: Image.Image, question: str, answer: str):
    judge_prompt = (
        f"<img>\nYou are an expert evaluator. Assess if the answer is correct and relevant to the question.\n"
        f"Question: {question}\nAnswer: {answer}\n\n"
        f"Reply with only 'yes' or 'no'."
    )
    inputs = qwen_processor(text=judge_prompt, images=image, return_tensors="pt").to(qwen_model.device)
    output = qwen_model.generate(**inputs, max_new_tokens=10)
    response = qwen_processor.batch_decode(output, skip_special_tokens=True)[0].lower()
    return "yes" in response

Prompt Tuning: 

In both generate_qa() and judge_qwen(), the prompt is the quiet driver behind output quality. In generate_qa(), the instruction to “generate a meaningful question… and provide an accurate answer” guides the model toward thoughtful, grounded outputs. Changing that phrasing—even slightly—can lead to less relevant or overly generic QA pairs. In judge_qwen(), the prompt explicitly asks for a simple “yes” or “no” on correctness and relevance. Without this clear instruction, the model might generate vague or verbose responses that derail automated evaluation. These examples show how critical prompt tuning is. The model may be powerful, but without clear, task-aligned prompts, the pipeline won’t behave as expected. Small wording changes can make or break consistency—so tuning prompts is as essential as picking the right models.

Python
 
import os
from PIL import Image

image_folder = "/path/to/images"
accepted_data = []

def parse_qa(text):
    if "Question:" in text and "Answer:" in text:
        q = text.split("Question:")[1].split("Answer:")[0].strip()
        a = text.split("Answer:")[1].strip()
        return q, a
    return None, None

for file in os.listdir(image_folder):
    if not file.lower().endswith((".png", ".jpg", ".jpeg")):
        continue

    image_path = os.path.join(image_folder, file)
    image = Image.open(image_path).convert("RGB")

    try:
        qa_text = generate_qa(image)
        question, answer = parse_qa(qa_text)

        if not question or not answer:
            continue

        if judge_qwen(image, question, answer):
            accepted_data.append({
                "image_path": image_path,
                "question": question,
                "answer": answer
            })

    except Exception as e:
        print(f"Error processing {file}: {e}")


Conclusion

Fine-tuning Vision-Language Models no longer requires massive investments in human-annotated data. By combining a high-quality generator (LLaMA-3.2-90B-Vision-Instruct) with a competent evaluator (Qwen2.5-VL-7B-Instruct), we can produce large-scale, high-quality QA datasets fully automatically.

This cleaned and validated synthetic dataset can then be used to fine-tune a smaller model—like LLaMA-3.2-11B-Vision—to inherit task-specific capabilities with lower training cost and faster inference.

This teacher–judge–student paradigm opens the door to scalable model customization, making it possible to adapt powerful VLMs for specialized domains like medical imaging, robotics, or enterprise knowledge systems—without needing armies of annotators or proprietary data.

Have a really great day!

Question answering Data (computing)

Opinions expressed by DZone contributors are their own.

Related

  • The QA Paradox: To Save Artificial Intelligence, We Must Stop Blindly Trusting Data—And Start Trusting Human Judgment
  • Large Language Models: A Comprehensive Analysis of Real-World CX Applications
  • 4 Software QA Metrics To Enhance Dev Quality and Speed
  • Data Ingestion: The Front Door to Modern Data Infrastructure

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: