Fine-Tune SLMs for Free: From Google Colab to Ollama in 7 Steps

Want your own custom SLM, but don't have a GPU box humming under your desk? Learn from this article why you actually don’t need one.

Sai Teja Erukude

Jan. 05, 26 · Tutorial

Likes (0)

Comment

Save

2.8K Views

In this article, I'll walk through a practical pipeline that:

Fine-tunes a popular open-source base small language model on your own data using Unsloth on Google Colab (free T4 GPU)
Exports the result to GGUF via llama.cpp
Deploys it to Ollama so that you can run ollama pull my-model from anywhere and even push it to the Ollama registry.

We'll put this into practice by creating a real-world example: a "multi-agent orchestrator," built step-by-step in seven concrete steps.

The Stack at a Glance

Here's the toolchain we’re wiring together:

Data: JSON (or JSONL) instruction data, hand-crafted or LLM-generated
Fine-tuning: Unsloth on Google Colab T4 (free tier)
Checkpoints: LoRA adapters (light-weight fine-tune weights)
Conversion: llama.cpp to produce GGUF, the format used by llama.cpp/Ollama
Serving: Ollama with a Modelfile

No in-house GPU. Just Colab + your local machine (CPU is enough).

Step 1: Curate Your Training Data in JSON

Your model is only as good as its data. For instruction-tuned chat models, a simple and effective schema is:

    JSON
   
 

   [
  {
    "agent_registry": [
      {
        "name": "researcher",
        "role": "Collects and summarizes information from the web and internal docs.",
        "cost": "medium",
        "latency": "medium"
      },
      {
        "name": "coder",
        "role": "Writes and fixes code, runs tests, and explains changes.",
        "cost": "high",
        "latency": "high"
      },
      {
        "name": "critic",
        "role": "Reviews outputs from other agents for quality, correctness, and clarity.",
        "cost": "low",
        "latency": "low"
      }
    ],
    "state": {
      "task": "Write a short technical blog post explaining what multi-agent LLM systems are.",
      "status": "in_progress",
      "step": 1,
      "max_steps": 8,
      "context": {
        "user_preferences": {
          "tone": "friendly",
          "audience": "intermediate developers"
        }
      },
      "history": []
    },
    "target_action": {
      "action": "call_agent",
      "target": "researcher",
      "arguments": {
        "query": "Explain what multi-agent LLM systems are, list 3–4 key benefits, and mention common frameworks."
      },
      "final_answer": null,
      "reason": "We need background information before planning the blog post."
    }
  },
  ...
]
  

A few practical tips:

Keep the format consistent. Decide on fields (agent_registry, state, or target_action) and stick to them.
Generate synthetic data when useful. You can bootstrap your dataset by prompting ChatGPT or another LLM to generate domain-specific Q&A or instructions.
Filter aggressively. Remove duplicates, contradictory answers, and very low-quality or off-topic data.
Start small, iterate fast. Even a few hundred high-quality examples can make a noticeable difference when fine-tuning with LoRA/QLoRA.

Save the dataset as something like train.json to upload to Colab.

Step 2: Fine-Tune With Unsloth on Google Colab

What Is Unsloth?

Unsloth is an open-source library focused on efficient fine-tuning and reinforcement learning for LLMs. It supports popular models such as Llama, Gemma, Qwen, Mistral, DeepSeek, and more, and is designed to train them 2x faster with up to 70% less VRAM than typical approaches.

The official docs emphasize that you can fine-tune for free on Colab or Kaggle, or locally, with as little as ~3 GB of VRAM using ready-made notebooks.

1. Set Up a Colab Notebook

Open a new Colab notebook.
Go to Runtime → Change runtime type → GPU → T4.
Install Unsloth and other dependencies.
!pip install unsloth trl peft accelerate bitsandbytes

2. Choose a Base Model

Pick a base model that Unsloth supports, and that also has llama.cpp/Ollama support. Common choices:

meta-llama/Meta-Llama-3-8B-Instruct
Qwen/Qwen2.5-7B-Instruct
google/gemma-2-9b-it
A smaller 3B-4B model if your dataset is small or your hardware is very tight.

3. Load the Base Model With Unsloth

A typical pattern in Unsloth looks like:

    Python
   
 

   from unsloth import FastLanguageModel
import torch

model_name = "unsloth/Llama-3.2-3B-Instruct"

max_seq_length = 2048 	# Choose sequence length
dtype = None  			# Auto detection

# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=True,
)
  

Then you wrap it with LoRA/PEFT:

    Python
   
 

   # Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=64,  # LoRA rank - higher = more capacity, more memory
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=128,  # LoRA scaling factor (usually 2x rank)
    lora_dropout=0,  # Supports any, but = 0 is optimized
    bias="none",     # Supports any, but = "none" is optimized
    use_gradient_checkpointing="unsloth",  # Unsloth's optimized version
    random_state=3407,
    use_rslora=False,  # Rank stabilized LoRA
    loftq_config=None, # LoftQ
)
  

4. Load Your JSON Dataset

    Python
   
 

   import json
from datasets import Dataset

# 1) Load your JSON array file
with open("train_data.json", "r", encoding="utf-8") as f:
    data = json.load(f)  # list of dicts

SYSTEM_PROMPT = """You are an ORCHESTRATOR, a supervisor coordinating a team of AI agents and tools.
You NEVER solve the task directly. You ONLY decide the next action.
Always respond with a single JSON object describing the next action.
"""

def format_example(example):
    agent_registry = example["agent_registry"]
    state = example["state"]
    target_action = example["target_action"]

    system_message = SYSTEM_PROMPT.strip()

    user_message_content = f"""
      Available agents (registry):
      {json.dumps(agent_registry, ensure_ascii=False, indent=2)}

      Current task state:
      {json.dumps(state, ensure_ascii=False, indent=2)}

      Decide the next action as JSON only, using this schema:
      {{
        "action": "call_agent" | "call_tool" | "ask_user" | "finish",
        "target": string or null,
        "arguments": object,
        "final_answer": string or null,
        "reason": string
      }}
    """
    
    assistant_message_content = json.dumps(target_action, ensure_ascii=False, separators=(",", ":"))

    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message_content.strip()},
        {"role": "assistant", "content": assistant_message_content}
    ]

    # Apply the tokenizer chat template directly for training data
    # add_generation_prompt=False because it's a complete conversation turn
    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) + "<|endoftext|>"

# 2) Build the formatted text list
# Make sure tokenizer is available in global scope before this call
formatted_data = [format_example(item) for item in data]

# 3) Create the HF Dataset
dataset = Dataset.from_dict({"text": formatted_data})
  

5. SFTTrainer Setup

Configure a Supervised Fine-Tuning (SFT) Trainer with arguments optimized for Unsloth. Feed the model and data you just created into an Unsloth training loop via SFTTrainer. Configure a small experiment to start:

Batch size: small (e.g., 1-4 per GPU, with gradient accumulation)
Learning rate: 1e-4 to 2e-4 for LoRA
Epochs: 1–3 to start

    Python
   
 

   from trl import SFTTrainer
from transformers import TrainingArguments

# Training arguments optimized for Unsloth
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # Effective batch size = 8
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=2,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        save_strategy="epoch",
        save_total_limit=2,
        dataloader_pin_memory=False,
        report_to="none", # Disable Weights & Biases logging
    ),
)
  

6. Train the Model

Let training run on the free Colab T4. Once it finishes, you'll have a LoRA-fine-tuned model in memory.

    Python
   
   trainer_stats = trainer.train()

7. Test the Model

    Python
   
 

   FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# Test prompt
messages = [
  {"role": "system", "content": SYSTEM_PROMPT.strip()},
  {
    "role": "user",
    "content": """
      Available agents (registry):
      {
        "agents": [
          {
            "name": "researcher",
            "role": "Collects and summarizes information from the web and docs.",
            "cost": "medium",
            "latency": "medium"
          },
          {
            "name": "coder",
            "role": "Writes and fixes code, runs tests, and explains changes.",
            "cost": "high",
            "latency": "high"
          }
        ]
      }

      Current task state:
      {
        "task": "Write a short explanation of what multi-agent systems are.",
        "status": "in_progress",
        "step": 1,
        "max_steps": 5,
        "context": null,
        "history": []
      }

      Decide the next action as JSON only, using this schema:
      {
        "action": "call_agent" | "call_tool" | "ask_user" | "finish",
        "target": string or null,
        "arguments": object,
        "final_answer": string or null,
        "reason": string
      }
    """
  }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True, # This adds the <|start_header_id|>assistant<|end_header_id|>
    return_tensors="pt",
).to("cuda")

# Generate response
outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=256,
    use_cache=True,
    temperature=0.7,
    do_sample=True,
    top_p=0.9,
)

# Decode and print only the generated part
# Get the length of the input tokens
input_length = inputs.shape[1]

# Decode only the generated tokens (after the input_length)
generated_tokens = outputs[0, input_length:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print(response)
  

Inference result:

    JSON
   
 

   {
    "action": "call_agent",
    "target": "researcher",
    "arguments": {
        "query": "Explain what multi-agent systems are. Choose 3-5 key points."
    },
    "final_answer": null,
    "reason": "The researcher agent should gather and summarize foundational information before the other agents are called."
}
  

Practical Limitations

While this pipeline works great for 1B-10B models on a free T4 GPU, very large LLMs (13B+, 70B, or larger) may not fit into Colab's VRAM, even with 4-bit loading and LoRA. If you want to fine-tune models 13B and above, you should expect to use a paid GPU tier (A100, H100, or RTX 4090/6000) or a cloud provider with sufficient VRAM.

Step 3: Save LoRA Weights in Colab (Don't Merge Yet)

Colab's free tier typically gives you around 12.7 GB of CPU RAM, which often isn’t enough to merge LoRA adapters into the full base model comfortably. Merges tend to spike RAM usage and cause the runtime to crash. Here's what I do to get around this problem:

Save the LoRA adapter and tokenizer as separate artifacts
Download them locally for merging on a machine with more RAM.

    Python
   
 

   # Save the LoRA adapter and tokenizer
save_dir = "lora-out"
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

# Download lora files to your local
import os
from google.colab import files

if os.path.exists(save_dir) and os.path.isdir(save_dir):
    print(f"Downloading files from '{save_dir}':")
    downloaded_any = False
    for filename in os.listdir(save_dir):
        filepath = os.path.join(save_dir, filename)
        if os.path.isfile(filepath):
            print(f"  - {filename}")
            files.download(filepath)
            downloaded_any = True
    if not downloaded_any:
        print(f"No files found in '{save_dir}' to download.")
  

Now you have:

Base model-id: unsloth/Llama-3.2-1B-Instruct
LoRA adapter weights (e.g., lora-out/)
Tokenizer files

You’ll use these on your local machine in the next steps.

Step 4: Convert LoRA to GGUF With llama.cpp

On your local machine, you don't need a GPU; CPU + enough RAM and disk is fine.

1. Clone llama.cpp

Clone llama.cpp and install conversion requirements:

    PowerShell
   
   git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
pip install -r requirements.txt

This gives you access to tool convert-lora-to-gguf.py.

2. Pull the Base Model From Ollama

Use Ollama to pull the same base model:

    PowerShell
   
   ollama pull llama3.2:1b

3. Convert LoRA to GGUF

    PowerShell
   
 

   python llama.cpp/convert_lora_to_gguf.py \
  ./multiagent-orchestrator/lora-out \
  --outfile ./multiagent-orchestrator/multiagent-orchestrator.gguf \
  --outtype f16 \
  --base-model-id unsloth/Llama-3.2-3B-Instruct
  

The first argument points to your LoRA model directory
--outfile is the GGUF filename you want
--outtype is the numerical precision, 16-bit floating point

This produces your fully merged, Ollama-ready GGUF checkpoint "multiagent-orchestrator.gguf".

Step 5: Create an Ollama Modelfile

Ollama uses a Modelfile (like a Dockerfile for models) to describe how a model should be built and run. For a GGUF-based model, the basic Modelfile is:

    Plain Text
   
   FROM llama3.2:1b

# GGUF LoRA adapter we just created
ADAPTER ./multiagent-orchestrator.gguf

PARAMETER temperature 0.2
PARAMETER num_ctx 2048

SYSTEM """
You are a specialized multi-agent ORCHESTRATOR.
You never solve the task directly; you only decide the next action and output a single JSON object describing the next step.
"""

Key points:

FROM points to the base model downloaded from ollama
ADAPTER is the GGUF model we just created in Step 4
PARAMETERlines configure defaults like:
- temperature (creativity vs determinism)
- num_ctx (context window size)
SYSTEM sets the global system prompt, defining your model’s persona and behavior.

Step 6: Build the Model in Ollama

Assuming Ollama is installed and running locally (ollama serve), building the model is just:

    PowerShell
   
   ollama create username/multiagent-orchestrator -f ./Modelfile

This essentially reads your Modelfile, registers the model under the name specified, and copies the GGUF into Ollama's internal storage. You can verify and run it:

    PowerShell
   
   ollama list
ollama run username/multiagent-orchestrator

Step 7: Push Your Fine-Tuned Model to Ollama

If you want to share the model so it can be pulled from anywhere (like you would with Docker images):

Create an Ollama account and add your local public write key in the Ollama web UI.

Make sure the model is namespaced with your username:

      PowerShell
     
     ollama cp multiagent-orchestrator username/multiagent-orchestrator

Push it to the registry:

      PowerShell
     
     ollama signin
ollama push username/multiagent-orchestrator

Other machines can now do:

      PowerShell
     
     ollama pull username/multiagent-orchestrator
ollama run username/multiagent-orchestrator

And that's it! Your fine-tuned, Colab-trained, GGUF-converted model is now a first-class Ollama model you can pull and run from anywhere where Ollama is installed.

You've effectively built a zero-in-house-GPU fine-tuning and deployment pipeline: Colab for training, llama.cpp for conversion, Ollama for serving.

Learned something new? Tap that like button and pass it on!

JSON Python (language) large language model

Opinions expressed by DZone contributors are their own.

Related

Trending