Fine-Tune SLMs for Free: From Google Colab to Ollama in 7 Steps
Want your own custom SLM, but don't have a GPU box humming under your desk? Learn from this article why you actually don’t need one.
Join the DZone community and get the full member experience.
Join For FreeIn this article, I'll walk through a practical pipeline that:
- Fine-tunes a popular open-source base small language model on your own data using Unsloth on Google Colab (free T4 GPU)
- Exports the result to GGUF via llama.cpp
- Deploys it to Ollama so that you can run
ollama pull my-modelfrom anywhere and even push it to the Ollama registry.
We'll put this into practice by creating a real-world example: a "multi-agent orchestrator," built step-by-step in seven concrete steps.
The Stack at a Glance
Here's the toolchain we’re wiring together:
- Data: JSON (or JSONL) instruction data, hand-crafted or LLM-generated
- Fine-tuning: Unsloth on Google Colab T4 (free tier)
- Checkpoints: LoRA adapters (light-weight fine-tune weights)
- Conversion: llama.cpp to produce GGUF, the format used by llama.cpp/Ollama
- Serving: Ollama with a
Modelfile
No in-house GPU. Just Colab + your local machine (CPU is enough).
Step 1: Curate Your Training Data in JSON
Your model is only as good as its data. For instruction-tuned chat models, a simple and effective schema is:
[
{
"agent_registry": [
{
"name": "researcher",
"role": "Collects and summarizes information from the web and internal docs.",
"cost": "medium",
"latency": "medium"
},
{
"name": "coder",
"role": "Writes and fixes code, runs tests, and explains changes.",
"cost": "high",
"latency": "high"
},
{
"name": "critic",
"role": "Reviews outputs from other agents for quality, correctness, and clarity.",
"cost": "low",
"latency": "low"
}
],
"state": {
"task": "Write a short technical blog post explaining what multi-agent LLM systems are.",
"status": "in_progress",
"step": 1,
"max_steps": 8,
"context": {
"user_preferences": {
"tone": "friendly",
"audience": "intermediate developers"
}
},
"history": []
},
"target_action": {
"action": "call_agent",
"target": "researcher",
"arguments": {
"query": "Explain what multi-agent LLM systems are, list 3–4 key benefits, and mention common frameworks."
},
"final_answer": null,
"reason": "We need background information before planning the blog post."
}
},
...
]
A few practical tips:
- Keep the format consistent. Decide on fields (
agent_registry,state, ortarget_action) and stick to them. - Generate synthetic data when useful. You can bootstrap your dataset by prompting ChatGPT or another LLM to generate domain-specific Q&A or instructions.
- Filter aggressively. Remove duplicates, contradictory answers, and very low-quality or off-topic data.
- Start small, iterate fast. Even a few hundred high-quality examples can make a noticeable difference when fine-tuning with LoRA/QLoRA.
Save the dataset as something like train.json to upload to Colab.
Step 2: Fine-Tune With Unsloth on Google Colab
What Is Unsloth?
Unsloth is an open-source library focused on efficient fine-tuning and reinforcement learning for LLMs. It supports popular models such as Llama, Gemma, Qwen, Mistral, DeepSeek, and more, and is designed to train them 2x faster with up to 70% less VRAM than typical approaches.
The official docs emphasize that you can fine-tune for free on Colab or Kaggle, or locally, with as little as ~3 GB of VRAM using ready-made notebooks.
1. Set Up a Colab Notebook
- Open a new Colab notebook.
- Go to Runtime → Change runtime type → GPU → T4.
- Install Unsloth and other dependencies.
!pip install unsloth trl peft accelerate bitsandbytes
2. Choose a Base Model
Pick a base model that Unsloth supports, and that also has llama.cpp/Ollama support. Common choices:
meta-llama/Meta-Llama-3-8B-InstructQwen/Qwen2.5-7B-Instructgoogle/gemma-2-9b-it- A smaller 3B-4B model if your dataset is small or your hardware is very tight.
3. Load the Base Model With Unsloth
A typical pattern in Unsloth looks like:
from unsloth import FastLanguageModel
import torch
model_name = "unsloth/Llama-3.2-3B-Instruct"
max_seq_length = 2048 # Choose sequence length
dtype = None # Auto detection
# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=True,
)
Then you wrap it with LoRA/PEFT:
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=64, # LoRA rank - higher = more capacity, more memory
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=128, # LoRA scaling factor (usually 2x rank)
lora_dropout=0, # Supports any, but = 0 is optimized
bias="none", # Supports any, but = "none" is optimized
use_gradient_checkpointing="unsloth", # Unsloth's optimized version
random_state=3407,
use_rslora=False, # Rank stabilized LoRA
loftq_config=None, # LoftQ
)
4. Load Your JSON Dataset
import json
from datasets import Dataset
# 1) Load your JSON array file
with open("train_data.json", "r", encoding="utf-8") as f:
data = json.load(f) # list of dicts
SYSTEM_PROMPT = """You are an ORCHESTRATOR, a supervisor coordinating a team of AI agents and tools.
You NEVER solve the task directly. You ONLY decide the next action.
Always respond with a single JSON object describing the next action.
"""
def format_example(example):
agent_registry = example["agent_registry"]
state = example["state"]
target_action = example["target_action"]
system_message = SYSTEM_PROMPT.strip()
user_message_content = f"""
Available agents (registry):
{json.dumps(agent_registry, ensure_ascii=False, indent=2)}
Current task state:
{json.dumps(state, ensure_ascii=False, indent=2)}
Decide the next action as JSON only, using this schema:
{{
"action": "call_agent" | "call_tool" | "ask_user" | "finish",
"target": string or null,
"arguments": object,
"final_answer": string or null,
"reason": string
}}
"""
assistant_message_content = json.dumps(target_action, ensure_ascii=False, separators=(",", ":"))
messages = [
{"role": "system", "content": system_message},
{"role": "user", "content": user_message_content.strip()},
{"role": "assistant", "content": assistant_message_content}
]
# Apply the tokenizer chat template directly for training data
# add_generation_prompt=False because it's a complete conversation turn
return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) + "<|endoftext|>"
# 2) Build the formatted text list
# Make sure tokenizer is available in global scope before this call
formatted_data = [format_example(item) for item in data]
# 3) Create the HF Dataset
dataset = Dataset.from_dict({"text": formatted_data})
5. SFTTrainer Setup
Configure a Supervised Fine-Tuning (SFT) Trainer with arguments optimized for Unsloth. Feed the model and data you just created into an Unsloth training loop via SFTTrainer. Configure a small experiment to start:
- Batch size: small (e.g., 1-4 per GPU, with gradient accumulation)
- Learning rate:
1e-4 to2e-4for LoRA - Epochs: 1–3 to start
from trl import SFTTrainer
from transformers import TrainingArguments
# Training arguments optimized for Unsloth
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
dataset_num_proc=2,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # Effective batch size = 8
warmup_steps=10,
num_train_epochs=3,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=2,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
save_strategy="epoch",
save_total_limit=2,
dataloader_pin_memory=False,
report_to="none", # Disable Weights & Biases logging
),
)
6. Train the Model
Let training run on the free Colab T4. Once it finishes, you'll have a LoRA-fine-tuned model in memory.
trainer_stats = trainer.train()
7. Test the Model
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
# Test prompt
messages = [
{"role": "system", "content": SYSTEM_PROMPT.strip()},
{
"role": "user",
"content": """
Available agents (registry):
{
"agents": [
{
"name": "researcher",
"role": "Collects and summarizes information from the web and docs.",
"cost": "medium",
"latency": "medium"
},
{
"name": "coder",
"role": "Writes and fixes code, runs tests, and explains changes.",
"cost": "high",
"latency": "high"
}
]
}
Current task state:
{
"task": "Write a short explanation of what multi-agent systems are.",
"status": "in_progress",
"step": 1,
"max_steps": 5,
"context": null,
"history": []
}
Decide the next action as JSON only, using this schema:
{
"action": "call_agent" | "call_tool" | "ask_user" | "finish",
"target": string or null,
"arguments": object,
"final_answer": string or null,
"reason": string
}
"""
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True, # This adds the <|start_header_id|>assistant<|end_header_id|>
return_tensors="pt",
).to("cuda")
# Generate response
outputs = model.generate(
input_ids=inputs,
max_new_tokens=256,
use_cache=True,
temperature=0.7,
do_sample=True,
top_p=0.9,
)
# Decode and print only the generated part
# Get the length of the input tokens
input_length = inputs.shape[1]
# Decode only the generated tokens (after the input_length)
generated_tokens = outputs[0, input_length:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(response)
Inference result:
{
"action": "call_agent",
"target": "researcher",
"arguments": {
"query": "Explain what multi-agent systems are. Choose 3-5 key points."
},
"final_answer": null,
"reason": "The researcher agent should gather and summarize foundational information before the other agents are called."
}
Practical Limitations
While this pipeline works great for 1B-10B models on a free T4 GPU, very large LLMs (13B+, 70B, or larger) may not fit into Colab's VRAM, even with 4-bit loading and LoRA. If you want to fine-tune models 13B and above, you should expect to use a paid GPU tier (A100, H100, or RTX 4090/6000) or a cloud provider with sufficient VRAM.
Step 3: Save LoRA Weights in Colab (Don't Merge Yet)
Colab's free tier typically gives you around 12.7 GB of CPU RAM, which often isn’t enough to merge LoRA adapters into the full base model comfortably. Merges tend to spike RAM usage and cause the runtime to crash. Here's what I do to get around this problem:
- Save the LoRA adapter and tokenizer as separate artifacts
- Download them locally for merging on a machine with more RAM.
# Save the LoRA adapter and tokenizer
save_dir = "lora-out"
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)
# Download lora files to your local
import os
from google.colab import files
if os.path.exists(save_dir) and os.path.isdir(save_dir):
print(f"Downloading files from '{save_dir}':")
downloaded_any = False
for filename in os.listdir(save_dir):
filepath = os.path.join(save_dir, filename)
if os.path.isfile(filepath):
print(f" - {filename}")
files.download(filepath)
downloaded_any = True
if not downloaded_any:
print(f"No files found in '{save_dir}' to download.")
Now you have:
- Base model-id:
unsloth/Llama-3.2-1B-Instruct - LoRA adapter weights (e.g.,
lora-out/) - Tokenizer files
You’ll use these on your local machine in the next steps.
Step 4: Convert LoRA to GGUF With llama.cpp
On your local machine, you don't need a GPU; CPU + enough RAM and disk is fine.
1. Clone llama.cpp
Clone llama.cpp and install conversion requirements:
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
pip install -r requirements.txt
This gives you access to tool convert-lora-to-gguf.py.
2. Pull the Base Model From Ollama
Use Ollama to pull the same base model:
ollama pull llama3.2:1b
3. Convert LoRA to GGUF
python llama.cpp/convert_lora_to_gguf.py \
./multiagent-orchestrator/lora-out \
--outfile ./multiagent-orchestrator/multiagent-orchestrator.gguf \
--outtype f16 \
--base-model-id unsloth/Llama-3.2-3B-Instruct
- The first argument points to your LoRA model directory
--outfileis the GGUF filename you want--outtypeis the numerical precision, 16-bit floating point
This produces your fully merged, Ollama-ready GGUF checkpoint "multiagent-orchestrator.gguf".
Step 5: Create an Ollama Modelfile
Ollama uses a Modelfile (like a Dockerfile for models) to describe how a model should be built and run. For a GGUF-based model, the basic Modelfile is:
FROM llama3.2:1b
# GGUF LoRA adapter we just created
ADAPTER ./multiagent-orchestrator.gguf
PARAMETER temperature 0.2
PARAMETER num_ctx 2048
SYSTEM """
You are a specialized multi-agent ORCHESTRATOR.
You never solve the task directly; you only decide the next action and output a single JSON object describing the next step.
"""
Key points:
FROMpoints to the base model downloaded from ollamaADAPTERis the GGUF model we just created in Step 4-
PARAMETERlines configure defaults like:temperature(creativity vs determinism)num_ctx(context window size)
SYSTEMsets the global system prompt, defining your model’s persona and behavior.
Step 6: Build the Model in Ollama
Assuming Ollama is installed and running locally (ollama serve), building the model is just:
ollama create username/multiagent-orchestrator -f ./Modelfile
This essentially reads your Modelfile, registers the model under the name specified, and copies the GGUF into Ollama's internal storage. You can verify and run it:
ollama list
ollama run username/multiagent-orchestrator
Step 7: Push Your Fine-Tuned Model to Ollama
If you want to share the model so it can be pulled from anywhere (like you would with Docker images):
- Create an Ollama account and add your local public write key in the Ollama web UI.
- Make sure the model is namespaced with your username:
PowerShell
ollama cp multiagent-orchestrator username/multiagent-orchestrator - Push it to the registry:
PowerShell
ollama signin ollama push username/multiagent-orchestrator - Other machines can now do:
PowerShell
ollama pull username/multiagent-orchestrator ollama run username/multiagent-orchestrator
And that's it! Your fine-tuned, Colab-trained, GGUF-converted model is now a first-class Ollama model you can pull and run from anywhere where Ollama is installed.
You've effectively built a zero-in-house-GPU fine-tuning and deployment pipeline: Colab for training, llama.cpp for conversion, Ollama for serving.
Learned something new? Tap that like button and pass it on!
Opinions expressed by DZone contributors are their own.
Comments