DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • From AI Chaos to Control: Building Enterprise-Grade LLM Gateways With MuleSoft Anypoint
  • Architecting Zero-Trust AI Agents: How to Handle Data Safely
  • Hallucination Has Real Consequences — Lessons From Building AI Systems
  • Building a Production-Ready AI Agent in 2026: Beyond the Hello World Demo

Trending

  • No More Cheap Claude: 4 First Principles of Token Economics in 2026
  • Run Gemma 4 on Your Laptop: A Hands-On Guide to Google's Latest Open Multimodal LLM
  • Your AI Agent Tests Are Passing, But Your Agent Is Still Broken
  • Implementing Observability in Distributed Systems Using OpenTelemetry
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. My Dive into Local LLMs, Part 1: From Alexa Curiosity to Homegrown AI

My Dive into Local LLMs, Part 1: From Alexa Curiosity to Homegrown AI

I set up Ollama on Ubuntu to run LLMs locally. CPU performance was a drag, but when GPU acceleration kicked in, it was a total "whoa" moment.

By 
Praveen Chinnusamy user avatar
Praveen Chinnusamy
·
Jun. 27, 25 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
55.7K Views

Join the DZone community and get the full member experience.

Join For Free

So, I'm a software dev manager over at the Alexa team, and being around the Alexa journey, you kinda get a front-row seat to some seriously cool AI. It got me thinking, "Hey, I wanna poke at these LLMs on my own machine, see what makes 'em tick without needing a massive server farm." This is basically my log of figuring out how to get a personal LLM setup going.

In this article:

  • Setting up Ollama on Ubuntu with GPU acceleration
  • Understanding CPU vs GPU performance impact
  • Choosing the right model for your hardware
  • First steps toward building local AI applications

The Why

Honestly, it was pure curiosity at first. Working with Alexa, you see the potential, but I wanted that hands-on feel. The idea of having a model running locally, something I could tinker with, feed my own stuff, and just play – that was the goal. No big corporate project, just me, my laptop, and some AI.

Getting Started: The Overwhelming Options

First off, wow, there's a LOT out there. You got your Ollama, LM Studio, GPT4All, Docker setups... it's a bit of a jungle. My main thing was, I wanted something fairly straightforward for a first go. I wasn't trying to build Rome in a day, y'know?

I ended up leaning towards Ollama for my initial experiments because it seemed pretty popular for command-line folks and had a simple ollama pull llama3 kind of vibe to get models. LM Studio also looked cool with its GUI, especially for just downloading and chatting.

My Ubuntu Setup: The Nitty-Gritty

My daily driver is an Ubuntu machine (22.04 LTS with an AMD Ryzen 7 5800X), so that's where I decided to build my little AI playground. This is where the rubber really met the road, moving from theory to a running process in my own terminal.

First thing's first: resources. You can't run a V8 engine on a lawnmower's gas tank. I popped open htop to get a feel for my RAM and CPU situation. With 32GB of RAM and 8 cores/16 threads, I felt pretty good about starting, but I knew the real bottleneck could be the GPU.

Installing Ollama on Ubuntu was a breeze, which I really appreciated. It's a single command, which you love to see:

Shell
 
# Installs Ollama on Linux
curl -fsSL https://ollama.ai/install.sh | sh


After running that, a quick ollama --version confirmed it was ready to go. My initial test was to pull a smaller model to see how it ran on the CPU alone.

Shell
 
ollama pull phi3


Then, ollama run phi3. And... it worked! But let's be honest, the performance was, let's say, deliberate. You could practically feel the CPU cores churning away, and the tokens trickled out onto the screen. It was cool to see it running, but not exactly a "conversational" speed. This confirmed my suspicion: to do this right, I needed to get my NVIDIA GPU in the game.

Performance Reality Check

Here's what I was seeing:

Model Hardware Tokens/Second User Experience
Phi-3 CPU only 2-3 "Deliberate" - watching paint dry
Phi-3 RTX 3080 45-50 Snappy responses
Llama 3 CPU only 0.5-1 Time for coffee... or three
Llama 3 RTX 3080 25-30 Natural conversation flow


This is where the real fun began. For Ollama to leverage the GPU on Ubuntu, it's not always automatic. You have to make sure the whole stack is in place.

GPU Setup: The Missing Pieces

NVIDIA Drivers: I had them installed, but if you don't, nvidia-smi is your moment of truth. If that command fails, you've got your first mission. The sudo ubuntu-drivers autoinstall command is usually a good, low-fuss way to handle it.

Shell
 
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03   Driver Version: 535.129.03   CUDA Version: 12.2   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+===============================+======================+======================+
| 0  NVIDIA RTX 3080     Off  | 00000000:2B:00.0 Off |                  N/A |
| 30%   45C    P8    20W / 320W |    364MiB / 10240MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+


NVIDIA Container Toolkit: This was the key. Since Ollama uses containers under the hood to manage environments, you need to install the NVIDIA Container Toolkit so that the container can actually see and use the GPU. This involves adding their repository and installing a few packages.

Shell
 
# Add NVIDIA Container Toolkit repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

# Install the toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker


After getting that set up and restarting the Docker daemon, I ran nvidia-smi again. Everything looked good. Now for the real test. I decided to pull a more powerful model this time.

Shell
 
ollama pull llama3


Then, the magic command:

Shell
 
ollama run llama3


The difference was night and day. Suddenly, responses were snappy. The text flowed. The model felt alive. This wasn't just a technical demo anymore; it was a genuinely usable tool running entirely on the hardware under my desk. That was the "whoa" moment I was looking for.

My Basic "Get it Running" Flow

Here's kinda how it went down in a nutshell, and how you could probably do it too. Total setup time: about 30-45 minutes if everything goes smoothly.

AI Curiosity Flow

Model Recommendations by Use Case

Based on my experiments, here's what I'd recommend:

Limited RAM (8-16GB)

  • Phi-3: Fast and capable for basic tasks
  • Mistral 7B: Good balance of performance and capability
  • Llama 2 7B: Solid all-rounder

Good GPU (8GB+ VRAM)

  • Llama 3 8B: My current favorite - great performance
  • Mistral 7B: Runs beautifully with room to spare
  • CodeLlama 7B: If you're focusing on coding tasks

Powerful Setup (24GB+ VRAM)

  • Llama 3 70B: The full experience
  • Mixtral 8x7B: Impressive mixture-of-experts model
  • Yi 34B: Great for longer contexts

Challenges and What I Bumped Into

Model Sizes: These things can be chunky! Some models need a good bit of RAM. My setup is decent, but you've got to pay attention to those model requirements. A good habit is to run ollama list to see what you've got downloaded and their sizes.

Prompting is an Art: Just like with the big guys, getting the right answer means asking the right way. It's a skill in itself, and I'm still figuring that bit out.

Which Model When?: So many models, each good at different things (coding, chat, summarization). It's a bit of trial and error to find your favorites for different tasks. Phi-3 is great for quick tasks on low-spec hardware, while Llama 3 is a fantastic all-rounder if you have the GPU power.

Quick Troubleshooting Tips

  • nvidia-smi not found? Run sudo ubuntu-drivers autoinstall and reboot
  • Ollama using CPU instead of GPU? Check docker run --gpus all support with: docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
  • Out of memory errors? Try smaller models or quantized versions: ollama pull llama3:7b-q4_0
  • Model won't download? Check disk space - some models need 20GB+ free space

What I Used (Mostly)

  • Ollama: For running the models locally on my Ubuntu box
  • Llama 3 / Phi-3: Seemed like good starting points, not too massive but very capable
  • Python & LangChain (a little bit): Started looking into how LangChain can talk to these local models because that opens up building actual little apps

Was it Worth It? Heck Yeah

Even though it's just a personal sandbox, understanding what it takes to run these LLMs locally has been super insightful. It's one thing to use an API, but it's quite another to configure the drivers, manage the models, and see the performance difference firsthand. It definitely gives me a different perspective on the tech we work with every day.

Next Steps for Me

Probably dive deeper into using LangChain with a local model for a simple RAG (Retrieval Augmented Generation) setup with my own documents. That seems like the next fun challenge. Now that the environment is stable, I might try out a more visual tool like LM Studio or GPT4All to compare the experience.

Actually, I'm already working on something practical - using this setup for personal finance analysis. Imagine having an AI that can categorize your expenses and give you insights, all while keeping your financial data completely private on your own machine. Stay tuned for Part 2!

Getting Started Yourself

If you're in dev and curious about LLMs, I'd say give it a shot. It's easier than you might think to get something basic running, and it's a cool learning curve. Just pick a tool, grab a model, and see what happens!

  1. Minimum specs: 16GB RAM, 6GB+ VRAM GPU (or patience for CPU-only)
  2. Install Ollama: That one-liner command above
  3. Start small: Try Phi-3 first, then work up to larger models
  4. Join the community: The Ollama Discord is super helpful

Helpful Resources

  • Ollama Model Library - Browse available models and their requirements
  • NVIDIA Container Toolkit Docs - Official installation guide
  • LangChain Local LLM Guide - For when you're ready to build apps
AI ubuntu large language model

Opinions expressed by DZone contributors are their own.

Related

  • From AI Chaos to Control: Building Enterprise-Grade LLM Gateways With MuleSoft Anypoint
  • Architecting Zero-Trust AI Agents: How to Handle Data Safely
  • Hallucination Has Real Consequences — Lessons From Building AI Systems
  • Building a Production-Ready AI Agent in 2026: Beyond the Hello World Demo

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook