My Dive into Local LLMs, Part 1: From Alexa Curiosity to Homegrown AI

I set up Ollama on Ubuntu to run LLMs locally. CPU performance was a drag, but when GPU acceleration kicked in, it was a total "whoa" moment.

Praveen Chinnusamy

Jun. 27, 25 · Tutorial

Likes (3)

Comment

Save

55.7K Views

So, I'm a software dev manager over at the Alexa team, and being around the Alexa journey, you kinda get a front-row seat to some seriously cool AI. It got me thinking, "Hey, I wanna poke at these LLMs on my own machine, see what makes 'em tick without needing a massive server farm." This is basically my log of figuring out how to get a personal LLM setup going.

In this article:

Setting up Ollama on Ubuntu with GPU acceleration
Understanding CPU vs GPU performance impact
Choosing the right model for your hardware
First steps toward building local AI applications

The Why

Honestly, it was pure curiosity at first. Working with Alexa, you see the potential, but I wanted that hands-on feel. The idea of having a model running locally, something I could tinker with, feed my own stuff, and just play – that was the goal. No big corporate project, just me, my laptop, and some AI.

Getting Started: The Overwhelming Options

First off, wow, there's a LOT out there. You got your Ollama, LM Studio, GPT4All, Docker setups... it's a bit of a jungle. My main thing was, I wanted something fairly straightforward for a first go. I wasn't trying to build Rome in a day, y'know?

I ended up leaning towards Ollama for my initial experiments because it seemed pretty popular for command-line folks and had a simple ollama pull llama3 kind of vibe to get models. LM Studio also looked cool with its GUI, especially for just downloading and chatting.

My Ubuntu Setup: The Nitty-Gritty

My daily driver is an Ubuntu machine (22.04 LTS with an AMD Ryzen 7 5800X), so that's where I decided to build my little AI playground. This is where the rubber really met the road, moving from theory to a running process in my own terminal.

First thing's first: resources. You can't run a V8 engine on a lawnmower's gas tank. I popped open htop to get a feel for my RAM and CPU situation. With 32GB of RAM and 8 cores/16 threads, I felt pretty good about starting, but I knew the real bottleneck could be the GPU.

Installing Ollama on Ubuntu was a breeze, which I really appreciated. It's a single command, which you love to see:

    Shell
   
   # Installs Ollama on Linux
curl -fsSL https://ollama.ai/install.sh | sh

After running that, a quick ollama --version confirmed it was ready to go. My initial test was to pull a smaller model to see how it ran on the CPU alone.

    Shell
   
   ollama pull phi3

Then, ollama run phi3. And... it worked! But let's be honest, the performance was, let's say, deliberate. You could practically feel the CPU cores churning away, and the tokens trickled out onto the screen. It was cool to see it running, but not exactly a "conversational" speed. This confirmed my suspicion: to do this right, I needed to get my NVIDIA GPU in the game.

Performance Reality Check

Here's what I was seeing:

Model	Hardware	Tokens/Second	User Experience
Phi-3	CPU only	2-3	"Deliberate" - watching paint dry
Phi-3	RTX 3080	45-50	Snappy responses
Llama 3	CPU only	0.5-1	Time for coffee... or three
Llama 3	RTX 3080	25-30	Natural conversation flow

This is where the real fun began. For Ollama to leverage the GPU on Ubuntu, it's not always automatic. You have to make sure the whole stack is in place.

GPU Setup: The Missing Pieces

NVIDIA Drivers: I had them installed, but if you don't, nvidia-smi is your moment of truth. If that command fails, you've got your first mission. The sudo ubuntu-drivers autoinstall command is usually a good, low-fuss way to handle it.

    Shell
   
 

   $ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03   Driver Version: 535.129.03   CUDA Version: 12.2   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+===============================+======================+======================+
| 0  NVIDIA RTX 3080     Off  | 00000000:2B:00.0 Off |                  N/A |
| 30%   45C    P8    20W / 320W |    364MiB / 10240MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
  

NVIDIA Container Toolkit: This was the key. Since Ollama uses containers under the hood to manage environments, you need to install the NVIDIA Container Toolkit so that the container can actually see and use the GPU. This involves adding their repository and installing a few packages.

    Shell
   
 

   # Add NVIDIA Container Toolkit repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

# Install the toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
  

After getting that set up and restarting the Docker daemon, I ran nvidia-smi again. Everything looked good. Now for the real test. I decided to pull a more powerful model this time.

    Shell
   
   ollama pull llama3

Then, the magic command:

    Shell
   
   ollama run llama3

The difference was night and day. Suddenly, responses were snappy. The text flowed. The model felt alive. This wasn't just a technical demo anymore; it was a genuinely usable tool running entirely on the hardware under my desk. That was the "whoa" moment I was looking for.

My Basic "Get it Running" Flow

Here's kinda how it went down in a nutshell, and how you could probably do it too. Total setup time: about 30-45 minutes if everything goes smoothly.

Model Recommendations by Use Case

Based on my experiments, here's what I'd recommend:

Limited RAM (8-16GB)

Phi-3: Fast and capable for basic tasks
Mistral 7B: Good balance of performance and capability
Llama 2 7B: Solid all-rounder

Good GPU (8GB+ VRAM)

Llama 3 8B: My current favorite - great performance
Mistral 7B: Runs beautifully with room to spare
CodeLlama 7B: If you're focusing on coding tasks

Powerful Setup (24GB+ VRAM)

Llama 3 70B: The full experience
Mixtral 8x7B: Impressive mixture-of-experts model
Yi 34B: Great for longer contexts

Challenges and What I Bumped Into

Model Sizes: These things can be chunky! Some models need a good bit of RAM. My setup is decent, but you've got to pay attention to those model requirements. A good habit is to run ollama list to see what you've got downloaded and their sizes.

Prompting is an Art: Just like with the big guys, getting the right answer means asking the right way. It's a skill in itself, and I'm still figuring that bit out.

Which Model When?: So many models, each good at different things (coding, chat, summarization). It's a bit of trial and error to find your favorites for different tasks. Phi-3 is great for quick tasks on low-spec hardware, while Llama 3 is a fantastic all-rounder if you have the GPU power.

Quick Troubleshooting Tips

nvidia-smi not found? Run sudo ubuntu-drivers autoinstall and reboot
Ollama using CPU instead of GPU? Check docker run --gpus all support with: docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Out of memory errors? Try smaller models or quantized versions: ollama pull llama3:7b-q4_0
Model won't download? Check disk space - some models need 20GB+ free space

What I Used (Mostly)

Ollama: For running the models locally on my Ubuntu box
Llama 3 / Phi-3: Seemed like good starting points, not too massive but very capable
Python & LangChain (a little bit): Started looking into how LangChain can talk to these local models because that opens up building actual little apps

Was it Worth It? Heck Yeah

Even though it's just a personal sandbox, understanding what it takes to run these LLMs locally has been super insightful. It's one thing to use an API, but it's quite another to configure the drivers, manage the models, and see the performance difference firsthand. It definitely gives me a different perspective on the tech we work with every day.

Next Steps for Me

Probably dive deeper into using LangChain with a local model for a simple RAG (Retrieval Augmented Generation) setup with my own documents. That seems like the next fun challenge. Now that the environment is stable, I might try out a more visual tool like LM Studio or GPT4All to compare the experience.

Actually, I'm already working on something practical - using this setup for personal finance analysis. Imagine having an AI that can categorize your expenses and give you insights, all while keeping your financial data completely private on your own machine. Stay tuned for Part 2!

Getting Started Yourself

If you're in dev and curious about LLMs, I'd say give it a shot. It's easier than you might think to get something basic running, and it's a cool learning curve. Just pick a tool, grab a model, and see what happens!

Minimum specs: 16GB RAM, 6GB+ VRAM GPU (or patience for CPU-only)
Install Ollama: That one-liner command above
Start small: Try Phi-3 first, then work up to larger models
Join the community: The Ollama Discord is super helpful

Helpful Resources

Ollama Model Library - Browse available models and their requirements
NVIDIA Container Toolkit Docs - Official installation guide
LangChain Local LLM Guide - For when you're ready to build apps

AI ubuntu large language model

Opinions expressed by DZone contributors are their own.

Related

Trending