The Developer’s Guide to Local LLMs: Building, Running, and Scaling With Ollama
The article discusses local LLMs through practical examples and, in depth, examines their limitations and specific aspects of real-world use.
Join the DZone community and get the full member experience.
Join For FreeFirstly, LLMs are already widely used for working with unstructured natural data in general. Additionally, they excel at extracting information and working with semi-structured data, such as JSON files and other lengthy configuration files. It allows us to use them that way to interact with relational data, for example. Cloud-based LLMs are effective and powerful, but they have some limits. That's where locally based LLMs come into play.
Local LLMs: Pros and Cons
I first realized the need to use local LLMs while developing software for a critical industry (healthcare), where Personal Health Information is strictly regulated and, accordingly, the use of cloud-based LLMs is very limited. So, privacy is the first benefit of using the local LLMs.
The second reason the classical LLMs may not fit is their level of customization. When the system needs custom fine-tuning or additional manipulations, it may be easier to implement them locally on the LLM.
The third reason may not be so rational, but it also makes some sense. Local LLM — it is fun. You may use it the same way as you use the cloud-based LLMs, but in a convenient way, without dependency on the Internet. You may download the model of interest to your laptop and handle much of your work routine as you would with a regular ChatGPT or Gemini. For sure, each local LLM will be more limited in terms of the knowledge cutoff compared to the cloud-based LLMs, especially when working in "Thinking" mode. But if your goal is not deep research or analysis, it may be a great fit.
The dark side of local LLMs is a knowledge cutoff, a lack of intelligence, and a lack of speed. It is not always a bottleneck. For example, for 70% of tasks, such as information extraction, summarization, and transformation, they will perform similarly to cloud-based systems. But the scalability may face challenges.
One more limitation, not often mentioned but still critical, especially for production usage, is licensing.
Architecture of the Local LLM Runtimes
You may find many great LLM runtimes that help you get started with deploying and running an LLM locally. Some of them are LM Studio, Ollama, and Jan AI.
Their purpose is to provide an environment and a UI/API interface for the LLMs themselves, making working with them easier and more manageable.
The typical architecture of these runtimes is the following:
For example, Ollama uses llama.cpp as its engine. Its function is to load a model into memory and operate on it. The web server runs by default on port 11434 and allows communication with the model from the local applications and CLI/GUI tools. User interacts with the model via the shell or via a GUI application. Software applications are also connected via a web server. After installing the LLM runtime, select the desired model(s) and download them to the local PC/Laptop. After that, the runtime loads it into memory, and it becomes accessible to the prompt.
Licensing
This topic is as important to consider, especially for the production or commercial usage of the LLMs. The good news is that most LLM runtimes have permissive licenses for commercial use (but double-check the specific tool for the exact details). The second layer is the LLM model itself. So, for example, if you use Ollama with the Meta Llama model, it means you need to read carefully two licenses:
- From Ollama
- From Meta Llama
So, it is especially important to understand whether both licenses allow usage of the model for commercial purposes before building commercial applications, for example.
Installation
This article will showcase Ollama's capabilities. It may be good for local experiments as well as for building the applications. Once you understand how this runtime works, it will be much easier to apply similar patterns to other runtimes.
Step 1. Install the Ollama Application
Download an application for Windows, Linux, or Mac from the official download page.
Step 2. Pull the Model and Run It
For example, let's install the first local LLM. Run this in the terminal:
ollama pull llama3.2:3b
ollama run llama3.2:3b
Ollama is manageable using a terminal. So, you may find some useful commands below to manipulate the Ollama models:
ollama list # list installed models
ollama pull llama3.2 # download a model
ollama run llama3.2 # run chat in terminal
ollama rm llama3.2 # remove a model
ollama show llama3.2 # show model info (template, params, etc.)
ollama ps # show loaded models
# On Mac, if brew was used to install Ollama:
brew install ollama # install ollama
brew services start ollama # start server
brew services stop ollama # stop server
brew services restart ollama # restart server
brew services list | grep -i ollama # check if ollama is running
UI Interface for Interaction
In July 2025, Ollama also released a GUI application for having a visual experience when prompting local LLMs. It simplifies interactions and allows loading the files as well. You may download it from their official site.
The application allows prompting local LLMs like ChatGPT and other tools, including adding PDF and other text-based files. Also, some models support multimodality, meaning they can generate images using specific models.
Building Applications on Top of Those Local LLMs
The prerequisites to run that code are:
1. Install Ollama locally.
2. Pull and start the local model (in that particular example, it is llama3.2:3b).
ollama pull llama3.2:3b
ollama run llama3.2:3b
That is the application code itself:
from ollama import chat
messages = [
{
'role': 'user',
'content': 'Generate a 3-4 sentence description of the random product from Amazon?',
},
]
response = chat('llama3.2:3b', messages=messages)
print(response['message']['content'])
The example answer was:
I've generated a fictional product description. Here it is:
"The Intergalactic Dreamweaver" is a unique, patented sleep
mask designed to enhance and control your dreams while you sleep
...
Remote Application Example Using Ollama
If you want to separate the Ollama server from the application server, it is very easy to do, since Ollama includes a built-in Web server.
To do that, I just modified the previous code to link to the Ollama Server (which may be separate):
from ollama import Client
client = Client(host="http://localhost:11434")
messages = [
{
"role": "user",
"content": "Generate a 3-4 sentence description of a random Amazon product?",
}
]
response = client.chat(model="llama3.2:3b", messages=messages)
print(response["message"]["content"])
Scalability Side of the Local LLMs
Let us understand the multitasking model for Ollama. If the application uses async mechanisms to generate many prompts to the LLM, Ollama currently handles them as a queue (FIFO). It means the application will not encounter an error, but latency may increase.
For example, I successfully ran that code on the MacBook M4.
import asyncio
import time
from ollama import AsyncClient
QTY = 20
MODEL = "llama3.2:3b"
PROMPT = "Please generate a random description for a product on Amazon, 3-4 sentences."
async def ask(i):
client = AsyncClient()
messages = [
{
"role": "user",
"content": PROMPT,
}
]
response = await client.chat(MODEL, messages=messages)
return i, response['message']['content']
async def main():
start = time.time()
tasks = [asyncio.create_task(ask(i)) for i in range(QTY)]
results = await asyncio.gather(*tasks)
end = time.time()
total_time = end - start
results.sort(key=lambda x: x[0])
for idx, answer in results:
print(f"\n=== Answer #{idx + 1} ===")
print(answer)
print(f"\n--- Total time: {total_time:.2f} seconds ---")
if __name__ == "__main__":
asyncio.run(main())
I changed only the QTY parameter, which determines the number of parallel requests sent to the Ollama server.
The metrics were the following:
- QTY = 1: 2.4 sec (2.4 sec per request)
- QTY = 2: 5.2 sec (2.6 sec per request)
- QTY = 10: 25 sec (2.5 sec per request)
- QTY = 20: 49 sec (2.5 sec per request)
This experiment shows that Ollama doesn't support parallelism at present. But it has an automatic queue, which means the client side should ultimately receive an answer.
Conclusion
To conclude, let us return to the use cases and limitations of the local LLMs. First of all, local LLMs are powerful enough to start thinking about them. It is not a toy anymore. It is a production-ready tool with rich support for frameworks, backed by intelligence, and may solve pretty complex tasks. They may be trained (fine-tuned), and while we didn't touch this topic in that article, fine-tuning remains one of the important features local LLMs offer.
The limitations of the local LLMs may include scalability and speed. Licensing should not be a problem for the ethical use of LLMs. However, caution is important here, because some models may not allow commercial use.
Overall, local LLMs may be the only option for some critical industries, where privacy matters the most. For other industries, it may be a good pick, with some trade-offs.
Opinions expressed by DZone contributors are their own.
Comments