How to Save Money Using Custom LLMs for Specific Tasks
MCP transforms AI from "chatbot" to "capable agent" by managing the messy details of tool integration and execution. With local models.
Join the DZone community and get the full member experience.
Join For FreeAI has already moved beyond text generation. Modern agents can browse the internet, read documents, call APIs, query databases, and coordinate numerous actions between tools and services. They are expected to do more than simply provide a single nebulous answer.
In real-world systems, agents evaluate the quality of their own results, independently identify errors, and learn. This capacity for reflection and adaptation distinguishes deep agent systems from the simple, one-off interactions of language models based on the 'one question, one answer' principle. A single answer implies incomplete reasoning, a lack of context, unclear instructions, and contradictory constraints. Rather than treating the generated results as final, the agent verifies them by asking questions:
- Does the result match the user’s intentions?
- Are there any logical inconsistencies?
- Is the answer comprehensive and well-structured?
Consequently, generating a response takes a long time as it involves numerous verification steps. Generation and evaluation are not the same task and for the same agent. The generator creates an initial response, while the evaluator analyses it for correctness, clarity, and alignment with the user’s intentions. As with humans, the evaluator should not be constrained by the same assumptions that led to the generator’s initial output. If an error is found, it is sent back, and the model is retrained, and so on, in a cycle.
It is important to manage feedback loops and response revisions effectively. Endless cycles of revision are counterproductive and super-super costly sometimes. Clear evaluation criteria, follow-up questions for the user, a list of corrective strategies, and explicit decision points are required.
A good prompt should describe how the system is supposed to operate, which tools must be used, and what steps should be taken. However, the more complex the task, the greater the chance of making a mistake. Like in every other aspect of IT processes. This is where the Model Context Protocol (MCP) comes in. The MCP enables us to identify and execute the necessary actions across different programs, access external resources, and retrieve results. For instance, to parse a website and create a mock-up of it in Figma, you would use the Selenium URL loader. Think of the MCP as a bridge facilitating pre-defined interactions between models, tools, and external systems. MCP reduces the effort required of the user to describe actions. Tools and resources are pre-loaded onto the MCP server rather than being described in text instructions.
If a user requests a summary of recent news, for example, Newspaper3K is configured to retrieve the relevant data, and the Oolama + OpenAI API is set up for local and server-side text generation. It is the model itself that decides which feature to use, rather than attempting to recreate behavior using prompts from the user. MCP transforms the model into something suitable for real-world tasks.
The MCP can be viewed as a coordination system that links intelligence and execution. The model focuses on understanding user intentions and answering the question, 'What does the user want from me?' The MCP manages the discovery, verification, and orchestration of tools and available resources. The LLM can't call APIs independently; this is done by the MCP. The MCP also helps to prevent context fragmentation. The context window represents the maximum number of tokens that the model can process in a single request.
However, there is no magic solution; the 'do it right' button has yet to appear, so we still have a job to do. It’s best to interact with an LLM using structured, detailed prompts to ensure predictable, consistent behavior. Providing clear instructions reduces the likelihood of misuse, wasted tokens, and confusion.
Tokens are the basic units of text. There are various tokenisation methods; popular examples include WordPiece, SentencePiece and BPE. You can import the nltk library and extract tokens from a sentence yourself: 'What goes around comes around' would be split into 'what', 'goes', 'around', 'comes', 'around', and these would then be converted into 0 and 1 for ML. As we can see, in this sense, LLMs are very similar to linear regression in fact.
Key components of MCP:
- "Clients" that manage user interactions, conversation state, and orchestration.
- Servers that provide discoverable tools and resources. Typically, these are HTTP-based servers that act as lightweight backends, remaining active and accepting requests via URLs.
- Messages convey intent, context, and execution results.
- Structures for incoming and outgoing data.
This separation helps the MCP avoid entanglement between models and execution logic. While each component remains independent, they continue to work together via a common protocol (which may be the MCP or another protocol). Models do not speculate or invent actions; they operate strictly within the capabilities defined by the MCP. This simplifies system debugging, makes deployment safer, and ensures more predictable behavior.
Broadly speaking, resources are documents, files, or any other type of structured content. All of these are accessible via a URI. This ensures that the model operates within defined rules and constraints, which makes it easy to debug errors. Therefore, it is important that each tool can be tested in isolation and reused. This is the only way to scale the system.
However, there are a few rules to follow when working with resources. Typically, businesses want instant access via an LLM to all the documentation accumulated over the last 30 years. You know, legacy, a set of PDFs, and so on. Even if we are technically able to provide the entire text at once upon request, we should still avoid large documents. This helps to maintain readability. Here, we will use an actor-critic architecture with two models: one selects the tool, and the other validates the quality of the selection via a reward. One model is responsible for the rules and the other for the value to the user.
What If There Are Any Errors?
Architecture inevitably becomes more complex over time. Or maybe even at the first iteration. The more complex and interconnected AI becomes, the greater the likelihood of errors or even failure. The key question, given that we are no longer dealing with predictable CRUD services, is: ‘How can we properly restore operations after errors occur?’
For AI systems, recovery from failures means ensuring system operation continues, and results remain acceptable, even if individual components fail. Rather than allowing a failure to bring the entire system to a halt, well-designed systems continue to operate. In other words, the system must be resilient, continuing to function even if some components fail. Is GPT-5.4 unavailable?
In that case, we switch to Gemini 2.5. The system may degrade, but it will continue to operate. This is better than a complete system failure. Ideally, you should have alternative tools and models, as well as simplified logical paths. And, of course, backups. If we cannot identify and fix the problem, we will only provide conservative responses if the model starts producing answers that are unsafe or violate policy.
The debugging process involves checking the input data and then testing the functionality of the tools and APIs, including checking their availability, latency, and response integrity.
Multi-Step Reasoning
Single-step reasoning is effective for simple queries, but becomes less so when tasks involve dependencies or intermediate solutions. In such situations, rather than immediately producing a final answer, the agent must track the progress of execution at every stage. Multi-stage reasoning addresses this by breaking down complex goals into smaller subtasks, preserving context separately at intermediate stages, and altering the execution sequence in the event of incorrect assumptions. Validation acts as a control mechanism in multi-stage workflows in the event of failures. This prevents errors from different stages from accumulating, and prevents tokens from being wasted on calculations based on incorrect data.
The likelihood of failure is very high if an agent has to tackle a highly complex, long-term task. One of the main reasons for this is an inability to prioritize sub-tasks. Hierarchical planning is required to distinguish between strategy and implementation. To focus on the long-term goal, we need temporal abstraction and constant feedback from the user.
Monitoring
LangSmith is a useful tool for monitoring agents. It is compatible with both LangChain and LangGraph and is run on Runs. An alternative is Langfuse, which is better suited to enterprise environments where there is a dedicated role for analyzing the request processing pipeline (from my PoV). It has a great dashboard, too. Langfuse enables you to troubleshoot issues using tracing. If a problem arises due to unexpected interactions between search processes, request formation, or model execution, Langfuse can help.
However, LangSmith also shows the sequence of events from start to finish, taking context into account. Classic Prometheus and Datadog are still suitable for tracking agents' activities. Overall, however, combining the Streamlit interface, LangChain pipelines, vector storage, and LangSmith tracing into a single app.py is a good solution. Centralization simplifies tracking, debugging, and analyzing workflows. So, the problem has been identified — what next?
When implementing AI in a large company, API failures are most often caused by incorrect input data or unexpected response structures rather than errors in the model itself. LangServe's automatic schema inference reduces the number of failures before the request even reaches the model, so this is nothing new.
I would suggest using containerization to reproduce errors. This provides service isolation to prevent dependency conflicts and enables reproducible deployments using container images with specific versions. There are also other benefits of container orchestration. Containerized components include:
- Agent APIs: access to tool execution via LangServe or similar frameworks.
- MCP servers: provide standardized access to tools and resources using the MCP client-server model. Containerization of MCP servers ensures consistent tool availability across all environments. The key is to avoid hard-coded file paths.
- Monitoring: Log execution traces, performance metrics, and assessments using LangSmith or similar tools.
- Supporting infrastructure: Databases, vector stores, or simply files accessed by agents.
Data
We’ve received a PDF file, and our task is to make it accessible via an LLM. First, the PDF needs to be split into chunks, each with a unique UUID. After embedding, these chunks should be stored in a vector database. The text must be transferred either sentence by sentence or with chunk overlap to preserve context between chunks. RAG will then enable us to interact with the document.
RAG is essentially an LLM that has access to a knowledge base. It can also reduce hallucinations to some extent. As always, the key to success here is data: its quality, stability, backups, and access speed. The high-level process is as follows:
query > retrieve > generate
To implement RAG on AWS, you can consider using Bedrock for the LLM, OpenSearch for access to the vector database (S3), and Lambda. Bedrock is Amazon’s service for deploying AI agents, and I love their prompt management. The most critical aspect of RAG is uploading files; it is crucial to provide high-quality content that the system will process and respond to.
Here, we have to keep in mind Amdahl's law in the context of parallel computing. The idea is simple: performance gains plateau as the number of processing threads increases because the sequential parts of the task cannot be parallelized. When compiling the llama.cpp file on a 24-core, 64-thread AMD Threadripper processor, I have noticed that increasing the number of threads from 12 to 64 significantly reduced the time taken for compilation. However, exceeding 64 threads only yielded a marginal improvement, due to I/O bottlenecks and sequential dependencies.
As part of the Amazon ecosystem, Bedrock is bundled with SageMaker for model training, AWS App Studio, and Amazon Q, which is a ready-to-use AI assistant. Also, if the free version of Google Colab proves insufficient, AWS SageMaker is a more or less excellent alternative. If you have chosen Bedrock, you will most likely use the async/await architecture in Rust and the Tokio runtime for parallel Bedrock API calls.
Amazon OpenSearch Serverless can be used as a vector database. And it's a pretty popular option. Rather than performing searches based on keyword matches, it indexes documents and performs searches based on semantic similarity. In the RAG pipeline on AWS, documents from S3 are split into fragments, embedded using Amazon Titan or a similar model, and stored in a vector index. This allows the most relevant content to be retrieved in response to user queries and synthesized using an LLM.
Well, grain of salt. After Amazon had been mentioned so many times, the experts began to consider the associated costs. It’s important to keep costs under control. Data is the new gold, for sure. But having too much data isn’t good for the wallet. It's important to be able to cache frequently executed queries. If you need a step-by-step guide:
- Use Bedrock alongside S3 as your data source and OpenSearch Serverless as your vector search engine.
- Implement smart chunking to optimize documents for search.
- If real-time data freshness is not required, use batch loading intervals instead of continuous updates.
- Add a caching layer for frequently asked queries.
The development of the agent can be broken down into three stages.
- Data preparation involves data loading, pre-processing, and structuring. Chunking and embedding.
- Indexes: preparing for successful data retrieval. Vector stores and SQL are all available in ChromaDB, Pinecone, and FAISS. The type of database is important because FAISS can store the index and perform searches on the GPU, speeding up searches by orders of magnitude. Meanwhile, GraphRAG enables you to link information to context and build connections.
- Retrievers are used to find the right document based on a query. Hybrid search retrieves the required document. It can also delete documents.
One challenge you’ll face repeatedly is reducing your monthly LLM costs while maintaining response quality and ensuring compliance with data privacy regulations. To achieve this, you should examine your current pay-per-call costs on Bedrock and compare them with fixed-price alternatives. You will most likely need to migrate workloads involving large volumes of data and heightened privacy requirements to the locally deployed llama.cpp platform with GGUF quantized models. This will eliminate API usage fees and improve data security. However, we won’t be able to completely abandon Bedrock if we require massive models. We can prototype on Canvas while MLOps keeps an eye on costs.
Fine-Tuning
Although pre-trained models are useful, we usually need our own. We can adapt models that have been pre-trained on large datasets to our smaller task. The simplest approach is standard fine-tuning, which involves updating the weights to adapt the model to our dataset. We take a pre-trained model and do not overwrite it. If your tasks are typical and you have a large dataset, then standard fine-tuning is the way to go.
The second fine-tuning option is low-rank adaptation (LoRa), which involves adding small matrices to specific layers. This approach requires only around 0.1% of the original set of parameters. In effect, it enables targeted adjustments to be made to the model when computational resources are limited. It even works for large models. The original weights remain unchanged, but are combined with the matrices. This enables us to adapt the model for a wide variety of tasks. We use it when resources are limited, for multitasking, and to avoid catastrophic forgetting. LoRa is well-suited to open-source projects, and PEFT is widely used. It also enables models to adapt easily to new tasks.
The third option is Supervised Fine-Tuning (SFT), which is a model that minimizes the loss function. It is particularly well-suited to tasks requiring high accuracy when a labeled dataset is available.\
The overall process will look like this:
- We need a dataset.
- It is prepared.
- A new layer is created.
- The model is trained.
- The model is tested and deployed. Lesson from my painful experience: pay particular attention to the file ID, as one small mistake could result in costly mistakes. If you have someone specially trained in a specific area (SME), you could opt for RLHF (training via human feedback).
In practice, the training data is stored in JSONL format and uploaded to OpenAI’s servers. Then, a task is created on FineTuning. You can view the demo here. I prefer to use jqlang when working with JSONL.
Before training the model, make sure you have defined and configured the training parameters. Key parameters:
- Learning rate: If this is set too high, the results will be unsatisfactory. If it is too low, the model will take a very long time to train.
- Batch size: The smaller the batch size, the less stable the model will be.
- The number of epochs: The lower this is, the weaker the training will be. Setting the epochs parameter to 5 means that the dataset will be iterated through five times.
LLAMA
Would you like to install the model locally? GGUF is the ideal solution for local models on LLAMA. It acts as a sort of bridge. It feeds into the GGUF Conversion Pipeline, a multi-stage process that converts a model from the original Hugging Face format into a single artifact file ready for deployment. After quantization, we reduce the file size from 62 gigabytes to approximately 19 gigabytes using llama-quantize. If the system can handle it, we can use the model to our heart's content.
My code is not the best, and an LLM could generate a better one. However, this code has worked fine on five different machines with different parameters and operating systems, so it's pretty robust. Download Llama and its extensions. The Llama C++ toolkit converts models into locally deployable helpers.
git clone https://github.com/ggerganov/llama.cpp.git
curl -LsSf https://astral.sh/uv/install.sh | sh
Check all the configured repositories that have been deleted in the current Git repository.
git remote -v
Installing huggingface_hub.
make GGML_METAL=1 GGML_ACCELERATE=1 -j8
pip3 install --user huggingface_hub\[cli\]
pip3 install --upgrade --user 'huggingface_hub[cli]'
And we use a script to download a 23-gigabyte model.
python3 -c "
from huggingface_hub import hf_hub_download
print('Downloading Qwen 2.5 Coder 32B Q5_K_M...')
hf_hub_download(
repo_id='Qwen/Qwen2.5-Coder-32B-Instruct-GGUF',
filename='qwen2.5-coder-32b-instruct-q5_k_m.gguf',
local_dir='.',
local_dir_use_symlinks=False
)
print('Download complete!')
"
Or a smaller version, because the larger version runs very slowly on my computer:
cd ~/git/llama.cpp
python3 -c "
from huggingface_hub import hf_hub_download
print('Downloading Qwen 2.5 Coder 7B Q5_K_M (~5GB)...')
hf_hub_download(
repo_id='Qwen/Qwen2.5-Coder-7B-Instruct-GGUF',
filename='qwen2.5-coder-7b-instruct-q5_k_m.gguf',
local_dir='.',
local_dir_use_symlinks=False
)
print('Download complete!')
"
ls -lh ~/git/llama.cpp/*.gguf
Run the following command: curl -LsSf https://astral.sh/uv/install.sh | sh, then check the version using uv --version.
Download the dependencies. UV is required to run the script that converts from PyTorch to GGUF.
uv run --with transformers --with torch --with sentencepiece \
python convert_hf_to_gguf.py /actual/path/to/model
pip3 install --user transformers torch sentencepiece protobuf numpy
After running UV, the next steps are uv venv to create the environment and uv sync to install the dependencies. It's for troubleshooting.
Quantization to reduce the model size, as discussed in the article. Optional.
curl -LsSf https://astral.sh/uv/install.sh | sh
cd ~/git/llama.cpp
# Create build directory
mkdir build
cd build
# Configure with Metal support (for Mac GPU)
cmake .. -DGGML_METAL=ON
# Build (use -j8 for parallel compilation)
cmake --build . --config Release -j8
ls -la bin/
./bin/llama-quantize \
../qwen2.5-coder-32b-instruct-q5_k_m.gguf \
../qwen2.5-coder-32b-instruct-q4_k_m.gguf \
Q4_K_M
llama-cli runs the model locally. Now, to start a conversation, go to http://127.0.0.1:8082/.
cd ~/git/llama.cpp/build
./bin/llama-server \
-m ../qwen2.5-coder-7b-instruct-q5_k_m.gguf \
-c 8192 \
-ngl 99 \
--port 8082
I hope this article helps you save money on LLMs, tokens, and MCPs.
Opinions expressed by DZone contributors are their own.
Comments