DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Start Coding With Google Cloud Workstations
  • Type Variance in Java and Kotlin
  • Using Python Libraries in Java
  • Beyond ChatGPT, AI Reasoning 2.0: Engineering AI Models With Human-Like Reasoning

Trending

  • DZone's Article Submission Guidelines
  • Medallion Architecture: Why You Need It and How To Implement It With ClickHouse
  • Start Coding With Google Cloud Workstations
  • Is Agile Right for Every Project? When To Use It and When To Avoid It
  1. DZone
  2. Coding
  3. Languages
  4. Install Llama-Cpp-Python With GPU Support

Install Llama-Cpp-Python With GPU Support

This article is a walk-through to install the llama-cpp-python package with GPU capability (CUBLAS) to load models easily on the GPU.

By 
Manish Kovelamudi user avatar
Manish Kovelamudi
·
May. 01, 24 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
6.8K Views

Join the DZone community and get the full member experience.

Join For Free

If you are looking for a step-wise approach to installing the llama-cpp-python package, you are in the right place. This guide summarizes the steps required for installation.

Before we install, are you wondering why we need to install this package separately with GPU capability?

This package gives us a class or interface (LlamaCPP) to create a model instance or object, primarily for pre-trained LLM models.

By default, even if you have a Nvidia GPU in your system with all the CUDA compilers and packages installed, this package only installs CPU capability.

Installing with GPU capability enabled eases the computation of LLMs (Larger Language Models) by automatically transferring the model onto GPU.

In this guide, detailed steps are provided to install this package using cuBLAS (GPU-accelerated library) provided by Nvidia.

Tested System Configuration

  • System — Azure VM
  • OS — Ubuntu 20.04
  • LLM model used — Mistral -7B

Prerequisites

  1. Ensure the Nvidia CUDA toolkit is installed, the minimum required package version is 12.2
  • Download the required package from Nvidia's official website and install it.
  • Verify the successful installation of the toolkit by using this command nvidia-smi. This command should detect your GPU.
  • Also, verify in the source folder by checking in the /usr/local/ directory, there should be cuda-12.2 directory created and inside all the required files will be created.

2. Install GCC and G++ compilers to compile and install packages

  • Add the gcc repository using the below command.
  • sudo add-apt-repository ppa:ubuntu-toolchain-r/test
  • Install gcc and g++ compilers using the command below.
  • sudo apt install gcc-11 g++-11 (minimum required version is 11 for gcc and g++ compilers)
  • Update alternatives using the below command to change default version 11
  • sudo update-alternatives — install /usr/bin/gcc gcc /usr/bin/gcc-11 60 — slave /usr/bin/g++ g++ /usr/bin/g++-11
  • Check the installed versions of GCC and G++ for correct installation.
  • gcc — version # This should printout gcc version as 11.4.0
  • g++ — version # This should printout gcc version as 11.4.0

3. Install Langchain and cmake packages using the below command

Python
 
pip install langchain cmake


Llama-CPP Installation

  • By default, the LlamaCPP package tries to pick up the default version available on the VM. If there are multiple CUDA versions, a specific version needs to be mentioned.
  • Use the below command for the installation of the package.
Python
 
CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCUDA_PATH=/usr/local/cuda-12.2 -DCUDAToolkit_ROOT=/usr/local/cuda-12.2 -DCUDAToolkit_INCLUDE_DIR=/usr/local/cuda-12/include -DCUDAToolkit_LIBRARY_DIR=/usr/local/cuda-12.2/lib64" FORCE_CMAKE=1 pip install llama-cpp-python - no-cache-dir


Verifying Installation

Verify by creating an instance of the LLM model by enabling verbose = True parameter.

Python
 
from langchain.llm import LlamaCpp
model = LlamaCpp(model_path, n_gpu_layers = -1, verbose = True)


n_gpu_layers = -1 is the main parameter that transfers available computation layers onto the GPU. Alternatively, you can set the number of layers you want to transfer, but -1 will automatically calculate and transfer them.

verbose = True prints the models details and parameters

On the terminal console, when the model is loaded, check for the following lines.

Device: <your-gpu-name> (Ex: Device 0: Tesla T4)

BLAS = 1 (indicates that the model is loaded onto the GPU)

Comparison

LlamaCPP With CPU

Time taken to load Mistral-7B model: 1 min (approx)

Time taken to generate a response to a query: 20 min (approx)

LlamaCPP With GPU

Time taken to load Mistral-7B model: 30 sec(approx)

Time taken to generate a response to query: 30 sec (approx)

Conclusion

Based on the load time and response generation, there is a significant performance difference when we use llama-cpp-python package with GPU support. Consider installing this package for better performance, if you have GPU/s attached to your system.

CPU time CUDA Python (language)

Published at DZone with permission of Manish Kovelamudi. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Start Coding With Google Cloud Workstations
  • Type Variance in Java and Kotlin
  • Using Python Libraries in Java
  • Beyond ChatGPT, AI Reasoning 2.0: Engineering AI Models With Human-Like Reasoning

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!