Tutorial

Run LLMs with Ollama on H100 GPUs for Maximum Efficiency

Published on September 23, 2024

AI/ML

Run LLMs with Ollama on H100 GPUs for Maximum Efficiency

Introduction

This article is a guide to run Large Language Models using Ollama on H100 GPUs offered by DigitalOcean. DigitalOcean GPU Droplets provide a powerful, scalable solution for AI/ML training, inference, and other compute-intensive tasks such as deep learning, high-performance computing (HPC), data analytics, and graphics rendering. These GPUs are designed to handle demanding workloads, GPU Droplets enable businesses to efficiently scale AI/ML operations on-demand, without the need for managing unnecessary costs. Offering simplicity, flexibility, and affordability, DigitalOcean’s GPU Droplets ensure quick deployment and ease of use, making them ideal for developers and data scientists.

Now, with support for NVIDIA H100 GPUs, users can accelerate AI/ML development, test, deploy, and optimize their applications seamlessly—without the need for extensive setup or maintenance typically associated with traditional platforms. Ollama is an open source tool which provides access to a diverse library of pre-trained models, offers effortless installation and setup across different operating systems, and exposes a local API for seamless integration into applications and workflows. Users can customize and fine-tune LLMs, optimize performance with hardware acceleration, and benefit from interactive user interfaces for intuitive interactions.

Prerequisites

Access to H100 GPUs: Ensure you have access to NVIDIA H100 GPUs, either through on-premise hardware or using GPU Droplets by DigitalOcean. DigitalOcean GPU DropletsDigitalOcean GPU Drople
Supported Frameworks: Familiarity with Python and Linux Commands.
CUDA and cuDNN Installed: Ensure NVIDIA CUDA and cuDNN libraries are installed for optimal GPU performance.
Sufficient Storage and Memory: Have ample storage and memory available to handle large model datasets and weights.
Basic Understanding of LLMs: A foundational understanding of large language models and their structure to effectively manage and optimize them.

These prerequisites help ensure a smooth and efficient experience when running LLMs with Ollama on H100 GPUs.

What is Ollama?

Ollama offers a way to download a large language model from its vast language model library which consists of Llama3.1, Mistral, Code Llama, Gemma and much more. Ollama combines model weights, configuration, and data into one package, specified by a Modelfile. Ollama provides a flexible platform for creating, importing, and using custom or pre-existing language models, ideal for creating chatbots, text summarization, and much more. It emphasizes privacy, integrates seamlessly with windows, macOS and Linux, and is free to use. Ollama also allows users to deploy models locally with ease. Further, the platform also supports real-time interactions via a REST API. It’s perfect for LLM-powered web apps and tools. It’s very similar to how Docker works. With Docker, we can grab different images from a central hub and run them in containers. Furthermore, Ollama allows us to customize the models by creating a Modelfile. Below is the code to create Modelfile:

From llama2

# Set the temperature PARAMETER temperature 1
# Set the system Prompt

SYSTEM """ You are a helpful teaching assistant created by DO.
Answer questions asked based on Artificial Intelligence, Deep Learning. """

Next, run the custom model,

Ollama create MLexp \-f ./Modelfile
Ollama run MLexp

The Power of NVIDIA H100 GPUs

The H100 is Nvidia’s most powerful GPU, specially designed for artificial intelligence applications. With 80 billion transistors—six times more than the A100—it can process large data sets much faster than other GPUs on the market.
As we all know AI applications are data hungry and are computationally expensive. To manage this huge amount of workload H100 are considered to be the best choice.
The H100 features fourth-generation tensor cores and a transformer engine with FP8 precision. The H100 triples the floating-point operations per second (FLOPS) compared to previous models, delivering 60 teraflops of double-precision (FP64) computing, which is crucial for precise calculations in HPC tasks. It can perform single-precision matrix-multiply operations at one petaflop throughput using TF32 precision without requiring any changes to existing code, making it user-friendly for developers.
The H100 introduces DPX instructions that significantly boost performance for dynamic programming tasks, achieving 7X better performance than the A100 and 40X faster than CPUs for specific algorithms like DNA sequence alignment.
H100 GPUs provide the necessary computational power, offering 3 terabytes per second (TB/s) of memory bandwidth per GPU. This high performance allows for efficient handling of large datasets.
The H100 supports scalability through technologies like NVLink and NVSwitch™, which allows multiple GPUs to work together effectively.

GPU Droplets

DigitalOcean GPU Droplets offer a simple, flexible, and cost-effective solution for your AI/ML workloads. These scalable machines are ideal for reliably running training and inference tasks on AI/ML models. Additionally, DigitalOcean GPU Droplets are well-suited for high-performance computing (HPC) tasks, making them a versatile choice for a range of use cases including simulation, data analysis, and scientific computing. Try the GPU Droplets now by signing up for a DigitalOcean account. Watch the video to learn the steps to create a GPU Droplets.

How to create a GPU Droplet?

Why Run LLMs with Ollama on H100 GPUs?

To run Ollama efficiently a GPU from NVIDIA is required to run things hassle free. As with CPU users can expect a slow response.

H100 due to its advanced architecture offers exceptional computing power which helps to significantly speed up the efficiency of LLMs.
Ollama lets users customize and fine-tune LLMs to meet their specific needs, enabling prompt engineering, few-shot learning, and tailored fine-tuning to align models with desired outcomes. Pairing Ollama with H100 GPUs enhances model inference and training times for developers and researchers.
H100 GPUs have the capacity to handle models such as Falcon 180b which makes them ideal to create and deploy Gen AI tools like chatbots or RAG applications.
H100 GPUs come with hardware optimizations like tensor cores, which significantly accelerate tasks involving LLMs, especially when dealing with matrix-heavy operations.

Setting Up Ollama with H100 GPUs

Ollama is very well compatible with Windows, macOS, or Linux. Here we are using Linux code as our GPU Droplets are based on Linux OS.

Run the code below in your terminal to check the GPU specification.

nvidia-smi

Next, we will try to install Ollama first using the same terminal.

curl \-fsSL https://ollama.com/install.sh | sh

This will instantly start the Ollama installation.

Once the installation is done we can pull any LLM and start working with the model such as Llama 3.1, Phi3, Mistral, Gemma 2 or any other model.

To run and chat with models, we will run the below code. Please feel free to change the model as per your requirements. Running the model with Ollama is quite straightforward and here we are using the powerful H100, the process to generate a response becomes fast and efficient.

ollama run example\_model

ollama run qwen2:7b

In case of the error "could not connect to ollama app, is it running? Please use the below code to connect to Ollama

sudo systemctl enable ollama

sudo systemctl start ollama

Ollama supports a wide list of models, here are some example models that can be downloaded and used.

Model	Parameters	Size	Download
Llama 3.1	8B	4.7GB	Ollama run llama3.1
Llama 3.1	70B	40GB	Ollama run llama3.1Ollama run llama3.1:70b
Llama 3.1	405B	231GB	Ollama run llama3.1:405b
Phi 3 Mini	3.8B	2.3GB	Ollama run phi3
Phi 3 Medium	14B	7.9GB	Ollama run phi3:medium
Gemma 2	27B	16GB	Ollama run gemma2:27b
Mistral	7B	4.1GB	Ollama run mistral
Code Llama	7B	3.8GB	Ollama run codellama

With Ollama users can run the LLMs conveniently without even the need for internet connection as the model and its dependencies get stored locally.

>>> Write a python code for a fibonacci series.


def fibonacci(n):  
    """  
    This function prints the first n numbers of the Fibonacci sequence.

    Parameters:  
    @param n (int): The number of elements in the Fibonacci sequence to print.

    Returns:  
    None

    """

    # Initialize the first two numbers of the Fibonacci sequence.  
    a, b = 0, 1

    # Iterate over the range and generate Fibonacci sequence.  
    for i in range(n):  
        print(a)  
        # Update the next number in the sequence  
        a, b = b, a + b

# Test function with first 10 numbers of the Fibonacci sequence.  
if __name__ == "__main__":  
    fibonacci(10)

This python code defines a simple `fibonacci` function that takes an integer argument and prints the first n numbers in the Fibonacci sequence. The Fibonacci sequence starts with 0 and 1, and each subsequent number is the sum of the previous two.

The if __name__ == "__main__": block at the end tests this function by calling it with a parameter value of 10, which prints out the first 10 numbers in the Fibonacci sequence.

Conclusion

Ollama is a new Gen-AI tool for working with large language models locally, offering enhanced privacy, customization, and offline accessibility. Ollama has led working with LLM simpler and to explore and experiment with open-source LLMs directly on their machines, Ollama promotes innovation and deeper understanding of AI. To access a powerful GPU like H100, consider using DigitalOcean’s GPU Droplets. DigitalOcean’s GPU Droplets are currently in Early Availability.

For getting started with Python, we recommend checking out this beginner’s guide to set up your system and prepare for running introductory tutorials.