Run Llama3.1-8B (LLM) on DigitalOcean CPU droplet

Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

Llama is an open-source pre-trained LLM released by Meta which can be used to run ChatGPT like prompts locally on CPU or GPU. Llama comes in different sizes like 1B parameters, 8B, 70B and 405B. The capability of the LLM increases with the number of parameters. This is preferred to OpenAI API because using Llama locally gives more native support.

The 8B variant of LLM performs reasonably well and can ask moderate level questions and write simple code as well. This is something 1B variant cannot perform.

LLMs are usually too large to run on CPU due to limited memory and compute power and GPU are expensive.

If you wish to test Llama on a CPU server due to budget limitations, you can test Llama 3.1 8B model on DigitalOcean CPU droplet with a nominal cost of $86 per month. On the other hand, using a NVIDIA GPU on DigitalOcean to run it or larger versions of Llama 3.1 or 3.3 may cost over $90 per day (30X more).

In this guide at OpenGenus.org, we present the steps to setup the minimum required DigitalOcean CPU droplet, download Llama3.1-8B-Instruct model, prepare the script to run it on a sample input and generate the output. The computation time can vary from 2 to 10 minutes (or beyond) depending on the input prompt length.

Minimum requirements to run Llama3.1-8B

Minimum requirements to run Llama3.1-8B or any LLM:

16GB RAM required for 4096+ tokens to be run.
Llama3.1-8B requires a minimum of 10GB RAM to load into memory. This involves KV cache. With the lower cache, the model fails to perform inference.
SSD memory of 20GB (Llama3.1-8B Model files are of 16.4GB)

Minimum DigitalOcean CPU droplet needed

Specification of the most economical DigitialOcean CPU droplet required to run Llama 3.1 8B model:

Machine type: Basic
CPU type: Premium AMD
vCPUs: 4
RAM: 16GB
SSD: 80GB
Transfer: 8TB
Price: $84 per month (As of May 2025)

Once the plan is acquired, an IP Address is assigned and you have to set the password.

Steps to use your droplet to run Llama on CPU

Install MobaXTerm

Install the free version of MobaXterm to connect to the DigitalOcean droplet using terminal.
Click on new Session
Select the SSH Client type

Figure: MobaXterm: SSH client
Enter the IP Address in the Remote Host field and the username as got from DigitalOcean.

Figure: MobaXterm: IP address and userame
Now the Session is created
Enter the password in the terminal to run the model.

Get access of Llama3.1-8B from Meta

Llama is a series of LLM from Meta (formerly known as Facebook).

Request Access to Llama3.1 from Meta
Once the Request is approved, the unique URL will be shared through email.
Install Llama CLI:

pip install llama-stack

To check the available models run

llama model list

Output:

Figure: List of available Llama models

The model that we have to download is Llama3.1-8B-Instruct.
Use the following command to install the model

llama download --source meta --model-id Llama3.1-8B-Instruct

Figure: Download the required Llama model

Now you have to provide the unique URL sent by Meta through email

Get access from HuggingFace

Since Llama3.1-8B is a gated repository, you have to get access from HuggingFace.

Figure: Access request at HuggingFace

Fill in the above details:

To check for approval updates, click on your HuggingFace profile then go to settings -> gated repository

Figure: HuggingFace: request status accepted
Once accepted, go to access tokens

Figure: HuggingFace: access token
Create a new token

Figure: HuggingFace: create new token
Fill in the details and select all the user permissions

Figure: HuggingFace: user permissions
Save the access token for later use (Note: this is a sample which we have deleted. You need to create your own token).

Figure: HuggingFace: save token

Install necessary libraries

Install pip:

sudo apt install python3-pip -y

Install Huggingface, Transformers, Torch, Accelerate:

pip install huggingface_hub
pip install transformers
pip install torch
pip install accelerate

Script to run LLM with an input prompt

Create a file named "prompt.py":

Figure: LLM script

Following is the code to start inference of your LLM (we will go through the explanation of the code following it):

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "meta-llama/Llama-3.1-8B-Instruct"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Environment: ipython<|eot_id|><|start_header_id|>user<|end_header_id|>

Write code to check if number is armstrong, use that to see if the number 153 is armstrong<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

# Tokenize and move to model device
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate the model's response
outputs = model.generate(
    inputs.input_ids,
    max_new_tokens=128,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)

# Decode and print only the model's answer
response = tokenizer.decode(
    outputs[0][inputs.input_ids.shape[-1]:],
    skip_special_tokens=True
)
print(response)

Explanation of the script

In short, the code prompts Llama 3.1 8B to generate a program to check if the number 153 is an Armstrong number. The process involves tokenizing the input, feeding it to the model and decoding the generated response to print it.

This code is using the HuggingFace transformers library to interact with the LLM to generate a response for a programming task. Following is the step by step explanation:

Imports
AutoTokenizer and AutoModelForCausalLM are classes from the HuggingFace transformers library. The AutoTokenizer is used to handle tokenization (converting text into input suitable for the model), and AutoModelForCausalLM is used to load the language model (in this case, for causal language modeling, which generates text based on a prompt).

torch is the PyTorch library, used for managing tensors and running models on devices like CPU or GPU.

Model ID
model_id = "meta-llama/Llama-3.1-8B-Instruct" specifies which model to load. It refers to a specific pre-trained model called Llama 3.1 with 8 billion parameters.
Load Tokenizer and Model
The tokenizer is loaded from the specified model, which prepares the text for the model. The model is loaded with torch_dtype=torch.bfloat16 (which reduces memory usage and improves speed on supported hardware), and device_map="auto" automatically places the model on the available device (CPU or GPU).
Prompt
A prompt is defined which contains a system instruction (environment: ipython) and a user request to write a program to check if the number 153 is an Armstrong number. The format uses special tokens like <|begin_of_text|> to mark different sections.
Tokenize Input
The prompt is tokenized (converted into numerical representations) and sent to the model's device.
Generate Model Response
The generate method is used to generate the model's response based on the provided input. In this case, it generates up to 128 new tokens with sampling controls (temperature=0.7 and top_p=0.9) to control the randomness and creativity of the response.
Decode and Output the Response
The output tokens from the model are decoded back into text, skipping any special tokens used for processing.

Run the LLM script on CPU

huggingface-cli login

Figure: LLM script

Now enter the access token

For access to git credentials, select "no".
Once login is successful, we will run the Python script.

Run the command:

python3 prompt.py