Blog Home
Research & Engineering


Jul 19, 2023

Instruction fine-tuning Llama 2 with PEFT's QLoRa method

Llama 2 is a family of open-source large language models released by Meta. They can be used for a variety of tasks, such as writing different kinds of creative content, translating languages, and sentiment analysis. Llama models come in two flavors - pre-trained and fine-tuned. While the latter is typically used for general purpose chat use cases, the former can be used as a foundation to be further fine-tuned for a specific use case.

In this blog post, we will discuss how to fine-tune Llama 2 7B pre-trained model using the PEFT library and QLoRa method. We’ll use a custom instructional dataset to build a sentiment analysis model.

As an aside for the curious, we are experimenting ways to integrate and use these fine-tuned models within UKey’s dynamic pricing (DP) engine. DP is being used by a partner to build a comment moderation tool.


You can skip this section if you are familiar with Google Colab, Weights & Biases (W&B), Hugging Face (HF) libraries, and its vast ecosystem of models and datasets.

While Google Colab, a hosted Jupyter notebook environment, isn’t a real prerequisite, we recommend using it to get access to a GPU and run quick experiments with your training scripts. Note that premium GPU access is available if you have a paid plan. And if you do, we recommend updating the runtime settings to ensure that you have enough RAM, disk space, and a GPU like A100.

Next, get a W&B account so you can authorize your training script to log progress and other training metrics.

Next, create an HF account. And then go to settings to create an access token with at least read privileges. This token will be used by the training script to download the pre-trained Llama 2 model and your hosted dataset.

Finally, follow the instructions here to accept the terms and request access to Llama 2 models. Wait for emails from Meta AI and HF. You should be granted access in a day or two.

Prepare Your Dataset

Instruction fine-tuning is a common technique used to fine-tune a base LLM for a specific downstream use-case. The training examples look like this:

Below is an instruction that describes a sentiment analysis task...

### Instruction:
Analyze the following comment and classify the tone as...

### Input:
I love reading your articles...

### Response:
friendly & constructive

But for creating a training dataset that can be easily used with HF libraries, we recommend using jsonl. The easiest way to go about this is to create a single line JSON object with just a text field for each example. Something like this:

{ "text": "Below is an instruction ... ### Instruction: Analyze the... ### Input: I love... ### Response: friendly" },
{ "text": "Below is an instruction ... ### Instruction: ..." }

There are many, many ways to pull raw data, process, and create training datasets as jsonl files. Here’s a simple starter script:

with open('train.jsonl', 'a') as outfile:
    for example in raw_data:
        text = '<process_example>'
        # now append entry to the jsonl file.
        outfile.write('{"text": "' + text + '"}')

A better way is to use data processing libraries like HF’s Datasets library.

Before we get to training, make sure you push the file to HF as a dataset repository. You can use the UI to create a new dataset and upload the file. The recommended way is to use huggingface-cli to upload the dataset.


For fine-tuning, we are going to rely on these HF libraries:

Parameter-Efficient Fine-Tuning (PEFT) is a library for efficiently fine-tuning LLMs without touching all of the LLM’s parameters. PEFT supports the QLoRa method to fine-tune a small fraction of the LLM parameters with 4-bit quantization. Read this excellent blog for more information.

Transformer Reinforcement Learning (TRL) is a library used to train language models with reinforcement learning. Supervised Fine-tuning (SFT) Trainer API provided by TRL makes it a breeze to create our own models and train them on custom datasets.

You may want to install some of the transformer libraries directly from HF main branch. They tend to have the latest features and support for fine-tuning models like Llama 2.

Time for the big reveal! The sample training script below was used as a starting point to build our sentiment analyzer. You can use this to fine-tune your own models.

!pip install -q huggingface_hub
!pip install -q -U trl transformers accelerate peft
!pip install -q -U datasets bitsandbytes einops wandb

# Uncomment to install new features that support latest models like Llama 2
# !pip install git+
# !pip install git+

# When prompted, paste the HF access token you created earlier.
from huggingface_hub import notebook_login

from datasets import load_dataset
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TrainingArguments
from peft import LoraConfig
from trl import SFTTrainer

dataset_name = "<your_hf_dataset>"
dataset = load_dataset(dataset_name, split="train")

base_model_name = "meta-llama/Llama-2-7b-hf"

bnb_config = BitsAndBytesConfig(

device_map = {"": 0}

base_model = AutoModelForCausalLM.from_pretrained(
base_model.config.use_cache = False

# More info:
base_model.config.pretraining_tp = 1 

peft_config = LoraConfig(

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

output_dir = "./results"

training_args = TrainingArguments(

max_seq_length = 512

trainer = SFTTrainer(


import os
output_dir = os.path.join(output_dir, "final_checkpoint")

If you are running the full python script, it is better to use a command line argument parser module like HfArgumentParser so you don’t have to hard code the values

In a future blog post we’ll talk about evaluation, saving, and deployment of a fine-tuned model to production for inference. In the meantime, here’s a quick and dirty approach to load the model and do a sanity test.

from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map=device_map, torch_dtype=torch.bfloat16)
text = "..."
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), attention_mask=inputs["attention_mask"], max_new_tokens=50, pad_token_id=tokenizer.eos_token_id)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

May your data be clean, your gradients smooth, and your models accurate!

Join a global community shopping smarter with UKey!


© UKey Inc. 2023


Made with 🖤 in Austin and Boston

UKey Inc. is a financial technology company, not a bank