Skip to main content

Command Palette

Search for a command to run...

Zero to Neuron Series 6: Coding QLoRA - Fine-Tuning an LLM on a Single GPU

Updated
11 min read
Zero to Neuron Series 6: Coding QLoRA - Fine-Tuning an LLM on a Single GPU

We've done the analogies. We've done the math by hand. Now, it's time to make the GPU sweat.

Welcome to the most practical post in the series! We're going to take all that theory from Series 4 and 5 and turn it into working Python code.

Our goal: To fine-tune a 7-Billion parameter model on a custom dataset, all for free, in a Google Colab notebook.

This is the magic of QLoRA. Let's get coding.

Step 1: Setup in Google Colab

First, open a new Colab notebook. Go to Runtime > Change runtime type and select a GPU accelerator (like the T4). This is essential.

Now, in the first cell, we'll install all the necessary libraries. trl (Transformer Reinforcement Learning) is a special library from Hugging Face that makes supervised fine-tuning (SFT) incredibly simple.

!pip install -q transformers datasets accelerate peft trl bitsandbytes
  • transformers: The core Hugging Face library.

  • datasets: For loading our data.

  • accelerate: Makes PyTorch training simple, even on complex hardware.

  • peft: The Parameter-Efficient Fine-Tuning library (this is LoRA).

  • trl: The Supervised Fine-tuning (SFT) Trainer.

  • bitsandbytes: The magic library for 4-bit quantization.

Step 2: Load Model (The "Q" in QLoRA)

We'll load the powerful mistralai/Mistral-7B-Instruct-v0.1 model. But we'll load it in 4-bit precision.

This is all handled by the BitsAndBytesConfig.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# The "Q" in QLoRA: 4-bit Quantization
# We define the quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # Use "nf4" (Normal Float 4) as recommended by QLoRA paper
    bnb_4bit_compute_dtype=torch.bfloat16, # Use bfloat16 for computation
    bnb_4bit_use_double_quant=True, # Use nested quantization
)

model_id = "mistralai/Mistral-7B-Instruct-v0.1"

# Load the 4-bit model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto" # Automatically map the model to the GPU
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Set a padding token if one isn't already set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Just like that, we have a 7-billion parameter model loaded in our Colab GPU. You can check the memory usage! It should be around ~5GB, not 28GB+!

Without quantization

With quantization to 4 bits

Wooow We can see the difference quite well just by quantization see the memory saved

Step 3: Load the Dataset

We need data to fine-tune on. We'll use a simple, popular instruction dataset called mlabonne/guanaco-llama2-1k, which is a small subset of the Guanaco dataset, formatted in the "Alpaca" style.

from datasets import load_dataset

# Load a small, clean dataset
dataset = load_dataset("mlabonne/guanaco-llama2-1k", split="train")

# You can uncomment the line below to see an example
# print(dataset[0]['text'])

The data looks like this: ### Human: [Instruction] ### Assistant: [Response]

This is the format the SFTTrainer works with best.

Step 4: Configure the Adapters (The "LoRA")

Now we add the "cheat sheet"—the LoRA adapters. We use peft to define our LoraConfig.

The LoRA adapter is the pair of small matrices, A and B.

LoraConfig (The "Cheat Sheet" Blueprint)

This object defines the properties of your LoRA adapters (the A and B matrices).

  • r=16: This is the Rank. It controls the "size" and "complexity" of your LoRA adapter. It sets the inner dimension of the A and B matrices. A higher rank means more trainable parameters, which can capture more complex patterns but also uses more memory. 16 is a very common, effective setting.

  • lora_alpha=32: This is the scaling factor, or "volume knob," for the adapters. The LoRA adapter's output is scaled by the ratio lora_alpha / r. In your case, 32 / 16 = 2. This means the adapter's contribution is doubled in strength, which often helps the model learn more effectively.

  • lora_dropout=0.05: This is a regularization technique. During training, it will randomly set 5% of the adapter's weights to zero. This helps prevent the model from "memorizing" your training data (overfitting) and helps it generalize better.

  • bias="none": This tells the trainer to not train the original bias terms in the linear layers. This saves a small amount of memory and is standard practice for QLoRA.

  • task_type="CAUSAL_LM": This tells the PEFT library that your goal is Causal Language Modeling (i.e., predicting the next word). This is essential for models like Mistral and Llama.

  • target_modules="all-linear": This is one of the most important lines. It tells PEFT to find every single Linear layer in the model (the q_proj, k_proj, v_proj, o_proj, and mlp layers) and attach a LoRA adapter to all of them. This is the most effective and common strategy.

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# 1. Prepare the model for k-bit training
model = prepare_model_for_kbit_training(model)

# 2. Configure LoRA
# This is where we define our "cheat sheet"
peft_config = LoraConfig(
    r=16,  # The "rank" of the adapter matrices. Higher = more parameters, but potentially more expressive.
    lora_alpha=32,  # The scaling factor (alpha)
    lora_dropout=0.05,  # Dropout for regularization
    bias="none",
    task_type="CAUSAL_LM",
    target_modules="all-linear" # Apply LoRA to all linear layers
)

# 3. Apply LoRA to the model
model = get_peft_model(model, peft_config)

# Let's see how many parameters we're *actually* training
model.print_trainable_parameters()

Output: trainable params: 41,943,040 || all params: 7,283,675,136 || trainable%: 0.5758
Look at that! We've frozen 99.42% of the model. We are only training 0.57% of the total parameters. This is why it's possible to run this on a single GPU.

Step 5: Train the Model

Now, we just set up the trainer and hit go. We'll use the SFTTrainer from the trl library, which is built for this exact purpose.

TrainingArguments (The Training Rulebook)

This object is the "instruction manual" for the training process itself.

  • output_dir="./results": This tells the trainer to create a folder named "results". Inside this folder, it will save:

    • Checkpoints: Sub-folders like checkpoint-100, checkpoint-200, etc., which are snapshots of your trained adapter.

    • The Final Adapter: The fully trained adapter_model.bin file.

    • Training Logs: Data that can be used to plot your model's learning curve.

  • num_train_epochs=1: How many times to loop over the entire training dataset. You're set to 1 epoch, which is common for fine-tuning.

  • per_device_train_batch_size=4: How many training samples to process at once on the GPU. A small batch size like 4 is used to save memory.

  • gradient_accumulation_steps=1: A memory-saving trick. If you set this to 4, it would run 4 "mini-batches" of size 4 before updating the weights, effectively simulating a larger batch size of 16. Since it's 1, you are not using this trick.

  • optim="paged_adamw_8bit": The optimizer. This is a special, memory-efficient version of the AdamW optimizer that is required for QLoRA to work.

  • logging_steps=25: Prints the training loss to your console every 25 steps so you can watch it learn.

  • learning_rate=2e-4: The learning rate. This is a crucial hyperparameter that controls how big of a change is made to the adapter weights at each step. 2e-4 (or 0.0002) is a very common, effective learning rate for LoRA.

  • fp16=True: This enables mixed-precision training. It performs calculations in 16-bit floating point (half-precision) instead of 32-bit, which dramatically speeds up training and saves memory.

  • max_grad_norm=0.3: A technique called gradient clipping. It prevents the training from becoming unstable by "clipping" any update signals (gradients) that are too "loud" or "explosive" to a maximum value of 0.3.

  • max_steps=100: This is a hard stop. It tells the trainer: "Stop training after exactly 100 steps (100 batches), even if the epoch isn't finished." This is perfect for a quick test run.

  • warmup_ratio=0.03: For the first 3% of training (i.e., the first 3 steps), the learning rate will slowly "warm up" from 0 to 2e-4. This helps the model stabilize at the very beginning.

  • lr_scheduler_type="constant": After the warm up, the learning rate will stay constant at 2e-4 for the rest of the training.

  • report_to="none": This is an important line. By default, the Trainer tries to log results to all available platforms, especially "wandb" (Weights & Biases). Setting this to "none" disables all external logging and prevents the "wandb" login pop-up.

from transformers import TrainingArguments
from trl import SFTTrainer

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_8bit", # Use the 8-bit paged optimizer for memory efficiency
    logging_steps=25,
    learning_rate=2e-4,
    fp16=True, # Use 16-bit precision for training
    max_grad_norm=0.3,
    max_steps=100, # Set to -1 to train for the full epoch
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
)

SFTTrainer (The Final "Go" Button)

This object brings everything together to create the final trainer. Note: The syntax for this object has changed in recent trl versions. If this code gives a TypeError, you may need to move some of these arguments into a separate SFTConfig object.

  • model=model: Your 4-bit quantized base model.

  • train_dataset=dataset: Your instruction data.

  • peft_config=peft_config: Your LoRA blueprint (the LoraConfig object you created earlier). This is the most critical part, as it tells the trainer to perform LoRA instead of full fine-tuning.

  • dataset_text_field="text": Tells the trainer that the column in your dataset to use for training is named "text".

  • max_seq_length=512: This is a critical memory-saving setting. It chunks all training examples into a maximum of 512 tokens. Any sample longer than this will be truncated. This ensures you don't run out of memory on a single, very long text sample.

  • tokenizer=tokenizer: The tokenizer needed to convert the "text" data into token IDs that the model can understand.

  • args=training_args: Your training rulebook (the TrainingArguments object you just defined).

# Create the SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=512,
    tokenizer=tokenizer,
    args=training_args,
)

# Train the model!
trainer.train()

This will take a few minutes. You'll see the training loss go down, which means our tiny LoRA "cheat sheet" is successfully learning from the new data!

Training Loss: (You will see the output like this)

Step 6: Test the Fine-Tuned Model

The training is done. But did it work? Let's ask it a question from our dataset.

1. Prepare Model for Inference

This switches the model from "learning mode" to "generating mode" for high speed and correct outputs.

  • model.config.use_cache = True: Enables caching to generate text much faster.

  • model.eval(): Disables training-only layers like Dropout.

  • with torch.no_grad(): Disables gradient calculations, saving a large amount of VRAM and speeding up the process.


2. Prepare the Prompt

This formats your question so the model understands it.

  • prompt = "...": Your input text. This must match the training format (e.g., ### Human: ... ### Assistant: ) to properly "trigger" the fine-tuned response.

  • inputs = tokenizer(...): Converts your text string into a PyTorch tensor on the GPU for the model to read.


3. Generate the Response

This is the "control panel" for running the model.

  • model.generate(...): The main function that generates new text.

  • max_new_tokens=250: A hard stop to set the maximum reply length.

  • temperature=0.7 & do_sample=True: These work together to control creativity. do_sample turns on sampling, and temperature=0.7 makes the model's answers creative but not too random.

  • eos_token_id: Tells the model which special token means "stop generating."


4. Decode the Output

This translates the model's answer back into human-readable text.

  • tokenizer.decode(output[0], ...): Converts the model's output (a tensor of token IDs) back into a plain string.

  • skip_special_tokens=True: Cleans up the final text by removing any special tokens like <s> or </s>.

# --- Let's test our fine-tuned model ---

# IMPORTANT: Disable gradient checkpointing for inference
model.config.use_cache = True
model.gradient_checkpointing_disable()

# Set model to evaluation mode
model.eval()

prompt = "### Human: What are the main differences between programming language and Scripting language and query language and examples? ### Assistant: "

# Tokenize the prompt
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate a response
with torch.no_grad():  # Disable gradient computation for inference
    output = model.generate(
        **inputs,
        max_new_tokens=250,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

# Decode and print the response
response = tokenizer.decode(output[0], skip_special_tokens=True)
print(response)

Example Output:

### Human: What are the main differences between programming language and Scripting language and query language and examples? ### Assistant: 

1. Programming Language: Programming languages are used to create complex software applications, games, and other programs that can run on different platforms. Some examples of programming languages are Java, C++, Python, Ruby, and JavaScript.
2. Scripting Language: Scripting languages are used to create shorter programs that automate repetitive tasks or customize applications. Some examples of scripting languages are PHP, Perl, Ruby, and JavaScript.
3. Query Language: Query languages are used to extract data from databases and other data sources. Some examples of query languages are SQL, NoSQL, and GraphQL.
The main difference between these languages is their purpose and complexity. Programming languages are more complex and require more time and effort to learn, while scripting languages are simpler and easier to learn, but have limited functionality. Query languages are used for specific tasks, such as accessing data from databases, and are not suitable for creating complex software applications. 
...

The model now gives a detailed, structured answer, exactly like the data it was fine-tuned on. It works!

Conclusion

You've done it. You've taken a massive 7-billion parameter model, loaded it in 4-bits, attached a tiny LoRA adapter, and fine-tuned it on a custom dataset—all from a free Colab notebook.

You can save your new, tiny "cheat sheet" (the adapter) and load it on top of the base model anytime.

# You can save your new adapter
trainer.save_model("my-mistral-adapter")

The adapter itself is only ~100MB, not 28GB!

You now have the complete pipeline: the theory (Series 4), the math (Series 5), and the code (Series 6).

Do we learn the QLoRA fully? Actually yeah but we can take this learning to next level by mapping the code part with maths!! That’ what about the next blog in this series: Zero to Neuron Series 7: QLoRA — Finding the Math in the Code

Happy coding meow! (Don't forget to subscribe, or my cat will... well, you know. 😼)