Skip to main content

Command Palette

Search for a command to run...

Zero to Neuron Series 5: QLoRA By Hand β€” A Step-by-Step Numerical Walkthrough

Updated
β€’8 min read
Zero to Neuron Series 5: QLoRA By Hand β€” A Step-by-Step Numerical Walkthrough

Welcome back, data adventurers, to the final boss of our fine-tuning series!

In our last post, we learned the "what" and "why" of QLoRA. We saw how it combines Quantization (shrinking the model) with LoRA (adding a tiny "cheat sheet") to let us fine-tune massive models on a single GPU.

We used analogies like "expert dog trainers" and "cheat sheets." Today, we're throwing the analogies away.

This post is a deep dive. We're opening the hood, grabbing a calculator, and walking through every single calculation of a QLoRA forward and backward pass by hand.

This is the most technical post in the series, but by the end, you won't just know QLoRA... you will understand it.

Let's do the math.

πŸ“œ The Quantization Rules

We will use a standard block-wise symmetric 4-bit quantization rule. For a given weight block W:

  1. Find Scale (s):

$$s = \dfrac{\max(|W|)}{7}$$

  1. Quantize (q): (These are stored as 4-bit integers)

$$q_{ij} = \operatorname{clip}\big(\mathrm{round}(w_{ij}/s), -7, 7\big)$$

  1. Dequantize (W-hat): (This is done on-the-fly for computation)

$$\hat w_{ij} = q_{ij}\cdot s$$

βš™οΈ Setup: Matrices and Initial Values

Let's define all our initial matrices and vectors.

Base Weight (W): We'll treat this entire 2x4 matrix as a single quantization block.

$$W \in \mathbb{R}^{2\times 4}$$

$$W = \begin{bmatrix} 1.6 & -0.9 & 0.3 & 0.0\\ -1.4 & 0.8 & -0.2 & 0.5 \end{bmatrix}$$

Input Vector (x):

$$x \in \mathbb{R}^{4}$$

$$x = \begin{bmatrix} 1.0 \\ 0.5 \\ 0.0 \\ 0.2 \end{bmatrix}$$

LoRA Adapters (Rank r=1): This means we have B (shape 2x1) and A (shape 1x4)

$$B \in \mathbb{R}^{2\times 1}, \quad A \in \mathbb{R}^{1\times 4}$$

$$B = \begin{bmatrix} 0.05 \\ -0.05 \end{bmatrix},\qquad A = \begin{bmatrix} 0 & 0 & 0 & 0 \end{bmatrix}$$

Target Vector (t): This is the "correct" output we're training towards

$$t = \begin{bmatrix} 1.0 \\ 0.0 \end{bmatrix}$$

Loss Function (L): We'll use a simple Mean Squared Error (scaled by 1/2)

$$L = \tfrac{1}{2}\|y - t\|^2$$


πŸš€ The Walkthrough: One Full Training Step

Let's go through the full computation, from quantization to a single gradient descent update.

1. Compute Block Scale (s)

First, we find the maximum absolute value in our weight matrix W.

$$\max(|W|) = 1.6$$

Now, we compute the scale s using our rule:

$$s = \frac{\max(|W|)}{7} = \frac{1.6}{7} = 0.228571428571$$

2. Quantize W to Q

Next, we quantize each element w_ij using the quantization rule and clip to the 4-bit range [-7, 7].

Row 1:

  • 1.6 / s βž” 1.6 / 0.228571428571 βž” 7.00000000000 βž” q_{11} = 7

  • -0.9 / s βž” -0.9 / 0.228571428571 βž” -3.93999999999 βž” q_{12} = -4

  • 0.3 / s βž” 0.3 / 0.228571428571 βž” 1.3125 βž” q_{13} = 1

  • 0.0 / s βž” 0.0 βž” q_{14} = 0

Row 2:

  • -1.4 / s βž” -1.4 / 0.228571428571 βž” -6.12799999999 βž” q_{21} = -6

  • 0.8 / s βž” 0.8 / 0.228571428571 βž” 3.50000000000 βž” q_{22} = 4 (Note: 3.5 rounds to 4)

  • -0.2 / s βž” -0.2 / 0.228571428571 βž” -0.875 βž” q_{23} = -1

  • 0.5 / s βž” 0.5 / 0.228571428571 βž” 2.1875 βž” q_{24} = 2

This gives us our 4-bit integer matrix Q:

$$Q = \begin{bmatrix} 7 & -4 & 1 & 0\\ -6 & 4 & -1 & 2 \end{bmatrix}$$

3. Dequantize Q to W-hat

To perform the matrix multiplication, we dequantize Q back to floating-point by multiplying by our scale s. This gives W-hat.

Row 1:

  • w-hat_11 βž” 7 * 0.228571428571 = 1.60000000000

  • w-hat_12 βž” -4 * 0.228571428571 = -0.914285714284

  • w-hat_13 βž” 1 * 0.228571428571 = 0.228571428571

  • w-hat_14 βž” 0 * 0.228571428571 = 0.000000000000

Row 2:

  • w-hat_21 βž” -6 * 0.228571428571 = -1.37142857143$

  • w-hat_22 βž” 4 * 0.228571428571 = 0.914285714284$

  • w-hat_23 βž” -1 * 0.228571428571 = -0.228571428571$

  • w-hat_24 βž” 2 * 0.228571428571 = 0.457142857142$

So, our dequantized weight matrix W-hat is:

$$\hat W \approx \begin{bmatrix} 1.60000000000 & -0.914285714284 & 0.228571428571 & 0.000000000000 \\ -1.37142857143 & 0.914285714284 & -0.228571428571 & 0.457142857142 \end{bmatrix}$$

Note: In a real, optimized library, you would not materialize this full W-hat matrix. Instead, you would keep Q (4-bit) and s (e.g., 32-bit float) and perform a "fused kernel" that multiplies by s on the fly during the main computation.

4. Frozen-Path Forward Pass (W-hat * x)

Now we compute the output from the frozen, quantized path.

$$y_{\text{frozen}} = \hat W x$$

First row output:

$$\begin{aligned} (\hat W x)_1 &= 1.60000000000\cdot 1.0 + (-0.914285714284)\cdot 0.5 + 0.228571428571\cdot 0.0 + 0.0\cdot 0.2\\ &= 1.60000000000 + (-0.457142857142) + 0 + 0\\ &= 1.142857142858 \approx 1.14285714286 \end{aligned}$$

Second row output:

$$\begin{aligned} (\hat W x)_2 &= -1.37142857143\cdot 1.0 + 0.914285714284\cdot 0.5 + (-0.228571428571)\cdot 0.0 + 0.457142857142\cdot 0.2\\ &= -1.37142857143 + 0.457142857142 + 0 + 0.0914285714284\\ &= -0.8228571428596 \approx -0.82285714286 \end{aligned}$$

Frozen path result:

$$\hat W x \approx \begin{bmatrix} 1.14285714286 \\ -0.82285714286 \end{bmatrix}$$

5. Add LoRA Adapter Path (B(Ax))

The full QLoRA output is y = W-hat*x + B(Ax). Let's compute the LoRA path. We can first compute the scalar u = A*x:

$$u = \begin{bmatrix} 0 & 0 & 0 & 0 \end{bmatrix} \cdot \begin{bmatrix} 1.0 \\ 0.5 \\ 0.0 \\ 0.2 \end{bmatrix} = 0$$

Since A was initialized to all zeros, its output is 0. The LoRA contribution is B u = B 0, which is a zero vector.

Therefore, the initial total output y is just the frozen path result:

$$y = \hat W x + B(Ax) = \hat W x + 0 = \begin{bmatrix} 1.14285714286 \\ -0.82285714286 \end{bmatrix}$$

6. Compute Loss and Output Gradient (delta)

Now we compare our output y to the target t to find the loss and the initial gradient delta = dL/dy.

Our loss function (defined in the math block above) has the simple gradient delta = y - t.

$$\delta = y - t = \begin{bmatrix} 1.14285714286 - 1.0 \\ -0.82285714286 - 0.0 \end{bmatrix} = \begin{bmatrix} 0.14285714286 \\ -0.82285714286 \end{bmatrix}$$

(The actual loss value would be L = 0.5 * (0.142857...^2 + (-0.822857...)^2) β‰ˆ 0.34875)

7. Backpropagation: Gradients for LoRA Parameters

This is the key to QLoRA: gradients do not flow back to W-hat or W. The base model is frozen. Gradients only flow to the trainable adapters A and B.

Let's find the gradients dL/dB and dL/dA.

Recall y = W-hat*x + B*u where u = A*x.

Gradient for B:

$$\frac{\partial L}{\partial B} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial B} = \delta \cdot u^\top$$

Since u=0, this gradient is zero.

$$\frac{\partial L}{\partial B} = \delta \cdot 0 = \begin{bmatrix} 0 \\ 0\end{bmatrix}$$

Gradient for A:

$$\frac{\partial L}{\partial A} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial A} = (\frac{\partial y}{\partial u} \cdot \frac{\partial L}{\partial y}) \cdot \frac{\partial u}{\partial A} = (B^\top \delta) \cdot x^\top$$

Let's compute the intermediate scalar value B_transpose * delta:

$$\begin{aligned} B^\top \delta &= \begin{bmatrix} 0.05 & -0.05 \end{bmatrix} \cdot \begin{bmatrix} 0.14285714286 \\ -0.82285714286 \end{bmatrix} \\ &= (0.05 \cdot 0.14285714286) + (-0.05 \cdot -0.82285714286) \\ &= 0.007142857143 + 0.041142857143 \\ &= 0.048285714286 \end{aligned}$$

Now, we can find the gradient for A:

$$\begin{aligned} \frac{\partial L}{\partial A} &= (B^\top \delta) \cdot x^\top \\ &= 0.048285714286 \cdot \begin{bmatrix} 1.0 & 0.5 & 0.0 & 0.2 \end{bmatrix} \\ &= \begin{bmatrix} 0.048285714286 & 0.024142857143 & 0.000000000000 & 0.009657142857 \end{bmatrix} \end{aligned}$$

8. Gradient Descent Step

Let's perform one update step with a learning rate eta = 0.1. The update rule is A_new = A - eta * (dL/dA).

Update B: B_new = B - eta * (dL/dB) = B - 0. B remains unchanged:

$$B_{\text{new}} = \begin{bmatrix}0.05\\ -0.05\end{bmatrix}$$

Update A: A_new = A - 0.1 * (dL/dA):

$$\begin{aligned} A_{\text{new}} &\approx \begin{bmatrix} 0 - 0.1\cdot 0.048285714286\\ 0 - 0.1\cdot 0.024142857143\\ 0 - 0.1\cdot 0.000000000000\\ 0 - 0.1\cdot 0.009657142857 \end{bmatrix}^\top \\ &\approx \begin{bmatrix} -0.0048285714286 & -0.0024142857143 & 0.0 & -0.0009657142857 \end{bmatrix} \end{aligned}$$

9. Forward Pass After One Update

Let's re-compute the forward pass with our new adapters to see if the output y_new is closer to the target t.

The frozen path W-hat * x is unchanged:

$$\hat W x \approx \begin{bmatrix} 1.14285714286 \\ -0.82285714286 \end{bmatrix}$$

Now we compute the new LoRA path contribution with A_new.

First, the new scalar u = A_new * x:

$$\begin{aligned} u &= (-0.0048285714286)\cdot 1.0 + (-0.0024142857143)\cdot 0.5 + 0.0\cdot 0.0 + (-0.0009657142857)\cdot 0.2\\ &= -0.0048285714286 + (-0.00120714285715) + 0 + (-0.00019314285714)\\ &= -0.00622885714289 \approx -0.006228857143 \end{aligned}$$

Now, the LoRA contribution B * u:

$$B_{\text{new}} \cdot u = \begin{bmatrix} 0.05 \\ -0.05 \end{bmatrix} \cdot(-0.006228857143) = \begin{bmatrix} -0.00031144285715 \\ 0.00031144285715 \end{bmatrix}$$

Finally, the new total output y_new = W-hat*x + B*u:

$$y_{\text{new}} \approx \begin{bmatrix} 1.14285714286 - 0.00031144285715 \\ -0.82285714286 + 0.00031144285715 \end{bmatrix} = \begin{bmatrix} 1.14254570000 \\ -0.822545700003 \end{bmatrix}$$

Our previous output y was approx

$$\begin{bmatrix} 1.1428 \\ -0.8228 \end{bmatrix}$$

and our target t is

$$\begin{bmatrix} 1.0 \\ 0.0 \end{bmatrix}$$

Our new output y_new has moved slightly in the correct direction (the first component decreased, the second increased), and the loss has decreased. The model is learning!

πŸ”‘ Key Takeaways

This step-by-step example demonstrates the core mechanics of QLoRA:

  • Quantization: The massive base weight matrix W is compressed into 4-bit integers (Q) plus one or more scaling factors (s). This is the source of the memory savings.

  • Frozen Base: The forward pass computes y = (W-hat)*x + B(Ax), where W-hat = Q * s is the dequantized base weight.

  • Trainable Adapters: Crucially, gradients only flow to the small LoRA adapter matrices A and B. The original weights (W, Q, and s) are never updated.

  • Efficiency: All the memory for optimizer states (like Adam's moments) is only needed for A and B, not for the massive W. This, combined with the 4-bit base model, is what makes QLoRA so memory-efficient.

You made it through the math! You now have a concrete, fundamental understanding of how QLoRA works under the hood.

But knowing the theory is one thing; applying it is another. In our next post, we'll ditch the calculator, fire up our GPUs, and implement QLoRA in Python. We'll use powerful libraries like bitsandbytes, transformers, and peft to fine-tune a real-world language model on a custom dataset.

Next up: Zero to Neuron Series 6: Coding QLoRA - Fine-Tuning an LLM on a Single GPU.

Happy coding meow! (Subscribe, or... you know the drill 😼)