how LoRA actually scales

One base model serving many different LoRA adapters in parallel

I've been building agents and AI products for the better part of 2 years now, and I've been through most cycles - RAG, fine-tuning, agent frameworks, thinking models, MoE, and now the latest buzz of all the different harnesses we have.

A lot of stuff has become more or less obsolete. AI is a very fast-moving field, and even RAG might not survive the test of time.

But one thing's for sure - the bare metal level in AI and LLMs will stay no matter what. I'm referring to fine-tuning, specifically LoRA (Low Rank Adaptation).

LoRA has been around for a few years now and is widely regarded as one of the most efficient and cheap ways to fine-tune models. In simple words, how it works is that it reduces the number of weights in an LLM that are fine-tuned/adjusted by the process, hence speeding it up and making it less compute-intensive.

But few people know the math behind LoRA and why it works so well, fewer people know that LoRA scales, and almost no one knows how it scales. I found it interesting enough to write about.

how does LoRA work under the hood?

If you already know this, I'd recommend skipping to the next section.

LoRA works on a simple concept known as vector decomposition. It's exactly what it sounds like - splitting a vector into two vectors, whom on multiplying, you get back the original vector.

M → A and B

Simply put, assume that 'M' is an X cross Y matrix. Then,

matrix 'A' will be an X cross N matrix, and
matrix B will be an N cross Y matrix.

Multiplying these two gives back a matrix that is X cross Y!

Now, how do you actually get these two matrices? The classic tool for this kind of low-rank approximation is SVD (Singular Value Decomposition). But here's the twist - LoRA doesn't run SVD on the weights at all. It freezes the original matrix and learns A and B from scratch during fine-tuning. They start out tiny (A random, B at zero) and training gradually shapes them into the update the model needs.

Now the real question - why are we doing this? How does this optimize the fine-tuning process?

why are we doing this?

The foundation of LoRA is simple - we decompose a matrix of LLM weights into two matrices whom upon multiplying, produce a matrix of the dimensions of the original one.

But one key point I didn't mention earlier - the value of n has to be significantly smaller than that of X or Y. This n is the rank of the decomposition - and yes, it's literally the "Low Rank" in LoRA. Keeping it small is exactly what provides the optimization.

Why? Because what we're trying to optimize, is the number of weights that are adjusted per step/epoch during the fine-tuning process.

n ≪ X, Y
⟹ n · (X + Y) ≪ X · Y

Instead of touching all X · Y weights of the original matrix, we only ever adjust the n · (X + Y) weights inside A and B. When n is tiny, that's a fraction of the work - and that's the whole reason LoRA is cheap to train.

how does LoRA scale?

First of all, what is scalability in LoRA at all? What does this even mean and why is it relevant? Let me give you an example.

Let's say you're working on a use-case where, you need to give personalized (fine-tuned) responses for every single inference request.

You will need (potentially) a different LoRA adapter for every single request, or at best, a different one for every single batch of requests.

What typical LoRA suggests is that we first merge the adapter into the model, and only then run inference using the input. But our GPU has only one copy of the model's weights.

At scale, obviously that doesn't hold up as merging/unmerging LoRA weights takes forever, and we cannot possibly do this on every single inference request.

But there's a way by which we can do this, without merging or unmerging any adapters! This uses a very fundamental property of matrix operations - the distributive property.

the distributive property trick

Let me set up the terms first:

W - the big base model weight matrix
A and B - the tiny LoRA matrices for one adapter
x - the input (a token, or a batch of them)

When you merge the adapter into the model before inference, the math is just this:

y = x · (W + AB)

But matrix multiplication is distributive. So we can expand that into:

y = (x · W) + (x · AB)

That tiny rewrite is the entire trick.

It says we never had to merge the weights in the first place. We can push the input through the untouched base model (x · W), push the same input through the tiny LoRA matrices (x · AB), and just add the two results at the end.

The base model stays pure. The adapter stays separate. One copy of the (say) 70GB model in VRAM, and as many tiny adapters as we want sitting right next to it.

Merged path x(W+AB) vs split path xW + xAB producing the same output

but what about batches?

A real inference server doesn't process one request at a time. It batches them - that's how you keep a GPU busy and throughput high.

So picture three requests landing in the same batch, in the same millisecond:

Request 1 wants a SQL-writing adapter
Request 2 wants a French-translation adapter
Request 3 wants the plain base model - no adapter at all

If you stack these into one batch X, how do you serve all three at once without cloning the base model three times? You can't - it barely fits in memory once.

This is exactly where y = xW + xAB earns its keep.

splitting the work

The engine splits the computation into two passes.

The heavy lift. Take the entire batch X and multiply it by the shared base weights W in one big, fully-parallel sweep.

Y_base = X · W

This is the expensive part, and every request in the batch gets it done simultaneously. The base model is already sitting in VRAM, so the GPU runs at full tilt.

The light lift. Now route each row of the batch to its requested adapter and compute just the offset:

Δy₁ = x₁ · A₁ · B₁ Δy₂ = x₂ · A₂ · B₂ Δy₃ = 0

Request 3 didn't want an adapter, so its offset is just zero. These adapters are tiny. To put a number on it - a rank-16 adapter on a 7B model is only a few MB, while the full fp16 weights are ~14GB. So the engine can keep a whole pool of them loaded in GPU memory alongside everything else.

Recombine. Add the offsets back onto the base result:

Y_final = Y_base + ΔY

Every user gets their own personalized output, and we never touched the base weights once.

A batch of requests routed to different LoRA adapters on a single base model

the catch: GPUs hate branching

There's a problem hiding inside "route each row to its adapter".

GPUs are fast because they do the same operation across thousands of cores at once. "Multiply everything by W" is perfect - every core does the identical thing. But "row 1 uses adapter A, row 2 uses adapter B, row 3 uses nothing" is the opposite of that. Different rows want different math.

Write that naively in PyTorch and the framework falls back to a loop: handle adapter A, then adapter B, then the next one... one at a time. And the moment you turn a parallel batch into a sequential loop, your expensive GPU is suddenly running at a fraction of its capacity, and your latency falls apart.

Think of a paint factory. Filling every bucket with white primer is easy - one command, all the robotic arms fire at once. But now bucket 1 needs red tint, bucket 2 needs blue, bucket 3 needs none. If the manager stops the line and walks each bucket to its own tint jar one by one, the other arms just stand there doing nothing. That's the for-loop, and that's the bottleneck.

gather/scatter: doing it all at once

The fix is a custom CUDA kernel built around a gather/scatter pattern.

Instead of stopping the line, you hand every arm a ticket: "read the order taped to your bucket, reach into the exact tint jar it names, and pour." Arm 1 reads "red" and grabs from the red shelf, arm 2 reads "blue" and grabs from the blue shelf, arm 3 reads "none" and does nothing - and crucially, they all do it in the same instant.

That's gather/scatter. Each row of the batch carries an index that points at the adapter it wants. The kernel reads those indices, gathers the right A and B matrices from wherever they happen to live in memory, and applies them - all in parallel, without ever breaking the batch apart.

vLLM does exactly this. It builds an index tensor that maps each request in the batch to its adapter (and a -1 for "no adapter, skip the extra math"), then fires a kernel that gathers the right weights per row. The technique comes from a project called Punica; in the vLLM codebase you'll find it under the name SGMV (Segmented Gather Matrix-Vector multiplication). It even peeks at the batch first - if requests 2, 3 and 4 all want the same adapter, it groups them so it only fetches those weights once.

The result: the big shared base multiply runs as one fully-parallel operation, and the tiny per-request LoRA math gets routed dynamically on top of it. You can serve 50 different users on 50 different adapters from a single base model, on a single GPU, and the engine barely notices.

GPU cores each gathering their assigned LoRA adapter from memory in parallel

why this matters

Merged LoRA gives you exactly one flavor of the model. That's fine if you only ever need one. But the second you need per-user, per-request personalization - which is exactly where a lot of real products are heading - merging completely falls apart.

The unlock is almost embarrassingly simple: y = xW + xAB. One line of high-school algebra, plus a kernel clever enough to respect how a GPU actually wants to work. Keep the base weights untouched, treat every adapter as a cheap little offset, and let gather/scatter handle the routing.

That's how a single GPU serves a different fine-tuned model to every request without melting. Not a bigger model. Not more VRAM. Just better math.

Thanks for reading!