How VertiPaq Sorts a Table — A Visual Walkthrough

Microsoft’s column-store engine reorders the rows of your table before it stores them. Not for query speed, not for joins — but to make a humble compression scheme called run-length encoding work miracles. The reordering is greedy, recursive, and surprisingly intuitive once you watch it happen.

This piece is the visual companion to How VertiPaq Sorts Rows to Maximize RLE Compression. That article explains why the algorithm exists; this one steps through it on sixteen rows so you can see every move.

A table, turned on its side

Row-oriented storage was the default for forty years because applications read records one at a time. Columnar storage flipped that: analytics reads one column across millions of rows. Same data, transposed. The columns are stored as independent arrays:

Once each column is its own array, you can compress it independently. The cheapest, fastest compression scheme in the book is run-length encoding: instead of writing Seattle, Seattle, Seattle, Seattle four times, you write (Seattle, ×4) once. RLE turns repetition into pure savings.

But RLE only helps when identical values sit next to each other. In a random row order, the values are scrambled. The whole game of VertiPaq’s reordering algorithm is this:

Find a row order that maximises consecutive repetition across all columns at once, even though every column wants a different order, and you can only choose one.

The same data, two ways

Below is one column — the city column — in two row orders. Same sixteen values, just rearranged. On the left, the order is essentially random. On the right, identical values have been clustered. The compression difference is visceral.

As written

Grouped

Left: 12 RLE runs (one new run every time the value changes). Right: 3 runs. Same information, four times fewer tokens to store.

With a single column this is easy — just sort it. With three columns sharing a row order, you can’t sort all of them perfectly at once. Sorting city scrambles quarter. So the question becomes: which column to favour, and by how much?

Bit savings: a number per choice

VertiPaq picks the next reordering move by computing a single number for every (column, value) pair it could group, and choosing the maximum. The number is called bit savings, and it’s the product of two things:

The reasoning is straightforward: every time you group n copies of a value together, you turn n separately-encoded values (each of which would have cost bits-per-value bits to bit-pack) into a single RLE token. So you save n × bits-per-value bits. Frequency is how much the group is worth; bits-per-value is how expensive each ungrouped value would have been.

The consequence: high-cardinality columns are more valuable to group (each row costs more bits, so each grouped row saves more), and frequent values within a column are easier wins. Below are the candidate moves on our 16-row example. Bars show the bit savings; the algorithm picks the longest.

Three columns × three distinct values per column = nine candidate moves. The winner is status = A, with eight rows and two bits per value: 8 × 2 = 16 bits saved by grouping. Notice that the city and quarter columns have other strong candidates — but bit savings is greedy and only one move wins each round.

Step by step on sixteen rows

The full algorithm is a recursive version of the move you just saw. Start with one bucket containing all the rows. Find the highest-saving (column, value) pair. Partition the bucket so matching rows sit at the top, non-matching at the bottom. The top half becomes a pure bucket — one column is permanently solved inside it — and the bottom half becomes a smaller impure bucket. Then do it again, on whichever bucket has the most rows left.

Twelve such operations finish the job on our 16-row table. Step through them with the controls below:

Step 0 of 12

Press Play to watch the algorithm sort the table, or step through manually. Initially all 16 rows live in a single bucket.

Table state · the row_order array

Candidates for this bucket

The bucket tree

A few things worth noticing as the walker runs:

The first split is the largest. The initial bucket has every row in it, so the strongest candidate (status = A, eight matches) lands a 16-bit saving. Subsequent splits operate on smaller buckets and produce smaller wins. Most of the value is captured early — a key reason the greedy strategy works in practice.
Columns get “marked done” inside a bucket. Once a bucket is purely A for status, the algorithm never reconsiders status inside that bucket. There’s nothing left to group there. This is tracked with a bitset per bucket.
Bucket processing order is largest-first. A max-heap keyed on bucket size determines what to split next, so the algorithm always attacks the biggest remaining problem. This prevents wasting effort on small buckets while large ones still have headroom.
Termination is local. Each bucket stops splitting when its best candidate falls below a savings threshold (in the real implementation, 64 bits) or when there’s only one distinct value left in every column. Different buckets stop at different times.

Runs that survive bucket boundaries

The split as described always puts matching rows first. That’s fine in isolation. But buckets sit next to each other in the final order, and a run of identical values can span a bucket boundary if you align it right. The algorithm peeks at its neighbours before deciding which end to put the matches at.

The check is local and cheap. Before partitioning, look at the value in the row just before bucket B and the row just after. Pick the layout that maximises continuity with whatever’s already there. It’s the same reason a Tetris player rotates a piece before dropping it.

Before and after

Twelve splits later, our table has settled into its final order. The same rows, the same values — but now organised so that every column compresses well. Here’s the entire table in both orderings, side by side, with the RLE runs drawn alongside:

Original

After VertiPaq

Original clusters

across the three columns

After optimisation

12 splits, 12 partitions

Compression gain

2.05×

fewer RLE tokens to store

A 2× improvement on sixteen rows is modest — there’s a floor below which you can’t go, since every distinct value pair has to break a run somewhere. But scale this to a million rows and the dynamics change: the early splits sort tens of thousands of rows at a time, and each one is worth real disk space. Real-world Power BI fact tables often see 10× to 50× compression improvements from row ordering alone, separate from the bit-packing and dictionary encoding applied on top.

When sorting helps, and when it doesn’t

The greedy strategy isn’t optimal — finding the row order that minimises total RLE runs across all columns simultaneously is NP-hard. But greedy is fast, deterministic, and most of the savings come from the first few splits. The patent’s design choices are practical, not theoretical.

Things that make the algorithm work harder for less reward

Very high cardinality columns. If a column has nearly as many distinct values as it has rows (think: timestamps with microsecond precision, or user IDs), there’s no repetition to find. The implementation skips these columns entirely when segment_size / cardinality < 64 — they’d waste histogram cycles for negligible gain.
Uniformly distributed data with no natural clustering. Histograms are flat; no value sticks out as a winner; bit savings are small for every candidate.
Small tables. The fixed overhead of bucket management dominates. Below a few thousand rows, sorting hardly matters.

Things that make it sing

Foreign keys in fact tables. A customer_id column with thousands of distinct values, each appearing dozens of times, is exactly the shape this algorithm loves: high frequency × high bits-per-value × strong correlation with related columns.
Correlated columns. If city and region are correlated, sorting by either one helps the other for free. The greedy strategy naturally exploits this without being told to.
Event data with natural categories. Logs, transactions, sensor readings — anything where a small set of “types” repeats many times.

What the real implementation adds

Segmentation. Large tables are processed in 1,048,576-row chunks (2²⁰, matching Power BI’s storage page size). Each segment is sorted independently and can be parallelised across cores.
Histogram sampling. Once a bucket exceeds ten thousand rows, the histogram is built from roughly 1 in 10,000 rows. The most-frequent value is preserved at small statistical risk and large speed gain.
Doubly-linked bucket list. Each bucket holds pointers to its neighbours in row order, so the merge-optimisation check is O(1).
Dictionary encoding upstream. Before any of this runs, string columns are replaced with small integer codes. The reordering algorithm only ever touches integer arrays.

The result is that an OLAP fact table compressed by Power BI is dramatically smaller than the same data stored row-wise — often small enough to fit entirely in RAM, which is the whole point. Sorting the rows is what makes the bit-packing that follows pay off. RLE turns out to be the unsung hero of column-store performance, and this little greedy algorithm is what feeds it.

Based on Microsoft Patent US 8,452,737 B2 (Netz, Petculescu, Crivat, 2013). The same algorithm powers Power BI, Analysis Services, and Excel data models in production. For the prose deep-dive, see the sibling article How VertiPaq Sorts Rows to Maximize RLE Compression.