How VertiPaq Sorts Rows to Maximize RLE Compression

The deep-dive articles on this site explain how VertiPaq physically stores columns: dictionary-encoded integer IDs, hybrid RLE plus bit-packed sub-segments, paged Huffman-compressed string stores. All of that work compresses values given a row order. What nobody tends to mention up front is that the row order itself is a first-class compression lever — often a larger one than any codec choice.

This article is about that lever. It describes the algorithmic shape I arrived at after dissecting VertiPaq output, comparing segment layouts, and reading Microsoft patent US 8,452,737 B2. I am intentionally leaving source-code-level implementation details out. The useful part for understanding the engine is not a particular codebase; it is the decision process VertiPaq appears to follow when it chooses a row order before writing compressed column segments.

Some of the intermediate reverse-engineering notes behind this article started in this X post, where I was tracing table-processing signals such as iteration counts, total RLE runs, and XMHybridRLECompressionInfo.

Where This Fits

This is a sibling article to VertiPaq Dictionaries and Hash Indexes and Reconstructing Column Data from .idf and .idfmeta. Both of those treat row order as given. This article asks where that order comes from, and why it makes such a large difference in the resulting .idf payload.

Why Row Order Drives RLE

RLE compresses runs of identical values. In a columnar store every column is compressed independently, but every column sees the same row order. That means row order is a single global knob that every column’s RLE has to live with.

The practical consequence is easy to see. Take a column that repeats heavily:

random order:    [45, 12, 45, 89, 12, 45, 12, ...]   → ~1000 RLE runs
grouped order:   [45, 45, 45, ..., 12, 12, 12, ...]  → ~100 runs

Same values, same dictionary, same encoding family — but a large change in compressed size just by picking a better order.

The hard part is that VertiPaq is not sorting one column in isolation. It is choosing one row permutation for the whole table segment. An order that is perfect for ProductID might scatter CustomerID; an order that groups Date might fragment Store. The goal is not “make one column beautiful”; it is “make the sum of all column encodings smaller.”

That is why this problem gets interesting. Finding the globally optimal ordering across multiple columns is a combinatorial optimisation problem. The search space grows factorially with the number of rows, and the multi-column version is closely related to NP-hard ordering and clustering problems. VertiPaq therefore needs a bounded heuristic: something that usually gets close enough to the best answer without trying every possible row order.

The Algorithm in One Paragraph

The algorithm is a greedy bucket split. Start with all rows in one bucket. Repeatedly pick the (column, value) pair that, if all rows matching it were pushed to the front of the bucket, would save the most bits overall. Split into two sub-buckets — matching rows and the rest — and recurse on each. Stop when no column offers enough savings. The row order that falls out at the end is the optimized ordering; VertiPaq then encodes each column against that order.

The original idea is described in Microsoft patent US 8,452,737 B2. The behaviour I have observed in VertiPaq output lines up with that style of greedy, segment-local bucket splitting.

Segment greedy ordering flow

Segment Alignment

VertiPaq processes data in segments. In Power BI Desktop, the commonly observed segment size is 2²⁰ rows, or 1,048,576 rows. That is the number most people run into when inspecting Desktop-created import models, but it is not the only segment size VertiPaq can use.

Microsoft documents a different default for semantic models that use the large semantic model storage format: Power BI automatically sets the default segment size to 8 million rows, matching Azure Analysis Services, to balance memory requirements and query performance for large tables.

For this row-ordering discussion, the exact segment size matters less than the boundary itself. VertiPaq appears to optimise row order within each segment, not by choosing one global permutation for the entire table. That has three practical consequences:

The greedy search uses segment-local statistics. A column’s cardinality and value distribution inside one segment can matter more than its global distribution across the table.
Different segments of the same table can end up with different local row orders because each segment has its own value distribution.
Segmenting gives the storage engine a natural unit of work. Segments can be analysed, ordered, encoded, loaded, and queried in smaller chunks rather than treating the table as one monolithic block.

Segment-local parallel processing

This also explains why pre-sorting a source table can help even though VertiPaq still performs its own optimisation. Pre-sorting can improve value distribution across segment boundaries, so similar rows are more likely to land in the same segment. It can also give the greedy process a better starting point inside each segment, which means a bounded search or timeout can stop earlier with a better answer.

Scoring a Split

For any candidate (column, value) pair, the score is a bit-savings estimate:

bit_savings ≈ frequency × bits_per_value

where bits_per_value = ceil(log2(cardinality)) for the column in this segment. The intuition: pulling a frequent value into a pure run saves roughly one bit-packed slot per occurrence.

This is a greedy heuristic, not a proof of optimality. At every step, VertiPaq appears to ask: “Which value, in which column, gives the best immediate compression payoff if I group it now?”

best_split = none

for column in unfinished_columns:
    histogram = count_values(bucket, column)
    value, frequency = histogram.most_frequent()
    saving = frequency * bits_per_value(column)

    if saving > best_split.saving:
        best_split = (column, value, saving)

if best_split.saving >= minimum_saving:
    split bucket on best_split.column = best_split.value
else:
    mark bucket done

Several practical guardrails keep the search sensible:

Columns with too little repetition are poor RLE candidates and can be skipped for a bucket.
Splits with tiny estimated savings are not worth pursuing.
Very large buckets can be sampled when building histograms, because the goal is to find a strong candidate quickly rather than exactly score every possible split.
A timeout can stop the greedy loop at any point. The row order produced so far is still valid; it is simply less optimised than it might have been with more time.

The timeout point matters. More columns generally means more possible choices, and therefore more work. A bounded optimiser can stop after a few seconds and still leave a reasonable ordering behind. A longer run can continue splitting buckets and squeeze out more RLE-friendly regions.

Bucket Split Logic

The split itself is simple. Once a winning (column, value) pair has been chosen, rows matching that value are moved into one contiguous region. The rest remain available for later splits.

before: [A=2, A=5, A=2, A=9, A=5, A=2]
split on A=2
after:  [A=2, A=2, A=2] [A=5, A=9, A=5]

That first bucket is now excellent for column A, but it still contains full rows. Those same rows also carry values for every other column. If the chosen rows also happen to share values in B, C, or Date, the split improves multiple columns at once. If not, the algorithm has bought RLE for one column at the cost of whatever fragmentation it introduced elsewhere.

This is where the “all columns at once” difficulty shows up. The greedy score is trying to approximate the total benefit of a local move inside a global row-ordering problem.

Bucket split flow

There is also a boundary issue. If two adjacent buckets end and begin with the same value for a column, the engine can avoid creating an unnecessary extra RLE run. That means a split is not only about separating matching and non-matching rows; the placement of those new buckets relative to their neighbours can affect run continuity.

Boundary run merge flow

Why Source Order Still Matters

A common question is whether the optimiser decides a global sort order upfront and then applies it to every segment. The behaviour I have seen points the other way: each segment is processed locally, and the greedy choice sequence is local to that segment.

That does not make source order irrelevant. It matters in two ways:

It affects which rows land in the same segment. If related rows are already near each other, segment-local optimisation starts with a more favourable distribution.
It affects the early histogram and split sequence inside a segment. A partially ordered segment can converge to a strong result faster, which matters when optimisation is bounded by time.

This is why sorting a source table before import can still improve model size. The source sort is not replacing VertiPaq’s internal ordering step. It is shaping the problem VertiPaq receives.

What the Heuristic Is Buying

The interesting part is the combination of four ideas:

Score local moves by estimated bit savings.
Split buckets greedily rather than attempting an exhaustive search.
Keep the work segment-local so memory, CPU, and parallelism stay bounded.
Preserve or improve RLE continuity at bucket boundaries when possible.

None of those ideas alone is magic. Together they give VertiPaq a practical answer to a hard problem: choose a single row order that improves compression across all columns, with limited time, fixed segment boundaries, and no opportunity to search the full permutation space.

That is also why correlated columns are where the biggest wins tend to appear. If Product, Category, Supplier, and Region move together, one bucket split can lengthen runs in several columns at once. If every column is independent noise, no row order can make all of them compress well.

Where to Go Next

VertiPaq Dictionaries and Hash Indexes — the dictionary layer that maps these sorted IDs back to strings and numerics
Reconstructing Column Data from .idf and .idfmeta — the Hybrid RLE payload the sort order feeds into
Microsoft patent US 8,452,737 B2 — the patent describing efficient column-based data encoding for large-scale storage