Skip to content
PBIXray
Go back

How VertiPaq Sorts Rows to Maximize RLE Compression

The deep-dive articles on this site explain how VertiPaq physically stores columns: dictionary-encoded integer IDs, hybrid RLE plus bit-packed sub-segments, paged Huffman-compressed string stores. All of that work compresses values given a row order. What nobody tends to mention up front is that the row order itself is a first-class compression lever — often a larger one than any codec choice.

This article is about that lever. It describes the algorithmic shape I arrived at after dissecting VertiPaq output, comparing segment layouts, and reading Microsoft patent US 8,452,737 B2. I am intentionally leaving source-code-level implementation details out. The useful part for understanding the engine is not a particular codebase; it is the decision process VertiPaq appears to follow when it chooses a row order before writing compressed column segments.

Some of the intermediate reverse-engineering notes behind this article started in this X post, where I was tracing table-processing signals such as iteration counts, total RLE runs, and XMHybridRLECompressionInfo.

Where This Fits

This is a sibling article to VertiPaq Dictionaries and Hash Indexes and Reconstructing Column Data from .idf and .idfmeta. Both of those treat row order as given. This article asks where that order comes from, and why it makes such a large difference in the resulting .idf payload.

Why Row Order Drives RLE

RLE compresses runs of identical values. In a columnar store every column is compressed independently, but every column sees the same row order. That means row order is a single global knob that every column’s RLE has to live with.

The practical consequence is easy to see. Take a column that repeats heavily:

random order:    [45, 12, 45, 89, 12, 45, 12, ...]   → ~1000 RLE runs
grouped order:   [45, 45, 45, ..., 12, 12, 12, ...]  → ~100 runs

Same values, same dictionary, same encoding family — but a large change in compressed size just by picking a better order.

The hard part is that VertiPaq is not sorting one column in isolation. It is choosing one row permutation for the whole table segment. An order that is perfect for ProductID might scatter CustomerID; an order that groups Date might fragment Store. The goal is not “make one column beautiful”; it is “make the sum of all column encodings smaller.”

That is why this problem gets interesting. Finding the globally optimal ordering across multiple columns is a combinatorial optimisation problem. The search space grows factorially with the number of rows, and the multi-column version is closely related to NP-hard ordering and clustering problems. VertiPaq therefore needs a bounded heuristic: something that usually gets close enough to the best answer without trying every possible row order.

The Algorithm in One Paragraph

The algorithm is a greedy bucket split. Start with all rows in one bucket. Repeatedly pick the (column, value) pair that, if all rows matching it were pushed to the front of the bucket, would save the most bits overall. Split into two sub-buckets — matching rows and the rest — and recurse on each. Stop when no column offers enough savings. The row order that falls out at the end is the optimized ordering; VertiPaq then encodes each column against that order.

The original idea is described in Microsoft patent US 8,452,737 B2. The behaviour I have observed in VertiPaq output lines up with that style of greedy, segment-local bucket splitting.

Segment greedy ordering flow

Segment Alignment

VertiPaq processes data in segments. In Power BI Desktop, the commonly observed segment size is 2²⁰ rows, or 1,048,576 rows. That is the number most people run into when inspecting Desktop-created import models, but it is not the only segment size VertiPaq can use.

Microsoft documents a different default for semantic models that use the large semantic model storage format: Power BI automatically sets the default segment size to 8 million rows, matching Azure Analysis Services, to balance memory requirements and query performance for large tables.

For this row-ordering discussion, the exact segment size matters less than the boundary itself. VertiPaq appears to optimise row order within each segment, not by choosing one global permutation for the entire table. That has three practical consequences:

Segment-local parallel processing

This also explains why pre-sorting a source table can help even though VertiPaq still performs its own optimisation. Pre-sorting can improve value distribution across segment boundaries, so similar rows are more likely to land in the same segment. It can also give the greedy process a better starting point inside each segment, which means a bounded search or timeout can stop earlier with a better answer.

Scoring a Split

For any candidate (column, value) pair, the score is a bit-savings estimate:

bit_savings ≈ frequency × bits_per_value

where bits_per_value = ceil(log2(cardinality)) for the column in this segment. The intuition: pulling a frequent value into a pure run saves roughly one bit-packed slot per occurrence.

This is a greedy heuristic, not a proof of optimality. At every step, VertiPaq appears to ask: “Which value, in which column, gives the best immediate compression payoff if I group it now?”

best_split = none

for column in unfinished_columns:
    histogram = count_values(bucket, column)
    value, frequency = histogram.most_frequent()
    saving = frequency * bits_per_value(column)

    if saving > best_split.saving:
        best_split = (column, value, saving)

if best_split.saving >= minimum_saving:
    split bucket on best_split.column = best_split.value
else:
    mark bucket done

Several practical guardrails keep the search sensible:

The timeout point matters. More columns generally means more possible choices, and therefore more work. A bounded optimiser can stop after a few seconds and still leave a reasonable ordering behind. A longer run can continue splitting buckets and squeeze out more RLE-friendly regions.

Bucket Split Logic

The split itself is simple. Once a winning (column, value) pair has been chosen, rows matching that value are moved into one contiguous region. The rest remain available for later splits.

before: [A=2, A=5, A=2, A=9, A=5, A=2]
split on A=2
after:  [A=2, A=2, A=2] [A=5, A=9, A=5]

That first bucket is now excellent for column A, but it still contains full rows. Those same rows also carry values for every other column. If the chosen rows also happen to share values in B, C, or Date, the split improves multiple columns at once. If not, the algorithm has bought RLE for one column at the cost of whatever fragmentation it introduced elsewhere.

This is where the “all columns at once” difficulty shows up. The greedy score is trying to approximate the total benefit of a local move inside a global row-ordering problem.

Bucket split flow

There is also a boundary issue. If two adjacent buckets end and begin with the same value for a column, the engine can avoid creating an unnecessary extra RLE run. That means a split is not only about separating matching and non-matching rows; the placement of those new buckets relative to their neighbours can affect run continuity.

Boundary run merge flow

Why Source Order Still Matters

A common question is whether the optimiser decides a global sort order upfront and then applies it to every segment. The behaviour I have seen points the other way: each segment is processed locally, and the greedy choice sequence is local to that segment.

That does not make source order irrelevant. It matters in two ways:

This is why sorting a source table before import can still improve model size. The source sort is not replacing VertiPaq’s internal ordering step. It is shaping the problem VertiPaq receives.

What the Heuristic Is Buying

The interesting part is the combination of four ideas:

None of those ideas alone is magic. Together they give VertiPaq a practical answer to a hard problem: choose a single row order that improves compression across all columns, with limited time, fixed segment boundaries, and no opportunity to search the full permutation space.

That is also why correlated columns are where the biggest wins tend to appear. If Product, Category, Supplier, and Region move together, one bucket split can lengthen runs in several columns at once. If every column is independent noise, no row order can make all of them compress well.

Where to Go Next


Share this post on:

Previous Post
PBIX Parsing in Your Browser: Introducing PBIX.info
Next Post
Reconstructing Column Data from .idf and .idfmeta