Skip to content
PBIXray
Go back

VertiPaq Dictionaries and Hash Indexes

Most explanations of PBIX internals jump straight from “the data is columnar” to “here is the final table.” The missing middle is usually the dictionary layer. VertiPaq does not store every imported value verbatim in the segment payload. It stores encoded IDs and relies on companion structures such as .dictionary and .hidx files to make those IDs meaningful.

That middle layer is where a parser stops being a metadata browser and starts becoming a real decoder.

Where This Fits

This article assumes you already know how to reach DataModel and what the VertiPaq workspace looks like. If not, read The DataModel: Power BI’s Embedded Analysis Services Engine first. If you want the next step after dictionaries, continue with Reconstructing Column Data from .idf and .idfmeta.

Two Main Reconstruction Paths

In practice, imported columns tend to follow one of two broad paths:

The split is visible right in the on-disk filenames. Hash-encoded columns drop a dictionary file into the table folder, typically named #.{Table} ({TableID}).{Column} ({ColumnID}).dictionary. Value-encoded columns drop a hash-index file instead, named #.H${Table} ({TableID})${Column} ({ColumnID}).hidx. That single convention tells you which reconstruction path to take before you open a single byte of the payload.

What a .dictionary File Stores

For dictionary-backed columns, the .idf payload stores integer IDs rather than the original strings or numerics. The .dictionary file provides the mapping from internal data IDs back to real values.

The high-level shape is straightforward:

seq:
  - id: dictionary_type
    type: s4
  - id: hash_table_class
    type: u4
  - id: hash_bin_header
    type: hash_bin_header

dictionary_type takes one of four values: -1 (invalid), 0 (int64 values), 1 (float64 values), and 2 (strings). The hash-bin header that follows describes a 64-bin hash table the engine uses in memory — on disk it’s usually empty (with an i64 = -1 sentinel after the three header fields), because the engine rebuilds it from the value store on load. A deserializer that only needs the values themselves can treat that whole block as opaque and skip past it.

Numeric dictionaries are then just a flat vector: an 8-byte count, a 4-byte element size, and then the raw bytes (s4le for 4-byte elements, s8le or f8le for 8-byte elements depending on whether the type is long or real).

String dictionaries are paged. After a small page-layout header (total string count, longest string length, page count) comes a sequence of dictionary pages, each of which can be independently compressed or not based on the low bit of a per-page page_mask. Every page is wrapped by two fixed sentinels — 0xDD 0xCC 0xBB 0xAA at the start and 0xCD 0xAB 0xCD 0xAB at the end — which double as a useful format fingerprint when you’re exploring a file by hand. A final vector of record handles (page_id, bit-or-byte offset) is how the engine locates a specific string within the pages.

Dictionary structure

Huffman-Compressed String Pages

One of the more interesting findings in the reverse-engineered format is that string pages are not always stored as plain null-terminated text. Compressed pages carry a character_set_type_identifier that selects between two Huffman variants:

Each compressed page carries a compact 128-byte encode_array rather than a full 256-byte table. The decoder expands that to 256 codeword lengths (two 4-bit nibbles per byte), reconstructs canonical Huffman codes from the lengths, and walks a tree bit-by-bit to decode. Record handles (page_id, bit-offset pairs) mark where each string starts inside the compressed buffer, so the decoder knows exactly when to stop for each value.

That is why a parser cannot treat the dictionary as a trivial array lookup. It often has to decompress string pages, honor record-handle offsets, and rebuild the mapping page by page. In pbixray, that logic lives in the dictionary-reading path of the VertiPaq decoder.

What a .hidx File Stores

A .hidx file is a hash index rather than a value store. Its layout is built around hash bins, local entries, and overflow entries:

seq:
  - id: hash_algorithm
    type: s4
  - id: hash_entry_size
    type: u4
  - id: hash_bin_size
    type: u4
  - id: local_entry_count
    type: u4

The key point is that .hidx supports lookup-oriented behavior around encoded values. It is not the same thing as the dictionary value store.

Hierarchy and hash index structure

Internal IDs, min_data_id, and Nullability

The .idf payload is usually decoded into internal IDs first. Those IDs only become user-facing values after one more mapping step.

In the current pbixray implementation, the reconstruction path depends on column metadata:

That second case is worth underlining. A .hidx file signals a different encoding path, but the actual user-visible number can often be reconstructed from metadata plus the decoded integer vector without consulting the hash index during plain table extraction.

Nullability adds another wrinkle. For dictionary-backed columns, the minimum data ID has to be adjusted correctly so the null slot lands in the expected position.

Why These Structures Matter to a Parser

Without the dictionary and hash-index layer, the output of .idf decoding is only half-finished. You would have:

This is why I think of dictionaries as the hinge between physical storage and semantic output.

How pbixray Uses Them

The current decoder follows a simple pattern:

if pd.notnull(column_metadata["Dictionary"]):
    dictionary = self._read_dictionary(dictionary_buffer, min_data_id=meta["min_data_id"])
    values = self._read_rle_bit_packed_hybrid(...)
    return pd.Series(values).map(dictionary)
elif pd.notnull(column_metadata["HIDX"]):
    values = self._read_rle_bit_packed_hybrid(...)
    return pd.Series(values).add(column_metadata["BaseId"]) / column_metadata["Magnitude"]

That is a useful summary of the whole article. Segment decoding gives you encoded values. The dictionary or value-encoding path gets you back to the values a user would recognize.


Share this post on:

Previous Post
Reconstructing Column Data from .idf and .idfmeta
Next Post
Parsing PBIX Files with Python (pbixray)