Skip to content
PBIXray
Go back

Reconstructing Column Data from .idf and .idfmeta

If metadata.sqlitedb tells you what the model means and .dictionary tells you what the IDs mean, .idf and .idfmeta tell you how the payload itself is packed on disk.

This is the layer where a PBIX parser either becomes a real data extractor or stays a metadata browser.

Where This Fits

This is the lowest-level storage article in the launch batch. It builds directly on The DataModel article and VertiPaq Dictionaries and Hash Indexes. If you only want the high-level API, skip to Parsing PBIX Files with Python (pbixray).

What the Two Files Do

For each stored column segment, VertiPaq separates metadata from payload:

That split is crucial. You cannot decode the .idf bytes correctly by looking at them in isolation. You need the companion metadata to know where RLE applies, where bit-packed values begin, and how to interpret the resulting integers.

Column storage layout

The Shape of .idf

Structurally an .idf file is a sequence of segments, each built from a variable-length primary segment followed by a variable-length sub-segment:

segment:
  seq:
    - id: primary_segment_size
      type: u8
    - id: primary_segment
      type: segment_entry
      repeat: expr
      repeat-expr: primary_segment_size
    - id: sub_segment_size
      type: u8
    - id: sub_segment
      type: u8
      repeat: expr
      repeat-expr: sub_segment_size

segment_entry:
  seq:
    - id: data_value
      type: u4
    - id: repeat_value
      type: u4

Each primary-segment entry is just two u4 fields: data_value and repeat_value. That already hints at the hybrid strategy. Most entries behave as straight RLE runs. A special marker (data_value == 0xFFFFFFFF) indicates that the next repeat_value records must be read from the bit-packed sub-segment instead.

The Shape of .idfmeta

The .idfmeta companion is much richer. It uses a tagged binary format where every block is wrapped in textual markers — <1:CP\0 opens a column partition and CP:1>\0 closes it, with <1:CS\0/CS:1>\0 for each column segment and <1:SS\0/SS:1>\0 for the subsegment statistics nested inside. You can often pick an .idfmeta file out of a hex dump just by spotting those tags.

Inside that envelope lives a compression_class identifier (a PF_OBJECT_CLASS value) that tells the decoder exactly how the segment is encoded:

For table reconstruction, a parser typically only needs a small subset of the available metadata:

The current pbixray decoder extracts exactly those essentials:

row_data = {
    "min_data_id": metadata.blocks.cp.cs.ss.min_data_id,
    "count_bit_packed": metadata.blocks.cp.cs.cs.count_bit_packed,
    "bit_width": metadata.bit_width,
}

IDF metadata structure

Hybrid RLE plus Bit Packing

The storage strategy described by the two files is hybrid:

In the pbixray implementation, the primary segment is walked entry by entry. Ordinary entries expand to repeated values. A special marker value means “switch to the next batch of bit-packed values from the sub-segment.”

Conceptually it looks like this:

primary segment:
  [value, count]
  [value, count]
  [0xFFFFFFFF, count]   -- marker: next `count` entries come from sub-segment

sub-segment:
  packed integers with width = bit_width

The pbixray decoder tracks a rolling bit_packed_offset so that successive bit-pack markers pull the next batch of values out of the sub-segment correctly — the marker is detected by checking entry.data_value + bit_packed_offset == 0xFFFFFFFF, then the offset advances by entry.repeat_value.

That design lets VertiPaq take advantage of both repetition and compact integer widths within the same column segment.

min_data_id, bit_width, and the Reconstructed Vector

Once the parser knows the bit width and the minimum data ID, it can unpack the sub-segment into actual integer values. In pbixray, each 64-bit word is shifted and masked repeatedly:

mask = (1 << bit_width) - 1
res.append(min_data_id + (u8le & mask))
u8le >>= bit_width

Those integers are still not final business values. They are the reconstructed encoded vector. That vector then flows into:

The End-to-End Column Reconstruction Flow

Putting the pieces together, a direct parser usually follows this sequence for each stored column:

  1. use metadata.sqlitedb to locate the relevant files and column properties
  2. parse .idfmeta to get min_data_id, count_bit_packed, and bit_width
  3. parse .idf to rebuild the encoded vector from RLE and bit-packed segments
  4. map the encoded vector through the dictionary path or the value-encoding path
  5. cast the result to the final runtime type

That five-step flow is the heart of imported-table reconstruction in pbixray.

Why This Is the Hardest Part of the Format

This layer is where multiple incomplete truths have to line up:

Only when all three agree do you get a faithful column back out. That is why so much of the reverse-engineering effort ends up concentrated here.


Share this post on:

Previous Post
How VertiPaq Sorts Rows to Maximize RLE Compression
Next Post
VertiPaq Dictionaries and Hash Indexes