Skip to content
PBIXray
Go back

The DataModel: Power BI's Embedded Analysis Services Engine

When Power BI Desktop saves a report, it serializes an entire in-memory columnar database into a single stream called DataModel. That stream is not a simplified export format. It is an Analysis Services Tabular database packaged inside the PBIX file.

Once you cross into DataModel, you are no longer dealing with report JSON or package metadata. You are dealing with the same family of storage concepts that power VertiPaq in Power BI, Power Pivot, and SSAS Tabular.

Where This Fits

This article begins where Inside the PBIX ZIP Archive ends. It follows the DataModel member through its compression layer and into the VertiPaq filesystem layout. For the semantic layer inside that structure, continue with Inside metadata.sqlitedb: Tables, Columns, Measures & Relationships.

The Layer Cake Inside DataModel

The easiest way to think about DataModel is as a stack of nested representations:

  1. the PBIX file is a ZIP archive
  2. the DataModel member is a raw ZIP entry
  3. that entry contains an XPress9-compressed Analysis Services backup stream
  4. the backup expands into a VertiPaq-oriented filesystem layout
  5. tables are reconstructed from metadata, dictionaries, hash indexes, and compressed column segments

That is why generic archive tooling is only the first step. The real implementation work starts after the ZIP boundary.

The same layering was central to my earlier post Lessons Learned from Unpacking VertiPaq: A Developer’s Journey. The difference here is that the goal is not to tell the discovery story. It is to document the structures in a way that is useful for parser authors.

XPress9 and the Analysis Services Backup Stream

In practice there are three forms a DataModel stream can take, identified by the first 102 bytes:

All three land at the same destination — an Analysis Services backup that unfolds into a VertiPaq filesystem — they just take different routes to get there. pbixray’s unpacker dispatches on those signatures directly.

Both XPress9 variants share the same chunk layout after their respective signatures: each chunk begins with an uncompressed_size and a compressed_size as 32-bit little-endian integers, followed by a compressed node. The multithreaded form adds one extra header up front (block counts and chunk sizes) so a decoder can hand work out to a thread pool. Each node itself starts with a 32-byte XPress9 header that includes a 0x4e86d72a magic, the original and encoded sizes, a Huffman-table flags bitfield, a session signature, a block index, and a CRC32 — the actual compressed payload begins after those 32 bytes.

seq:
  - id: uncompressed
    type: u4
  - id: compressed
    type: u4
  - id: node
    type: node

That chunking step is the bridge between “ZIP member” and “recoverable backup.”

The ABF Layer Underneath

Once XPress9 is out of the way, what’s left is an Analysis Services ABF (Analysis Backup File). Inside every ABF there are three anchored structures a parser needs:

  1. a BackupLogHeader at offset 72, always 4 KB long, which gives you the offset and size of the virtual directory
  2. a VirtualDirectory — an XML-ish list of file entries with paths, sizes, and offsets into the backup body
  3. a BackupLog — an XML manifest of file groups and backup files, which is matched back against the virtual directory by StoragePath

When those three agree, you get a file log of (Path, FileName, StoragePath, Size, OffsetHeader) tuples. That file log is what turns the backup stream into the VertiPaq workspace described below — each tuple becomes one on-disk file.

What the VertiPaq Filesystem Looks Like

After decompression, the structure follows a predictable on-disk layout that Power BI Desktop also uses when it materializes models to its local workspace folder. A minimal example looks like this:

0.CryptKey.bin
metadata.sqlitedb

Fruit RLE (427).tbl
  0.Fruit RLE (427).Type (430).dictionary
  1.H$Fruit RLE (427)$Qty (431).hidx
  432.prt/
    0.Fruit RLE (427).Qty (431).0.idf
    0.Fruit RLE (427).Qty (431).0.idfmeta
    0.Fruit RLE (427).Type (430).0.idf
    0.Fruit RLE (427).Type (430).0.idfmeta

That single directory tree already reveals most of the moving parts behind imported table reconstruction:

Root-Level Folder Families

The VertiPaq workspace contains more than just table folders. Four folder families show up next to each other, each with a distinct naming convention:

If your immediate goal is “read table data,” the first family matters most. The others are still important because they show that the on-disk engine is maintaining more than simple column payloads. It is also materializing structures needed for hierarchy navigation, relationship operations, and query-time behavior.

Why metadata.sqlitedb Sits at the Center

Although DataModel contains many binary structures, the metadata database is the organizing spine. It tells the parser:

That is why a direct parser usually reads metadata.sqlitedb very early in the process. The rest of the binary files become much easier to interpret once they are anchored to real model objects.

What pbixray Actually Does with DataModel

At a high level, the Python implementation follows a three-stage pattern:

  1. unpack the embedded DataModel
  2. read metadata.sqlitedb into DataFrames
  3. use that metadata to locate and decode per-column storage files

That split is deliberate. The hardest part of the format is not one magical binary parser. It is the fact that meaning is distributed across multiple layers that only become useful when combined.


Share this post on:

Previous Post
Inside metadata.sqlitedb: Tables, Columns, Measures & Relationships
Next Post
Inside the PBIX ZIP Archive