A .pbix file is the native file format used by Microsoft Power BI Desktop. At first glance it looks like a binary blob, but it is really a package that combines report definition, queries, semantic metadata, and imported table data into a single deliverable.
Understanding that package boundary is useful. Understanding what sits behind it is much more useful. The hard part of PBIX is not that it is zipped. The hard part is that one of the ZIP members, DataModel, contains an embedded Analysis Services Tabular database with VertiPaq storage structures underneath it.
This article is the hub for the launch batch on pbixray.com. It sets up the terminology and the map. The rest of the articles drill into the parts of the format that matter when you are building tooling rather than authoring reports.
The open source Python library pbixray was built specifically to make PBIX parsing accessible without requiring Power BI Desktop or a live Analysis Services connection.
Where This Fits
If you want the short version, stay here. If you want the package boundary next, continue with Inside the PBIX ZIP Archive. If you care most about the storage engine, jump to The DataModel: Power BI’s Embedded Analysis Services Engine.
A PBIX File Is a Package, Not a Monolith
At the outer layer, a PBIX file behaves like a ZIP archive with a recognizable set of top-level members:
DataModel
Mashup
Report/Layout
Report/StaticResources
SecurityBindings
Connections
[Content_Types].xml
Version
Those entries already tell you a lot about the product model:
Report/Layoutbelongs to the report canvas and visuals.Mashupbelongs to Power Query.Connectionsbelongs to external data sources and model bindings.DataModelbelongs to the embedded Analysis Services engine.
For developers, DataModel is where most of the real complexity lives.
Why DataModel Is the Center of Gravity
In the PBIX files I work with, the storage path looks like this:
- the outer PBIX file is a ZIP archive
- the
DataModelmember is stored as a raw ZIP entry - that entry contains an XPress9-compressed Analysis Services backup stream
- inside the backup is a VertiPaq filesystem layout
- individual columns are reconstructed from metadata, dictionaries, hash indexes, and compressed segment payloads
That is why “just unzip the PBIX” is only the beginning of the story. It gets you to the doorstep, not into the data.
The articles in this batch build directly on the research trail I described in Lessons Learned from Unpacking VertiPaq: A Developer’s Journey, but with the emphasis shifted from the story of discovery to the structures that are now documented in code and notes.
The Seven Articles in This Launch
This batch is intentionally storage-focused:
- What Is a PBIX File? explains the package and the terminology.
- Inside the PBIX ZIP Archive maps the container layer and the route to
DataModel. - The DataModel: Power BI’s Embedded Analysis Services Engine follows the XPress9 and ABF layers into the VertiPaq workspace.
- Inside
metadata.sqlitedb: Tables, Columns, Measures & Relationships shows how the semantic model is exposed through the embedded SQLite database. - VertiPaq Dictionaries and Hash Indexes explains how encoded IDs become readable values.
- Reconstructing Column Data from
.idfand.idfmetacovers the compressed column segments themselves. - Parsing PBIX Files with Python (pbixray) ties the pieces together into a practical extraction workflow.
Why Reverse Engineer PBIX at All?
There are at least four good reasons:
- you want metadata extraction without launching Desktop
- you want table reconstruction in automated workflows
- you want to inspect model size, cardinality, and storage shape directly from files
- you want a precise mental model of how VertiPaq persists imported data
That last reason has been a recurring theme in my work for a while. Years ago I wrote about Power BI limits from the outside in, asking how much data could fit in Power BI Desktop. These articles approach the same world from the inside out.
From Research Notes to Working Code
The articles here are grounded in three complementary sources:
- published Microsoft specifications where they exist
- reverse-engineered Kaitai Struct schemas and storage notes
- the implementation work in
pbixray
That combination matters because no single source is sufficient on its own. Specs rarely tell the whole implementation story. Reverse-engineering notes need validation. Library code needs a conceptual model to remain maintainable. When all three line up, the format becomes much easier to reason about.
Related Articles
- Read Inside the PBIX ZIP Archive for the package layer.
- Read The DataModel: Power BI’s Embedded Analysis Services Engine for the inner storage container.
- Read Parsing PBIX Files with Python (pbixray) if your goal is extraction rather than format study.