ποΈ Architecture Star Schema
Dimensions
dim_customer
customer_sk (PK) β customer_id β name β email β created_at
ββββββββββββββββββΌββββββββββββββΌβββββββΌββββββββΌβββββββββββ
[hash] β 1 β Aliceβ ... β 2024-01-01
dim_product
product_sk β product_id β product_name β brand_sk β category_sk β price
ββββββββββββΌβββββββββββββΌβββββββββββββββΌβββββββββββΌββββββββββββββΌββββββ
[hash] β 1 β iPhone 15 β [hash] β [hash] β 999.99
dim_date
date_sk β date β year β month β day β dow β quarter β week_of_year
βββββββββΌβββββββββββββΌβββββββΌββββββββΌββββββΌββββββΌββββββββββΌββββββββββββββ
[hash] β 2024-06-01 β 2024 β 6 β 1 β Sat β 2 β 23
Table de Faits
fact_sales
order_id β date_sk β customer_sk β product_sk β quantity β unit_price β subtotal β year β month
ββββββββ βΌββββββββββΌββββββββββββββΌβββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββΌβββββββΌββββββ
1 β [hash] β [hash] β [hash] β 2 β 999.99 β 1999.98 β 2024 β 6
Fluxes de DonnΓ©es
βββββββββββββββββββββββββββββββββββββββββββ
β SOURCES (Bronze Layer) β
βββββββββββββββββββββββββββββββββββββββββββ€
β customers.csv β dim_customer β
β brands.csv β dim_brand β
β categories.csv β dim_category β
β products.csv β dim_product (+ FKs) β
β orders.csv β fact_sales β
β order_items.csv β fact_sales (mesures) β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββ
β TRANSFORMATIONS (Silver/Gold Layers) β
βββββββββββββββββββββββββββββββββββββββββββ€
β 1. Load & validate schemas β
β 2. Generate surrogate keys (xxhash64) β
β 3. Create dimensions β
β 4. Build fact table with joins β
β 5. Validate quality (PK, FK, nulls) β
β 6. Partition & compress (Parquet) β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββ
β OUTPUTS (Gold Layer - Parquet) β
βββββββββββββββββββββββββββββββββββββββββββ€
β dim_customer.parquet β
β dim_brand.parquet β
β dim_category.parquet β
β dim_product.parquet β
β dim_date.parquet β
β fact_sales.parquet (year/month) β
βββββββββββββββββββββββββββββββββββββββββββ
Joins
# Fact table construction
fact_sales = (
order_items # Base: 100 rows
.join(products, # 100 Γ 20 lookups
on="product_id")
.join(orders, # Preserve order context
on="order_id")
.join(customers, # Add customer dimension
on="customer_id")
.join(dim_date, # Add date dimension
on="date_sk")
)Partitionnement
fact_sales/
βββ year=2024/
β βββ month=6/
β β βββ part-00000.parquet
β β βββ part-00001.parquet
β β βββ _SUCCESS
β βββ month=7/
β β βββ ...
β βββ month=8/
β β βββ ...
β βββ month=11/
β βββ ...
BΓ©nΓ©fices:
- β RequΓͺtes par date ultra-rapides
- β Nettoyage incrΓ©mental (drop partitions)
- β ParrallΓ©lisation par partition
Voir aussi: DATA_SCHEMA.md pour les dΓ©tails