Chat with us

— Capability demonstration · Reference build

Walmart sales pipeline:
10M+ rows, four sources, one dashboard.

A reference build using the exact stack delivered to every Shopify Data Hub client. Built end-to-end with production-grade modeling, peer-reviewed improvements, and full documentation.

Dataset

Walmart retail sales

Public dataset

Rows processed

10M+

Data sources

4

Build time

28 days

The brief was simple: ingest a fragmented retail sales dataset across four sources, model it for analytical use, and surface the views a business operator would actually look at. Build it the way you'd build it for a client.

The fragmentation problem

Real retail data — Walmart's or Shopify's — never arrives clean. Sales tables don't speak the same language as inventory feeds. Promotional data lives in a separate system. External factors (weather, economic indicators, holidays) sit in entirely different sources. Stitching them together is most of the work, and it's where most internal teams stall.

This build deliberately replicated that mess: four sources, inconsistent grain, conflicting timestamps, missing joins. The kind of state every Shopify operator with Klaviyo + Shopify Analytics + Meta Ads + Google Ads is sitting in right now.

The architecture

  • Ingest layer. Raw CSV / Parquet sources landed in AWS S3, partitioned by source and date. Python loaders standardized encoding and timestamp formats before warehouse load.
  • Raw layer (Snowflake). Sources loaded as-is into separate schemas. Zero transformation. Full audit trail preserved.
  • Staging layer (dbt). One staging model per source. Column renaming, type casting, light cleanup. Every model tested for nulls, uniqueness, and referential integrity.
  • Intermediate layer (dbt). Business logic isolated here. Join conditions, fiscal calendar mapping, deduplication rules. The layer that hurts to write but saves you for years.
  • Mart layer (dbt). Fact and dimension tables ready for BI. fct_sales, dim_product, dim_store, dim_date. Documented and exposed.
  • Visualization. Plotly dashboard surfacing the views an operator would actually use: revenue trends, top-performing products, store-level performance, promotional lift, external factor correlations.

The peer-review pass

The first version worked. The second version was production-grade. The difference came from a structured peer review that surfaced three categories of improvement:

  • Testing discipline. Added dbt test coverage on every model — uniqueness on primary keys, not-null on critical fields, referential integrity on every join. CI now blocks broken models from reaching the mart layer.
  • Documentation. Every model has a description, every column a definition. dbt docs generated and published. Downstream users can answer "what is this column" without asking.
  • Incremental loading. Full-refresh loads replaced with incremental strategies on the high-volume fact tables. Compute cost dropped meaningfully. Daily runs went from "slow and risky" to "boring and reliable."

What the dashboard actually answers

The output isn't pretty charts. It's specific business questions answered at a glance:

  • Which products drive margin, and which bleed it after returns and markdowns?
  • Are promotional lifts real, or are they pulling forward demand that would have come anyway?
  • How does external context (seasonality, economic conditions) explain or fail to explain weekly variance?
  • Where are the stores or categories that need attention this week, not last quarter?

Translate those into Shopify language: real LTV by channel, real product profitability after refunds, real attribution after platform noise, real cohort retention. Same questions. Same architecture. Same answers.

Why this maps to Shopify Data Hub

The retail dataset is a stand-in. The stack, the modeling approach, the testing discipline, and the layered architecture are identical to what's delivered in every Shopify Data Hub engagement. Source connectors swap (Shopify, Klaviyo, Meta Ads, Google Ads replace the retail sources), business logic shifts to e-commerce KPIs, but the engineering rigor is the same.

That rigor — testing, documentation, incremental loads, layered models — is what separates "a dashboard a junior built once" from "infrastructure your business can actually trust." It's the difference you're paying for.

— Disclosure

This is a reference build on a public dataset, not a paid client engagement. The Shopify Data Hub service applies the same architecture, stack, and modeling rigor to your live store data. First paid engagements onboarding now.