Case Study

Member Insights Lakehouse

A production-minded analytics application that transforms privacy-safe wearable, app, lifecycle, and experimentation events into trusted member insights, platform-health visibility, governed metric definitions, and natural-language analysis over curated tables.

DuckDBStreamlitPythonSQLPandasAltair
Live Demo ↗

Overview

Member Insights Lakehouse is built around the kind of internal data product a member-based health, fitness, or performance company would need: reliable event modeling, cohort analytics, experiment readouts, platform-health visibility, governed metric definitions, and natural-language analysis over trusted tables.

The project uses generated privacy-safe data. It does not use or imply access to private product, company, or member data. The goal is to show the architecture and product thinking: raw wearable and app signals become durable analytical tables that can support dashboards, experimentation, ML features, and governed AI workflows.

Live Product

The live app leads with operational product analytics, then lets stakeholders move into experimentation, platform health, metric definitions, and governed natural-language analysis without leaving the same product surface.

Member Insights performance signals dashboard with recovery, sleep, strain, low recovery risk, and trend chart
Performance Signals view: member-day facts roll up into recovery, sleep, strain, engagement, and low-recovery risk trends.
Member Insights assistant response with retention chart and analysis trace metadata
Insights Assistant view: a retention question routes to an approved function, returns a visual answer, and exposes rows considered, latency, token estimate, and cost metadata.

Architecture

This is the README architecture adapted directly into the portfolio case study. It shows the local implementation and the production migration path in one diagram: privacy-safe event sources, modeled analytical tables, product workflows, and the cloud-scale equivalents.

flowchart LR
    subgraph Sources["Privacy-Safe Event Sources"]
        A["Wearable telemetry\nheart rate, sleep, strain, recovery"]
        B["App engagement\nsessions, workouts, check-ins"]
        C["Member lifecycle\nplan, cohort, goal, status"]
        D["Algorithm release assignments\nbaseline vs release candidate"]
    end

    subgraph Warehouse["Modeled Analytical Layer"]
        E["stg_member_events\nimmutable event table"]
        F["dim_members\nmember attributes"]
        G["dim_experiment_assignments\nrelease groups"]
        H["fct_member_day\none row per member per day"]
        I["agg_cohort_daily\nperformance signal mart"]
        J["agg_member_lifecycle\ngrowth and retention mart"]
        K["agg_experiment_summary\nrelease outcome mart"]
        L["metric_dictionary\ngoverned definitions"]
        M["pipeline_run_log + model_inventory\nplatform observability"]
    end

    subgraph Product["Analytics Product Surface"]
        N["Growth & Retention"]
        O["Performance Signals"]
        P["Experimentation"]
        Q["Data Platform Health"]
        R["Metric Dictionary"]
        S["Insights Assistant\ngoverned visual analysis"]
    end

    subgraph Production["Production Direction"]
        T["Kafka/Kinesis"]
        U["Spark/PySpark"]
        V["Snowflake + dbt"]
        W["AWS observability"]
        X["Approved AI tooling"]
    end

    A --> E
    B --> E
    C --> F
    D --> G
    E --> H
    F --> H
    G --> H
    H --> I
    F --> J
    H --> K
    I --> O
    J --> N
    K --> P
    M --> Q
    L --> R
    I --> S
    J --> S
    K --> S
    M --> S
    L --> S
    A -.production intake.-> T
    H -.batch and streaming scale.-> U
    H -.warehouse serving.-> V
    Q -.monitoring and alerts.-> W
    S -.approved AI layer.-> X

Data Model

The central design choice is the member-day grain. It bridges immutable wearable/app events, member lifecycle attributes, experiment assignments, dashboard aggregates, and ML-ready feature tables. Cohort, lifecycle, and experiment marts then serve specific stakeholder questions without re-aggregating raw event streams inside the UI.

erDiagram
    DIM_MEMBERS {
        string member_id PK
        string cohort_month
        string plan_type
        string goal
        string status
        string acquisition_channel
    }
    STG_MEMBER_EVENTS {
        string event_id PK
        string member_id FK
        timestamp event_ts
        string event_type
        number recovery_score
        number sleep_hours
        number strain_score
        number heart_rate
    }
    DIM_EXPERIMENT_ASSIGNMENTS {
        string member_id FK
        string experiment_name
        string variant
        date assigned_date
    }
    FCT_MEMBER_DAY {
        string member_id FK
        date metric_date
        number recovery_score
        number sleep_hours
        number strain_score
        number engagement_score
        boolean low_recovery_flag
    }
    AGG_COHORT_DAILY {
        string cohort_month
        date metric_date
        number active_members
        number avg_recovery
        number avg_sleep
        number avg_strain
    }
    AGG_MEMBER_LIFECYCLE {
        string segment
        number new_members
        number active_members
        number retention_30d
        number subscription_continuity
    }
    AGG_EXPERIMENT_SUMMARY {
        string experiment_name
        string variant
        number recovery_lift
        number sleep_lift
        number engagement_lift
        number low_recovery_guardrail
    }
    DIM_MEMBERS ||--o{ STG_MEMBER_EVENTS : "generates"
    DIM_MEMBERS ||--o{ FCT_MEMBER_DAY : "summarized_as"
    STG_MEMBER_EVENTS ||--o{ FCT_MEMBER_DAY : "modeled_into"
    DIM_EXPERIMENT_ASSIGNMENTS ||--o{ FCT_MEMBER_DAY : "annotates"
    FCT_MEMBER_DAY ||--o{ AGG_COHORT_DAILY : "aggregates"
    DIM_MEMBERS ||--o{ AGG_MEMBER_LIFECYCLE : "segments"
    FCT_MEMBER_DAY ||--o{ AGG_EXPERIMENT_SUMMARY : "rolls_up"

Delivery Progression

The project was built in progressive layers: first the analytical contract and quality gates, then the stakeholder workflows, then the governed assistant and public-ready presentation. That framing is more useful than a chronological task log because it shows how the prototype matured toward a production-grade data product.

timeline
    title Member Insights Lakehouse Development Lifecycle
    2026-05-24 : Defined the member-insights data product
               : Identified growth, retention, recovery, sleep, strain, engagement, and experiment questions
               : Generated privacy-safe member, wearable, app, lifecycle, and release-assignment data
               : Built the DuckDB warehouse, SQL model layer, quality checks, and first Streamlit surface
    2026-05-25 : Matured the app into a governed analytics surface
               : Added Growth, Performance, Experimentation, Platform Health, Dictionary, and Assistant workflows
               : Replaced open-ended LLM dependency with a governed function router and visual responses
               : Refactored reusable data access, metric logic, and UI patterns into dedicated modules
    2026-05-27 : Prepared the project for public portfolio review
               : Audited public wording, repository hygiene, theme configuration, and live-app presentation
               : Refined architecture, production mapping, screenshots, and case-study narrative

Key Engineering Decisions

Decision #1 — Table Grain

Why model member-day facts before building aggregates?

Member-day is the useful bridge between raw events and downstream analytics. It can support recovery, sleep, strain, engagement, risk flags, cohort trends, experimentation, and ML feature work without forcing each consumer to understand raw event semantics. That extra modeling step creates a stable contract for the rest of the product.

Decision #2 — Platform Health

Why make reliability visible inside the dashboard?

Metric trust depends on freshness, model inventory, quality gates, and table grain. Hiding those checks in terminal output would make the app look more polished but less production-minded. The Data Platform Health workflow exposes run status, modeled table counts, freshness, and quality pass rate alongside the business metrics that depend on them.

Decision #3 — Experiment Language

Why describe release groups as baseline and release candidate?

Control and treatment are statistically precise, but baseline and release candidate are easier for product and engineering stakeholders to scan during an algorithm-release review. The model still captures experiment assignments and lift metrics, while the UI uses language that matches how teams discuss rollout decisions.

Decision #4 — Governed Assistant

Why use function routing instead of open-ended SQL generation?

The assistant is designed as a governed visual analyst. It routes natural-language prompts to curated analytical functions, answers from modeled aggregate tables and metric definitions, attaches charts when useful, and shows trace metadata including selected tool, estimated tokens, rows considered, latency, API calls, and API cost. This avoids hallucinated definitions and keeps AI behavior inside approved analytical boundaries.

Outcome & Reflection

This project demonstrates how a compact app can still carry production-grade data habits: explicit grains, quality checks, observability, metric governance, experiment support, and natural-language access over trusted data products. The strongest walkthrough path is to show growth and retention first, then experimentation, then platform health, then the assistant trace.

In production, generated CSVs would become Kafka or Kinesis event streams, DuckDB would become Snowflake, the SQL build would become a dbt DAG with CI, Python checks would become dbt tests or Great Expectations, and the governed function router would become an approved AI workflow with access controls, evaluation traces, and audit logs.