Data · Infrastructure · Analytics

Ravi Rajpurohit

ravi@portfolio ~ %

About

Building data systems
that actually get used.

I've spent the last several years building pipelines that process billions of events per month, cloud data platforms that cut executive reporting from days to under an hour, and data visualizations that make complex datasets worth looking at.

Before my MS in Computer Science at UT Arlington, I built the data backbone for a wearable health platform at KaHa Technologies — Kafka ingestion, real-time telemetry, 10M+ users. I'm most interested in the full picture: from how data gets ingested to whether the person reading the dashboard actually trusts what it shows.

AWSdbtSparkKafkaAirflowPythonSQLSnowflakeD3.jsNoSQLCI/CD

⚡ Outside of work: 🏸 racket sports, 🎸 music, 🧑‍🍳 cooking — and yes, I once worked as a chef for my university.

Ravi Rajpurohit

2B+

events/month processed

10M+

users served

5M+

records/day through current pipelines

Capabilities

What I Build

End-to-End Data Pipelines

Kafka ingestion at 2B+ events/month, Python and PySpark transformations, dbt modeling, dimensional schemas — the whole chain from raw source to analytics-ready table.

Cloud Data Platforms

AWS (Glue, Athena, Spark, S3), Snowflake, Databricks, DuckDB. I care about what each tool actually costs and whether engineers will be able to maintain it six months later.

Analytics & BI

Executive dashboards, D3.js data stories, self-serve reporting. I think about who is going to open this dashboard at 8am and what they actually need to see — not just what the data model can technically produce.

ML-Enabling Infrastructure

Feature pipelines and data layers for real-time ML inference — health wearable telemetry at 10M+ user scale, high-frequency biosensor research at 200Hz. The ML model is only as good as the data it gets.

Experience

Where I've Worked

Data Infrastructure Engineer

State of Michigan — Ottawa Area ISD

  • Built an AWS data lake consolidating 15+ siloed data sources — cut executive report generation from 3 days to under 1 hour.
  • Designed ELT pipelines in Python and PySpark processing 5M+ daily records; provisioned the full stack as a serverless, event-driven architecture using AWS CloudFormation.
  • Built AWS Step Functions state machines with explicit failure branches, per-run DynamoDB audit logging, and SNS alerting — ensuring every pipeline failure surfaces immediately rather than silently propagating to downstream dashboards.
  • Deployed semantic-layer dashboards across Tableau, Power BI, and QuickSight — abstracting dimensional model complexity so business teams could self-serve answers without analyst intervention.
  • Accelerated team productivity by 20% by integrating AI coding assistants into documentation and development workflows.
More Details ↗

Software Data Engineer — Wearables & MLOps

KaHa Technologies

  • Built a real-time telemetry pipeline on Apache Kafka processing 2B+ monthly events from 10M+ wearable users with end-to-end lag under 5 seconds.
  • Designed a multi-store data layer (DynamoDB, S3, Redshift) matched to each access pattern — from low-latency device reads to batch ML training and warehouse analytics.
  • Built observability and data quality monitoring pipelines with Prometheus, catching bad data upstream of data science workflows and accelerating model iteration speed by 25%.
  • Built A/B testing infrastructure and self-serve analytics that enabled product teams to independently design, run, and evaluate experiments — removing the data team as a bottleneck on product iteration.
  • Instrumented mobile apps with Firebase analytics and built a BigQuery pipeline that auto-segmented performance by device make, model, and OS version — replacing ad-hoc QA queries with an automated daily report.
More Details ↗

Data Engineer Intern — Cloud & APIs

Nutanix

  • Integrated the internal analytics platform into Nutanix's IAM via JWT, eliminating duplicate credentials and driving daily adoption gains; delivered the GoLang API on Kubernetes via Jenkins CI/CD.
  • Served as the bridge between engineering and analytics to define source-to-target data contracts; delivered role-based dashboards for Support and Sales from 500GB+ of daily product telemetry.
  • Awarded 2nd place in the company-wide hackathon for prototyping a product recommendation module powered by customer usage signals.
More Details ↗

Data Engineer — Research Applications

University of Texas at Arlington

  • Built a high-frequency wearable sensor data pipeline — ingesting raw device readings, applying noise filtering, and extracting structured features — delivering ML-ready datasets for multiple concurrent research experiments.
  • Orchestrated ETL workflows with Apache Airflow, reducing processing latency by 35%.
  • Implemented automated data quality tests with dbt and daily job execution monitoring, catching upstream data issues before they could corrupt ML feature extraction or degrade model results.
More Details ↗

Education

MS Computer Science

University of Texas at Arlington

Also while there

Chef ↗Python TutorStaff Manager

Projects

Featured Work

Projects with a Case Study include architecture diagrams, data models, and the key engineering decisions behind them.

Sentinel Fleet Operations screenshot

Sentinel Fleet Operations

Sentry Domain Mission Analytics Platform

A mission analytics platform for an autonomous surveillance fleet, built on DuckDB, dbt, and Streamlit. Four tabs — Operations, Detection Analytics, Reliability, and Pipeline Health — serve commanders, analysts, engineers, and data teams from a single star-schema fact layer. 66 dbt tests, 9 sources, 8 mart models, and a pre-built DuckDB artifact shipped with the repo.

dbtDuckDBStreamlitPythonPlotlySQL
PlacesOps screenshot

PlacesOps

BI Dashboard & Modern Data Stack Prototype

A BI dashboard prototype built on DuckDB, dbt-core, and Streamlit. Two tabs for two audiences: a capital expenditure view for business stakeholders, and a pipeline health monitor for data engineers — both powered by the same fact table.

dbtDuckDBStreamlitPythonSQL
Uber NYC Dashboard screenshot

Uber NYC Dashboard

Multi-Page Data & AI Analytics App

Rebuilt from a Streamlit tutorial into a multi-page data and AI engineering app. Three pages — Home (KPIs and anomaly detection), Map Explorer (Pydeck geospatial layers), and AI Analyst (LLM chat with function calling) — all powered by a structured ingest → enrich → aggregate pipeline on ~1M NYC Uber records. Provider-agnostic: runs on local Ollama or Groq cloud with no code changes.

StreamlitPythonPandasPlotlyPydeckGroqLLM
Learner Activity Pipeline screenshot

Learner Activity Pipeline

Medallion ELT Architecture Redesign

Diagnosed four critical anti-patterns in a Matillion + Snowflake ETL pipeline and rebuilt it as a robust, idempotent Medallion ELT architecture with S3 staging, high-water mark incremental loads, and MERGE-based upserts.

MatillionSnowflakeAWS S3PythonMedallion ArchitectureFERPA
App Store Ecosystem Analytics screenshot

App Store Ecosystem Analytics

Interactive Data Storytelling with D3.js

500MB of raw Kaggle data, pre-aggregated down to 50KB via Python, rendered as a smooth animated bar chart race in D3.js. 13 years of App Store genre competition in one visualization. Published on Medium.

D3.jsPythonPandasHTML/CSS
ChatPilot

ChatPilot

Multi-Provider AI Chat Application

AI chat application with support for multiple LLM providers (Gemini, OpenAI) via a Streamlit interface. Actively developed — model switching and prompt memory are on the roadmap.

StreamlitGemini APIOpenAI APIPython

Toolkit

My Toolkit

Tools I reach for — and know well enough to have opinions about.

Data Engineering & Orchestration

Python
Apache Spark
Apache Kafka
Apache Airflow
dbt
Go
SQL

Cloud & Data Platforms

AWS
Snowflake
Databricks
BigQuery
Amazon Redshift
Firebase
DuckDB

Databases & Storage

PostgreSQL
MongoDB
DynamoDB
Cassandra
Amazon S3
Redis

Visualization & Analytics

D3.js
Streamlit
Power BI
Tableau
Amazon QuickSight
Plotly

Infrastructure & DevOps

Docker
Kubernetes
Terraform
AWS CloudFormation
Prometheus
GitHub Actions
Jenkins

Contact

Let's build something together.

I'm open to opportunities. Reach out if you're working on something interesting.