Data platform · Environmental IoT
India Air Quality Intelligence
Cloud ETL pipeline consolidating 500+ IoT sensor streams into policy-ready air quality indicators across India.
Summary
The Challenge
Air quality readings arrived fragmented across hundreds of sensors — unusable for public-health or policy decisions without normalization and scale.
The Solution
An end-to-end pipeline on Azure Databricks + PySpark with a PostgreSQL serving layer and Power BI operational views plus a public Next.js dashboard.
The Impact
2M+ daily records normalized into consistent AQI and health indicators — from raw IoT noise to indicators stakeholders can monitor.
“Policy teams don't need more raw readings — they need trusted indicators updated on a rhythm they can act on.”
Why ETL mattered
Constraints
What shaped the platform.
Scale, sensor heterogeneity, and cloud cost defined architecture.
Sensor heterogeneity
500+ devices reported inconsistent schemas and units — normalization rules had to run before any aggregate was trustworthy.
Volume at ingest
Millions of daily rows required distributed processing (PySpark) instead of single-node pandas jobs.
Serving latency
Dashboards needed pre-computed aggregates in PostgreSQL — analytical queries on raw bronze tables were too slow.
Azure cost envelope
Pipeline designed for batch windows and cluster autoscaling to stay within student/project budget tiers.
Public presentation
A separate Next.js layer translated engineering outputs into a portfolio-grade public view.
Platform
From sensor streams to indicators.
Ingestion from IoT feeds into Databricks, PySpark transforms for cleaning and AQI computation, load into PostgreSQL, and consumption through Power BI and a public dashboard at aqi-india.dvillagrans.dev.

Key decisions
Pipeline trade-offs.
Databricks as compute hub
Chose Azure Databricks for PySpark jobs — unified notebooks, scheduling, and scale without managing Spark clusters manually.
Trade-off
Vendor lock-in and cluster startup latency on cold runs.
Bronze → Silver → Gold
Layered tables so raw, cleaned, and aggregated data stayed separated — debugging bad readings didn't corrupt downstream KPIs.
Trade-off
More storage and orchestration steps; clearer lineage.
PostgreSQL for dashboards
Materialized aggregates in PostgreSQL fed both Power BI and the public site — fast reads without hitting Spark for every chart.
Trade-off
Sync jobs between lake and RDBMS added operational steps.
Dual consumption
Power BI for analyst-style exploration and Next.js for public storytelling — same indicators, different audiences.
Trade-off
Two surfaces to keep in sync when metrics definitions change.
Lessons learned
What broke at scale.
IoT pipelines fail quietly until aggregation exposes bad assumptions.
Timezone and unit drift
Sensors crossed IST boundaries with mixed units. Added explicit normalization tables and validation notebooks.
Data qualityCold cluster costs
Ad-hoc reruns on large clusters burned credits. Scheduled jobs with right-sized clusters fixed spend.
CostMetric definition drift
AQI formulas differed between Power BI and the public site once. Documented single metric spec shared by both.
Governance