Data platform · Environmental IoT

India Air Quality Intelligence

Cloud ETL pipeline consolidating 500+ IoT sensor streams into policy-ready air quality indicators across India.

0M+Daily records

0+IoT sensors

AzureCloud platform

LivePublic dashboard

The Challenge

Air quality readings arrived fragmented across hundreds of sensors — unusable for public-health or policy decisions without normalization and scale.

The Solution

An end-to-end pipeline on Azure Databricks + PySpark with a PostgreSQL serving layer and Power BI operational views plus a public Next.js dashboard.

The Impact

2M+ daily records normalized into consistent AQI and health indicators — from raw IoT noise to indicators stakeholders can monitor.

RoleData engineering · Analytics

Timeline2024

StackAzure, PySpark, PostgreSQL

DeliverablesETL, live dashboard, repo

“Policy teams don't need more raw readings — they need trusted indicators updated on a rhythm they can act on.”

Why ETL mattered

Constraints

What shaped the platform.

Scale, sensor heterogeneity, and cloud cost defined architecture.

Sensor heterogeneity

500+ devices reported inconsistent schemas and units — normalization rules had to run before any aggregate was trustworthy.

Volume at ingest

Millions of daily rows required distributed processing (PySpark) instead of single-node pandas jobs.

Serving latency

Dashboards needed pre-computed aggregates in PostgreSQL — analytical queries on raw bronze tables were too slow.

Azure cost envelope

Pipeline designed for batch windows and cluster autoscaling to stay within student/project budget tiers.

Public presentation

A separate Next.js layer translated engineering outputs into a portfolio-grade public view.

Platform

From sensor streams to indicators.

Ingestion from IoT feeds into Databricks, PySpark transforms for cleaning and AQI computation, load into PostgreSQL, and consumption through Power BI and a public dashboard at aqi-india.dvillagrans.dev.

India air quality intelligence dashboard walkthrough

Key decisions

Pipeline trade-offs.

Ingest

Databricks as compute hub

Chose Azure Databricks for PySpark jobs — unified notebooks, scheduling, and scale without managing Spark clusters manually.

Trade-off

Vendor lock-in and cluster startup latency on cold runs.

Model

Bronze → Silver → Gold

Layered tables so raw, cleaned, and aggregated data stayed separated — debugging bad readings didn't corrupt downstream KPIs.

Trade-off

More storage and orchestration steps; clearer lineage.

Serving

PostgreSQL for dashboards

Materialized aggregates in PostgreSQL fed both Power BI and the public site — fast reads without hitting Spark for every chart.

Trade-off

Sync jobs between lake and RDBMS added operational steps.

Delivery

Dual consumption

Power BI for analyst-style exploration and Next.js for public storytelling — same indicators, different audiences.

Trade-off

Two surfaces to keep in sync when metrics definitions change.

Lessons learned

What broke at scale.

IoT pipelines fail quietly until aggregation exposes bad assumptions.

Timezone and unit drift

Sensors crossed IST boundaries with mixed units. Added explicit normalization tables and validation notebooks.

Data quality

Cold cluster costs

Ad-hoc reruns on large clusters burned credits. Scheduled jobs with right-sized clusters fixed spend.

Cost

Metric definition drift

AQI formulas differed between Power BI and the public site once. Documented single metric spec shared by both.

Governance

Back to projects

Data platform · Environmental IoT

India Air Quality Intelligence

Cloud ETL pipeline consolidating 500+ IoT sensor streams into policy-ready air quality indicators across India.

Live dashboard Repository

0M+Daily records

0+IoT sensors

AzureCloud platform

LivePublic dashboard

The Challenge

Air quality readings arrived fragmented across hundreds of sensors — unusable for public-health or policy decisions without normalization and scale.

The Solution

An end-to-end pipeline on Azure Databricks + PySpark with a PostgreSQL serving layer and Power BI operational views plus a public Next.js dashboard.

The Impact

2M+ daily records normalized into consistent AQI and health indicators — from raw IoT noise to indicators stakeholders can monitor.

RoleData engineering · Analytics

Timeline2024

StackAzure, PySpark, PostgreSQL

DeliverablesETL, live dashboard, repo

“Policy teams don't need more raw readings — they need trusted indicators updated on a rhythm they can act on.”

Why ETL mattered

Constraints

What shaped the platform.

Scale, sensor heterogeneity, and cloud cost defined architecture.

Sensor heterogeneity

500+ devices reported inconsistent schemas and units — normalization rules had to run before any aggregate was trustworthy.

Volume at ingest

Millions of daily rows required distributed processing (PySpark) instead of single-node pandas jobs.

Serving latency

Dashboards needed pre-computed aggregates in PostgreSQL — analytical queries on raw bronze tables were too slow.

Azure cost envelope

Pipeline designed for batch windows and cluster autoscaling to stay within student/project budget tiers.

Public presentation

A separate Next.js layer translated engineering outputs into a portfolio-grade public view.

Platform

From sensor streams to indicators.

Key decisions

Pipeline trade-offs.

Ingest

Databricks as compute hub

Chose Azure Databricks for PySpark jobs — unified notebooks, scheduling, and scale without managing Spark clusters manually.

Trade-off

Vendor lock-in and cluster startup latency on cold runs.

Model

Bronze → Silver → Gold

Layered tables so raw, cleaned, and aggregated data stayed separated — debugging bad readings didn't corrupt downstream KPIs.

Trade-off

More storage and orchestration steps; clearer lineage.

Serving

PostgreSQL for dashboards

Materialized aggregates in PostgreSQL fed both Power BI and the public site — fast reads without hitting Spark for every chart.

Trade-off

Sync jobs between lake and RDBMS added operational steps.

Delivery

Dual consumption

Power BI for analyst-style exploration and Next.js for public storytelling — same indicators, different audiences.

Trade-off

Two surfaces to keep in sync when metrics definitions change.

Lessons learned

What broke at scale.

IoT pipelines fail quietly until aggregation exposes bad assumptions.

Timezone and unit drift

Sensors crossed IST boundaries with mixed units. Added explicit normalization tables and validation notebooks.

Data quality

Cold cluster costs

Ad-hoc reruns on large clusters burned credits. Scheduled jobs with right-sized clusters fixed spend.

Cost

Metric definition drift

AQI formulas differed between Power BI and the public site once. Documented single metric spec shared by both.

Governance

Summary

The Challenge

The Solution

The Impact

What shaped the platform.

Sensor heterogeneity

Volume at ingest

Serving latency

Azure cost envelope

Public presentation

From sensor streams to indicators.

Pipeline trade-offs.

Databricks as compute hub

Trade-off

Bronze → Silver → Gold

Trade-off

PostgreSQL for dashboards

Trade-off

Dual consumption

Trade-off

What broke at scale.

Timezone and unit drift

Cold cluster costs

Metric definition drift

Summary

The Challenge

The Solution

The Impact

What shaped the platform.

Sensor heterogeneity

Volume at ingest

Serving latency

Azure cost envelope

Public presentation

From sensor streams to indicators.

Pipeline trade-offs.

Databricks as compute hub

Trade-off

Bronze → Silver → Gold

Trade-off

PostgreSQL for dashboards

Trade-off

Dual consumption

Trade-off

What broke at scale.

Timezone and unit drift

Cold cluster costs

Metric definition drift