Risk profiles COVID-19

Risk profiling from open COVID-19 data using K-Means and Fuzzy C-Means. Advanced Data Analytics, ESCOM-IPN.

The Challenge

The problem was not just “analyzing COVID data”; it was reducing uncertainty in triage and resource prioritization when ICU beds, ventilators, and capacity are scarce. Stakeholders: public health, academic evaluation, and—in perspective—clinical teams needing interpretable risk profiles.

The Solution

We combined K-Means and Fuzzy C-Means on stratified samples from 30M+ SSA records: clear assignment for triage plus partial memberships for borderline cases. Pipeline from raw data to LaTeX report and Next.js portfolio.

The Impact

9 profiles (K-Means) and 2 risk groups (FCM) characterized by age, comorbidities, hospitalization, and severity. Identified clusters such as young patients with high hospitalization and low comorbidity (Clusters 4 and 6), useful for health policy. Reproducible pipeline and documented limitations.

RoleData Science / Clustering

TimelineAcademic (Jan 2026)

StackPython, scikit-learn, scikit-fuzzy

DeliverablesLaTeX report, Next.js portfolio

View Dashboard

"What really hurt was not having multivariate risk profiles: isolated factors were known (age, diabetes, etc.), but not how they combine in practice."

The Real Pain Point

01 // Hard Constraints

What shaped the pipeline.

Academic deadline, infra limits, and data quality defined scope.

Time & Scope

Fixed delivery (9 Jan 2026). Scope limited to clustering, report, and portfolio; no operational deployment. Three-person team; task split (notebooks, document, portfolio) conditioned scope.

Infrastructure

GitHub 100 MB file limit. Raw data and .pkl not versioned; anyone cloning must run descargar_datos.py and notebooks. README recommends 16 GB+ RAM for the sample.

No Real End Users

No hospital deployment. “Users” are evaluators and portfolio readers.ACADEMIC

Costs

No cloud budget. Everything local: Conda, Jupyter, downloaded data.

Data Legacy

Official data with reporting bias, missing variables (vaccination, variants, treatments), temporal heterogeneity 2020–2024, and limited geographic granularity.

02 // System & Pipeline

From sources to deliverables.

Notebooks 00–02 take raw data to a stratified sample, cleaning, derived variables, and scaling. Notebooks 03–05 run clustering (K-Means and FCM), evaluation (Silhouette, Davies-Bouldin, ARI, NMI), and comparison. Artifacts (.pkl and figures) feed the LaTeX report and Next.js portfolio. The portfolio is public presentation with static/exported data; no live connection to .pkl.

03 // Key Decisions

The heart of the design.

3.1 / ALGORITHMS

K-Means and FCM (not just one)

We needed both clear classification for triage and identification of borderline cases. K-Means for definitive assignment and efficiency on millions of records; FCM for mixed profiles and prioritization via partial memberships.

Trade-off

More clusters give more detail but more complexity to communicate; we chose k=9 and c=2.

3.2 / SAMPLE

Stratified sampling

Given size (30M+ records) and memory/time limits, we used stratified sampling to preserve representativeness without collapsing the pipeline.

Trade-off

Full-dataset clustering was not feasible; sample size and strata were documented.

3.3 / PCA

PCA only for visualization

We trained clustering on 17 standardized variables and used PCA only to reduce dimension in plots, not as clustering space.

Trade-off

Keeps interpretability in original feature space; visuals in 2D.

3.4 / PRESENTATION

Portfolio in Next.js

Results had to reach non-technical audiences and faculty. We added a web layer (Next.js, React, Recharts) alongside the LaTeX report: narrative, gallery, and interactive charts.

Trade-off

Static exports; no live connection to models.

04 // What Went Wrong

Documented limitations.

Issues we had to acknowledge before closing the project.

100% lethality in all clusters

All clusters showed 100% lethality, which forced us to document the need to validate the dataset and review the FALLECIDO variable. Not fixed before deadline; assumed as a known limitation.

DATA QUALITY

Outcome variable quality

We trusted FALLECIDO to characterize clusters without early cross-check with the data dictionary and possible reporting bias. That weakened the mortality interpretation.

VALIDATION

Dual source of figures

Visualizations were generated in notebooks for LaTeX and then reused in the portfolio without a single “source of truth”; the portfolio gallery could drift from the report figures.

PROCESS

Summary

The Challenge

The Solution

The Impact

"What really hurt was not having multivariate risk profiles: isolated factors were known (age, diabetes, etc.), but not how they combine in practice."

01 // Hard Constraints

Time & Scope

Infrastructure

No Real End Users

Costs

Data Legacy

02 // System & Pipeline

03 // Key Decisions

K-Means and FCM (not just one)

Trade-off

Stratified sampling

Trade-off

PCA only for visualization

Trade-off

Portfolio in Next.js

Trade-off

04 // What Went Wrong

100% lethality in all clusters

Outcome variable quality

Dual source of figures