Risk profiles COVID-19
Risk profiling from open COVID-19 data using K-Means and Fuzzy C-Means. Advanced Data Analytics, ESCOM-IPN.
Summary
The Challenge
The problem was not just “analyzing COVID data”; it was reducing uncertainty in triage and resource prioritization when ICU beds, ventilators, and capacity are scarce. Stakeholders: public health, academic evaluation, and—in perspective—clinical teams needing interpretable risk profiles.
The Solution
We combined K-Means and Fuzzy C-Means on stratified samples from 30M+ SSA records: clear assignment for triage plus partial memberships for borderline cases. Pipeline from raw data to LaTeX report and Next.js portfolio.
The Impact
9 profiles (K-Means) and 2 risk groups (FCM) characterized by age, comorbidities, hospitalization, and severity. Identified clusters such as young patients with high hospitalization and low comorbidity (Clusters 4 and 6), useful for health policy. Reproducible pipeline and documented limitations.
"What really hurt was not having multivariate risk profiles: isolated factors were known (age, diabetes, etc.), but not how they combine in practice."
The Real Pain Point
01 // Hard Constraints
What shaped the pipeline.
Academic deadline, infra limits, and data quality defined scope.
Time & Scope
Fixed delivery (9 Jan 2026). Scope limited to clustering, report, and portfolio; no operational deployment. Three-person team; task split (notebooks, document, portfolio) conditioned scope.
Infrastructure
GitHub 100 MB file limit. Raw data and .pkl not versioned; anyone cloning must run descargar_datos.py and notebooks. README recommends 16 GB+ RAM for the sample.
No Real End Users
No hospital deployment. “Users” are evaluators and portfolio readers.ACADEMIC
Costs
No cloud budget. Everything local: Conda, Jupyter, downloaded data.
Data Legacy
Official data with reporting bias, missing variables (vaccination, variants, treatments), temporal heterogeneity 2020–2024, and limited geographic granularity.
02 // System & Pipeline
From sources to deliverables.
Notebooks 00–02 take raw data to a stratified sample, cleaning, derived variables, and scaling. Notebooks 03–05 run clustering (K-Means and FCM), evaluation (Silhouette, Davies-Bouldin, ARI, NMI), and comparison. Artifacts (.pkl and figures) feed the LaTeX report and Next.js portfolio. The portfolio is public presentation with static/exported data; no live connection to .pkl.
03 // Key Decisions
The heart of the design.
K-Means and FCM (not just one)
We needed both clear classification for triage and identification of borderline cases. K-Means for definitive assignment and efficiency on millions of records; FCM for mixed profiles and prioritization via partial memberships.
Trade-off
More clusters give more detail but more complexity to communicate; we chose k=9 and c=2.
Stratified sampling
Given size (30M+ records) and memory/time limits, we used stratified sampling to preserve representativeness without collapsing the pipeline.
Trade-off
Full-dataset clustering was not feasible; sample size and strata were documented.
PCA only for visualization
We trained clustering on 17 standardized variables and used PCA only to reduce dimension in plots, not as clustering space.
Trade-off
Keeps interpretability in original feature space; visuals in 2D.
Portfolio in Next.js
Results had to reach non-technical audiences and faculty. We added a web layer (Next.js, React, Recharts) alongside the LaTeX report: narrative, gallery, and interactive charts.
Trade-off
Static exports; no live connection to models.
04 // What Went Wrong
Documented limitations.
Issues we had to acknowledge before closing the project.
100% lethality in all clusters
All clusters showed 100% lethality, which forced us to document the need to validate the dataset and review the FALLECIDO variable. Not fixed before deadline; assumed as a known limitation.
DATA QUALITYOutcome variable quality
We trusted FALLECIDO to characterize clusters without early cross-check with the data dictionary and possible reporting bias. That weakened the mortality interpretation.
VALIDATIONDual source of figures
Visualizations were generated in notebooks for LaTeX and then reused in the portfolio without a single “source of truth”; the portfolio gallery could drift from the report figures.
PROCESS