CreditForge — Bank-Grade Credit Risk Platform
The 80% that makes a PD model bank-credible — leakage-safe targets, out-of-time validation, calibration, reason codes, fairness, and drift monitoring.
The problem
Expected Loss is one line — EL = PD × LGD × EAD. You can fit a PD model on a Kaggle dataset in an afternoon and feel like you have done credit risk. You have not. The model is roughly 20% of the work.
The other 80% is the scaffolding that makes a model a validation committee would actually sign off on. The silent killer is leakage: a random train/test split lets the model peek at the future and reports a discrimination score that evaporates the moment it sees next quarter's originations.
Architecture
- 1
Bronze
Raw Freddie Mac Single-Family loan + monthly-performance panels, vintage-partitioned.
- 2
Silver
Cleaned, point-in-time 12-month default flag, performance joined.
- 3
Gold
Leakage-safe feature matrix + forward target, vintage-tagged.
- 4
Out-of-time split
Train on old vintages, test on newer ones — never a random shuffle.
- 5
PD models
WoE scorecard (logistic, interpretable) + LightGBM challenger.
- 6
Calibration
Isotonic — turns a good ranking into a calibrated probability.
- 7
LGD + EAD → EL
Two-stage LGD and EAD/CCF combined into Expected Loss.
- 8
Validation + governance
Gini/KS/PSI, SHAP reason codes, fairness tests, model card; FastAPI serving + Next.js Risk Cockpit + drift monitoring.
Key tradeoffs
Out-of-time vintage split instead of a random split.
Why · The only honest test of "will this work on loans I have not seen yet." A random split inflates a Gini that does not survive deployment.
Train a WoE scorecard and a LightGBM challenger side by side.
Why · The scorecard is the interpretable model you defend in the room; the challenger measures how much signal the simple model leaves on the table.
Isotonic calibration as a first-class step.
Why · A credit decision needs a calibrated probability, not just a rank. A well-ordered but miscalibrated model prices every loan wrong.
Risk Copilot agents orchestrate; the classical ML core computes every number.
Why · SHAP explains the drivers and the LLM narrates them — it never invents a Gini.
Eval results
Out-of-time test on Freddie Mac vintages; gains/lift curves.
Reliability curve and Hosmer-Lemeshow — predicted vs realized default rate.
Population and characteristic drift vs the training distribution.
Per-applicant adverse-action reason codes, protected-group fairness testing, and a documented model card.
Production proof
The artifact that keeps the numbers honest — the eval harness / monitoring gates that run in CI, not a one-off notebook result.
CI performance gates
CI · passingcreditforge/eval gates fail the build on regression, and PSI drift is monitored on a schedule in production — the model cannot silently degrade.
Let's talk
I'm focused on finance AI — credit risk, RegTech, AML, and agentic investment research. Open to roles, mentorship, and collaborators in fintech, quant, and bank AI.