Credit Risk · ML

CreditForge — Bank-Grade Credit Risk Platform

The 80% that makes a PD model bank-credible — leakage-safe targets, out-of-time validation, calibration, reason codes, fairness, and drift monitoring.

Live demo Source

The problem

Expected Loss is one line — EL = PD × LGD × EAD. You can fit a PD model on a Kaggle dataset in an afternoon and feel like you have done credit risk. You have not. The model is roughly 20% of the work.

The other 80% is the scaffolding that makes a model a validation committee would actually sign off on. The silent killer is leakage: a random train/test split lets the model peek at the future and reports a discrimination score that evaporates the moment it sees next quarter's originations.

Architecture

1
Bronze
Raw Freddie Mac Single-Family loan + monthly-performance panels, vintage-partitioned.
2
Silver
Cleaned, point-in-time 12-month default flag, performance joined.
3
Gold
Leakage-safe feature matrix + forward target, vintage-tagged.
4
Out-of-time split
Train on old vintages, test on newer ones — never a random shuffle.
5
PD models
WoE scorecard (logistic, interpretable) + LightGBM challenger.
6
Calibration
Isotonic — turns a good ranking into a calibrated probability.
7
LGD + EAD → EL
Two-stage LGD and EAD/CCF combined into Expected Loss.
8
Validation + governance
Gini/KS/PSI, SHAP reason codes, fairness tests, model card; FastAPI serving + Next.js Risk Cockpit + drift monitoring.

Key tradeoffs

Out-of-time vintage split instead of a random split.

Why · The only honest test of "will this work on loans I have not seen yet." A random split inflates a Gini that does not survive deployment.

Train a WoE scorecard and a LightGBM challenger side by side.

Why · The scorecard is the interpretable model you defend in the room; the challenger measures how much signal the simple model leaves on the table.

Isotonic calibration as a first-class step.

Why · A credit decision needs a calibrated probability, not just a rank. A well-ordered but miscalibrated model prices every loan wrong.

Risk Copilot agents orchestrate; the classical ML core computes every number.

Why · SHAP explains the drivers and the LLM narrates them — it never invents a Gini.

Eval results

Gini / KS

Discrimination

Out-of-time test on Freddie Mac vintages; gains/lift curves.

Isotonic + HL

Calibration

Reliability curve and Hosmer-Lemeshow — predicted vs realized default rate.

PSI / CSI

Stability

Population and characteristic drift vs the training distribution.

SHAP · reason codes · fairness

Governance

Per-applicant adverse-action reason codes, protected-group fairness testing, and a documented model card.

Production proof

The artifact that keeps the numbers honest — the eval harness / monitoring gates that run in CI, not a one-off notebook result.

CI performance gates

CI · passing

Discrimination (Gini)≥ threshold gated

Calibration error≤ threshold gated

PSI stability≤ threshold gated + monitored

creditforge/eval gates fail the build on regression, and PSI drift is monitored on a schedule in production — the model cannot silently degrade.

A credit-risk stack that answers the only question that matters inside a bank — "would your model-validation committee sign this off?" — end to end, on real GSE mortgage data.

Let's talk

I'm focused on finance AI — credit risk, RegTech, AML, and agentic investment research. Open to roles, mentorship, and collaborators in fintech, quant, and bank AI.

Email me LinkedIn

More case studies

Healthcare · Agentic AI

CADUCEUS — Virtual Molecular Tumor Board

M&A · Agentic AI

PRAETOR — Agentic M&A Due-Diligence Engine