All projects
Credit Risk · ML

CreditForge — Bank-Grade Credit Risk Platform

The 80% that makes a PD model bank-credible — leakage-safe targets, out-of-time validation, calibration, reason codes, fairness, and drift monitoring.

The problem

Expected Loss is one line — EL = PD × LGD × EAD. You can fit a PD model on a Kaggle dataset in an afternoon and feel like you have done credit risk. You have not. The model is roughly 20% of the work.

The other 80% is the scaffolding that makes a model a validation committee would actually sign off on. The silent killer is leakage: a random train/test split lets the model peek at the future and reports a discrimination score that evaporates the moment it sees next quarter's originations.

Architecture

  1. 1

    Bronze

    Raw Freddie Mac Single-Family loan + monthly-performance panels, vintage-partitioned.

  2. 2

    Silver

    Cleaned, point-in-time 12-month default flag, performance joined.

  3. 3

    Gold

    Leakage-safe feature matrix + forward target, vintage-tagged.

  4. 4

    Out-of-time split

    Train on old vintages, test on newer ones — never a random shuffle.

  5. 5

    PD models

    WoE scorecard (logistic, interpretable) + LightGBM challenger.

  6. 6

    Calibration

    Isotonic — turns a good ranking into a calibrated probability.

  7. 7

    LGD + EAD → EL

    Two-stage LGD and EAD/CCF combined into Expected Loss.

  8. 8

    Validation + governance

    Gini/KS/PSI, SHAP reason codes, fairness tests, model card; FastAPI serving + Next.js Risk Cockpit + drift monitoring.

Key tradeoffs

Out-of-time vintage split instead of a random split.

Why · The only honest test of "will this work on loans I have not seen yet." A random split inflates a Gini that does not survive deployment.

Train a WoE scorecard and a LightGBM challenger side by side.

Why · The scorecard is the interpretable model you defend in the room; the challenger measures how much signal the simple model leaves on the table.

Isotonic calibration as a first-class step.

Why · A credit decision needs a calibrated probability, not just a rank. A well-ordered but miscalibrated model prices every loan wrong.

Risk Copilot agents orchestrate; the classical ML core computes every number.

Why · SHAP explains the drivers and the LLM narrates them — it never invents a Gini.

Eval results

Gini / KS
Discrimination

Out-of-time test on Freddie Mac vintages; gains/lift curves.

Isotonic + HL
Calibration

Reliability curve and Hosmer-Lemeshow — predicted vs realized default rate.

PSI / CSI
Stability

Population and characteristic drift vs the training distribution.

SHAP · reason codes · fairness
Governance

Per-applicant adverse-action reason codes, protected-group fairness testing, and a documented model card.

Production proof

The artifact that keeps the numbers honest — the eval harness / monitoring gates that run in CI, not a one-off notebook result.

CI performance gates

CI · passing
Discrimination (Gini) gated
Calibration error gated
PSI stability gated + monitored

creditforge/eval gates fail the build on regression, and PSI drift is monitored on a schedule in production — the model cannot silently degrade.

A credit-risk stack that answers the only question that matters inside a bank — "would your model-validation committee sign this off?" — end to end, on real GSE mortgage data.

Let's talk

I'm focused on finance AI — credit risk, RegTech, AML, and agentic investment research. Open to roles, mentorship, and collaborators in fintech, quant, and bank AI.