CreditForge: What It Actually Takes to Build a Credit-Risk Model a Bank Would Sign Off On
EL = PD × LGD × EAD is one line. The 80% that makes a credit model bank-credible is everything around it — leakage-safe point-in-time targets, out-of-time validation, calibration, reason codes, fairness, and drift monitoring. Here is the full stack.
Expected Loss is one line of arithmetic:
EL = PD × LGD × EAD
Probability of default, times loss given default, times exposure at default. You can fit a PD model on a Kaggle dataset in an afternoon and feel like you have done credit risk. You have not. The model is maybe 20% of the work. The other 80% — the part that makes a credit model something a bank's model-validation committee would actually sign off on — is the scaffolding around it. That scaffolding is what I built CreditForge to demonstrate.
The keystone: leakage discipline
The single thing that separates a credible credit model from a leaky toy is point-in-time correctness. Features must use only information available at the observation date. The target looks forward 12 months. And the train/test split is out-of-time — old loan vintages train the model, newer vintages test it — never a random shuffle.
Why this matters: a random split lets the model peek at the future. It learns from loans that defaulted next to loans that did not, in the same period, and reports a Gini that evaporates the moment you deploy it on next quarter's originations. Out-of-time validation is the only honest test of "will this work on loans I have not seen yet." It is the keystone, and almost every notebook-grade credit model gets it wrong.
PD: a scorecard and a challenger, on purpose
CreditForge trains two PD models side by side:
- A WoE scorecard — weight-of-evidence binning plus logistic regression. It is monotonic, interpretable, and the format regulators have trusted for decades. This is the model you can defend in a room.
- A LightGBM challenger — the gradient-boosted model that tells you how much signal a more flexible learner can find, so you know what the interpretable scorecard is leaving on the table.
Both are then passed through isotonic calibration, because a credit decision needs a calibrated probability, not just a good ranking. A model that ranks borrowers perfectly but predicts 5% when the true rate is 15% will price every loan wrong. Discrimination tells you the order; calibration tells you the level. You need both, and most people only measure the first.
LGD, EAD, and the full lifecycle
PD is one of three terms. CreditForge models LGD with a two-stage approach and EAD/CCF, then combines all three into Expected Loss. The whole thing runs as a data lifecycle, not an agent graph:
Bronze (raw, vintage-partitioned) → Silver (cleaned, point-in-time flag, performance joined) → Gold (leakage-safe feature matrix + target) → out-of-time split → PD scorecard + challenger → calibration → LGD + EAD → Expected Loss → validation → governance → serving + drift monitoring
It reads the Freddie Mac Single-Family loan and monthly-performance schema. Until you drop real GSE files in, a seeded synthetic generator produces the same schema, so the entire stack runs offline and reproducibly.
Validation and governance — the part that proves maturity
This is where bank-credible diverges from Kaggle-credible. CreditForge ships a full validation and governance suite:
- Discrimination — Gini, KS, gains/lift curves.
- Calibration — reliability curves and Hosmer-Lemeshow.
- Stability — PSI and CSI to catch population and characteristic drift.
- Benchmarking — the challenger as a yardstick for the scorecard.
- Adverse-action reason codes — because if you decline someone for credit, regulation says you must tell them why, per-applicant.
- Fairness testing across protected groups, plus a model card documenting it all.
- CI performance gates — Gini, calibration, and PSI thresholds that fail the build on regression, so the model cannot silently degrade.
- Scheduled drift monitoring in production.
The Risk Copilot
On top of the platform sits a small agent team — a Portfolio Analyst, a Model Validator, and a Fairness Officer — each with a focused toolbelt over the real platform tools: scoring, portfolio slices, validation metrics, SHAP drivers, fairness checks. An orchestrator routes a question to the right specialist, and answers come back with interactive charts the agents emit and a transparent tool trace.
The discipline here is the same one that runs through everything I build: the numbers come from the models, never from the LLM. The agents orchestrate and explain; the classical ML core computes. SHAP explains the drivers; the LLM narrates them. It never invents a Gini.
The whole app ships as one Docker image — FastAPI serving the API under one path, a pre-built Next.js "Risk Cockpit" served at the root, one origin, no CORS — deployed free on Hugging Face Spaces.
What this taught me
- The model is the easy 20%. Leakage discipline, out-of-time validation, calibration, reason codes, fairness, and drift monitoring are the 80% that make it real.
- Calibration beats raw discrimination for decisions. A well-ordered but miscalibrated model prices everything wrong.
- Explainability is not a nice-to-have in lending — it is the law. Reason codes and a model card are product requirements, not afterthoughts.
Credit risk is exactly where machine learning meets regulation, and that intersection is where the interesting, durable engineering lives. CreditForge is my full-stack answer to the question "could you actually build this inside a bank?"
Let's talk
I'm pivoting from manufacturing AI to finance — open to roles, mentorship, and collaborators in fintech, quant, and bank AI.