📐
Probability Calibration
Platt scaling reduces ECE by 67–68% on ClinTox.
Tox21 RF already well-calibrated (ECE 0.018) — no post-hoc correction needed.
Scaffold split worsens calibration by +27–35% ECE.
🎯
Applicability Domain
Tanimoto similarity < 0.4 → AUROC drops to 0.63–0.71.
Scaffold split mean NN sim = 0.41 vs random 0.58.
Practical abstention threshold identified.
🔬
Uncertainty & Scaffold Errors
Ensemble uncertainty (5 RF seeds) improves AUROC by ~0.030 when keeping 50% most confident predictions. ClinTox: 73% of scaffold groups fail (AUROC < 0.6) — broad structural generalization failure.
+0.030
AUROC gain
50% coverage