Benchmark Paper · Computational Toxicology
ToxBench
A leakage-safe benchmark for predictive toxicology with calibration, uncertainty & domain-shift analysis
10,267
Compounds
41
Tasks
120
Conditions
4
Models
STEP 01
Dataset Acquisition
Tox21 · ClinTox · SIDER via DeepChem MoleculeNet 2.8.0
3 Datasets
STEP 02
Standardization
RDKit salt removal · charge neutralization · canonical SMILES
0 Failed
STEP 03
🔍
Deduplication
Exact + conflicting-label removal before any split is created
350 Removed
STEP 04
🧬
Featurization
ECFP4 2048-bit fingerprints · Molecular graph (GNN)
ECFP4 · Graph
STEP 05
Splitting
Random + Scaffold (Bemis-Murcko) · 5 seeds · leakage verified
30 Files · 0 Leaks
STEP 06
🤖
Model Training
RF · XGBoost · MLP · GNN · 5-seed cross-validation
120 Conditions
Random Split
Literature Standard
● Train    ● Test   — structurally similar compounds mixed across splits
Tox21 AUROC
0.804
SIDER AUROC
0.670
Mean NN Sim
0.58
Scaffold Split
ToxBench Primary
● Train    ● Test   — all same-scaffold compounds assigned to one split only
Tox21 AUROC
0.747
SIDER AUROC
0.635
Mean NN Sim
0.41
Avg Drop
−0.057
to −0.079
Random Forest
SCAFFOLD AUROC
0.747 ±0.024
0.057 vs random
XGBoost
SCAFFOLD AUROC
0.708 ±0.018
0.079 vs random
MLP
SCAFFOLD AUROC
0.723 ±0.029
0.068 vs random
GNN (GIN)
SCAFFOLD AUROC
0.744 ±0.025
0.075 vs random
📐
Probability Calibration
Platt scaling reduces ECE by 67–68% on ClinTox. Tox21 RF already well-calibrated (ECE 0.018) — no post-hoc correction needed. Scaffold split worsens calibration by +27–35% ECE.
Tox21
ClinTox
Raw / Platt / Isotonic
🎯
Applicability Domain
Tanimoto similarity < 0.4 → AUROC drops to 0.63–0.71. Scaffold split mean NN sim = 0.41 vs random 0.58. Practical abstention threshold identified.
0.0–0.4
0.4–0.6
0.6–1.0
AUROC by
similarity bin
🔬
Uncertainty & Scaffold Errors
Ensemble uncertainty (5 RF seeds) improves AUROC by ~0.030 when keeping 50% most confident predictions. ClinTox: 73% of scaffold groups fail (AUROC < 0.6) — broad structural generalization failure.
73%
scaffolds
AUROC<0.6
+0.030
AUROC gain
50% coverage