ToxBench — Graphical Abstract

01 — Data Pipeline

STEP 01

⬇

Dataset Acquisition

Tox21 · ClinTox · SIDER via DeepChem MoleculeNet 2.8.0

3 Datasets

STEP 02

⚗

Standardization

RDKit salt removal · charge neutralization · canonical SMILES

0 Failed

STEP 03

🔍

Deduplication

Exact + conflicting-label removal before any split is created

350 Removed

STEP 04

🧬

Featurization

ECFP4 2048-bit fingerprints · Molecular graph (GNN)

ECFP4 · Graph

STEP 05

✂

Splitting

Random + Scaffold (Bemis-Murcko) · 5 seeds · leakage verified

30 Files · 0 Leaks

STEP 06

🤖

Model Training

RF · XGBoost · MLP · GNN · 5-seed cross-validation

120 Conditions

02 — Core Finding: Scaffold vs Random Split

Random Split

Literature Standard

            ● Train    ● Test   — structurally similar compounds mixed across splits
          

Tox21 AUROC

0.804

SIDER AUROC

0.670

Mean NN Sim

0.58

Scaffold Split

ToxBench Primary

            ● Train    ● Test   — all same-scaffold compounds assigned to one split only
          

Tox21 AUROC

0.747

SIDER AUROC

0.635

Mean NN Sim

0.41

Avg Drop

−0.057

to −0.079

03 — Model Performance (Tox21, Scaffold Split)

Random Forest

SCAFFOLD AUROC

0.747 _±0.024

↓ 0.057 vs random

XGBoost

SCAFFOLD AUROC

0.708 _±0.018

↓ 0.079 vs random

MLP

SCAFFOLD AUROC

0.723 _±0.029

↓ 0.068 vs random

GNN (GIN)

SCAFFOLD AUROC

0.744 _±0.025

↓ 0.075 vs random

04 — Complementary Analyses

📐

Probability Calibration

Platt scaling reduces ECE by 67–68% on ClinTox. Tox21 RF already well-calibrated (ECE 0.018) — no post-hoc correction needed. Scaffold split worsens calibration by +27–35% ECE.

Tox21

ClinTox

            Raw / Platt / Isotonic
          

🎯

Applicability Domain

Tanimoto similarity < 0.4 → AUROC drops to 0.63–0.71. Scaffold split mean NN sim = 0.41 vs random 0.58. Practical abstention threshold identified.

0.0–0.4

0.4–0.6

0.6–1.0

            AUROC by
similarity bin
          

🔬

Uncertainty & Scaffold Errors

Ensemble uncertainty (5 RF seeds) improves AUROC by ~0.030 when keeping 50% most confident predictions. ClinTox: 73% of scaffold groups fail (AUROC < 0.6) — broad structural generalization failure.

73%

scaffolds
AUROC<0.6

+0.030

AUROC gain
50% coverage