CAPA replaces coarse HLA match / mismatch scores with continuous ESM-2 embeddings, then predicts GvHD, relapse, and transplant-related mortality as competing risks via cross-attention and DeepHit.
How it works
Three stages transform raw HLA typing into calibrated, interpretable competing-risk predictions.
Donor and recipient alleles at five loci (A, B, C, DRB1, DQB1) are looked up in the IPD-IMGT/HLA database to retrieve their full protein sequences.
Each amino-acid sequence is encoded by frozen ESM-2 (650M parameters) into a 1 280-dim vector. Immunologically similar alleles cluster together.
A cross-attention network models donor–recipient allele interactions. DeepHit jointly outputs cumulative incidence curves for GvHD, relapse, and TRM.
Key results
Evaluated on the UCI Bone Marrow Transplant dataset (n = 187) using time-dependent C-index and Brier score.
Illustrative competing-risk CIFs for a representative donor–recipient pair. Run the prediction tool for case-specific curves.
| Model | GvHD | Relapse | TRM |
|---|---|---|---|
| Cox-PH (cause-specific) | — | 0.75 | 0.65 |
| Fine–Gray best | — | 0.84 | 0.66 |
| DeepHit (tabular HLA) | — | 0.67 | 0.41 |
About the project
A new lens on HLA compatibility.
Haematopoietic stem cell transplantation outcome depends critically on HLA compatibility. The standard approach encodes this as a binary match / mismatch count — discarding most of the immunological information.
CAPA was built to change that. By encoding every allele with ESM-2, a protein language model trained on 250 M sequences, and learning donor–recipient interaction through cross-attention, we obtain embeddings that reflect structural and functional similarity rather than mere categorical identity.
This is an open-source proof-of-concept, validated on 187 paediatric HSCT patients. We acknowledge the small-cohort limitation and encourage replication on larger datasets.
1 280-dim per allele, frozen 650M model.
Interpretable donor × recipient interaction.
Joint competing-risks CIF output.
MIT licensed, fully reproducible.
Enter donor and recipient HLA strings and receive competing-risk curves, attention heatmaps, and SHAP feature attribution in seconds.