Benchmarking the translational potential of AI-based drug-resistance prediction from Mycobacterium tuberculosis whole-genome sequencing data
Benchmarking the translational potential of AI-based drug-resistance prediction from Mycobacterium tuberculosis whole-genome sequencing data
Liu, C.; Zhu, H.; Zhou, P.; Thanh, N. T.; Dat, N. Q.; Atmosukarto, I.; Cheong, I. H.; Kozlakidis, Z.; Adisasmito, W.; Zheng, X.; Wang, H.; Yang, Y.
AbstractBackground: Tuberculosis, especially drug-resistant tuberculosis (DR-TB) including multidrug-resistant (MDR) and extensively drug-resistant (XDR) strains, remains a leading cause of infectious death worldwide. The rapid accumulation of whole-genome sequencing (WGS) data had spurred numerous computational methods for predicting antimicrobial resistance in Mycobacterium tuberculosis. However, heterogeneous datasets, preprocessing pipelines, and evaluation protocols have made fair comparisons impossible and have hindered clinical translation. A critical yet missing resource is a large-scale, unified benchmark to systematically assess and compare existing methods. Methods: We curated an integrated MTB WGS--phenotypic drug susceptibility testing (pDST) dataset from three sources: the CRyPTIC dataset (Comprehensive Resistance Prediction for Tuberculosis: an International Consortium), a published multi-study compilation, and newly curated literature-derived datasets. The final benchmark contains 54,364 paired WGS-pDST records with broad geographic, lineage, and drug coverage. After harmonizing phenotypes and generating standardized variant features, we evaluated seven models (including classical machine learning and deep learning architectures) across 18 drug-level and six clinical resistance category prediction tasks. Results: XGBoost achieved the highest mean drug-level AUPRC (0.674) and F1-score (0.620) and ranked first in AUPRC for 11 of 18 drugs, whereas WDNN achieved the highest mean AUROC. Random forest yielded the highest mean specificity (0.956) and accuracy (0.933), whereas logistic regression achieved the highest mean recall (0.774), highlighting distinct clinical trade-offs. Drug-level difficulty was highly heterogeneous: rifampicin and isoniazid were predicted robustly, whereas bedaquiline, delamanid, linezolid, and clofazimine remained persistently difficult. In clinical resistance category evaluation, RR-TB, MDR-TB, and pan-susceptibility were well predicted, but XDR-TB and other resistance categories constituted major bottlenecks. Conclusions: Under the largest unified benchmark to date, classical machine-learning methods, particularly XGBoost, provided the strongest precision--recall and F1 performance overall, while neural models remained competitive by AUROC. Emerging drugs (bedaquiline, delamanid, linezolid, clofazimine) and XDR cases remain persistently difficult to predict, identifying key bottlenecks for future method development. This benchmark can serve as a community standard for evaluating MTB resistance prediction and the provided evaluation pipeline offers an actionable baseline for regulatory qualification and clinical decision support system validation, accelerating the translation of WGS-based resistance prediction into practice.