Knowledge Inclusive Machine Learning for Disease Gene Prioritisation
Knowledge Inclusive Machine Learning for Disease Gene Prioritisation
Gamage, C. J.; Xia, Y.; Rupasinghe, R.; Senevirathne, S.; Senanayake, D.; Malepathirana, T.; Hevapathige, A.; Corbett, M.; O'Brien, T. J.; Petrou, S.; Berkovic, S. F.; Scheffer, I. E.; Gecz, J.; Bahlo, M.; Bennett, M. F.; Halgamuge, S. K.
AbstractThe predictive performance of machine learning models depends on the context available to them. In disease gene prioritisation, this context comprises two forms: specific context from sample-level experimental data, such as gene expression and protein-protein interaction networks, and general context from accumulated and curated biological knowledge capturing established relationships among genes, diseases, and pathways. Neither is sufficient alone: experimental data are sensitive to dataset specific noise and lack broader biological grounding, while curated knowledge lacks the resolution required for gene-level discrimination. Consequently, most machine learning approaches relying solely on experimental data risk learning spurious correlations rather than underlying biology. Here we introduce Knowledge Inclusive Machine Learning (KIML), a paradigm that integrates both context types within a unified analytical pipeline. KIML combines experimental data with two types of general context: literature-derived representations from PubMed and structured biomedical knowledge graphs. We evaluate the approach on Developmental and Epileptic Encephalopathy and benchmark it against recent methods using publicly available datasets. Performance is assessed using temporal-split evaluation and biological evaluations, including ontology enrichment analysis. KIML consistently outperforms existing approaches, providing improved predictive accuracy and biologically meaningful insights. Furthermore, the framework generates interpretable explanations of gene prioritisation and demonstrates strong generalisability across six additional diseases.