A Systematic Approach Toward Implementing Machine Learning Techniques to Analyze Gut Microbiome Data
A Systematic Approach Toward Implementing Machine Learning Techniques to Analyze Gut Microbiome Data
Jahanikia, S.; Taada, A.; George, A.; Biruduraju, D.; Lu, E.; Singh, I.; Chhajer, K.; Wang, M.; Pentela, T.
AbstractThis study investigates the relationship between the gut microbiota and specific diseases. Data was collected from the Human Gut Microbiome Atlas, which examines regional variations across 20 countries on five continents, categorizing microbial species by taxonomy, from genus to species. The Atlas provides color-coded phylum classifications, numerical species counts within the same genus, and an analysis of dysbiosis-related associations with 23 diseases, as well as region-enriched species. The data stratified samples into distinct categories such as westernized, non-westernized, cancerous, and non-cancerous. The findings demonstrate that tree-based ensemble methods, such as Bagging and Boosting prediction methods, achieved the highest accuracies across all categories due to their robustness in handling the complex, high-dimensional data. The XGBoost model yielded the strongest predictive performance, achieving 91% accuracy for westernized cancer-associated samples, 84% accuracy for non-westernized cancer-associated samples, 92% accuracy for westernized samples, and 78% for non-westernized samples. Additionally, advanced topological data analysis was used to assess the global structure and underlying patterns within the dataset.