Understanding the bias of compositional microbiome differential abundance estimation
Understanding the bias of compositional microbiome differential abundance estimation
Calle, M. L.; Pujolassos, M.; Susin, A.
AbstractOne of the most relevant objectives in microbiome studies is the identification of microbial species that are differentially abundant across conditions. However, the compositional nature of microbiome data complicates this task. Interdependence among components leads to spurious associations when the abundances of each component are analyzed separately. Due to the growing awareness of the challenges of compositional data analysis (CoDA), log-ratio transformations, such as the additive log-ratio (alr) or the centered log-ratio (clr) transformations, have become increasingly popular in microbiome studies. Several studies have compared the performance of compositional and non-compositional methods through simulations. However, the debate between these two frameworks remains unresolved, creating confusion among researchers. Rather than relying on simulation-based results, this work provides theoretical results that enable a more rigorous and conclusive analysis of the problem, contributing to a better understanding of differential abundance estimation. We provide theoretical expressions of the bias of differential abundance estimation related to the use of proportions (total sum scaling) and log-ratio transformations (alr and clr) when estimates are interpreted as absolute rather than relative to a reference. The factors that most strongly influence the bias are the magnitude and direction of the effects, the dimension of the composition, the proportion of differentially abundant variables, and the distribution of relative abundances. The findings of this work strongly support the use of CoDA transformations; however, they also highlight that even when log-ratio transformations are applied, interpreting the results outside of a CoDA framework can still lead to biased conclusions. Among CoDA transformations, alr has several advantages over clr: its reference is more explicit, which reduces the risk of interpreting estimates as absolute rather than relative, and it facilitates the replication of results in independent studies, as it only requires assessing changes relative to the same reference rather than reconstructing the full composition. In this work, we propose a heuristic method for selecting a suitable alr reference component, which will enable a more widespread use of this transformation.