PanXpress: Gene expression quantification with a pan-transcriptomic gapped k-mer index
PanXpress: Gene expression quantification with a pan-transcriptomic gapped k-mer index
Alves Ferreira, I.; Zentgraf, J.; Schmitz, J. E.; Rahmann, S.
AbstractMotivation: Most existing workflows for quantifying bacterial gene expression from RNA-seq data rely on mapping reads to a (single) reference transcriptome, typically ignoring strain-level variation. When samples contain unknown or mixed strains, these workflows may introduce reference bias and fail to accurately capture strain-specific gene expression. Pan-transcriptomic approaches address this issue by using pan-transcriptomes as references, but existing solutions require multiple steps for pan-transcriptome construction, indexing, and expression quantification. Results: We introduce PanXpress, a unified framework for bacterial pan-transcriptomics that performs pan-transcriptome construction and indexing directly from genomic FASTA and GFF annotation files, alignment-free mapping of reads to genes from FASTQ samples, and gene expression quantification. The index, a multi-way Cuckoo hash table storing gapped k-mers with associated genes, preserves diversity on the k-mer level. Using simulated RNA-seq data from a mixture of Pseudomonas aeruginosa strains, PanXpress achieves mapping recall comparable to alignment-based methods such as Bowtie2 with higher precision and obtains accurate gene expression and log fold change estimates. On real P. aeruginosa RNA-seq data, using PanXpress' pan-transcriptomic reference increases the proportion of mapped reads and discovered expressed genes. The index of PanXpress is smaller than that of other tools and it provides faster analysis with consistent results, compared to other tools (Salmon, Kallisto, Bowtie2). PanXpress is thus an accurate and efficient method for bacterial gene expression analysis in complex samples.