Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites
Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites
Kravchenko, P.; Vorontsov, I. E.; Makeev, V. J.; Kulakovskiy, I. V.; Penzar, D. D.
AbstractMotivation: DNA motifs recognised by transcription factors are typically represented as position weight matrices (PWMs), assuming independent contributions of individual nucleotides to protein binding specificity. Many alternative models accounting for correlations of positional contributions have been introduced in the past decades. However, performance gains have generally not out-weighed the advantages of simplicity, interpretability, and practical applicability of PWMs with the well-established codebase. Existing software tools and motif databases provide multiple non-identical PWMs for the same transcription factor or even for the same dataset. It remains a prac-tical question whether these PWMs can be effectively combined into a single improved model. Results: Here we describe ArChIPelago (https://github.com/autosome-ru/ArChIPelago), a compu-tational framework that combines multiple PWMs into a joint model using classic machine learning techniques, from linear regression to ensembles of decision trees. We show that such a combina-tion improves prediction of transcription factor binding sites in genomic sequences. With a diverse collection of 704 ChIP-Seq datasets spanning 36 orthologous human and mouse transcription factors of diverse structural families, we show that ArChIPelago consistently outperforms the best available individual mono- and dinucleotide PWMs as well as sparse local inhomogeneous mixture models. Furthermore, using both human and mouse data, we demonstrate that PWM ensembles are capable of making reliable cross-species predictions.