Science Cast

Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites

Dmitry PenzarMay 15, 2026 12:56am

Views (27)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites

bioRxivPDFMay 14, 2026 12:00am

Authors

Kravchenko, P.; Vorontsov, I. E.; Makeev, V. J.; Kulakovskiy, I. V.; Penzar, D. D.

Abstract

Motivation: DNA motifs recognised by transcription factors are typically represented as position weight matrices (PWMs), assuming independent contributions of individual nucleotides to protein binding specificity. Many alternative models accounting for correlations of positional contributions have been introduced in the past decades. However, performance gains have generally not out-weighed the advantages of simplicity, interpretability, and practical applicability of PWMs with the well-established codebase. Existing software tools and motif databases provide multiple non-identical PWMs for the same transcription factor or even for the same dataset. It remains a prac-tical question whether these PWMs can be effectively combined into a single improved model. Results: Here we describe ArChIPelago (https://github.com/autosome-ru/ArChIPelago), a compu-tational framework that combines multiple PWMs into a joint model using classic machine learning techniques, from linear regression to ensembles of decision trees. We show that such a combina-tion improves prediction of transcription factor binding sites in genomic sequences. With a diverse collection of 704 ChIP-Seq datasets spanning 36 orthologous human and mouse transcription factors of diverse structural families, we show that ArChIPelago consistently outperforms the best available individual mono- and dinucleotide PWMs as well as sparse local inhomogeneous mixture models. Furthermore, using both human and mouse data, we demonstrate that PWM ensembles are capable of making reliable cross-species predictions.

TwitterandLinkedIn

0 comments

Add comment

Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites

Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments