Contrastive pre-training for sequence based genomics models

Voices Powered byElevenlabs logo
Connected to paperThis paper is a preprint and has not been certified by peer review

Contrastive pre-training for sequence based genomics models


Sokolova, K.; Chen, K. M.; Troyanskaya, O.


Motivation: In recent years deep learning has become one of the central approaches in a number of applications, including many tasks in genomics. However, as models grow in depth and complexity, they either require more data or a strategic initialization technique to improve performance. Results: In this project, we introduce cGen, a novel unsupervised, model-agnostic contrastive pre-training method for sequence-based models. cGen can be used before training to initialize weights, reducing the size of the dataset needed. It works through learning the intrinsic features of the reference genome and makes no assumptions on the underlying structure. We show that the embeddings produced by the unsupervised model are already informative for gene expression prediction and that the sequence features provide a meaningful clustering. We demonstrate that cGen improves model performance in various sequence-based deep learning applications, such as chromatin profiling prediction and gene expression. Our findings suggest that using cGen, particularly in areas constrained by data availability, could improve the performance of deep learning genomic models without the need to modify the model architecture.

Follow Us on


Add comment