Modeling gene regulatory perturbations via deep learning from high-throughput reporter assays

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

Modeling gene regulatory perturbations via deep learning from high-throughput reporter assays

Authors

Venukuttan, R.; Doty, R.; Thomson, A.; Chen, Y.; Li, B.; Duan, Y.; Barrera, A.; Dura, K.; Ko, K.-Y.; Lapp, H.; Reddy, T. E.; Allen, A. S.; Majoros, W. H.

Abstract

Assessing likely variant effects on phenotypes is of critical importance in diagnostic settings, and while much progress has been made in interpreting genic mutations based on our understanding of coding sequence, noncoding variants can be much more challenging to reliably interpret based on DNA sequence alone. High-throughput reporter assays such as STARR-seq and MPRA have shown utility in experimentally measuring regulatory effects of noncoding variants present in samples, but provide no readout for variants not present in the assay inputs. However, whole-genome reporter assays provide copious data that can be used to train predictive models for prioritizing variants not directly observed in the experiment. We describe a retrainable predictive modeling framework, BlueSTARR, for this task, and present results of training several models with this framework on whole-genome STARR-seq data from two cell lines and one drug treatment. Using these models, we uncover a global signature across the human genome consistent with purifying selection against both loss-of-function and gain-of-function regulatory variants, with the latter showing a significant bias consistent with selection against gains of cis regulatory function in closed chromatin proximal to genes. By testing the model on synthetic enhancers with binding motifs for transcription factors GR and AP-1, we find that when trained on drug perturbation data, the model is able to learn distance-dependent and treatment-dependent binding patterns and their resulting reporter gene activation. These results demonstrate that lightweight, easily retrainable models such as ours have utility in probing latent signals present in novel experimental data. Finally, we find only modest differences in performance between different deep-learning architectures when trained on this single data modality, and while somewhat greater predictive accuracy can be achieved with much larger models trained at great expense on many terabytes of data, there is still copious room for improvement even for industrial strength, state-of-the-art models.

Follow Us on

0 comments

Add comment