LOCALE: Local-Alignment Embeddings for Noise-Robust DNA Search at SRA Scale

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

LOCALE: Local-Alignment Embeddings for Noise-Robust DNA Search at SRA Scale

Authors

Synk, R.; Pandey, P.; Sahinalp, C.; Duraiswami, R.

Abstract

Searching petabase-scale repositories of raw sequencing data such as the NIH Sequence Read Archive (SRA) could transform biological discovery, but existing methods either do not scale well or rely on exact k-mer matching that is brittle to sequencing errors and biological divergence. We recast sequence search as dense retrieval: we learn vector embeddings whose inner-product similarity ranks locally aligned sequences above unaligned ones. Our key observation is that effective retrieval does not require accurate regression of global edit distance---it only requires that sequences with better local alignments score higher than sequences with worse ones. We train a DNABERT-2 encoder with an InfoNCE objective on biologically informed augmentations: overlapping crops of parent sequences corrupted with substitutions, insertions, and deletions. On a 50-accession SRA benchmark, LOCALE maintains 62.4% average Recall@Rq at a 10% mutation rate, while every baseline we evaluated falls below 60% Recall@Rq in the noisy-query setting. The advantage holds at scale: on a 500-accession, 15-Gbp benchmark, LOCALE achieves AUPRC 0.508 at 10% mutation versus 0.129 for MetaGraph.

Follow Us on

0 comments

Add comment