Science Cast

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

Jinhan LiJune 18, 2026 4:12am

Views (1)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

arXivPDFJune 17, 2026 12:00am

Authors

Jinhan Li, Kexian Tang, Yihan Xu, Zhuorui Ye, Kaifeng Lyu

Abstract

To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors. To this end, we propose Safety Reflection Pretraining, a pretraining-stage alignment method which regularly inserts short safety reflections into pretraining corpora to integrate self-monitoring directly into language modeling, establishing a foundational capability that is subsequently reinforced by compatible post-training. Our experiments with 1.7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, MedSafetyWorld, with a clear definition of safety and a reasoning structure under which models can easily generalize unsafe behaviors from safe data. Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.

TwitterandLinkedIn

0 comments

Add comment

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments