Science Cast

Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

librarianJune 23, 2026 11:26pm

Views (3)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

arXivPDFJune 22, 2026 12:00am

Authors

Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, Lijun Zhang

Abstract

AdamW is the de facto optimizer for training large language models (LLMs), yet the theory behind it still lives mostly in finite-variance regimes. This is increasingly unsatisfying, as empirical evidence indicates that stochastic gradient noise in LLM pretraining is typically heavy-tailed. Recent work shows that sign-based optimizers such as Lion and Muon achieve sharp heavy-tailed rates, and that AdaGrad can also converge under heavy-tailed noise. However, no rigorous convergence theory for AdamW has yet been established in this regime. Can AdamW converge under the same heavy-tailed assumptions, or does its second-moment accumulator create a genuine obstruction? We formulate this as an open problem, prove a positive weighted-metric benchmark, and give a corridor lower-bound mechanism showing how denominator memory can hide large gradients.

TwitterandLinkedIn

0 comments

Add comment

Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments