Diffusion Language Models are Super Data Learners

Name: Web3 Jobs
Rating: 4.7 (2350 reviews)

07 November 2025 22:27 🎙️ Jingwen Liang, Gengyu Wang

Listen to this episode

About this episode

🤗 Upvotes: 67 | cs.LG

        <strong>Authors:</strong><br />
        Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, Michael Qizhe Shieh</p>

        <strong>Title:</strong><br />
        Diffusion Language Models are Super Data Learners</p>

        <strong>Arxiv:</strong><br />
        <a href="http://arxiv.org/abs/2511.03276v1">http://arxiv.org/abs/2511.03276v1</a></p>

        <strong>Abstract:</strong><br />
        Under strictly controlled pre-training settings, we observe a Crossover: when unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs. The crossover shifts later with more or higher-quality data, earlier with larger models, and persists across dense and sparse architectures. We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation; input or parameter noise improves AR under data constraint but cannot close the gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B unique Python tokens overtakes an AR coder trained with strictly matched settings. In addition, a 1B-parameter DLM achieves &gt; 56% accuracy on HellaSwag and &gt; 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. We also show that rising validation cross-entropy does not imply degraded downstream performance in this regime

Want to find AI jobs?

Join thousands of AI professionals finding their next opportunity