This paper proposes LongWriter-Zero, an incentivization-based reinforcement learning method. It trains models from scratch, enabling high-quality, ultra-long text generation without relying on pre-existing synthetic data. Methods 🔧: - LongWriter-Zero uses Group Relative Policy Optimization for training. - It employs three specialized reward models: Length Reward Model, Writing Reward Model, and Format Reward Model. - The Length Reward Model guides appropriate word count, based on predicted ranges. - The Writing Reward Model, trained on human preference data, assesses holistic writing quality. - The Format Reward Model enforces structural conformity and reduces redundancy. - A balanced final reward combines advantages from these models, ensuring equal influence. - The model uses a "Think Prompt" during training for explicit reasoning, planning, and refinement. - Continual pretraining on diverse writing data and long Chain-of-Thought samples enhances the base model's long-form generation capabilities. - Trained on Qwen2.5-32B, LongWriter-Zero achieves an 8.69 overall critic score on WritingBench. - It obtains a 1447 Elo rating on Arena-Write, outperforming DeepSeek-R1 and Qwen3-235B. 📌 Reinforcement learning without supervised fine-tuning significantly boosts ultra-long text quality. 📌 Explicit 'think' steps in LLMs enable superior content organization and coherence. 📌 Continual pretraining elevates baseline model performance, maximizing reinforcement learning gains.
See Tweet