STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability
Researchers analyzed token-level entropy dynamics in GRPO, a post-training paradigm for LLMs, and found a credit assignment mismatch causing policy entropy collapse. They propose STARE, a method to reweight advantages and stabilize policy entropy. This addresses a key limitation of GRPO, enabling more stable training of complex reasoning in LLMs. You can apply STARE to improve GRPO's performance in your own LLM training workflows.
- GRPO suffers from policy entropy collapse due to token-level credit assignment mismatch.
- STARE reweights advantages to stabilize policy entropy in GRPO.
- STARE improves stability of complex reasoning training in LLMs.