Tag

#post-training

Every item tagged post-training, newest first.

2 items

i post-trained a model to reliably roll a die

A Reddit user post-trained a model to reliably roll a die, achieving roughly equal probability for each number. This demonstrates overcoming a common issue in reinforcement learning where models often rely on known strategies rather than exploring new actions. The experiment shows that with post-training, a model can learn to generate truly random outputs. This has implications for builders working on applications requiring unpredictable behavior.

Key takeaways

Model post-trained to roll a die with roughly equal probability for each number.
Demonstrates overcoming common RL issue of model relying on known strategies.
Experiment shows post-training can achieve truly random outputs.

rr/LocalLLaMA#reinforcement-learning #randomness #post-training

research14h

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

Researchers propose rubric-conditioned self-distillation, a new method for post-training reasoning language models that reduces reliance on expensive and potentially noisy chain-of-thought annotations. This approach uses evaluative feedback to improve model performance without requiring detailed rationales. The method aims to enhance model accuracy and efficiency by leveraging verified rewards.

Key takeaways

Rubric-conditioned self-distillation reduces need for chain-of-thought annotations.
Method uses evaluative feedback to improve model performance.
Approach aims to enhance model accuracy and efficiency.

aarXiv#reasoning-language-models #post-training #self-distillation