1sec.ai

Tag

#post-training

Every item tagged post-training, newest first.

2 items

i post-trained a model to reliably roll a die

A Reddit user post-trained a model to reliably roll a die, achieving roughly equal probability for each number. This demonstrates overcoming a common issue in reinforcement learning where models often rely on known strategies rather than exploring new actions. The experiment shows that with post-training, a model can learn to generate truly random outputs. This has implications for builders working on applications requiring unpredictable behavior.

Key takeaways
  • Model post-trained to roll a die with roughly equal probability for each number.
  • Demonstrates overcoming common RL issue of model relying on known strategies.
  • Experiment shows post-training can achieve truly random outputs.

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

Researchers propose rubric-conditioned self-distillation, a new method for post-training reasoning language models that reduces reliance on expensive and potentially noisy chain-of-thought annotations. This approach uses evaluative feedback to improve model performance without requiring detailed rationales. The method aims to enhance model accuracy and efficiency by leveraging verified rewards.

Key takeaways
  • Rubric-conditioned self-distillation reduces need for chain-of-thought annotations.
  • Method uses evaluative feedback to improve model performance.
  • Approach aims to enhance model accuracy and efficiency.