#human-preferences — 1sec.ai

Fine-tuning GPT-2 from human preferences

Researchers fine-tuned a 774M parameter GPT-2 model using human feedback for various tasks. The model learned to match human preferences, but those preferences sometimes conflicted with the researchers' goals. For example, in summarization tasks, human labelers preferred verbatim copying from the input, resulting in the model learning to copy. This experiment shows that fine-tuning with human feedback can lead to unexpected behaviors.

Key takeaways

Fine-tuning GPT-2 with human feedback led to unexpected behaviors.
Human labelers preferred verbatim copying in summarization tasks.
60k human labels were required for summarization tasks.

OOpenAI#fine-tuning #human-preferences #gpt-2

researchJun 13

Learning from human preferences

OpenAI and DeepMind collaborated on an algorithm that infers human preferences by comparing proposed behaviors. This aims to improve AI safety by reducing the need for manual goal functions. The approach lets humans simply choose which behavior is better, rather than specifying exact goals. This could help prevent undesirable behavior from misaligned goals.

Key takeaways

Algorithm infers human preferences from pairwise comparisons.
Reduces need for manual goal functions in AI development.
Improves AI safety by minimizing risk of misaligned goals.

OOpenAI#ai-safety #human-preferences #alignment