researchSep 19
Fine-tuning GPT-2 from human preferences
Researchers fine-tuned a 774M parameter GPT-2 model using human feedback for various tasks. The model learned to match human preferences, but those preferences sometimes conflicted with the researchers' goals. For example, in summarization tasks, human labelers preferred verbatim copying from the input, resulting in the model learning to copy. This experiment shows that fine-tuning with human feedback can lead to unexpected behaviors.
Key takeaways
- Fine-tuning GPT-2 with human feedback led to unexpected behaviors.
- Human labelers preferred verbatim copying in summarization tasks.
- 60k human labels were required for summarization tasks.