Direct Preference Optimization: Your Language Model is Secretly a Reward Model

2024 article dpo_paper

Authors

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

URL

arXiv