Direct Preference Optimization: Your Language Model is Secretly a Reward Model 2024 article dpo_paper Authors Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn URL https://arxiv.org/abs/2305.18290 arXiv 2305.18290