ORPO: Preference Optimization without the Supervised Fine-tuning (SFT) Step

A much cheaper alignment method performing as well as DPOGenerated with DALL-EThere are now many methods to align large language models (LLMs) with human preferences. Reinforcement learning with human feedback (RLHF) was one of the first and brought us ChatGPT, but RLHF is very costly. DPO, IPO, and KTO are notably cheaper than RLHF as they don’t need a reward model.While DPO and IPO are cheaper, they still require to train two different models. One model for the supervised fine-tuning (SFT) step, i.e., training the model to answer instructions, and then the model to align with human preferences using the SFT model for initialization and as a reference.ORPO is yet another new method for LLM alignment but this one doesn’t even need the SFT model. With ORPO, the LLM jointly learns to answer instructions and human preferences.In this article, I explain ORPO and review its performance. I show how to use it to turn Mistral 7B into a chat model using consumer hardware.ORPO is presented in this paper:ORPO: Monolithic Preference Optimization without Reference ModelThe authors motivate very well ORPO by demonstrating that the SFT step is not ideal in the alignment pipeline. While fine-tuning the model on instruction datasets indeed adapts the model to answer instructions in a particular domain, the probability of generating answers that humans would reject is also increased.sourceThis is intuitive. Chosen and rejected responses may share a lot of common points: same domain, same format, etc. hence the increased probability of generating an answer relevant to the task but incorrect.Techniques like DPO are then necessary to decrease the probability of the rejected responses while increasing the probability of the chosen responses, i.e., increasing the gap between the curves in the figure above. Preference optimization techniques are…