Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Episode 4
Duration:
Talk
Speakers:
Talk
Details
This talk discussed teaching language models to self-improve using a preference oracle like GPT-4, framing it as a two-player game to find an optimal policy at a Nash equilibrium, and achieving state-of-the-art win rates against GPT-4 Turbo on benchmarks such as Alpaca-Eval and MT-Bench.
Speakers (1)
Session Code: E4 D