Loading...

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

Episode 4

Duration:

Talk

Talk

Speakers:
Talk

Talk

Details

This talk discussed teaching language models to self-improve using a preference oracle like GPT-4, framing it as a two-player game to find an optimal policy at a Nash equilibrium, and achieving state-of-the-art win rates against GPT-4 Turbo on benchmarks such as Alpaca-Eval and MT-Bench.

Speakers (1)
 Corby  Rosset
Senior Researcher Microsoft Research AI Frontiers

Session Code: E4 D