We propose the Trust Region Preference Approximation (TRPA) algorithm ⚙️, which integrates rule-based optimization with preference-based optimization for LLM reasoning tasks 🤖🧠. As a ...
Abstract: Function approximation has experienced significant success in the field of reinforcement learning (RL). Despite a handful of progress on developing theory for nonstationary RL with function ...
Mark Jerrum, Alistair Sinclair (UC Berkeley) and Eric Vigoda (Georgia Tech) received the Association for Computing Machinery (ACM) Test of Time Award at a virtual ceremony on Wednesday 23 June at the ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results