Jiaxin Wen1, Ruiqi Zhong2, Akbir Khan3, Ethan Perez3, Jacob Steinhardt2 Minlie Huang1, Samuel R. Bowman3434{}^{3~{}4}, He He4, Shi Feng4,5 1Tsinghua University 2Univeristy of California, Berkeley 3Anthropic 4New York University 5George Washington University Abstract Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most pop

