專用的反饋與編輯模型賦能開放式通用領域任務的推理時擴展
Dedicated Feedback and Edit Models Empower Inference-Time Scaling for Open-Ended General-Domain Tasks
March 6, 2025
作者: Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Daniel Egert, Ellie Evans, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev
cs.AI
摘要
推理時擴展技術對於近期模型如OpenAI o1和DeepSeek R1的成功至關重要。然而,許多用於訓練模型以實現推理時擴展的技術要求任務答案可被驗證,這限制了其在數學、編程和邏輯推理等領域的應用。我們從人類如何進行初次嘗試、向他人尋求詳細反饋並基於此類反饋在廣泛的開放性探索中改進的過程中獲得啟發。為此,我們收集數據並訓練專門的反饋與編輯模型,這些模型能夠在開放性通用任務中執行推理時擴展。在我們的設置中,一個模型生成初始回應,第二個模型提供反饋,第三個模型則利用這些反饋來編輯回應。我們展示,通過擴展初始回應草稿的數量、有效反饋和編輯回應,可以提升在Arena Hard基準上的表現,該基準對Chatbot Arena Elo具有強預測性。當擴展達到最佳狀態時,基於Llama 3家族70B模型的設置能夠在2025年3月5日達到Arena Hard上的最新技術水平,得分為92.7,超越OpenAI o1-preview-2024-09-12的90.4和DeepSeek R1的92.3。
English
Inference-Time Scaling has been critical to the success of recent models such
as OpenAI o1 and DeepSeek R1. However, many techniques used to train models for
inference-time scaling require tasks to have answers that can be verified,
limiting their application to domains such as math, coding and logical
reasoning. We take inspiration from how humans make first attempts, ask for
detailed feedback from others and make improvements based on such feedback
across a wide spectrum of open-ended endeavors. To this end, we collect data
for and train dedicated Feedback and Edit Models that are capable of performing
inference-time scaling for open-ended general-domain tasks. In our setup, one
model generates an initial response, which are given feedback by a second
model, that are then used by a third model to edit the response. We show that
performance on Arena Hard, a benchmark strongly predictive of Chatbot Arena Elo
can be boosted by scaling the number of initial response drafts, effective
feedback and edited responses. When scaled optimally, our setup based on 70B
models from the Llama 3 family can reach SoTA performance on Arena Hard at 92.7
as of 5 Mar 2025, surpassing OpenAI o1-preview-2024-09-12 with 90.4 and
DeepSeek R1 with 92.3.Summary
AI-Generated Summary