大型推理模型的競爭性程式設計

摘要

我們展示了將強化學習應用於大型語言模型（LLMs）顯著提升了在複雜編碼和推理任務上的表現。此外，我們比較了兩個通用推理模型 - OpenAI o1 和 o3 的早期檢查點 - 與一個特定領域系統 o1-ioi，該系統使用為競爭參加 2024 年國際信息學奧林匹克競賽（IOI）而設計的手工推理策略。我們在 IOI 2024 現場比賽中使用 o1-ioi，並使用手工製作的測試時間策略，排名第 49 個百分位數。在放寬的競賽限制條件下，o1-ioi 獲得了金牌。然而，在評估後續模型如 o3 時，我們發現 o3 在沒有手工製作的特定領域策略或放寬約束條件的情況下也能獲得金牌。我們的研究結果顯示，雖然像 o1-ioi 這樣的專用流程能夠帶來穩固的改進，但規模化的通用 o3 模型超越了這些結果，而無需依賴手工製作的推理啟發法。值得注意的是，o3 在 2024 年 IOI 獲得金牌，並且在 Codeforces 評分上與頂尖人類競爭者持平。總的來說，這些結果表明，擴展通用強化學習，而不是依賴特定領域技術，為推理領域（如競爭性編程）的最新人工智能提供了堅實的道路。

English

We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks. Additionally, we compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi, which uses hand-engineered inference strategies designed for competing in the 2024 International Olympiad in Informatics (IOI). We competed live at IOI 2024 with o1-ioi and, using hand-crafted test-time strategies, placed in the 49th percentile. Under relaxed competition constraints, o1-ioi achieved a gold medal. However, when evaluating later models such as o3, we find that o3 achieves gold without hand-crafted domain-specific strategies or relaxed constraints. Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inference heuristics. Notably, o3 achieves a gold medal at the 2024 IOI and obtains a Codeforces rating on par with elite human competitors. Overall, these results indicate that scaling general-purpose reinforcement learning, rather than relying on domain-specific techniques, offers a robust path toward state-of-the-art AI in reasoning domains, such as competitive programming.