ChatPaper.aiChatPaper

Kimi k1.5:利用LLMs擴展強化學習

Kimi k1.5: Scaling Reinforcement Learning with LLMs

January 22, 2025
作者: Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Zhao, Jin Zhang, Jingyuan Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Mengnan Dong, Neo Zhang, Ningchen Ma, Qiwei Pan, Qucheng Gong, Shaowei Liu, Shengling Ma, Shupeng Wei, Sihan Cao, Siying Huang, Tao Jiang, Weihao Gao, Weimin Xiong, Weiran He, Weixiao Huang, Wenhao Wu, Wenyang He, Xianghui Wei, Xianqing Jia, Xingzhe Wu, Xinran Xu, Xinxing Zu, Xinyu Zhou, Xuehai Pan, Y. Charles, Yang Li, Yangyang Hu, Yangyang Liu, Yanru Chen, Yejie Wang, Yibo Liu, Yidao Qin, Yifeng Liu, Ying Yang, Yiping Bao, Yulun Du, Yuxin Wu, Yuzhi Wang, Zaida Zhou, Zhaoji Wang, Zhaowei Li, Zhen Zhu, Zheng Zhang, Zhexu Wang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Ziyao Xu, Zonghan Yang
cs.AI

摘要

利用下一個標記預測進行語言模型預訓練已被證明對於擴展計算效果顯著,但受限於可用的訓練數據量。擴展強化學習(RL)開啟了一個新的維度,持續改進人工智慧的潛力,大型語言模型(LLMs)可以通過學習探索獲得獎勵來擴展其訓練數據。然而,先前發表的研究工作並未取得競爭力的結果。鑑於此,我們報告了Kimi k1.5的訓練實踐,這是我們最新的多模態LLM,使用RL進行訓練,包括其RL訓練技術、多模態數據配方和基礎設施優化。長上下文擴展和改進的策略優化方法是我們方法的關鍵要素,該方法建立了一個簡單而有效的RL框架,而無需依賴諸如蒙特卡羅樹搜索、價值函數和處理獎勵模型等更複雜的技術。值得注意的是,我們的系統在多個基準和模態下實現了最先進的推理性能,例如在AIME上達到77.5,在MATH 500上達到96.2,在Codeforces上達到94個百分位,在MathVista上達到74.9,與OpenAI的o1相匹敵。此外,我們提出了有效的長2短方法,利用長CoT技術改進短CoT模型,產生最先進的短CoT推理結果,例如在AIME上達到60.8,在MATH500上達到94.6,在LiveCodeBench上達到47.3,遠遠超過現有的短CoT模型,如GPT-4o和Claude Sonnet 3.5,提高了多達+550%。
English
Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).

Summary

AI-Generated Summary

PDF1156January 23, 2025