ChatPaper.aiChatPaper

FlagEval 發現報告:大型推理模型在可自動驗證的文本與視覺問題上的初步評估

FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

September 21, 2025
作者: Bowen Qin, Chen Yue, Fang Yin, Hui Wang, JG Yao, Jiakang Liu, Jing-Shu Zheng, Miguel Hu Chen, Richeng Xuan, Shibei Meng, Shiqi Zhou, Teng Dai, Tong-Shuai Ren, Wei Cui, Xi Yang, Xialin Du, Xiaojing Xu, Xue Sun, Xuejing Li, Yaming Liu, Yesheng Liu, Ying Liu, Yonghua Lin, Yu Zhao, Yunduo Zhang, Yuwen Luo, Zheqi He, Zhiyuan He, Zhongyuan Wang
cs.AI

摘要

我們對當前的大型推理模型(LRMs)進行了一項中等規模、在一定程度上無污染的評估,並獲得了一些初步發現。同時,我們發布了ROME,這是一個旨在測試基於視覺線索推理能力的視覺語言模型評估基準。我們在本網站上附上了基準測試、評估數據及其他更新的鏈接: https://flageval-baai.github.io/LRM-Eval/
English
We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/
PDF132September 23, 2025