FlagEval评估报告:大型推理模型在可自动验证的文本与视觉问题上的初步评估
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions
September 21, 2025
作者: Bowen Qin, Chen Yue, Fang Yin, Hui Wang, JG Yao, Jiakang Liu, Jing-Shu Zheng, Miguel Hu Chen, Richeng Xuan, Shibei Meng, Shiqi Zhou, Teng Dai, Tong-Shuai Ren, Wei Cui, Xi Yang, Xialin Du, Xiaojing Xu, Xue Sun, Xuejing Li, Yaming Liu, Yesheng Liu, Ying Liu, Yonghua Lin, Yu Zhao, Yunduo Zhang, Yuwen Luo, Zheqi He, Zhiyuan He, Zhongyuan Wang
cs.AI
摘要
我们对当前的大型推理模型(LRMs)进行了一项中等规模、在一定程度上无污染的评估,并得出了一些初步发现。同时,我们发布了ROME,这是一个旨在测试视觉线索推理能力的视觉语言模型评估基准。我们在此网站上附上了基准测试、评估数据及其他更新的链接:
https://flageval-baai.github.io/LRM-Eval/
English
We conduct a moderate-scale contamination-free (to some extent) evaluation of
current large reasoning models (LRMs) with some preliminary findings. We also
release ROME, our evaluation benchmark for vision language models intended to
test reasoning from visual clues. We attach links to the benchmark, evaluation
data, and other updates on this website:
https://flageval-baai.github.io/LRM-Eval/