ChatPaper.aiChatPaper

IntentGrasp:一个用于意图理解的全面基准

IntentGrasp: A Comprehensive Benchmark for Intent Understanding

May 7, 2026
作者: Yuwei Yin, Chuyuan Li, Giuseppe Carenini
cs.AI

摘要

准确理解语言、对话和写作背后的意图对于开发有用的大语言模型(LLM)助手至关重要。本文提出IntentGrasp,一个用于评估LLM意图理解能力的综合性基准数据集。该数据集源自49个高质量、开放许可的语料库,覆盖12个不同领域,通过源数据集整理、意图标签情境化和任务格式统一构建而成。IntentGrasp包含大规模训练集(262,759个实例)和两个评估集:包含12,909个测试用例的全量集,以及更均衡且更具挑战性的精选集(470个案例)。对7个系列20个LLM(包括GPT-5.4、Gemini-3.1-Pro和Claude-Opus-4.7等前沿模型)的广泛评估结果显示,模型表现不尽如人意——全量集得分低于60%,精选集低于25%。值得注意的是,20个测试模型中有17个在精选集上的表现低于随机猜测基线(15.2%),而预估的人类表现约为81.1%,这表明仍有显著的提升空间。为增强这种能力,本文提出意图微调(IFT)方法,即基于IntentGrasp训练集对模型进行微调,在全量集上实现30余个F1得分点的显著提升,在精选集上提升20余个得分点。值得注意的是,留一域(Lodo)实验进一步验证了IFT的强大跨域泛化能力,证明其是显著增强LLM意图理解能力的有效途径。总体而言,通过建立基准并提升意图理解能力,本研究为开发更具意图感知力、更强大且更安全的AI助手以造福人类社会开辟了光明前景。
English
Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification. IntentGrasp contains a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random-guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement. To enhance such ability, this paper proposes Intentional Fine-Tuning (IFT), which fine-tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave-one-domain-out (Lodo) experiments further demonstrate the strong cross-domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs. Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.
PDF61May 12, 2026