IntentGrasp：一个用于意图理解的全面基准

摘要

准确理解语言、对话和写作背后的意图对于开发有用的大语言模型（LLM）助手至关重要。本文提出IntentGrasp，一个用于评估LLM意图理解能力的综合性基准数据集。该数据集源自49个高质量、开放许可的语料库，覆盖12个不同领域，通过源数据集整理、意图标签情境化和任务格式统一构建而成。IntentGrasp包含大规模训练集（262,759个实例）和两个评估集：包含12,909个测试用例的全量集，以及更均衡且更具挑战性的精选集（470个案例）。对7个系列20个LLM（包括GPT-5.4、Gemini-3.1-Pro和Claude-Opus-4.7等前沿模型）的广泛评估结果显示，模型表现不尽如人意——全量集得分低于60%，精选集低于25%。值得注意的是，20个测试模型中有17个在精选集上的表现低于随机猜测基线（15.2%），而预估的人类表现约为81.1%，这表明仍有显著的提升空间。为增强这种能力，本文提出意图微调（IFT）方法，即基于IntentGrasp训练集对模型进行微调，在全量集上实现30余个F1得分点的显著提升，在精选集上提升20余个得分点。值得注意的是，留一域（Lodo）实验进一步验证了IFT的强大跨域泛化能力，证明其是显著增强LLM意图理解能力的有效途径。总体而言，通过建立基准并提升意图理解能力，本研究为开发更具意图感知力、更强大且更安全的AI助手以造福人类社会开辟了光明前景。

English

Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification. IntentGrasp contains a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random-guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement. To enhance such ability, this paper proposes Intentional Fine-Tuning (IFT), which fine-tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave-one-domain-out (Lodo) experiments further demonstrate the strong cross-domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs. Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.

IntentGrasp：一个用于意图理解的全面基准

IntentGrasp: A Comprehensive Benchmark for Intent Understanding

摘要

Support