IntentGrasp : Un benchmark complet pour la compréhension des intentions

Résumé

Comprendre avec précision l'intention sous-jacente au discours, à la conversation et à l'écrit est crucial pour développer des assistants basés sur des modèles de langage de grande taille (LLM) utiles. Cet article présente IntentGrasp, un référentiel complet pour évaluer la capacité de compréhension des intentions des LLM. Issu de 49 corpus de haute qualité sous licence ouverte couvrant 12 domaines variés, IntentGrasp est construit via la curation de sources, la contextualisation des étiquettes d'intention et l'uniformisation du format des tâches. IntentGrasp comprend un vaste ensemble d'entraînement de 262 759 instances et deux ensembles d'évaluation : un ensemble complet (All Set) de 12 909 cas de test et un ensemble Gem Set plus équilibré et exigeant de 470 cas. Des évaluations approfondies sur 20 LLM issus de 7 familles (incluant des modèles de pointe comme GPT-5.4, Gemini-3.1-Pro et Claude-Opus-4.7) révèlent des performances insatisfaisantes, avec des scores inférieurs à 60 % sur All Set et à 25 % sur Gem Set. Fait notable, 17 des 20 modèles testés obtiennent des résultats inférieurs à un niveau de base aléatoire (15,2 %) sur Gem Set, tandis que la performance humaine estimée est d'environ 81,1 %, ce qui laisse une marge d'amélioration considérable. Pour renforcer cette capacité, cet article propose l'ajustement fin intentionnel (IFT), qui affine les modèles sur l'ensemble d'entraînement d'IntentGrasp, générant des gains significatifs de plus de 30 points de F1 sur All Set et de plus de 20 points sur Gem Set. De manière révélatrice, les expériences de validation croisée par domaine (leave-one-domain-out, Lodo) démontrent également la forte généralisabilité inter-domaines de l'IFT, confirmant qu'il s'agit d'une approche prometteuse pour améliorer substantiellement la compréhension des intentions des LLM. Globalement, en évaluant et en renforçant la capacité de compréhension des intentions, cette étude ouvre la voie à des assistants IA plus intentionnels, compétents et sûrs, au bénéfice des humains et du bien social.

English

Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification. IntentGrasp contains a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random-guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement. To enhance such ability, this paper proposes Intentional Fine-Tuning (IFT), which fine-tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave-one-domain-out (Lodo) experiments further demonstrate the strong cross-domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs. Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.

IntentGrasp : Un benchmark complet pour la compréhension des intentions

IntentGrasp: A Comprehensive Benchmark for Intent Understanding

Résumé

Support