ChatPaper.aiChatPaper

ObfusQAte:一个评估大语言模型在混淆事实问答任务中鲁棒性的框架提案

ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering

August 10, 2025
作者: Shubhra Ghosh, Abhilekh Borah, Aditya Kumar Guru, Kripabandhu Ghosh
cs.AI

摘要

大型语言模型(LLMs)的迅速普及极大地推动了能够进行事实问答(QA)的公平AI系统的发展。然而,目前尚无已知研究测试LLMs在面对模糊化版本问题时的鲁棒性。为了系统评估这些局限性,我们提出了一种新颖的技术——ObfusQAte,并基于此引入了ObfusQA,这是一个首创的、包含多层次模糊化级别的综合框架,旨在从三个不同维度检验LLM的能力:(i)命名实体间接性,(ii)干扰项间接性,以及(iii)上下文过载。通过捕捉语言中的这些细微差别,ObfusQA为评估LLM的鲁棒性和适应性提供了一个全面的基准。我们的研究发现,当面对这些日益复杂的变体时,LLMs往往会出现失败或生成虚构回答的倾向。为了促进这一方向的研究,我们公开了ObfusQAte。
English
The rapid proliferation of Large Language Models (LLMs) has significantly contributed to the development of equitable AI systems capable of factual question-answering (QA). However, no known study tests the LLMs' robustness when presented with obfuscated versions of questions. To systematically evaluate these limitations, we propose a novel technique, ObfusQAte and, leveraging the same, introduce ObfusQA, a comprehensive, first of its kind, framework with multi-tiered obfuscation levels designed to examine LLM capabilities across three distinct dimensions: (i) Named-Entity Indirection, (ii) Distractor Indirection, and (iii) Contextual Overload. By capturing these fine-grained distinctions in language, ObfusQA provides a comprehensive benchmark for evaluating LLM robustness and adaptability. Our study observes that LLMs exhibit a tendency to fail or generate hallucinated responses when confronted with these increasingly nuanced variations. To foster research in this direction, we make ObfusQAte publicly available.
PDF02August 14, 2025