좋은 질의문의 조건은 무엇인가? 인간을 혼란스럽게 하는 언어적 특징이 LLM 성능에 미치는 영향 측정

초록

대규모 언어 모델(LLM)의 환각 현상은 일반적으로 모델 또는 디코딩 전략의 결함으로 간주된다. 본 연구는 고전 언어학을 바탕으로 질의의 형태가 청자(및 모델)의 응답 형성에 영향을 미칠 수 있음을 주장한다. 우리는 이러한 통찰력을 절차화하기 위해 절 구조 복잡성, 어휘 희귀도, 그리고 인간 이해력에 영향을 미치는 것으로 알려진 공시적 지시, 부정, 응답 가능성, 의도 기반 등을 포괄하는 22차원 질의 특징 벡터를 구성하였다. 369,837개의 실제 질의를 활용하여 우리는 다음과 같은 의문을 제기한다: 특정 유형의 질의가 환각 발생 가능성을 높이는가? 대규모 분석 결과, 깊은 수준의 절 중첩과 명세 부족과 같은 특정 특징은 높은 환각 성향과 연관되는 일관된 "위험 지형도"를 드러냈다. 반면, 명확한 의도 기반과 응답 가능성은 낮은 환각률과 연관되었다. 도메인 특이성과 같은 다른 특징들은 데이터셋 및 모델에 따라 혼재된 효과를 보였다. 따라서 이러한 연구 결과는 환각 위험과 상관관계가 있는 경험적으로 관찰 가능한 질의-특징 표현을 확립하며, 체계적인 질의 재구성과 향후 중재 연구의 길을 열어준다.

English

Large Language Model (LLM) hallucinations are usually treated as defects of the model or its decoding strategy. Drawing on classical linguistics, we argue that a query's form can also shape a listener's (and model's) response. We operationalize this insight by constructing a 22-dimension query feature vector covering clause complexity, lexical rarity, and anaphora, negation, answerability, and intention grounding, all known to affect human comprehension. Using 369,837 real-world queries, we ask: Are there certain types of queries that make hallucination more likely? A large-scale analysis reveals a consistent "risk landscape": certain features such as deep clause nesting and underspecification align with higher hallucination propensity. In contrast, clear intention grounding and answerability align with lower hallucination rates. Others, including domain specificity, show mixed, dataset- and model-dependent effects. Thus, these findings establish an empirically observable query-feature representation correlated with hallucination risk, paving the way for guided query rewriting and future intervention studies.

좋은 질의문의 조건은 무엇인가? 인간을 혼란스럽게 하는 언어적 특징이 LLM 성능에 미치는 영향 측정

What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance

초록

Support