설명 가능한 인용 기반 대화를 위한 점진적 훈련: 영어-힌디어 대규모 언어 모델의 환각 현상을 제로로 줄이기

초록

지식 기반 대화 시스템은 외부 지식 소스를 조건으로 하여 정보성이 풍부하고 맥락에 적합한 응답을 생성하는 것을 목표로 합니다. 그러나 기존 대부분의 접근법은 영어에만 집중하고, 사실 주장 검증을 위한 명시적 인용 메커니즘이 부족하며, 모델 의사 결정에 대한 투명성이 제한적입니다. 본 연구에서는 양국어(영어-힌디어) 환경에서 설명 가능한 지식 기반 대화 생성을 위한 점진적 4단계 학습 파이프라인인 XKD-Dial을 제시합니다. 이 파이프라인은 (1) 다국어 적응, (2) 인용 근거를 포함한 영어 대화 SFT(지도 미세 조정), (3) 양국어 대화 SFT, (4) 인용 인식 보상을 활용한 GRPO(일반화 정책 최적화) 정렬로 구성됩니다. 인코더-디코더(250M-3B) 및 디코더 전용(1B-7B) 아키텍처를 아우르는 6가지 모델을 파이프라인 각 단계에서 평가합니다. 본 연구의 주요 기여는 다음과 같습니다: (i) 학습轨迹를 따라 체계적으로 적용된 3가지 사후 설명 가능성 분석(교차 주의 정렬, 통합 그래디언트 기여도, 폐색 기반 인과적 근거 지정)을 통해 '인용 행동'이 학습되는 '여부'뿐만 아니라 '어떻게' 학습되는지를 밝혀냄; (ii) 인용 근거 SFT는 2단계 이후 인코더-디코더 모델의 환각 현상을 0.0%로 감소시킴; (iii) 점진적 파이프라인은 힌디어 능력을 향상시키면서도 치명적 망각을 방지함; (iv) SFT 후 영어 평가에서 소규모 모델이 대규모 모델과 유사한 성능을 보임; (v) 구조화된 인용 작업에 대해 잘 설계된 SFT 대비 GRPO는 한계적인 개선 효과만 제공함. 평가는 6가지 자동 평가 척도(BLEU, ROUGE, BERTScore, FactScore, Citation-F1, 환각률)를 통해 수행되었습니다.

English

Knowledge-grounded dialogue systems aim to generate informative, contextually relevant responses by conditioning on external knowledge sources. However, most existing approaches focus exclusively on English, lack explicit citation mechanisms for verifying factual claims, and offer limited transparency into model decision-making. We present XKD-Dial, a progressive four-stage training pipeline for explainable, knowledge-grounded dialogue generation in a bilingual (English-Hindi) setting, comprising: (1) multilingual adaptation, (2) English dialogue SFT with citation grounding, (3) bilingual dialogue SFT, and (4) GRPO alignment with citation-aware rewards. We evaluate six models spanning encoder-decoder (250M-3B) and decoder-only (1B-7B) architectures at every pipeline stage. Our key contributions are: (i) three post-hoc explainability analyses - cross-attention alignment, Integrated Gradients attribution, and occlusion-based causal grounding - applied systematically across the training trajectory to reveal how citation behaviour is learned, not only whether it is learned; (ii) citation-grounded SFT reduces hallucination to 0.0% for encoder-decoder models from Stage 2 onward; (iii) the progressive pipeline prevents catastrophic forgetting while improving Hindi capabilities; (iv) smaller models match larger models on English after SFT; and (v) GRPO provides marginal improvement over well-designed SFT for structured citation tasks. We evaluate across six automatic metrics (BLEU, ROUGE, BERTScore, FactScore, Citation-F1, and hallucination rate).

설명 가능한 인용 기반 대화를 위한 점진적 훈련: 영어-힌디어 대규모 언어 모델의 환각 현상을 제로로 줄이기

Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs

초록

Support