説明可能な引用拠に基づく対話のための段階的トレーニング：英語-ヒンディー語LLMにおける虚構生成のゼロ化

要旨

知識に基づく対話システムは、外部知識源を条件として情報量が豊富で文脈に関連する応答を生成することを目的としている。しかし、既存手法の大半は英語に特化しており、事実主張を検証するための明示的な引用機構を欠き、モデルの意思決定過程の透明性も限られている。本論文では、二言語（英語-ヒンディー語）設定における説明可能な知識基盤型対話生成のための漸進的4段階トレーニングパイプライン「XKD-Dial」を提案する。これは、（1）多言語適応、（2）引用基盤付き英語対話SFT、（3）二言語対話SFT、（4）引用認識報酬を用いたGRPOアライメントから構成される。エンコーダ-デコーダ（250M-3B）およびデコーダのみ（1B-7B）のアーキテクチャにわたる6つのモデルをパイプラインの各段階で評価した。主な貢献は以下の通りである：（i）訓練軌跡全体に体系的に適用された3つの事後説明可能性分析（クロスアテンションアライメント、統合勾配帰属推定、オクルージョンベース因果的基盤付け）により、引用行動が「学習されるか否か」だけでなく、「どのように学習されるか」を明らかにする；（ii）引用基盤付きSFTにより、エンコーダ-デコーダモデルのStage 2以降における幻覚生成率が0.0%に低減される；（iii）漸進的パイプラインは破滅的忘録を防止しつつヒンディー語能力を向上させる；（iv）SFT後、小規模モデルは英語タスクにおいて大規模モデルと同等の性能を発揮する；（v）構造化された引用タスクにおいて、GRPOはよく設計されたSFTに対して限定的な改善をもたらす。6つの自動評価指標（BLEU、ROUGE、BERTScore、FactScore、Citation-F1、幻覚生成率）を用いて評価を行った。

English

Knowledge-grounded dialogue systems aim to generate informative, contextually relevant responses by conditioning on external knowledge sources. However, most existing approaches focus exclusively on English, lack explicit citation mechanisms for verifying factual claims, and offer limited transparency into model decision-making. We present XKD-Dial, a progressive four-stage training pipeline for explainable, knowledge-grounded dialogue generation in a bilingual (English-Hindi) setting, comprising: (1) multilingual adaptation, (2) English dialogue SFT with citation grounding, (3) bilingual dialogue SFT, and (4) GRPO alignment with citation-aware rewards. We evaluate six models spanning encoder-decoder (250M-3B) and decoder-only (1B-7B) architectures at every pipeline stage. Our key contributions are: (i) three post-hoc explainability analyses - cross-attention alignment, Integrated Gradients attribution, and occlusion-based causal grounding - applied systematically across the training trajectory to reveal how citation behaviour is learned, not only whether it is learned; (ii) citation-grounded SFT reduces hallucination to 0.0% for encoder-decoder models from Stage 2 onward; (iii) the progressive pipeline prevents catastrophic forgetting while improving Hindi capabilities; (iv) smaller models match larger models on English after SFT; and (v) GRPO provides marginal improvement over well-designed SFT for structured citation tasks. We evaluate across six automatic metrics (BLEU, ROUGE, BERTScore, FactScore, Citation-F1, and hallucination rate).

説明可能な引用拠に基づく対話のための段階的トレーニング：英語-ヒンディー語LLMにおける虚構生成のゼロ化

Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs

要旨

Support