可解释引证对话的渐进式训练：实现英印双语大语言模型的零幻觉生成

摘要

基于知识的对话系统旨在通过引入外部知识源生成信息丰富、上下文相关的回复。然而现有方法大多仅关注英语，缺乏验证事实主张的显式引用机制，且模型决策透明度有限。我们提出XKD-Dial——一个面向双语（英语-印地语）可解释知识对话生成的渐进式四阶段训练框架，包含：（1）多语言适配；（2）带引用的英语对话监督微调；（3）双语对话监督微调；（4）基于引用感知奖励的生成式强化策略优化对齐。我们在每个阶段对六款涵盖编码器-解码器（2.5亿-30亿参数）和纯解码器（10亿-70亿参数）架构的模型进行评估。核心贡献包括：（i）系统化应用三种事后可解释性分析——交叉注意力对齐、积分梯度归因和基于遮挡的因果溯源，沿训练轨迹揭示引用行为的学习机制而非仅验证学习效果；（ii）带引用的监督微调使编码器-解码器模型从第二阶段起幻觉率降至0.0%；（iii）渐进式框架在提升印地语能力的同时避免灾难性遗忘；（iv）较小模型经监督微调后英语表现可媲美大模型；（v）针对结构化引用任务，生成式强化策略优化相较精心设计的监督微调仅产生边际改善。我们采用六项自动指标（BLEU、ROUGE、BERTScore、FactScore、Citation-F1和幻觉率）进行全面评估。

English

Knowledge-grounded dialogue systems aim to generate informative, contextually relevant responses by conditioning on external knowledge sources. However, most existing approaches focus exclusively on English, lack explicit citation mechanisms for verifying factual claims, and offer limited transparency into model decision-making. We present XKD-Dial, a progressive four-stage training pipeline for explainable, knowledge-grounded dialogue generation in a bilingual (English-Hindi) setting, comprising: (1) multilingual adaptation, (2) English dialogue SFT with citation grounding, (3) bilingual dialogue SFT, and (4) GRPO alignment with citation-aware rewards. We evaluate six models spanning encoder-decoder (250M-3B) and decoder-only (1B-7B) architectures at every pipeline stage. Our key contributions are: (i) three post-hoc explainability analyses - cross-attention alignment, Integrated Gradients attribution, and occlusion-based causal grounding - applied systematically across the training trajectory to reveal how citation behaviour is learned, not only whether it is learned; (ii) citation-grounded SFT reduces hallucination to 0.0% for encoder-decoder models from Stage 2 onward; (iii) the progressive pipeline prevents catastrophic forgetting while improving Hindi capabilities; (iv) smaller models match larger models on English after SFT; and (v) GRPO provides marginal improvement over well-designed SFT for structured citation tasks. We evaluate across six automatic metrics (BLEU, ROUGE, BERTScore, FactScore, Citation-F1, and hallucination rate).

可解释引证对话的渐进式训练：实现英印双语大语言模型的零幻觉生成

Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs

摘要

Support