ReFIne:一個具備可靠性、忠實性與可解釋性的可信大型推理模型框架
ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability
October 10, 2025
作者: Chung-En Sun, Ge Yan, Akshay Kulkarni, Tsui-Wei Weng
cs.AI
摘要
近期在长链思维推理(CoT)领域的进展,主要聚焦于答案准确性和令牌效率,却忽视了可信赖性的关键方面。我们认为,实用的推理系统必须具备可信赖性,这体现在三个特性上:可解释性、忠实性和可靠性。为此,我们提出了ReFIne,一种新的训练框架,它将监督微调与GRPO相结合,旨在激励模型实现以下目标:(i) 通过生成结构化的、基于标签的追踪记录,并辅以高层级规划,提升可解释性,使人类更易于理解;(ii) 通过明确揭示指导每个解决方案的关键信息,并保持跨部分引用的一致性,增强忠实性;(iii) 通过提供对推导过程合理性的自我评估及最终答案的置信度,促进可靠性。我们将ReFIne应用于不同规模(1.7B/4B/8B)的Qwen3模型,并在不同难度的数学基准上进行评估。实验结果表明,ReFIne模型生成了更清晰、结构更优的推理追踪(可解释性提升44.0%),更忠实地展现了其底层决策过程(忠实性提升18.8%),并提供了信息丰富的置信度估计(可靠性提升42.4%)。这些发现揭示了一个被忽视但重要的方向:推理模型不仅应优化准确性,还应拓展至可信赖性的更广泛维度。我们的代码公开于:
https://github.com/Trustworthy-ML-Lab/Training_Trustworthy_LRM_with_Refine
English
Recent advances in long chain-of-thought (CoT) reasoning have largely
prioritized answer accuracy and token efficiency, while overlooking aspects
critical to trustworthiness. We argue that usable reasoning systems must be
trustworthy, characterized by three properties: interpretability, faithfulness,
and reliability. To this end, we propose ReFIne, a new training framework that
integrates supervised fine-tuning with GRPO to encourage models to: (i) improve
interpretability by producing structured, tag-based traces with high-level
planning that are easier for humans to follow; (ii) enhance faithfulness by
explicitly disclosing the decisive information guiding each solution, with
consistent cross-section references; and (iii) promote reliability by providing
self-assessments of both the derivation's soundness and the confidence of the
final answer. We apply ReFIne to the Qwen3 models at multiple scales
(1.7B/4B/8B) and evaluate across mathematical benchmarks of varying difficulty.
Our experimental results show that ReFIne models generate clearer and
better-structured reasoning traces (interpretability +44.0%), more faithfully
expose their underlying decision process (faithfulness +18.8%), and offer
informative confidence estimates (reliability +42.4%). These findings highlight
an overlooked but important direction: reasoning models should be optimized not
only for accuracy, but also for broader dimensions of trustworthiness. Our code
is available at:
https://github.com/Trustworthy-ML-Lab/Training_Trustworthy_LRM_with_Refine