ChatPaper.aiChatPaper

MentraSuite:面向心理健康推理与评估的大语言模型后训练优化系统

MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment

December 10, 2025
作者: Mengxi Xiao, Kailai Yang, Pengde Zhao, Enze Zhang, Ziyan Kuang, Zhiwei Liu, Weiguang Han, Shu Liao, Lianting Huang, Jinpeng Hu, Min Peng, Qianqian Xie, Sophia Ananiadou
cs.AI

摘要

全球有数亿人受心理健康问题困扰,而网络已成为获取支持、信息和评估的主要渠道。大型语言模型(LLMs)虽能提供可扩展的便捷辅助,但其在心理健康场景中的应用仍存在风险——当模型推理存在不完整、不一致或缺乏依据时尤为明显。现有心理类LLMs侧重于情感理解或知识复现,却忽视了评估、诊断、干预规划、抽象归纳及验证所需的阶梯式临床推理逻辑。为此,我们推出MentraSuite这一推进可靠心理健康推理的统一框架。通过构建涵盖五大推理维度、六类任务和13个数据集的综合评测基准MentraBench,我们从简洁性、连贯性、幻觉规避、任务理解及内在一致性五个层面系统评估任务表现与推理质量。进一步,我们提出基于混合SFT-RL框架微调的后训练模型Mindora,其采用不一致性检测奖励机制以确保忠实连贯的推理。为支撑训练,我们创新性地提出推理轨迹生成策略:通过智能筛选困难样本,并实施以一致性为导向的结构化重写流程,构建出简洁可读、均衡优质的高质量推理轨迹。在评估的20个LLMs中,Mindora在MentraBench上取得最高综合表现,并在推理可靠性方面展现卓越能力,证明了其在复杂心理健康场景中的有效性。
English
Mental health disorders affect hundreds of millions globally, and the Web now serves as a primary medium for accessing support, information, and assessment. Large language models (LLMs) offer scalable and accessible assistance, yet their deployment in mental-health settings remains risky when their reasoning is incomplete, inconsistent, or ungrounded. Existing psychological LLMs emphasize emotional understanding or knowledge recall but overlook the step-wise, clinically aligned reasoning required for appraisal, diagnosis, intervention planning, abstraction, and verification. To address these issues, we introduce MentraSuite, a unified framework for advancing reliable mental-health reasoning. We propose MentraBench, a comprehensive benchmark spanning five core reasoning aspects, six tasks, and 13 datasets, evaluating both task performance and reasoning quality across five dimensions: conciseness, coherence, hallucination avoidance, task understanding, and internal consistency. We further present Mindora, a post-trained model optimized through a hybrid SFT-RL framework with an inconsistency-detection reward to enforce faithful and coherent reasoning. To support training, we construct high-quality trajectories using a novel reasoning trajectory generation strategy, that strategically filters difficult samples and applies a structured, consistency-oriented rewriting process to produce concise, readable, and well-balanced trajectories. Across 20 evaluated LLMs, Mindora achieves the highest average performance on MentraBench and shows remarkable performances in reasoning reliability, demonstrating its effectiveness for complex mental-health scenarios.
PDF232December 17, 2025