VoxMind：端到端智能语音对话系统

摘要

近年来，端到端语音对话模型实现了自然交互。然而随着用户需求日益复杂，仅依赖对话能力的模型往往难以应对。因此引入智能体能力至关重要：通过工具调用机制，这类模型能够突破知识边界，更好地解决现实任务。但现有研究多集中于核心感知与生成能力，对工具增强扩展的探索相对有限。为弥补这一空白，我们提出VoxMind——一个为端到端语音对话模型提供完整智能体能力的集成框架。基于我们精心构建的470小时AgentChat数据集，我们引入"先思后言"机制，使模型将结构化推理内化为规划与响应生成的关键前提。此外，为缓解大规模工具集成引发的延迟瓶颈，我们提出多智能体动态工具管理架构。通过将检索任务异步委托给与主模型推理轨迹对齐的辅助智能体，该系统有效实现推理延迟与工具集规模的解耦。实验结果表明，VoxMind在智能体性能上取得显著提升：相比强基线模型，任务完成率从34.88%提升至74.57%，在语音智能体任务上超越Gemini-2.5-Pro，同时保持通用对话质量。相关源代码与数据已公开于https://github.com/MM-Speech/VoxMind。

English

Recent end-to-end spoken dialogue models enable natural interaction. However, as user demands become increasingly complex, models that rely solely on conversational abilities often struggle to cope. Incorporating agentic capabilities is therefore essential: by enabling tool use, these models can extend their knowledge boundaries and better solve real-world tasks. Yet, existing research has largely concentrated on core perception and generation, with comparatively limited exploration of such tool-augmented extensions. To bridge this gap, we present VoxMind, an integrated framework designed to equip end-to-end spoken dialogue models with comprehensive agentic abilities. Leveraging our curated 470-hour AgentChat dataset, we incorporate a "Think-before-Speak" mechanism, enabling the model to internalize structured reasoning as a critical prerequisite for planning and response generation. Furthermore, to mitigate latency bottlenecks caused by large-scale tool integration, we propose a Multi-Agent Dynamic Tool Management architecture. By asynchronously delegating retrieval tasks to an auxiliary agent aligned with the main model's reasoning trajectory, this system effectively decouples inference latency from toolset size. Experimental results confirm that VoxMind achieves significant improvements in agent performance: compared with strong baselines, the task completion rate increases from 34.88% to 74.57%, outperforming Gemini-2.5-Pro on spoken agent tasks while preserving general conversational quality. The source code and associated data are publicly available at https://github.com/MM-Speech/VoxMind.