STREAM: 一种以数据为中心的框架,用于从流媒体中挖掘高价值任务导向对话
STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media
May 24, 2026
作者: Liang Xue, Haoyu Liu, Cheng Wang, Pengyu Chen, Haozhuo Zheng, Yang Liu
cs.AI
摘要
针对垂直领域的大语言模型受限于复杂、领域特定任务导向型对话的稀缺性。现有的数据采集管线面临持续的三元困境:专家标注成本高昂、真实服务对话受隐私与商业限制约束、静态语料库时效性快速衰减。我们提出Stream——一种以数据为中心的框架,通过利用公开的流媒体(直播与短视频)大规模合成高价值服务对话。Stream从嘈杂的流媒体中挖掘真实交互信号,通过将角色扎根的人物构建与对话蓝图构建相结合来合成对话;并进一步采用检索增强生成(RAG)支持知识感知的回应。基于Stream,我们发布了StreamDial——一个覆盖汽车、餐饮、酒店领域的大规模多领域数据集。StreamDial共包含87,498个对话会话与1,497,320轮次,平均每会话17.11轮次,各领域规模相当。每个会话组织为结构化四元组⟨P_u, P_a, B, H⟩,将对话历史与明确的用户/代理角色及对话蓝图配对,捕捉需求挖掘、约束冲突、协商与恢复等真实服务行为。自动评估与下游任务评测表明,StreamDial在内在对话质量上优于强基线模型,且基于StreamDial训练的模型能提升不同骨干网络的对话状态跟踪性能;我们进一步汇报了完整的人工评估集,并在受控训练预算下基于Qwen3-8B实现了令人鼓舞的多语言迁移效果。数据已发布于 https://github.com/hitxueliang/DialogDataSetBySTREAM。
English
Large language models for vertical domains are bottlenecked by the scarcity of complex, domain-specific task-oriented dialogues. Existing data acquisition pipelines face a persistent trilemma: expert annotation is expensive, real-world service conversations are constrained by privacy and commercial restrictions, and static corpora quickly become temporally stale. We propose Stream, a data-centric framework that leverages publicly available streaming media (live streams and short videos) to synthesize high-value service dialogues at scale. Stream mines authentic interaction signals from noisy streams and synthesizes conversations by integrating role-grounded persona construction with Conversational Blueprint construction; it further adopts retrieval-augmented generation (RAG) to support knowledge-aware responses. Based on Stream, we release StreamDial, a large-scale multi-domain dataset covering Automotive, Restaurant, and Hotel. StreamDial contains 87,498 dialogue sessions and 1,497,320 turns in total, with an average of 17.11 turns per session and a comparable scale across domains. Each session is organized as a structured quadruplet langle P_u, P_a, B, H rangle that pairs dialogue history with explicit user/agent personas and a Conversational Blueprint, capturing realistic service behaviors such as requirement mining, constraint conflicts, negotiation, and recovery. Evaluations with automatic judges and downstream tasks show that StreamDial improves intrinsic dialogue quality over strong baselines, and models trained with StreamDial improve Dialogue State Tracking across backbones; we further report a completed human-evaluation set and encouraging multilingual transfer on Qwen3-8B under a controlled training budget. The data is released in https://github.com/hitxueliang/DialogDataSetBySTREAM.