STREAM: 스트리밍 미디어에서 고가치 작업 지향 대화를 마이닝하기 위한 데이터 중심 프레임워크

초록

수직 도메인을 위한 대규모 언어 모델은 복잡하고 도메인 특화된 과제 지향 대화의 부족으로 인해 병목 현상을 겪는다. 기존 데이터 수집 파이프라인은 지속적인 트릴레마에 직면해 있다. 전문가 주석은 비용이 많이 들고, 실제 서비스 대화는 개인정보 및 상업적 제약으로 인해 제한되며, 정적 코퍼스는 시간이 지남에 따라 빠르게 낡아진다. 본 논문은 공개 스트리밍 미디어(라이브 스트리밍 및 짧은 동영상)를 활용하여 고가치 서비스 대화를 대규모로 합성하는 데이터 중심 프레임워크인 Stream을 제안한다. Stream은 잡음이 많은 스트림에서 진정한 상호작용 신호를 추출하고, 역할 기반 페르소나 구축과 대화 청사진 구축을 통합하여 대화를 합성하며, 추가로 검색 증강 생성(RAG)을 채택하여 지식을 인지한 응답을 지원한다. Stream을 기반으로 자동차, 레스토랑, 호텔을 포괄하는 대규모 다중 도메인 데이터셋인 StreamDial을 공개한다. StreamDial은 총 87,498개의 대화 세션과 1,497,320턴으로 구성되며, 세션당 평균 17.11턴이고 도메인 간 유사한 규모를 갖는다. 각 세션은 대화 기록을 명시적 사용자/에이전트 페르소나 및 대화 청사진과 짝지은 구조화된 4중항 ⟨P_u, P_a, B, H⟩로 구성되며, 요구사항 발굴, 제약 충돌, 협상, 복구와 같은 현실적인 서비스 행동을 포착한다. 자동 평가자 및 하위 작업을 통한 평가는 StreamDial이 강력한 기준선 대비 내재적 대화 품질을 향상시키고, StreamDial로 훈련된 모델이 백본 전반에 걸쳐 대화 상태 추적을 개선함을 보여준다. 또한 완료된 인간 평가 세트와 통제된 훈련 예산 하에서 Qwen3-8B의 고무적인 다국어 전이를 추가로 보고한다. 데이터는 https://github.com/hitxueliang/DialogDataSetBySTREAM 에 공개되어 있다.

English

Large language models for vertical domains are bottlenecked by the scarcity of complex, domain-specific task-oriented dialogues. Existing data acquisition pipelines face a persistent trilemma: expert annotation is expensive, real-world service conversations are constrained by privacy and commercial restrictions, and static corpora quickly become temporally stale. We propose Stream, a data-centric framework that leverages publicly available streaming media (live streams and short videos) to synthesize high-value service dialogues at scale. Stream mines authentic interaction signals from noisy streams and synthesizes conversations by integrating role-grounded persona construction with Conversational Blueprint construction; it further adopts retrieval-augmented generation (RAG) to support knowledge-aware responses. Based on Stream, we release StreamDial, a large-scale multi-domain dataset covering Automotive, Restaurant, and Hotel. StreamDial contains 87,498 dialogue sessions and 1,497,320 turns in total, with an average of 17.11 turns per session and a comparable scale across domains. Each session is organized as a structured quadruplet langle P_u, P_a, B, H rangle that pairs dialogue history with explicit user/agent personas and a Conversational Blueprint, capturing realistic service behaviors such as requirement mining, constraint conflicts, negotiation, and recovery. Evaluations with automatic judges and downstream tasks show that StreamDial improves intrinsic dialogue quality over strong baselines, and models trained with StreamDial improve Dialogue State Tracking across backbones; we further report a completed human-evaluation set and encouraging multilingual transfer on Qwen3-8B under a controlled training budget. The data is released in https://github.com/hitxueliang/DialogDataSetBySTREAM.