路由器智能推荐:基于视觉对话的多模态自动补全动态路由方案
Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs
January 9, 2026
作者: Sandeep Mishra, Devichand Budagam, Anubhab Mandal, Bishal Santra, Pawan Goyal, Manish Gupta
cs.AI
摘要
实时多模态自动补全技术对数字助手、聊天机器人、设计工具及医疗咨询场景至关重要,这些场景中用户输入往往依赖于共享的视觉上下文。我们提出多模态自动补全任务(MAC),该任务能结合部分输入文本与视觉线索实时预测对话中的后续字符。与传统纯文本自动补全(TAC)不同,MAC将预测基于多模态语境,从而更精准捕捉用户意图。为支持该研究,我们重构MMDialog和ImageChat数据集构建基准测试集。通过对比主流视觉语言模型(VLM)与强文本基线,我们揭示了精度与效率的权衡关系。此外,我们提出路由推荐框架(Router-Suggest),可根据对话上下文动态选择文本模型或VLM,并推出适用于资源受限环境的轻量级变体。该框架相较性能最优VLM实现了2.3至10倍的加速效果。用户研究表明,在多轮对话中VLM在用户满意度方面显著优于文本模型,尤其在节省用户输入成本与提升补全质量方面表现突出。这些发现印证了多模态语境在自动补全中的必要性,为构建更智能、更具用户感知能力的助手指明方向。
English
Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a 2.3x to 10x speedup over the best-performing VLM. A user study shows that VLMs significantly excel over textual models on user satisfaction, notably saving user typing effort and improving the quality of completions in multi-turn conversations. These findings underscore the need for multimodal context in auto-completions, leading to smarter, user-aware assistants.