ChatPaper.aiChatPaper

路由器智能建议:视觉对话中多模态自动补全的动态路由方案

Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs

January 9, 2026
作者: Sandeep Mishra, Devichand Budagam, Anubhab Mandal, Bishal Santra, Pawan Goyal, Manish Gupta
cs.AI

摘要

实时多模态自动补全技术对于数字助手、聊天机器人、设计工具及医疗咨询等依赖共享视觉情境的用户输入场景至关重要。我们提出多模态自动补全(MAC)任务,该任务能根据实时对话中已输入的文字和视觉线索预测即将输入的字符。与传统纯文本自动补全(TAC)不同,MAC将预测基于多模态语境,从而更精准捕捉用户意图。为支持该任务,我们改造MMDialog和ImageChat构建基准数据集,评估主流视觉语言模型(VLM)与强文本基线的性能差异,揭示准确性与效率的权衡。我们提出Router-Suggest路由框架,可根据对话上下文动态选择文本模型或VLM,并推出适用于资源受限环境的轻量级变体。该框架相较性能最佳VLM实现了2.3至10倍的速度提升。用户研究表明,在多轮对话中VLM在用户满意度方面显著优于文本模型,尤其体现在节省用户输入成本与提升补全质量两方面。这些发现印证了多模态语境在自动补全中的必要性,为构建更智能、具用户感知能力的助手指明方向。
English
Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a 2.3x to 10x speedup over the best-performing VLM. A user study shows that VLMs significantly excel over textual models on user satisfaction, notably saving user typing effort and improving the quality of completions in multi-turn conversations. These findings underscore the need for multimodal context in auto-completions, leading to smarter, user-aware assistants.
PDF11January 13, 2026