Router-Suggest: 視覚的基盤対話におけるマルチモーダル自動補完のための動的ルーティング

要旨

リアルタイムマルチモーダル自動補完は、ユーザー入力が共有された視覚的文脈に依存する、デジタルアシスタント、チャットボット、デザインツール、医療相談において不可欠である。本稿では、入力途中のテキストと視覚的手がかりを用いてライブチャットにおける続く文字列を予測するタスク、Multimodal Auto-Completion（MAC）を提案する。従来のテキストのみの自動補完（TAC）とは異なり、MACは予測をマルチモーダル文脈に基づかせることで、ユーザーの意図をより良く捉える。このタスクを可能にするため、MMDialogとImageChatを改変し、ベンチマークデータセットを構築した。主要な視覚言語モデル（VLM）を強力なテキストベースラインと比較評価し、精度と効率性のトレードオフを明らかにする。さらに、対話文脈に基づいてテキストモデルとVLMを動的に選択するルーターフレームワーク「Router-Suggest」と、リソース制約のある環境向けの軽量版を提案する。Router-Suggestは、最高性能のVLMと比べて2.3倍から10倍の高速化を達成した。ユーザスタディにより、VLMはユーザ満足度においてテキストモデルを大きく凌駕し、特にユーザーの入力労力を削減し、マルチターン会話における補完の質を向上させることが示された。これらの知見は、自動補完におけるマルチモーダル文脈の必要性を強調し、よりスマートでユーザーを意識したアシスタントの実現につながるものである。

English

Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a 2.3x to 10x speedup over the best-performing VLM. A user study shows that VLMs significantly excel over textual models on user satisfaction, notably saving user typing effort and improving the quality of completions in multi-turn conversations. These findings underscore the need for multimodal context in auto-completions, leading to smarter, user-aware assistants.

Router-Suggest: 視覚的基盤対話におけるマルチモーダル自動補完のための動的ルーティング

Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs

要旨

Support