多模態策略內化於對話代理
Multimodal Policy Internalization for Conversational Agents
October 10, 2025
作者: Zhenhailong Wang, Jiateng Liu, Amin Fazel, Ritesh Sarkhel, Xing Fan, Xiang Li, Chenlei Guo, Heng Ji, Ruhi Sarikaya
cs.AI
摘要
現代對話系統如ChatGPT和Alexa+依賴於預先定義的策略,這些策略規定了元數據、回應風格及工具使用規則。隨著這些基於大型語言模型(LLM)的系統擴展以支持多樣的商業和用戶查詢,此類策略——通常以上下文提示的形式實現——正變得日益複雜和冗長,使得嚴格遵循變得困難,並帶來了巨大的固定計算成本。隨著多模態代理的興起,管理視覺和多模態行為的策略變得至關重要,但這方面的研究仍顯不足。先前的提示壓縮工作主要縮短任務模板和示範,而現有的策略對齊研究僅專注於基於文本的安全規則。我們引入了多模態策略內化(Multimodal Policy Internalization, MPI),這是一項新任務,旨在將推理密集型的多模態策略內化到模型參數中,從而在推理過程中無需包含策略即可實現更強的策略遵循。MPI提出了獨特的數據和算法挑戰。我們構建了兩個數據集,涵蓋合成和現實世界的決策及工具使用任務,並提出了TriMPI,一個三階段的訓練框架。TriMPI首先通過持續預訓練注入策略知識,然後進行監督微調,最後應用PolicyRollout,這是一種GRPO風格的強化學習擴展,通過策略感知的回應增強滾動以實現接地探索。TriMPI在端到端準確性、泛化能力和抗遺忘性方面取得了顯著提升。作為多模態策略內化的首項工作,我們提供了數據集、訓練方案和全面評估,以促進未來研究。項目頁面:https://mikewangwzhl.github.io/TriMPI。
English
Modern conversational agents like ChatGPT and Alexa+ rely on predefined
policies specifying metadata, response styles, and tool-usage rules. As these
LLM-based systems expand to support diverse business and user queries, such
policies, often implemented as in-context prompts, are becoming increasingly
complex and lengthy, making faithful adherence difficult and imposing large
fixed computational costs. With the rise of multimodal agents, policies that
govern visual and multimodal behaviors are critical but remain understudied.
Prior prompt-compression work mainly shortens task templates and
demonstrations, while existing policy-alignment studies focus only on
text-based safety rules. We introduce Multimodal Policy Internalization (MPI),
a new task that internalizes reasoning-intensive multimodal policies into model
parameters, enabling stronger policy-following without including the policy
during inference. MPI poses unique data and algorithmic challenges. We build
two datasets spanning synthetic and real-world decision-making and tool-using
tasks and propose TriMPI, a three-stage training framework. TriMPI first
injects policy knowledge via continual pretraining, then performs supervised
finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement
learning extension that augments rollouts with policy-aware responses for
grounded exploration. TriMPI achieves notable gains in end-to-end accuracy,
generalization, and robustness to forgetting. As the first work on multimodal
policy internalization, we provide datasets, training recipes, and
comprehensive evaluations to foster future research. Project page:
https://mikewangwzhl.github.io/TriMPI.