ChatPaper.aiChatPaper

多模态策略内化于对话代理

Multimodal Policy Internalization for Conversational Agents

October 10, 2025
作者: Zhenhailong Wang, Jiateng Liu, Amin Fazel, Ritesh Sarkhel, Xing Fan, Xiang Li, Chenlei Guo, Heng Ji, Ruhi Sarikaya
cs.AI

摘要

诸如ChatGPT和Alexa+等现代对话系统,依赖于预先定义的策略来指定元数据、响应风格及工具使用规则。随着这些基于大语言模型(LLM)的系统扩展以支持多样化的商业和用户查询,这些通常以上下文提示形式实现的策略正变得愈发复杂冗长,导致忠实遵循变得困难,并带来高昂的固定计算成本。随着多模态智能体的兴起,规范视觉和多模态行为的策略至关重要,但相关研究仍显不足。先前的提示压缩工作主要集中于缩短任务模板和示例,而现有的策略对齐研究则仅聚焦于基于文本的安全规则。我们提出了多模态策略内化(MPI)这一新任务,旨在将推理密集型的多模态策略内化至模型参数中,从而在不包含策略的情况下实现更强的策略遵循能力。MPI带来了独特的数据与算法挑战。我们构建了两个数据集,涵盖合成与真实世界的决策制定及工具使用任务,并提出了TriMPI,一个三阶段训练框架。TriMPI首先通过持续预训练注入策略知识,随后进行监督微调,最后应用PolicyRollout——一种GRPO风格的强化学习扩展,它通过策略感知的响应增强探索,实现有根据的探索。TriMPI在端到端准确性、泛化能力及抗遗忘性方面取得了显著提升。作为多模态策略内化的首项工作,我们提供了数据集、训练方案及全面评估,以促进未来研究。项目页面:https://mikewangwzhl.github.io/TriMPI。
English
Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: https://mikewangwzhl.github.io/TriMPI.
PDF42October 14, 2025