ChatPaper.aiChatPaper

简要总结:大规模视觉语言模型的令牌级侦探奖励模型

TLDR: Token-Level Detective Reward Model for Large Vision Language Models

October 7, 2024
作者: Deqing Fu, Tong Xiao, Rui Wang, Wang Zhu, Pengchuan Zhang, Guan Pang, Robin Jia, Lawrence Chen
cs.AI

摘要

尽管奖励模型在改进多模态大语言模型方面取得了成功,但奖励模型本身仍然粗糙且包含最少信息。值得注意的是,现有的奖励模型只通过为任何文本分配一个二进制反馈来模仿人类注释,而不管文本的长度如何。在多模态语言模型的领域中,模型需要处理图像和文本,一个天真的奖励模型可能会学习对文本的隐性偏见,并且与图像联系较少。在本文中,我们提出了一个基于标记级别的侦探奖励模型(TLDR),为每个文本标记提供细粒度注释。我们首先介绍了一种基于扰动的方法,用于生成合成的困难负例及其标记级别标签,以训练TLDR模型。然后我们展示了TLDR模型的丰富用途,既可以帮助现成模型自我纠正生成,也可以作为幻觉评估工具。最后,我们展示了TLDR模型可以将人类注释的速度显著提高3倍,以获取更广泛范围的高质量视觉语言数据。
English
Although reward models have been successful in improving multimodal large language models, the reward models themselves remain brutal and contain minimal information. Notably, existing reward models only mimic human annotations by assigning only one binary feedback to any text, no matter how long the text is. In the realm of multimodal language models, where models are required to process both images and texts, a naive reward model may learn implicit biases toward texts and become less grounded in images. In this paper, we propose a Token-Level Detective Reward Model (TLDR) to provide fine-grained annotations to each text token. We first introduce a perturbation-based method to generate synthetic hard negatives and their token-level labels to train TLDR models. Then we show the rich usefulness of TLDR models both in assisting off-the-shelf models to self-correct their generations, and in serving as a hallucination evaluation tool. Finally, we show that TLDR models can significantly speed up human annotation by 3 times to acquire a broader range of high-quality vision language data.

Summary

AI-Generated Summary

PDF172November 16, 2024