요약: 대규모 비전 언어 모델을 위한 토큰 수준의 탐지 보상 모델

초록

보상 모델은 다중 모달 대규모 언어 모델의 성능을 향상시키는 데 성공했지만, 보상 모델 자체는 여전히 잔인하며 최소한의 정보만을 포함하고 있습니다. 특히 기존의 보상 모델은 어떤 텍스트에 대해 길이에 관계없이 하나의 이진 피드백만 할당하여 인간 주석을 모방합니다. 이미지와 텍스트를 모두 처리해야 하는 다중 모달 언어 모델의 영역에서는, 순진한 보상 모델은 텍스트에 대한 암시적 편향을 학습하고 이미지에 대한 기반을 잃을 수 있습니다. 본 논문에서는 각 텍스트 토큰에 세밀한 주석을 제공하기 위한 토큰-수준 탐지 보상 모델(TLDR)을 제안합니다. 우리는 먼저 합성 어려운 부정적 사례를 생성하고 이들의 토큰-수준 레이블을 훈련시키기 위한 변형 기반 방법을 소개합니다. 그런 다음 TLDR 모델이 오프-더-셀프 모델이 생성을 자체 수정하는 데 도움을 주고 환각 평가 도구로 작용하는 풍부한 유용성을 보여줍니다. 마지막으로, TLDR 모델이 고품질 비전 언어 데이터의 보다 넓은 범위를 확보하기 위해 인간 주석을 3배로 빠르게 할 수 있다는 것을 보여줍니다.

English

Although reward models have been successful in improving multimodal large language models, the reward models themselves remain brutal and contain minimal information. Notably, existing reward models only mimic human annotations by assigning only one binary feedback to any text, no matter how long the text is. In the realm of multimodal language models, where models are required to process both images and texts, a naive reward model may learn implicit biases toward texts and become less grounded in images. In this paper, we propose a Token-Level Detective Reward Model (TLDR) to provide fine-grained annotations to each text token. We first introduce a perturbation-based method to generate synthetic hard negatives and their token-level labels to train TLDR models. Then we show the rich usefulness of TLDR models both in assisting off-the-shelf models to self-correct their generations, and in serving as a hallucination evaluation tool. Finally, we show that TLDR models can significantly speed up human annotation by 3 times to acquire a broader range of high-quality vision language data.

요약: 대규모 비전 언어 모델을 위한 토큰 수준의 탐지 보상 모델

TLDR: Token-Level Detective Reward Model for Large Vision Language Models

초록

Support