VRAG-RL: 강화 학습을 통한 반복적 추론 기반 시각적 정보 이해를 위한 시각 인식 기반 RAG 강화

초록

시각적으로 풍부한 정보를 효과적으로 검색, 추론 및 이해하는 것은 RAG(Retrieval-Augmented Generation) 방법론에 있어 여전히 과제로 남아 있습니다. 기존의 텍스트 기반 방법론은 시각적 정보를 처리할 수 없습니다. 반면, 현재의 시각 기반 RAG 접근법은 고정된 파이프라인에 제한되며 모델의 기본 능력이 충분히 활성화되지 않아 효과적인 추론에 어려움을 겪는 경우가 많습니다. 강화 학습(RL)이 모델 추론에 유익하다는 것이 입증됨에 따라, 우리는 시각적으로 풍부한 정보에 대한 복잡한 추론을 위해 특화된 새로운 RL 프레임워크인 VRAG-RL을 소개합니다. 이 프레임워크를 통해 시각 언어 모델(VLM)은 검색 엔진과 상호작용하며, 시각적 인식 토큰의 도움으로 단일 또는 다중 턴 추론 궤적을 자율적으로 샘플링하고 이러한 샘플을 기반으로 지속적인 최적화를 진행합니다. 우리의 접근법은 RAG 도메인에서 RL의 주요 한계를 강조합니다: (i) 기존의 다중 모달 RAG 접근법은 단순히 이미지를 컨텍스트에 통합하는 경향이 있어 추론 토큰 할당이 불충분하고 시각적 특수 인식을 소홀히 한다는 점; (ii) 모델이 검색 엔진과 상호작용할 때, 요구 사항을 명확히 표현하지 못해 관련 정보를 검색하지 못하고 결과적으로 성능이 저하된다는 점. 이러한 문제를 해결하기 위해, 우리는 시각적으로 풍부한 입력에 맞춰진 액션 공간을 정의하며, 이 공간에는 크롭핑과 스케일링과 같은 액션이 포함되어 모델이 거시적에서 미시적 관점으로 정보를 수집할 수 있도록 합니다. 또한, 사용자의 원래 질의와 검색기 간의 간극을 줄이기 위해, 질의 재작성과 검색 성능을 모델 기반 보상과 통합한 간단하지만 효과적인 보상 메커니즘을 사용합니다. 우리의 VRAG-RL은 특별히 설계된 RL 전략을 사용하여 RAG 작업에 대한 VLM을 최적화하며, 모델을 실제 응용 분야와 조율합니다. 코드는 https://github.com/Alibaba-NLP/VRAG{https://github.com/Alibaba-NLP/VRAG}에서 확인할 수 있습니다.

English

Effectively retrieving, reasoning and understanding visually rich information remains a challenge for RAG methods. Traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As RL has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users' original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. The code is available at https://github.com/Alibaba-NLP/VRAG{https://github.com/Alibaba-NLP/VRAG}.

VRAG-RL: 강화 학습을 통한 반복적 추론 기반 시각적 정보 이해를 위한 시각 인식 기반 RAG 강화

VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

초록

Support