시각적 질의응답을 위한 시각적으로 해석 가능한 하위 작업 추론

초록

'어떤 붉은색 가구가 앉는 데 사용될 수 있는가?'와 같은 복잡한 시각적 질문에 답하기 위해서는 객체 인식, 속성 필터링, 관계적 이해를 포함한 다단계 추론이 필요합니다. 최근 연구에서는 다중 모드 대형 언어 모델(MLLMs)의 해석 가능성을 개선하기 위해 작업을 하위 작업 프로그램으로 분해하는 방법을 제안했지만, 이러한 방법은 계산 비용이 많이 들고 대상 데이터에 대한 적응이 부족해 정확도가 낮은 문제가 있습니다. 이를 해결하기 위해, 우리는 VISTAR(Visually Interpretable Subtask-Aware Reasoning Model)을 소개합니다. VISTAR는 MLLMs 내에서 텍스트 및 시각적 설명을 생성함으로써 해석 가능성과 추론 능력을 모두 향상시키는 하위 작업 중심의 훈련 프레임워크입니다. 외부 모델에 의존하는 대신, VISTAR는 MLLMs를 미세 조정하여 구조화된 '하위 작업 사고'(Subtask-of-Thought) 근거(단계별 추론 시퀀스)를 생성합니다. 두 벤치마크에서의 실험 결과, VISTAR는 해석 가능성을 유지하면서도 추론 정확도를 꾸준히 향상시키는 것으로 나타났습니다. 우리의 코드와 데이터셋은 https://github.com/ChengJade/VISTAR에서 공개될 예정입니다.

English

Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning accuracy while maintaining interpretability. Our code and dataset will be available at https://github.com/ChengJade/VISTAR.

시각적 질의응답을 위한 시각적으로 해석 가능한 하위 작업 추론

Visually Interpretable Subtask Reasoning for Visual Question Answering

초록

Support