생각의 흐름이 중요한가? 비디오 장면 이해를 위한 Gemini 비전-언어 모델의 추론 능력 평가

초록

우리는 비전-언어 모델의 동영상 장면 이해에 내부 추론 흔적(우리는 이를 사고 흐름이라고 부름)이 미치는 영향을 벤치마크합니다. Google의 Gemini 2.5 Flash 및 Flash Lite의 4가지 구성으로 100시간 분량 동영상에서 추출한 장면들을 대상으로 세 가지 질문을 던집니다: 더 많은 사고가 더 나은 결과를 낳는가, 성능 향상은 어디에서 멈추는가, 그리고 이 모델들은 실제로 무엇에 대해 생각하는가? 우리는 세 가지 평가 지표를 도입합니다. 내용 충실도는 사고 흐름 중 유용한 장면 내용과 메타적 논평의 비율을 측정합니다. 사고-최종 출력 커버리지는 사고 흐름이 최종 출력으로 얼마나 충실하게 변환되는지 측정합니다. 주요 개체 분석은 모델이 주로 어떤 주체, 행동, 배경에 집중하는지 식별합니다. GPT-5는 독립적인 평가자 역할을 합니다. 우리는 추가 사고에 따른 품질 향상이 빠르게 정점에 도달하며, 대부분의 개선이 처음 수백 토큰 내에서 발생한다는 사실을 발견했습니다. Flash Lite는 품질과 토큰 사용량 사이에서 가장 균형 잡힌 성능을 제공합니다. 제한된 추론 예산은 모델이 전혀 추론하지 않은 내용을 최종 출력에 추가하는, 일종의 압축 단계 환각을 유발합니다. 서로 다른 모델 계층임에도 불구하고 Flash와 Flash Lite는 유사한 사고 흐름을 생성하지만 스타일에서는 차이를 보입니다: Flash는 자신의 추론 과정을 논의하는 반면, Lite는 장면 설명에 집중합니다.

English

We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google's Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred tokens. Flash Lite offers the best balance between quality and token usage. Tight reasoning budgets cause the model to add content in the final output that it never reasoned about, a form of compression-step hallucination. Despite being different model tiers, Flash and Flash Lite produce similar thought streams, though they differ in style: Flash discusses its reasoning process, while Lite focuses on describing the scene.

생각의 흐름이 중요한가? 비디오 장면 이해를 위한 Gemini 비전-언어 모델의 추론 능력 평가

Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding

초록

Support