Doet Denkstroom Ertoe? Evaluatie van Redeneervermogen in Gemini Vision-Language Modellen voor Videoscènebegrip

Samenvatting

Wij evalueren hoe interne redeneersporen, die wij thought streams noemen, het begrip van videoscenes beïnvloeden in vision-language modellen. Met vier configuraties van Google's Gemini 2.5 Flash en Flash Lite, toegepast op scenes geëxtraheerd uit 100 uur video, stellen wij drie vragen: leidt meer nadenken tot betere output, waar houden de verbeteringen op, en waar denken deze modellen eigenlijk over na? Wij introduceren drie evaluatiemetrics. *Contentfulness* meet hoeveel van de thought stream nuttige scene-inhoud is versus meta-commentaar. *Thought-Final Coverage* meet hoe getrouw de thought stream wordt vertaald naar de uiteindelijke output. *Dominant Entity Analysis* identificeert op welke onderwerpen, handelingen en settings het model zich focust. GPT-5 fungeert als onafhankelijke beoordelaar. Wij constateren dat kwaliteitswinst door extra denken snel plateauert, waarbij de meeste verbetering plaatsvindt in de eerste paar honderd tokens. Flash Lite biedt de beste balans tussen kwaliteit en tokenverbruik. Krappe redeneerbudgetten zorgen ervoor dat het model inhoud toevoegt in de eindoutput waar het nooit over heeft geredeneerd, een vorm van hallucinatie tijdens de compressiestap. Ondanks dat het verschillende modelniveaus zijn, produceren Flash en Flash Lite vergelijkbare thought streams, hoewel ze in stijl verschillen: Flash bespreekt zijn redeneerproces, terwijl Lite zich richt op het beschrijven van de scene.

English

We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google's Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred tokens. Flash Lite offers the best balance between quality and token usage. Tight reasoning budgets cause the model to add content in the final output that it never reasoned about, a form of compression-step hallucination. Despite being different model tiers, Flash and Flash Lite produce similar thought streams, though they differ in style: Flash discusses its reasoning process, while Lite focuses on describing the scene.

Doet Denkstroom Ertoe? Evaluatie van Redeneervermogen in Gemini Vision-Language Modellen voor Videoscènebegrip

Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding

Samenvatting

Support