EndoCoT: Schaalvergroting van endogene keten-van-gedachte-redenering in diffusiemodellen

Samenvatting

Onlangs zijn multimodale grote taalmodellen (MLLM's) op grote schaal geïntegreerd in diffusiekaders, voornamelijk als tekstencoders, om complexe taken zoals ruimtelijk redeneren aan te pakken. Dit paradigma kampt echter met twee kritieke beperkingen: (i) De MLLM-tekstencoder vertoont onvoldoende redeneerdiepte. Encodering in één stap activeert het Chain-of-Thought-proces niet, wat essentieel is voor MLLM's om accurate begeleiding te bieden voor complexe taken. (ii) De begeleiding blijft invariant tijdens het decoderingsproces. Invariante begeleiding tijdens het decoderen verhindert dat DiT complexe instructies progressief kan ontbinden in uitvoerbare denoisestappen, zelfs met correcte MLLM-coderingen. Daartoe stellen wij Endogenous Chain-of-Thought (EndoCoT) voor, een nieuw framework dat eerst het redeneerpotentieel van MLLM's activeert door latent gedachtestaten iteratief te verfijnen via een iteratieve gedachtenbegeleidingsmodule, en deze staten vervolgens verbindt met het denoiseproces van de DiT. Ten tweede wordt een terminale gedachteverankermodule toegepast om ervoor te zorgen dat het redeneerspoor verankerd blijft in tekstueel toezicht door de eindtoestand af te stemmen op grond-waarheid-antwoorden. Met deze twee componenten levert de MLLM-tekstencoder zorgvuldig beredeneerde begeleiding, waardoor de DiT deze progressief kan uitvoeren en uiteindelijk complexe taken stap voor stap kan oplossen. Uitgebreide evaluaties op diverse benchmarks (bijv. Maze, TSP, VSP en Sudoku) behaalden een gemiddelde nauwkeurigheid van 92,1%, wat 8,3 procentpunt hoger is dan de sterkste baseline.

English

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.

EndoCoT: Schaalvergroting van endogene keten-van-gedachte-redenering in diffusiemodellen

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Samenvatting

Support