Universele YOCO voor Efficiënte Schaalvergroting in Diepte

Samenvatting

De opkomst van test-time scaling heeft de redeneer- en agentvaardigheden van Large Language Models (LLM's) aanzienlijk verbeterd. Toch slagen standaard Transformers er niet in om inference-berekeningskracht efficiënt op te schalen, omdat conventionele loopingstrategieën te kampen hebben met een hoge computationele overhead en een KV-cache die meegroeit met de modeldiepte. Wij presenteren Universal YOCO (YOCO-U), dat de YOCO decoder-decoder-architectuur combineert met recursieve berekening om een synergetisch effect te bereiken dat groter is dan elk afzonderlijk. Gebouwd op het YOCO-framework, implementeert YOCO-U een Universele Self-Decoder die meerdere iteraties uitvoert via parameter sharing, terwijl het iteratieve proces wordt beperkt tot ondiepe, efficient-attention lagen. Deze combinatie levert een gunstige capability-efficiency trade-off op die noch YOCO noch recursie alleen bereikt. De YOCO-architectuur biedt een constante globale KV-cache en lineair pre-filling, terwijl partiële recursie de representatiediepte verbetert met beperkte overhead. Samen verbetert YOCO-U de token utility en scaling behavior terwijl efficiënte inference behouden blijft. Empirische resultaten bevestigen dat YOCO-U zeer concurrerend blijft in algemene en long-context benchmarks, wat aantoont dat de integratie van efficient-attention architecturen en recursieve berekening een veelbelovende richting is voor schaalbare LLM's.

English

The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either alone. Built on the YOCO framework, YOCO-U implements a Universal Self-Decoder that performs multiple iterations via parameter sharing, while confining the iterative process to shallow, efficient-attention layers. This combination yields a favorable capability-efficiency tradeoff that neither YOCO nor recursion achieves independently. The YOCO architecture provides a constant global KV cache and linear pre-filling, while partial recursion enhances representational depth with limited overhead. Together, YOCO-U improves token utility and scaling behavior while maintaining efficient inference. Empirical results confirm that YOCO-U remains highly competitive in general and long-context benchmarks, demonstrating that the integration of efficient-attention architectures and recursive computation is a promising direction for scalable LLMs.

Universele YOCO voor Efficiënte Schaalvergroting in Diepte

Universal YOCO for Efficient Depth Scaling

Samenvatting

Support