通用YOCO：实现高效深度扩展

摘要

测试时缩放技术的兴起显著提升了大型语言模型（LLM）的推理与智能体能力。然而，标准Transformer模型难以高效扩展推理计算量，因为传统循环策略存在高计算开销问题，且KV缓存会随模型深度增加而膨胀。我们提出通用YOCO架构（YOCO-U），通过将YOCO的解码器-解码器架构与递归计算相结合，产生超越单一方法的协同效应。基于YOCO框架构建的YOCO-U采用通用自解码器，通过参数共享实现多轮迭代，同时将迭代过程限制在高效的浅层注意力层中。这种组合实现了YOCO或递归单独使用均无法达到的优异能力-效率平衡：YOCO架构提供恒定的全局KV缓存和线性预填充，而部分递归以有限开销增强了表征深度。二者协同使YOCO-U在保持高效推理的同时，显著提升了标记利用率和扩展性能。实证结果表明，YOCO-U在通用和长上下文基准测试中均保持强劲竞争力，证明高效注意力架构与递归计算的融合是构建可扩展LLM的有效路径。

English

The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either alone. Built on the YOCO framework, YOCO-U implements a Universal Self-Decoder that performs multiple iterations via parameter sharing, while confining the iterative process to shallow, efficient-attention layers. This combination yields a favorable capability-efficiency tradeoff that neither YOCO nor recursion achieves independently. The YOCO architecture provides a constant global KV cache and linear pre-filling, while partial recursion enhances representational depth with limited overhead. Together, YOCO-U improves token utility and scaling behavior while maintaining efficient inference. Empirical results confirm that YOCO-U remains highly competitive in general and long-context benchmarks, demonstrating that the integration of efficient-attention architectures and recursive computation is a promising direction for scalable LLMs.