暗处寻踪：基于测试时实例级策略梯度的潜在空间推理

摘要

推理能力作为人类智能的核心组成部分，在追求通用人工智能（AGI）的过程中，对大型语言模型（LLMs）仍构成重大挑战。尽管在训练规模定律下模型性能有所提升，但在训练算法方面仍存在显著挑战，如灾难性遗忘及新颖训练数据的有限可用性。作为一种替代方案，测试时扩展通过增加测试时计算而不更新参数来提升推理性能。不同于此前专注于标记空间的同类方法，我们提出利用潜在空间以实现更有效的推理，并更好地遵循测试时扩展定律。我们引入了LatentSeek，一个新颖的框架，通过在模型潜在空间内进行测试时实例级适应（TTIA）来增强LLM的推理能力。具体而言，LatentSeek利用策略梯度，在自我生成奖励信号的指导下，迭代更新潜在表示。LatentSeek在包括GSM8K、MATH-500和AIME2024在内的多种推理基准上，跨越多个LLM架构进行了评估。结果显示，LatentSeek持续超越如思维链提示和基于微调的方法等强基线。此外，我们的分析表明，LatentSeek效率极高，对于中等复杂度问题通常能在几次迭代内收敛，同时还能从额外迭代中获益，从而凸显了潜在空间中测试时扩展的潜力。这些发现确立了LatentSeek作为一种轻量级、可扩展且有效的解决方案，用于增强LLM的推理能力。

English

Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.

暗处寻踪：基于测试时实例级策略梯度的潜在空间推理

Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

摘要

Support