在黑暗中探索：通过潜在空间中的测试时实例级策略梯度进行推理

摘要

推理能力作為人類智能的核心組成部分，在追求通用人工智能（AGI）的過程中，對大型語言模型（LLMs）而言仍是一大挑戰。儘管模型性能在訓練規模法則下有所提升，但諸如災難性遺忘等訓練算法問題以及新穎訓練數據的有限可用性，仍構成顯著挑戰。作為替代方案，測試時規模化通過增加測試時計算而不更新參數來提升推理性能。與以往專注於令牌空間的此類方法不同，我們提出利用潛在空間以實現更有效的推理並更好地遵循測試時規模化法則。我們引入了LatentSeek，這是一個新穎的框架，通過在模型潛在空間內進行測試時實例級適應（TTIA）來增強LLM的推理能力。具體而言，LatentSeek利用策略梯度，在自我生成獎勵信號的引導下，迭代更新潛在表示。LatentSeek在多種LLM架構上，針對包括GSM8K、MATH-500和AIME2024在內的一系列推理基準進行了評估。結果顯示，LatentSeek在諸如思維鏈提示和基於微調的方法等強基線之上，持續表現優異。此外，我們的分析表明，LatentSeek效率極高，對於中等複雜度的問題通常能在幾次迭代內收斂，同時也能從額外迭代中獲益，從而凸顯了潛在空間中測試時規模化的潛力。這些發現將LatentSeek定位為一種輕量級、可擴展且有效的解決方案，用於增強LLM的推理能力。

English

Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.

在黑暗中探索：通过潜在空间中的测试时实例级策略梯度进行推理

Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

摘要

Support