어둠 속에서 탐색하기: 잠재 공간에서의 테스트 시점 인스턴스 수준 정책 경사 기반 추론

초록

인간 지능의 핵심 구성 요소인 추론 능력은 AGI(인공 일반 지능)를 추구하는 과정에서 대규모 언어 모델(LLMs)에게 여전히 큰 도전 과제로 남아 있습니다. 모델 성능이 훈련 스케일링 법칙에 따라 개선되었음에도 불구하고, 특히 훈련 알고리즘 측면에서 치명적 망각(catastrophic forgetting)과 새로운 훈련 데이터의 제한된 가용성과 같은 상당한 과제들이 남아 있습니다. 대안으로, 테스트 시간 스케일링(test-time scaling)은 매개변수 업데이트 없이 테스트 시간 계산을 증가시켜 추론 성능을 향상시킵니다. 이 패러다임에서 이전의 방법들이 토큰 공간에 초점을 맞췄던 것과 달리, 우리는 더 효과적인 추론과 테스트 시간 스케일링 법칙에의 더 나은 준수를 위해 잠재 공간(latent space)을 활용할 것을 제안합니다. 우리는 모델의 잠재 공간 내에서 테스트 시간 인스턴스 수준 적응(Test-Time Instance-level Adaptation, TTIA)을 통해 LLM의 추론 능력을 향상시키는 새로운 프레임워크인 LatentSeek를 소개합니다. 구체적으로, LatentSeek는 자체 생성된 보상 신호를 통해 잠재 표현을 반복적으로 업데이트하기 위해 정책 경사(policy gradient)를 활용합니다. LatentSeek는 GSM8K, MATH-500, AIME2024를 포함한 다양한 추론 벤치마크에서 여러 LLM 아키텍처에 걸쳐 평가되었습니다. 결과는 LatentSeek가 사고 연쇄(Chain-of-Thought) 프롬프트 및 미세 조정(fine-tuning) 기반 방법과 같은 강력한 베이스라인을 지속적으로 능가함을 보여줍니다. 또한, 우리의 분석은 LatentSeek가 평균 복잡도의 문제에 대해 일반적으로 몇 번의 반복 내에 수렴하는 동시에 추가 반복으로부터 이점을 얻는 등 매우 효율적임을 입증하며, 이는 잠재 공간에서의 테스트 시간 스케일링의 잠재력을 강조합니다. 이러한 연구 결과는 LatentSeek를 LLM의 추론 능력을 향상시키기 위한 가볍고 확장 가능하며 효과적인 솔루션으로 자리매김합니다.

English

Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.

어둠 속에서 탐색하기: 잠재 공간에서의 테스트 시점 인스턴스 수준 정책 경사 기반 추론

Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

초록

Support