TTCS: 자기 진화를 위한 시험 시간 교육 과정 합성

초록

테스트 타임 트레이닝(Test-Time Training)은 테스트 질문만을 사용하여 모델을 적응시킴으로써 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 유망한 방법을 제공합니다. 그러나 기존 방법은 두 가지 이유로 어려운 추론 문제에 어려움을 겪습니다: 원본 테스트 질문은 고품질 의사 레이블(pseudo-label)을 생성하기에는 너무 어렵고, 테스트 세트의 제한된 크기로 인해 지속적인 온라인 업데이트가 불안정하기 쉽습니다. 이러한 한계를 해결하기 위해 우리는 공동 진화(Co-Evolving) 테스트 타임 트레이닝 프레임워크인 TTCS를 제안합니다. 구체적으로, TTCS는 동일한 사전 훈련된 모델에서 두 가지 정책을 초기화합니다: 질문 합성기(Question Synthesizer)와 추론 해결사(Reasoning Solver). 이 정책들은 반복적 최적화를 통해 진화합니다: 합성기는 테스트 질문을 조건으로 점점 더 어려운 질문 변형(variant)을 생성하여 해결사의 현재 능력에 맞춰진 구조화된 커리큘럼을 만들고, 해결사는 원본 테스트 질문과 합성 질문 모두에 대해 여러 샘플링된 응답으로부터 계산된 자기 일관성(self-consistency) 보상을 사용하여 스스로를 업데이트합니다. 결정적으로, 해결사의 피드백은 합성기가 모델의 현재 능력에 부합하는 질문을 생성하도록 안내하며, 생성된 질문 변형은 차례로 해결사의 테스트 타임 트레이닝을 안정화합니다. 실험 결과, TTCS가 다양한 LLM 백본에서 어려운 수학 벤치마크에 대한 추론 능력을 꾸준히 강화하고 일반 도메인 작업으로도 전이(transfer)됨을 보여주며, 자기 진화(Self-Evolving)를 위한 테스트 타임 커리큘럼을 동적으로 구축하는 확장 가능한 경로를 부각시킵니다. 우리의 코드와 구현 세부 사항은 https://github.com/XMUDeepLIT/TTCS에서 확인할 수 있습니다.

English

Test-Time Training offers a promising way to improve the reasoning ability of large language models (LLMs) by adapting the model using only the test questions. However, existing methods struggle with difficult reasoning problems for two reasons: raw test questions are often too difficult to yield high-quality pseudo-labels, and the limited size of test sets makes continuous online updates prone to instability. To address these limitations, we propose TTCS, a co-evolving test-time training framework. Specifically, TTCS initializes two policies from the same pretrained model: a question synthesizer and a reasoning solver. These policies evolve through iterative optimization: the synthesizer generates progressively challenging question variants conditioned on the test questions, creating a structured curriculum tailored to the solver's current capability, while the solver updates itself using self-consistency rewards computed from multiple sampled responses on both original test and synthetic questions. Crucially, the solver's feedback guides the synthesizer to generate questions aligned with the model's current capability, and the generated question variants in turn stabilize the solver's test-time training. Experiments show that TTCS consistently strengthens the reasoning ability on challenging mathematical benchmarks and transfers to general-domain tasks across different LLM backbones, highlighting a scalable path towards dynamically constructing test-time curricula for self-evolving. Our code and implementation details are available at https://github.com/XMUDeepLIT/TTCS.

TTCS: 자기 진화를 위한 시험 시간 교육 과정 합성

TTCS: Test-Time Curriculum Synthesis for Self-Evolving

초록

Support