양방향 진화 탐색을 통한 자기 개선 언어 모델

초록

탐색은 자기 개선 언어 모델 및 에이전트 시스템, 특히 후학습 샘플 생성 및 추론 모두에서 효과적인 방법으로 제안되어 왔다. 그러나 최상-N 샘플링 및 트리 탐색과 같은 널리 사용되는 방법은 두 가지 근본적인 한계에 직면한다: 희소 검증 신호에 의해 안내되며, 후보를 주로 자기회귀적 확장을 통해 구성하여 상당한 모델 확률 질량을 가진 영역으로만 탐색을 제한한다. 이러한 문제를 해결하기 위해, 우리는 전진 후보 진화와 후진 목표 분해를 결합한 탐색 프레임워크인 양방향 진화 탐색(BES)을 제안한다. 전진 탐색에서 BES는 부분 궤적을 재조합하여 단일 모델 롤아웃으로는 얻기 어려운 후보를 생성하는 진화 연산자로 표준 확장을 보강한다. 후진 탐색에서 BES는 원래 작업을 확인 가능한 하위 목표로 재귀적으로 분해하여 전진 탐색을 안내하는 밀집 중간 피드백을 생성한다. 우리는 확장 전용 탐색으로 생성된 후보가 좁은 엔트로피 껍질에 국한되는 반면 진화 연산자는 이를 벗어날 수 있으며, 후진 탐색이 정답을 찾는 데 필요한 샘플 수를 지수적으로 감소시킬 수 있음을 보여주는 이론적 동기를 제공한다. 실험 결과, 주류 후학습 알고리즘이 개선에 실패하는 까다로운 후학습 과제에서 BES가 일관된 성능 향상을 가능하게 하며, 추론 시 세 가지 공개 문제 해결 벤치마크에서 BES가 기존 오픈소스 프레임워크를 평균 및 최고 성능 모두에서 능가함을 보여준다. 코드와 학습된 모델은 https://github.com/Embodied-Minds-Lab/BES에서 확인할 수 있다.

English

Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are guided by sparse verification signals, and they construct candidates primarily through autoregressive expansion, restricting exploration to regions with substantial model probability mass. To address these, we propose Bidirectional Evolutionary Search (BES), a search framework that couples forward candidate evolution with backward goal decomposition. In the forward search, BES augments standard expansion with evolution operators that recombine partial trajectories to generate candidates that are difficult to obtain from a single model rollout. In the backward search, BES recursively decomposes the original task into checkable subgoals, producing dense intermediate feedback that guides forward search. We provide theoretical motivation showing that candidates generated by expansion-only search are confined to a narrow entropy shell while evolutionary operators can escape it, and that backward search can exponentially reduce the number of required samples to find a correct answer. Experiments show that on challenging post-training tasks where mainstream post-training algorithms fail to improve, BES enables consistent gains, and on three open problem solving benchmarks at inference time, BES outperforms existing open-source frameworks in both average and best-case performance. Code and trained models are available at https://github.com/Embodied-Minds-Lab/BES.