전문 에이전트를 활용한 자동 연구로 효과적이고 비범한 훈련 방법 개발

초록

우리는 외부 측정으로 구동되는 폐쇄적 실증 순환으로서의 자동화 연구 방법론을 연구한다. 각 제출된 시도는 가설, 실행 가능한 코드 수정, 평가자가 소유한 결과, 그리고 다음 제안을 형성하는 피드백을 포함한다. 출력물은 생성된 논문이나 단일 모델 체크포인트가 아니라, 제안, 코드 차이, 실험, 점수, 실패 레이블로 구성된 감사 가능한 궤적이다. 우리는 이 순환을 레시피 표면을 분할하고 시도 간 측정된 계보를 공유하는 전문가 에이전트로 구현한다. 핵심 실증 결과는 계보 피드백이 에이전트로 하여금 크래시, 예산 초과, 크기 위반, 정확도 기준 미달 등 평가자의 결과를 일회성 제안이 아닌 후속 프로그램 수준의 레시피 수정으로 전환하게 한다는 점이다. 1,197회의 헤드라인 실행 시도와 600회의 Parameter Golf 대조군 시도 동안, 일회성 설정 및 시작 후 인간은 제안 선택, 레시피 수정, 점수 무시, 실패한 시도 수리를 수행하지 않았다. 세 가지 헤드라인 실행에서 동일한 제출-시도 순환은 Parameter Golf 검증 bpb를 0.81% 감소시키고, NanoChat-D12 CORE를 38.7% 향상시키며, CIFAR-10 Airbench96 월클록을 4.59% 단축했으며, 각 작업은 자체 외부 평가자와 합법성 검사를 통해 측정되었다. 궤적에는 157건의 헤드라인 실행 제출물에 대한 엄격한 아키텍처-도메인 감사와 NanoChat 어텐션 커널 경로 변경 같은 프로그램 재작성이 포함된다. 이 범위 내에서 순환은 코드 작성, 실험 제출, 피드백 흡수, 각 환경 내 기법 적용 및 결합, 공개 시작 레시피 개선을 자율적으로 수행한다.

English

We study auto research as a closed empirical loop driven by external measurement. Each submitted trial carries a hypothesis, an executable code edit, an evaluator-owned outcome, and feedback that shapes the next proposal. The output is not a generated paper or a single model checkpoint, but an auditable trajectory of proposals, code diffs, experiments, scores, and failure labels. We instantiate this loop with specialist agents that partition recipe surfaces and share measured lineage across trials. The central empirical finding is that lineage feedback lets agents turn evaluator outcomes, including crashes, budget overruns, size failures, and accuracy-gate misses, into later program-level recipe edits rather than one-shot suggestions. Across 1,197 headline-run trials plus 600 Parameter Golf control trials after one-time setup and launch, humans did not choose proposals, edit recipes, override scores, or repair failed trials during the search. In the three headline runs, the same submitted-trial loop reduces Parameter Golf validation bpb by 0.81%, raises NanoChat-D12 CORE by 38.7%, and reduces CIFAR-10 Airbench96 wallclock by 4.59%, with each task measured by its own external evaluator and legality checks. The trace includes a strict architecture-domain audit of 157 headline-run submissions and program rewrites such as a NanoChat attention-kernel path change. Within this scope the loop autonomously writes code, submits experiments, absorbs feedback, applies and combines known techniques inside each environment, and improves public starting recipes.

전문 에이전트를 활용한 자동 연구로 효과적이고 비범한 훈련 방법 개발

Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes

초록

Support