신경망 아키텍처의 에이전트 기반 발견: AIRA-Compose와 AIRA-Design

초록

재귀적 자기 개선을 향한 연구로서, 본 논문은 LLM 에이전트가 표준 Transformer를 넘어서는 파운데이션 모델을 자율적으로 설계하는 방안을 탐구한다. 이를 위해 고수준 아키텍처 탐색을 위한 AIRA-Compose와 저수준 메커니즘 구현을 위한 AIRA-Design이라는 이중 프레임워크 접근법을 도입한다. AIRA-Compose는 24시간 예산 내에서 11개의 에이전트를 활용하여 기본 계산 프리미티브를 탐색한다. 에이전트는 수백만 파라미터 후보를 평가하고, 최상위 설계를 350M, 1B, 3B 규모로 확장한다. 이를 통해 Transformer 기반 AIRaformer 계열과 Transformer-Mamba 하이브리드 AIRAhybrid 계열의 14개 아키텍처를 도출한다. 1B 규모로 사전 학습된 이 모델들은 Llama 3.2 및 Composer 기준선을 일관되게 능가한다. 다운스트림 태스크에서 AIRAformer-D와 AIRAhybrid-D는 Llama 3.2 대비 정확도를 각각 2.4% 및 3.8% 향상시킨다. 또한 AIRA-Compose는 고효율 확장 경계를 가진 모델을 발견한다: AIRAformer-C는 Llama 3.2 및 Composer의 최고 Transformer보다 각각 54% 및 71% 더 빠르게 확장되며, AIRAhybrid-C는 Nemotron-2보다 23%, Composer의 최고 하이브리드보다 37% 더 빠른 확장 속도를 보인다. AIRA-Design은 20개의 에이전트에 장거리 의존성을 위한 새로운 어텐션 메커니즘과 고성능 훈련 스크립트 작성을 할당한다. Long Range Arena 벤치마크에서 에이전트가 설계한 아키텍처는 문서 매칭 및 텍스트 분류에서 인간 수준의 최고 성능에 각각 2.3% 및 2.6% 이내로 도달한다. Autoresearch 벤치마크에서 Greedy Opus 4.5는 고정 시간 예산 하에서 0.968의 검증 비트-퍼-바이트를 달성하여 기존 최저 발표치를 능가한다. 이들 프레임워크를 통해 AI 에이전트가 수동 설계 기준선과 동등하거나 이를 능가하는 아키텍처와 알고리즘 최적화를 자율적으로 발견할 수 있음을 보여준다. 이는 차세대 파운데이션 모델 발견을 위한 강력한 패러다임을 확립하며, 재귀적 자기 개선을 향한 명확한 발걸음이 된다.

English

Toward recursive self-improvement, we investigate LLM agents autonomously designing foundation models beyond standard Transformers. We introduce a dual-framework approach: AIRA-Compose for high-level architecture search, and AIRA-Design for low-level mechanistic implementation. AIRA-Compose uses 11 agents to explore fundamental computational primitives under a 24-hour budget. Agents evaluate million-parameter candidates, extrapolating top designs to 350M, 1B, and 3B scales. This yields 14 architectures across two families: AIRAformers (Transformer-based) and AIRAhybrids (Transformer-Mamba). Pre-trained at 1B scale, these consistently outperform Llama 3.2 and Composer-found baselines. On downstream tasks, AIRAformer-D and AIRAhybrid-D improve accuracy by 2.4% and 3.8% over Llama 3.2. Furthermore, AIRA-Compose finds models with highly efficient scaling frontiers: AIRAformer-C scales 54% and 71% faster than Llama 3.2 and Composer's best Transformer, while AIRAhybrid-C outscales Nemotron-2 by 23% and Composer's best hybrid by 37%. AIRA-Design tasks 20 agents with writing novel attention mechanisms for long-range dependencies and high-performing training scripts. On the Long Range Arena benchmark, agent-designed architectures reach within 2.3% and 2.6% of human state-of-the-art on document matching and text classification. On the Autoresearch benchmark, Greedy Opus 4.5 achieves 0.968 validation bits-per-byte under a fixed time budget, surpassing the published minimum. Together, these frameworks show AI agents can autonomously discover architectures and algorithmic optimizations matching or surpassing hand-designed baselines. This establishes a powerful paradigm for discovering next-generation foundation models, marking a clear step toward recursive self-improvement.