ニューラルアーキテクチャのエージェントによる発見：AIRA-Compose と AIRA-Design

要旨

再帰的自己改善に向けて、我々は標準的なトランスフォーマーを超える基盤モデルをLLMエージェントが自律的に設計することを探求する。本稿では、高水準のアーキテクチャ探索を行うAIRA-Composeと、低水準のメカニズム実装を行うAIRA-Designという二重フレームワークアプローチを導入する。AIRA-Composeは11のエージェントを用いて、24時間の予算制約のもとで基本的な計算プリミティブを探索する。エージェントは百万パラメータ規模の候補を評価し、上位の設計を350M、1B、3Bパラメータの規模に外挿する。これにより、トランスフォーマーベースのAIRAformer群と、トランスフォーマー・MambaハイブリッドのAIRAhybrid群という2つのファミリーにわたる14のアーキテクチャが得られた。これらは1B規模で事前学習され、Llama 3.2およびComposer発見のベースラインを一貫して上回る。下流タスクでは、AIRAformer-DとAIRAhybrid-DがLlama 3.2に対してそれぞれ2.4%および3.8%の精度向上を達成した。さらにAIRA-Composeは、極めて効率的なスケーリングフロンティアを持つモデルを発見する。AIRAformer-CはLlama 3.2およびComposer最良トランスフォーマーよりもそれぞれ54%および71%速くスケーリングし、AIRAhybrid-CはNemotron-2およびComposer最良ハイブリッドよりもそれぞれ23%および37%上回るスケーリングを示す。一方、AIRA-Designは20のエージェントに、長距離依存関係を扱う新規アテンション機構と高性能なトレーニングスクリプトの作成をタスクとして課す。Long Range Arenaベンチマークでは、エージェント設計のアーキテクチャが文書マッチングとテキスト分類において人間の最高水準から2.3%および2.6%以内に到達する。Autoresearchベンチマークでは、Greedy Opus 4.5が固定時間予算のもとで検証ビット毎バイト0.968を達成し、公表された最小値を上回った。これらのフレームワークは、AIエージェントが手動設計のベースラインに匹敵またはそれを上回るアーキテクチャとアルゴリズム的最適化を自律的に発見できることを示している。これは次世代基盤モデルを発見するための強力なパラダイムを確立し、再帰的自己改善への明確な一歩となる。

English

Toward recursive self-improvement, we investigate LLM agents autonomously designing foundation models beyond standard Transformers. We introduce a dual-framework approach: AIRA-Compose for high-level architecture search, and AIRA-Design for low-level mechanistic implementation. AIRA-Compose uses 11 agents to explore fundamental computational primitives under a 24-hour budget. Agents evaluate million-parameter candidates, extrapolating top designs to 350M, 1B, and 3B scales. This yields 14 architectures across two families: AIRAformers (Transformer-based) and AIRAhybrids (Transformer-Mamba). Pre-trained at 1B scale, these consistently outperform Llama 3.2 and Composer-found baselines. On downstream tasks, AIRAformer-D and AIRAhybrid-D improve accuracy by 2.4% and 3.8% over Llama 3.2. Furthermore, AIRA-Compose finds models with highly efficient scaling frontiers: AIRAformer-C scales 54% and 71% faster than Llama 3.2 and Composer's best Transformer, while AIRAhybrid-C outscales Nemotron-2 by 23% and Composer's best hybrid by 37%. AIRA-Design tasks 20 agents with writing novel attention mechanisms for long-range dependencies and high-performing training scripts. On the Long Range Arena benchmark, agent-designed architectures reach within 2.3% and 2.6% of human state-of-the-art on document matching and text classification. On the Autoresearch benchmark, Greedy Opus 4.5 achieves 0.968 validation bits-per-byte under a fixed time budget, surpassing the published minimum. Together, these frameworks show AI agents can autonomously discover architectures and algorithmic optimizations matching or surpassing hand-designed baselines. This establishes a powerful paradigm for discovering next-generation foundation models, marking a clear step toward recursive self-improvement.