神经架构的智能体发现：AIRA-Compose 与 AIRA-Design

摘要

为实现递归自我改进，我们研究LLM智能体如何自主设计超越标准Transformer的基础模型。我们提出一种双框架方法：AIRA-Compose用于高层架构搜索，AIRA-Design用于底层机制实现。AIRA-Compose在24小时预算内利用11个智能体探索基础计算原语。这些智能体评估百万参数级候选模型，并将最优设计外推至3.5亿、10亿和30亿参数规模，最终产生两个家族共14种架构：AIRAformer（基于Transformer）和AIRAhybrid（Transformer-Mamba混合）。在10亿参数规模预训练后，这些模型持续优于Llama 3.2和Composer发现的基线模型。在下游任务中，AIRAformer-D和AIRAhybrid-D相比Llama 3.2分别提升2.4%和3.8%的准确率。此外，AIRA-Compose发现了具有高效扩展前沿的模型：AIRAformer-C的扩展速度比Llama 3.2快54%、比Composer最优Transformer快71%，而AIRAhybrid-C比Nemotron-2快23%、比Composer最优混合模型快37%。AIRA-Design则指派20个智能体为长程依赖关系编写新型注意力机制及高性能训练脚本。在长程竞技场基准测试中，智能体设计的架构在文档匹配和文本分类任务上分别达到人工最优水平的2.3%和2.6%差距内。在自动研究基准中，Greedy Opus 4.5在固定时间预算下达到0.968验证比特每字节，超越已发表的最低值。综合来看，这些框架表明AI智能体能够自主发现可媲美或超越人工设计基线的架构与算法优化，为发现下一代基础模型建立了强大范式，标志着迈向递归自我改进的明确一步。

English

Toward recursive self-improvement, we investigate LLM agents autonomously designing foundation models beyond standard Transformers. We introduce a dual-framework approach: AIRA-Compose for high-level architecture search, and AIRA-Design for low-level mechanistic implementation. AIRA-Compose uses 11 agents to explore fundamental computational primitives under a 24-hour budget. Agents evaluate million-parameter candidates, extrapolating top designs to 350M, 1B, and 3B scales. This yields 14 architectures across two families: AIRAformers (Transformer-based) and AIRAhybrids (Transformer-Mamba). Pre-trained at 1B scale, these consistently outperform Llama 3.2 and Composer-found baselines. On downstream tasks, AIRAformer-D and AIRAhybrid-D improve accuracy by 2.4% and 3.8% over Llama 3.2. Furthermore, AIRA-Compose finds models with highly efficient scaling frontiers: AIRAformer-C scales 54% and 71% faster than Llama 3.2 and Composer's best Transformer, while AIRAhybrid-C outscales Nemotron-2 by 23% and Composer's best hybrid by 37%. AIRA-Design tasks 20 agents with writing novel attention mechanisms for long-range dependencies and high-performing training scripts. On the Long Range Arena benchmark, agent-designed architectures reach within 2.3% and 2.6% of human state-of-the-art on document matching and text classification. On the Autoresearch benchmark, Greedy Opus 4.5 achieves 0.968 validation bits-per-byte under a fixed time budget, surpassing the published minimum. Together, these frameworks show AI agents can autonomously discover architectures and algorithmic optimizations matching or surpassing hand-designed baselines. This establishes a powerful paradigm for discovering next-generation foundation models, marking a clear step toward recursive self-improvement.