真实资本约束下链上语言模型代理的操作层控制
Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital
April 28, 2026
作者: T. J. Barton, Chris Constantakis, Patti Hauseman, Annie Mous, Alaska Hoffman, Brian Bergeron, Hunter Goodreau
cs.AI
摘要
本研究聚焦于自主语言模型代理的可靠性问题,该代理系统能在真实资本环境下将用户指令转化为经验证的工具操作。实验场景为DX Terminal Pro平台——一项为期21天的实际部署,期间3,505个用户注资的代理程序在受约束的链上市场中进行了真实ETH交易。用户通过结构化控件和自然语言策略配置金库,但仅有代理可执行常规买卖交易。系统累计产生750万次代理调用、约30万次链上操作、约2000万美元交易量、逾5000枚ETH部署量、约700亿推理令牌,且策略验证通过的提交交易结算成功率达99.9%。长期运行的代理积累了数千次序列决策,其中持续活跃代理的提示-状态-行动循环超6000次,形成了从用户指令到提示生成、推理验证、组合状态及结算的全链路大规模轨迹。可靠性并非仅源自基础模型,而是孕育于模型周边的操作层:提示编译、类型化控件、策略验证、执行防护、内存设计和轨迹级可观测性。上线前测试暴露了纯文本基准测试难以衡量的故障模式,包括虚构交易规则、手续费瘫痪、数值锚定、节律交易和代币经济误读等。针对性架构调整使受影响测试群体中虚构卖出规则发生率从57%降至3%,手续费主导的观察值从32.5%降至10%以下,资本部署率从42.9%提升至78.0%。研究表明,管理资本的代理系统应沿用户指令→提示生成→验证操作→结算的全路径进行综合评估。
English
We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement. Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test population. We show that capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.