实际资本约束下链上语言模型代理的操作层控制

摘要

我们研究了在真实资本环境下，将用户指令转化为经验证工具操作的自主语言模型智能体的可靠性。该研究基于DX Terminal Pro平台开展的21天实盘部署，期间3,505个用户注资的智能体在受限链上市场中进行了真实ETH交易。用户通过结构化控件和自然语言策略配置资金库，但只有智能体可执行常规买卖交易。系统产生750万次智能体调用、约30万次链上操作、约2000万美元交易量、逾5000枚ETH部署资金、约700亿推理令牌，且策略验证通过的提交交易结算成功率达99.9%。长期运行的智能体累计完成数千次序列决策，其中持续活跃智能体产生超6000次"提示-状态-行动"循环，形成了从用户指令到生成提示、推理、验证、资产组合状态及结算的全链路追踪。可靠性并非仅源自基础模型，而是诞生于模型周边的操作层：提示编译、类型化控件、策略验证、执行防护、内存设计和链路可观测性。上线前测试暴露了纯文本基准测试难以衡量的故障模式，包括虚构交易规则、手续费瘫痪、数值锚定、节律交易和代币经济误读等。针对性架构改进使受影响测试群体中虚构卖出规则发生率从57%降至3%，手续费主导观察值从32.5%降至10%以下，资本部署率从42.9%提升至78.0%。研究表明，管理资本的智能体需在从用户指令到提示生成、验证操作及结算的完整路径上进行全面评估。

English

We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement. Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test population. We show that capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.