OpenComputer：電腦使用代理的可驗證軟體世界

摘要

我們提出OpenComputer，這是一個基於驗證器的框架，用於為電腦使用智能體構建可驗證的軟體世界。OpenComputer整合了四個組成部分：（1）應用專屬狀態驗證器，可在真實應用上暴露結構化檢查端點；（2）自進化驗證層，利用執行接地反饋提升驗證器可靠性；（3）任務生成管線，合成現實且可機器檢查的桌面任務；（4）評估工具集，記錄完整軌跡並計算可審計的部分信用獎勵。目前，OpenComputer涵蓋33個桌面應用及1000個最終確定的任務，範圍包括瀏覽器、辦公工具、創意軟體、開發環境、檔案管理器和通訊應用。實驗結果顯示，OpenComputer的硬編碼驗證器與人類判斷的一致性優於以LLM作為評審的評估，特別是在成功取決於細粒度應用狀態的情況下。前沿智能體在端到端完成方面表現掙扎，儘管有部分進展，而開源模型則從其OSWorld驗證分數出現急劇下降，揭示了穩健電腦自動化中持續存在的差距。

English

We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.