構造化された省察を伴うコンピュータ制御のためのゼロショット言語エージェント

要旨

大規模言語モデル（LLM）は、ライブコンピュータ環境（例：MiniWoB++）において、高レベルの目標を計画し実行する能力が向上していることが示されています。タスクを実行するために、最近の研究では、モデルが教師あり学習や少数/多数ショットプロンプティングを通じてタスクのトレース例から学習することを必要とすることが多いです。これらのトレース例がない場合、エージェントが自律的に学習し、コンピュータ上の制御を改善する方法は依然として課題であり、これが新しいタスクを実行するエージェントの能力を制限しています。私たちは、専門家のトレースを必要としないゼロショットエージェントを用いてこの問題にアプローチします。私たちのエージェントは、部分的に観測された環境で実行可能なアクションを計画し、自己反省と構造化された思考管理を通じてミスを特定し学習することで、タスクを反復的に進めます。MiniWoB++の簡単なタスクにおいて、私たちのゼロショットエージェントは、より効率的な推論により、最近のSoTAをしばしば上回ることを示します。より複雑なタスクでは、私たちの反省的エージェントは、専門家のトレースや追加の画面情報にアクセスするという利点を持っていた以前の最良のモデルと同等の性能を発揮します。

English

Large language models (LLMs) have shown increasing capacity at planning and executing a high-level goal in a live computer environment (e.g. MiniWoB++). To perform a task, recent works often require a model to learn from trace examples of the task via either supervised learning or few/many-shot prompting. Without these trace examples, it remains a challenge how an agent can autonomously learn and improve its control on a computer, which limits the ability of an agent to perform a new task. We approach this problem with a zero-shot agent that requires no given expert traces. Our agent plans for executable actions on a partially observed environment, and iteratively progresses a task by identifying and learning from its mistakes via self-reflection and structured thought management. On the easy tasks of MiniWoB++, we show that our zero-shot agent often outperforms recent SoTAs, with more efficient reasoning. For tasks with more complexity, our reflective agent performs on par with prior best models, even though previous works had the advantages of accessing expert traces or additional screen information.

構造化された省察を伴うコンピュータ制御のためのゼロショット言語エージェント

A Zero-Shot Language Agent for Computer Control with Structured Reflection

要旨

Support