OpenAI 推論モデルを用いた Web アプリケーションのコーディングに関する事例研究

要旨

本論文では、OpenAIの最新の推論モデルであるo1-previewとo1-miniによるコーディングタスクの事例研究を他の先端モデルと比較して示す。o1モデルは、単一タスクのベンチマークであるWebApp1KにおいてSOTAの結果を提供する。このために、タスク数とテストケースを倍にした難易度の高いベンチマークであるWebApp1K-Duoを導入する。新しいベンチマークにより、o1モデルの性能が著しく低下し、Claude 3.5を下回る結果となった。さらに、非典型的であるが正しいテストケースに直面した際に一貫して失敗し、非推論モデルが時折回避する罠に陥る。我々は、性能の変動が指示の理解に起因すると仮説を立てている。具体的には、推論メカニズムは、すべての期待が捉えられた際に性能を向上させる一方で、重要な期待が見落とされた際に誤りを悪化させ、入力の長さに影響を受ける可能性がある。このように、推論モデルのコーディング成功は、優れたベースモデルとSFTによる入念な指示への厳密な遵守にかかっていると主張する。

English

This paper presents a case study of coding tasks by the latest reasoning models of OpenAI, i.e. o1-preview and o1-mini, in comparison with other frontier models. The o1 models deliver SOTA results for WebApp1K, a single-task benchmark. To this end, we introduce WebApp1K-Duo, a harder benchmark doubling number of tasks and test cases. The new benchmark causes the o1 model performances to decline significantly, falling behind Claude 3.5. Moreover, they consistently fail when confronted with atypical yet correct test cases, a trap non-reasoning models occasionally avoid. We hypothesize that the performance variability is due to instruction comprehension. Specifically, the reasoning mechanism boosts performance when all expectations are captured, meanwhile exacerbates errors when key expectations are missed, potentially impacted by input lengths. As such, we argue that the coding success of reasoning models hinges on the top-notch base model and SFT to ensure meticulous adherence to instructions.

OpenAI 推論モデルを用いた Web アプリケーションのコーディングに関する事例研究

A Case Study of Web App Coding with OpenAI Reasoning Models

要旨

Support