使用OpenAI推理模型进行Web应用编码的案例研究。

摘要

本文通过对OpenAI最新推理模型o1-preview和o1-mini在编码任务上的案例研究，与其他前沿模型进行比较。o1模型在单任务基准WebApp1K上取得了SOTA结果。为此，我们引入了WebApp1K-Duo，一个更难的基准，将任务数量和测试用例数量翻倍。新基准导致o1模型的性能显著下降，落后于Claude 3.5。此外，当面对非典型但正确的测试用例时，它们经常失败，而非推理模型偶尔可以避免这种陷阱。我们假设性能变化是由于指令理解能力造成的。具体而言，当所有期望都被捕捉到时，推理机制会提升性能，而当关键期望被忽略时，错误会加剧，这可能受输入长度的影响。因此，我们认为推理模型的编码成功取决于一流的基础模型和SFT，以确保对指令的细致遵循。

English

This paper presents a case study of coding tasks by the latest reasoning models of OpenAI, i.e. o1-preview and o1-mini, in comparison with other frontier models. The o1 models deliver SOTA results for WebApp1K, a single-task benchmark. To this end, we introduce WebApp1K-Duo, a harder benchmark doubling number of tasks and test cases. The new benchmark causes the o1 model performances to decline significantly, falling behind Claude 3.5. Moreover, they consistently fail when confronted with atypical yet correct test cases, a trap non-reasoning models occasionally avoid. We hypothesize that the performance variability is due to instruction comprehension. Specifically, the reasoning mechanism boosts performance when all expectations are captured, meanwhile exacerbates errors when key expectations are missed, potentially impacted by input lengths. As such, we argue that the coding success of reasoning models hinges on the top-notch base model and SFT to ensure meticulous adherence to instructions.

使用OpenAI推理模型进行Web应用编码的案例研究。

A Case Study of Web App Coding with OpenAI Reasoning Models

摘要

Support