一歩一歩注意を払ってください：思考が人間を劣化させるタスクにおいて、連鎖思考はパフォーマンスを低下させる可能性があります

要旨

思考連鎖（CoT）プロンプティングは、大規模な言語およびマルチモーダルモデルの取り扱いに広く使用される戦略となっています。CoTは多くのタスクでパフォーマンスを向上させることが示されていますが、それが効果的である状況を特定することは依然として取り組まれています。特に、CoTがモデルのパフォーマンスを系統的に低下させる状況がいまだにオープンな問題です。本論文では、認知心理学からインスピレーションを得て、CoTがパフォーマンスを低下させるタスクの特性を特定しようとしています。具体的には、（i）言語的思考や熟考が人間のパフォーマンスに悪影響を及ぼすケース、および（ii）人間のパフォーマンスを制御する制約が言語モデルに一般化されるケースを考察します。暗黙の統計的学習、視覚認識、および例外を含むパターンで分類するという3つのケースにおいて、広範な実験を通じて、最新のモデル群が推論時の推論とゼロショット対応と比較して、有意なパフォーマンスの低下（例：OpenAI o1-previewにおいてGPT-4oと比較して最大36.3％の絶対精度低下）を示すことがわかりました。また、条件（i）を満たすが（ii）を満たさない3つのタスクを特定し、これらのタスクにおいて言語的思考が人間のパフォーマンスを低下させる一方で、CoTはモデルのパフォーマンスを維持または向上させることがわかりました。全体として、モデルの認知プロセスと人間のそれとの完全な対応は存在しないものの、思考が人間のパフォーマンスに否定的な影響を及ぼすケースを考えることで、モデルに否定的な影響を及ぼす状況を特定するのに役立ちます。人間の熟考に関する文献とCoTの評価を結びつけることで、プロンプト選択や推論時の推論の影響を理解するために使用できる新しいツールを提供しています。

English

Chain-of-thought (CoT) prompting has become a widely used strategy for working with large language and multimodal models. While CoT has been shown to improve performance across many tasks, determining the settings in which it is effective remains an ongoing effort. In particular, it is still an open question in what settings CoT systematically reduces model performance. In this paper, we seek to identify the characteristics of tasks where CoT reduces performance by drawing inspiration from cognitive psychology, looking at cases where (i) verbal thinking or deliberation hurts performance in humans, and (ii) the constraints governing human performance generalize to language models. Three such cases are implicit statistical learning, visual recognition, and classifying with patterns containing exceptions. In extensive experiments across all three settings, we find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance (e.g., up to 36.3% absolute accuracy for OpenAI o1-preview compared to GPT-4o) when using inference-time reasoning compared to zero-shot counterparts. We also identify three tasks that satisfy condition (i) but not (ii), and find that while verbal thinking reduces human performance in these tasks, CoT retains or increases model performance. Overall, our results show that while there is not an exact parallel between the cognitive processes of models and those of humans, considering cases where thinking has negative consequences for human performance can help us identify settings where it negatively impacts models. By connecting the literature on human deliberation with evaluations of CoT, we offer a new tool that can be used in understanding the impact of prompt choices and inference-time reasoning.

一歩一歩注意を払ってください：思考が人間を劣化させるタスクにおいて、連鎖思考はパフォーマンスを低下させる可能性があります

Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse

要旨

Support