Multiplex-Denken: Schlussfolgern durch tokenweises Verzweigen und Zusammenführen

papers.abstract

Große Sprachmodelle lösen komplexe Denkaufgaben oft effektiver mit Chain-of-Thought (CoT), allerdings auf Kosten langer Token-Sequenzen mit geringer Bandbreite. Menschen hingegen denken oft „weich“, indem sie eine Verteilung über plausible nächste Schritte aufrechterhalten. Davon motiviert schlagen wir Multiplex Thinking vor, einen stochastischen Soft-Reasoning-Mechanismus, der bei jedem Denkschritt K Kandidaten-Tokens sampelt und ihre Embeddings zu einem einzigen kontinuierlichen Multiplex-Token aggregiert. Dies bewahrt das Vokabular-Embedding-Prior und die Sampling-Dynamik der standardmäßigen diskreten Generierung, während es eine handhabbare Wahrscheinlichkeitsverteilung über Multiplex-Rollouts induziert. Folglich können Multiplex-Trajektorien direkt mit On-Policy Reinforcement Learning (RL) optimiert werden. Wichtig ist, dass Multiplex Thinking selbstadaptiv ist: Wenn das Modell sicher ist, verhält sich das Multiplex-Token nahezu diskret wie Standard-CoT; bei Unsicherheit repräsentiert es kompakt mehrere plausible nächste Schritte, ohne die Sequenzlänge zu erhöhen. In anspruchsvollen mathematischen Reasoning-Benchmarks übertrifft Multiplex Thinking durchgängig starke diskrete CoT- und RL-Baselines von Pass@1 bis Pass@1024 und erzeugt dabei kürzere Sequenzen. Der Code und die Checkpoints sind verfügbar unter https://github.com/GMLR-Penn/Multiplex-Thinking.

English

Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT), but at the cost of long, low-bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on-policy reinforcement learning (RL). Importantly, Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at https://github.com/GMLR-Penn/Multiplex-Thinking.

Multiplex-Denken: Schlussfolgern durch tokenweises Verzweigen und Zusammenführen

Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge

papers.abstract

Support