Multiplex Denken: Redeneren via Token-gewijze Vertakken-en-Samenvoegen

Samenvatting

Grote taalmodellen lossen complexe redeneertaken vaak effectiever op met Chain-of-Thought (CoT), maar tegen de kost van lange, laagbandbreedte tokenreeksen. Mensen daarentegen redeneren vaak op een 'zachte' manier door een verdeling aan te houden over plausibele volgende stappen. Gemotiveerd door dit gegeven stellen wij Multiplex Thinking voor, een stochastisch zacht redeneermechanisme dat bij elke denkstap K kandidaat-tokens bemonstert en hun embeddings aggregeert tot een enkel continu multiplex-token. Dit behoudt de prior van de vocabulaire-embedding en de bemonsteringsdynamiek van standaard discrete generatie, terwijl het een hanteerbare kansverdeling induceert over multiplex-rollouts. Hierdoor kunnen multiplex-trajecten direct worden geoptimaliseerd met on-policy reinforcement learning (RL). Belangrijk is dat Multiplex Thinking zelfadaptief is: wanneer het model zeker is, is het multiplex-token bijna discreet en gedraagt het zich zoals standaard CoT; wanneer het onzeker is, representeert het compact meerdere plausibele volgende stappen zonder de reekslengte te vergroten. Op uitdagende wiskundige redeneerbenchmarks presteert Multiplex Thinking consistent beter dan sterke discrete CoT- en RL-baselines van Pass@1 tot Pass@1024, terwijl het kortere sequenties produceert. De code en checkpoints zijn beschikbaar op https://github.com/GMLR-Penn/Multiplex-Thinking.

English

Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT), but at the cost of long, low-bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on-policy reinforcement learning (RL). Importantly, Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at https://github.com/GMLR-Penn/Multiplex-Thinking.

Multiplex Denken: Redeneren via Token-gewijze Vertakken-en-Samenvoegen

Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge

Samenvatting

Support