（一维）有序令牌实现高效测试时搜索

摘要

分词是自回归生成模型的核心组件，它将原始数据转换为更易于建模的单元。通常，分词描述局部信息（如图像中的像素区域或文本中的词片段），而自回归生成以固定顺序预测这些分词。一个值得探讨的问题是：分词结构是否会影响通过测试时搜索引导生成的能力？这种搜索会探索多个候选生成结果并由验证器进行评估。以图像生成为实验平台，我们假设最近具有由粗到细结构的单向有序分词器，可能比经典的二维网格结构更适用于搜索。这源于以下事实：由粗到细序列中的中间状态携带语义信息，验证器可可靠评估这些信息，从而在生成过程中实现有效引导。通过受控实验，我们发现基于由粗到细有序分词训练的自回归模型，相比基于网格分词的模型展现出更好的测试时扩展性能。此外，我们证明得益于这种有序结构，纯测试时搜索（即无需训练自回归模型）在图文验证器的引导下可实现无需训练的文生图生成。除此之外，我们系统研究了经典搜索算法（N选最优、束搜索、前瞻搜索）与不同分词结构的交互作用，以及不同验证器和自回归先验的作用。我们的研究结果凸显了分词结构对推理时可扩展性的影响，并为自回归模型的测试时扩展提供了实用指导。

English

Tokenization is a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling. Commonly, tokens describe local information, such as regions of pixels in images or word pieces in text, and AR generation predicts these tokens in a fixed order. A worthwhile question is whether token structures affect the ability to steer the generation through test-time search, where multiple candidate generations are explored and evaluated by a verifier. Using image generation as our testbed, we hypothesize that recent 1D ordered tokenizers with coarse-to-fine structure can be more amenable to search than classical 2D grid structures. This is rooted in the fact that the intermediate states in coarse-to-fine sequences carry semantic meaning that verifiers can reliably evaluate, enabling effective steering during generation. Through controlled experiments, we find that AR models trained on coarse-to-fine ordered tokens exhibit improved test-time scaling behavior compared to grid-based counterparts. Moreover, we demonstrate that, thanks to the ordered structure, pure test-time search over token sequences (i.e., without training an AR model) can perform training-free text-to-image generation when guided by an image-text verifier. Beyond this, we systematically study how classical search algorithms (best-of-N, beam search, lookahead search) interact with different token structures, as well as the role of different verifiers and AR priors. Our results highlight the impact of token structure on inference-time scalability and provide practical guidance for test-time scaling in AR models.