Aangrenzend Autoregressief Modelleren voor Efficiënte Visuele Generatie

Samenvatting

Visuele autoregressieve modellen volgen doorgaans een rastervolgorde "volgende-token-voorspelling"-paradigma, waarbij de ruimtelijke en temporele localiteit die inherent is aan visuele content over het hoofd wordt gezien. Specifiek vertonen visuele tokens aanzienlijk sterkere correlaties met hun ruimtelijk of temporeel aangrenzende tokens in vergelijking met tokens die veraf liggen. In dit artikel stellen we Neighboring Autoregressive Modeling (NAR) voor, een nieuw paradigma dat autoregressieve visuele generatie formuleert als een progressieve uitbreidingsprocedure, volgens een nabij-naar-ver "volgende-buur-voorspelling"-mechanisme. Beginnend bij een initieel token, worden de overige tokens gedecodeerd in oplopende volgorde van hun Manhattan-afstand tot het initiële token in de ruimtelijk-temporele ruimte, waarbij de grens van het gedecodeerde gebied progressief wordt uitgebreid. Om parallelle voorspelling van meerdere aangrenzende tokens in de ruimtelijk-temporele ruimte mogelijk te maken, introduceren we een set dimensie-gerichte decodeerkoppen, die elk het volgende token voorspellen langs een onderling orthogonale dimensie. Tijdens inferentie worden alle tokens die grenzen aan de gedecodeerde tokens parallel verwerkt, wat het aantal modelvoorwaartse stappen voor generatie aanzienlijk vermindert. Experimenten op ImageNet256x256 en UCF101 tonen aan dat NAR respectievelijk 2,4x en 8,6x hogere doorvoer bereikt, terwijl het superieure FID/FVD-scores behaalt voor zowel beeld- als videogeneratietaken in vergelijking met de PAR-4X-aanpak. Bij evaluatie op de tekst-naar-beeld-generatiebenchmark GenEval presteert NAR met 0,8B parameters beter dan Chameleon-7B, terwijl het slechts 0,4 van de trainingsdata gebruikt. Code is beschikbaar op https://github.com/ThisisBillhe/NAR.

English

Visual autoregressive models typically adhere to a raster-order ``next-token prediction" paradigm, which overlooks the spatial and temporal locality inherent in visual content. Specifically, visual tokens exhibit significantly stronger correlations with their spatially or temporally adjacent tokens compared to those that are distant. In this paper, we propose Neighboring Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure, following a near-to-far ``next-neighbor prediction" mechanism. Starting from an initial token, the remaining tokens are decoded in ascending order of their Manhattan distance from the initial token in the spatial-temporal space, progressively expanding the boundary of the decoded region. To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads, each predicting the next token along a mutually orthogonal dimension. During inference, all tokens adjacent to the decoded tokens are processed in parallel, substantially reducing the model forward steps for generation. Experiments on ImageNet256times 256 and UCF101 demonstrate that NAR achieves 2.4times and 8.6times higher throughput respectively, while obtaining superior FID/FVD scores for both image and video generation tasks compared to the PAR-4X approach. When evaluating on text-to-image generation benchmark GenEval, NAR with 0.8B parameters outperforms Chameleon-7B while using merely 0.4 of the training data. Code is available at https://github.com/ThisisBillhe/NAR.

Aangrenzend Autoregressief Modelleren voor Efficiënte Visuele Generatie

Neighboring Autoregressive Modeling for Efficient Visual Generation

Samenvatting

Support