ScalSelect: Schaalbaar, trainingsvrij selecteren van multimodale gegevens voor efficiënte visuele instructieafstemming

Samenvatting

Grootschalige visuele instructie-afstemming (VIT) is een belangrijk paradigma geworden voor het verbeteren van de prestaties van visie-taalmodellen (VLM's) bij diverse multimodale taken. De training op grootschalige datasets is echter rekenkundig kostbaar en inefficiënt vanwege redundantie in de gegevens, wat de behoefte motiveert aan multimodale gegevensselectie om de trainingsefficiëntie te verbeteren. Bestaande methoden voor gegevensselectie voor VIT vereisen ofwel kostbare training of gradientberekening. Trainingsvrije alternatieven zijn vaak afhankelijk van proxy-modellen of -datasets, instructie-agnostische representaties en paarsgewijze gelijkenis met kwadratische complexiteit, wat de schaalbaarheid en representatietrouw beperkt. In dit werk stellen we ScalSelect voor, een schaalbare, trainingsvrije multimodale gegevensselectiemethode met lineaire tijdscomplexiteit ten opzichte van het aantal steekproeven, waardoor externe modellen of hulpdatasets overbodig worden. ScalSelect construeert eerst steekproefrepresentaties door visuele kenmerken te extraheren waarop instructietokens in het doel-VLM de meeste aandacht richten, waardoor instructie-relevante informatie wordt vastgelegd. Vervolgens identificeert het steekproeven waarvan de representaties de dominante deelruimte van de volledige datasetrepresentaties het best benaderen, waardoor schaalbare belangrijkheidsscoring mogelijk wordt zonder paarsgewijze vergelijkingen. Uitgebreide experimenten met meerdere VLM's, datasets en selectiebudgetten tonen aan dat ScalSelect meer dan 97,5% van de prestaties van training op de volledige dataset bereikt met slechts 16% van de gegevens, en in sommige settings zelfs de training op volledige gegevens overtreft. De code is beschikbaar op https://github.com/ChangtiWu/ScalSelect{ScalSelect}.

English

Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at https://github.com/ChangtiWu/ScalSelect{ScalSelect}.

ScalSelect: Schaalbaar, trainingsvrij selecteren van multimodale gegevens voor efficiënte visuele instructieafstemming

ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning

Samenvatting

Support