희생 없이 효율화하기 - LMM에서 계산 중복성 제거

초록

대규모 멀티모달 모델은 멀티모달 작업에서 뛰어난 성능을 보이지만, 시각적 토큰에 대한 과도한 계산으로 인해 상당한 계산적 어려움에 직면합니다. 토큰 수준의 중복성에 초점을 맞춘 토큰 축소 방법과 달리, 우리는 정보 손실 없이 시각 토큰에 대한 계산 수준의 중복성을 식별하고 연구합니다. 우리의 핵심 통찰은 사전 훈련된 시각 인코더에서 생성된 시각 토큰이 디코더 전용 대규모 멀티모달 모델에서 모든 무거운 연산(예: 자기 주의, 피드포워드 신경망)을 반드시 필요로 하지 않으며, 적절한 설계를 통해 더 가볍게 처리될 수 있다는 것입니다. 우리는 시각 관련 계산 중복성을 발견하고 점진적으로 줄이기 위해 일련의 실험을 설계했습니다. 이러한 발견을 바탕으로, 우리는 원래 시각 토큰에 대한 계산 부담을 완화하기 위해 프록시 시각 토큰을 활용하는 새로운 접근 방식인 ProxyV를 제안합니다. ProxyV는 성능 저하 없이 효율성을 향상시키며, 더 적당한 효율성 개선 시나리오에서도 상당한 성능 향상을 가져올 수 있습니다. 또한, ProxyV의 유연성은 토큰 축소 방법과 결합하여 효율성을 더욱 증진시킬 수 있음을 보여줍니다. 코드는 이 https://github.com/penghao-wu/ProxyV URL에서 공개될 예정입니다.

English

Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive computation on visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further. The code will be made public at this https://github.com/penghao-wu/ProxyV URL.

희생 없이 효율화하기 - LMM에서 계산 중복성 제거

Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM

초록

Support