LSNet: 크게 보며, 작게 집중하라

초록

컨볼루션 신경망(Convolutional Neural Networks)과 비전 트랜스포머(Vision Transformers)를 포함한 비전 네트워크 설계는 컴퓨터 비전 분야를 크게 발전시켰습니다. 그러나 이러한 네트워크의 복잡한 계산은 실시간 애플리케이션을 비롯한 실제 배포에서 어려움을 야기합니다. 이를 해결하기 위해 연구자들은 다양한 경량화 및 효율적인 네트워크 설계를 탐구해 왔습니다. 그러나 기존의 경량 모델은 주로 토큰 혼합을 위해 자기 주의 메커니즘(self-attention mechanisms)과 컨볼루션을 활용합니다. 이러한 의존성은 경량 네트워크의 인식 및 집계 과정에서 효과성과 효율성에 한계를 가져오며, 제한된 계산 예산 하에서 성능과 효율성 간의 균형을 방해합니다. 본 논문에서는 인간의 효율적인 시각 시스템에 내재된 동적 이종 스케일 비전 능력에서 영감을 받아, 경량 비전 네트워크 설계를 위한 "큰 것을 보고, 작은 것에 집중하라(See Large, Focus Small)" 전략을 제안합니다. 우리는 대형 커널 인식과 소형 커널 집계를 결합한 LS(Large-Small) 컨볼루션을 소개합니다. 이는 광범위한 인식 정보를 효율적으로 포착하고 동적이며 복잡한 시각적 표현을 위한 정밀한 특징 집계를 달성함으로써 시각 정보를 능숙하게 처리할 수 있게 합니다. LS 컨볼루션을 기반으로, 우리는 새로운 경량 모델 패밀리인 LSNet을 제시합니다. 다양한 비전 작업에서 LSNet은 기존의 경량 네트워크보다 우수한 성능과 효율성을 달성함을 광범위한 실험을 통해 입증합니다. 코드와 모델은 https://github.com/jameslahm/lsnet에서 확인할 수 있습니다.

English

Vision network designs, including Convolutional Neural Networks and Vision Transformers, have significantly advanced the field of computer vision. Yet, their complex computations pose challenges for practical deployments, particularly in real-time applications. To tackle this issue, researchers have explored various lightweight and efficient network designs. However, existing lightweight models predominantly leverage self-attention mechanisms and convolutions for token mixing. This dependence brings limitations in effectiveness and efficiency in the perception and aggregation processes of lightweight networks, hindering the balance between performance and efficiency under limited computational budgets. In this paper, we draw inspiration from the dynamic heteroscale vision ability inherent in the efficient human vision system and propose a ``See Large, Focus Small'' strategy for lightweight vision network design. We introduce LS (Large-Small) convolution, which combines large-kernel perception and small-kernel aggregation. It can efficiently capture a wide range of perceptual information and achieve precise feature aggregation for dynamic and complex visual representations, thus enabling proficient processing of visual information. Based on LS convolution, we present LSNet, a new family of lightweight models. Extensive experiments demonstrate that LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks. Codes and models are available at https://github.com/jameslahm/lsnet.

LSNet: 크게 보며, 작게 집중하라

LSNet: See Large, Focus Small

초록

Support