LSNet: 大局を見て、細部に焦点を当てる

要旨

畳み込みニューラルネットワークやVision Transformerを含む視覚ネットワーク設計は、コンピュータビジョンの分野を大きく進歩させてきました。しかし、その複雑な計算は、特にリアルタイムアプリケーションにおける実用的な展開に課題を投げかけています。この問題に対処するため、研究者たちは様々な軽量で効率的なネットワーク設計を探求してきました。しかし、既存の軽量モデルは主にトークンミキシングのためにセルフアテンションメカニズムと畳み込みを活用しています。この依存性は、軽量ネットワークの知覚と集約プロセスにおける効果性と効率性に制限をもたらし、限られた計算予算下での性能と効率のバランスを妨げています。本論文では、効率的な人間の視覚システムに内在する動的異尺度視覚能力に着想を得て、軽量視覚ネットワーク設計のための「大きく見て、小さく焦点を当てる」戦略を提案します。我々は、大カーネル知覚と小カーネル集約を組み合わせたLS（Large-Small）畳み込みを導入します。これにより、広範な知覚情報を効率的に捕捉し、動的で複雑な視覚表現のための精密な特徴集約を実現し、視覚情報の熟練した処理を可能にします。LS畳み込みに基づいて、我々は新しい軽量モデルファミリーであるLSNetを提示します。広範な実験により、LSNetが様々な視覚タスクにおいて既存の軽量ネットワークを凌駕する性能と効率を達成することが実証されています。コードとモデルはhttps://github.com/jameslahm/lsnetで公開されています。

English

Vision network designs, including Convolutional Neural Networks and Vision Transformers, have significantly advanced the field of computer vision. Yet, their complex computations pose challenges for practical deployments, particularly in real-time applications. To tackle this issue, researchers have explored various lightweight and efficient network designs. However, existing lightweight models predominantly leverage self-attention mechanisms and convolutions for token mixing. This dependence brings limitations in effectiveness and efficiency in the perception and aggregation processes of lightweight networks, hindering the balance between performance and efficiency under limited computational budgets. In this paper, we draw inspiration from the dynamic heteroscale vision ability inherent in the efficient human vision system and propose a ``See Large, Focus Small'' strategy for lightweight vision network design. We introduce LS (Large-Small) convolution, which combines large-kernel perception and small-kernel aggregation. It can efficiently capture a wide range of perceptual information and achieve precise feature aggregation for dynamic and complex visual representations, thus enabling proficient processing of visual information. Based on LS convolution, we present LSNet, a new family of lightweight models. Extensive experiments demonstrate that LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks. Codes and models are available at https://github.com/jameslahm/lsnet.

LSNet: 大局を見て、細部に焦点を当てる

LSNet: See Large, Focus Small

要旨

Support