Pushing to the Limit: An Attention based Dual-Prune Approach for Highly-Compacted CNN Filter Pruning
-
Abstract
Filter pruning is an important technique to compress convolutional neural networks (CNNs), enabling the acquisition of light-weight, high-performance models for practical deployment. However, existing filter pruning methods often suffer from a sharp performance drop when the pruning ratio is large, possibly due to the unrecoverable information loss caused by aggressive pruning. In this paper, we propose a Dual-Attention based pruning approach called DualPrune, designed to push the limits of network pruning at ultra-high compression ratios. Firstly, it employs a graph attention network to automatically extract filter-level and layer-level features from CNNs, based on the roles of their filters in the entire computation graph. Then, the extracted comprehensive features are fed into a side-attention network, which generates sparse attention weights for individual filters to guide model pruning. To avoid layer collapse, the side-attention network adopts a side-path design to preserve the information flow going through the CNN model properly. This allows the CNN model to be pruned with a high compression ratio at initialization and trained from scratch afterward. Extensive experiments based on several well-known CNN models and real-world datasets demonstrate that the proposed DualPrune method outperforms state-of-the-art methods, with significant performance improvement, particularly for model compression at a high pruning ratio.
-
-