- 论文地址：Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition
This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple ﬁlter scales. PyConv contains a pyramid of kernels, where each level involves different types of ﬁlters with varying size and depth, which are able to capture different levels of details in the scene.
- 首段提出了当前的CNNkenel size大部分为3x3（大kernel会引入大量参数和复杂度）；CNN网络引入了大量的downsample方法来减小input_size来增大感受野；而这引出了两个问题：
First, even though for many of current CNNs the theoretical receptive ﬁeld can cover a big part of the input or even the whole input, in  it is shown that the empirical receptive ﬁeld is much smaller than the theoretical one, even more than 2.7 times smaller in the higher layers of the network.
Second, downsampling the input without previously having access to enough context information (especially in complex scenes as in Fig. 1) may affect the learning process and the recognition performance of the network, as useful details are lost since the receptive ﬁled is not large enough to capture different dependencies in the scene before performing the downsampling.
To be able to capture such a diversity of categories and such a variability in their scales, the use of a single type of kernel (as in standard convolution) with a single spatial size, may not be an optimal solution for such complexity.
(1) We introduce pyramidal convolution (PyConv), which contains different levels of kernels with varying size and depth. Besides enlarging the receptive ﬁeld, PyConv can process the input using increasing kernel sizes in parallel, to capture different levels of details. On top of these advantages, PyConv is very efﬁcient and, with our formulation, it can maintain a similar number of parameters and computational costs as the standard convolution. PyConv is very ﬂexible and extendable, opening the door for a large variety of network architectures for numerous tasks of computer vision (see Section 3)
We propose two network architectures for image classiﬁcation task that outperform the baselines by a signiﬁcant margin. Moreover, they are efﬁcient in terms of number of parameters and computational costs and can outperform other more complex architectures (see Section 4)
We propose a new framework for semantic segmentation. Our novel head for parsing the output provided by a backbone can capture different levels of context information from local to global. It provides state-of-the-art results on scene parsing (see Section 5)
We present network architectures based on PyConv for object detection and video classiﬁcation tasks, where we report signiﬁcant improvements in recognition performance over the baseline (see Appendix).
In practice, when building a PyConv there are several additional rules. The denominator of FM i at each level of the pyramid in Equations 1, should be a divisor of FM i . In other words, at each pyramid level, the number of feature maps from each created group should be equal. Therefore, as an approximation, when choosing the number of groups for each level of the pyramid (and thus the depth of the kernel), we can take the closest number to the denominator of FM i from the list of possible divisors of FM i . Furthermore, the number of groups for each level should be also a divisor for the number of output feature maps of each level of PyConv. To be able to easily create different network architectures with PyConv, it is recommended that the number of input feature maps, the groups for each level of pyramid, and the number of output feature maps for each level of PyConv, to be numbers of power of 2. Next sections show practical examples.
- Multi-Scale Processing
To build an effective pipeline for scene parsing, it is necessary to create a head that can parse the feature maps provided by the backbone and obtain not only local but also global information.
We propose a novel head for scene parsing (image segmentation) task, PyConv Parsing Head (PyConvPH). The proposed PyConvPH is able to deal with both local and global information at multiple scales.
- Local PyConv block 详细介绍如下，结构图如上图6a所示：
- Global PyConv block 详细介绍如下，结构图如上图6b所示
- Merge Local-Global PyConv block
In this paper we proposed pyramidal convolution (PyConv), which contains several levels of kernels with varying scales. PyConv shows signiﬁcant improvements for different visual recognition tasks and, at the same time, it is also efﬁcient and ﬂexible, providing a very large pool of potential network architectures. Our novel framework for image segmentation provides state-of-the-art results. In addition to a broad range of visual recognition tasks, PyConv can have a signiﬁcant impact in many other directions, such as image restoration, completion/inpainting, noise/artifact removal, enhancement and image/video super-resolution.