1. 前言

2. Abstract

  • 简介了本文要提出的PyConv

This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales. PyConv contains a pyramid of kernels, where each level involves different types of filters with varying size and depth, which are able to capture different levels of details in the scene.

  • 这项工作引入了金字塔卷积(PyConv),它能够在多个filter尺度上处理输入。PyConv包含一个kernel金字塔,其中每个level涉及大小和深度不同的不同类型的filters,它们能够捕获场景中不同level的细节。

3. Introduction

  • 首段提出了当前的CNNkenel size大部分为3x3(大kernel会引入大量参数和复杂度);CNN网络引入了大量的downsample方法来减小input_size来增大感受野;而这引出了两个问题:

First, even though for many of current CNNs the theoretical receptive field can cover a big part of the input or even the whole input, in [19] it is shown that the empirical receptive field is much smaller than the theoretical one, even more than 2.7 times smaller in the higher layers of the network.

Second, downsampling the input without previously having access to enough context information (especially in complex scenes as in Fig. 1) may affect the learning process and the recognition performance of the network, as useful details are lost since the receptive filed is not large enough to capture different dependencies in the scene before performing the downsampling.

  • 接着论文就图一指出了:自然图像中的物体的尺寸有大有小;同类别的物体尺寸大小不一,例如car;指出了传统的卷积在处理该问题时的不足:

To be able to capture such a diversity of categories and such a variability in their scales, the use of a single type of kernel (as in standard convolution) with a single spatial size, may not be an optimal solution for such complexity.

  • 本文的四个贡献点:
  • 提出了PyConv,可获取多尺度的信息,且维持了很少的参数

(1) We introduce pyramidal convolution (PyConv), which contains different levels of kernels with varying size and depth. Besides enlarging the receptive field, PyConv can process the input using increasing kernel sizes in parallel, to capture different levels of details. On top of these advantages, PyConv is very efficient and, with our formulation, it can maintain a similar number of parameters and computational costs as the standard convolution. PyConv is very flexible and extendable, opening the door for a large variety of network architectures for numerous tasks of computer vision (see Section 3)

  • 提出了超越了baseline的表现的图像分类网络

We propose two network architectures for image classification task that outperform the baselines by a significant margin. Moreover, they are efficient in terms of number of parameters and computational costs and can outperform other more complex architectures (see Section 4)

  • 提出了新的语义分割框架,达到了sota

We propose a new framework for semantic segmentation. Our novel head for parsing the output provided by a backbone can capture different levels of context information from local to global. It provides state-of-the-art results on scene parsing (see Section 5)

  • 对于目标检测和视频分类任务也有了很大提升

We present network architectures based on PyConv for object detection and video classification tasks, where we report significant improvements in recognition performance over the baseline (see Appendix).

4. Pyramidal Convolution

  • 论文先给出了标准卷积和PyConv的示意图:

  • 论文首先简介了标准卷积及其相关的几个参数,可参照图示,内容如下:

  • 以下论文内容简介了PyConv逐渐增大的kernel_size和逐渐减小的depth:

  • PyConv中使用不同depth的kernel,所以,需要将input分组,使用分组卷积,原文如下:

  • PyConv的每个level的kernel_size和depth,及相应的参数量和FLOPS计算如下:

  • 下面该段没有读懂,暂时先贴到下面

In practice, when building a PyConv there are several additional rules. The denominator of FM i at each level of the pyramid in Equations 1, should be a divisor of FM i . In other words, at each pyramid level, the number of feature maps from each created group should be equal. Therefore, as an approximation, when choosing the number of groups for each level of the pyramid (and thus the depth of the kernel), we can take the closest number to the denominator of FM i from the list of possible divisors of FM i . Furthermore, the number of groups for each level should be also a divisor for the number of output feature maps of each level of PyConv. To be able to easily create different network architectures with PyConv, it is recommended that the number of input feature maps, the groups for each level of pyramid, and the number of output feature maps for each level of PyConv, to be numbers of power of 2. Next sections show practical examples.

  • 论文指出PyConv的主要优势:
  • Multi-Scale Processing

  • Efficiency

  • Flexibility

5. PyConv Network on Semantic Segmentation

  • 论文给出了分割的网络架构图,需满足在多尺度下获取局部和全局信息

To build an effective pipeline for scene parsing, it is necessary to create a head that can parse the feature maps provided by the backbone and obtain not only local but also global information.

We propose a novel head for scene parsing (image segmentation) task, PyConv Parsing Head (PyConvPH). The proposed PyConvPH is able to deal with both local and global information at multiple scales.

  • 如上图所示:PyConvPH包括三个主要模块,local、global、merge
  • Local PyConv block 详细介绍如下,结构图如上图6a所示:

  • Global PyConv block 详细介绍如下,结构图如上图6b所示

  • Merge Local-Global PyConv block

6. Experiments

6.1 PyConv results on semantic segmentation

7. Conclusion

In this paper we proposed pyramidal convolution (PyConv), which contains several levels of kernels with varying scales. PyConv shows significant improvements for different visual recognition tasks and, at the same time, it is also efficient and flexible, providing a very large pool of potential network architectures. Our novel framework for image segmentation provides state-of-the-art results. In addition to a broad range of visual recognition tasks, PyConv can have a significant impact in many other directions, such as image restoration, completion/inpainting, noise/artifact removal, enhancement and image/video super-resolution.

-------------The End-------------