ACNet-RGBD Semantic Segmentation

1. 前言

2. Abstract

  • 问题背景的引出:

    Compared to RGB semantic segmentation, RGBD semantic segmentation can achieve better performance by taking depth information into consideration. However, it is still problematic for contemporary segmenters to effectively exploit RGBD information since the feature distributions of RGB and depth (D) images vary significantly in different scenes.

  • 与RGB语义分割相比,RGBD语义分割考虑了深度信息,可以获得更好的性能;然而,由于RGB和depth (D)图像的特征分布在不同场景中存在显著差异,当代分割器如何有效地利用RGBD信息仍然是一个问题。

  • 本文的贡献:

In this paper, we propose an Attention Complementary Network (ACNet) that selectively gathers features from RGB and depth branches. The main contributions lie in the Attention Complementary Module (ACM) and the architecture with three parallel branches. More precisely, ACM is a channel attention-based module that extracts weighted features from RGB and depth branches. The architecture preserves the inference of the original RGB and depth branches, and enables the fusion branch at the same time. Based on the above structures, ACNet is capable of exploiting more high-quality features from different channels.

  • 在本文中,我们提出了网络ACNet,选择性地收集特征从RGB和Depth分支。
  • 主要贡献在于ACM模块和具有三个并行分支的架构。更准确地说,ACM是一个基于通道注意力的模块,它从RGB和Depth分支中提取加权特征。该架构保留了原始RGB和depth分支的推断,同时支持融合分支。

3. Introduction

  • RGBD图像即RGB图像和深度(Depth)图像的结合,包括RGB三个通道和一个代表像素点与相机距离的深度通道共四个通道。

  • 首先介绍了几种之前的RGBD语义分割的思路,随后指出了目前存在的两个问题:

These networks designed for RGBD semantic segmentation have achieved break-through results. However, there are still some issues that need to be solved:

  1. Although the geometric information encoded in the depth image can clearly provide additional benefits for image segmentation, the information contained in RGB image and depth image are not equivalent for each scene (shown in Fig. 1). In other words, features extracted from RGB branch and depth branch by current networks may be not appropriate.
  2. Conventional RGBD segmentation network can be divided into two types of architectures. One of them, such as [8], employs two encoders to extract features from RGB and depth image respectively, and combines the features of both before or during upsampling. The other like [5][9] just fuses the RGBD features at the downsampling stage. The former can’t sufficiently combine RGBD information, and the latter tends to lose original RGB and depth branches since the fusion branches take the place of them.
  • 问题一:虽然Depth图像中的信息对图像分割有利,但并不是在任何场景下,从RGB图像和Depth图像提取的信息是合适的,即某些场景下结合Depth的信息不会有利。论文给了下图作为解释:看图是左侧的分割结果在结合了D信息有利,右侧的则相反;

  • 问题二:已有的RGBD图像语义分割方法有两种思路:一是利用两个编码器分别从RGB图像和深度图像中提取特征,结合之后进行上采样;二是在下采样阶段直接将两个特征融合处理。前者不能将两种特征充分融合,后者没有考虑两种特征对最终结果的贡献程度,对于RGB图像信息和深度图像信息可能不充分对等的RGBD图像中,都不能取得较好的效果;

  • 论文提出的ACNet架构如下图所示:

  • 如图所示:蓝色灰色:两个基于ResNet的独立分支分别用于RGB图像和深度图像的特征提取;
  • 根据每一层特征所包含的信息量设计的多个注意力辅助模块(ACM,Attention Complementary Modules)来平衡特征的分布,使网络更加关注图像的有效区域;
  • 橙色:基于ResNet的独立分支用于融合RGB特征和深度特征,最后经过多次上采样得到分割结果。

4. Framework

4.1 Attention Complementary Module (ACM)

  • 在上述部分的问题一和图一中:室内场景下的RGBD图像中,RGB图像和深度图像的特征分布完全不同,为了使网络专注于目标的有效区域,论文设计了多个注意力辅助模块ACMs,单个ACM结构如下图所示:

  • 论文中对上图的解释如下:

  • 论文的大致步骤为:全局平均池化得到Z;经过1x1卷积提取通道的联系;sigmoid函数来激活联系得到权重向量V;最后将输入A与V叉乘得到输出U

4.2 Architecture for Feature Fusion

In order to keep the original RGB and depth features flow during downsampling, we propose a specialized architecture for RGBD feature fusion. As illustrated in Fig. 2, two complete ResNets are deployed to extract RGB and depth features separately. Note that here the ResNet can be replaced with other networks, e.g., ERF-PSPNet [2] in efficiency-critical domains. Vitally, these two branches can preserve RGB and depth features before upsampling. After that, the fusion branch is leveraged to extract features from the merged feature maps.

  • 提出了一种RGBD特征融合的专用体系结构。如图2所示,部署了两个完整的ResNets分别提取RGB和depth特征。注意,这里的ResNet可以被其他网络取代,例如ERF-PSPNet[2]。重要的是,这两个分支可以在上采样前保持RGB和深度特征。然后,利用融合分支从合并的特征图中提取特征。

4.3 Attention Complementary Network (ACNet)

We design an integrated network called ACNet for RGBD semantic segmentation. The backbone of ACNet is shown in Fig. 2. RGB image and depth image are inputted, and are processed by ResNet branches separately. During inference, each aforementioned branch provides a group of feature maps at every module stage, such as Conv, Layer1, etc. Then the feature maps are reorganized by ACM. After passing through Conv, the feature maps are further element-wisely added as input of fusion branch, while others are added to the output of fusion branch. In this way, both low-level and high-level features can be extracted, reorganized and fused by our ACNet. As for upsampling, we apply the skip connection like [5], which appends the features in downsampling to upsampling with a quite low computation cost.

5. Experiments

  • 实验实现细节

  • Analysis of the ACM

To understand ACM better, we visualize the feature maps from layer2 (shown in Fig. 4) since layer2’s low-level features are more consistent with visual intuitions. Note that we only visualize the first 16 of 128 feature maps for better illustration.

  • 我们将layer2的特征图可视化(如图4所示),因为第2层的低级特征更符合视觉直觉;后面的权重矩阵只展示了其中一部分的4x4的矩阵;

  • 上图的标记出的权重矩阵的数值,论文中这样描述:

At (0,0), feature map of RGB branch contains more valid information than the feature map from depth branch visually, so that ACM tends to give a higher weight to the RGB branch. In contrast, at (2,2), feature map of depth branch contains more information, therefore, depth branch gets higher weight.

  • 论文还做了权重的大小趋势图,原文描述如下:

  • 图5如下:

  • 对模型结构的去除实验:

Ablation Study. To verify functionality of both ACM and the multi-branch architecture, we perform an ablation study by comparing the original model with two defective models: Model-1 and Model-2. In Model-1, we remove all ACMs and the RGB and D branches after Conv Layer. In Model-2, we remove all ACMs but retain the multi-branch architecture. Our ablation study on NYUDv2 turns out that, the mIoU of Model-1 and Model-2 are 44.3% and 46.8%, verifying the multi-branch architecture and ACM lead to significant accuracy boost of 2.5% and 1.5%, respectively.

6. Conclusions

In this paper, we propose a novel multi-branch attention based network for RGBD semantic segmentation. The multi-branch architecture is able to gather features efficiently and doesn’t destroy original RGB and depth branches’ inference. The attention module can selectively gather features from RGB and depth branches according to the amount of information they contain, and complement the fusion branch by using these weighted features. Our model can resolve the problem that RGB images and depth images always contain unequal amount of information as well as different context distributions. We evaluate our model on NYUDv2 and SUN-RGBD datasets, and the experiments show that our model can outperform state-of-the-art methods.

-------------The End-------------