- 论文链接：ACNET: ATTENTION BASED NETWORK TO EXPLOIT COMPLEMENTARY FEATURES FOR RGBD SEMANTIC SEGMENTATION
Compared to RGB semantic segmentation, RGBD semantic segmentation can achieve better performance by taking depth information into consideration. However, it is still problematic for contemporary segmenters to effectively exploit RGBD information since the feature distributions of RGB and depth (D) images vary signiﬁcantly in different scenes.
In this paper, we propose an Attention Complementary Network (ACNet) that selectively gathers features from RGB and depth branches. The main contributions lie in the Attention Complementary Module (ACM) and the architecture with three parallel branches. More precisely, ACM is a channel attention-based module that extracts weighted features from RGB and depth branches. The architecture preserves the inference of the original RGB and depth branches, and enables the fusion branch at the same time. Based on the above structures, ACNet is capable of exploiting more high-quality features from different channels.
These networks designed for RGBD semantic segmentation have achieved break-through results. However, there are still some issues that need to be solved:
- Although the geometric information encoded in the depth image can clearly provide additional beneﬁts for image segmentation, the information contained in RGB image and depth image are not equivalent for each scene (shown in Fig. 1). In other words, features extracted from RGB branch and depth branch by current networks may be not appropriate.
- Conventional RGBD segmentation network can be divided into two types of architectures. One of them, such as , employs two encoders to extract features from RGB and depth image respectively, and combines the features of both before or during upsampling. The other like  just fuses the RGBD features at the downsampling stage. The former can’t sufﬁciently combine RGBD information, and the latter tends to lose original RGB and depth branches since the fusion branches take the place of them.
- 根据每一层特征所包含的信息量设计的多个注意力辅助模块(ACM，Attention Complementary Modules)来平衡特征的分布，使网络更加关注图像的有效区域；
In order to keep the original RGB and depth features ﬂow during downsampling, we propose a specialized architecture for RGBD feature fusion. As illustrated in Fig. 2, two complete ResNets are deployed to extract RGB and depth features separately. Note that here the ResNet can be replaced with other networks, e.g., ERF-PSPNet  in efﬁciency-critical domains. Vitally, these two branches can preserve RGB and depth features before upsampling. After that, the fusion branch is leveraged to extract features from the merged feature maps.
We design an integrated network called ACNet for RGBD semantic segmentation. The backbone of ACNet is shown in Fig. 2. RGB image and depth image are inputted, and are processed by ResNet branches separately. During inference, each aforementioned branch provides a group of feature maps at every module stage, such as Conv, Layer1, etc. Then the feature maps are reorganized by ACM. After passing through Conv, the feature maps are further element-wisely added as input of fusion branch, while others are added to the output of fusion branch. In this way, both low-level and high-level features can be extracted, reorganized and fused by our ACNet. As for upsampling, we apply the skip connection like , which appends the features in downsampling to upsampling with a quite low computation cost.
- Analysis of the ACM
To understand ACM better, we visualize the feature maps from layer2 (shown in Fig. 4) since layer2’s low-level features are more consistent with visual intuitions. Note that we only visualize the ﬁrst 16 of 128 feature maps for better illustration.
At (0,0), feature map of RGB branch contains more valid information than the feature map from depth branch visually, so that ACM tends to give a higher weight to the RGB branch. In contrast, at (2,2), feature map of depth branch contains more information, therefore, depth branch gets higher weight.
Ablation Study. To verify functionality of both ACM and the multi-branch architecture, we perform an ablation study by comparing the original model with two defective models: Model-1 and Model-2. In Model-1, we remove all ACMs and the RGB and D branches after Conv Layer. In Model-2, we remove all ACMs but retain the multi-branch architecture. Our ablation study on NYUDv2 turns out that, the mIoU of Model-1 and Model-2 are 44.3% and 46.8%, verifying the multi-branch architecture and ACM lead to signiﬁcant accuracy boost of 2.5% and 1.5%, respectively.
In this paper, we propose a novel multi-branch attention based network for RGBD semantic segmentation. The multi-branch architecture is able to gather features efﬁciently and doesn’t destroy original RGB and depth branches’ inference. The attention module can selectively gather features from RGB and depth branches according to the amount of information they contain, and complement the fusion branch by using these weighted features. Our model can resolve the problem that RGB images and depth images always contain unequal amount of information as well as different context distributions. We evaluate our model on NYUDv2 and SUN-RGBD datasets, and the experiments show that our model can outperform state-of-the-art methods.