When it comes to the key problems in the wide field of computer vision, semantic segmentation is regarded as the most problematic one. While looking at the bigger picture, semantic segmentation is that high-level task that will pave the complete way towards complete understanding of the scene. Scene understanding is regarded as a core problem for computer vision as an increasing number of new applications nourish the deducing knowledge right from imagery. Human-computer interaction and virtual reality are some of those applications.
With the widespread popularity of deep learning in the recent years, various problems related to semantic segmentation are now being tackled with the use of deep architectures, with Convolutional Neural Nets being the most common of all.
What is the Actual Concept of Semantic Segmentation?
Semantic segmentation can be regarded as the natural step of progression from coarse right to fine inference. The origin can be located at the classification level that consists of predictions for a complete input. The next step that comes in the picture is detection or localization that not only provides the classes but also comes with additional information regarding the location of the classes. In the end, semantic segmentation achieves fine inference with the help of making dense predictions inferring the labels for each and every pixel. It is necessary to make sure that each of the pixels is labeled with classes of its own enclosing object.
Existing Approaches of Semantic Segmentation
The general architecture for semantic segmentation can be regarded as an encoder network which is followed by a decoder network.
- The encoder is a pre-trained network just like ResNet/VGG which is followed by the decoder network.
- The overall task of the decoder is to project the features which the encoder learnt.
Unlike the process of classification, where the only important thing is the end result of the network, semantic segmentation not only needs discrimination at the pixel level but it also needs a mechanism for projecting the features which are learnt by the encoder at the various stages of pixel space. There are three main approaches which are employed for the mechanism of decoding. Let’s have a look at them.
Region-based Semantic Segmentation
This method generally follows the segmentation process using the pipeline of recognition. It first extracts the free-form regions from the image and then describes them. It is followed by a classification which is generally region-based. During testing, the predictions which are region-based are transformed into predictions of pixels by labeling the pixels with the highest scoring region that holds it.
Fully Convolutional Semantic Segmentation
The original FCN or fully convolutional network learns mapping from pixel to pixel. It does not extract the region proposals during the process. The FCN network can be regarded as the extension of the CNN. The main objective is to make the CNN take up inputs as arbitrary-sized images. FCN comes with only pooling layers that give them the complete ability for making the predictions on the inputs which are arbitrary-sized.
Weakly Supervised Semantic Segmentation
The relevant methods for semantic segmentation rely on large number of images with masks of pixel-wise segmentation. Annotating all these masks manually is quite frustrating and time-consuming. It is quite expensive as well. Thus, weakly supervised methods have been proposed which can fulfill the process of semantic segmentation by using the bounding boxes that are annotated.