Performance Prediction for Semantic Segmentation and by a Self-Supervised Image Reconstruction Decoder

Abstract

In supervised learning, a deep neural network’s performance is measured using ground truth data. In semantic segmentation, ground truth data is sparse, requires an expensive annotation process, and, most importantly, it is not available during online operation. To tackle this problem, recent works propose various forms of performance prediction. However, they either rely on inference data histograms, additional sensors, or additional training data. In this paper, we propose a novel per-image performance prediction for semantic segmentation, with (i) no need for additional sensors (sensor efficiency), (ii) no need for additional training data (data efficiency), and (iii) no need for a dedicated retraining of the semantic segmentation (training efficiency). Specifically, we extend an already trained semantic segmentation network having fixed parameters with an image reconstruction decoder. After training and a subsequent regression, the image reconstruction quality is evaluated to predict the semantic segmentation performance. We demonstrate our method’s effectiveness with a new state-ofthe-art benchmark both on KITTI and Cityscapes for imageonly input methods, on Cityscapes even excelling a LiDARsupported benchmark.

Publication
In Proc. of CVF/IEEE Conference on Computer Vision and Pattern Recognition - Workshops