Publications

Frozen Feature Augmentation for Few-Shot Image Classification
Training a linear classifier or lightweight model on top of pretrained vision model outputs, so-called ‘frozen features’, leads to impressive performance on a number of downstream few-shot tasks. Currently, frozen features are not modified during training. On the other hand, when networks are trained directly on images, data augmentation is a standard recipe that improves performance with no substantial overhead. In this paper, we conduct an extensive pilot study on few-shot image classification that explores applying data augmentations in the frozen feature space, dubbed ‘frozen feature augmentation (FroFA)’, covering twenty augmentations in total. Our study demonstrates that adopting a deceptively simple pointwise FroFA, such as brightness, can improve few-shot performance consistently across three network architectures, three large pretraining datasets, and eight transfer datasets.
Improvements to Image Reconstruction-Based Performance Prediction for Semantic Segmentation in Highly Automated Driving
The performance of deep neural networks is typically measured with ground truth data which is expensive and not available during operation. At the same time, safety-critical applications, such as highly automated driving, require an awareness of the current performance, especially during operation with distorted inputs. Recently, performance prediction for semantic segmentation by an image reconstruction decoder was proposed. In this work, we investigate three approaches to improve its predictive power: Parameter initialization, parameter sharing, and inter-decoder lateral connections. Our best setup establishes a new state of the art in performance prediction with image-only inputs on Cityscapes and KITTI and even excels a method exploiting both point cloud and image inputs on Cityscapes. Further, our investigations reveal that the best Pearson correlation between the segmentation quality and the reconstruction quality does not always lead to the best predictive power. Code is available at https://github.com/ifnspaml/PerfPredRecV2.
A Novel Benchmark for Refinement of Noisy Localization Labels in Autolabeled Datasets for Object Detection
Autolabeling approaches are attractive w.r.t. time and cost as they allow fast annotation without human intervention. However, can we really trust the label quality of autolabeling? And further, which potential consequences arise from resulting label noise? In this work, we address these questions for localization, a subtask of object detection, by investigating the effects on a state-of-the-art deep neural network (DNN) for object detection and the widely used Pascal VOC 2012 dataset. Our contributions are threefold: First, we propose a method to inject noise into localization labels, enabling us to simulate localization label errors of autolabeling methods. Afterwards, we train a state-of-the-art object detection DNN with these noisy labels. Second, we propose a refinement network which takes a noisy localization label and its respective image as input and performs a localization refinement. Third, we again train a state-of-the-art object detection DNN, however, this time with refined localization labels. Our insights are: Training a state-of-the-art DNN for object detection on noisy localization labels leads to a severe performance drop. Our proposed localization label refinement network is able to refine the noisy localization labels. We are able to retain the performance to some extent by retraining the state-of-the-art DNN for object detection on the refined localization labels. Our study motivates a new challenging task ‘refinement of noisy localization labels’ and sets a first benchmark for Pascal VOC 2012. Code is available at https://github.com/ifnspaml/LocalizationLabelNoise.
Detecting Adversarial Perturbations in Multi-Task Perception
While deep neural networks (DNNs) achieve impressive performance on environment perception tasks, their sensitivity to adversarial perturbations limits their use in practical applications. In this paper, we (i) propose a novel adversarial perturbation detection scheme based on multi-task perception of complex vision tasks (i.e., depth estimation and semantic segmentation). Specifically, adversarial perturbations are detected by inconsistencies between extracted edges of the input image, the depth output, and the segmentation output. To further improve this technique, we (ii) develop a novel edge consistency loss between all three modalities, thereby improving their initial consistency which in turn supports our detection scheme. We verify our detection scheme’s effectiveness by employing various known attacks and image noises. In addition, we (iii) develop a multi-task adversarial attack, aiming at fooling both tasks as well as our detection scheme. Experimental evaluation on the Cityscapes and KITTI datasets shows that under an assumption of a 5% false positive rate up to 100% of images are correctly detected as adversarially perturbed, depending on the strength of the perturbation. Code is available at this https URL. A short video at this https URL provides qualitative results.
Performance Prediction for Semantic Segmentation and by a Self-Supervised Image Reconstruction Decoder
In supervised learning, a deep neural network’s performance is measured using ground truth data. In semantic segmentation, ground truth data is sparse, requires an expensive annotation process, and, most importantly, it is not available during online operation. To tackle this problem, recent works propose various forms of performance prediction. However, they either rely on inference data histograms, additional sensors, or additional training data. In this paper, we propose a novel per-image performance prediction for semantic segmentation, with (i) no need for additional sensors (sensor efficiency), (ii) no need for additional training data (data efficiency), and (iii) no need for a dedicated retraining of the semantic segmentation (training efficiency). Specifically, we extend an already trained semantic segmentation network having fixed parameters with an image reconstruction decoder. After training and a subsequent regression, the image reconstruction quality is evaluated to predict the semantic segmentation performance. We demonstrate our method’s effectiveness with a new state-ofthe-art benchmark both on KITTI and Cityscapes for imageonly input methods, on Cityscapes even excelling a LiDARsupported benchmark.
Adaptive Bitrate Quantization Scheme Without Codebook for Learned Image Compression
We propose a generic approach to quantization without codebook in learned image compression called onehot max (OHM, Ω) quantization. It reorganizes the feature space resulting in an additional dimension, along which vector quantization yields one-hot vectors by comparing activations. Furthermore, we show how to integrate Ω quantization into a compression system with bitrate adaptation, i.e., full control over bitrate during inference. We perform experiments on both MNIST and Kodak and report on rate-distortion trade-offs comparing with the integer rounding reference. For low bitrates (< 0.4 bpp), our proposed quantizer yields better performance while exhibiting also other advantageous training and inference properties. Code is available at https://github.com/ifnspaml/OHMQ.
Joint Optimization for DNN Model Compression and Corruption Robustness
Modern deep neural networks (DNNs) are achieving state-of-the-art results due to their capability to learn a faithful representation of the data they are trained on. In this chapter, we address two insufficiencies of DNNs, namely, the lack of robustness to corruptions in the data, and the lack of real-time deployment capabilities, that need to be addressed to enable their safe and efficient deployment in real-time environments. We introduce hybrid corruption-robustness focused compression (HCRC), an approach that jointly optimizes a neural network for achieving network compression along with improvement in corruption robustness, such as noise and blurring artifacts that are commonly observed. For this study, we primarily consider the task of semantic segmentation for automated driving and focus on the interactions between robustness and compression of the network. HCRC improves the robustness of the DeepLabv3+ network by 8.39% absolute mean performance under corruption (mPC) on the Cityscapes dataset, and by 2.93% absolute mPC on the Sim KI-A dataset, while generalizing even to augmentations not seen by the network in the training process. This is achieved with only minor degradations on undisturbed data. Our approach is evaluated over two strong compression ratios (30% and 50%) and consistently outperforms all considered baseline approaches. Additionally, we perform extensive ablation studies to further leverage and extend existing state-of-the-art methods.
Inspect, Understand, Overcome: A Survey of Practical Methods for AI Safety
Deployment of modern data-driven machine learning methods, most often realized by deep neural networks (DNNs), in safety-critical applications such as health care, industrial plant control, or autonomous driving is highly challenging due to numerous model-inherent shortcomings. These shortcomings are diverse and range from a lack of generalization over insufficient interpretability and implausible predictions to directed attacks by means of malicious inputs. Cyber-physical systems employing DNNs are therefore likely to suffer from so-called safety concerns, properties that preclude their deployment as no argument or experimental setup can help to assess the remaining risk. In recent years, an abundance of state-of-the-art techniques aiming to address these safety concerns has emerged. This chapter provides a structured and broad overview of them. We first identify categories of insufficiencies to then describe research activities aiming at their detection, quantification, or mitigation. Our work addresses machine learning experts and safety engineers alike: The former ones might profit from the broad range of machine learning topics covered and discussions on limitations of recent methods. The latter ones might gain insights into the specifics of modern machine learning methods. We hope that this contribution fuels discussions on desiderata for machine learning systems and strategies on how to help to advance existing approaches accordingly.
Improving Transferability of Generated Universal Adversarial Perturbations for Image Classification and Segmentation
Although deep neural networks (DNNs) are high-performance methods for various complex tasks, e.g., environment perception in automated vehicles (AVs), they are vulnerable to adversarial perturbations. Recent works have proven the existence of universal adversarial perturbations (UAPs), which, when added to most images, destroy the output of the respective perception function. Existing attack methods often show a low success rate when attacking target models which are different from the one that the attack was optimized on. To address such weak transferability, we propose a novel learning criterion by combining a low-level feature loss, addressing the similarity of feature representations in the first layer of various model architectures, with a cross-entropy loss. Experimental results on ImageNet and Cityscapes datasets show that our method effectively generates universal adversarial perturbations achieving state-of-the-art fooling rates across different models, tasks, and datasets. Due to their effectiveness, we propose the use of such novel generated UAPs in robustness evaluation of DNN-based environment perception functions for AVs.
From a Fourier-Domain Perspective on Adversarial Examples to a Wiener Filter Defense for Semantic Segmentation
Despite recent advancements, deep neural networks are not robust against adversarial perturbations. Many of the proposed adversarial defense approaches use computationally expensive training mechanisms that do not scale to complex real-world tasks such as semantic segmentation, and offer only marginal improvements. In addition, fundamental questions on the nature of adversarial perturbations and their relation to the network architecture are largely understudied. In this work, we study the adversarial problem from a frequency domain perspective. More specifically, we analyze discrete Fourier transform (DFT) spectra of several adversarial images and report two major findings: First, there exists a strong connection between a model architecture and the nature of adversarial perturbations that can be observed and addressed in the frequency domain. Second, the observed frequency patterns are largely image- and attack-type independent, which is important for the practical impact of any defense making use of such patterns. Motivated by these findings, we additionally propose an adversarial defense method based on the well-known Wiener filters that captures and suppresses adversarial frequencies in a data-driven manner. Our proposed method not only generalizes across unseen attacks but also beats five existing state-of-the-art methods across two models in a variety of attack settings.
Detection of Collective Anomalies in Images for Automated Driving Using an Earth Mover’s Deviation (EMDEV) Measure
For visual perception in automated driving, a reliable detection of so-called corner cases is important. Corner cases appear in many different forms and can be image frame- or sequence-related. In this work, we consider a specific type of corner case: collective anomalies. These are instances that appear in unusually large amounts in an image. We propose a detection method for collective anomalies based on a comparison of a test (sub-)set instance distribution to a training (i.e., reference) instance distribution, both distributions obtained by an instance-based semantic segmentation. For this comparison, we propose a novel so-called earth mover’s deviation (EMDEV) measure, which is able to provide signed deviations of instance distributions. Further, we propose a sliding window approach to allow the comparison of instance distributions in an online application in the vehicle. With our approach, we are able to identify collective anomalies by the proposed EMDEV measure, and to detect deviations from the instance distribution of the reference dataset.
Improving Online Performance Prediction for Semantic Segmentation
In this work we address the task of observing the performance of a semantic segmentation deep neural network (DNN) during online operation, i.e., during inference, which is of high importance in safety-critical applications such as autonomous driving. Here, many high-level decisions rely on such DNNs, which are usually evaluated offline, while their performance in online operation remains unknown. To solve this problem, we propose an improved online performance prediction scheme, building on a recently proposed concept of predicting the primary semantic segmentation task’s performance. This can be achieved by evaluating the auxiliary task of monocular depth estimation with a measurement supplied by a LiDAR sensor and a subsequent regression to the semantic segmentation performance. In particular, we propose (i) sequential training methods for both tasks in a multi-task training setup, (ii) to share the encoder as well as parts of the decoder between both task’s networks for improved efficiency, and (iii) a temporal statistics aggregation method, which significantly reduces the performance prediction error at the cost of a small algorithmic latency. Evaluation on the KITTI dataset shows that all three aspects improve the performance prediction compared to previous approaches.
An Unsupervised Temporal Consistency (TC) Loss To Improve the Performance of Semantic Segmentation Networks
Deep neural networks (DNNs) for highly automated driving are often trained on a large and diverse dataset, and evaluation metrics are reported usually on a per-frame basis. However, when evaluated on video sequences, the predictions are often unstable between consecutive frames. As such unstable predictions over time can lead to severe safety consequences, there is a growing need to understand, evaluate, and improve the temporal consistency of DNNs. In this paper, we explore such a temporal characteristic and propose a novel unsupervised temporal consistency (TC) loss that penalizes unstable semantic segmentation predictions. This loss function is used in a two-stage training scheme to jointly optimize for both, accuracy of semantic segmentation predictions, and its temporal consistency based on video sequences. We demonstrate that our training strategy helps in improving the temporal consistency of two state-of-the-art semantic segmentation networks on two different road-scenes datasets. We report an absolute 4.25% improvement in the mean temporal consistency (mTC) of the HRNetV2 network and an absolute 2.78% improvement on the DeepLabv3+ network, both evaluated on the Cityscapes dataset, with only a slight decrease in accuracy. When evaluating on the same video sequences using a synthetic dataset Sim KI-A, we show absolute improvements in both, accuracy (2.19% mIoU) and temporal consistency (0.21% mTC) for the DeepLabv3+ network. We confirm similar improvements for the HRNetV2 network.
The Vulnerability of Semantic Segmentation Networks to Adversarial Attacks in Autonomous Driving: Enhancing Extensive Environment Sensing
Enabling autonomous driving (AD) can be considered one of the biggest challenges in today’s technology. AD is a complex task accomplished by several functionalities, with environment perception being one of its core functions. Environment perception is usually performed by combining the semantic information captured by several sensors, i.e., lidar or camera. The semantic information from the respective sensor can be extracted by using convolutional neural networks (CNNs) for dense prediction. In the past, CNNs constantly showed state-of-the-art performance on several vision-related tasks, such as semantic segmentation of traffic scenes using nothing but the red-green-blue (RGB) images provided by a camera. Although CNNs obtain state-of-the-art performance on clean images, almost imperceptible changes to the input, referred to as adversarial perturbations, may lead to fatal deception. The goal of this article is to illuminate the vulnerability aspects of CNNs used for semantic segmentation with respect to adversarial attacks, and share insights into some of the existing known adversarial defense strategies. We aim to clarify the advantages and disadvantages associated with applying CNNs for environment perception in AD to serve as a motivation for future research in this field.
Transferable Universal Adversarial Perturbations Using Generative Models
Deep neural networks tend to be vulnerable to adversarial perturbations, which by adding to a natural image can fool a respective model with high confidence. Recently, the existence of image-agnostic perturbations, also known as universal adversarial perturbations (UAPs), were discovered. However, existing UAPs still lack a sufficiently high fooling rate, when being applied to an unknown target model. In this paper, we propose a novel deep learning technique for generating more transferable UAPs. We utilize a perturbation generator and some given pretrained networks so-called source models to generate UAPs using the ImageNet dataset. Due to the similar feature representation of various model architectures in the first layer, we propose a loss formulation that focuses on the adversarial energy only in the respective first layer of the source models. This supports the transferability of our generated UAPs to any other target model. We further empirically analyze our generated UAPs and demonstrate that these perturbations generalize very well towards different target models. Surpassing the current state of the art in both, fooling rate and model-transferability, we can show the superiority of our proposed approach. Using our generated non-targeted UAPs, we obtain an average fooling rate of 93.36% on the source models (state of the art: 82.16%). Generating our UAPs on the deep ResNet-152, we obtain about a 12% absolute fooling rate advantage vs. cutting-edge methods on VGG-16 and VGG-19 target models.
Focussing Learned Image Compression to Semantic Classes for V2X Applications
Cooperative perception with many sensors involved greatly improves the performance of perceptual systems in autonomous vehicles. However, the increasing amount of sensor data leads to a bottleneck due to limited capacity of vehicle-to-X (V2X) communication channels. We leverage lossy learned image compression by means of an autoencoder with adversarial loss function to reduce the overall bitrate. Our key contribution is to focus image compression on regions of interest (ROIs) governed by a binary mask. A transmitter-sided semantic segmentation network extracts semantically important classes being the basis for the generation of a ROI. A second key contribution is that the mask is not transmitted as side information, only the quantized bottleneck data is transmitted. To train the network, we use a loss function operating only on the pixels in the ROI. We report peak-signal-to-noise ratio (PSNR) both in the entire image and only in the ROI, evaluating various fusion architectures and fusion operations involving input image and mask. Showing the high generalizability of our approach, we achieve consistent improvements in the ROI in all experiments on the Cityscapes dataset.
Class-Incremental Learning for Semantic Segmentation Re-Using Neither Old Data Nor Old Labels
While neural networks trained for semantic segmentation are essential for perception in autonomous driving, most current algorithms assume a fixed number of classes, presenting a major limitation when developing new autonomous driving systems with the need of additional classes. In this paper we present a technique implementing class-incremental learning for semantic segmentation without using the labeled data the model was initially trained on. Previous approaches still either rely on labels for both old and new classes, or fail to properly distinguish between them. We show how to overcome these problems with a novel class-incremental learning technique, which nonetheless requires labels only for the new classes. Specifically, (i) we introduce a new loss function that neither relies on old data nor on old labels, (ii) we show how new classes can be integrated in a modular fashion into pretrained semantic segmentation models, and finally (iii) we re-implement previous approaches in a unified setting to compare them to ours. We evaluate our method on the Cityscapes dataset, where we exceed the mIoU performance of all baselines by 3.5% absolute reaching a result, which is only 2.2% absolute below the upper performance limit of single-stage training, relying on all data and labels simultaneously.
Unsupervised Temporal Consistency Metric for Video Segmentation in Highly-Automated Driving
Commonly used metrics to evaluate semantic segmentation such as mean intersection over union (mIoU) do not incorporate temporal consistency. A straightforward extension of existing metrics towards evaluating the consistency of segmentation of video sequences does not exist, since labelled videos are rare and very expensive to obtain. For safety-critical applications such as highly automated driving, there is, however, a need for a metric that measures such temporal consistency of video segmentation networks to possibly support safety requirements. In this paper, (a) we introduce a metric which does not require segmentation labels for measuring the stability of the predictions of segmentation networks over a series of images; (b) we perform an in-depth analysis of the proposed metric and observe strong correlations to the supervised mIoU metric; (c) we perform an evaluation of five state-of-the-art networks for semantic segmentation of varying complexities and architectures evaluated on two public datasets, namely, Cityscapes and CamVid. Finally, we perform timing evaluations and propose the use of the metric as either an online observer for identification of possibly unstable segmentation predictions, or as an offline method to evaluate or to improve semantic segmentation networks, e.g., by selecting additional training data with critical temporal consistency.
Robust Semantic Segmentation by Redundant Networks With a Layer-Specific Loss Contribution and Majority Vote
The lack of robustness shown by deep neural networks (DNNs) questions their deployment in safety-critical tasks, such as autonomous driving. We pick up the recently introduced redundant teacher-student frameworks (3 DNNs) and propose in this work a novel error detection and correction scheme with application to semantic segmentation. It obtains its robustnesss by an online-adapted and therefore hard-to-attack student DNN during vehicle operation, which builds upon a novel layer-dependent inverse feature matching (IFM) loss. We conduct experiments on the Cityscapes dataset showing that this loss renders the adaptive student to be more than 20% absolute mean intersection-over-union (mIoU) better than in previous works. Moreover, the entire error correction virtually always delivers the performance of the best non-attacked network, resulting in an mIoU of about 50% even under strongest attacks (instead of 1…2%), while keeping the performance on clean data at about original level (ca. 75.7%).
Improved Noise and Attack Robustness for Semantic Segmentation by Using Multi-Task Training with Self-Supervised Depth Estimation
While current approaches for neural network training often aim at improving performance, less focus is put on training methods aiming at robustness towards varying noise conditions or directed attacks by adversarial examples. In this paper, we propose to improve robustness by a multi-task training, which extends supervised semantic segmentation by a self-supervised monocular depth estimation on unlabeled videos. This additional task is only performed during training to improve the semantic segmentation model’s robustness at test time under several input perturbations. Moreover, we even find that our joint training approach also improves the performance of the model on the original (supervised) semantic segmentation task. Our evaluation exhibits a particular novelty in that it allows to mutually compare the effect of input noises and adversarial attacks on the robustness of the semantic segmentation. We show the effectiveness of our method on the Cityscapes dataset, where our multi-task training approach consistently outperforms the single-task semantic segmentation baseline in terms of both robustness vs. noise and in terms of adversarial attacks, without the need for depth labels in training.
On the Robustness of Redundant Teacher-Student Frameworks for Semantic Segmentation
The trend towards autonomous systems in today’s technology comes with the need for environment perception. Deep neural networks (DNNs) constantly showed state-ofthe-art performance over the last few years in visual machine perception, e.g., semantic segmentation. While DNNs work fine on uncorrupted data, recently introduced adversarial examples (AEs) led to misclassification with high confidence. This lack of robustness against such adversarial attacks questions the use of DNNs in safety-critical autonomous systems, e.g., autonomous driving vehicles. In this work, we address the mentioned problem with the use of a redundant teacher-student framework, consisting of a static teacher network (T), a static student network (S), and a constantly adapting student network (A). By using this triplet in combination with a novel inverse feature matching (IFM) loss, we show that a significant robustness increase of student DNNs against adversarial attacks is achieveable, while maintaining semantic segmentation quality at a reasonably high level. With our approach, we manage to increase the mean intersection over union (mean IoU) ratio between static student adversarial examples and clean images from about 35 % to about 80 % on the Cityscapes dataset. Moreover, our proposed method can be integrated into any DNN-based perception mechanism to increase the (online) robustness in an adversarial environment, created from static model knowledge.
Towards Corner Case Detection for Autonomous Driving
The progress in autonomous driving is also due to the increased availability of vast amounts of training data for the underlying machine learning approaches. Machine learning systems are generally known to lack robustness, e.g., if the training data did rarely or not at all cover critical situations. The challenging task of corner case detection in video, which is also somehow related to unusual event or anomaly detection, aims at detecting these unusual situations, which could become critical, and to communicate this to the autonomous driving system (online use case). Such a system, however, could be also used in offline mode to screen vast amounts of data and select only the relevant situations for storing and (re)training machine learning algorithms. So far, the approaches for corner case detection have been limited to videos recorded from a fixed camera, mostly for security surveillance. In this paper, we provide a formal definition of a corner case and propose a system framework for both the online and the offline use case that can handle video signals from front cameras of a naturally moving vehicle and can output a corner case score.
On Low-Bitrate Image Compression for Distributed Automotive Perception: Higher Peak SNR Does Not Mean Better Semantic Segmentation
The high amount of sensors required for autonomous driving poses enormous challenges on the capacity of automotive bus systems. There is a need to understand tradeoffs between bitrate and perception performance. In this paper, we compare the image compression standards JPEG, JPEG2000, and WebP to a modern encoder/decoder image compression approach based on generative adversarial networks (GANs). We evaluate both the pure compression performance using typical metrics such as peak signal-to-noise ratio (PSNR), structural similarity (SSIM) and others, but also the performance of a subsequent perception function, namely a semantic segmentation (characterized by the mean intersection over union (mIoU) measure). Not surprisingly, for all investigated compression methods, a higher bitrate means better results in all investigated quality metrics. Interestingly, however, we show that the semantic segmentation mIoU of the GAN autoencoder in the highly relevant low-bitrate regime (at 0.0625 bit/pixel) is better by 3.9 % absolute than JPEG2000, although the latter still is considerably better in terms of PSNR (5.91dB difference). This effect can greatly be enlarged by training the semantic segmentation model with images originating from the decoder, so that the mIoU using the segmentation model trained by GAN reconstructions exceeds the use of the model trained with original images by almost 20 % absolute. We conclude that distributed perception in future autonomous driving will most probably not provide a solution to the automotive bus capacity bottleneck by using standard compression schemes such as JPEG2000, but requires modern coding approaches, with the GAN encoder/decoder method being a promising candidate.