Commonly used metrics to evaluate semantic segmentation such as mean intersection over union (mIoU) do not incorporate temporal consistency. A straightforward extension of existing metrics towards evaluating the consistency of segmentation of video sequences does not exist, since labelled videos are rare and very expensive to obtain. For safety-critical applications such as highly automated driving, there is, however, a need for a metric that measures such temporal consistency of video segmentation networks to possibly support safety requirements. In this paper, (a) we introduce a metric which does not require segmentation labels for measuring the stability of the predictions of segmentation networks over a series of images; (b) we perform an in-depth analysis of the proposed metric and observe strong correlations to the supervised mIoU metric; (c) we perform an evaluation of five state-of-the-art networks for semantic segmentation of varying complexities and architectures evaluated on two public datasets, namely, Cityscapes and CamVid. Finally, we perform timing evaluations and propose the use of the metric as either an online observer for identification of possibly unstable segmentation predictions, or as an offline method to evaluate or to improve semantic segmentation networks, e.g., by selecting additional training data with critical temporal consistency.