As many other research fields, remote sensing has been greatly impacted by machine and deep learning and benefits from technological and computational advances. In the recent years, a considerable effort has been spent on deriving not just accurate, but also reliable modeling techniques. In the particular framework of image classification, this reliability is validated by e.g. checking whether the confidence in the model prediction adequately describes the true certainty of the model when confronted with unseen data. We investigate this reliability in the framework of classifying satellite images into different land cover classes. More precisely, we use the So2Sat LCZ42 data set  comprised of Sentinel-1 and Sentinel-2 image pairs. Those were classified into 17 categories by a team of two labelers, following the Local Climate Zone (LCZ) classification scheme.
As a novelty, we make explicit use of the so-termed evaluation set which was additionally produced by the authors of the LCZ42 data set. In this supplementary study, a subset of the initial data was re-labeled by 10 different remote sensing experts, which independently of one another re-cast their label votes for each satellite image. The resulting sets of label votes contain a notion of human uncertainty associated with the underlying satellite images. In the following, we try to explicitly incorporate this uncertainty into the training process of a neural network classifier and investigate its impact on model performance. Also, the earlier introduced definition of reliability is checked and compared to a more common modeling approach. The more common approach is using a single ground truth as label, which is derived from the majority vote of the individual expert label votes.
The 17 LCZs describe the urbanization of certain cities and are comprised of 10 classes related to built-up areas (urban classes) and 7 classes related to surrounding land cover (non-urban classes). The evaluation data set, which we will use for modeling purposes in the following, consists of 10 European cities as well as additional areas from around the globe which are added for class balancing reasons. A total of ca. 250.000 Sentinel-1 and Sentinel-2 image pairs are included, with a total of 10 spectral bands and 8 statistics derived from the VV-VH dual Pol SLC Sentinel-1 data. Each image is of size 32 by 32 pixels and covers an area of 320m by 320m. For simplicity, we will focus our analysis only on the Sentinel-2 data. Accompanying each satellite image, 10 individual expert label votes are provided. These votes are aggregated for each image by forming the empirical distribution over the different classes. As a result, we receive a distributional label form that stores the information from the individual label votes. Additionally, we store the majority vote of the experts for each image, which serves as a pseudo ground truth label. In case of a tie, the label cast by the two initial labelers was also considered for the determination of the majority vote. Due to the overall high rate of agreement among the voters within the non-urban classes, solely the images associated with the urban classes are considered for modeling the distributional labels in the following.
As a result, the human uncertainty is now saved within the derived label distribution. For the purpose of integrating this information into the classification task, two main changes are made to an already existing deep neural network. First, the usual one-hot encoded labels are replaced by the computed distributional labels. Furthermore, the typical cross-entropy loss gets replaced by the Kullback-Leibler (KL) divergence. This is done to better reflect the task of approximating the ground truth distribution, which is formed by the label votes, from an information theoretic perspective. The training is performed as usual by backpropagating the loss through the network. For evaluating the predictive uncertainty of the model, we investigate the so termed expected calibration error (ECE). The ECE is derived from comparing the model confidence (i.e. the highest predicted class probability of the model) and the corresponding accuracy on the hold-out test set. The discrepancies between the two quantities can further be visualized in a 2D bar plot called reliability diagram.
3. Experiments & Results
We use the benchmark model for the data set from a previous study , in which the authors found this model to be superior over many common Convolutional Neural Network (CNN)-based architectures. This benchmark model is termed Sen2LCZ, and builds on the combination of conventional convolutional blocks, the fusion of multiple intermediate deep features and double-pooling. Our implementation results in a network depth of 17 and uses a dropout after the second and third block. The evaluation data set was split into geographically separated training and testing sets. The latter was furthermore randomly split into validation and testing data.
Two separate implementations of the benchmark model were evaluated in order to identify the impact of explicitly modeling the human uncertainty in the labels. The classical approach employed the one-hot encoded labels based on the majority vote of the label votes together with the typically used cross-entropy loss. On the other hand, the modified model utilized the earlier described distributional labels as well as the KL divergence as loss. Apart from that, identical architectures, hyperparameters and training setups were applied. The usual performance metrics were derived on the identical unique test set for both implementations.
As a first and foremost result, all metrics including overall accuracy, average accuracy (both macro and weighted) as well as the kappa score, improved by at least 1 percentage point when using the distributional labels. Note that for deriving these metrics in the presence of distributional labels, the majority vote (i.e. the mode of the distributional label) was taken as ground truth, and the prediction was counted as correct if this ground truth was matched by the highest predicted class probability. Furthermore, the cross-entropy between the predicted probabilities and the ground truth one-hot labels could be reduced by ca. 20% on the test set by training with distributional labels. The central result of this work can be moreover seen in the accompanying visualization, which shows the reliability diagrams of the two implementations: The expected calibration error could be reduced by a large margin (cut more than half) via incorporating the label distributions, and overconfidence could be avoided. The average confidence matches the overall accuracy, and furthermore the two quantities are closely related for almost the entire spectrum.
The last reported result shows the clear advantage of integrating label uncertainty into the training process of a neural network for the task of classifying satellite images into LCZs. Adding to that, the integration is superior over classical calibration methods as it also led to improved model performance metrics and a reduced loss on the test set. The derivation and implementation of the distributional labels are straightforward and easy to use. As a main outcome, we would like to emphasize the large improvement in the calibration of the predictive distribution. In particular, the predicted probabilities of the model using the distributional labels can be solidly interpreted and adequately reflect the uncertainty in the prediction.
 Zhu, X. X., Hu, J., Qiu, C., Shi, Y., Kang, J., Mou, L., Bagheri, H., Hua, Y., Huang, R., Hughes, L.H., Li, H., Sun, Y., Zhang, G., Han, S., Schmitt, M., Wang, Y. (2020). So2Sat LCZ42: A benchmark data set for the classification of global local climate zones. IEEE Geoscience and Remote Sensing Magazine (GRSM), 8(3), 76-89.
 Qiu, C., Tong, X., Schmitt, M., Bechtel, B., & Zhu, X. X. (2020). Multilevel feature fusion-based CNN for local climate zone classification from sentinel-2 images: Benchmark results on the So2Sat LCZ42 dataset. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13, 2793-2806.