Day 4

Detailed paper information

Back to list

Paper title Critical Components of Strong Supervised Baselines for Building Damage Assessment in Satellite Imagery and their Limitations
  1. Sebastian Gerard KTH - Royal Institute of Technology Speaker
  2. Yonk Shi KTH - Royal Institute of Technology
  3. Dávid Kerekes KTH Royal Institute of Technology
  4. Yifang Ban KTH Royal Institute of Technology
  5. Hossein Azizpour KTH (Royal Institute of Technology)
  6. Josephine Sullivan KTH Royal Institute of Technology
Form of presentation Poster
  • C1. AI and Data Analytics
    • C1.07 ML4Earth: Machine Learning for Earth Sciences
Abstract text Critical Components of Strong Supervised Baselines for Building Damage Assessment in Satellite Imagery and their Limitations

Deep learning is powerful approach to solving semantic segmentation in the domains of computer vision[1] and medical image analysis[2]. Variations of encoder-decoder networks, such as the U-Net, have consistently shown strong repeatable results when trained in a supervised fashion on appropriately labelled training data. These encoder-decoder architectures and training approaches are now increasingly explored and exploited for semantic segmentation tasks in satellite image analysis. Several challenges within this field, including the xView2 Challenge[3], have been won with such approaches. However, from reading the summaries, reports, and code of high performing solutions it is frequently not entirely clear which aspects of the training, network architectures and pre- and post-processing steps are critical to obtain a strong performance. This opacity is mainly because top solutions can be somewhat over-engineered and computationally expensive in the pursuit of the small gains needed to win challenge competitions or become SOTA on standard benchmarks. This makes it difficult for practitioners to decide what to include in their systems when they solve their specific problem, but want to mimic high-performing systems subject to their own computational restrictions at training and test time.

Thus in this paper we dissect the winning solution of the xView2 challenge, a late fusion U-Net [4] network architecture, and identify its most important components when training, to perform building localization and building damage classification caused by natural disasters, and still maintain strong performance. We focus on the xView2 challenge as it has satellite images of pre and post disaster sites from a large and diverse set of global locations and disaster types and manually verified labels - qualities not abundant in the publicly available remote sensing datasets. Our results show that many of the bells and whistles of the system built such as the pre- and post-processing applied, ensembling of models with large back-bone networks and extensive data-augmentations are not necessary to obtain 90-95% of performance of the winning method. A summary of the conclusions from our experiments are:
1) the choice of loss function is critical with a carefully weighted combination of the focal and dice loss being important for stable training,
2) A U-Net architecture with a ResNet-34 backbone is sufficient for good performance.
3) Late fusion of features from the pre- and post-disaster images via an appropriately pre-trained U-Net is important.
4) A per-class weighted loss is very helpful, but optimizing the weights beyond inverse relative frequency does not yield much improvement.

We also identify a problem with the evaluation criterion of the xView2 challenge dataset. Images from the same disaster sites, both pre and post disaster, are included in both the training (and by default also any validation sets created from the training set) and test sets. Therefore the performance numbers quoted are not so meaningful for the common use case of when a disaster occurs at a site unseen during training. Currently, we have preliminary results which show that when test disaster sites are not present in the training set, performance on the unseen test site can fall by > 50% with the damage classification performance being much more affected than the building localization task. These results demonstrate that generalization of networks trained in a supervised fashion to unseen sites is still far from solved and that perhaps supervised trained networks are not the final word on semantic segmentation for real world satellite applications.

[1] Semantic Segmentation on Cityscapes test,; Semantic Segmentation on PASCAL VOC 2012 test,
[2] Medical Image Segmentation on Medical Segmentation Decathlon,
[3] xView2: Assess Building Damage, Computer Vision for Building Damage Assessment using satellite imagery of natural disasters,
[4] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI 2015