Day 4

Detailed paper information

Back to list

Paper title Estimating the area of applicability of machine learning models in EO applications
Authors
  1. Hanna Meyer WWU Münster - ILÖK Speaker
  2. Edzer Pebesma University of Muenster
Form of presentation Poster
Topics
  • C1. AI and Data Analytics
    • C1.07 ML4Earth: Machine Learning for Earth Sciences
Abstract text Machine learning algorithms have become very popular for spatial mapping of the environment, even on a global scale. Model training is usually based on limited field observations and the trained model is applied to make predictions far beyond the geographic location of these data – assuming that the learned relationships still hold. However, while the algorithms allow fitting complex relationships, this comes with the disadvantage that trained models can only be applied to new data if these resemble the training data. Assuming that new geographic space often goes along with new environmental properties, this can often not be ensured and predictions for unsampled environments have to be considered highly uncertain.

We suggest a methodology that delineates the ‘area of applicability’ (AOA) that we define as the area where we enabled the model to learn about relationships based on the training data, and where the estimated cross-validation performance holds. We first propose a ‘dissimilarity index’ (DI) that is based on the minimum distance to the training data in the multidimensional predictor space, with predictors being weighted by their respective importance in the model. The AOA is derived by applying a threshold which is the maximum DI of the training data derived via cross-validation. We further use the relationship between the DI and the cross-validation performance to map the estimated performance of predictions. To illustrate the approach, we present a simulated case study of biodiversity mapping and compare prediction performance inside and outside the AOA.

We suggest to add the AOA computation to the modeller's standard toolkit and to limited predictions to this area. The (global) maps that we create using remote sensing, field data and machine learning, are not just nice colorful figures but they are also being distributed digitally, often as open data, and are used for purposes of decision-making or planning, e.g. in the context of nature conservation, with high requirements on the quality. To avoid large error propagation or misplanning, it should be the obligation of the map developer to clearly communicate the limitations, towards more reliable EO products.