Day 4

Detailed paper information

Back to list

Paper title Effects of sample size on machine learning regression models for biophysical parameter retrieval from spectral data
  1. Hannes Feilhauer Leipzig University Speaker
Form of presentation Poster
  • C1. AI and Data Analytics
    • C1.07 ML4Earth: Machine Learning for Earth Sciences
Abstract text Machine learning (ML) regression is a frequently used approach for the retrieval of biophysical vegetation properties from spectral data. ML regression is often preferred in this context over conventional multiple linear regression models because ML approaches are able to cope with one or more of the following challenges that impair conventional regression models:
(1) Spectral data are highly inter-correlated. This strong correlation between bands or wavelengths violates the assumption in linear regression that the predictor variables are statistically independent and impairs the interpretation of regression coefficients.
(2) The relation between spectral data and the response variable is non-linear and not well described by linear models.
(3) The relation between individual spectral bands and the response variable is rather weak and many bands are necessary to build an adequate prediction model.
In addition, some ML approaches promise to require only a comparatively small sample size for achieving robust model results. This makes ML-based approaches suitable for data sets that are asymmetric in terms of containing fewer samples than spectral bands. In practice, the sample size for training data in remote sensing studies targeting biophysical variables is most often determined by availability and is frequently limited to n < 100. The practice of using rather small sample sizes and the promise of ML to require only a few observations for sufficient model training is encountered by reports that these techniques are prone to over-fitting. So far, no systematic analysis of the effects of sample size on ML regression performance in biophysical property retrieval is available. The advent of spectral data archives such as the ecosis repository ( enables such an analysis. This study hence addresses the question ‘How does the training sample size affect the model performance in machine-learning based biophysical trait retrieval?’

For a comprehensive analysis, two parameters were selected that are physically linked to the spectral signal of vegetation and are frequently addressed at the leaf and at the canopy level: leaf chlorophyll (LC, two data sets at the leaf and two at the canopy level) and leaf mass per area (LMA, seven and two data sets, respectively). LC has a very distinct influence on the spectral signal due to its pronounced absorption in the visible region and shows a strong statistical relation to a few spectral bands. LMA has a rather broad and unspecific absorption in the NIR and SWIR range and shows a weaker relation to the spectral signal in individual bands. Due to the differences in their spectral absorption features, these two parameters were expected to behave differently in regression analysis.

With these data, three different ML regression techniques were tested for effects of training sample size on their performance: Partial Least Squares regression (PLSR), Random Forest regression (RFR) and Support Vector Machine regression (SVMR). For each data set and regression technique, the target variable was repeatedly modeled with a successively growing training sample size. Trends in the model performances were identified and analyzed.

The results show that the performance of ML regression techniques clearly depends on the sample size of the training data. On both leaf and canopy level, for both LC and LMA, as well as for all three regression techniques, an increase in model performance with a growing sample size was observed. This increase is, however, non-linear and tends to saturate. The saturation in the validation fits emerges for training sample sizes larger than ncal = 100 to ncal = 150. While it may be possible to build a model with an adequate fit and robustness even with a rather small training data set, the risk of a weak performance, over-fitted and thus not transferable model and erratic band importance metrics are increasing considerably.