Day 4

Detailed paper information

Back to list

Paper title Spatiotemporal data Cube at 1-km resolution 1982-2020 to enable dynamic system modeling
  1. Chris van Diemen OpenGeoHub Speaker
  2. Tomislav Hengl OpenGeoHub Foundation
  3. Leandro Parente OpenGeoHub Foundation
Form of presentation Poster
  • C1. AI and Data Analytics
    • C1.07 ML4Earth: Machine Learning for Earth Sciences
Abstract text Recently several groups have put significant effort to release consistent time-series data sets to represent our environmental history. Example include HILDAplus GLOBv-v1.0 land cover time series dataset (, MODIS-AVHRR NDVI time-series 1982–2020 monthly values, TMF long-term (1990–2020) deforestation and degradation in tropical moist forests (, TerraClimate (monthly historic climate) precipitation, mean, minimum, maximum temperature and snow cover, (; DMSP NTL time-series data (1992–2018) at 1-km spatial resolution (; Hyde v3.2: land use annual time-series 1982–2016 (occurrence fractions) at 10 km resolution (, Vegetation Continuous Fields (VCF5KYR) Version 1 dataset (, Daily global Snow Cover Fraction - viewable (SCFV) from AVHRR (1982 - 2019), version 1.0 (, WAD2M global dataset of Wetland Area. We have combined, harmonized, gap-filled, and where necessary downscaled these datasets to produce a Spatiotemporal Earth-Science data Cube at 1-km resolution 1982-2020 hosted as Cloud-Optimized GeoTIFFs via our data portal. The data set covers all the land on the planet and could be useful for any researcher working on modelling parts of the earth system in the time frame 1982-2020.
We discuss the process of generating this data Cube. We show examples of using geospatial packages like gdal and python rasterio to generate harmonized datasets. We discuss the feature engineering that was done to enhance the final product and demonstrate uses of this data for Spatiotemporal Machine Learning i.e. for fitting models to predict dynamic changes in target variables. For feature engineering we make use of the python package eumap and optimize the process of computing features for large datasets. Eumap implements a parallelization approach by dividing large geospatial datasets in tiles and distributing the calculation per tile. In this way we are able to quickly generate new features from large datasets, ultimately helping machine learning models to find patterns in the data. The focus here will be on generating features that help our efforts to make the data cube useful for modelling systems that are influenced by processes that take multiple decades to develop like accumulated values for land use classes.
To exemplify the usefulness of this data for processes that are subject to time frames of decades we present a case study where we model soil organic carbon globally. Especially the benefit of using features generated from long term land cover datasets such as HYDE and HILDA and combining them with reflectance for machine learning approaches will be discussed.
Finally, we hope this example of a harmonized and open source dataset can inspire more researchers to present data in a systematic and open source manner in the future.