|Paper title||Object-based Land Cover Mapping of Satellite Image Time Series via Attention-Based CNN|
|Form of presentation||Poster|
Nowadays, modern Earth observation systems continuously collect massive amounts of satellite information that can be referred to as Earth Observation (EO) data.
A notable example is represented by the Sentinel-2 mission from the Copernicus programme, supplying optical information with a revisit time period between 5 and 10 days thanks to a constellation of two twin satellites. Due to the high revisiting period exhibited by such satellites, the acquired images can be organized in Satellite Image Time Series (SITS), which represent a practical tool to monitor a particular spatial area through time. SITS data can support a wide number of application domains like ecology, agriculture, mobility, health, risk assessment, land management planning, forest and natural habitat monitoring and, for this reason, they constitute a valuable source of information to follow the dynamic of the Earth Surface. The huge amount of regularly acquired SITS data opens new challenges in the field of remote sensing in relationship with the way the knowledge can be effectively extracted and how spatio-temporal interplay can be exploited to get the most out of such a rich information source.
One of the main tasks related to SITS data analysis is associated to land cover mapping, where a predictive model is learnt to make the connection between satellite data (i.e., SITS) and the associated land cover classes. SITS data captures the temporal dynamics exhibited by land cover classes, thus supporting a more effective discrimination among them.
Despite the increasing necessity to provide large scale (i.e., region or national) land cover maps, the amount of labeled information collected to train such models is still limited, sparse (annotated polygons are scattered all over the study site) and, most of the time, at coarser scale with respect to pixel precision. This is due to the fact that the labeling task is generally labour-intensive and time costly in order to cover a sufficient number of samples with respect to the extent of the study site.
Object Based Image Analysis (OBIA) refers to a category of digital remote sensing image analysis approaches that study geographic entities, or phenomena through delineating and analyzing image-objects rather than individual pixels. When dealing with supervised Land Use / Land Cover (LULC) classification, the recur to OBIA approaches is motivated by the fact that, in modern remote sensing imagery, most of the common land cover classes present an heterogeneous radiometric composition, and classical pixel-based approaches typically fail to capture such complexity. Of course, this effect is even more important when the aforementioned complexity is exhibited also in the temporal dimension, which is the case for SITS data.
To address this issue, in the OBIA framework, the main idea is to group adjacent pixels together prior to the classification process, and subsequently work on the so-obtained object layer in which segments correspond to more representative samples of such complex LULC classes (e.g. ``land units''). This is typically achieved by tuning the segmentation algorithms to provide object layers at an appropriate spatial scale, at which objects are generally not radiometrically homogeneous, especially on the most complex LULC classes. Matter of facts, most of the common segmentation techniques used in remote sensing allow for the parametrization of the spatial scale, e.g. by using an heterogeneity threshold as in, by defining a bandwith parameter specifically for the spatial domain as in Mean-Shift or, recently, by specifying the number of required objects as in SLIC.
Based on these assumptions, the typical approach in the OBIA framework for automatic LULC mapping is to leverage agglomerate descriptors (i.e. object-based radiometric statistics) to build proper samples for training and classification, without explicitly managing within-object information diversity. For instance, a single segment derived by an urban scene: this typically contains, simultaneously, sets of pixels associated to buildings, streets, gardens, and so on, which are all equivalently important in the recognition of the Urban LULC class. However, in many cases, the components of a single segment do not equally contribute to their identification as belonging to a certain land-cover class.
In this abstract, we propose TASSEL, a new deep-learning framework to deal with object-based SITS land cover mapping which can be ascribed into the weakly supervised learning (WSL) setting. We locate our contribution in the framework of WSL since the object-based land cover classification task exhibits label information that intrinsically brings a certain degree of approximation and inaccurate supervision to train the corresponding learning model, related to the presence of non-discriminative SITS components within a single labelled object.
The architecture of our framework is depicted in the first image associated with this abstract: firstly, the different components that constitute the object are identified. Secondly, a CNN block extracts information from each of the different object components. Then, the results of each CNN block are combined via attention. Finally, the classification is performed via dedicated Fully Connected layers. The outputs of the process are the prediction for the input object SITS as well as the extra information alpha that provides information related to the contribution of each object component.
Our framework includes several stages: firstly, it identifies the different multifaceted components on which an object is defined. Secondly, a Convolutional Neural Network (CNN) extracts an internal representation from each of the different object components. Here, the CNN is especially tailored to model the temporal behavior exhibited by the object component.
Then, the per component representation is aggregated together and used to provide the decision about the land cover class of the object. Beyond the pure model performance, our framework also allows us to go a step further in the analysis, by providing extra information related to the contribution of each component to the final decision. Such extra information can be easily visualized in order to provide additional feedback to the end user, supporting spatial interpretability associated with the model prediction.
In order to assess the quality of TASSEL, we have performed extensive evaluation on two real-world scenarios over large areas with contrasted land cover features and characterized by sparsely annotated ground truth data. The evaluation is conducted considering state of the art land cover mapping approaches for sparsely annotated data in the OBIA framework. Our framework gains around 2 points, on average, of F-Measure with respect to the best competing approaches demonstrating the added value to explicitly manage the intra-object heterogeneity.
Finally, we perform a qualitative analysis to underline the ability of our framework to provide extra information that can be effectively leveraged to support the comprehension of the classification decision. The second image of the associated image file represents an example where the extra information supplied by TASSEL is used to interpret the final decision. The yellow lines represent object contours. The example refers to the Annual Crops land cover class. The legend on the right reports the scale (discretized considering quantiles) associated with the attention map. Here, we can note that TASSEL assigns more attention (dark blue) to the portion of the object directly related to the Annual Crops land cover class while lower attention (light blue) is assigned to the Shea Trees that are not representative of the Annual Crops class .
To summarize, the main contributions of our work can be summarized as follows:
i) We propose a new deep-learning framework to cope with object-based SITS classification devoted to manage the within-object information diversity exhibited in the context of land cover mapping; ii) We design our framework with the goal to provide as outcomes not only the model decision but also extra information that can provide insights about (spatial) model interpretability and; iii) We conduct an extensive evaluation of our framework considering both quantitative and qualitative analysis on real-world benchmarks that involve ground truth data collected during field campaigns and featured by operational constraints.