CLEOS (Cloud Erth Observtion Services) is e-GEOS satellite data and information platform offering access to a wide variety of geospatial datasets and enabling the development, test and large scale deplyment of geospatial data prcoessing pipelines. CLEOS adopts a multicloud approach and fosters interoprability through standards and practixs such as OpenEO for APIs and STAC for metadata catalogue.
High Level Architecture.
CLEOS Architecture is modular and it is based on five main components:
• The Marketplace, where Customers can access available products and services through a user-friendly UI;
• The API Layer, which exposes purchasing, processing and data access capabilities through a RESTful web interface, also enabling machine-to-machine integration with external systems;
• The Processing Platform, which orchestrates the scalable execution of Data Processing Pipelines for both Optical and SAR data, using e-GEOS proprietary algorithms and Third-Party tools. It includes the AI Factory for the management and development of AI-based applications;
• The Data Layer that all entire the metadata catalogue of the federated data sources connected to the infrastructure and a pool of locally available assets and resources, including EO and non-EO data;
• The Help Desk, used by e-GEOS Customer Service to provide the necessary support to CLEOS Customers.
CLEOS accesses multiple satellite and non-satellite data sources through a set of data collectors adapted to the interfaces offered by each data/contents provider (e.g. satellite missions Ground Segments). CLEOS also has a multi-cloud orchestration service that allows the deployment of the whole CLEOS platform or of single processing jobs in different commercial cloud infrastructures, including all Copernicus DIAS.
Design principles and innovations.
CLEOS Platform design adheres to technical design principles widely shared in the geospatial community.
1. Multi-Cloud & Data Locality. Space Earth Observation data are notably very large datasets (e.g. Sentinel-2 satellites acquire about 10 TB of data daily). The “data gravity” associated with these huge data archives requires a shift in the processing paradigm, which means bringing the processing close to the data, to minimize network congestion and increase the throughput of the overall system. This is called the Data Locality principle. However, space and geospatial data are not located in a single infrastructure, since today there are several endpoints offering access to the same datasets (e.g. Sentinel-2 data can be accessed in AWS, Google Cloud and five Copernicus DIAS). Additionally, the selection of the infrastructure where to access and/or process data is driven by multiple considerations: not only price/performance, but also constraints from Customers (including the option to process data in a local infrastructure for certain workloads).
This scenario had the following impact on CLEOS design:
• The Data Catalogue needs to register multiple endpoints where to access the same resource. CLEOS has adopted Spatio Temporal Asset Catalog (STAC) metadata structure, as it allows extensible definition of spatial assets and resources enhancing indexing and discovery process standardization;
• The Processing Platform must have the flexibility to pilot processing requests on different cloud platforms, including on-premises infrastructures, allowing also hybrid cloud workloads. CLEOS has adopted the Max-ICS platform, which exploits open source frameworks such as Mesos, Marathon, Puppet, and Terraform, to manage Infrastructure as a Service (IaaS) resources in multiple cloud infrastructures, abstracting the complexity of the platform control and service orchestration layers.
2.Elasticity&Scalability. Geospatial data processing use-cases can define a great variety of workloads types, from large batch processes in case of multi-temporal analysis over large areas, to synchronous data analytics requests of newly acquired data, or even stream processing. The simultaneous management of such heterogeneous workloads requires a strong optimization of resources usage and deployment. This scenario requires CLEOS to be able to scale up and down the available worker nodes, according to the active workload in an elastic way. CLEOS infrastructure is based on microservices, fragmenting complex workflows into elementary steps that can executed by independent nodes and easily orchestrated. Therefore CLEOS infrastructure can dynamically scale-up those microservices to fulfill the increasing number of processing jobs, deploying ultimately new on demand virtual hosts exploiting the elastic resources allocation made available by almost all cloud providers.
3.Microservices and Data Processing Pipelines. In CLEOS all the processing services are made available by nodes and pipelines. Nodes (microservices) and pipelines (set of nodes linked to each other to perform a workflow) are the two main components upon which to build a collaborative ecosystem in which CLEOS developers can reuse available standalone blocks to create new services with a modular LEGO logic. The definition of Data Processing Pipelines is central in modern platforms, as it allows the design and implementation of a workflow that is activated once data reach the first node of the pipeline, flow through it to produce a result that can be used as the input of another pipeline or be delivered to the user.
4.Platform Federation & Interoperability. Today, a platform cannot behave as a standalone system, but it needs to be interoperable and federated with other platforms on the two sides of the market:
• Supply: the platform needs to be able to access to several heterogeneous suppliers of data and services;
• Demand: the platform needs to offer its data and services to heterogeneous customers that, more and more often, will be other platforms and not humans.
To achieve this objective, the API layer plays a central role. In particular, CLEOS has adopted also the OpenEO standard to manage the end-to-end process of searching, configuring, buying, monitoring and accessing data and services. This choice was made to enable the OpenEO clients to connect with CLEOS with minimal effort. Through this API definition, it is possible to unify the access to the different service backends, abstracting proprietary implementations made by each vendor with their own internal interfaces.
CLEOS Data Layer
CLEOS Data Layer is a set of modules dedicated to the storage and cataloguing (Technical Catalogue) of available resources, being them data, metadata or capabilities available through the Processing Platform. The CLEOS storage relies on available Object Storage services in different cloud infrastructures, since most of data collections are available at different endpoints. Therefore, the Technical Catalogue in the Data Layer has a paramount importance, insofar it allows the Marketplace and the Processing Platform to have a unique point of reference about what resources/services are available and how/where they can be accessed. Those catalogues follow the STAC (SpatioTemporal Asset Catalog) specification. The aim of STAC is to define a common language to describe a range of geospatial informations, so they can more easily be indexed and discovered. CLEOS users/developers will take all the benefits of this adoption that prevents them from writing new code each time a new data set or service is available. The Data Layer also offers methods and interfaces (API in the OpenEO standard) to access available data in multiple ways and for multiple purposes (View, Download, Subset, …)
CLEOS Processing Platform, Developer Portal & AI Factory
CLEOS Processing Platform is the module that is responsible for execution of all processing tasks, from the simple retrieval and delivery of a product up to the orchestration of complex and long batch processing jobs involving thousands of processing nodes. The Processing Platform operates and deploys the Processing Pipelines created in the Developer Portal. The Developer Portal is the environment where e-GEOS and external developers can build new Processing Pipelines re-using available modules and building blocks, taking advantage of an Interactive Development Environment (IDE) and of CLEOS API to streamline data access and processing operations
The Processing Platform is able to:
• manage the provision of the necessary infrastructure resources in a dynamic way with elastic scaling up & down in multiple cloud and on premise infrastructure;
• manage the processing pipelines using a data driven and message-based approach where requests are queued and progressively managed, enlarging or reducing the size of the available worker nodes according to the demand;
• manage DevOps through a Continuous-Integration/Continuous-Development (CI/CD) pipeline • monitor the resources usage, node by node, pipeline by pipeline and infrastructure wise
• centralize all the microservices logs so that they can be accessed, filtered and analysed cost-effectively While the processing jobs follow as a stream due to backend event based architecture, CLEOS Processing Platform offers also the capabilities to retrieve and control the overall status of a request managing also customers notifications and updates.
At last, the AI Factory is the platform section dedicated to the development and management of AI models and corpus, where AI developers and users work together to develop, test and scale new AI based applications.
The AI Factory allows:
• to access a large set of pre-defined AI models or to import custom ones;
• to import training corpus or to build new ones using a simple and intuitive interface;
• to train, re-train models, benchmark performance metrics and manage model versioning;
• to directly include trained AI models into processing pipelines via their relative inference nodes.