The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualisations and narrative text. Uses with the Earth Observation community include data cleaning and transformation, numerical simulation, statistical modelling, data visualisation, machine learning, and much more. A Jupyter Notebook allows you to combine rich documentation, live and adaptable code, data visualisations. It can also be used as a tool to share your data analysis with others, collaborate, teach, and promote reproducible science.
We are at a particularly exciting time with this technology where many archives are deploying Jupyter Notebook services. These services allow unprecedented access to petabytes of data allowing users from any part of the globe to engage with EO data in a very powerful way. Jupyter Notebooks produced during a research project can very often be the best starting point for new users to engage with data deposited with an archive, however, this raises unique challenges. While Jupyter Notebooks can be a valuable resource, there are issues surrounding input data/processing/technical dependencies and quality. Poor quality notebooks with hidden dependencies may cause new users a lot of problems.
To deal with these issues CEOS (Committee for Earth Observation Satellites) conducted a number of surveys and ran webinars on Jupyter notebooks to gain a better understanding of the EO community needs. We engaged with over 500 people from over 50 countries and two core needs for the wider community became evident. The first was the need for a Jupyter Notebooks best practice document to support the creation and preservation of high-quality reusable notebooks. The second was the need for basic training to get the next generation of researchers ready to engage with emerging services.
We will discuss in greater detail the following key areas to be addressed by a CEOS Jupyter Notebooks Best Practice.
• Notebook description and function
• Structure, workflow, and documentation
• Technical dependencies and Virtual Environments
• Citation of input data and data access
• Association with archived data
• Incorporation with data cubes
• Version control, preservation and archival
• Open-source software licensing
• Publishing software and getting a DOI
• Interoperability and reuse on alternate platforms
• Creating a binder deployment
From recent CEOS WGCapD (Working Group for Capacity Development and Data Democracy) and WGISS (Working Group on Information Systems and Services) meetings, we have seen how many different CEOS agencies are employing Jupyter Notebooks in several different ways. To introduce the broader community, we developed a set of demonstrators that would take you through a technical arc of what is currently possible. Beginning with simple baseline notebooks that have integrated training materials to notebooks that drive heavy-duty processing on the Earth Analytics Interoperability Lab.
Jupyter Hub and Notebooks on Data Analysis Platforms: We looked at two examples from the UK’s JASMIN Jupyter Notebook service, which can access over 20 petabytes of data on the CEDA archive. We then explored the Sentinel 5p global archive of data and demonstrated how to use a very basic Notebook to use the data and answer valuable questions, e.g. how did pollution levels change in large cities during the Covid-19 pandemic? We also looked at a smaller scale specialist example, regional NCEO biomass maps. This helped to demonstrate how, in addition to helping users use Jupyter Notebooks to obtain domain-specific information from data, we can also help them learn technical knowledge and skills related to libraries, modules, and shape files.
Open Data Cube and Google Earth Engine – A Jupyter Notebook Sandbox Demonstration: The Open Data Cube (ODC) Google Sandbox is a free and open programming interface that connects users to Google Earth Engine datasets. This open-source tool allows users to run Python application algorithms using Google’s Colab Notebook environment. This demonstration showed two examples of Landsat applications focused on scene-based cloud statistics and historic water extent. Basic operation of the tool will support unlimited users for small-scale analyses and training but can also be scaled in size and scope with Google Cloud resources to support enhanced user needs.
ESA PDGS (European Space Agency -- Payload Data Ground Segment) Data Cube and Time Series Data: The ESA PDGS Data Cube is a pixel-based access service that enables human and machine-to-machine interfaces for Heritage Missions (HM), Third-Party Missions (TPM) and Earth Explorer (EE) datasets handled at the European Space Agency. The pixel-based access service provides the users with advanced retrieval capabilities, such as time series extraction, data subsetting, mosaicking, band combinations, and index generation (e.g. normalized difference vegetation index (NDVI), anomalies, and more) directly from the EO-SIP packages with no need for data duplication or data preparation.
The ESA PDGS Data Cube service provides both the web-based Explorer user interface (https://datacube.pdgs.eo.esa.int) and Jupyter Notebook (https://jupyter.pdgs.eo.esa.int) to allow users to import, write, and execute code that runs close to the data. This demonstration showcased how to retrieve Soil Moisture time-series using the Jupyter environment in order to generate thematic maps (monthly anomalies map) over an area of interest. The benefit of using the pixel-based service with respect to traditional access services in terms of resources usage was also highlighted.
Earth Analytics and Interoperability Lab – Big Data Processing: The CEOS Earth Analytics Interoperability Lab (EAIL) is a platform for CEOS projects to test interoperability in a live Earth Observation (EO) ecosystem. EAIL is hosted on Amazon Web Services and includes facilities for Jupyter Notebooks, scalable compute infrastructure for integrated analysis, and data pipelines that can connect to new and existing CEOS data discovery and access services. This demonstration showed how we use Jupyter Notebooks with the Python Dask Library to efficiently compute and perform large-scale analyses (10s GB) with interactive plotting and scalable compute resources in EAIL.
Going forward there is a great deal of interest in collaborating and developing these activities further. We will discuss how we will be creating baseline notebooks aimed at developing key EO data science skills and exemplars for the best practice. We anticipate holding a CEOS Jupyter Notebooks day later in 2022. The aim of which will be to stimulate other agencies/organisations to produce similar resources which, will benefit students/early career researchers. Enabling them to engage with Jupyter Notebook services which are emerging globally.