Sentinel-2 data cube

Trouble using the data cube or satisfied user?
Let’s get in touch! Share your thoughts: datacube @ geotiles.nl.

2023-03-06: The use of simplecache may raise exceptions of empty chunks. This is known, and under investigation. Removing simplecache:: will fix the issue. To open the cube, use the following instead:

cube = xr.open_zarr('https://geotiles.nl/cube.zarr',
                    group='/S2/R051', consolidated=True)

This publicly available data cube contains Sentinel-2 satellite imagery that overlap with the Netherlands, corrected for atmospheric effects, from 2015 onward.

The data cube allows for easy access to the full time series at full (10 m) resolution. The data can be used both for visualisation as well as more advanced analytics in Python and other programming languages.

For a quick overview, see the poster presented at the ESA Living Planet Symposium ’22.

The cube is stores as Zarr array, that allows for direct access to individual pixels or regions of the 3+ TiB data archive. Access to the cube is free and unmetered and is provided on the basis of ‘best effort’. Currently a single orbit (relative orbit 51) is available via de the data cube.

Getting started (in Python)

Connect

You will need the xarray and zarr packages to access the data cube. These may be installed from Conda (command line, Anaconda) or pip, depending on your installation. An example for Google Colab (notebook).

As of zarr 2.8.3 opening the data cube is as easy as:

import xarray as xr

cube = xr.open_zarr('simplecache::https://geotiles.nl/cube.zarr',
                    group='/S2/R051', consolidated=True)

Users of earlier versions of zarr need a bit of extra code to ensure the variable names are properly transmitted (details):

import xarray as xr, zarr

cube = xr.open_zarr(
    # Request the datacube from https://geotiles.nl/cube.zarr.
    # The second option, 'normalize_keys', is True by default and should mannually be set to False.
    # 'simplecache::' is added to keep a temporary, local, copy of the data for faster repeated access.
    # See https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations for details.
    store=zarr.storage.FSStore('simplecache::https://geotiles.nl/cube.zarr', normalize_keys=False),
    # We would like the data set of Sentinel-2, relative orbit 51.
    # Currently, this is the only data set available.
    group='/S2/R051',
    # A technicallity, instead of searching for variables, a pre-generated file is used instead.
    consolidated=True
)

Explore

The cube is split in variables, according to the bands of the MSI instrument onboard the Sentinel-2 satellites. In addition, information on the cloud probability (CLD), snow probability (SNW), classification (SCL) and water vapour path (WVP), extracted during atmospheric correction, are available.

Each variable is a three-dimensional array with coordinates: time, eating (x), northing (y) in RD coordinates (EPSG:28992). Even individual variable or band would require 1½ TB of RAM to be kept in memory. The Zarr data structure enables us to seamlessly fetch small portions of each variable. Furthermore, the system is based on ‘delayed execution’, an overview such as shown below is merely a description of the data, no actual data was fetched (yet).

Introductory example: NDVI in the Hoekschewaard

A standard operation is the calculation of the normalized difference vegetation index (NDVI). The code necessary to create this example can be found in this notebook (Google Colab).

To calculate the NDVI on the data cube, we simply apply the formula to the cube.

NDVI = (cube.B08-cube.B04)/(cube.B08+cube.B04)

The result is an immense array of 354 × 34 375 × 30 000 = 365 062 500 000 elements (at the time of writing). Fortunately, we rarely need all simultaneous access to all historic NDVI values over the Netherlands!

Instead, we select either a region or a time of interest via .sel(). To select a time-series of a single point, set x and y.

We will be looking at the point marked with the red dot, in the Hoekschewaard.

point = NDVI.sel(x=94486, y=415307, method='nearest')

The result is a vector of NDVI values at all time steps. Note that still no data resides in memory. Only once we access the values, the data is fetched from the server and the NDVI calculated. To load and calculate the data, either access point.values, that will return a numpy vector, or point.load() to retrieve an xarray array, including coordinate axes.

To explore the data, a handy .plot() method is provided. In our example the NDVI (y-axis) is plotted against time (x-axis).

point.plot()

Although noisy, the plot shows the dense vegetation cover in summer (high NDVI), cut off by harvesting and barren land in winter (low NDVI).

The noise is due to the abundant cloud cover in the Netherlands. The cycle of barren land, crop growth and harvest is visible on the various fields, but is often obscured by clouds.

Shown here is a time series of true colour images from the data cube.

The estimated cloud coverage, included with the data, comes in handy here. The variable CLD describes on a scale from 0 to 1 how likely a pixel is to be a cloud.

This graph shows the average cloud probability per acquisition date for the region shown in the previous clip.

Filtering every image with a mean cloud probability larger than 0.1, the time series is already a lot cleaner. Out of 354 images, 118 are left.

However, the time series is still not as ‘clean’ as we had hoped. With some interpretation a seasonal signal is visible. Note that the labels now indicate summer and winter.

Example: aggregation of the NDVI over Delft

To query a larger region, the data cube is sliced. These coordinates (Rijksdriehoekscoördinaten, EPSG:28992) delineate an area around Delft. The code of this example is available in this notebook (Google Colab).

Here we continue with the NDVI, as defined in the previous example. However, instead of a point, a slice of the same data cube is retrieved. Note that the y slice is inverted!

Delft = NDVI.sel({'x': slice(79550, 92180), 'y': slice(450495, 442146)})

To select a single time step, use the .sel() method again. Here followed by a plot command:

Delft.sel(time='2015-08-01', method='nearest') \
     .plot(cmap='RdYlGn', vmin=-1, vmax=1)

Here a single time step, closest to 2015-08-01, is retrieved, the NDVI calculated and the result shown as map. The pasture fields surrounding Delft in their late summer intensity stand out from the inner city.

As regions grow, calculations may take some time to complete. A convenient progress bar is provided by Dask. This code will add a progress bar to all computations on the data cube.

from dask.diagnostics import ProgressBar
ProgressBar().register()

An example of a more complex aggregation query is the standard deviation of the NDVI over time as an indication of the variation in land cover. This will provide us with an indication of whether the NDVI is stable or variable over time.

Delft.std('time').plot.imshow(cmap='viridis')

Buildings and greenhouses are never green. Trees show a slightly lower standard deviation in NDVI than pasture fields. Local knowledge is required for full interpretation of the dynamics contained therein.

Note that all acquisitions, including clouded imagery, are included in the calculation.

The required calculation of the NDVI for all pixels 373 625 760 pixels (at the time of writing) is done under the hood, and may take up to ten minutes to complete. Other aggregation functions include: max, mean, median, min, prod, sum, std, and var.

Example: the Hoeksche Waard over time

To illustrate the number of acquisitions available in a single orbit, they may be combined into a single image with this notebook (Google Colab). The following image is made up of all orbit 51 acquisitions of Sentinel-2 over the Hoeksche Waard, from July 2015 till spring 2022.

Unfortunately, a large portion of the imagery is cloudy, blocking our view of the island. Furthermore, time series analysis would require a cloud free time series.

The previous example had limited success in filtering clouds based on the cloud detection provided with the atmospheric correction. However, since we have the full time series available, we may compare the reflectance to previous acquisitions, while the atmospheric correction has to work without such information.

Histogram of per pixel reflectance over a five year period.

A histogram of the reflectance values across all dimensions of the data cube reveals that there are at least two processes at work. The reflection from the land cover (low reflectance values) and cloud cover (high reflectance values). However, there is no clear separation between the two and there is a relation between the position and width of the first peak and the land cover (e.g. sand versus forest).

To evaluate the processes at pixel level, the cumulative distribution function may be expressed as reflectance vs. quantiles. As long as clouds have a higher reflectance than the underlying land cover, and thus end up in higher quantiles, this figure may be used to identify the transition between the two. After the ~0.4 quantile there is a significant increase in reflection, suggesting the ‘darkest’ 40% of images might be cloud free.

Cumulative distribution function of reflectance, as function of (per pixel) quantile.

Working from this, the image may be reproduced, based on only the darkest 25% of acquisitions. The result is a colourful composition of six years of Sentinel-2 data, with a small number of clouded images. Given the size of the area (~ 12½×30 km), the notebook (Google Colab) can be slow to run.

Example: exporting/saving parts of the data cube

Selections of the cube may be exported to more conventional formats, such as GeoTIFF. An example of such procedure is provided in this notebook (Google Colab). In the example, all bands of a single time step (2021-03-31) are exported to a single GeoTIFF with many layers.

It is possible to export to netCDF as well. See this notebook as an example. Unfortunately, Google Colab has strict limits on the size of the download and large downloads may stall.

If you require a complete copy, it is best to get in touch first. We will discuss the possibilities and ensure you will receive a consistent copy of the cube.

Additional resources

Some additional resources are included within the data cube. These resources are useful as training data for classification experiments, or as aggregation regions for statistics. The data is aligned with the Sentinel-2 acquisitions, and directly compatible, for example in a .where() condition to select areas with a specific land cover or within a municipality. It is highly recommended to limit your region of interest first, via .sel() on both data products, otherwise the mask will cover the full Netherlands. An example on how to connect is available as notebook (Google Colab).

GroupData setYear
/NLBestuurlijke grenzen
(municipalities, provinces, national border)
2021
/NL/BBGBestand Bodemgebruik
(land cover)
2008, 2010, 2012, 2015, 2017
/NL/BRPBasisregistratie Gewaspercelen
(agriculture, crops)
2009 – 2021

Notes on full scale analyses

For very large regions, it might be beneficial to coarsen the data first. For example by taking the mean of every section of 100×100 pixels while reading to reduce the memory footprint (see .coarsen()). However, unlike Google Earth Engine, no ‘pyramid’ is available. Therefore, all data has to be read from the server and averaged on the client before further processing. Furthermore, data is stored in heaps of five consecutive time steps. Requesting a single time step, or sequential time steps without local caching, will trigger a download of the four closest acquisitions. As a consequence a full RGB image requires 16 GB of data to be fetched. This may take some time to complete, and may be accelerated by maintaining a local cache.

Upcoming changes

The current data cube contains all images, including imagery largely or fully dominated by clouds. Future updates may remove some of this imagery, to create space for the incorporation of a second orbit into the data cube.

For an up to date changelog, see the GeoTiles overview.

Acknowledgements

Thanks to the support of the NCG Talent Program, the data cube is publicly available.

The data cube contains modified Copernicus Sentinel data, retrieved from Copernicus SciHub and CNES PEPS. Software used to generate the data cube: GDAL/OGR, python (GeoPandas, rasterio, xarray), Zarr, and atmospheric corrections are applied using sen2cor.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.