Sentinel-2 data cube

With the support of the NCG Talent Program, a publicly available data cube of Sentinel-2 satellite imagery is now available. The data cube allows for easy access to the full time series at full (10 m) resolution.

This cube contains acquisitions that overlap with the Netherlands, corrected for atmospheric effects, from 2015 onward. Data can be used both for visualisation as well as more advanced analytics in Python and other programming languages.

The cube is stores as Zarr array, that allows for direct access to individual pixels or regions of the 3+ TiB data archive. Access to the cube is free and unmetered and is provided on the basis of ‘best effort’. Currently a single orbit (relative orbit 51) is available via de the data cube.

Getting started (in Python)

Connect

You will need the xarray and zarr packages to access the data cube. These may be installed from Conda (command line, Anaconda) or pip, depending on your installation. An example for Google Colab (notebook).

As of zarr 2.8.3 opening the data cube is as easy as:

import xarray as xr

cube = xr.open_zarr('simplecache::https://geotiles.nl/cube.zarr',
                    group='/S2/R051', consolidated=True)

Users of earlier versions of zarr need a bit of extra code to ensure the variable names are properly transmitted (details):

import xarray as xr, zarr

cube = xr.open_zarr(
    # Request the datacube from https://geotiles.nl/cube.zarr.
    # The second option, 'normalize_keys', is True by default and should mannually be set to False.
    # 'simplecache::' is added to keep a temporary, local, copy of the data for faster repeated access.
    # See https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations for details.
    store=zarr.storage.FSStore('simplecache::https://geotiles.nl/cube.zarr', normalize_keys=False),
    # We would like the data set of Sentinel-2, relative orbit 51.
    # Currently, this is the only data set available.
    group='/S2/R051',
    # A technicallity, instead of searching for variables, a pre-generated file is used instead.
    consolidated=True
)

Explore

The cube is split in variables, according to the bands of the MSI instrument onboard the Sentinel-2 satellites. In addition, information on the cloud probability (CLD), snow probability (SNW), classification (SCL) and water vapour path (WVP), extracted during atmospheric correction, are available.

Each variable is a three-dimensional array with coordinates: time, eating (x), northing (y) in RD coordinates (EPSG:28992). Even individual variable or band would require 1½ TB of RAM to be kept in memory. The Zarr data structure enables us to seamlessly fetch small portions of each variable. Furthermore, the system is based on ‘delayed execution’, an overview such as shown below is merely a description of the data, no actual data was fetched (yet).

Introductory example: NDVI in the Hoekschewaard

A standard operation is the calculation of the normalized difference vegetation index (NDVI). The code necessary to create this example can be found in this notebook (Google Colab).

To calculate the NDVI on the data cube, we simply apply the formula to the cube.

NDVI = (cube.B08-cube.B04)/(cube.B08+cube.B04)

The result is an immense array of 354 × 34 375 × 30 000 = 365 062 500 000 elements (at the time of writing). Fortunately, we rarely need all simultaneous access to all historic NDVI values over the Netherlands!

Instead, we select either a region or a time of interest via .sel(). To select a time-series of a single point, set x and y.

We will be looking at the point marked with the red dot, in the Hoekschewaard.

point = NDVI.sel(x=94486, y=415307, method='nearest')

The result is a vector of NDVI values at all time steps. Note that still no data resides in memory. Only once we access the values, the data is fetched from the server and the NDVI calculated. To load and calculate the data, either access point.values, that will return a numpy vector, or point.load() to retrieve an xarray array, including coordinate axes.

To explore the data, a handy .plot() method is provided. In our example the NDVI (y-axis) is plotted against time (x-axis).

point.plot()

Although noisy, the plot shows the dense vegetation cover in summer (high NDVI), cut off by harvesting and barren land in winter (low NDVI).

The noise is due to the abundant cloud cover in the Netherlands. The cycle of barren land, crop growth and harvest is visible on the various fields, but is often obscured by clouds.

Shown here is a time series of true colour images from the data cube.

The additional cloud cover comes in handy here. The variable CLD describes on a scale from 0 to 1 how likely a pixel is to be a cloud.

This graph shows the average cloud probability per acquisition date for the region shown in the previous clip.

Filtering every image with a mean cloud probability larger than 0.1, the time series is already a lot cleaner. Out of 354 images, 118 are left.

However, the time series is still not as ‘clean’ as we had hoped. With some interpretation a seasonal signal is visible. Note that the labels now indicate summer and winter.

Example: aggregation of the NDVI over Delft

To query a larger region, the data cube is sliced. These coordinates (Rijksdriehoekscoördinaten, EPSG:28992) delineate an area around Delft. The code of this example is available in this notebook (Google Colab).

Here we continue with the NDVI, as defined in the previous example. However, instead of a point, a slice of the same data cube is retrieved. Note that the y slice is inverted!

Delft = NDVI.sel({'x': slice(79550, 92180), 'y': slice(450495, 442146)})

To select a single time step, use the .sel() method again. Here followed by a plot command:

Delft.sel(time='2015-08-01', method='nearest') \
     .plot(cmap='RdYlGn', vmin=-1, vmax=1)

Here a single time step, closest to 2015-08-01, is retrieved, the NDVI calculated and the result shown as map. The pasture fields surrounding Delft in their late summer intensity stand out from the inner city.

As regions grow, calculations may take some time to complete. A convenient progress bar is provided by Dask. This code will add a progress bar to all computations on the data cube.

from dask.diagnostics import ProgressBar
ProgressBar().register()

An example of a more complex aggregation query is the standard deviation of the NDVI over time as an indication of the variation in land cover. This will provide us with an indication of whether the NDVI is stable or variable over time.

Delft.std('time').plot.imshow(cmap='viridis')

Buildings and greenhouses are never green. Trees show a slightly lower standard deviation in NDVI than pasture fields. Local knowledge is required for full interpretation of the dynamics contained therein.

Note that all acquisitions, including clouded imagery, are included in the calculation.

The required calculation of the NDVI for all pixels 373 625 760 pixels (at the time of writing) is done under the hood, and may take up to ten minutes to complete. Other aggregation functions include: max, mean, median, min, prod, sum, std, and var.

Notes on full scale analyses

For very large regions, it might be beneficial to coarsen the data first. For example by taking the mean of every section of 100×100 pixels while reading to reduce the memory footprint (see .coarsen()). However, unlike Google Earth Engine, no ‘pyramid’ is available. Therefore, all data has to be read from the server and averaged on the client before further processing. Furthermore, data is stored in heaps of five consecutive time steps. Requesting a single time step, or sequential time steps without local caching, will trigger a download of the four closest acquisitions. As a consequence a full RGB image requires 16 GB of data to be fetched. This may take some time to complete, and may be accelerated by maintaining a local cache.

Exporting/saving parts of the data cube

Selections of the cube may be exported to more conventional formats, such as GeoTIFF. An example of such procedure is provided in this notebook (Google Colab). In the example, all bands of a single time step (2021-03-31) are exported to a single GeoTIFF with many layers.

It is possible to export to netCDF as well. See this notebook as an example. Unfortunately, Google Colab has strict limits on the size of the download and large downloads may stall.

If you require a complete copy, it is best to get in touch first. We will discuss the possibilities and ensure you will receive a consistent copy of the cube.

Upcoming changes

The current data cube contains all images, including imagery largely or fully dominated by clouds. Future updates may remove some of this imagery, to create space for the incorporation of a second orbit into the data cube.

For an up to date changelog, see the GeoTiles overview.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.