3. ESDC Access¶
As introduced in the last section, the ESDC physically consists of a set of NetCDF files on disk, which can be accessed in a number of different ways which are described in this section.
3.1. Download ESDC Data¶
The simplest approach to access the ESDC data is downloading it to you computer using the ESDC FTP server.
Since the ESDC is basically a directory of NetCDF files, you can use a variety of software packages and programming
languages to access the data. In each cube directory, you find a text file
cube.config which describes the overall
data cube layout.
Within the ESDC Project, dedicated data access packages have been developed for the Python and Julia
programming languages. These packages “understand” the ESDC’s
cube.config files and represent the cube data
by a convenient data structures. The section Using Python describes how to access the data from Python.
3.2. OPeNDAP and WCS Services¶
The ESDC’ data variables can also be accessed via a dedicated ESDC THREDDS server.
The server supports the OPeNDAP and OGC-compliant Web Coverage Service (WCS) data access protocols. You can use it for accessing subsets of the ESDC’s data variables and also for visual exploration of the data, and finally for downloading the data as a NetCDF file or of plain text.
Depending on the variable subsets, and the region and time period of interest, the transferred data volume might be much lower than a complete download of the ESDC via FTP. However, the disadvantage of using OPeNDAP and WCS is that the actual structure of the ESDC gets lost, so that it can’t be accessed anymore using the aforementioned ESDC Python/Julia data access packages.
A dedicated ESDC E-Laboratory has been developed to access the ESDC data via distributed Jupyter Notebooks for Julia and Python. This is the most resource efficient and convenient way of exploring the ESDC.
These notebooks have direct access to the ESDC data so there is no need to download it. In addition they provide the ESDC Python and Julia APIs comprising the Data Access API and the Data Analytics Toolkit.
The E-Laboratory provides some example notebooks in the shared ESDC community repository.
The E-Laboratory is based on the JupyterHub platform and currently comprises three 16-core computers running in a Cloud environment.
3.4. Using Python¶
Note: if you use the E-Laboratory you don’t need to install any additional packages for accessing the data. This section is only relevant if you’ve downloaded a ESDC instance to your local computer.
While in principle the NetCDF files comprising the ESDC can be used with any tool of choice, we developed specifically tailored Data Access APIs for Python 3.X and Julia. Furthermore, a set of high-level routines for data analysis, the Data Analytics Toolkit, greatly facilitates standard operations on the large amount of data in the ESDC. While in the E-laboratory, the Data Access API and the DAT are already pre-installed, the user has to download and install the cube library when working on a local computer.
To get started on your local computer with Python, clone the cablab-core repository from https://github.com/CAB-LAB:
git clone https://github.com/CAB-LAB/cablab-core
The following command will create a new Python 3.5 environment named
esdc with all required dependencies, namely
- dask >= 0.14
- gridtools >= 0.1 (from Conda channel
- h5netcdf >= 0.3
- h5py >= 2.7
- netcdf4 >= 1.2
- scipy >= 0.16
- scikit_image >= 0.11
- matplotlib >= 2.0
- xarray >= 0.9
$ conda env create environment.yml
To active new Python environment named
esdc you must source on Linux/Darwin
$ source activate.sh esdc
> activate esdc
Now change into new folder
cablab-core and install the
cablab Python package using the
$ cd cablab-core $ python setup.py develop
You can now easily change source code in
cablab-core without reinstalling it.
When you do not plan to add or modify any code (e.g. add a new source data provider), you can also permanently
install the sources using
$ python setup.py install
However, if you now change any code, make sure to the install command again.
After download of a ESDC including the corresponding
cube.config file and successful installation of the ESDC,
you are ready to explore the data in the ESDC using the Using Python.
The following example code demonstrates how to access a locally stored ESDC, query its content, and get data chunks of different sizes for further analysis.
Open a cube
from cablab import Cube from datetime import datetime import numpy as np cube = Cube.open("/path/to/datacube")
Note, in order to work properly the
/path/to/datacube/ passed to
must be the path to an existing ESDC cube directory which contains a valid configuration file named
It contains essential metadata about the ESDC to be opened.
['aerosol_optical_thickness_1610', 'aerosol_optical_thickness_550', 'aerosol_optical_thickness_555', 'aerosol_optical_thickness_659', 'aerosol_optical_thickness_865', 'air_temperature_2m', 'bare_soil_evaporation', 'black_sky_albedo', 'burnt_area', 'country_mask', 'c_emissions', ...]
After successful opening the ESDC, chunks of data or the entire data set can be accessed via the
get() functions. The first returns a xarray.Dataset object in which all
cube variables are represented as xarray.DataArray objects. More about these objects can also be
found in DAT for Python section. The second function can be used to read subsets of the data.
In contrast it returns a list of Numpy ndarray arrays, one for each requested variable.
The corresponding API for Julia is very similar and illustrated in DAT for Julia.
Accessing the cube data
cube.data.dataset() has an optional argument which is a list of variable names to include in the returned
xarray.DataArray object. If omitted, all variables will be included. Note it can take up to a few seconds to open
generate the dataset object with all variables.
ds = cube.data.dataset() ds
<xarray.Dataset> Dimensions: (bnds: 2, lat: 720, lon: 1440, time: 506) Coordinates: * time (time) datetime64[ns] 2001-01-05 ... * lon (lon) float32 -179.875 -179.625 ... lon_bnds (lon, bnds) float32 -180.0 -179.75 ... lat_bnds (lat, bnds) float32 89.75 90.0 89.5 ... * lat (lat) float32 89.875 89.625 89.375 ... time_bnds (time, bnds) datetime64[ns] 2001-01-01 ... Dimensions without coordinates: bnds Data variables: aerosol_optical_thickness_1610 (time, lat, lon) float64 nan nan nan ... aerosol_optical_thickness_550 (time, lat, lon) float64 nan nan nan ... aerosol_optical_thickness_555 (time, lat, lon) float64 nan nan nan ... aerosol_optical_thickness_659 (time, lat, lon) float64 nan nan nan ... aerosol_optical_thickness_865 (time, lat, lon) float64 nan nan nan ... air_temperature_2m (time, lat, lon) float64 243.4 243.4 ... bare_soil_evaporation (time, lat, lon) float64 nan nan nan ... black_sky_albedo (time, lat, lon) float64 nan nan nan ... burnt_area (time, lat, lon) float64 0.0 0.0 0.0 ... country_mask (time, lat, lon) float64 nan nan nan ... ...
lst = ds['land_surface_temperature'] lst
<xarray.DataArray 'land_surface_temperature' (time: 506, lat: 720, lon: 1440)> dask.array<concatenate, shape=(506, 720, 1440), dtype=float64, chunksize=(46, 720, 1440)> Coordinates: * time (time) datetime64[ns] 2001-01-05 2001-01-13 2001-01-21 ... * lon (lon) float32 -179.875 -179.625 -179.375 -179.125 -178.875 ... * lat (lat) float32 89.875 89.625 89.375 89.125 88.875 88.625 88.375 ... Attributes: url: http://data.globtemperature.info/ long_name: land surface temperature source_name: LST standard_name: surface_temperature comment: Advanced Along Track Scanning Radiometer pixel land surfa... units: K
lst can now be used like a Numpy ndarray. Howver, using
xarray there are a
number of more convenient data access methods that take care of the actual coordinates provided for every
dimenstion. For example, the
sel() method can be used to extract slices and subsets from a data array.
Here a point is extract from
lst, and the result is a 1-element data array:
lst_point = lst.sel(time='2006-06-15', lat=53, lon=11, method='nearest') lst_point
<xarray.DataArray 'land_surface_temperature' ()> dask.array<getitem, shape=(), dtype=float64, chunksize=()> Coordinates: time datetime64[ns] 2006-06-14 lon float32 11.125 lat float32 53.125 Attributes: url: http://data.globtemperature.info/ long_name: land surface temperature source_name: LST standard_name: surface_temperature comment: Advanced Along Track Scanning Radiometer pixel land surfa... units: K
Data arrays also have a handy
plot() method. Try:
lst.sel(lat=53, lon=11, method='nearest').plot() lst.sel(time='2006-06-15', method='nearest').plot() lst.sel(lon=11, method='nearest').plot() lst.sel(lat=53, method='nearest').plot()
Closing the cube
If you no longer require access to the cube, it should be closed to release file handles and reserved memory.
Some more demonstrations are included in the ESDC community notebooks.