Task 4 – Big data management and analytics of climate simulations

Leader: S. Denvil (IPSL)

Contributors: IPSL, LSCE, IDRIS, CNRS-GAME, CERFACS, MDLS

Objectives:

Many challenges remain for climate scientists, and new ones will emerge as more and more data pour in from climate simulations run on ever-increasing high-performance computing (HPC) platforms and from increasingly higher-resolution satellites and instruments. From these current and future challenges, we envision a scientific data universe in which data creation, collection, documentation, analysis and dissemination are expanded. To prepare for such an expansion with continual delivery of a comprehensive end-to-end solution we will tackle specific challenges in this task. Bring together large volumes of diverse data to generate new insights reveals three challenges:

  • Variety: managing complex data from multiple types and schemas (model and observations),
  • Velocity: ingesting and distributing live data streams and large data volume quickly and efficiently,
  • Volume: analyzing large data volume (from terabytes to exabytes) in place for big data analytics.

Recognizing that large-scale scientific endeavors are becoming ever more data centric, we have begun to move toward a data services model. As part of this transition, over the past year, we successfully operated ESGF stack as a means of integrating and delivering scientific data services to the climate community. We will build upon our existing capabilities to efficiently produce, manage, discover and analyse distributed climate data. French community will strengthen his position in the international community by contributing to this effort. By leveraging the METAFOR CIM, we can output simulation and model documentation in a standardized format. By leveraging ES-DOC client tools we have a clear pathway for creating, publishing and exploiting model run documentation inherently conformant to an internationally recognized standard.

  • Task 4.1: XIOS implemented within project models (IPSL, MDLS, CERFACS, CNRM)

Without XIOS, output files were written partially process per process during the run. The time needed for the reconstruction of the files, done sequentially in the post-processing phase, grows exponentially with the resolution becoming in some very-high resolution cases even higher than the simulation time. This task will implement XIOS within IPSL-CM models components (LMDz, ORCHIDEE, INCA) and CNRM-CM model components (ARPEGE, GELATO, SURFEX). NEMO, used by both IPSL-CM and CNRM-CM, already implements XIOS but will have to integrate new functionalities (Tasks 2.2 and 4.2). This work will impact source code of cited components and will cover the ability to switch off the previous IO system, to integrate the new system and to validate the numerous field generated by climate models (more than thousand quantities). Furthermore, development done in Task 2.2 and Task 4.2 will be readily available to IPSL-CM and CNRM-CM families.

  • Task 4.2: XIOS a bridge towards standardisation (IPSL, MDLS, CERFACS, CNRM)

XIOS output format, structure and description will (1.) conform to the Climate and Forecast (CF) convention when describing spatio-temporal distribution of data (2.) conform to CMIP controlled vocabulary requirements and (3.) conform to CIM ontology describing simulation and model documentation. This will enable easier and faster systematic ingestion of outputs by data services developed in Task 4.3, and ensure a high level of documentation, provenance, standardization and reuse. Steps 1 and 2 were up to now done through costly post-processing and step 3 by time consuming manual intervention using on-line questionnaire. CF and CMIP controlled vocabulary requirements will be fulfil by adapting XIOS code so as to perfectly define and fill variables describing space and time discretization and by generating XIOS configuration file embarking project controlled vocabulary (CMIP and followers). XIOS project specific configuration files will be generated from the information system operated in Task 4.3. Innovative ES-DOC meta-programming approach will forward engineer C++ and python CIM client tools ensuring that the client tools are speedily updated in response to changes in the CIM standard. Those client tools will be implemented within XIOS to create CIM compliant documents and within the running environment to publish those documents. CIM instances will be created on the local file system in either XML of JSON format then published to remote ES-DOC CIM API.

  • Task 4.3: Data and metadata services (IPSL, IDRIS, CNRS-GAME)

Part of the intellectual content forming the basis of current and future climate research is being assembled in massive digital collections scattered at different place. As the size and complexity of these holdings increase, so do the complexities arising from interactions over the data, including use, reuse, and repackaging for unanticipated uses, as well as managing over time the historical metadata that keep the data relevant. For scientific data centres there is also a need for infrastructures that support a comprehensive, end-to-end approach to managing climate data: full information life cycle management. Ultimately, we have to be able to manage diverse collections within a uniform, coordinated environment. We will further develop ESGF and ES-DOC environment capabilities to that end. This environment will be our communication bus, accessible through RESTful interfaces and will be an integrative middle layer on which various data services, tailored to the needs of a diverse and growing user base, will be built. So far, ESGF ingestion capabilities have been limited to parsing NetCDF files organized in THREDDS catalogs, which was enough to support the global CMIP5 effort and related climate projects (CORDEX, PMIP3, Obs4MIPs, etc…). We will develop services to greatly enhance the variety of resources that can be published and discovered through the system. The targets are simulations data and CIM documents produced by the climate modelling platform (standardized in Task 4.1), reference datasets to sustain the Climate Evaluation Framework and diagnostics images produced in Task 5. Architecture will support publication through both a “pull” mechanism, whereby clients request the service to harvest a remote metadata repository, and a “push” mechanism, whereby clients send complete metadata records directly to the service for ingestion. The system will support the ability to plug-in new harvesting algorithms. The new publishing client will be use within the runtime environment developed in Task 3 and to that end will have to be compliant with GENCI HPC centres requirements and CNRS-GAME computing resources requirements.

  • Task 4.4: Big Data Analytics (IPSL, CNRS-GAME, IDRIS, CERFACS)

We will focus on specific development to enable server side computation capabilities at data location balanced and orchestrated by local cluster computation having high capacity link with the data location. This will greatly expand the capability of the ESGF Compute Node software stack continuing effort done in G8 ExArch project (finishing June 2014) and contributing to the international development of the ESGF stack. Some core/generic functions have to be available at the data location. Primitive functions needed by typical climate analysis process will be implemented on the ESGF compute node at the data location (average, interpolation, Taylor diagram) primarily chosen to feed Task 5 needs.

The ESGF compute node will allow for large-scale manipulation and analysis of data. These computations will be coordinated within an organizational unit as well as across organizational units directed by data locality. Institution cluster will be used to orchestrate the analysis by querying compute node’s core functions and performing directly “non core” computations. This capability coupled with the existing distributed ‘data-space’ capability will give us the ability to take advantage of local and remote computing resources to efficiently analyse data across their locality. Each participating organization will be able to provide policies to govern its local resource utilization. ESGF authentication and authorization system will be used to restrict access to the computational capabilities when necessary.

Success criteria:

Standardized data and metadata can be seamlessly ingests during simulations runtime. ESGF and ES-DOC API are largely used so as to discover, reuse and analyse simulations, reference datasets and images.

Identified risks:

Not all climate outputs, metadata or images can be ingested at run time either because XIOS has not been implemented in all components or because ingestion process is not fast enough for near real time ingestion. Manual post-processing will have to be implemented to overcome this issue for a subsets of simulation. Server side core functions are not enough generic or efficient to suit the task 5 needs. This has to be identify as early as possible so as to define a fall-back position that could take the form of additional post-processing procedure at the HPC centre triggered manually for simulations subsets.