Matt Pryor Phil Kershaw Alan Iwi CEDA Sebastien Gardoll IPSL Carsten Ebrecht DKRZ Luca Cinquini NASA JPLUCAR ESGF Container Working Group ESGF F2F Washington DC December 2018 ID: 807977
Download The PPT/PDF document "Highly Available ESGF Services for the C..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Highly Available ESGF Services for the Copernicus Climate Data Store
Matt Pryor
, Phil Kershaw, Alan Iwi (CEDA)
Sebastien
Gardoll
(IPSL)
Carsten
Ebrecht
(DKRZ)
Luca
Cinquini
(NASA JPL/UCAR)
ESGF Container Working Group
ESGF F2F, Washington D.C.
December 2018
Slide2Contents
Context
What is the Copernicus Climate Data Store?
Requirements for data discovery and download services
Load-balanced Architecture
Overview
Challenges and Compromises
Containerised
ESGF services
Motivation
Current state
Challenges and Solutions
Future work
Slide3Context
Slide4Context
What is the Copernicus Climate Data Store?
The Climate Data Store (CDS) is part of the Copernicus Climate Change Service (C3S)
C3S is operated by ECMWF on behalf of the European Union
It aims to provide key indicators of climate change drivers, supporting all sectors
The CDS provides a single, freely available interface to a range of climate-related observations and simulations
Wide range of data sources from many participating
organisations
In-situ observations, models,
reanalyses
, satellite products
Slide5Context
Requirements for data discovery and download services
CEDA, IPSL and DKRZ to provide quality-controlled subset of CMIP5 for use with the CDS using ESGF services
User-facing services (e.g. search and download) must be highly available at a single set of URLs
>= 98% uptime, ~7 days downtime per year
Publishing not subject to this restriction
No single site can meet this requirement
Geographically distributed, load-balanced service is required
Not the same as the traditional federated approach
Some inconsistency is accepted as a trade-off for high availability
Slide6Current Architecture
Slide7Normal operation
Publication
Data replication
Load-balanced Architecture
Overview
Separate master index node for publishing
Publishing does not have to be highly available
Replication to slaves is turned off during publishing
Data node and slave index node at each site
Data replication using
Synda
DNS load-balancing across sites
Each DNS query returns an A record for an available site at random
Available sites determined by health check
Short time-to-live (TTL) means clients perform lookups regularly
No need for proxy server (which is a single point of failure)
Cloud-based DNS service (Amazon Route 53)
CEDA
Slave Index Node
Data Node
DNS Service
End User
Master Index Node
DKRZ
Slave Index Node
Data Node
Synda
IPSL
Slave Index Node
Data Node
Synda
Publisher
Slide8Load-balanced Architecture
Challenges and Compromises
To maintain high availability when publishing, some consistency must be sacrificed
Data may be available for download via THREDDS at one site but not at others
Slave indexes may be inconsistent after publication to the master index
Data replication via
Synda
needs to target a specific data node
Requires modifying
Solr
records after initial publication
Non-deterministic catalog paths generated during publication
Patch from Alan Iwi (CEDA) uses DRS in path instead of an integer
DNS load-balancing is not perfect
Reliant on clients to respect TTL for correct
behaviourReliant on third-party service (running a DNS server is difficult)
Sophisticated algorithms are a lot more expensive on cloud-based providersSophisticated health checks are also more expensive
Slide9Containerised ESGF Services
Slide10Traditional installation
Shared Libraries
Hypervisor
Guest OS
Guest OS
Process Space
Containerised
ESGF Services
Motivation for Containers
Containers simplify installation
A container encapsulates an application and its dependencies as a single unit
No more dependency hell
Containers increase confidence
A container is packaged once and used multiple times
Same code in test and production
Containers increase portability
A container can run anywhere there is a Linux kernel
Containers encourage modularity
Each container runs a single application
Containers work together to provide an integrated system
Containers allow better usage of resources
Higher density than a VM per applicationMore isolation than processes on a shared host
Server
Host Operating System
Virtualised installation
Libraries
Libraries
Application
Application
Containerised
installation
Slide11Containerised ESGF Services
Motivation for Kubernetes
Containers excel when used with an orchestrator
Automated management of
containerised
applications across a cluster
Kubernetes is now the de facto standard
Resilience and scaling are core features of the platform
Zero downtime rolling upgrade
In-cluster service discovery and load-balancing
Storage abstraction
Slide12Containerised ESGF Services
Current State
https://github.com/ESGF/esgf-docker
All core ESGF services have been
containerised
Currently no support for
GridFTP
/Globus, node manager or dashboard
MyProxy
deprecated in
favour
of SLCS
Single-node deployment using Docker Compose working
Kubernetes deployment using Helm charts working
Each Tomcat and Django application is fully self-contained
SSL termination and client authentication using Nginx proxy
Container images built, tested and pushed by Jenkins for every commit to master and develThanks to Sebastien Gardoll (IPSL)
Slide13Containerised ESGF Services
Challenges and Solutions
Very different paradigm to traditional monolithic installer
Shared configuration files in traditional installer are difficult to untangle for each application
Initial implementation by Luca made large steps towards addressing this problem
Initial implementation closely followed traditional installer
Refactored to be more “cloud-native”
No need for process managers like
supervisord
Use official base containers where possible
Reduce container bloat
Slide14Containerised ESGF Services
Challenges and Solutions
ESGF applications with multiple responsibilities
ESGF applications could be refactored to better suit a micro-services architecture
Would allow better use of scaling features in Kubernetes
SSL client authentication
Kubernetes has no native support for SSL client authentication
Current solution requires proxy container for SSL handshake
Ideally, we would allow Kubernetes to handle ingress
Could replace SSL certificates for authentication with OAuth tokens
Slide15Containerised ESGF Services
Future work
More flexible deployment
Work is currently underway to support partial deployments
Build Tomcat applications from source
Pre-built wars are included from ESGF distribution site at build time
Should build Tomcat applications from source at a particular version
Also useful for testing (e.g. build an image from a dev branch)
Implement more of the ESGF test suite for Docker build
Feature parity with traditional installer
Subject to specific deprecations
Automated publication using
Kubernetes jobs
Slide16Questions