Data Guard at CERN Emil Pilecki Credit Luca Canali Marcin Blaszczyk Steffen Pade Agenda About CERN Oracle and Data Guard at CERN DG perks and benefits Zero data loss over long distances far sync ID: 499194
Download Presentation The PPT/PDF document "Oracle" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Oracle Data Guard at CERN
Emil Pilecki
Credit
: Luca Canali
, Marcin
Blaszczyk,
Steffen
PadeSlide2
Agenda
About CERN
Oracle and Data Guard
at CERN
DG perks and benefits
Zero data loss over long distances (far sync)
Far sync testing resultsSlide3
About CERN
European Organization for Nuclear Research founded in 1954
21 member states, 2 candidates, 6 observers + UNESCO and UE
60 Non-member States collaborate with CERN
2500 staff members and 10 000 scientistsSlide4
LHC and Experiments
Large Hadron Collider (LHC) – particle accelerator collides beams at very high energy
27 km long circular tunnel
Located ~100m underground
Protons travel at 99.9999972% the speed of light
Collisions are analysed with usage of special detectors and software in the experiments dedicated to LHC
New particle discovered!
Consistent with the Higgs Boson
Announced on July 4th 2012Slide5
Oracle at CERN
Since 1982, version 2.3
Oracle DBs play a key role in the LHC production chains
Accelerator logging and monitoring systems
Online acquisition, offline data (re)processing, data distribution, analysis
Grid infrastructure and operation services
Monitoring, dashboards, etc.
Data management services
File catalogues, file transfers, etc.
Metadata and transaction processing for tape storage system
Administrative servicesSlide6
CERN’s Databases
Over 100 Oracle databases, mostly RAC
NAS storage plus some SAN with ASM
~400 TB of data files for production DBs
Examples of CERN’s critical DBs:
LHC logging database ~170 TB, expected growth up to 70 TB / year
13 Production experiments’ databases ~140 TB in total
15 production systems protected with Data Guard
Active Data Guard since 11gSlide7
Our Data Guard architecture
Primary Database
Active Data Guard
for disaster recovery
Active Data Guard
for read only workloads
2. Busy & critical ADG
1. Low load ADG
Active Data Guard
for read only workloads and disaster recovery
Primary Database
Maximum performance
Maximum performance
Redo
Transport
Redo
Transport
Redo
Transport
LOG_ARCHIVE_DEST_X=‘SERVICE=<
tns_alias
> OPTIONAL
ASYNC NOAFFIRM
VALID FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=<
standby_db_unique_name
>’Slide8
(Active) Data Guard benefitsFeatures and functionalities we profit from:Data protection for disaster recovery
Replication and offloading read only workloadDatabase backups from standbySafeguard logical data corruptions with flashback Snapshot standby for testing
Fast upgrades and hardware migrations
Detection of lost writes
Automatic block media recoverySlide9
Disaster recoveryWe have been using it since a few yearsSwitchover/failover is our first line of defence
Saved the day already for production servicesCurrent disaster recovery site at 10 km away from our main datacentreRemote site in Hungary to be used soon
Over 1000km distance
Network latency of 25ms is a challenge
Plan to move most of the standby databases there within 1 yearSlide10
Offloading production databasesEfficient replication of the whole database
Workload distributionTransactional workload runs on primaryRead-only workload can be moved to ADG
Read-mostly workload: DMLs can be redirected to primary with a
dblink
Database backups from standby
Significantly reduces load on primary by
Removes sequential I/O of full backup
ADG allows usage of block change tracking for fast incremental backupsSlide11
Flashback and snapshot standbyFlashback enabled on standby only
Recover from human errors and data corruptionsAvoid impacting primary database with flashback logs generation
Snapshot standby
Testing changes before implementing them on primary
Safe – redo is still sent to standby
Very easy to use
SQL> ALTER DATABASE CONVERT TO SNAPSHOT STANDBY;
SQL> ALTER DATABASE CONVERT TO PHYSICAL STANDBY;Slide12
Fast upgrades and migrations
Clusterware
1
1
g
+
RDBMS 1
1g
Clusterware
1
2c+RDBMS 11g
Redo
Transport
DATA GUARD RAC
database
Primary
database RAC
Redo Transport
RW Access
RW
Acess
Clusterware
1
2c
+
RDBMS 1
2c
RDBMS
upgrade
DATABASE downtime
Upgrade complete!
1
2
3
4
5
6Slide13
Fast upgrades and migrationsRisk mitigation
Fresh installation of the new clusterwareOld system stays untouched
Allows full upgrade test
Allows stress testing of new system
Downtime reduction
~ 1h for RDBMS upgrade
Additional hardware required unless migration to new one is expected anywaySlide14
Lost write detection and ABMRSlave exiting with ORA-752 exceptionErrors in file /ORA/dbs0a/PDBR_RAC50/diag
/rdbms/pdbr_rac50/PDBR1/trace/PDBR1_pr0l_92600.trc:ORA-00752: recovery detected a lost write of a data blockORA-10567: Redo is inconsistent with data block (file# 67, block# 57976209, file offset is 2494701568 bytes)
ORA-10564:
tablespace
STRMMON
ORA-01110: data file 67: '/ORA/dbs03/PDBR_RAC50/
datafile
/STRMMON_67.dbf'ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 435213427 Mon Apr 14 06:52:02 2014 Recovery Slave PR0L previously exited with exception 752Stops redo application when a lost write is detectedPrevious consistent block version still on standbyHelps to diagnose and repair the errorAutomatic Block Media Recovery with ADGFixes physical block corruptions
Works both ways: Primary
ADGSlide15
Zero data loss replicationUse synchronous redo transport method
DML statements impacted due to commit acknowledgment on standby
LOG_ARCHIVE_DEST_X=‘SERVICE=<
tns_alias
> OPTIONAL
SYNC AFFIRM
VALID FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=<
standby_db_unique_name
>’
Data Guard Standby
Primary Database
Redo
Transport
Commit
Ack
Network latency matters!!!Slide16
Long distances = high network latency =
slow commit acknowledge with SYNC redo transport
Far Sync concepts
Redo
Transport
Redo
Transport
Redo Transport
Redo Transport
Redo
Transport
25 ms
sync
sync
async
asyncSlide17
Far Sync testing at CERNFunctionalDoes it work? Are there any bugs?Performance
Simulated heavy DML workload with and without Far SyncOracle Real Application Testing – workload captured from production databases
Redo
Transport
25 ms
Redo
Far SyncSlide18
Far Sync testing resultsFunctional testsIt works well!!! but…
3-7013523981: FRA not cleaned up automatically on FAR SYNC instance3-7023772221: Failover to alternate destination does not work with FAR SYNCBoth bugs still present in 12.1.0.1 production
Some configuration issues with Data Guard Broker
Redo
Transport
25 ms
Redo
Far
SyncSlide19
Far Sync testing results
Performance tests with simulated heavy DML workload
256 parallel sessions inserting data in 500 row batches, 50 batches per session. The target table partitioned and indexed: 4 local b-tree indexes, 6 local bitmap indexes, global primary key index with reversed keys.
Each session inserting data into it's own partition.Slide20
Far Sync testing resultsPerformance tests with Oracle Real Application Testing frameworkReal production workload captured per schema
Workload replay with and without Far Sync 25ms latency
Replay parameters:
connect_time_scale
=0
think_time_scale
=0
CMSR – DML mostly workloadLCGR – read only workloadSlide21
Far Sync summaryVery promising for long distance replication if data loss is not acceptableUp to 60% performance gain (DML only workloads) with 25ms network latency
Lightweight and easy to deploy (virtual machine)If latency <5ms most likely you don’t need Far SyncThere are still bugs that need fixing
Redo
Transport
25 ms
Redo
Far
SyncSlide22
Discussion