Oracle Dataguard is crucial for High Availability and Disaster Recovery, as it protects data by maintaining one or more standby databases. If the standby database falls behind the primary, it can create a lag issue that impacts your Recovery Point Objective (RPO) and Recovery Time Objective (RTO).
What is Data Guard Lag?
Data Guard Lag is the delay in shipping or applying redo data from the primary database to the standby database. This means standby database is not fully synchronized with the primary, increasing risk of data loss in a failover scenario.
There are three primary types of lag:
- Transport Lag (Redo Shipping Lag): The time delay of redo data from the primary database to the standby database, often caused by network issues or primary I/O bottlenecks.
- Apply lag (Redo Apply Lag): The amount of redo data that has been received but not yet applied to the standby database, usually due to I/O, CPU, or memory bottlenecks on the standby.
- Gap Resolution Lag: Happens when the redo transport services find a missing archived redo log file and pause log application until the missing log is received and recorded.
How to check Dataguard lag
Check the Transport lag and Apply lag
SELECT name, value, unit, time_computed FROM v$dataguard_stats WHERE name in ('transport lag','apply lag');
- Transport lag: Time difference between the last redo data generated on primary and the last redo data received on standby.
- Apply Lag: Time difference between the last redo data generated on primary and the last redo data applied on standby.
Check Redo Process Status (Transport and Apply)
SELECT PROCESS, CLIENT_PROCESS, SEQUENCE#, STATUS, THREAD# FROM V$MANAGED_STANDBY;
- Look for the RFS (Remote File Server) process to check Redo Transport (receiving logs).
- Look for the MRP0 (Media Recovery Process) to check redo apply. A common Status for MRP is APPLYING_LOG or WAIT_FOR_LOG
Check for Redo Gaps
SELECT * FROM V$ARCHIVE_GAP;
If this query returns rows, it indicates missing archive log sequences that need to be resolved.
Troubleshooting and Tuning Tips:
Network TroubleShooting ( Transport Lag)
If Transport lag is high, the problem is often between the primary and standby databases.
- Check Bandwidth: Make sure the network has enough bandwidth.
- Reduce Packets Drops: Check network statistics to confirm there are minimal packet drops or errors.
- Enable Compression: Use the COMPRESSION=ENABLE parameter in the LOG_ARCHIVE_DEST_n configuration to reduce the amount of data transferred.
- Monitor Latency: Check VPN/firewall latency that might be delaying communication.
- Use ASYNC transport: Use LGWR ASYNC over LGWR SYNC ( Maximum Performance mode) if the network is unreliable or has high latency, prioritizing performance over absolute zero data loss(Maximum Protection/ Availability Modes).
MRP (APPLY) Tuning Tips ( APPLY LAG)
If Apply lag is high, the standby database is struggling to keep up with the redo apply rate.
- Increase Parallel Recovery: Increase the degree of parallelism for recovery using the RECOVERY_PARALLELISM parameter. For Example: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE PARALLEL 4;
- Enable Real-Time Apply: Ensure you are using Real-time Apply, where the MRP applies redo directly from the standby Redo Log Files (SRLs) as soon as they are filled, rather then waiting for them to be archived.
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;
- Resource Allocation: Ensure the standby server has sufficient CPU and RAM to handle the redo apply workload.
- Remove Resource Limits: Check and remove unnecessary resource limits ( e.g. in a database resource manager) that could throttle the MRP process.
Redo Generation issues
A suddenly surge in redo generation can overwhelm the sytem.
- Tune Application Commits: Frequent small commits generate more redo and log switched than larger, less frequent commits.
- Avoid Unnecessary Batch Operations: Minimize operations that generate excessive redo.
- Use Additional Standby Redo Logs: Ensure the number and size of SRLs on the standby are equal to or greater than the standby are equal to or greater than the online redo logs on the primary to avoid waiting for logs to be reused.