Split brain syndrome in Oracle Real Application Clusters (RAC) refers to a scenario where two or more instances of an Oracle RAC cluster believe they are the only active instance of the database. This can lead to data corruption, as each instance might try to independently access and modify the same data blocks. Split brain typically occurs due to network failures or communication issues between the nodes in the cluster, causing a loss of synchronization.
Causes of Split Brain Syndrome
- Network Partitioning: When the network links between nodes fail, each node might think the other nodes are down and attempt to take over resources.
- Clusterware Misconfigurations: Incorrect configuration of the Oracle Clusterware can lead to improper failover handling.
- Hardware Failures: Failures in the network interface cards (NICs), switches, or other hardware components can cause communication issues.
- Software Bugs: Bugs in the Oracle RAC software can sometimes lead to synchronization issues.
Mechanisms to Prevent Split Brain
- Voting Disks: Oracle RAC uses voting disks to determine which nodes are active in the cluster. If a node cannot access the majority of voting disks, it will shut down to prevent split brain.
- Network Heartbeats: Nodes in the cluster exchange heartbeat messages over the private network. If heartbeats are missed, nodes will check the voting disks to make decisions.
- Disk Heartbeats: Nodes periodically write to and read from a shared disk to indicate they are active.
Resolving Split Brain Syndrome
When split brain syndrome occurs, it must be resolved to ensure data consistency and cluster integrity. Here are steps to resolve it:
- Automatic Node Fencing: Oracle Clusterware will automatically evict nodes that it deems to be in a split brain state. The evicted nodes will reboot to clear any potential corruption.
- Manual Intervention:
- Identify the Issue: Use logs (
alert.log,crsd.log, etc.) to identify the nodes involved in the split brain. - Shutdown Conflicting Instances: If necessary, manually shut down the conflicting database instances.
- Cluster Reconfiguration: Reconfigure the cluster if misconfigurations are found.
- Restart Cluster Services: Restart the Oracle Clusterware services (
crsctl start crs).
Example Scenario and Resolution
Scenario: Assume a two-node RAC setup with nodes rac1 and rac2. Due to a network failure, rac1 and rac2 lose communication with each other but continue to function independently, leading to a split brain situation.
Resolution Steps:
- Check Voting Disk Status:
crsctl query css votedisk
- Examine Logs:
- Review the
alert.logandcrsd.logon both nodes to determine the state of the cluster and the split brain cause. - Check for messages indicating loss of network heartbeat or node eviction.
- Manual Node Fencing (if needed):
- If automatic eviction has not occurred, manually shut down one of the nodes to consolidate cluster control.
crsctl stop crs -f
- Restart Oracle Clusterware Services:
- On the surviving node, ensure clusterware services are running.
crsctl start crs
- Bring Up the Database:
- Start the database instance on the surviving node.
srvctl start database -d <dbname>
- Reconfigure Cluster (if needed):
- Fix any underlying network issues and reconfigure the cluster as necessary.
- Restart the Other Node:
- Once the network issue is resolved, restart the other node and join it back to the cluster.
crsctl start crs
- Monitor the Cluster:
- Ensure that both nodes are communicating properly and that the cluster is functioning without any split brain issues.
By carefully following these steps, you can resolve split brain syndrome in Oracle RAC and restore normal cluster operations.