Oracle RAC (Real Application Clusters) is designed to provide high availability (HA), ensuring that a database remains operational even if one or more components (such as a node or instance) fail. RAC’s architecture offers failover and recovery mechanisms, client-side failover solutions like FAN (Fast Application Notification) and TAF (Transparent Application Failover), and can integrate with Oracle Data Guard for disaster recovery. This chapter will explain the core concepts and provide commands and examples on how to configure and test high availability features in Oracle RAC.
1. Oracle RAC and High Availability
High Availability in Oracle RAC refers to the ability of a system to continue operating and providing services, despite the failure of one or more components. RAC provides several mechanisms for high availability:
- Instance Failover: If an instance fails, the surviving instances take over its workload.
- Node Failover: If a node fails, the Oracle Clusterware automatically reconfigures the cluster, redistributing the workload.
- Client Failover: Clients connected to a failed instance can reconnect to surviving instances using technologies like FAN and TAF.
Key Benefits of Oracle RAC for High Availability:
- Zero downtime during scheduled or unplanned outages.
- Load balancing across multiple nodes.
- Automatic failover and recovery.
Oracle RAC is commonly used in environments where high availability is crucial, such as financial, retail, and healthcare industries.
2. RAC Failover and Recovery Mechanisms
Oracle RAC ensures continuous availability and minimizes downtime by automatically detecting failures and redistributing workloads among surviving nodes. Key components and mechanisms include:
- Oracle Clusterware: Monitors the health of nodes and instances, and manages the cluster’s resources.
- Cache Fusion: Maintains data consistency across instances by sharing data blocks through the interconnect.
- Automatic Workload Management (AWM): Distributes client connections and workloads across instances.
Example Scenario:
Suppose you have a two-node RAC cluster with nodes node1 and node2. If node1 fails, Oracle Clusterware detects the failure and performs the following actions:
- Instance Recovery: Surviving instances recover any in-flight transactions from the failed instance.
- Service Failover: Services running on the failed node are relocated to the surviving node.
- Connection Redistribution: New client connections are directed to the surviving instance.
Commands to Monitor RAC Status:
# Check the status of the RAC database
srvctl status database -d racdb
# Check the status of instances
srvctl status instance -d racdb -i racdb1
srvctl status instance -d racdb -i racdb2
# Check cluster resources
crsctl status resource -t
Instance Failover:
When one instance in a RAC cluster fails, the other instances take over the in-progress transactions of the failed instance. The Database Background Processes (e.g., Global Cache Service (GCS) and Global Enqueue Service (GES)) handle the necessary cache recovery to make the data available to the remaining nodes.
Node Failover:
If a node fails, Oracle Clusterware detects the failure and automatically:
- Marks the node as unavailable.
- Redirects client connections to surviving nodes.
- Restarts the Oracle instances on surviving nodes, if needed.
Example: Viewing Cluster Node Status
To see the status of all nodes in the cluster, use the following command:
crsctl stat res -t
Example: Manually Reconfiguring After Node Failure
In rare cases, manual reconfiguration might be necessary. Use srvctl to stop the instance on the failed node and redistribute services:
srvctl stop instance -d <dbname> -n <node_name>
srvctl relocate service -d <dbname> -s <service_name> -n <new_node_name>
3. Configuring Fast Application Notification (FAN) for Failover
Fast Application Notification (FAN) is a RAC feature that publishes events to inform applications about cluster changes, such as node failures or service status changes. FAN enables applications to respond quickly, minimizing the impact of failures.
Steps to Enable FAN:
- Enable Oracle Notification Service (ONS): FAN uses ONS to propagate events to applications.To start ONS on each RAC node:
srvctl start ons
2. Configure FAN with Services: FAN works best when used in conjunction with services.
To enable FAN for a service:
srvctl modify service -d <dbname> -s <service_name> -q TRUE
3. Client Configuration for FAN: Ensure that the client-side connection uses Fast Connection Failover (FCF), which can subscribe to FAN events. For Java clients, the -Doracle.ons property must be configured.
Configuring FAN:
- Ensure Oracle Notification Service (ONS) is Running:ONS is responsible for publishing FAN events.
# Check ONS status
srvctl status nodeapps -n node1
# Start ONS if not running
srvctl start nodeapps -n node1
2. Configure Applications to Use FAN:
- Java Applications: Use Oracle Universal Connection Pool (UCP) or JDBC with FAN support.
- OCI Clients: Enable FAN by setting environment variables or using APIs.
3. Enable FAN Callouts (Optional):
FAN callouts allow you to execute custom scripts in response to FAN events.
# Create a callout script directory
mkdir -p $GRID_HOME/racg/usrco
# Place your custom script in the directory
cp my_callout_script.sh $GRID_HOME/racg/usrco/
Example:
To configure a JDBC application to use FAN:
// Enable FAN for Oracle JDBC
Properties props = new Properties();
props.put(OracleConnection.CONNECTION_PROPERTY_FAN_ENABLED, "true");
Testing FAN Configuration:
- Simulate Node Failure:
# Stop an instance to simulate failure
srvctl stop instance -d racdb -i racdb1 -o immediate
2. Observe Application Behavior:
The application should quickly receive the FAN event and take appropriate action, such as reconnecting to a surviving instance.
Transparent Application Failover (TAF): Configuration and Testing
Transparent Application Failover (TAF) provides automatic failover for user sessions in the event of an instance failure. If a client session is connected to a failed instance, TAF redirects the connection to a surviving instance in the cluster. Depending on the TAF configuration, Oracle can retry failed transactions or simply re-establish the connection.
Configuring TAF for a Service:
To configure TAF, define a service that enables TAF and specify the failover method and type.
- Create a TAF-enabled service:
srvctl add service -d <dbname> -s taf_service -r <preferred_instance> -a <available_instance> -P BASIC -e SELECT -z 180 -w 5
Here:
-P BASIC: Sets basic TAF.-e SELECT: Failover type (e.g., SELECT for query failover).-z 180: Time in seconds to retry connections.-w 5: Retry interval in seconds.
2. Start the service:
srvctl start service -d <dbname> -s taf_service
Testing TAF Failover:
- Connect to the database using the TAF service:
sqlplus user/password@tafsrv
Simulate a node or instance failure by stopping the instance:
srvctl stop instance -d <dbname> -n <node_name>
You should see the session reconnect to another instance, and any in-flight SELECT statements should continue from where they left off.
Configuring TAF:
Client-Side Configuration (tnsnames.ora):
RACDB_TAF =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = node1-vip)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = node2-vip)(PORT = 1521))
)
(CONNECT_DATA =
(SERVICE_NAME = racdb)
(FAILOVER_MODE =
(TYPE = SELECT)
(METHOD = BASIC)
(RETRIES = 180)
(DELAY = 5)
)
)
)
Server-Side Configuration (Using Services):
- Create a Service with TAF Policy:
srvctl add service -d racdb -s myservice \
-r racdb1,racdb2 \
-P BASIC -e SELECT -z 180 -w 5
-P: Failover method (BASICorPRECONNECT)-e: Failover type (SESSION,SELECT, orNONE)-z: Failover retries-w: Failover delay (seconds)
Start the Service:
srvctl start service -d racdb -s myservice
Testing TAF:
- Connect Using TAF Service:
sqlplus user/password@myservice
Execute a Long-Running Query:
SELECT COUNT(*) FROM large_table;
Simulate Instance Failure:
srvctl stop instance -d racdb -i racdb1 -o immediate
Observe Query Continuation:
The query should continue executing without interruption on the surviving instance.
Oracle Data Guard and RAC Integration for Disaster Recovery
Oracle Data Guard is Oracle’s solution for disaster recovery and high availability by maintaining standby databases. When integrated with Oracle RAC, Data Guard provides additional protection against both site-level and node-level failures.
Configuring Oracle Data Guard with RAC:
- Configure Primary and Standby Databases: Ensure that both the primary and standby databases are RAC-enabled. Use Data Guard broker or SQL commands to manage and configure the Data Guard setup.
- Create a Standby Database:
dgmgrl
DGMGRL> create configuration 'DGConfig' as
primary database is 'PrimaryDB'
connect identifier is 'primarydb';
Add RAC Standby to Data Guard Configuration:
DGMGRL> add database 'StandbyDB' as
connect identifier is 'standbydb'
maintained as physical;
Switchover and Failover: You can perform a switchover or failover between the RAC primary and RAC standby databases in case of failures. Data Guard also provides automatic failover using Fast-Start Failover (FSFO).
Switchover Example:
dgmgrl
DGMGRL> switchover to 'StandbyDB';
Failover Example:
dgmgrl
DGMGRL> failover to 'StandbyDB';
Configuring Data Guard with RAC:
- Prepare the Standby Environment:
- Ensure the standby site has compatible hardware and software.
- Configure network connectivity between primary and standby sites.
- Configure Standby Redo Logs:On both primary and standby databases.
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 GROUP 10 ('/u02/oradata/racdb/srl1.log') SIZE 500M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 2 GROUP 11 ('/u02/oradata/racdb/srl2.log') SIZE 500M;
Set Initialization Parameters:
On Primary:
ALTER SYSTEM SET LOG_ARCHIVE_CONFIG='DG_CONFIG=(racdb,stbydb)';
ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='SERVICE=stbydb ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=stbydb';
ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_2=ENABLE;
On Standby:
ALTER SYSTEM SET LOG_ARCHIVE_CONFIG='DG_CONFIG=(racdb,stbydb)';
ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='SERVICE=racdb ASYNC VALID_FOR=(ONLINE_LOGFILES,STANDBY_ROLE) DB_UNIQUE_NAME=racdb';
ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_2=ENABLE;
Enable Data Guard Broker (Optional):
On Both Primary and Standby:
ALTER SYSTEM SET DG_BROKER_START=TRUE;
Start Redo Apply on Standby:
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM SESSION;
Testing Data Guard Failover:
- Perform Switchover:
-- On primary
ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY WITH SESSION SHUTDOWN;
-- On standby
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WITH SESSION SHUTDOWN;
Verify Roles Swapped:
SELECT DATABASE_ROLE FROM V$DATABASE;
6. Managing Node Failures and Reconfiguration
In the event of a node failure, Oracle Clusterware automatically detects and handles the reconfiguration of the cluster to maintain high availability. RAC uses Global Cache and Enqueue Services to manage resources and lock information across the remaining nodes, ensuring the database remains available.
Steps to Handle Node Failures:
- Monitor Node Failures Using
crsctl: Oracle Clusterware automatically monitors node and instance availability. You can view node status using:
crsctl check cluster
Reconfigure the Cluster After a Node Failure: If necessary, you can manually reconfigure the cluster after a node failure. Use srvctl to remove or relocate services and manage instances on surviving nodes.
Remove a failed node:
srvctl remove nodeapps -n <failed_node_name>
Add a Node Back to the Cluster: If the failed node is restored, it can be added back to the RAC configuration:
srvctl add instance -d <dbname> -n <node_name>
Relocate Services to Surviving Nodes: In case of a node failure, services can be relocated to other nodes:
srvctl relocate service -d <dbname> -s <service_name> -n <new_node_name>
Example of Node Recovery:
- Stop the failed instance:
srvctl stop instance -d racdb -n racnode1
2. Start the services on another node:
srvctl relocate service -d racdb -s myservice -n racnode2
Oracle RAC handles node failures by reconfiguring the cluster and redistributing workloads.
Detecting Node Failures:
- Clusterware Monitoring:
crsctl check cluster
View Cluster Nodes:
olsnodes -n
Managing Failed Nodes:
- Confirm Node Failure:
# On surviving nodes
crsctl check cluster
Check Resource Status:
crsctl status resource -t
Remove Failed Node from Cluster (If Necessary):
# As root on surviving node
crsctl delete node -n failed_node
Re-add Node to Cluster After Repair:
# On repaired node
$GRID_HOME/addnode/addnode.sh -silent "CLUSTER_NEW_NODES={failed_node}"
Start Clusterware on Re-added Node:
crsctl start crs
Managing Node Failures and Reconfiguration
Oracle RAC handles node failures by reconfiguring the cluster and redistributing workloads.
Detecting Node Failures:
- Clusterware Monitoring:
crsctl check cluster
View Cluster Nodes:
olsnodes -n
Managing Failed Nodes:
- Confirm Node Failure:
# On surviving nodes
crsctl check cluster
Check Resource Status:
crsctl status resource -t
Remove Failed Node from Cluster (If Necessary):
# As root on surviving node
crsctl delete node -n failed_node
Re-add Node to Cluster After Repair:
# On repaired node
$GRID_HOME/addnode/addnode.sh -silent "CLUSTER_NEW_NODES={failed_node}"
Start Clusterware on Re-added Node:
crsctl start crs
Reconfiguring the Cluster:
Oracle Clusterware automatically reconfigures the cluster when nodes join or leave.
Verify Cluster Status:
crsctl status cluster -all
Check Interconnect Configuration:
oifcfg getif
Verify Services and Instances:
srvctl status service -d racdb
srvctl status instance -d racdb -n failed_node
Handling Workload Redistribution:
After a node failure, workloads are redistributed to surviving nodes. Ensure that:
- Services Are Running on Surviving Nodes:
srvctl status service -d racdb
Instances Are Balanced:
Monitor CPU and memory utilization to prevent overloading surviving nodes.
Summary
Oracle RAC provides a robust high-availability solution through its failover and recovery mechanisms. By configuring features like FAN and TAF, you can enhance the application’s ability to respond to cluster events, minimizing downtime. Integrating Oracle RAC with Data Guard offers comprehensive disaster recovery capabilities. Effective management of node failures and cluster reconfiguration ensures continuous database availability and optimal performance.
Key Commands Recap:
- Check RAC Database Status:
srvctl status database -d racdb
Start/Stop Instances:
srvctl start instance -d racdb -i racdb1
srvctl stop instance -d racdb -i racdb1 -o immediate
Manage Services:
srvctl add service -d racdb -s myservice -r racdb1,racdb2
srvctl start service -d racdb -s myservice
srvctl status service -d racdb
Clusterware Commands:
crsctl check cluster
crsctl status resource -t
crsctl stop crs
crsctl start crs
By understanding and implementing these configurations and commands, you can ensure that your Oracle RAC environment provides the high availability and resilience required for mission-critical applications.
Pingback: Mastering Oracle Real Application Clusters (RAC): A Complete Guide to High Availability and Scalability | Smart way of Technology
Pingback: Mastering Oracle Real Application Clusters (RAC): A Complete Guide to High Availability and Scalability | SmartTechWays – Innovative Solutions for Smart Businesses