7: Oracle RAC and High Availability

Oracle RAC (Real Application Clusters) is designed to provide high availability (HA), ensuring that a database remains operational even if one or more components (such as a node or instance) fail. RAC’s architecture offers failover and recovery mechanisms, client-side failover solutions like FAN (Fast Application Notification) and TAF (Transparent Application Failover), and can integrate with Oracle Data Guard for disaster recovery. This chapter will explain the core concepts and provide commands and examples on how to configure and test high availability features in Oracle RAC.


1. Oracle RAC and High Availability

High Availability in Oracle RAC refers to the ability of a system to continue operating and providing services, despite the failure of one or more components. RAC provides several mechanisms for high availability:

  • Instance Failover: If an instance fails, the surviving instances take over its workload.
  • Node Failover: If a node fails, the Oracle Clusterware automatically reconfigures the cluster, redistributing the workload.
  • Client Failover: Clients connected to a failed instance can reconnect to surviving instances using technologies like FAN and TAF.

Key Benefits of Oracle RAC for High Availability:

  • Zero downtime during scheduled or unplanned outages.
  • Load balancing across multiple nodes.
  • Automatic failover and recovery.

Oracle RAC is commonly used in environments where high availability is crucial, such as financial, retail, and healthcare industries.

2. RAC Failover and Recovery Mechanisms

Oracle RAC ensures continuous availability and minimizes downtime by automatically detecting failures and redistributing workloads among surviving nodes. Key components and mechanisms include:

  • Oracle Clusterware: Monitors the health of nodes and instances, and manages the cluster’s resources.
  • Cache Fusion: Maintains data consistency across instances by sharing data blocks through the interconnect.
  • Automatic Workload Management (AWM): Distributes client connections and workloads across instances.

Example Scenario:

Suppose you have a two-node RAC cluster with nodes node1 and node2. If node1 fails, Oracle Clusterware detects the failure and performs the following actions:

  1. Instance Recovery: Surviving instances recover any in-flight transactions from the failed instance.
  2. Service Failover: Services running on the failed node are relocated to the surviving node.
  3. Connection Redistribution: New client connections are directed to the surviving instance.

Commands to Monitor RAC Status:

# Check the status of the RAC database
srvctl status database -d racdb

# Check the status of instances
srvctl status instance -d racdb -i racdb1
srvctl status instance -d racdb -i racdb2

# Check cluster resources
crsctl status resource -t

Instance Failover:

When one instance in a RAC cluster fails, the other instances take over the in-progress transactions of the failed instance. The Database Background Processes (e.g., Global Cache Service (GCS) and Global Enqueue Service (GES)) handle the necessary cache recovery to make the data available to the remaining nodes.

Node Failover:

If a node fails, Oracle Clusterware detects the failure and automatically:

  1. Marks the node as unavailable.
  2. Redirects client connections to surviving nodes.
  3. Restarts the Oracle instances on surviving nodes, if needed.

Example: Viewing Cluster Node Status

To see the status of all nodes in the cluster, use the following command:

crsctl stat res -t

Example: Manually Reconfiguring After Node Failure

In rare cases, manual reconfiguration might be necessary. Use srvctl to stop the instance on the failed node and redistribute services:

srvctl stop instance -d <dbname> -n <node_name>
srvctl relocate service -d <dbname> -s <service_name> -n <new_node_name>

3. Configuring Fast Application Notification (FAN) for Failover

Fast Application Notification (FAN) is a RAC feature that publishes events to inform applications about cluster changes, such as node failures or service status changes. FAN enables applications to respond quickly, minimizing the impact of failures.

Steps to Enable FAN:

  1. Enable Oracle Notification Service (ONS): FAN uses ONS to propagate events to applications.To start ONS on each RAC node:
    srvctl start ons

    2. Configure FAN with Services: FAN works best when used in conjunction with services.

    To enable FAN for a service:

    srvctl modify service -d <dbname> -s <service_name> -q TRUE

    3. Client Configuration for FAN: Ensure that the client-side connection uses Fast Connection Failover (FCF), which can subscribe to FAN events. For Java clients, the -Doracle.ons property must be configured.

    Configuring FAN:

    1. Ensure Oracle Notification Service (ONS) is Running:ONS is responsible for publishing FAN events.
    # Check ONS status
    srvctl status nodeapps -n node1
    
    # Start ONS if not running
    srvctl start nodeapps -n node1
    

    2. Configure Applications to Use FAN:

    • Java Applications: Use Oracle Universal Connection Pool (UCP) or JDBC with FAN support.
    • OCI Clients: Enable FAN by setting environment variables or using APIs.

    3. Enable FAN Callouts (Optional):

    FAN callouts allow you to execute custom scripts in response to FAN events.

    # Create a callout script directory
    mkdir -p $GRID_HOME/racg/usrco
    
    # Place your custom script in the directory
    cp my_callout_script.sh $GRID_HOME/racg/usrco/

    Example:

    To configure a JDBC application to use FAN:

    // Enable FAN for Oracle JDBC
    Properties props = new Properties();
    props.put(OracleConnection.CONNECTION_PROPERTY_FAN_ENABLED, "true");

    Testing FAN Configuration:

    1. Simulate Node Failure:
    # Stop an instance to simulate failure
    srvctl stop instance -d racdb -i racdb1 -o immediate
    

    2. Observe Application Behavior:

    The application should quickly receive the FAN event and take appropriate action, such as reconnecting to a surviving instance.

    Transparent Application Failover (TAF): Configuration and Testing

    Transparent Application Failover (TAF) provides automatic failover for user sessions in the event of an instance failure. If a client session is connected to a failed instance, TAF redirects the connection to a surviving instance in the cluster. Depending on the TAF configuration, Oracle can retry failed transactions or simply re-establish the connection.

    Configuring TAF for a Service:

    To configure TAF, define a service that enables TAF and specify the failover method and type.

    1. Create a TAF-enabled service:
    srvctl add service -d <dbname> -s taf_service -r <preferred_instance> -a <available_instance> -P BASIC -e SELECT -z 180 -w 5
    

    Here:

    • -P BASIC: Sets basic TAF.
    • -e SELECT: Failover type (e.g., SELECT for query failover).
    • -z 180: Time in seconds to retry connections.
    • -w 5: Retry interval in seconds.

    2. Start the service:

    srvctl start service -d <dbname> -s taf_service

    Testing TAF Failover:

    1. Connect to the database using the TAF service:
      sqlplus user/password@tafsrv

      Simulate a node or instance failure by stopping the instance:

      srvctl stop instance -d <dbname> -n <node_name>

      You should see the session reconnect to another instance, and any in-flight SELECT statements should continue from where they left off.

      Configuring TAF:

      Client-Side Configuration (tnsnames.ora):

      RACDB_TAF =
        (DESCRIPTION =
          (ADDRESS_LIST =
            (ADDRESS = (PROTOCOL = TCP)(HOST = node1-vip)(PORT = 1521))
            (ADDRESS = (PROTOCOL = TCP)(HOST = node2-vip)(PORT = 1521))
          )
          (CONNECT_DATA =
            (SERVICE_NAME = racdb)
            (FAILOVER_MODE =
              (TYPE = SELECT)
              (METHOD = BASIC)
              (RETRIES = 180)
              (DELAY = 5)
            )
          )
        )
      

      Server-Side Configuration (Using Services):

      1. Create a Service with TAF Policy:
      srvctl add service -d racdb -s myservice \
        -r racdb1,racdb2 \
        -P BASIC -e SELECT -z 180 -w 5
      
      • -P: Failover method (BASIC or PRECONNECT)
      • -e: Failover type (SESSION, SELECT, or NONE)
      • -z: Failover retries
      • -w: Failover delay (seconds)

      Start the Service:

      srvctl start service -d racdb -s myservice

      Testing TAF:

      1. Connect Using TAF Service:
      sqlplus user/password@myservice

      Execute a Long-Running Query:

      SELECT COUNT(*) FROM large_table;

      Simulate Instance Failure:

      srvctl stop instance -d racdb -i racdb1 -o immediate

      Observe Query Continuation:

      The query should continue executing without interruption on the surviving instance.

      Oracle Data Guard and RAC Integration for Disaster Recovery

      Oracle Data Guard is Oracle’s solution for disaster recovery and high availability by maintaining standby databases. When integrated with Oracle RAC, Data Guard provides additional protection against both site-level and node-level failures.

      Configuring Oracle Data Guard with RAC:

      1. Configure Primary and Standby Databases: Ensure that both the primary and standby databases are RAC-enabled. Use Data Guard broker or SQL commands to manage and configure the Data Guard setup.
      2. Create a Standby Database:
      dgmgrl
      DGMGRL> create configuration 'DGConfig' as
          primary database is 'PrimaryDB'
          connect identifier is 'primarydb';

      Add RAC Standby to Data Guard Configuration:

      DGMGRL> add database 'StandbyDB' as
          connect identifier is 'standbydb'
          maintained as physical;

      Switchover and Failover: You can perform a switchover or failover between the RAC primary and RAC standby databases in case of failures. Data Guard also provides automatic failover using Fast-Start Failover (FSFO).

      Switchover Example:

      dgmgrl
      DGMGRL> switchover to 'StandbyDB';
      

      Failover Example:

      dgmgrl
      DGMGRL> failover to 'StandbyDB';
      

      Configuring Data Guard with RAC:

      1. Prepare the Standby Environment:
        • Ensure the standby site has compatible hardware and software.
        • Configure network connectivity between primary and standby sites.
      2. Configure Standby Redo Logs:On both primary and standby databases.
      ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 GROUP 10 ('/u02/oradata/racdb/srl1.log') SIZE 500M;
      ALTER DATABASE ADD STANDBY LOGFILE THREAD 2 GROUP 11 ('/u02/oradata/racdb/srl2.log') SIZE 500M;
      

      Set Initialization Parameters:

      On Primary:

      ALTER SYSTEM SET LOG_ARCHIVE_CONFIG='DG_CONFIG=(racdb,stbydb)';
      ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='SERVICE=stbydb ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=stbydb';
      ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_2=ENABLE;
      

      On Standby:

      ALTER SYSTEM SET LOG_ARCHIVE_CONFIG='DG_CONFIG=(racdb,stbydb)';
      ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='SERVICE=racdb ASYNC VALID_FOR=(ONLINE_LOGFILES,STANDBY_ROLE) DB_UNIQUE_NAME=racdb';
      ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_2=ENABLE;
      

      Enable Data Guard Broker (Optional):

      On Both Primary and Standby:

      ALTER SYSTEM SET DG_BROKER_START=TRUE;

      Start Redo Apply on Standby:

      ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM SESSION;

      Testing Data Guard Failover:

      1. Perform Switchover:
      -- On primary
      ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY WITH SESSION SHUTDOWN;
      
      -- On standby
      ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WITH SESSION SHUTDOWN;
      

      Verify Roles Swapped:

      SELECT DATABASE_ROLE FROM V$DATABASE;

      6. Managing Node Failures and Reconfiguration

      In the event of a node failure, Oracle Clusterware automatically detects and handles the reconfiguration of the cluster to maintain high availability. RAC uses Global Cache and Enqueue Services to manage resources and lock information across the remaining nodes, ensuring the database remains available.

      Steps to Handle Node Failures:

      1. Monitor Node Failures Using crsctl: Oracle Clusterware automatically monitors node and instance availability. You can view node status using:
      crsctl check cluster

      Reconfigure the Cluster After a Node Failure: If necessary, you can manually reconfigure the cluster after a node failure. Use srvctl to remove or relocate services and manage instances on surviving nodes.

      Remove a failed node:

      srvctl remove nodeapps -n <failed_node_name>

      Add a Node Back to the Cluster: If the failed node is restored, it can be added back to the RAC configuration:

      srvctl add instance -d <dbname> -n <node_name>

      Relocate Services to Surviving Nodes: In case of a node failure, services can be relocated to other nodes:

      srvctl relocate service -d <dbname> -s <service_name> -n <new_node_name>

      Example of Node Recovery:

      1. Stop the failed instance:
      srvctl stop instance -d racdb -n racnode1

      2. Start the services on another node:

      srvctl relocate service -d racdb -s myservice -n racnode2

      Oracle RAC handles node failures by reconfiguring the cluster and redistributing workloads.

      Detecting Node Failures:

      • Clusterware Monitoring:
      crsctl check cluster

      View Cluster Nodes:

      olsnodes -n

      Managing Failed Nodes:

      1. Confirm Node Failure:
      # On surviving nodes
      crsctl check cluster

      Check Resource Status:

      crsctl status resource -t

      Remove Failed Node from Cluster (If Necessary):

      # As root on surviving node
      crsctl delete node -n failed_node

      Re-add Node to Cluster After Repair:

      # On repaired node
      $GRID_HOME/addnode/addnode.sh -silent "CLUSTER_NEW_NODES={failed_node}"

      Start Clusterware on Re-added Node:

      crsctl start crs

      Managing Node Failures and Reconfiguration

      Oracle RAC handles node failures by reconfiguring the cluster and redistributing workloads.

      Detecting Node Failures:

      • Clusterware Monitoring:
      crsctl check cluster

      View Cluster Nodes:

      olsnodes -n

      Managing Failed Nodes:

      1. Confirm Node Failure:
      # On surviving nodes
      crsctl check cluster

      Check Resource Status:

      crsctl status resource -t

      Remove Failed Node from Cluster (If Necessary):

      # As root on surviving node
      crsctl delete node -n failed_node
      

      Re-add Node to Cluster After Repair:

      # On repaired node
      $GRID_HOME/addnode/addnode.sh -silent "CLUSTER_NEW_NODES={failed_node}"
      

      Start Clusterware on Re-added Node:

      crsctl start crs

      Reconfiguring the Cluster:

      Oracle Clusterware automatically reconfigures the cluster when nodes join or leave.

      Verify Cluster Status:

      crsctl status cluster -all

      Check Interconnect Configuration:

      oifcfg getif

      Verify Services and Instances:

      srvctl status service -d racdb
      srvctl status instance -d racdb -n failed_node

      Handling Workload Redistribution:

      After a node failure, workloads are redistributed to surviving nodes. Ensure that:

      • Services Are Running on Surviving Nodes:
      srvctl status service -d racdb

      Instances Are Balanced:

      Monitor CPU and memory utilization to prevent overloading surviving nodes.

      Summary

      Oracle RAC provides a robust high-availability solution through its failover and recovery mechanisms. By configuring features like FAN and TAF, you can enhance the application’s ability to respond to cluster events, minimizing downtime. Integrating Oracle RAC with Data Guard offers comprehensive disaster recovery capabilities. Effective management of node failures and cluster reconfiguration ensures continuous database availability and optimal performance.

      Key Commands Recap:

      • Check RAC Database Status:
      srvctl status database -d racdb

      Start/Stop Instances:

      srvctl start instance -d racdb -i racdb1
      srvctl stop instance -d racdb -i racdb1 -o immediate

      Manage Services:

      srvctl add service -d racdb -s myservice -r racdb1,racdb2
      srvctl start service -d racdb -s myservice
      srvctl status service -d racdb
      

      Clusterware Commands:

      crsctl check cluster
      crsctl status resource -t
      crsctl stop crs
      crsctl start crs

      By understanding and implementing these configurations and commands, you can ensure that your Oracle RAC environment provides the high availability and resilience required for mission-critical applications.

      This entry was posted in Oracle on by .
      Unknown's avatar

      About SandeepSingh

      Hi, I am working in IT industry with having more than 15 year of experience, worked as an Oracle DBA with a Company and handling different databases like Oracle, SQL Server , DB2 etc Worked as a Development and Database Administrator.

      2 thoughts on “7: Oracle RAC and High Availability

      1. Pingback: Mastering Oracle Real Application Clusters (RAC): A Complete Guide to High Availability and Scalability | Smart way of Technology

      2. Pingback: Mastering Oracle Real Application Clusters (RAC): A Complete Guide to High Availability and Scalability | SmartTechWays – Innovative Solutions for Smart Businesses

      Leave a Reply