Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

About this task

This scenario is based on a PowerHA environment with external storage Metro Mirror, or Global Mirror. Below is an example of the environment that is used in this scenario using Global Mirror, although the same sequence of steps follows for a Metro Mirror environment:

In this scenario:

  • Node A is the Primary cluster node with the production copy of the IASP.

  • Node B is the Backup cluster node, with the mirror copy of the IASP.

  • Replication is occurring in the direction from Node A to Node B.

  • A site failure in Data Center A causes the need to fail production over to Node B making it the Primary cluster node.

Procedure

Tip

Did you know?
All of the steps below leave the data at Data Center A as-is. There is no automatic reverse replication from the new primary back to Data Center A.

It is recommended to follow these steps as soon as the failure at Data Center A occurs. The determination of where to restart the production workload can be decided on at a later point in time (either to start production workloads at Data Center B, or wait and get Data Center A online). Performing this set of steps allows the option, and reduces your recovery time objective (RTO) in the event that the decision is made to restart the production workload at Data Center B.

  1. Begin with the environment pictured above.

  2. When an unplanned failure of Node A and/or External Storage 1 in Data Center A occurs, the following set of procedures should be followed to fail over to Data Center B:

Possible Failure Scenarios:

Automatic Failover from Node A to Node B

In many instances, Node A will send out a distress message as the node is going down. This indicates to Node B that it should take over and causes automatic failover processing to start.

Requirements for Automatic Failover:

  • The Cluster Resource Group (CRG) must have had a status of active prior to the failure.

  • The target node in the CRG recovery domain must have had a status of active in the CRG recovery domain prior to the failure.

  • The Metro Mirror or Global Mirror replication must have a copy status of active at the time of the failover.

  • The failing node must have an opportunity to send out a distress message.

  • There cannot be a QCST_CRG_CANCEL_FAILOVER policy disabling automatic failover for the type of failure event that occurred.

In the event that the requirements for an automatic failover are not met, the steps below walk through appropriate manual failover steps.

Automatic Failover Procedure

  1. On the node that will become the new primary node (Node B): use either WRKCLU, option 6 Work with Cluster Nodes or the DSPCLUINF command to display the current cluster node status.

  2. Verify that the status of the original primary node (Node A), is either Inactive P, Inactive or Failed.

Note

Note: If the status of the node shows as Partition, continue to Resolving a Cluster Partition below.

3. If the status is Inactive P, wait until the status changes to either Inactive or Failed, pressing F5=Refresh to refresh the panel.

4. If a failover message queue is defined for either the CRG or the Cluster, a message will be present in the failover message queue asking to proceed or cancel the failover.

Expand
titleAnswering the Failover Message Queue
  1. Display the message queue on Node B. For example, if the message queue is QSYSOPR, the command is: DSPMSG MSGQ(QSYS/QSYSOPR).

  2. If a failover message queue is defined, and a failover has been triggered, message CPABB02: Cluster resource groups are failing over to node NODEB will be in the message queue.

  3. Answer G (Go) to continue with the failover.

Note

Note: If the CPABB02 message was already answered with C (Cancel) to cancel the failover, continue to Detaching Replication below.

5. Wait for the PowerHA failover processing to complete.

Expand
titleMonitoring the PowerHA automatic failover process
  1. Display the cluster resource groups using the DSPCRGINF (Display CRG Information) command by typing DSPCRGINF with no parameters.

  2. If the CRG shows a status of Switchover Pending, the cluster resource group is in the process of performing a failover. Use F5=Refresh to refresh the screen until the CRG no longer has a status of Switchover Pending.

  3. Once the CRG no longer shows a status of Switchover Pending, follow the appropriate steps below, depending on the CRG status:

    Active CRG
    If the CRG shows a status of Active, and the primary node is now the new primary (Node B in this example), the failover is now complete. Continue to varying on the IASP below.


    Inactive CRG - Primary node has not switched
    If the CRG shows a status of Inactive and the primary node is still Node A, continue to Detaching Replication below.

6. Vary on the IASP on the new primary node (NODEB)

Expand
titleVarying on the IASP on the new primary node

Work with the configuration status of the independent ASP (IASP) by using the WRKCFGSTS command. For example: WRKCFGSTS CFGTYPE(*DEV) CFGD(MYIASP). Follow the steps below depending on the IASP status:

VARIED OFF

  1. If the IASP has a status of VARIED OFF, use Option 1 to vary on the IASP.

  2. In another session, use the Display ASP Status Command (DSPASPSTS) to monitor the vary on progress.

AVAILABLE

If the IASP has a status of AVAILABLE, the IASP is already varied on and data on Node B can be accessed.

Tip

The CRG will automatically vary on if the IASP as part of the failover if the device has Configuration object online set to *ONLINE in the cluster resource group.

7. Once the vary on of the IASP is complete, data on Node B can be accessed.

8. For information on choosing a direction to restart replication see Restoring the Environment below.

Resolving a Cluster Partition

In some instances, Node A does not have the opportunity to send out a distress message before the node fails. In these instances, PowerHA is unable to automatically determine if the failure is a true failure of a system, or a temporary communication failure that will automatically resolve itself.

Tip

Did you know?

You can increase PowerHA's ability to detect failures by utilizing PowerHA's HMC Advanced Node Failure Detection. See Advanced Node Failure Detection for more information.

Resolving a Cluster Partition Procedure

  1. On the node that will become the new primary node (Node B): use either WRKCLU, option 6 Work with Cluster Nodes or the DSPCLUINF command to display the current cluster node status.

  2. Verify that the status of the original primary node (Node A), is Partition.

  3. Use the Change Cluster Node Entry (CHGCLUNODE) command with the *CHGSTS option to change the status of the node from Partition to Failed. This indicates to PowerHA that the node is actually down, and that the partition condition is not the result of a temporary network communication issue. In this example, the command is: CHGCLUNODE CLUSTER(MYCLU) NODE(NODEA) OPTION(*CHGSTS)

  4. Display the cluster resource groups using the DSPCRGINF (Display CRG Information) command by typing DSPCRGINF with no parameters.

  5. Follow the appropriate steps below, depending on the CRG status and primary node:

    1. If the CRG status is Inactive and the primary node is the new primary node (NODEB). Continue on to step 6, varying on the IASP.

    2. If the CRG status is Inactive and the primary node is the original primary node (NODEA). Continue to Detaching Replication below.

  6. Vary on the IASP on the new primary node (NODEB)

Expand
titleVarying on the IASP on the new primary node

Work with the configuration status of the independent ASP (IASP) by using the WRKCFGSTS command. For example: WRKCFGSTS CFGTYPE(*DEV) CFGD(MYIASP). Follow the steps below:

  1. If the IASP has a status of VARIED OFF, use Option 1 to vary on the IASP.

  2. In another session, use the Display ASP Status Command (DSPASPSTS) to monitor the vary on progress.

7. Once the vary on of the IASP is complete, data on Node B can be accessed.

8. For information on choosing a direction to restart replication see Restoring the Environment below.

Detaching Replication

These steps should be followed only if there is still no access to the IASP on Node B after following the steps in Automatic Failover and Resolving a Cluster Partition.

The steps under Automatic Failover and Resolving a Cluster Partition only enable access to data on Node B when replication is active at the time of the failover processing. In instances where replication is not active at the time of failover processing, since the data at Data Center B may be back-level additional procedures are required.

Note

Warning: When following these procedures, in environments with Global Mirror due to the asynchronous nature of Global mirror, there may be data that was not received by the storage at Data Center B. This loss of data is represented in the Recovery Point Objective (RPO) trade-offs of allowing Global Mirror to span the globe.

Detaching Replication Procedure

  1. Display the cluster resource groups using the DSPCRGINF (Display CRG Information) command by typing DSPCRGINF with no parameters.

  2. If the status of the CRG is anything other than Inactive, end the cluster resource group with the ENDCRG command. In this scenario, the command is: ENDCRG CLUSTER(MYCLU) CRG(MYCRG)

  3. Detach replication by using the *DETACH option on the Change Session Command. See the appropriate procedure depending on the type of replication:

Expand
titleDetaching SAN Volume Controller (SVC) or Storwize Metro Mirror or Global Mirror Sessions

To detach SVC/Storwize Metro Mirror or Global Mirror Sessions, use the CHGSVCSSN command. For example: CHGSVCSSN SSN(MYGMIRSSN) OPTION(*DETACH)

Expand
titleDetaching Copy Services Manager (CSM) Metro Mirror or Global Mirror Sessions

To detach DS8000 CSM Metro Mirror or Global Mirror Sessions, use the CHGCSMSSN command. For example: CHGCSMSSN SSN(MYGMIRSSN) OPTION(*DETACH)

Expand
titleDetaching DS8000 ASP Metro Mirror or Global Mirror Sessions

To detach DS8000 ASP Metro Mirror or Global Mirror Sessions, use the CHGASPSSN command. For example: CHGASPSSN SSN(MYGMIRSSN) OPTION(*DETACH)

4. Vary on the IASP on Node B:

Expand
titleVarying on the IASP on the new primary node
  1. Work with the configuration status of the independent ASP (IASP) by using the WRKCFGSTS command. For example: WRKCFGSTS CFGTYPE(*DEV) CFGD(MYIASP).

  2. Use Option 1 to vary on the IASP.

  3. In another session, use the Display ASP Status Command (DSPASPSTS) to monitor the vary on progress.

5. Once the vary on of the IASP is complete, data on Node B can be accessed.

6. For information on choosing a direction to restart replication see Restoring the Environment below.

Restoring the Environment

  1. On the new primary cluster node (Node B): use either WRKCLU, option 6 Work with Cluster Nodes or the DSPCLUINF command to display the current cluster node status.

  2. Verify that the status of all nodes is Active.

  3. If one or more nodes have a status of either Inactive or Failed, start clustering by using the STRCLUNOD command. For example: STRCLUNOD CLUSTER(MYCLU) NODE(NODEA)

  4. End the cluster resource group if it currently has a status of Active or Indoubt by using the ENDCRG command. For example: ENDCRG CLUSTER(MYCLU) CRG(MYCRG)

  5. Display the PowerHA replication session using the display session command. See the appropriate procedure depending on the type of replication:

Expand
titleDisplaying SAN Volume Controller (SVC) or Storwize Metro Mirror or Global Mirror Sessions

To display SVC/Storwize Metro Mirror or Global Mirror Sessions, use the DSPSVCSSN command. For example: CHGSVCSSN SSN(MYGMIRSSN) OPTION(*DETACH)

Expand
titleDisplaying Copy Services Manager (CSM) Metro Mirror or Global Mirror Sessions

To display DS8000 CSM Metro Mirror or Global Mirror Sessions, use the DSPCSMSSN command. For example: CHGCSMSSN SSN(MYGMIRSSN) OPTION(*DETACH)

Expand
titleDisplaying DS8000 ASP Metro Mirror or Global Mirror Sessions

To display DS8000 ASP Metro Mirror or Global Mirror Sessions, use the DSPASPSSN command. For example: CHGASPSSN SSN(MYGMIRSSN) OPTION(*DETACH)

6. Verify that the copy status is detached.

7. Verify the source node in the PowerHA session is the node that contains the copy of the data to keep. Data at the target node will be overwritten. If the source and target node are reversed, use the CHGCRG command to correct the source and target node. For example, if Node B is currently the primary and Node A is the backup node, but the desired copy of the IASP to keep is the copy on Node A, a command similar to the following would be used:
CHGCRG CLUSTER(MYCLU)
CRG(MYCRG)
CRGTYPE(*DEV)
RCYDMNACN(*CHGCUR)
RCYDMN((NODEB *BACKUP 1 DATACTRB *SAME *NONE)
(NODEA *PRIMARY *LAST DATACTRA *SAME *NONE))

8. Verify the IASP is varied off on the target node.

9. Reattach the PowerHA session using the Change Session command. See the appropriate procedure depending on the type of replication:

Expand
titleReattaching SAN Volume Controller (SVC) or Storwize Metro Mirror or Global Mirror Sessions

To reattach SVC/Storwize Metro Mirror or Global Mirror Sessions, use the CHGSVCSSN command. For example: CHGSVCSSN SSN(MYGMIRSSN) OPTION(*REATTACH)

Expand
titleReattaching Copy Services Manager (CSM) Metro Mirror or Global Mirror Sessions

To reattach DS8000 CSM Metro Mirror or Global Mirror Sessions, use the CHGCSMSSN command. For example: CHGCSMSSN SSN(MYGMIRSSN) OPTION(*REATTACH)

Expand
titleReattaching DS8000 ASP Metro Mirror or Global Mirror Sessions

To reattach DS8000 ASP Metro Mirror or Global Mirror Sessions, use the CHGASPSSN command. For example: CHGASPSSN SSN(MYGMIRSSN) OPTION(*REATTACH)

10. A confirmation panel confirming the reattach is displayed. Verify that the source and target node are correct and press F16 to confirm.

Warning

Important: The data on the node listed as the target node on the confirmation panel will be overwritten by the data on the node listed as the source node. If the nodes are incorrect use F12 to cancel the operation and go to step 7 to correct the recovery domain using the CHGCRG command.

11. Start the Cluster Resource Group using the STRCRG command. For example: STRCRG CLUSTER(MYCLU) CRG(MYCRG)

12. If the current primary node is not the desired primary node, perform a switchover using the CHGCRGPRI command. For example: CHGCRGPRI CLUSTER(MYCLU) CRG(MYCRG)