About this task

This scenario is based on a PowerHA environment with external storage Metro Mirror, or Global Mirror. Below is an example of the environment that is used in this scenario using Global Mirror, although the same sequence of steps apply to a Metro Mirror environment:

In this scenario:

Node A is the Primary cluster node with the production copy of the IASP.
Node B is the Backup cluster node, with the mirror copy of the IASP.
Replication is occurring in the direction from Node A to Node B.
A site failure in Data Center A causes the need to fail over production to Node B making it the Primary cluster node.

Procedure

Did you know?
All of the steps below leave the data at Data Center A as-is. There is no automatic reverse replication from the new primary back to Data Center A.

It is recommended to follow these steps as soon as the failure at Data Center A occurs. The determination of where to restart the production workload can be decided at a later point in time (either to start production workloads at Data Center B, or wait and get Data Center A online). Performing this set of steps allows the option, and reduces your recovery time objective (RTO) in the event that the decision is made to restart the production workload at Data Center B.

Begin with the environment pictured above.
When an unplanned failure of Node A and/or External Storage 1 in Data Center A occurs, the following set of procedures should be followed to fail over to Data Center B:

Possible Failure Scenarios:

Automatic Failover from Node A to Node B

In many instances, Node A will send out a distress message as the node is going down. This indicates to Node B that it should take over and causes automatic failover processing to start.

Requirements for Automatic Failover:

The Cluster Resource Group (CRG) must have had a status of active prior to the failure.
The target node in the CRG recovery domain must have had a status of active in the CRG recovery domain prior to the failure.
The Metro Mirror or Global Mirror replication must have a copy status of active at the time of the failover.
The failing node must have an opportunity to send out a distress message.
There cannot be a QCST_CRG_CANCEL_FAILOVER policy disabling automatic failover for the type of failure event that occurred.

In the event that the requirements for an automatic failover are not met, the following procedure demonstrates the appropriate manual failover steps.

Automatic Failover Procedure

On the node that will become the new primary node (Node B): use either WRKCLU, option 6 Work with Cluster Nodes or the DSPCLUINF command to display the current cluster node status.
Verify that the status of the original primary node (Node A), is either Inactive P, Inactive or Failed.

Note: If the status of the node shows as Partition, continue to Resolving a Cluster Partition below.

3. If the status is Inactive P, wait until the status changes to either Inactive or Failed, pressing F5=Refresh to refresh the panel.

4. If a failover message queue is defined for either the CRG or the Cluster, a message will be present in the failover message queue asking to proceed or cancel the failover.

Display the message queue on Node B. For example, if the message queue is QSYSOPR, the command is: DSPMSG MSGQ(QSYS/QSYSOPR).
If a failover message queue is defined, and a failover has been triggered, message CPABB02: Cluster resource groups are failing over to node NODEB will be in the message queue.
Answer G (Go) to continue with the failover.

Note: If the CPABB02 message already was answered with C (Cancel) to cancel the failover, continue to Detaching Replication below.

5. Wait for the PowerHA failover processing to complete.

Display the cluster resource groups using the DSPCRGINF (Display CRG Information) command by typing DSPCRGINF with no parameters.
If the CRG shows a status of Switchover Pending, the cluster resource group is in the process of performing a failover. Use F5=Refresh to refresh the screen until the CRG no longer has a status of Switchover Pending.
Once the CRG no longer shows a status of Switchover Pending, follow the appropriate steps below, depending on the CRG status:

Active CRG
If the CRG shows a status of Active, and the primary node is now the new primary (Node B in this example), the failover is now complete. Continue to varying on the IASP below.

Inactive CRG - Primary node has not switched
If the CRG shows a status of Inactive and the primary node is still Node A, continue to Detaching Replication below.

6. Vary on the IASP on the new primary node (NODEB)

Work with the configuration status of the independent ASP (IASP) by using the WRKCFGSTS command. For example: WRKCFGSTS CFGTYPE(*DEV) CFGD(MYIASP). Follow the steps below depending on the IASP status:

VARIED OFF

If the IASP has a status of VARIED OFF, use Option 1 to vary on the IASP.
In another session, use the Display ASP Status Command (DSPASPSTS) to monitor the vary on progress.

AVAILABLE

If the IASP has a status of AVAILABLE, the IASP already is varied on and data on Node B can be accessed.

7. Once the vary on of the IASP is complete, data on Node B can be accessed.

8. For information on choosing a direction to restart replication see Restoring the Environment below.

Resolving a Cluster Partition

In some instances, Node A does not have the opportunity to send out a distress message before the node fails. In these instances, PowerHA is unable to automatically determine if the failure is a true failure of a system, or a temporary communication failure that will automatically resolve itself.

Resolving a Cluster Partition Procedure

On the node that will become the new primary node (Node B): use either WRKCLU, option 6 Work with Cluster Nodes or the DSPCLUINF command to display the current cluster node status.
Verify that the status of the original primary node (Node A), is Partition.
Use the Change Cluster Node Entry (CHGCLUNODE) command with the *CHGSTS option to change the status of the node from Partition to Failed. This indicates to PowerHA that the node is actually down, and that the partition condition is not the result of a temporary network communication issue. In this example, the command is: CHGCLUNODE CLUSTER(MYCLU) NODE(NODEA) OPTION(*CHGSTS)
Display the cluster resource groups using the Display CRG Information (DSPCRGINF) command by typing DSPCRGINF with no parameters.
Follow the appropriate steps below, depending on the CRG status and primary node:
1. If the CRG status is Inactive and the primary node is the new primary node (NODEB). Continue on to step 6, varying on the IASP.
2. If the CRG status is Inactive and the primary node is the original primary node (NODEA). Continue to Detaching Replication below.
Vary on the IASP on the new primary node (NODEB)

Work with the configuration status of the independent ASP (IASP) by using the WRKCFGSTS command. For example: WRKCFGSTS CFGTYPE(*DEV) CFGD(MYIASP). Follow the steps below:

If the IASP has a status of VARIED OFF, use Option 1 to vary on the IASP.
In another session, use the Display ASP Status Command (DSPASPSTS) to monitor the vary on progress.

7. Once the vary on of the IASP is complete, data on Node B can be accessed.

8. For information on choosing a direction to restart replication see Restoring the Environment below.

Detaching Replication

These steps should be followed only if there is still no access to the IASP on Node B after following the steps in Automatic Failover and Resolving a Cluster Partition.

The steps under Automatic Failover and Resolving a Cluster Partition only enable access to data on Node B when replication is active at the time of the failover processing. In instances where replication is not active at the time of failover processing, since the data at Data Center B may be back-level additional procedures are required.

Detaching Replication Procedure

Display the cluster resource groups using the Display CRG Information (DSPCRGINF) command by typing DSPCRGINF with no parameters.
If the status of the CRG is anything other than Inactive, end the cluster resource group with the ENDCRG command. In this scenario, the command is: ENDCRG CLUSTER(MYCLU) CRG(MYCRG)
Detach replication by using the *DETACH option on the Change Session Command. See the appropriate procedure depending on the type of replication:

To detach SVC/Storwize Metro Mirror or Global Mirror Sessions, use the CHGSVCSSN command. For example: CHGSVCSSN SSN(MYGMIRSSN) OPTION(*DETACH)

4. Vary on the IASP on Node B:

5. Once the vary on of the IASP is complete, data on Node B can be accessed.

6. For information on choosing a direction to restart replication see Restoring the Environment below.

Restoring the Environment

On the new primary cluster node (Node B): use either WRKCLU, option 6 Work with Cluster Nodes or the DSPCLUINF command to display the current cluster node status.
Verify that the status of all nodes is Active.
If one or more nodes have a status of either Inactive or Failed, start clustering by using the STRCLUNOD command. For example: STRCLUNOD CLUSTER(MYCLU) NODE(NODEA)
End the cluster resource group if it currently has a status of Active or Indoubt by using the ENDCRG command. For example: ENDCRG CLUSTER(MYCLU) CRG(MYCRG)
Display the PowerHA replication session using the display session command. See the appropriate procedure depending on the type of replication:

6. Verify that the copy status is detached.

7. Verify the source node in the PowerHA session is the node that contains the copy of the data to keep. Data at the target node will be overwritten. If the source and target node are reversed, use the CHGCRG command to correct the source and target node. For example, if Node B is currently the primary and Node A is the backup node, but the desired copy of the IASP to keep is the copy on Node A, a command similar to the following would be used:
CHGCRG CLUSTER(MYCLU)
CRG(MYCRG)
CRGTYPE(*DEV)
RCYDMNACN(*CHGCUR)
RCYDMN((NODEB *BACKUP 1 DATACTRB *SAME *NONE)
(NODEA *PRIMARY *LAST DATACTRA *SAME *NONE))

8. Verify the IASP is varied off on the target node.

9. Reattach the PowerHA session using the Change Session command. See the appropriate procedure depending on the type of replication:

10. A confirmation panel confirming the reattach is displayed. Verify that the source and target node are correct and press F16 to confirm.

11. Start the Cluster Resource Group using the STRCRG command. For example: STRCRG CLUSTER(MYCLU) CRG(MYCRG)

12. If the current primary node is not the desired primary node, perform a switchover using the CHGCRGPRI command. For example: CHGCRGPRI CLUSTER(MYCLU) CRG(MYCRG)

IBM Partnership

Recovering from an unplanned failure in a Metro Mirror or Global Mirror environment