IBM i Parallel Save and Restore Functions

Many customers are interested in spreading their backups across multiple drives simultaneously in order to shorten their backup time. The IBM i platform offers several techniques to do this, some of which have special considerations. This article will begin by outlining the various multi-streamed backup options. It will then discuss a situation with referential database constraints where care is required with multi-streamed saves. Finally, the article will delve into each of the three multi-streamed backup options, explaining how to use the save technique, and the best way to restore that type of save.

Overview of techniques for running multi-streamed backups on the IBM i platform

There are three different techniques for running multi-streamed backups on the IBM i platform:

Concurrent saves
Parallel-parallel saves
Parallel-serial saves

Concurrent saves have been available since the early days of the AS/400. They are being used very successfully by many customers around the world, with very few considerations.

Parallel-parallel saves were introduced at V4R4 and were intended for customers who needed to split a single large object across multiple drives. To use them successfully, use Backup, Recovery and Media Services (BRMS) for the save, recover to a system that has access to the BRMS information about the save, and use the same number of drives for the restore as you used for the save.

Parallel-serial saves were introduced at V5R1 by allowing parallel saves to run against special values like *ALLPROD, *ALLTEST, *ALLUSR, *ASP01–*ASP99, *IBM, generics etc. They basically run a set of concurrent saves, but BRMS divides the libraries into the save streams rather than the user. To use parallel-serial saves successfully, use BRMS for the save, and use a special recovery technique to get multiple drives working on the restore simultaneously.

NOTES:

In concept, concurrent saves and parallel-serial saves of multiple libraries seem similar, however they do each have benefits and disadvantages.
During parallel-serial backups, the save manages the libraries being saved, devices and media. However, since the save is run in a single threaded job, there may be significant delays before data is written to each device in the parallel-serial backup as each library is being preprocessed.
During concurrent backups, since each backup is run in a different job, preprocessing can be performed on multiple libraries at the time. However, to run concurrent saves, the user must manage the libraries being saved, devices and media.

Referential constraints

When a database has referential constraints, there are some scenarios where multi-streamed restores can fail due to seize issues with the various objects related to the constraint. As one object is being restored, the system has a seize lock on it. If a related object is restored at the same time, then the second restore will be unable to get the necessary seize on the first object, and the restore will fail.

For customers who have referential constraints in their databases, there are three ways to work around this issue:

Use a single stream recovery rather than a multi-streamed recovery for the libraries that have files involved in the referential constraint
If the related objects are all in a single library, and if that library is saved using a save-while-active command, then IBM i is able to handle this situation.
If the save is a parallel-parallel save, and the recovery is done with the same number of drives as the save, then IBM i is able to handle the situation.

These considerations for customers with referential constraints are in addition to those described in the sections below.

Concurrent saves

For this technique, the user decides how he will allocate the objects among the tape drives. He then issues one command or job for each drive, and the various drives will run independently from one another, e.g.

Drive 1: SAVLIB LIB (A* B* C* .... H*)
Drive 2: SAVLIB LIB (I* J* K* ..... R*)
Drive 3: SAVLIB LIB (S* T* U* .....Z*)
etc

Some tinkering will be required to make the save streams approximately the same size so they will end at approximately the same time. Care must be taken to ensure that all desired objects are included in the save, including libraries that start with non-alphabetic characters, the Document Library Objects (DLO), the Integrated File System (IFS), spoolfiles, etc.

Recovery is fairly simple. The sets of tapes can be restored using the same number of drives or fewer drives, by mounting the tapes and issuing the corresponding restore command. The key consideration on the restore is for customers who have their physical/logical files or their journals/receivers in different libraries: they need to ensure that the files are restored in the proper order, otherwise an error message will occur, and the missed files will need to be restored separately later. Careful planning of the save streams can usually allow the files to be restored in a single pass, even for customers with physicals/logicals and journals/receivers in different libraries.

Any data on the IBM i can be saved concurrently.

Parallel-parallel saves

There was a group of customers who needed to shorten their backups, but could not do so with concurrent saves, because their data included a single large object that made up the bulk of the save. In the early days, there was no way to split this object across drives, so that one stream became the limiting factor in shortening the backup window. Adding more drives to the save was no help.

To assist these customers, IBM introduced parallel-parallel saves at V4R4. For a parallel-parallel save, the user issues a single save command, but asks IBM i to run it across multiple drives, e.g.,

SAVLIB LIB(biglib) DEV(*MEDDFN) MEDDFN(biglibdfn)
(where ”biglib” is the library with the large object, and where the “biglibdfn” media definition indicates multiple drives)

Although it is possible to implement this using IBM i commands, it is fairly difficult since media definition files need to be created for both the save and the restore. For example, the media definition for the restore tells IBM i the details of the data on the tapes and the file sequence numbers where the data can be found. Rather than doing this manually, customers are encouraged to use BRMS since it creates the media definition files in the background using the tape exit programs and the file sequence information in the BRMS database. Customers simply tell BRMS how many drives they would like to use for the save or restore, and BRMS handles the rest.

There are three key considerations for restoring this type of save:

In order to restore the data, you need access to the BRMS database so the media definition can be created. This is easy for a customer who is restoring an object to a system in the same BRMS network where it was saved, or a customer doing a full system recovery who will re-load his BRMS information as part of the restore. However, it is more difficult to take a parallel-parallel tape to a separate system and try to restore it since the media definition will need to be created manually.
When doing the recovery, IBM i will restore all parts of the first library from each tape, then go back and restore all parts of the second library from each tape, and so on. If you have the same number of drives on the restore as you had on the save, then IBM i mounts all the tapes and moves quickly among them and pulls the data off each tape in order. The recovery typically takes 1-2 times as long as the save. However, if the restore is done with fewer drives than the save, then IBM i needs to mount each tape once for each library that is saved on it. Due to all the tape mounts, searches, rewinds, etc this can take a very long time and is likely not practical.
As discussed above, using the same number of drives for the save and restore will avoid seize problems when there are referential database constraints.

The net of this is that parallel-parallel saves can be used successfully so long as saves are done with BRMS, restored to a system that has access to the BRMS information, and restored using the same number of drives as the save. As with concurrent saves, customers who have their physical / logical files and journals / receivers in different libraries need to plan their strategies so these objects are restored in the proper order.

Parallel-serial saves

Starting at V5R1, it became possible to specify generics (eg ABC*) and special values like *ALLUSR, *ALLPROD, *ALLTEST, *ASP01-*ASP99, *IBM, generics, etc on a parallel save. This type of save became known as a parallel-serial save since it was a cross between a parallel save and a regular (“serial”) save. It was considered parallel since IBM i decided how to spread the libraries among drives, unlike the concurrent save where the customer decided how to spread the libraries. However, individual objects are not necessarily spread across multiple drives: instead, IBM i uses a round-robin algorithm to allocate the various libraries, one to each drive, working from A-Z.

When you ask for a parallel save, IBM i decides whether to do a parallel-parallel or parallel-serial save, depending on the type of objects to be saved as follows:

single object - parallel-parallel
single library - parallel-parallel
list of libraries - IBM i decides which to do depending on various factors
generic libraries (eg LIB*) - parallel-serial
special values (eg *ALLUSR, *ALLPROD) - parallel-serial

The tapes that result from a parallel-serial save have the libraries jumbled on them alphabetically due to the round-robin algorithm. For example, with a 3-drive save, if all the libraries were approximately the same size, the libraries might be spread something like the following:

Tape 1: A-lib, D-lib, G-lib, etc.
Tape 2: B-lib, E-lib, H-lib, etc.
Tape 3: C-lib, F-lib, I-lib, etc.

Now as an aside: as mentioned above, programming convention on the IBM i platform says that customers should try to put their physicals / logicals or their journals / receivers in the same library. In this case, IBM i is able to restore these related files successfully. For customers who need these objects to be in separate libraries, programming convention says to name the libraries alphabetically so that IBM i will restore the base files first and the dependent files second when it does a restore in alphabetical order.

When restoring a parallel-serial save, BRMS tries to follow the above convention by restoring the objects in alphabetical order. Customers are often surprised to learn that regardless of how many drives they ask BRMS to use, the default restore only reads from a single drive at a time as follows:

If a customer asks BRMS to use the same number of drives for the restore as for the save, then the first tape of each set is loaded into each drive. However, IBM i then moves through the drives one by one, restoring the libraries in alphabetical order, such that only one drive is in use at any given time.

If customers ask BRMS to use a single drive for the restore, then IBM i mounts the first tape and restores the first library. BRMS then rewinds/dismounts the first tape and mounts/loads the second tape and restores the second library. This continues, cycling through the tapes one by one until all libraries are restored. Not only is the restore being performed on a single drive, but there are also numerous mount/search/rewind/dismount cycles to wait for.

Since a single-drive restore is typically not practical for customers, a work-around was devised whereby multiple restore jobs are submitted to get multiple drives working simultaneously. At V5R3 and lower, customers need to create these streams manually by carefully selecting the items to be restored. At V5R4, new function was added to the STRRCYBRM screens to make this simpler. Here is a description:

During a recovery, the STRRCYBRM command brings up a screen showing all the objects that are available for restore. This list is in alphabetical order. There are 3 ways to handle this screen:

If you ask BRMS to go ahead and do the default restore, you will find yourself in one of the situations described above where only a single drive is active on the recovery at a given time.
The workaround for V5R3 and earlier releases is to submit the recovery in multiple sections. This needs to be done very carefully. Start by figuring out all the tapes that are in the first set. Then open the STRRCYBRM screen and go down through the list and select all the items that are on that set of tapes, then submit the command for restore. Then go through the list again, and select the items that are on the second set of tapes, and submit them for restore. Continue until all items have been selected from the list.
Starting at V5R4, the same technique can be used, but BRMS offers assistance in selecting the items to be restored in each set. On the STRRCYBRM ACTION(*RESTORE) screen that lists the items for restore, there are two selection fields in the top right corner that let you specify *VOLSET and a volume number prior to using the F16=Select key. All saved items that are on the same set of tapes as that volume are marked with a ‘1’. You then use F9 for recovery defaults, and F4 to submit the restore. You then repeat this for each different save stream until all restores have been submitted.

As with concurrent and parallel-parallel saves, customers who have their physicals / logicals and journals / receivers in different libraries, may receive error messages indicating that files could not be restored since their base file was not yet on the system. Operators need to check for these messages and go back and restore the missing files later. Alternatively, it may be possible to plan the save strategy to minimize these concerns, using one of the following techniques:

Design the save strategy so all traditional libraries are in a single stream, then use the other streams for other items like DLO, IFS, etc. That way the traditional library stream will be on a single drive and will restore in alphabetical order without dependencies between drives
Design the streams so one stream saves all the dependent files (e.g. logical files and journal receivers). During the recovery, restore that stream last, after all the other streams are already back on the system

The net of this is that parallel-serial saves can be used successfully so long as the special recovery technique is used to get multiple drives running simultaneously during the recovery, and the restore order of physical / logical files and journals / receivers is handled.

NOTE: The parallel-serial saves are very similar to concurrent saves, except that IBM i carves the backup into multiple streams and attempts to have them all finish at the same time based on the round robin algorithm. For customers with data that varies from day to day such that it would be difficult to pre-plan concurrent streams, then parallel-serial saves offer a benefit. However, customers who are able to design their own concurrent streams, may find that concurrent saves are a better strategy due to the simpler recovery prior to V5R4 and less overhead on the system.

Conclusion
Concurrent and parallel saves offer customers options to shorten their backup window. If customers understand the advantages and considerations for each, they can plan a save/restore strategy that will be a good fit for their organization.

(Article by Nancy Roper appeared in the COMMON Connect magazine)