ZFS/NFS HA with NON-shared discs

First of all, i would like to thanks Jim Dunham and his team for a great software, and the help in that configuration. Here is his blog, and here is a great article!

The following procedure has the intent to be a "simple" howto for the configuration of a "high available" ZFS/NFS server (with NON-shared discs), using AVS for the synchronization/replication task.

PLEASE BE AWARE THAT ANY INFORMATION YOU MAY FIND HERE MAY BE INACCURATE.

USE AT YOUR OWN RISK!

-+-

Conventions:
NODE1: The hostname of the primary host on cluster.
NODE2: The hostname of the secondary host on cluster.
POOLNAME: It’s the ZFS pool’s name.
#: The superuser prompt. In Solaris you should execute that commands as root or with a role that has the necessary power.

-+-

Environment:
1) Two servers (NODE1 and NODE2), each server with two SATA discs of 150GB each one.

Here are the two SATA discs of NODE1:
#format < /dev/null | grep 149.01GB
       4. c2d0 < ST316081-4LS50QX-0001-149.01GB >
       5. c3d0 < ST316081-4LS46T1-0001-149.01GB >

and NODE2:
#format < /dev/null | grep 149.01GB
       4. c2d0 < WDC WD16-WD-WCANM463514-0001-149.01GB >
       5. c3d0 < WDC WD16-WD-WCANM463733-0001-149.01GB >

2) That two servers are a two-node cluster (Sun Cluster 3.2 is available for free at sun.com).
3) The Operating System is Solaris 10 u3.
4) AVS was installed using the source code available at opensolaris.org (The installation procedure is here)

-+-

Procedure

We need some math. We have just two discs, and we will use that discs to make a ZFS mirror.. So, we need create two partitions in each disc:
s0 for the zfs mirror
s1 for the bitmap volume
p.s: You can use other discs for the bitmap volumes (if you have), and use the whole disc for the ZFS mirror, what is the recommended way.
We will take one disk to create a temporary ZFS pool, using the ZFS zpool utility to correctly format this disk, and create a EFI labeled disk, with all available blocks in slice 0. After that, we can delete the pool.

# zpool create -f TEMP c2d0; zpool destroy TEMP

The dsbitmap command calculates the size of the Availability Suite bitmap volume required for use with the specified data volume. You can use the dsbitmap command alone to see the results (e.g: dsbitmap -r /dev/rdsk/c2d0s0). We are just writting a script to automate the process. You can use format or whatever utility to configure the discs, fmthard is just a automatic or “non-interactive” version of format. The prtvtoc utility is used to do his job, print the vtoc information…

We will use some variables to simplify our job. Two of them are important ones:

VOLUME_SIZE is the whole (usable) disc..
BITMAP_SIZE is the needed bitmap volume size.

NODE1

# VOLUME_SIZE="`dsbitmap -r /dev/rdsk/c2d0s0 | \\
 grep 'size: [0-9]' | awk '{print $5}'`"
# BITMAP_SIZE="`dsbitmap -r /dev/rdsk/c2d0s0 | \\
grep 'Sync ' | awk '{print $3}'`"
# PART_0_SIZE=$((VOLUME_SIZE - BITMAP_SIZE ))
# LAST_0_SECT=$((34 + PART_0_SIZE ))
# prtvtoc /dev/rdsk/c2d0
# fmthard -d 0:4:0:34:$PART_0_SIZE /dev/rdsk/c2d0
# fmthard -d 1:4:0:$LAST_0_SECT:$BITMAP_SIZE /dev/rdsk/c2d0
# prtvtoc /dev/rdsk/c2d0
# fmthard -d 0:4:0:34:$PART_0_SIZE /dev/rdsk/c3d0
# fmthard -d 1:4:0:$LAST_0_SECT:$BITMAP_SIZE /dev/rdsk/c3d0
# prtvtoc /dev/rdsk/c3d0

We need to do the same on NODE2

# VOLUME_SIZE="`dsbitmap -r /dev/rdsk/c2d0s0 | \\
grep 'size: [0-9]' | awk '{print $5}'`"
# BITMAP_SIZE="`dsbitmap -r /dev/rdsk/c2d0s0 | \\
grep 'Sync ' | awk '{print $3}'`"
# PART_0_SIZE=$((VOLUME_SIZE - BITMAP_SIZE ))
# LAST_0_SECT=$((34 + PART_0_SIZE ))
# prtvtoc /dev/rdsk/c2d0
# fmthard -d 0:4:0:34:$PART_0_SIZE /dev/rdsk/c2d0
# fmthard -d 1:4:0:$LAST_0_SECT:$BITMAP_SIZE /dev/rdsk/c2d0
# prtvtoc /dev/rdsk/c2d0
# fmthard -d 0:4:0:34:$PART_0_SIZE /dev/rdsk/c3d0
# fmthard -d 1:4:0:$LAST_0_SECT:$BITMAP_SIZE /dev/rdsk/c3d0

-+-

As you could see above, we did create two slices in each disc (one for the bitmap volume, s1, with the $BITMAP_SIZE), and other, s0, with the remaining blocks. In that case, to replicate a 150GB disc, using sync replication mode, the BITMAP_SIZE is 1194 blocks. Hence, our slice has 597K (512 bytes/sector * 1194 blocks = 611328 bytes). So, our bitmap is addressing 512 blocks per byte! If that count is right, i would like to see an explanation in comments…

AVS: Replicas' configuration

Here is the real thing, we are about to issue the sndradm commands, and a little explanation is important:

NODE1

# sndradm -C local -nE NODE1 /dev/rdsk/c2d0s0 /dev/rdsk/c2d0s1 \\
NODE2 /dev/rdsk/c2d0s0 /dev/rdsk/c2d0s1 ip sync g POOLNAME
# sndradm -C local -nE NODE1 /dev/rdsk/c3d0s0 /dev/rdsk/c3d0s1 \\
NODE2 /dev/rdsk/c3d0s0 /dev/rdsk/c3d0s1 ip sync g POOLNAME

-+-

-C local” is a tag that is mandatory in a cluster environment. The tag can be “local” or “global”.. the discs that we are using (the whole solution) is for NON-shared discs, so… our tage is local.
-n” Does not prompt the user after starting a Remote Mirror operation using sndradm (from sndradm man page). The default is ask for confirmation (Y/N).
-E“When using “sndradm -E …”, (upper-case ‘E’), SNDR is configured such the SNDR primary and secondary volumes are indicated as 0% different, no differences recorded as zeros (0s) in the bitmap. There is no need to perform a bit-order traversal, as there are no bits set. This works in the example above, as the “zpool create ….” command assumes that both volumes contain no valid data, hence invalid data = invalid data, thus the volumes are equal. When the “zpool create … ” command is invoked, write I/Os happen to the SNDR primary volume, which are then replicated in write-order to the SNDR secondary. – Jim Dunham
ip” Specifies the network transfer protocol (from sndradm man page).
sync” Specifies the Remote Mirror operating mode. sync is the Remote Mirror mode where the I/O operation is not confirmed as complete until the remote volume has been updated.
g POOLNAME” Specifies our io_groupname. We will use that io_groupname in future sndradm commands to reference to the whole ZFS pool.

NODE2

The same operation in both nodes (the same commands, now on NODE2).

# sndradm -C local -nE NODE1 /dev/rdsk/c2d0s0 /dev/rdsk/c2d0s1 \\
NODE2 /dev/rdsk/c2d0s0 /dev/rdsk/c2d0s1 ip sync g POOLNAME
# sndradm -C local -nE NODE1 /dev/rdsk/c3d0s0 /dev/rdsk/c3d0s1 \\
NODE2 /dev/rdsk/c3d0s0 /dev/rdsk/c3d0s1 ip sync g POOLNAME

-+-

NODE1

Now we can create the ZFS pool (mirror), using the s0 of both discs (c2d0 and c3d0)

#zpool create -m legacy POOLNAME mirror c2d0s0 c3d0s0

Now the update procedure.. after we have created the ZFS pool, we need to use the “-u” option to update the Remote Mirror volume set. Only the blocks logged as changed in the Remote Mirror scoreboards are updated. That is a synchronization procedure, hence block-order update.

# sndradm -C local -g POOLNAME -n -u

Because of a synchronization procedure is block-order, the secondary volume will be unusable until the both the primary and secondary volume are 100% synchronized. After that, i guess, the primary volume will be enabled to replication. So, to wait the synchronization process, we will issue another sndradm command, using the “-w” option. That option waits for a synchronization copy to complete or abort. When this command returns, the both volumes are 100% synchronized.

# sndradm -C local -g POOLNAME -n -w

Now we can make a simple consistency test on the ZFS pool. We can use the “-l” option to stop the Remote Mirror replication and copy operations between primary and secondary volumes. Because our volumes are in replication mode, we have always a valid point-in-time copy of the corresponding primary volumes. So, stopping the replication, we can import the ZFS pool on NODE2.
# sndradm -C local -g POOLNAME -n -l

NODE2

# zpool import
# zpool import -f POOLNAME

Now we can use the ZFS pool if we want (just to test it). Read some files, and calculate a md5sum for others, for example… But remember that the discs in NODE2 are secondary volumes, and after we restart the syncronization/replication process, all changes in that discs will be lost. So, let’s export the ZFS pool…

# zpool export POOLNAME

NODE1

When we did stop the replication process with the “-l” option, the system automatically drops into logging mode (the primary and secondary rdc kernel module, starts independent Remote Mirror scoreboard loggin on these volumes). So, the AVS software knows exactly what need to be synchronized to put the volumes in replication mode again.

# sndradm -C local -g POOLNAME -n -u

And i think we should use the -w option again to make sure…

# sndradm -C local -g POOLNAME -n -w

Now, we are back in the game, in really good shape. Thanks to Availability Suite…

Go to Part II (Sun Cluster integration)