Computing Science, posix rules, life rules, no rules…
01:51 - Mon 6 September, 2010 |  RSS:
Publications
Comments

Solaris 10 u3 – SC 3.2 ZFS/NFS HA with NON-shared discs using AVS (part II)

First of all, i would like to thanks Suraj Verma, Venkateswarlu Tella, Venkateswara Chennuru, and the great opensolaris community.

If you do not have the AVS setup already, take a look at part I of this howto. Ok, let’s go..

First, we will need to create a custom agent (Resource type), because the HA.StoragePlus resource needs global devices for HA, and this howto is for HA using local devices.
So, let’s create the resource named: MRSL.NONsharedDevice.
We will use the directory /usr/scnonshared:

   # mkdir /usr/scnonshared

In that directory let’s create a file named MRSL.NONsharedDevice.rt with the following lines:

##############################################
# Resource Type: NONsharedDevice             #
# byLeal                                     #
# 17/ago/2007                                #
##############################################
RESOURCE_TYPE   = NONsharedDevice;
RT_DESCRIPTION  = "NON-Shared Device Service";
VENDOR_ID       = MRSL;
RT_VERSION      = "1.0";
RT_BASEDIR      = "/usr/scnonshared/bin";
START           = nsd_svc_start;
STOP            = nsd_svc_stop;

{
        PROPERTY = Zpools;
        EXTENSION;
        STRINGARRAY;
        TUNABLE = AT_CREATION;
        DESCRIPTION = "The Pool name";
}
{
        PROPERTY = Mount_Point;
        EXTENSION;
        STRINGARRAY;
        TUNABLE = AT_CREATION;
        DESCRIPTION = "The mount point";
}

Let’s go through some lines of that file:
p.s: Lines starting with “#” are just comments…
1) RESOURCE_TYPE: Here we have the name of our brand new Resource Type (agent).
2) RT_BASEDIR: That’s the directory base of the Resource type, where “must” be the start/stop methods.
3,4) START/STOP: The two methods that we need to implement for mount/unmount our pool/fs. The filenames will be “nsd_svc_start” and “nsd_svc_stop“.
Here you can download the two perl scripts: nsd_svc_start, nsd_svc_stop.
Ok, ok, i’m not a “real” perl hacker, but i think that two scripts do the job… the two scripts are basically the same, i don’t know if we can have just one script to start and stop methods. But, at least we can put the functions in an “include” file.. feel free to enhance it!
p.s: You will need to change the string “primaryhostname” in both scripts to the name of your server (master of the sndr volumes).
The sections between “{}“, are extensions properties that we need for customize each resource that we will be creating in this Resource Type:
1) Zpools: Like in SUNW.HAStoragePlus RT, we need this extension to associate ZFS pool with the MRSL.NONsharedDevice resource.
2) Mount_Point: This extension we need to mount the ZFS pool/fs. Because we must set the ZFS pool mount property to “legacy_mount“, for each ZFS pool that we want to use in a HA solution with “local” devices .
Both extensions must be provided “AT_CREATION” time, and there is no default.
If you change the perl scripts, keep in mind that the important point is the control of the AVS’s synchronization/replication. Like Jim Dunham have said to me:
Golden Rule of AVS and ZFS:”

When replicating data between Solaris hosts, never
allow ZFS to have a storage pool imported on both nodes
when replication is active, or becomes active. ZFS does not
support shared writer, and thus if ZFS on one node, sees a
replicated data block from ZFS on another node, it will
panic Solaris.

and more..

Now in a failback or switchback scenario, you do have a
decision to make. Do you keep the changes made to the
SNDR secondary volume (most likely), or do you discard
any changes, and just continue to reuse the data on the
SNDR primary volume (least likely). The first thing that
needs to be done to switchback, which is automatically
provided by the NFS resource group, is to ZFS legacy
unmount the ZFS storage pool on the SNDR secondary node.

- If you want to retain the changes made on the SNDR secondary, in a script perform a “sndradm -n -m -r
- If you want to dispose of the changes made on the SNDR secondary, do nothing.

Next allow the NFS resource group to ZFS legacy mount
the ZFS storage pool on the SNDR primary node, and now
you are done.

In this HOWTO we will “retain the changes made on the SNDR secondary“, so the stop/start scripts must handle that. If you are implementing a solution that does not need to retain the changes, you can edit the scripts and remove the sndradm lines. Remember that you need put that files in the directory “/usr/scnonshared/“, rename them to “nsd_svc_start” and “nsd_svc_stop“, and give them execution permissions.
You can use the start/stop scripts on command line to test them in your system. You will need to create two directories in both nodes:

 # mkdir /var/log/scnonshared
 # mkdir /var/scnonshared/run

Be sure that the ZFS pool is unmounted and exported, also, keep in mind the “AVS/ZFS golden rule”… You can try the scripts running a command similar to:

/usr/scnonshared/bin/nsd_svc_start -R poolname-nonshareddevice-rs \\
-T MRSL.NONsharedDevice -G poolname-rg

and

/usr/scnonshared/bin/nsd_svc_stop -R poolname-nonshareddevice-rs \\
-T MRSL.NONsharedDevice -G poolname-rg

The SC subsystem will call the scripts with the options we have used above…
p.s: I think this should be a information that i could find out without have to make a script to print “$@“. Would be nice find it in the scha_* man pages…

The scripts will log everything on screen, and in the log files named: nsd_svc_stop.`date`.log and nsd_svc_start.`date`.log…
Ok, after `cd /usr/scnonshared`,we can register our new RT:

 # clresourcetype register -f MRSL.NONsharedDevice.rt MRSL.NONsharedDevice

…and configure the whole resource group:

 # clresourcegroup create -p \\
PathPrefix=/dir1/dir2/POOLNAME/HA poolname-rg
 # clreslogicalhostname create -g poolname-rg -h \\
servernfs servernfs-lh-rs
 # clresource create -g poolname-rg -t \\
MRSL.NONsharedDevice -p Zpools=POOLNAME \\
-p Mount_Point=/dir1/dir2/POOLNAME poolname-nonshareddevice-rs
 # clresourcegroup online -M poolname-rg

Now, the resource group associated with the ZFS pool (POOLNAME) is online, hence the ZFS pool too. Before configure the resource for the NFS services, you will need to create a file named: dfstab.poolname-nfs-rs, in the directory: /dir1/dir2/POOLNAME/HA/SUNW.nfs/.
p.s: As you know, the dfstab is the file containing commands for sharing resources across a network (NFS shares – “man dfstab” for informations about the sintaxe of that file).

To avoid the SC “validate” check, to configure the SUNW.nfs resource, we will need to have the ZFS pool mounted on both nodes. So, let’s put the AVS software in logging mode (on primary node):

   #sndradm -C local -g POOLNAME -n -l

After that, we can import the pool (on secondary node):

  #zpool import -f POOLNAME

Now we can proceed with the cluster configuration:

   # clresource create -g poolname-rg -t SUNW.nfs -p \\
Resource_dependencies=poolname-nonshareddevice-rs \\
poolname-nfs-rs

So, now we can unmount/export the ZFS pool on secondary node, put the AVS software in replication mode, and go to the last step.. bring all the resources online.
On the secondary node:

   # unmount POOLNAME
   # zpool export POOLNAME

On primary node:

   # sndradm -C local -g POOLNAME -n -u
   # sndradm -C local -g POOLNAME -n -w
   # clresourcegroup online -M poolname-rg

That’s it, you can test the failover/switchback scenarios with the commands:

 # clnode evacuate < nodename >

The command above will take all the resources from the nodename, and will bring them online on other cluster node.. or you can use the clresourcegroup command to switch the resource group to one specific host (nodename):

clresourcegroup switch -n < nodename > poolname-rg

WARNING: There is a “time” (60 seconds by default), to keep resource groups from switching back onto a node, after the resources have been evacuated from a node (man clresourcegroup). Look here..

If you need undo all the configurations (if something goes wrong), here is the step-by-step procedure for understanding:

   # clresourcegroup offline poolname-rg
   # clresource disable poolname-nonshareddevice-rs
   # clresource disable poolname-nfs-rs
   # clresource delete poolname-nfs-rs
   # clresource delete poolname-nonshareddevice-rs
   # clresource disable servernfs-lh-rs
   # clresource delete servernfs-lh-rs
   # clresourcegroup delete poolname-rg

I think we can enhance this procedure using a point-in-time copy of the data, to avoid “inconsistency” issues during the synchronization task… but it is something i will let you comment! That’s all..

Edited by MSL (09/24/2007):
This “Agent” is changing over time (for better, i guess :), and i will use the comments sections like a “Changelog“. So, if you want to implement that solution, i recommend you to read the comments section, and see if there is some changes in the above procedure. The “stop” and “start” scripts are always permanent links, and the updated RT file can be downloaded here.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • TwitThis

7 Trackbacks to "Solaris 10 u3 – SC 3.2 ZFS/NFS HA with NON-shared discs using AVS (part II)"

  1. on June 13, 2009 at 9:10 pm
  2. on June 13, 2009 at 10:32 pm
  3. on June 14, 2009 at 12:04 am
  4. on September 5, 2009 at 9:18 pm
  5. on September 5, 2009 at 10:42 pm
  6. on September 6, 2009 at 2:31 am
  7. on September 23, 2009 at 4:07 am

21 Comments to "Solaris 10 u3 – SC 3.2 ZFS/NFS HA with NON-shared discs using AVS (part II)"

  1. Suraj Verma's Gravatar Suraj Verma
    August 23, 2007 - 1:37 pm | Permalink

    Very cool and pretty useful I am sure.
    Regarding your comment “Would be nice find it in the scha_* man pages…”, I guess you can find this information in rt_callbacks(1HA) or in Developer’s Guide (link found at http://opensolaris.org/os/community/ha-clusters/ohac/Documentation)

  2. MSL's Gravatar MSL
    August 24, 2007 - 12:50 pm | Permalink

    Suraj Verma comment something to me, that i think is very relevant, and because of that i’m posting here:
    ”You say that the extension properties can only be set when creating the
    resource (AT_CREATION). However I guess it will be safe to have them
    modified when the resource is disabled (WHEN_DISABLED). This will give
    you the flexibility of adding/removing mountpoints without deleting the
    resource”.

  3. Jeremy's Gravatar Jeremy
    April 25, 2008 - 5:34 pm | Permalink

    What happens if one of the secondary nodes disks fail? It looks like you’re opening yourself up to a scenario where NODE2 is not in a good state ( would be degraded if switching over to it ).

    What if the primary NODE1 detects a bad disk and starts to resilver, that will generate a LOT of IO across this AVS solution.

    My question is, is it better to forget about zfs and use this solution with a UFS meta device?

  4. June 13, 2009 - 2:05 pm | Permalink

    Hi, interest post. I’ll write you later about few questions!

  5. September 5, 2009 - 6:51 pm | Permalink

    Присоединяюсь, к комментариям! Добавлю в избранное!

  6. September 16, 2009 - 12:37 pm | Permalink

    Спасибочки) Очень помогли =-*

  7. September 23, 2009 - 3:09 am | Permalink

    Спасибо) есть что то интересное))

  8. October 5, 2009 - 5:29 am | Permalink

    Как раз то что искал, большое спасибо!

  9. January 10, 2010 - 5:36 pm | Permalink

    Very interesting and informative site! Good job done by you guys, Thanks

  10. Chris's Gravatar Chris
    March 17, 2010 - 3:06 am | Permalink

    Hi,

    this is a fantastic agent, and i have it up and running so far flawlessly however when my cluster restarts AVS appears to loose its configuration?

    if i run “sndradm -C local -g POOLNAME -n -u” im told:

    Remote Mirror: avs1 /dev/rdsk/c8t1d0s0 /dev/rdsk/c8t1d0s1 avs2 /dev/rdsk/c8t1d0s0 /dev/rdsk/c8t1d0s1
    sndradm: warning: SNDR: /dev/rdsk/c8t1d0s0 ==> /dev/rdsk/c8t1d0s0 not already enabled

    dsstat also returns nothing…

    so far my solution is running on both nodes:
    dscfgadm -d
    rm /etc/dscfg_cluster && rm /etc/dscfg_local
    echo “/dev/did/rdsk/d2s0″ > /etc/dscfg_cluster
    dscfgadm

    and then syncing the primary to the slave again.. obviously not ideal :(

    any suggestions??

    and cheers for this awesome agent… would be nice if the AVS comunity got behind this!!

  11. Chris's Gravatar Chris
    March 17, 2010 - 9:49 pm | Permalink

    from all the googling i did find alot of emails from you and alot of pushback from devs saying it wasnt a problem, however here i am with the same problem as you, using opensolaris 2009.6.. :(

    i may give this a shot with the latest solaris release and the packages from the sun website, see if its just an opensolaris issue.

    such a shame that this problem is here as your solution is fantastic providing AVS works – which it does until a node restarts!!!

    i gather your current agent code is here “http://www.eall.com.br//hp/Solaris/MRSLnonshareddevice-2.2.tar.gz” ?

    once again… awesome work with the agent!

  12. Chris's Gravatar Chris
    March 18, 2010 - 2:29 am | Permalink

    well i tried using the supported stuff from sun website and i have the same problem…

    what i have noticed is that prior to running sndradm if i run dscfg -l -s /dev/did/rdsk/d5s0 | grep -v “#” – the contents is empty. Once i setup the replication with sndradm and rerun the dscfg command it returns one line with setid: 1 setid-ctag

    i would of thought it would return the same as /etc/dscfg_local which in my case has:
    cm: 128 64 – – – – – – -
    sndr: nas1 /dev/rdsk/c1t1d0s0 /dev/rdsk/c1t1d0s1 nas2 /dev/rdsk/c1t1d0s0 /dev/rdsk/c1t1d0s1 ip sync tank1 – setid=1; -
    sv: /dev/rdsk/c1t1d0s0 – -
    sv: /dev/rdsk/c1t1d0s1 – -
    dsvol: /dev/rdsk/c1t1d0s0 – sndr
    dsvol: /dev/rdsk/c1t1d0s1 – sndr

    so perhaps the problem isnt that configuration is lost, but that configuration isnt even being written to the cluster database on the shared disk???

    wonder how i could manually get that info on the shared storage.. i did try d if=/etc/dscfg_local of=/dev/did/rdsk/d5s0 bs=512k count=11 but without any luck heh

    maybe you have some ideas? :)

  13. Chris's Gravatar Chris
    March 18, 2010 - 3:20 am | Permalink

    apparently i can update the contents of the cluster database via dscfg -C – -a file

    however this appears to be only updating dscfg_local

    arggh :(

    time to give up…

  14. March 22, 2010 - 9:38 am | Permalink

    Thanks for a wonderful post, l ve been looking for such information, I will join jour rss feed now.

Leave a Reply