I’m making some tests with AVS software, and the “sndradm” command is not working very well…
I have a two-node cluster configuration, and sndradm returns “OK” on the secondary node (node2), but did not returns anything on the primary (node1).

Take a Look:

 # sndradm -C local -P
/dev/rdsk/c2d0s0        <-      node1:/dev/rdsk/c2d0s0
autosync: off, max q writes: 4096, max q fbas: 16384, \\
async threads: 2, mode: sync, group: B2007, state: logging
/dev/rdsk/c3d0s0        <-      node1:/dev/rdsk/c3d0s0
autosync: off, max q writes: 4096, max q fbas: 16384, \\
async threads: 2, mode: sync, group: B2007, state: logging

Four lines of output:

 # sndradm -C local -P | wc -l
    4

Running the same command on node1:

 # sndradm -C local -P
 # echo $?
0
 #

So, i did start to look for the problem.. take a look at this dtrace script, first executed on the node2:

 # dtrace -n 'syscall::write:entry / pid == $target / \\
{ ustack(); }' -c "sndradm -C local -P"

dtrace: description 'syscall::write:entry ' matched 1 probe

/dev/rdsk/c2d0s0        <-      node1:/dev/rdsk/c2d0s0
autosync: off, max q writes: 4096, max q fbas: 16384, \\
async threads: 2, mode: sync, group: B2007, state: logging
/dev/rdsk/c3d0s0        <-      node1:/dev/rdsk/c3d0s0
autosync: off, max q writes: 4096, max q fbas: 16384, \\
async threads: 2, mode: sync, group: B2007, state: logging
dtrace: pid 28104 has exited
CPU     ID                    FUNCTION:NAME
  0 100087                      write:entry
              libc.so.1`_write+0x15
              libc.so.1`_ndoprnt+0x2816
              libc.so.1`printf+0xa8
              sndradm`rdc_print+0x646
              sndradm`main+0xceb
              sndradm`_start+0x7a

  0 100087                      write:entry
              libc.so.1`_write+0x15
              libc.so.1`_ndoprnt+0x2816
              libc.so.1`printf+0xa8
              sndradm`rdc_print+0x8ab
              sndradm`main+0xceb
              sndradm`_start+0x7a

  0 100087                      write:entry
              libc.so.1`_write+0x15
              libc.so.1`_ndoprnt+0x2816
              libc.so.1`printf+0xa8
              sndradm`rdc_print+0x646
              sndradm`main+0xceb
              sndradm`_start+0x7a

  0 100087                      write:entry
              libc.so.1`_write+0x15
              libc.so.1`_ndoprnt+0x2816
              libc.so.1`printf+0xa8
              sndradm`rdc_print+0x8ab
              sndradm`main+0xceb
              sndradm`_start+0x7a

  1 100087                      write:entry
              libc.so.1`_write+0x15
              libdscfg.so.1`cfg_writepid+0x5b
              libdscfg.so.1`cfg_enterpid+0x7d
              libdscfg.so.1`cfg_lockd_init+0x126
              libdscfg.so.1`cfg_open+0x40d
              sndradm`rdc_print+0x21f
              sndradm`main+0xceb
              sndradm`_start+0x7a

Ok, but in the node1:

# dtrace -n 'syscall::write:entry / pid == $target / \\
{ ustack(); }' -c "sndradm -C local -P"
dtrace: description 'syscall::write:entry ' matched 1 probe

dtrace: pid 3729 has exited
CPU     ID                    FUNCTION:NAME
  1  98411                      write:entry
              libc.so.1`_write+0x15
              libdscfg.so.1`cfg_writepid+0x5b
              libdscfg.so.1`cfg_enterpid+0x7d
              libdscfg.so.1`cfg_lockd_init+0x126
              libdscfg.so.1`cfg_open+0x40d
              sndradm`rdc_print+0x21f
              sndradm`main+0xceb
              sndradm`_start+0x7a

So, the normal output has five writes from the rdc_print function, but the second sndradm command (node1), has just one! But with that stack trace, we just know which function to look at...
Let's follow the entry/return of the rdc_print function:

# dtrace -n 'pid$target::rdc_print:entry /pid == $target/ \\
{self->follow = 1; } pid$target::rdc_print:return {self->  \\
follow = 0; trace(arg0);} pid$target:::entry /self->follow/\\
{} pid$target:::return /self->follow/ {trace(arg0);}' -c    \\
"sndradm -C local -P" | grep rdc_

dtrace: description 'pid$target::rdc_print:entry '
matched 11034 probes
  0 100881                  rdc_print:entry
  0 100930                  rdc_ioctl:entry
  0 106451                 rdc_ioctl:return               133
dtrace: pid 28113 has exited
  0 100891            rdc_print_state:entry
  0 106412           rdc_print_state:return               573
  0 100891            rdc_print_state:entry
  0 106412           rdc_print_state:return               573
  0 100882                 rdc_print:return              2370

On node1 (just the output):

dtrace: description 'pid$target::rdc_print:entry '
matched 11034 probes
  0  98991                  rdc_print:entry
  0  99040                  rdc_ioctl:entry
  0 104561                 rdc_ioctl:return               133
dtrace: pid 3745 has exited
  0  98992                 rdc_print:return              2370

As we could see, the command when executed on node1, there is no call to rdc_print_state function. Looking at the sndradm.c source code, we can see that the rdc_print function calls rdc_print_state to print the lines 2 and 4 on a normal execution.
Here we can see it:

(void) printf(gettext(", state: %s"), rdc_print_state(urdc));

The execution on node1 is not calling rdc_print_state!
Let's confirm it:

 # dtrace -n 'pid$target::rdc_print_state: {}' -c "sndradm -C local -P"
dtrace: description 'pid$target::rdc_print_state: ' matched 175 probes
dtrace: pid 3785 has exited

So, i did a similar dtrace script to look at each rdc_print instruction, and take a diff..
node2:

 # dtrace -n 'pid$target::rdc_print: {}' -c "sndradm -C local -P" |\\
 /usr/sfw/bin/ggrep -A 5 -B 6 280
dtrace: description 'pid$target::rdc_print: ' matched 702 probes
dtrace: pid 28156 has exited
CPU     ID                    FUNCTION:NAME
  0 101045                    rdc_print:253
  0 101046                    rdc_print:254
  0 101047                    rdc_print:259
  0 101048                    rdc_print:25c
  0 101049                    rdc_print:25f
  0 101057                    rdc_print:279
  0 101058                    rdc_print:280
  0 101059                    rdc_print:283
  0 101060                    rdc_print:286
  0 101061                    rdc_print:28c
  0 101062                    rdc_print:28f
  0 101063                    rdc_print:291
 

node1:

 # dtrace -n 'pid$target::rdc_print: {}' -c "sndradm -C local -P" |\\
 /usr/sfw/bin/ggrep -A 5 -B 6 280
dtrace: description 'pid$target::rdc_print: ' matched 702 probes
dtrace: pid 3801 has exited
  0  99155                    rdc_print:253
  0  99156                    rdc_print:254
  0  99157                    rdc_print:259
  0  99158                    rdc_print:25c
  0  99159                    rdc_print:25f
  0  99167                    rdc_print:279
  0  99168                    rdc_print:280
  0  99169                    rdc_print:283
  0  99170                    rdc_print:286
  0  99650                    rdc_print:8c1
  0  99651                    rdc_print:8c4
  0  99652                    rdc_print:8c7

I think the important line is rdc_print:286. After that line, the same code take different directions on node1 and node2. So, we can use mdb to disassembly the code (we can execute this script in any node):

 # echo rdc_print+271::dis | mdb /usr/sbin/sndradm
rdc_print+0x253:                pushl  %eax
rdc_print+0x254:                call   -0xcd8   < PLT:cfg_lock >
rdc_print+0x259:                addl   $0x8,%esp
rdc_print+0x25c:                cmpl   $0x0,%eax
rdc_print+0x25f:                jne    +0x1a    < rdc_print+0x279 >
rdc_print+0x261:                pushl  $0x8061f58
rdc_print+0x266:                call   -0xd4a   < PLT:gettext >
rdc_print+0x26b:                addl   $0x4,%esp
rdc_print+0x26e:                pushl  %eax
rdc_print+0x26f:                pushl  $0x0
rdc_print+0x271:                call   +0xe87f  < rdc_err >
rdc_print+0x276:                addl   $0x8,%esp
rdc_print+0x279:                movl   $0x0,-0x18(%ebp)
rdc_print+0x280:                movl   -0x18(%ebp),%eax
rdc_print+0x283:                cmpl   -0x20(%ebp),%eax
rdc_print+0x286:                jge    +0x63b   < rdc_print+0x8c1 >
rdc_print+0x28c:                movl   -0x18(%ebp),%ecx
rdc_print+0x28f:                movl   %ecx,%eax
rdc_print+0x291:                shll   $0x4,%eax
rdc_print+0x294:                movl   %ecx,%edx
rdc_print+0x296:                shll   $0x5,%edx

Ok, let's go! Here, the first important line is:

rdc_print+0x254:                call   -0xcd8   < PLT:cfg_lock >

In the function rdc_print there is just "one" call to cfg_lock function, and is on line: 273. That instruction is executed on both nodes, and the result is fine, as we can see here:

rdc_print+0x25f:                jne    +0x1a    < rdc_print+0x279 >

In both nodes the next executed instruction is rdc_print+0x279.

But the problem is on the next comparison, and in the following jump:

rdc_print+0x286:                jge    +0x63b   < rdc_print+0x8c1 >

Looking the source code more one time, we can see that the next evaluation is on line: 277, and is:

 for (i = 0; i < max; i++) {

The result for that evaluation is "wrong" on node1! For some reason, the max variable "must" be corrupted.. and the "for" is not executed, that is why i got "nothing" as output. In the line 263, we can see:

max = min(rdc_status->nset, rdc_maxsets);

So, dtrace gave us the answer, and the rdc_print+0x286 was the culprit..

I think we can continue the debug task with mdb, or another tool to find out the value of "max". But maybe the "devel team" know about that problem, so, i will post this on opensolaris forums, and see what i get...
That's all