AP Photo/Oliver Multhaup

AP Photo/Oliver Multhaup


Hello there, this will be a long post… so, i think if you have no time to waste, think again before you go ahead.
And, no. I’m not Morpheus, so i cannot offer you the truth. Actually, i can just offer you my doubts. And no pills, i do not like them.
My blog is like a notepad, and i’m not a kernel developer, and so, what i will describe in this post is just my findings and in any way is an authoritative or official resource about the OpenSolaris Kernel, or scsi interface. It ‘s here in the hope it can be:
1) Useful
2) Fixed
In any order…
DO NOT TRY IT IN PRODUCTION. USE AT YOUR OWN RISK!

Well, for my two or three loyal readers… here we go!
After my last post about performance (II), i’m trying (really hard) to understand the “background” about the Solaris/OpenSolaris device driver’s interface, mainly the scsi driver (sd), the elevator/reordering algorithm, and their impact on workloads (performance and consistency if any).
I do think it’s an important point in the storage, and i guess we do not consider it this way most of the time. It’s like F1, the driver and the car are subjects of the whole interest, but many races are decided on pit stop.
In my quest the first point that got my attention was the commentary:

 ..
  * A one-way scan is natural because of the way UNIX read-ahead
  * blocks are allocated.
...

The phrase above is out of context, and we will talk more about the function this commentary is trying to explain, but i really think we need to review our concepts.
So, if you did read my last post about performance, you know that i did talk about the disk queue and cmds reordering inside the device driver. The block of commentary above is from the function: sd_add_but_to_waitq, the first function that pick my attention. It’s through that function that an I/O cmd is sent to the disk. As always, a common feature of OpenSolaris code, it is very well commented and simple to read. Here is the prototype of the function we will discuss in this post:

   ...
static void
sd_add_buf_to_waitq(struct sd_lun *un, struct buf *bp)
   ...

So, the function receives two parameters, and basically: one is the target and the other is the command. It’s important for us to understand where we are… we are not talking about filesystems, about ZPL, about DMU, or SPA. We are down, really close to the disks, and more: we have the power to live and let die.
Sorry for the simplistic approach, but i will talk about the first argument as the “disk” that we want to write, and the argument two as the “data” we want to write to that disk. The *bp is a pointer to a buffer that has much more informations the device driver needs to do the job.
Before talking about that procedure in particular, let’s agree that in a normal scenario, we would expect a FIFO alghorithm for that kind of work. So, if we need to write data: “B”, “C”, “D”, “E”, and “F” at the locations: “10”, “14”, “18”, 19″ and “21”, that would be the order we would execute the task.
Now, back to that procedure, we need to know that the *bp has an important information about how everything depends on: the location the data will be written. The BLKNO will tell the absolute block where that data will be written, and so, if we have another flag (disksort enabled by default) enabled on that device, everything changes.
Let’s say we have a scsi disk: sd1, and a ZFS filesystem on it. The requests are sent to that disk through the sd_add_buf_to_waitq function, and will be sent directly to the device or go the wait queue. Then, imagine we have a queue (the letters above) like this ( lower BLKNO’s first):

+------------------------+
 | 10 | 14 | 18 | 19 | 21 | 
+------------------------+

So that I/O’s were sent to the driver, and now we want to write the letter “A”, and the BLKNO for that data is “8”. Let’s follow the logic in the sd_add_buf_to_waitq and understand how this new I/O will be delivered:

...
/* If the queue is empty, add the buf as the only entry & return. */
 if (un->un_waitq_headp == NULL) {
...

But the queue is not empty, no luck to us… what next?

/*
 * If sorting is disabled, just add the buf to the tail end of
 * the wait queue and return.
 */
...
	if (un->un_f_disksort_disabled || un->un_f_enable_rmw) {
...

That’s not the case either, but if the answer was “yes” (disksort = disabled), that would be FIFO and we would be included after the “21”. But the sorting is enabled by default for all devices. So, the logic to insert the “A” in the wait queue is as follow:

 -> BLKNO of (A) < BLKNO of first request (B))?
  ->  No, then... search for the first inversion on the list, 
        because it should be ascending. If it finds an inversion, 
        it adds the "A" before or after it, depending of the BLKNO.
        In this case there is no inversion, so the "A" will be placed
        at the end, and will be the first request for the "second"
       list:

+---------------------------------+
 | 10 | 14 | 18 | 19 | 21 |  08  | 
+---------------------------------+


Yes, second list. We have two lists implemented on the waitq:
– The first with cmds that have the first request BLKNO greater than the current;
– The second list is for the requests which came in after their BLKNO number was passed;
Well, it’s not FIFO, but we have the same behaviour in our example, don’t you agree? The last request (A), will be the last one to be served. Fair enough…
To facilitate my math, thinking that this disk can handle each request in ~10ms, we will have our request “done” in ~60ms.
But remember: that is not a FIFO, and that first example was just coincidence. Luck for some guys, providence for others. So, let’s say in the next ms did arive more six requests: “O”, “P”, “Q”, “R”, “S”, and “T”.
BLKNO: “4”, “7”, “1000”, “6”, “2000”, “3000” respectively (random as life).
Let’s see how our waitq would look (following the algorithm):

+---------------------------------------------------------------------+
 | 10 | 14 | 18 | 19 | 21 | 1000 | 2000 | 3000 | 04 |  06 | 07 | 08  | 
+---------------------------------------------------------------------+

Hmmm, now our latency did change a little… from 60ms to 120ms! We can have I/O’s taking half a second, and so, somebody waiting for the 4K block for the email header, or a really bad browser experience. But the logic on sd.c code is not dumb, it has a reason, a purpose, and is really simple: provide more bandwitdh.
The idea is this:
– We do a “head seek”, and do the max work we can…
– “seek again” and do the same thing.
The purpose is to valorize a “head seek”. What is the better as a default behaviour is not so clear to me.
But the main concern should be the length of that list. Bigger it gets, bigger the latencies and you will not have control of your workload service times. If you need “controlled” latency times, you need to configure that list to a few elements, or none at all. You really want to check this parameter if need to handle a heavy random IOPS workload.
The last OpenSolaris distro is 2009.06, and that version has the old definition of 35(zfs_vdev_max_pending) elements on that list. And that number can give you all the random latency you don’t want in a critical IOPS environment. The new value of 10 is better, but not perfect for all workloads. So, take a close look on your disks latency times.
I did write a D script to look further into this situation, and what i saw was that the number of elements is totally dependent on the workload. I’m in the middle of these tests, but what i can tell you is that 10 is always better than 35. But just two or three elements was not better in all situations i did test. In some workloads 15 was the better number. What is the disk queue, service time, workload, retries, etc…
The last point about reordering is consistency, and the commentary i did write in the begining of this post. Besides the performance problem we can have with this configuration, i think there are some consistency issues we can face too. My script is printing the following informations:

– Device throttle (default 256, and was not a problem; max cmds allowed in transport…);
– ncmds_in_driver (I think here are the cmds in the waitq and “others” i could not trace, not sure);
– ncmds_in_transport (The commands sent to the device, and hope no other sort logic there);
– Write Cache Enabled;
– The BLKNO of this request in question (buf);

I did my initial tests on a ZFS environment with just one disk on the pool, and really idle to make things easier. Here are two spa_syncs with the output from that test with zfs_vdev_max_pending = 10:

2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426757
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  3  BS: 68987048
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 100771666
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  2  BS: 69233318
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  2  BS: 100771666
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233378
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 100771670
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 32530
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 33042
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  2  BS: 154239250
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  3  BS: 154239762
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 101426647
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42218576
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233386
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68986953
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233322
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 100771642
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101482890
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68987055
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233372
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426764
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42115694
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42218578
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752890
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171462
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426771
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 101171464
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  1  BS: 69233376
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752873
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68987055
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233372
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171460
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426764
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  0  BS: 101426646
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233282
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426663
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752866
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 100771670
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42115694
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  1  BS: 101426648
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101039490
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171406
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 100771666
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  1  BS: 69233318
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  0  BS: 101426764
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  3  BS: 68987055
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233386
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  2  BS: 101171400
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101040002
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42115694
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233374
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426774
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42218576
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101482898
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171460
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752873
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68987062
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101482898
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752890
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233318
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171466
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426777

Here the second, 30 seconds later…

2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  1  BS: 101426787
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233394
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 100771690
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  2  BS: 69233450
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42115702
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233466
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 101426784
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426790
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752898
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233450
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 101171514
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  1  BS: 69233460
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  0  BS: 101426813
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  3  BS: 68987101
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42115702
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233466
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233462
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 32532
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 33044
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  2  BS: 154239252
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  3  BS: 154239764
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  0  BS: 101426793
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233384
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233454
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171506
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42218586
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42218594
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171508
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171516
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42218584
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 100771690
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  1  BS: 101426780
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426796
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68987108
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233462
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68987124
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233448
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101482906
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171468
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42218584
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  2  BS: 100771690
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68987108
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68987108
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426820
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426836
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233418
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 100771674
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42115702
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426820
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752905
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752930
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426820
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752905
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171506
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68987071
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233448
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 101171514
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233460
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  2  BS: 42218592
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101482906
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426833
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  3  BS: 68987121
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752930
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171516

You can see above two spa_sync’s and 30 seconds between one and the other. Everything’s fine… but…
In the first execution, at the “begining” we have this:

2010 Sep  4 20:38:07 -  ... NCMDSD:  1 NCMDST:  0  BS: 32530
2010 Sep  4 20:38:07 -  ... NCMDSD:  2 NCMDST:  1  BS: 33042
2010 Sep  4 20:38:07 -  ... NCMDSD:  3 NCMDST:  2  BS: 154239250
2010 Sep  4 20:38:07 -  ... NCMDSD:  4 NCMDST:  3  BS: 154239762

33042 – 32530 = 256KB = ZFS Label = L0 and L1
154239762 – 154239250 = 256KB = ZFS Label = L2 and L3

First, why that blocks in the begining of the report, when they should be the last update?
And the two-phase commit says that ZFS updates L0 and L2..
… after that, will be updated labels L1 and L3. L0 and L1 are at the begining of the vdev (disk), and L2 and L3 are at the end. So, with the two-phase commit, updating one label on the head and one label at the tail of the disk, ZFS tries to make harder to loose access to one pool if we have a logical failure, that is more likely to be contiguous. Interesting to see that they are the only commands that have a pattern 1, 2, 3 ,4 on the ncmds on driver and on transport (0, 1, 2, 3). It will be interesting to see the latencies for this updates.
I did change the zfs_vdev_max_pending to 1 and the problem almost did go away completely. But i did see a few label updates in the middle of the reports sometimes. To be continued…

peace