PLEASE BE AWARE THAT ANY INFORMATION YOU MAY FIND HERE MAY BE INACCURATE, AND COULD INCLUDE TECHNICAL INACCURACIES, TYPOGRAPHICAL ERRORS, AND EVEN SPELLING ERRORS.

 From the MANUAL page:
 The zdb command is used by  support  engineers  to  diagnose
 failures and gather statistics. Since the ZFS file system is
 always consistent on disk and is self-repairing, zdb  should
 only be run under the direction by a support engineer.

DO NOT TRY IT IN PRODUCTION. USE AT YOUR OWN RISK!

That’s is something trick, and i think is important to write about… i’m talking about ZFS vdevs.
ZFS has two types of vdevs: logical and physical. So, from the ZFS on-disk specification, we know that a physical vdev is a writeable media device, and a logical vdev is a grouping of physical vdevs.
Let’s see a simple diagram using a RAIDZ logical vdev, and five physical vdevs:

                             +---------------------+
                             |      Root vdev      |
                             +---------------------+
                                        |
                                +--------------+
                                |    RAIDZ     |            VIRTUAL VDEV
                                +--------------+
                                       |
                                  +----------+
                                  |  128KB   |              CHECKSUM
                                  +----------+
                                       |
           32KB         32KB         32KB          32KB          Parity
        .------.      .------.      .------.     .------.       .------.
       |-______-|    |-______-|    |-______-|   |-______-|     |-______-|
       |  vdev1 |    | Vdev2  |    | Vdev3  |   | Vdev4  |     | Vdev5  |    PHYSICAL VDEVS
        '-____-'      '-____-'      '-____-'     '-____-'       '-_____-'

The diagram above was just an example, and in that example the data that we are handling in the RAIDZ virtual vdev is a block of 128KB. That was just to make my math easy, so i could divide equal to all phisycal vdevs. ;-)

But remember that with RAIDZ we have always a full stripe, not matter the size of the data.
The important part here is the filesystem block. When i did see the first video presentation about ZFS,

i had the wrong perception about the diagram above. As we can see, if the system reclaims a block, let’s say the 128KB block above, and the physical vdev 1 gives the wrong data, ZFS just fix the data on that physical vdev, right? Wrong… and that was my wrong perception. ;-)
ZFS RAIDZ virtual vdev does not know which physical vdev (disk) gave the wrong data. And here i think there is a great level of abstraction that shows the beauty about ZFS… because the filesystems are there (on the physical vdevs), but there is not an explict relation! A filesystem block has nothing to do with a disk block. So, the checksum of the data block is not at the physical vdev level, and so ZFS cannot know directly what disk gave the wrong data without a “combinatorial reconstruction” to identify the culprit. From the vdev_raidz.c:

    784 static void
    785 vdev_raidz_io_done(zio_t *zio)
    786 {
...
    853 	/*
    854 	 * If the number of errors we saw was correctable -- less than or equal
    855 	 * to the number of parity disks read -- attempt to produce data that
    856 	 * has a valid checksum. Naturally, this case applies in the absence of
    857 	 * any errors.
    858 	 */
...

That gives a good understanding of the design of ZFS. I really like that way of solving problems, and to have specialized parts like this one. Somebody can think that this behaviour is not optimum. But remember that this is something that should not happen all the time.

In mirror we have a whole different situation, because all the data is on any device, and so ZFS can match the checksum, and read the other vdevs looking for the right answer. Remember that we can have n-way mirror…

In the source we can see that a normal read is done in any device:

    252 static int
    253 vdev_mirror_io_start(zio_t *zio)
    254 {

...
    279 		/*
    280 		 * For normal reads just pick one child.
    281 		 */
    282 		c = vdev_mirror_child_select(zio);
    283 		children = (c >= 0);
...

So, ZFS knows if this data is OK or not, and if it is not, it can fix it. But without
to know which disk but which physical vdev. ;-) The procedure is the same without the
combinatorial reconstruction. And as a final note, the resilver of a block is not copy
on write, so in the code we have a comment about it:


    402 			/*
    403 			 * Don't rewrite known good children.
    404 			 * Not only is it unnecessary, it could
    405 			 * actually be harmful: if the system lost
    406 			 * power while rewriting the only good copy,
    407 			 * there would be no good copies left!
    408 	

So the physical vdev that has a good copy is not touched.
As we need to see to believe…

mkfile 100m /var/fakedevices/disk1
mkfile 100m /var/fakedevices/disk2
zpool create cow mirror /var/fakedevices/disk1 /var/fakedevices/disk2
zfs create cow/fs01
cp -pRf /etc/mail/sendmail.cf /cow/fs01/
ls -i /cow/fs01/
 4 sendmail.cf
zdb -dddddd cow/fs01 4
Dataset cow/fs01 [ZPL], ID 30, cr_txg 15, 58.5K, 5 objects, rootbp [L0 DMU objset]  \\
400L/200P DVA[0]=<0:21200:200> DVA[1]=<0:1218c00:200> fletcher4 lzjb LE contiguous \\
birth=84 fill=5 cksum=99a40530b:410673cd31e:df83eb73e794:207fa6d2b71da7

    Object  lvl   iblk   dblk  lsize  asize  type
         4    1    16K  39.5K  39.5K  39.5K  ZFS plain file (K=inherit) (Z=inherit)
                                 264  bonus  ZFS znode
        path    /sendmail.cf
        uid     0
        gid     2
        atime   Mon Jul 13 19:01:42 2009
        mtime   Wed Nov 19 22:35:39 2008
        ctime   Mon Jul 13 18:30:19 2009
        crtime  Mon Jul 13 18:30:19 2009
        gen     17
        mode    100444
        size    40127
        parent  3
        links   1
        xattr   0
        rdev    0x0000000000000000
Indirect blocks:
               0 L0 0:11200:9e00 9e00L/9e00P F=1 B=17

                segment [0000000000000000, 0000000000009e00) size 39.5K

So, we have a mirror of two disk, and a little file on it… let’s do a little math, and
smash the data block from the first disk:

zpool export cow
perl -e "\$x = ((0x400000 + 0x11200) / 512); printf \"\$x\\n\";"
dd if=/tmp/garbage.txt of=/var/disk1 bs=512 seek=8329 count=79 conv="nocreat,notrunc"
zpool import -d /var/fakedevices/ cow
cat /cow/fs01/sendmail.cf > /dev/null
zpool status cow
  pool: cow
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        cow             ONLINE       0     0     0
          mirror        ONLINE       0     0     0
            /var/disk1  ONLINE       0     0     1
            /var/disk2  ONLINE       0     0     0

errors: No known data errors

Now let’s export the pool and read our data from the same offset in both disks:

dd if=/var/fakedevices/disk1 of=/tmp/dump.txt bs=512 skip=8329 count=79
dd if=/var/fakedevices/disk2 of=/tmp/dump2.txt bs=512 skip=8329 count=79
diff /tmp/dump.txt /tmp/dump2.txt
head /tmp/dump.txt
#
# Copyright (c) 1998-2004 Sendmail, Inc. and its suppliers.
#       All rights reserved.
# Copyright (c) 1983, 1995 Eric P. Allman.  All rights reserved.
# Copyright (c) 1988, 1993
#       The Regents of the University of California.  All rights reserved.
#
# Copyright 1993, 1997-2006 Sun Microsystems, Inc.  All rights reserved.
# Use is subject to license terms.
#

head /etc/mail/sendmail.cf
#
# Copyright (c) 1998-2004 Sendmail, Inc. and its suppliers.
#       All rights reserved.
# Copyright (c) 1983, 1995 Eric P. Allman.  All rights reserved.
# Copyright (c) 1988, 1993
#       The Regents of the University of California.  All rights reserved.
#
# Copyright 1993, 1997-2006 Sun Microsystems, Inc.  All rights reserved.
# Use is subject to license terms.
#

So, never burn your physical vdevs, because you can (almost) always get some files from it.
Even if the ZFS can’t. ;-)
peace