PLEASE BE AWARE THAT ANY INFORMATION YOU MAY FIND HERE MAY BE INACCURATE, AND COULD INCLUDE TECHNICAL INACCURACIES, TYPOGRAPHICAL ERRORS, AND EVEN SPELLING ERRORS.
From the MANUAL page: The zdb command is used by support engineers to diagnose failures and gather statistics. Since the ZFS file system is always consistent on disk and is self-repairing, zdb should only be run under the direction by a support engineer.
DO NOT TRY IT IN PRODUCTION. USE AT YOUR OWN RISK!
That’s is something trick, and i think is important to write about… i’m talking about ZFS vdevs.
ZFS has two types of vdevs: logical and physical. So, from the ZFS on-disk specification, we know that a physical vdev is a writeable media device, and a logical vdev is a grouping of physical vdevs.
Let’s see a simple diagram using a RAIDZ logical vdev, and five physical vdevs:
+---------------------+
| Root vdev |
+---------------------+
|
+--------------+
| RAIDZ | VIRTUAL VDEV
+--------------+
|
+----------+
| 128KB | CHECKSUM
+----------+
|
32KB 32KB 32KB 32KB Parity
.------. .------. .------. .------. .------.
|-______-| |-______-| |-______-| |-______-| |-______-|
| vdev1 | | Vdev2 | | Vdev3 | | Vdev4 | | Vdev5 | PHYSICAL VDEVS
'-____-' '-____-' '-____-' '-____-' '-_____-'
The diagram above was just an example, and in that example the data that we are handling in the RAIDZ virtual vdev is a block of 128KB. That was just to make my math easy, so i could divide equal to all phisycal vdevs. ;-)
But remember that with RAIDZ we have always a full stripe, not matter the size of the data.
The important part here is the filesystem block. When i did see the first video presentation about ZFS,
i had the wrong perception about the diagram above. As we can see, if the system reclaims a block, let’s say the 128KB block above, and the physical vdev 1 gives the wrong data, ZFS just fix the data on that physical vdev, right? Wrong… and that was my wrong perception. ;-)
ZFS RAIDZ virtual vdev does not know which physical vdev (disk) gave the wrong data. And here i think there is a great level of abstraction that shows the beauty about ZFS… because the filesystems are there (on the physical vdevs), but there is not an explict relation! A filesystem block has nothing to do with a disk block. So, the checksum of the data block is not at the physical vdev level, and so ZFS cannot know directly what disk gave the wrong data without a “combinatorial reconstruction” to identify the culprit. From the vdev_raidz.c:
784 static void 785 vdev_raidz_io_done(zio_t *zio) 786 { ...
853 /* 854 * If the number of errors we saw was correctable -- less than or equal 855 * to the number of parity disks read -- attempt to produce data that 856 * has a valid checksum. Naturally, this case applies in the absence of 857 * any errors. 858 */ ...
That gives a good understanding of the design of ZFS. I really like that way of solving problems, and to have specialized parts like this one. Somebody can think that this behaviour is not optimum. But remember that this is something that should not happen all the time.
In mirror we have a whole different situation, because all the data is on any device, and so ZFS can match the checksum, and read the other vdevs looking for the right answer. Remember that we can have n-way mirror…
In the source we can see that a normal read is done in any device:
252 static int 253 vdev_mirror_io_start(zio_t *zio) 254 { ...
279 /* 280 * For normal reads just pick one child. 281 */ 282 c = vdev_mirror_child_select(zio); 283 children = (c >= 0); ... So, ZFS knows if this data is OK or not, and if it is not, it can fix it. But without to know which disk but which physical vdev. ;-) The procedure is the same without the combinatorial reconstruction. And as a final note, the resilver of a block is not copy on write, so in the code we have a comment about it: 402 /* 403 * Don't rewrite known good children. 404 * Not only is it unnecessary, it could 405 * actually be harmful: if the system lost 406 * power while rewriting the only good copy, 407 * there would be no good copies left! 408 */ So the physical vdev that has a good copy is not touched. As we need to see to believe... mkfile 100m /var/fakedevices/disk1 mkfile 100m /var/fakedevices/disk2 zpool create cow mirror /var/fakedevices/disk1 /var/fakedevices/disk2 zfs create cow/fs01 cp -pRf /etc/mail/sendmail.cf /cow/fs01/ ls -i /cow/fs01/ 4 sendmail.cf zdb -dddddd cow/fs01 4 Dataset cow/fs01 [ZPL], ID 30, cr_txg 15, 58.5K, 5 objects, rootbp [L0 DMU objset] \\ 400L/200P DVA[0]=<0:21200:200> DVA[1]=<0:1218c00:200> fletcher4 lzjb LE contiguous \\ birth=84 fill=5 cksum=99a40530b:410673cd31e:df83eb73e794:207fa6d2b71da7 Object lvl iblk dblk lsize asize type 4 1 16K 39.5K 39.5K 39.5K ZFS plain file (K=inherit) (Z=inherit) 264 bonus ZFS znode path /sendmail.cf uid 0 gid 2 atime Mon Jul 13 19:01:42 2009 mtime Wed Nov 19 22:35:39 2008 ctime Mon Jul 13 18:30:19 2009 crtime Mon Jul 13 18:30:19 2009 gen 17 mode 100444 size 40127 parent 3 links 1 xattr 0 rdev 0x0000000000000000 Indirect blocks: 0 L0 0:11200:9e00 9e00L/9e00P F=1 B=17 segment [0000000000000000, 0000000000009e00) size 39.5K So, we have a mirror of two disk, and a little file on it... let's do a little math, and smash the data block from the first disk: zpool export cow perl -e "\$x = ((0x400000 + 0x11200) / 512); printf \"\$x\\n\";" dd if=/tmp/garbage.txt of=/var/disk1 bs=512 seek=8329 count=79 conv="nocreat,notrunc" zpool import -d /var/fakedevices/ cow cat /cow/fs01/sendmail.cf > /dev/null zpool status cow pool: cow state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAME STATE READ WRITE CKSUM cow ONLINE 0 0 0 mirror ONLINE 0 0 0 /var/disk1 ONLINE 0 0 1 /var/disk2 ONLINE 0 0 0 errors: No known data errors Now let's export the pool and read our data from the same offset in both disks: dd if=/var/fakedevices/disk1 of=/tmp/dump.txt bs=512 skip=8329 count=79 dd if=/var/fakedevices/disk2 of=/tmp/dump2.txt bs=512 skip=8329 count=79 diff /tmp/dump.txt /tmp/dump2.txt head /tmp/dump.txt # # Copyright (c) 1998-2004 Sendmail, Inc. and its suppliers. # All rights reserved. # Copyright (c) 1983, 1995 Eric P. Allman. All rights reserved. # Copyright (c) 1988, 1993 # The Regents of the University of California. All rights reserved. # # Copyright 1993, 1997-2006 Sun Microsystems, Inc. All rights reserved. # Use is subject to license terms. # head /etc/mail/sendmail.cf # # Copyright (c) 1998-2004 Sendmail, Inc. and its suppliers. # All rights reserved. # Copyright (c) 1983, 1995 Eric P. Allman. All rights reserved. # Copyright (c) 1988, 1993 # The Regents of the University of California. All rights reserved. # # Copyright 1993, 1997-2006 Sun Microsystems, Inc. All rights reserved. # Use is subject to license terms. # So, never burn your physical vdevs, because you can (almost) always get some files from it. Even if the ZFS can't. ;-) peace
Recent Comments