ZFS Internals (part #1)
From the MANUAL page: The zdb command is used by support engineers to diagnose failures and gather statistics. Since the ZFS file system is always consistent on disk and is self-repairing, zdb should only be run under the direction by a support engineer.
Do we have a deal? ;-)
A few days ago i was trying to figure it out how ZFS copy-on-write semantics really works, understand the ZFS on disk layout, and my friends were the zfsondiskformat specification, the source code, and zdb. I did search on the web looking for something like what i´m writing here, and could not find anything. That´s why i´m writing this article, thinking it can be useful for somebody else.
Let´s start with a two disks pool (disk0 and disk1):
# mkfile 100m /var/fakedevices/disk0 # mkfile 100m /var/fakedevices/disk1 # zpool create cow /var/fakedevices/disk0 /var/fakedevices/disk1 # zfs create cow/fs01
The recordsize is the default (128K):
# zfs get recordsize cow/fs01 NAME PROPERTY VALUE SOURCE cow/fs01 recordsize 128K default
Ok, we can use the THIRDPARTYLICENSEREADME.html file from “/opt/staroffice8/” to have a good file to make the tests (size: 211045). First, we need the object ID (aka inode):
# ls -i /cow/fs01/ 4 THIRDPARTYLICENSEREADME.html
Now the nice part…
# zdb -dddddd cow/fs01 4 ... snipped ... Indirect blocks: 0 L1 0:9800:400 1:9800:400 4000L/400P F=2 B=190 0 L0 1:40000:20000 20000L/20000P F=1 B=190 20000 L0 0:40000:20000 20000L/20000P F=1 B=190 segment [0000000000000000, 0000000001000000) size 16M
Now we need the concepts in the zfsondiskformat doc. Let´s look the first block line:
0 L1 0:9800:400 1:9800:400 4000L/400P F=2 B=190
The L1 means two levels of indirection (number of block pointers which need to be traversed to arrive at this data). The “0:9800:400” is: the device where this block is (0 = /var/fakedevices/disk0), the offset from the begining of the disk (9800), and the size of the block (0x400 = 1K), respectivelly. So, ZFS is using two disk blocks to hold pointers to file data…
ps.: 0:9800 is the Data virtual Address 1 (dva1)
At the end of the line there are two other important informations: F=2, and B=190. The first is the fill count, and describes the number of non-zero block pointers under this block pointer. Remember our file is greater than 128K (the default recordsize), so ZFS needs two blocks (FSB), to hold our file. And the second is the birth time, what is the same as the txg number(190), that creates that block.
Now, let´s get our data! Looking at the second block line, we have:
0 L0 1:40000:20000 20000L/20000P F=1 B=190
Based on zfsondiskformat doc, we know that L0 is the block level that holds data (we can have up to six levels). And in this level, the fill count has a little different interpretation. Here the F= means if the block has data or not (0 or 1), what is different from the levels 1 and above, where it means “how many” non-zero block pointers under this block pointer. So, we can see our data using the -R option from zdb:
# zdb -R cow:1:40000:20000 | head -10 Found vdev: /var/fakedevices/disk1 cow:1:40000:20000 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 000000: 505954434f44213c 50206c6d74682045 !DOCTYPE html P 000010: 2d222043494c4255 442f2f4333572f2f UBLIC "-//W3C//D 000020: 204c4d5448204454 6172542031302e34 TD HTML 4.01 Tra 000030: 616e6f697469736e 2220224e452f2f6c nsitional//EN" " 000040: 772f2f3a70747468 726f2e33772e7777 http://www.w3.or 000050: 6d74682f52542f67 65736f6f6c2f346c g/TR/html4/loose
That´s nice! 16 bytes per line, that is our file. Let´s read it for real:
# zdb -R cow:1:40000:20000:r ... snipped ... The intent of this document is to state the conditions under which VIGRA may be copied, such that the author maintains some semblance of artistic control over the development of the library, while giving the users of the library the right to use and distribute VIGRA in a more-or-less customary fashion, plus the right to
ps.: Don´t forget that is the first 128K of our file…
We can assemble the whole file like this:
# zdb -R cow:1:40000:20000:r 2> /tmp/file1.dump # zdb -R cow:0:40000:20000:r 2> /tmp/file2.dump # cat /tmp/file2.dump >> /tmp/file1.dump # diff /tmp/file1.dump /cow/fs01/THIRDPARTYLICENSEREADME.html Warning: missing newline at end of file /tmp/file1.dump 5032d5031 <
Ok, that´s warning is something we can understand. But let´s change something on that file, to see the copy-on-write in action... we will use VI to change the "END OF TERMS AND CONDITIONS" line (four lines before the EOF), to "FIM OF TERMS AND CONDITIONS".
# vi THIRDPARTYLICENSEREADME.html # zdb -dddddd cow/fs01 4 ... snipped ... Indirect blocks: 0 L1 0:1205800:400 1:b400:400 4000L/400P F=2 B=1211 0 L0 0:60000:20000 20000L/20000P F=1 B=1211 20000 L0 0:1220000:20000 20000L/20000P F=1 B=1211 segment [0000000000000000, 0000000001000000) size 16M
All blocks were reallocated! The first L1, and the two L0 (data blocks). That´s something a little strange... I was hoping to see all the block pointers reallocated (metadata), and the data block that holds the bytes i have changed. The first data block that holds the first 128K of our file, now is on the first device (0), and second block is still on the first device (0), but in another location. We can be sure by looking the new offsets, and the new txg creation time (B=1211). Let´s see our data again, getting it from the new locations:
zdb -R cow:0:60000:20000:r 2> /tmp/file3.dump zdb -R cow:0:1220000:20000:r 2> /tmp/file4.dump cat /tmp/file4.dump >> /tmp/file3.dump diff /tmp/file3.dump THIRDPARTYLICENSEREADME.html Warning: missing newline at end of file /tmp/file3.dump 5032d5031 <
Ok, and the old blocks, they are still there?
zdb -R cow:1:40000:20000:r 2> /tmp/file1.dump zdb -R cow:0:40000:20000:r 2> /tmp/file2.dump cat /tmp/file2.dump >> /tmp/file1.dump diff /tmp/file1.dump THIRDPARTYLICENSEREADME.html Warning: missing newline at end of file /tmp/file1.dump 5027c5027 < END OF TERMS AND CONDITIONS --- > FIM OF TERMS AND CONDITIONS 5032d5031 <
Really nice! In our test the ZFS copy-on-write moved the whole file from on region on disk to another. But if we were talking about a really big file, let´s say 1GB? Many 128K data blocks, and just a 1K change. ZFS copy-on-write would reallocate all data blocks too? And why ZFS reallocated the "untouched" block in our example (the first data block L0)?
Something to look in another time. Stay tuned... ;-)