PLEASE BE AWARE THAT ANY INFORMATION YOU MAY FIND HERE MAY BE INACCURATE, AND COULD INCLUDE TECHNICAL INCACCURACIES, TYPOGRAPHICAL ERRORS, AND EVEN SPELLING ERRORS.

 From the MANUAL page:
 The zdb command is used by  support  engineers  to  diagnose
 failures and gather statistics. Since the ZFS file system is
 always consistent on disk and is self-repairing, zdb  should
 only be run under the direction by a support engineer.

DO NOT TRY IT IN PRODUCTION. USE AT YOUR OWN RISK!

Do we have a deal? ;-)

A few days ago i was trying to figure it out how ZFS copy-on-write semantics really works, understand the ZFS on disk layout, and my friends were the zfsondiskformat specification, the source code, and zdb. I did search on the web looking for something like what i´m writing here, and could not find anything. That´s why i´m writing this article, thinking it can be useful for somebody else.

Let´s start with a two disks pool (disk0 and disk1):

# mkfile 100m /var/fakedevices/disk0
# mkfile 100m /var/fakedevices/disk1
# zpool create cow /var/fakedevices/disk0 /var/fakedevices/disk1
# zfs create cow/fs01

The recordsize is the default (128K):

# zfs get recordsize cow/fs01
NAME      PROPERTY    VALUE     SOURCE
cow/fs01  recordsize  128K      default

Ok, we can use the THIRDPARTYLICENSEREADME.html file from “/opt/staroffice8/” to have a good file to make the tests (size: 211045). First, we need the object ID (aka inode):

# ls -i /cow/fs01/
         4 THIRDPARTYLICENSEREADME.html

Now the nice part…

# zdb -dddddd cow/fs01 4
... snipped ...
 Indirect blocks:
               0 L1  0:9800:400 1:9800:400 4000L/400P F=2 B=190
               0  L0 1:40000:20000 20000L/20000P F=1 B=190
           20000  L0 0:40000:20000 20000L/20000P F=1 B=190

                segment [0000000000000000, 0000000001000000) size   16M

Now we need the concepts in the zfsondiskformat doc. Let´s look the first block line:

0 L1  0:9800:400 1:9800:400 4000L/400P F=2 B=190



The L1 means two levels of indirection (number of block pointers which need to be traversed to arrive at this data). The “0:9800:400” is: the device where this block is (0 = /var/fakedevices/disk0), the offset from the begining of the disk (9800), and the size of the block (0x400 = 1K), respectivelly. So, ZFS is using two disk blocks to hold pointers to file data…

ps.: 0:9800 is the Data virtual Address 1 (dva1)

At the end of the line there are two other important informations: F=2, and B=190. The first is the fill count, and describes the number of non-zero block pointers under this block pointer. Remember our file is greater than 128K (the default recordsize), so ZFS needs two blocks (FSB), to hold our file. And the second is the birth time, what is the same as the txg number(190), that creates that block.

Now, let´s get our data! Looking at the second block line, we have:

0  L0 1:40000:20000 20000L/20000P F=1 B=190

Based on zfsondiskformat doc, we know that L0 is the block level that holds data (we can have up to six levels). And in this level, the fill count has a little different interpretation. Here the F= means if the block has data or not (0 or 1), what is different from the levels 1 and above, where it means “how many” non-zero block pointers under this block pointer. So, we can see our data using the -R option from zdb:

# zdb -R cow:1:40000:20000 | head -10
Found vdev: /var/fakedevices/disk1

cow:1:40000:20000
          0 1 2 3 4 5 6 7   8 9 a b c d e f  0123456789abcdef
000000:  505954434f44213c  50206c6d74682045 !DOCTYPE html P
000010:  2d222043494c4255  442f2f4333572f2f  UBLIC "-//W3C//D
000020:  204c4d5448204454  6172542031302e34  TD HTML 4.01 Tra
000030:  616e6f697469736e  2220224e452f2f6c  nsitional//EN" "
000040:  772f2f3a70747468  726f2e33772e7777  https://www.w3.or
000050:  6d74682f52542f67  65736f6f6c2f346c  g/TR/html4/loose

That´s nice! 16 bytes per line, that is our file. Let´s read it for real:

# zdb -R cow:1:40000:20000:r
... snipped ...
The intent of this document is to state the conditions under which
VIGRA may be copied, such that the author maintains some
semblance of artistic control over the development of the library,
while giving the users of the library the right to use and
distribute VIGRA in a more-or-less customary fashion, plus the
right to

ps.: Don´t forget that is the first 128K of our file

We can assemble the whole file like this:

# zdb -R cow:1:40000:20000:r 2> /tmp/file1.dump
# zdb -R cow:0:40000:20000:r 2> /tmp/file2.dump
# cat /tmp/file2.dump >> /tmp/file1.dump
# diff /tmp/file1.dump /cow/fs01/THIRDPARTYLICENSEREADME.html
Warning: missing newline at end of file /tmp/file1.dump
5032d5031
<

Ok, that´s warning is something we can understand. But let´s change something on that file, to see the copy-on-write in action... we will use VI to change the "END OF TERMS AND CONDITIONS" line (four lines before the EOF), to "FIM OF TERMS AND CONDITIONS".

#  vi THIRDPARTYLICENSEREADME.html
# zdb -dddddd cow/fs01 4
... snipped ...
Indirect blocks:
               0 L1  0:1205800:400 1:b400:400 4000L/400P F=2 B=1211
               0  L0 0:60000:20000 20000L/20000P F=1 B=1211
           20000  L0 0:1220000:20000 20000L/20000P F=1 B=1211

                segment [0000000000000000, 0000000001000000) size   16M

All blocks were reallocated! The first L1, and the two L0 (data blocks). That´s something a little strange... I was hoping to see all the block pointers reallocated (metadata), and the data block that holds the bytes i have changed. The first data block that holds the first 128K of our file, now is on the first device (0), and second block is still on the first device (0), but in another location. We can be sure by looking the new offsets, and the new txg creation time (B=1211). Let´s see our data again, getting it from the new locations:

zdb -R cow:0:60000:20000:r 2> /tmp/file3.dump
zdb -R cow:0:1220000:20000:r 2> /tmp/file4.dump
cat /tmp/file4.dump >> /tmp/file3.dump
diff /tmp/file3.dump THIRDPARTYLICENSEREADME.html
Warning: missing newline at end of file /tmp/file3.dump
5032d5031
<

Ok, and the old blocks, they are still there?

zdb -R cow:1:40000:20000:r 2> /tmp/file1.dump
zdb -R cow:0:40000:20000:r 2> /tmp/file2.dump
cat /tmp/file2.dump >> /tmp/file1.dump
diff /tmp/file1.dump THIRDPARTYLICENSEREADME.html
Warning: missing newline at end of file /tmp/file1.dump
5027c5027
< END OF TERMS AND CONDITIONS
---
> FIM OF TERMS AND CONDITIONS
5032d5031
<

Really nice! In our test the ZFS copy-on-write moved the whole file from on region on disk to another. But if we were talking about a really big file, let´s say 1GB? Many 128K data blocks, and just a 1K change. ZFS copy-on-write would reallocate all data blocks too? And why ZFS reallocated the "untouched" block in our example (the first data block L0)?
Something to look in another time. Stay tuned... ;-)
peace.