ZFS Internals (part #10)

PLEASE BE AWARE THAT ANY INFORMATION YOU MAY FIND HERE MAY BE INACCURATE, AND COULD INCLUDE TECHNICAL INACCURACIES, TYPOGRAPHICAL ERRORS, AND EVEN SPELLING ERRORS.

 From the MANUAL page:
 The zdb command is used by  support  engineers  to  diagnose
 failures and gather statistics. Since the ZFS file system is
 always consistent on disk and is self-repairing, zdb  should
 only be run under the direction by a support engineer.

DO NOT TRY IT IN PRODUCTION. USE AT YOUR OWN RISK!

In ZFS internals (part #8) i did talk about a misunderstanding i had regarding ZFS physical and logical vdevs, when i did start to study this software. This filesystem (actually i do not like to call it “just” that), is all about specialized parts and Rampant layering violation.

ps.: This is one of the posts i like most, maybe because i’m not good on math, and it just seems like magic to me. Another text i really like is this, that i have a copy on my site. I did read it many years ago in a book from O’Reilly, and did look for it on internet…

As we have a volume manager, RAID, and filesystem all in one software, sometimes we think about it as a whole, and so this can lead to the wrong perception about some important concepts and isolations we have inside ZFS.

ZFS is a software very well designed that is really easy to use and see its beauty, but it has three fundamental parts that we need to understand in our quest: ZPL, DMU, and SPA. In the part #8 i did talk more about the SPA, so today i will focus on the other two.

Let’s start above in the stack…

Basically, the ZPL is the component that gives the files and directories to us. This component is the ZFS Posix Layer, something we are used to, and the ZFS creators could not leave behind. Most of the softwares are expecting a Posix compliant filesystem, and is a well known interface in Unix and Unix like OS’s (e.g GNU/Linux and FreeBSD). So, we can imagine this piece as a “simple” translator, the “face” of ZFS.

Ok, so should not be any complexity or big deal with this layer. I mean, when i did start to think about it, my guess was: “Solaris has all the UFS implementation already, i think the engineers pick that implementation, and did change the glue for talk to the next layer: DMU”. I mean, if we are talking about an API we know and use for more than 30 years, i think the better approuch would be to use the UFS code, pretty much stable, and just change the communication with the underlying layers. You know, talk is always easier… change some parameters, add and delete some functions, and it’s done. Something like a day. Two, if i’m not inspired. ;-)

Not going any further with the ZPL, let’s think about some few ZFS features like:

1 – transactional semantics;

2 – no-fsck,

3 – always consistent on-disk;

4 – Copy-on-Write;

UFS has no one of that features.

So, for ZFS accomplish all that goals, and if all that features are not on ZPL, they should be implemented elsewhere down the stack. But is not that simple, and it is a combination rather than a separation as we will see next…

The other component in order is the DMU (Data Management Unit). This component works with: dataset, object, and offset. And gives an Object Transaction Interface to ZPL (and any other component that wants to use it, like ZVOL).

So, we have the “transactional semantics of ZFS” implemented here, on DMU. And yes, DMU keeps the “always consistent on-disk” state, because the copy-on-write of blocks is implemented on it too. One thing i can risk to say is: “DMU is the heart of ZFS”.

But even doing two and a half of the 4 features i did list above, DMU does not know anything about fsck, and the object transaction interface exported by DMU need to be used appropriately. So the transition (in the sense of filesystem consistency) is 100% responsability of the ZPL.

The important concept behind this, is that DMU will guarantee 100% atomicity between the transition from one disk state to another, but a consistent logical view of the filesystem must be handled above this layer. In this case: ZPL.

So, ZPL uses the transactional semantics and the always consistent on-disk state given by the DMU (using copy-on-write), and eliminates the FSCK!

The ZPL job is group the transactions and commit them in a way the logical view of the Posix filesystem implemented on this layer is always consistent. I think is like the famous quote from many movies: “Guns do not kill people, people do”.

Looking at the well commented OpenSolaris source code, we can see this:

…

 * Each vnode op performs some logical unit of work.  To do this, the ZPL must
 * properly lock its in-core state, create a DMU transaction, do the work,
 * record this work in the intent log (ZIL), commit the DMU transaction,
 * and wait for the intent log to commit if it is a synchronous operation.

...

In other words, that was not the gun, that was the people killing the fsck.

In that same C file we can see the algorithm:

    140  *	ZFS_ENTER(zfsvfs);		// exit if unmounted
    141  * top:
    142  *	zfs_dirent_lock(Edl, ...)	// lock directory entry (may VN_HOLD())
    143  *	rw_enter(...);			// grab any other locks you need
    144  *	tx = dmu_tx_create(...);	// get DMU tx
    145  *	dmu_tx_hold_*();		// hold each object you might modify
    146  *	error = dmu_tx_assign(tx, TXG_NOWAIT);	// try to assign
    147  *	if (error) {
    148  *		rw_exit(...);		// drop locks
    149  *		zfs_dirent_unlock(dl);	// unlock directory entry
    150  *		VN_RELE(...);		// release held vnodes
    151  *		if (error == ERESTART) {
    152  *			dmu_tx_wait(tx);
    153  *			dmu_tx_abort(tx);
    154  *			goto top;
    155  *		}
    156  *		dmu_tx_abort(tx);	// abort DMU tx
    157  *		ZFS_EXIT(zfsvfs);	// finished in zfs
    158  *		return (error);		// really out of space
    159  *	}
    160  *	error = do_real_work();		// do whatever this VOP does
    161  *	if (error == 0)
    162  *		zfs_log_*(...);		// on success, make ZIL entry
    163  *	dmu_tx_commit(tx);		// commit DMU tx -- error or not
    164  *	rw_exit(...);			// drop locks
    165  *	zfs_dirent_unlock(dl);		// unlock directory entry
    166  *	VN_RELE(...);			// release held vnodes
    167  *	zil_commit(zilog, seq, foid);	// synchronous when necessary
    168  *	ZFS_EXIT(zfsvfs);		// finished in zfs
    169  *	return (error);			// done, report error

Sorry to say the obvious, but as i think in the future we will see more components in this same layer as ZPL and ZVOL, we need to know that the logical consistency will not be given to us. We can create a filesystem on top of DMU that needs FSCK for example. As we can have a not proper service on top of ZPL disabling the ZIL.

Then we see that we have many levels of consistency, and each layer is responsible for one. As an example of another level, now for an application point of view, the ZPL implements the ZFS intent log to gurantee synchronous semantics for applications requests, and provide a reliable service. This has nothing to do with the SPA or DMU, is ZPL business.

And even with all the concepts and techniques implemented on that layers to provide a good service, all that would be nothing without proper use.

The allocation, deallocation, stripping, redundancy, and whatever related to disk blocks are handled by the SPA component. This is the botton layer of ZFS that talks to the device driver directly. SPA does have a important function on ZFS performance, because even not knowing about the copy-on-write stuff itself, it needs to be efficient in the block allocation strategy, so DMU that uses copy-on-write, needs good amounts of contiguous free space to have good performance (SPA uses a derivative slab allocator algorithm for this purpose).

FSCK RIP