These days i was working to understand a workload: random, many small files (e.g: web servers), running on NFS/ZFS environment.

Interesting to see that the discs were really busy, a very heavy random workload, but a few NFS reads and writes requests. On top of it, an idle SSD as a ZFS Pool cache device.

With the above scenario, you can conclude that the service times for that discs were really high (like 30ms on average), sometimes 40-50ms.

I have an opinion that performance is not a problem. Ok, i can hear you LOL… ok… ok, i will try to explain my point.

Actually the problem do exists, but i think it is a consequence and what really matter is the “why” we are facing a performance problem. In my day to day job my main concern is about availability. And has been for a long time, because i do understand “that” is the real value we can add in our job: availability. So, you will tell me that a performance problem can lead to an availability/reliability problem. And in the end what will say if is one or another will be the expectations for the service. I do agree.

What i want to say is that performance is always a composition, and we can provide performance with capacity planing, flexibility, levels and levels of cache, and… respecting the limits. If we do not consider a solution that starts “wrong”, i mean, the requisite is 100x and the solution was not created delivering 50x ;-), all the rest about performance is a matter of monitoring. I think performance administration is like money, if you don’t know how to manage US$ 1K, because you spent US$ 1K and one dollar, you will not know how to manage US$ 1M (because one cent is overflow in both cases). As a last argument, i think we can explain a performance problem, but can’t justify it.

Back to the problem in the begining of this post… when i did talk about “a few NFS reads and writes“, i was talking about 200, sometimes 300 or so NFS ops of this kind. But about 8K, 9K NFS ops (lookup, getattr, etc).

Looking further, we could see that the workload was about many, many files on the same directory. So, we could see that +60% of the NFS operations were READDIRPLUS, and like 15% or 20% GETATTR, and just 1% or 2% READS and WRITES (both). So the discs were working, true, but for the wrong purpose…

READDIRPLUS is a “prefetch” feature for NFSv3. IBM (one of the best knowledge base on net IMHO), has a good summary:

“In NFS Version 3, file handle and attribute information is returned along with directory entries via the READDIRPLUS operation. This relieves the client from having to query the server for that information separately for each entry, as is done with NFS Version 2, and is thus much more efficient. However, in some environments with large directories where only the information of a small subset of directory entries is used by the client, the NFS Version 3 READDIRPLUS operation might cause slower performance“.

This metadata overhead is one of the arguments in that very old discussion about “maildir vs mailbox” for example, and here we have a document about it:

Currently, we’re using version 2 of the NFS protocol. While version 3 does give some significant performance benefits, we give it all back because of the implementation of the READDIRPLUS procedure which prefetches attribute information on all files in that directory, whether we need them or not. Since we store a large number of files in the same directory and are only interested in operating on one of them at a time, this is significant overhead that we don’t need”.

Ok, our percentage of “real work” is adherent to the phrase above. And as we can guess, there are many ways to fix this, and actually we don’t need to switch back to NFSv2 (and change “one problem” by “many others”)… one solution would be ask for the clients to use READDIR instead (e.g: GNU/Linux -onordirplus), while still using NFSv3.

As an example of this possible overhead, imagine a scenario where you are woking on a few files (reading and writing to them), a few random bytes. You have many files on few directories (thousands of files on hundreds of directories). Every change on that directory is sufficient to invalidate the client cache, and READDIRPLUS calls retrieving attributes for thousands of files (in each directory).

But we are not done yet, because our cache is not working well, and we were suspecting that we were doing even more work on that discs. Well, we have another “prefetch” on our solution: ZFS Prefetch. And the guess is that ZFS prefetch was making the discs going crazy because of that many metadata operations.

Let’s see the RFC 1813 (NFS v3):

Procedure READDIRPLUS retrieves a variable number of entries from a file system directory and returns complete information about each along with information to allow the client to request additional directory entries in a subsequent READDIRPLUS. READDIRPLUS differs from READDIR only in the amount of information returned for each entry. In READDIR, each entry returns the filename and the fileid. In READDIRPLUS, each entry returns the name, the fileid, attributes (including the fileid), and file handle“.

Read more:http://www.faqs.org/rfcs/rfc1813.html#ixzz0rbtJCugU

The question is: what is this “complete information” on ZFS implementation? Well, before going further on ZFS source code, we can just try it:

echo zfs_prefetch_disable/W0t1 | mdb -kw

Bingo!

Discs now 0% busy, latencies down to 5-10ms…

You are working on a performance problem, and say that performance is not a problem? ;-) The difference between one scenario and the other was the limit of each one, nothing more. In both scenarios we always start with everything working, and we need to monitor to not pass that limit. Being that limit 1K or 1M.

What we did, actually, was not solve a performance problem, but a resource optimization.

peace