That is one point that always was really difficult for me to understand about the implementation of ZFS. The writes in ZFS are done in transaction groups, and have lower priorities than reads. Ok, but seems to me that we have a well defined “write cycle”, and not the same “clear definition” about the options for read preemption. In one of my ZFS environments, i’m facing some strange behaviour that i *think* could be related with the way the ZFS IO scheduler works.
The txg_sync_time can be 5 seconds, and with 30 seconds between each spa_sync, we can really get there. So, in that environment, i did see some read contention from time to time, and i think is because of some syncs that can take 1 or 2 seconds…
So, ZFS really handle the write workload, and with a SSD for synchronous writes, should be no problem. But for reads… L2ARC is the way to go, but don’t forget that someday we will need to go for the discs. And diferently from writes, will be for sure, someone waiting.
I’m reading the source code to understand better the scheduler, and what i could see is that the preemption of read requests are dependents of the device queue. And reading the ZFS good tuning guide, seems like this really can be a problem. If we have a priority, but we don’t have separe slots to handle it, our intent is good but not so efficient. It’s like “bank queues”… one for pregnant ladies, elderly; and other for everybody else. With just one queue, we can just make the lady be the next, but she will need to wait for the client that is already in service (and can be a long time ;-). But that would be a problem if we had two queues too. But the response time would be much better if the queues were defined with different priorities. And with two queues we could have different queue sizes, what would leave the latency more predictable. In resume, we have priority implementation on ZFS but not on disks.
Hmm… so, seems like this is not a simple “free slot” problem, but a need for real capacity to preempt one IO (what is dependent of the underlying device). But i guess some kind of queue separation would help too (scrub, resilver, normal reads, writes, etc). But ZFS still does what no other filesystem does about IO priorities. The problem seems to be with large writes, but if we are talking about IOs per second, i think the actual implementation does a really good job.
peace