What i do not like about ZFS…
Well, it has been a long time from my last post, and i’m quite busy with my day-to-day job and other projects that are consuming a lot of my time.
I really like to blog, but right now i really do not have the time i need to write here. Nice to read the comments about my ZFS Internals series, and the PDF version i need to update. I do not even remember my last post about ZFS, and still have a lot of comments on that sections. Thank you!
I’m doing a lot of interesting things at work, and with other projects that will be nice to write about soon… but what i really think is my obligation to write about, is about my opinion on ZFS not so good points.
If you are used to read my blog, i do not need to tell you how i think this is a great peace of software, and how i think it is the way to go for storage on these days. But as everthing else in life, it has problems, and i will say (sorry), huge ones.
Actually, i would title this post “ZFS errors”, but who am i to say that? So, i prefer to talk about my personal opinion after to use it and hack it a little. Besides, it’s not fair to talk about just the good points about something, because it’s not real. And the problems we have faced with some technology, and maybe for us is something not so critical, can be a huge problem for other people, and we can help them with our experience.
Sorry ZFS hackers, but here we go…
Number 1: RAIDZ: Simple like that: It solves a real problem, that two guys in the whole universe did know about. And creates many others… RAIDZ simply cannot be used in production, if you need a little of performance, you cannot use it. And that makes you run in production with just one configuration: Mirror. If you want to diversify, you can do RAID1. In a comparison with other storage vendors, your cost per giga will be always impacted by this ghost. Raw/2.
It’s not useless, you can use it at home, even for backups in a datacenter solution, but you need to have a robust infrastructure to deal with long resilvers and really poor restore procedures. Everything is a whole stripe solves one problem and creates a bottleneck for performance and a nightmare for the resilver process (ZFS needs to traverse the whole filesystem to resiver a disk). If you care, i can give you an advice: if you want to use it, three discs in a set is the maximum you can obtain for it.
Number 2: L2ARC: Sometime ago i did write about it already, but it’s something worth to talk again in this list. The ZFS ARC is a damn good code, and it really can bring a huge performance over time! But looking at the features of Oracle Storage 7420, i was scared:
Up to 1.15PB (petabyte) raw capacity
Up to 1TB DDR3 DRAM
Up to 4TB flash-enabled read-cache
Up to 1.7TB flash-enabled write-cache
If ZFS did solve the write problem using SSD’s , i think it did create another for reads. If we loose 4TB of read-cache, we will not have the universe as we know it anymore. No, don’t tell me that this will not happen…
Solid State Drives are persistent devices, i guess would be a priority to not loose it on reboots, failovers, or SSD’s failures (eg.: mirroring). It’s huge performance impact, that the system will not be available, and the second more critical problem on storage is availability (the first one is to loose data, and ZFS cares a lot about our data). Every presentation, paper, everything about L2ARC is the same “answer”: The data is on disk, we can access it from there in the case of loosing the SSD data. No, no, we can not get it from there! ZFS loves cheap discs, our discs is 7200 SATA discs… they do know how to store data, but do not know how to read it. These 7200 SATA drives should have a banner saying: Pay to write it and pray for read it.
Number 3: FRAGMENTATION
No tears, no tears… something that can make you do a full ZFS send and recreate the dataset or pool. Incremental snapshot replication can be a terrible experience. Copy on write for the win, defrag to not loose the title!
Number 4: Myths
Here i can add to a whole blog entry, but let’s stay with tuning for now. When you have a huge installation, there is no code that can give you the better numbers without tuning. Prefetch, max_pending, recordsize, L2ARC population task, txg sync time, write_throtle, scrub_limit, resilver_min_time… and that is just some of the numbers i think every “medium” ZFS installation should tune from the start. My book has others… ;-)
And we are not talking about other OS specific numbers outside ZFS that can make the whole difference between: Everything is working fine to: The performance is a crap! Disksort, naggle alghoritm, dnlc, and etc.
Ok, done. Feel free to curse at me on the comments section.