ZFS Internals (part #9)

February 21st, 2010


PLEASE BE AWARE THAT ANY INFORMATION YOU MAY FIND HERE MAY BE INACCURATE, AND COULD INCLUDE TECHNICAL INACCURACIES, TYPOGRAPHICAL ERRORS, AND EVEN SPELLING ERRORS.

 From the MANUAL page:
 The zdb command is used by  support  engineers  to  diagnose
 failures and gather statistics. Since the ZFS file system is
 always consistent on disk and is self-repairing, zdb  should
 only be run under the direction by a support engineer.

DO NOT TRY IT IN PRODUCTION. USE AT YOUR OWN RISK!

Some builds ago there was a great change in OpenSolaris regarding ZFS. It was not a change in the ZFS itself because the change was the adition of a new scheduling class (but with great impact on ZFS).

OpenSolaris had six scheduling classes until then:
- Timeshare (TS);
This is the classical. Each process (thread) has an amount of time to use the processor resources, and that “amount of time” is based on priorities. This scheduling class works changing the process priority.
- Interactive (IA);
This is something interesting in OpenSolaris, because it is designed to give a better response time to the desktop user. Because the windown that is active has a priority boosts from the OS.
- Fair Share (FSS);
Here there is a division of the processor (fair? ;-) in units, so the administrator can allocate the processor resourcers in a controlled way. I have a screencast series about Solaris/OpenSolaris features so you can see a live demonstration about resource management and this FSS scheduling class. Take a look at the containers series…
- Fixed Priority (FX);
As the name suggests, the OS does not change the priority of the thread, so the time quantum of the thread is always the same.
- Real Time (RT);
This is intended to guarantee a good response time (latency). So, is like a special queue on the bank (if you have special necessities, a pregnant lady, elder, or have many, many dollars). Actually this kind of person do not go to bank.. hmmm bad example, anyway…
- System (SYS);
For the bank owner. ;-)
Hilarious, because here was the problem with ZFS! Actually, the SYS was not prepared for ZFS’s transaction group sync processing.

There were many problems with the behaviour of ZFS IO/Scheduler:

6471212 need reserved I/O scheduler slots to improve I/O latency of critical ops
6494473 ZFS needs a way to slow down resilvering
6721168 slog latency impacted by I/O scheduler during spa_sync
6586537 async zio taskqs can block out userland commands
ps.: I would add to this list the scrub process too…

The solution on the new scheduling class (#7) is called:

System Duty Cycle Scheduling Class (SDC);

The first thing that i did think reading the theory statement from the project was not so clear why fix a IO problem changing the scheduler class, actualy messing with the management of the processor resources. Well, that’s why i’m not a kernel engineer… thinking well, seems like a clever solution, and given the complexity of ZFS, the easy way to control it.
As you know, ZFS has IO priorities and deadlines, synchronous IO (like sync/writes and reads) have the same priority. My first idea was to have separated slots for different type of operation. It’s interesting because this problem was subject of a post from Bill Moore about how ZFS was handling a massive write keeping up the reads.
So, there were some discussions about why create another scheduling class and not just use the SYS class. And the answer was that the sys class was not designed to run kernel threads that are large consumers of CPU time. And by definition, SYS class threads run without preemption from anything other than real-time and interrupt threads.
And more:

Each (ZFS) sync operation generates a large batch of I/Os, and each I/O may need to be compressed and/or checksummed before it is written to storage. The taskq threads which perform the compression and checksums will run nonstop as long as they have work to do; a large sync operation on a compression-heavy dataset can keep them busy for seconds on end.

ps.: And we were not talking about dedup yet, seems like a must fix for the evolution of ZFS…
You see how our work is wonderful, by definition NAS servers have no CPU bottleneck, and that is why ZFS has all these wonderful features, and new features like dedup are coming. But the fact that CPU is not a problem, actually was the problem. ;-) It’s like give to me the MP4-25 from Lewis Hamilton. ;-)))
There is another important point with this PSARC integration, because now we can observe the ZFS IO processing because was introduced a new system process with the name: zpool-poolname, which gives us observability using simple commands like ps and prstat. Cool!
I confess i did not had the time to do a good test with this new scheduling class implementation, and how it will perform. This fix was commited on the build 129, and so should be on the production ready release from OpenSolaris OS (2010.03). Would be nice to hear the comments from people that is already using this new OpenSolaris implementation, and how the dedup is performing with this.
peace

PS3 and Music, movies…

February 13th, 2010

Wonderful combination: PS3, MacOSX and PS3 Media Server. I was looking for a solution for streaming movies, and music for my PS3 console, and PS3 Media Server just works. Without any previous conversion on the movie files, it converts on the fly, and we can watch the movies without any problems (with subtitles) using wifi connection! I’m working on a NAS server for my home (stay tuned ;-), using OpenSolaris and ZFS for sure, so i did find this post about some problems (fixed following the post) using PS3 Media Server on solaris.

This software supports:

- Ready to launch and play. No codec packs to install. No folder configuration and pre-parsing or this kind of annoying thing. All your folders are directly browsed by the PS3, there’s an automatic refresh also.
- Real-time video transcoding of MKV/FLV/OGM/AVI, etc.
- Direct streaming of DTS / DTS-HD core to the receiver
- Remux H264/MPEG2 video and all audio tracks to AC3/DTS/LPCM in real time with tsMuxer when H264 is PS3/Level4.1 compliant
- Full seeking support when transcoding
- DVD ISOs images / VIDEO_TS Folder transcoder
- OGG/FLAC/MPC/APE audio transcoding
- Thumbnail generation for Videos
- You can choose with a virtual folder system your audio/subtitle language on the PS3!
- Simple streaming of formats PS3 natively supports: MP3/JPG/PNG/GIF/TIFF, all kind of videos (AVI, MP4, TS, M2TS, MPEG)
- Display camera RAWs thumbnails (Canon / Nikon, etc.)
- ZIP/RAR files as browsable folders
- Support for pictures based feeds, such as Flickr and Picasaweb
- Internet TV / Web Radio support with VLC, MEncoder or MPlayer
- Podcasts audio/ Video feeds support
- Basic Xbox360 support
- FLAC 96kHz/24bits/5.1 support
- Windows Only: DVR-MS remuxer and AviSynth alternative transcoder support

Excellent Java application, just run it and play! You just need to point the shares for Music and Movies…

peace

D tails…

February 12th, 2010

Dtrace as any other programming language has many details, and i was bitten by one of them…
Sometime ago i did a post about some strange values in the latency times provided by one script at NFSv3 provider wiki. Well, the culprit was the predicate, actually the predicate was not there… and surprise, was missed by the Brendan Gregg. If he can do a mistake in D, i’m fine with mine. ;-)
This time the problem was (again) strange values, actually the same in many tests. So, looking at the script execution, i did see:

dtrace: xxx dynamic variable drops

Seems not good…
So i did see a post from Bryan about a documentation about how to fix it. And looking my script, i did remember that when i did it, i had to put this:

#pragma D option dynvarsize=16m

But actually without to understand why… but did work, i did forget about it. Well, the script did not. ;-) The little error in the programming came alive again when the data was big enough, so to see it, i did just change the 16m to 256m. Everything was fine again. But now i had to fix it properly.
So i did start to look for the error, and is incredible how a simple script can be difficult to find a so simple mistake. I did forget to do a “must” in D thread-local variables:

Always assign zero to thread-local variables that are no longer in use“.

Well, something simple like:

self->read = 0;

problem fixed. ;-)

But what i did not understand was how i could change the dynamic size to 16mb or 256mb without change the dtrace kernel parameter (16KB). I think i would need something like:

echo ‘dtrace_global_maxsize/Z F000000′ | mdb -kw

right? wrong? ;-)

ps.: I think Dtrace should exit on error when something like: dynamic variable drops. Bryan did write in the paper: “Must be eliminated for correct results“! Or there is some option in D for make such error more critical?

peace

Fishworks RFE (2)

February 10th, 2010

The Sun 7000 Storages have many old, old, ZFS replication problems…
Some are ZFS old problems, and other old problems from fishworks itself…

- We can not replicate per dataset, just entire projects. The only way to replicate one dataset, is create a project for each dataset… well, why i want to create directories to put just one file inside of each one?

- If we do a manual replication, and that replication runs for 10 days and fail, we do not have the replication as a whole. We loose the snapshot that was created for the replication too.

- If the replication fails, all the datasets fail. If we have a project with 10 datasets, each one 10TB, and the replication fails after a week, we loose everything.

- We cannot see the replicated datasets, just the project. Even with read only access. You need to believe your data is there. You need to clone the project to see it…

- You cannot take a snapshot while the replication is running (what??? This is ZFS B/C), and the replication can run for a lllllonnnngggg time.

- If you remove a replication schedule, you loose all the snapshots related (now you have two problems instead of one), and run because your storage will go down!

- If you have a storage with 5TB of data, and your replication schedule fail, you will not have the storage replicating anymore. The performance is from ZFS B/C too… Actually, the only way to have it working again, is removing the scheduled replication (again: run!), loose all the snapshots, and start a full replication again (a little chance for a “X” TB replication actually work).

- Replication performance is terrible!

Seems to be a really nice product in the performance perspective, but we need to improve the reliability, and the replication needs to come to 2010. I remember that kind of problems using the first version of ZFS on Solaris 10.
We see a great product that we really want to Oracle improve it, because we want to use it for real. But right now our home made ZFS storage is much better in general. I think Oracle/Sun can do a better job than us assembling a storage system, we are not a hardware manufacturer. And i was expecting a fix procedure more clean than reinstall (update firmware) for fix any kind of bug.

peace

Conspiracy Theory

February 8th, 2010

Well, as many of you, i remember all the criticism Sun and the OpenSolaris project did receive on the start (the license, the company behind, etc). I think i’m not radical about Open Source software, i did use GPL, BSD, CDDL, and even proprietary software. I have my personal opinion about it, but i don’t want that everybody agrees with me. I think there is space to Microsoft windows, Apple, Solaris, OpenSolaris, GNU/Linux, and etc. But let me say this: If there is someone in the computer science that i *really* respect is Richard Stallman. He did create a *masterpiece*, and gave it for developers, users, and companies. Yes, for fun and profit!
I did say that because i want to let *clear* that i think Open Source community is about philosophy (yes i do). It’s not just about lines of code. So, the license and all the surrounding in open software is important too. And that is not radicalism is just how i see the ecossystem. But my profession, my work, is not Open Source. So my point is: When talking about technology, we have an Open World for Oracle or MySQL softwares… when talking about Open Source/Community is much more complicated. How many companies have solutions using GNU/Linux? They have guaranties. Principles from GPL.
First i like technology, innovation, and quality. When Sun released Solaris as an open project it was a dream come true for me. Because i could access the source and learn a lot using the best Operanting System that i know. More, maybe i could help a little bit, and to participate of such technology. Just great. Well, you imagine all the things i did hear about participate on this… makes me remember the movie 2012, there is a funny quote about “Always remember, folks. You heard it first from Charlie“. Charlie represents many of my friends…
Now i’m receiving many emails and reading many blogs from people that i know and really respect that are leaving Oracle/SUN. 100% saying: “Don’t worry, everything is ok”.
(Another quote from 2012: When they tell you not to panic… that’s when you run!)
I know many people that say that Oracle did want to buy MySQL and could not, so did a indirect deal using SUN (that’s why the title of this post ;-). And others, saying that this is b**** …
In the end, i don’t think this has much importance… it’s done. And thinking about the $$ Sun did pay for MySQL
What about companies that did invest in OpenSolaris as their business basement? We have many people working around that project, and all we have is: “?” (Thanks Benr for that).

ps.: Sorry if you are looking for the movie, go here. ;-)
peace

Colorados!

November 29th, 2009

Lá blah pé blah lá lá…
Lá blah lá vier…
Lá lá nós estaremos…
Lá com blah blah estiver.

Futebol brasileiro tem que ter decisão!

November 18th, 2009

Parece que sou minoria, mas realmente não gosto desta nova fórmula do campeonato brasileiro de pontos corridos. Acho uma cópia barata do modelo europeu, e como não podia ser diferente, está indo pelo mesmo caminho…

Acontece que por lá, os países tem dois ou três times que concorrem ao título, passeando pelo país com seus times milhonários. Aqui é muito diferente, quando começa o campeonato brasileiro, temos no mínimo 10 times concorrendo ao título!

Imagine se nesta reta final de campeonato, tivéssemos os oito primeiro classificados para decisão do título: São Paulo, Flamengo, Palmeiras, Internacional, Atlético MG, Cruzeiro, Avaí, Goiás!

Teríamos: Maracanã, Beira-Rio, Morumbi, Mineirão… todos lotados! A emoção é totalmente diferente, podemos acompanhar todos os jogos, e todo o clima das finais é diferente.

Aposto que todos lembram das finais do campeonato brasileiro que seu time ganhou, bem como final da copa do mundo, final da libertadores, finais da NBA, o que quer que seja. Agora como tem terminado nosso campeonato nacional?


Só consigo imaginar este campeonato dando certo se for estipulado uma quantidade mínima de pontos do primeiro para o segundo colocado para que o campeonato termine nos pontos corridos. Algo tipo 9 pontos (ou três partidas de diferença). Qualquer coisa diferente disso é premiar a regularidade e não a excepcionalidade. O que me acostumei a ver no futebol nacional.

Imaginem que temos um primeiro colocado ao final do campeonato que tem apenas dois pontos a mais que o segundo. Sendo que o segundo colocado ganhou as duas partidas do primeiro colocado. Só que o segundo colocado jogou com o time “B” na primeira partida do returno, quando o time “B” estava na pilha de fugir do rebaixamento, jogando em casa e técnico novo. O segundo colocado do campeonato empatou. Daí, o primeiro colocado pegou este mesmo time “B” na última rodada do campeonato, jogando em casa, e com o time “B” rebaixado matematicamente e com o time sub-15 em campo. ;-)

Os melhores jogadores crescem nas finais, e tudo isso porque futebol é arte e pautar o campeonato do melhor futebol do mundo pela regularidade, mesmo que a melhor do campeonato, é muito pouco para apontar o “melhor” time.

NFS Performance

November 8th, 2009

As many of you, i did start to work with NFS when performance was not a problem. Right? Well, more or less… performance was not the first problem. The first real problem was get the whole thing working! Compatibility was a huge problem, and i’m not talking about GNU/Linux and Unix compatibility, that was some kind of art to get things working. But i’m talking about GNU/Linux and GNU/Linux. The same distro, with the same version was a problem, imagine different distros!
So, after that, in general after remove all the NFS stuff from the distros and installing from source the nfs-utils, rpc, and etc (homogeneous install), everything working… ok, now performance was the problem. ;-)

Well, welcome back to 2009! Compatibility is a problem? No, NFSv3 is a established standard and i have no experiences with incompatibility theses days. And performance? I admit that i did think that was another problem that we did leave behind too. But i’m wrong. Yeah, i know if we want to deploy a high performance environment, or we need to get *more* than a normal configuration would give us, we need to do some dirty work. But the performance numbers i do see on GNU/Linux RedHat clients were some crazy ones!
I did a simple test like:

 iozone -s1g -r8 -i 0 -i 2 -t 1 -c -o -O


And the performance for random writes were 1500 ops! Changing the threads to: 2, 3, 6 and 12 did not change *nothing*. The numbers were 1200, 1600, 1500

A OpenSolaris client got much better numbers, like 1500, 2500, 4000, 5000! The same as the local ZFS performance/NFS server (with one intel SSD). The difference was that on the local NFS server, i got 3300 ops with a single thread.
At least the Ubuntu client was better, with numbers not so good like the OpenSolaris client (2009.06), but way better than the RedHat!

So, if Brendan Gregg or Roch see this post by accident, i think you will need to do something like NetApp GNU/Linux patches. Brendan is working in getting on the fishworks limits (I think the clients they are using on the tests are just OpenSolaris/Solaris). But as a enterprise GNU/Linux distro (RedHat), present a lot on the datacenters, i fear the Storage 7000 series from SUN/Oracle be like a ferrari without road to run. I know that thinking quickly we can say: “That’s not my proble…”, but indirectly can be.
peace

Olhar Digital III (English)

November 2nd, 2009

These days two of my kids did participate on a television show about their experience in GNU/Linux. The program is in Brazilian Portuguese, but i did find this youtube version with english subtitles. So, if you want to watch…

ps.: I did think they were learning MS Windows at school… now i know i will need to buy a license, and teach them at home. ;-)
peace

Storage 7410 RFE

October 27th, 2009


Hello there! I did a post about IO on demand some time ago, and that post was about some aspects i think we need to have more power on the storage land.
I think the 7410 (fishworks) is a great step in the Unified Storage Model, but in the same way, the “Unified” word is a complicated one. The problem with that word is the notion of “all-in-one”, and the lack IMO of some controls needed in such environment.

So, here i can post my wishes ;-)

1) Would be nice if we could set limits for IOPS based on datasets;
2) Would be nice if we could set latency objectives per datasets too;

And it’s more complicated when we have other vendors with solutions for that. So, we want ZFS, but that is not enough to win the war.

If we will have a giant source of power, we need a way to master it.

peace.