Computing Science, posix rules, life rules, no rules…
01:22 - Mon 6 September, 2010 |  RSS:
Publications
Comments

2010 World Rally Championship

Performance

These days i was working to understand a workload: random, many small files (e.g: web servers), running on NFS/ZFS environment.

Interesting to see that the discs were really busy, a very heavy random workload, but a few NFS reads and writes requests. On top of it, an idle SSD as a ZFS Pool cache device.

With the above scenario, you can conclude that the service times for that discs were really high (like 30ms on average), sometimes 40-50ms.

I have an opinion that performance is not a problem. Ok, i can hear you LOL… ok… ok, i will try to explain my point.

Actually the problem do exists, but i think it is a consequence and what really matter is the “why” we are facing a performance problem. In my day to day job my main concern is about availability. And has been for a long time, because i do understand “that” is the real value we can add in our job: availability. So, you will tell me that a performance problem can lead to an availability/reliability problem. And in the end what will say if is one or another will be the expectations for the service. I do agree.

What i want to say is that performance is always a composition, and we can provide performance with capacity planing, flexibility, levels and levels of cache, and… respecting the limits. If we do not consider a solution that starts “wrong”, i mean, the requisite is 100x and the solution was not created delivering 50x ;-), all the rest about performance is a matter of monitoring. I think performance administration is like money, if you don’t know how to manage US$ 1K, because you spent US$ 1K and one dollar, you will not know how to manage US$ 1M (because one cent is overflow in both cases). As a last argument, i think we can explain a performance problem, but can’t justify it.

Back to the problem in the begining of this post… when i did talk about “a few NFS reads and writes“, i was talking about 200, sometimes 300 or so NFS ops of this kind. But about 8K, 9K NFS ops (lookup, getattr, etc).

Looking further, we could see that the workload was about many, many files on the same directory. So, we could see that +60% of the NFS operations were READDIRPLUS, and like 15% or 20% GETATTR, and just 1% or 2% READS and WRITES (both). So the discs were working, true, but for the wrong purpose…

READDIRPLUS is a “prefetch” feature for NFSv3. IBM (one of the best knowledge base on net IMHO), has a good summary:

“In NFS Version 3, file handle and attribute information is returned along with directory entries via the READDIRPLUS operation. This relieves the client from having to query the server for that information separately for each entry, as is done with NFS Version 2, and is thus much more efficient. However, in some environments with large directories where only the information of a small subset of directory entries is used by the client, the NFS Version 3 READDIRPLUS operation might cause slower performance“.

This metadata overhead is one of the arguments in that very old discussion about “maildir vs mailbox” for example, and here we have a document about it:

Currently, we’re using version 2 of the NFS protocol. While version 3 does give some significant performance benefits, we give it all back because of the implementation of the READDIRPLUS procedure which prefetches attribute information on all files in that directory, whether we need them or not. Since we store a large number of files in the same directory and are only interested in operating on one of them at a time, this is significant overhead that we don’t need”.

Ok, our percentage of “real work” is adherent to the phrase above. And as we can guess, there are many ways to fix this, and actually we don’t need to switch back to NFSv2 (and change “one problem” by “many others”)… one solution would be ask for the clients to use READDIR instead (e.g: GNU/Linux -onordirplus), while still using NFSv3.

As an example of this possible overhead, imagine a scenario where you are woking on a few files (reading and writing to them), a few random bytes. You have many files on few directories (thousands of files on hundreds of directories). Every change on that directory is sufficient to invalidate the client cache, and READDIRPLUS calls retrieving attributes for thousands of files (in each directory).

But we are not done yet, because our cache is not working well, and we were suspecting that we were doing even more work on that discs. Well, we have another “prefetch” on our solution: ZFS Prefetch. And the guess is that ZFS prefetch was making the discs going crazy because of that many metadata operations.

Let’s see the RFC 1813 (NFS v3):

Procedure READDIRPLUS retrieves a variable number of entries from a file system directory and returns complete information about each along with information to allow the client to request additional directory entries in a subsequent READDIRPLUS. READDIRPLUS differs from READDIR only in the amount of information returned for each entry. In READDIR, each entry returns the filename and the fileid. In READDIRPLUS, each entry returns the name, the fileid, attributes (including the fileid), and file handle“.

Read more:http://www.faqs.org/rfcs/rfc1813.html#ixzz0rbtJCugU

The question is: what is this “complete information” on ZFS implementation? Well, before going further on ZFS source code, we can just try it:

echo zfs_prefetch_disable/W0t1 | mdb -kw

Bingo!

Discs now 0% busy, latencies down to 5-10ms…

You are working on a performance problem, and say that performance is not a problem? ;-) The difference between one scenario and the other was the limit of each one, nothing more. In both scenarios we always start with everything working, and we need to monitor to not pass that limit. Being that limit 1K or 1M.

What we did, actually, was not solve a performance problem, but a resource optimization.

peace

Ubuntu and kids (part II)…

Well, once more was time to update the kids’ desktop. And now i see that did pass two years and i did not solve this “desktop” problem yet…

In the last post my intention was change it to OpenSolaris, but i really like Ubuntu and GNU/Linux, and i think is just fine the kids learn that the OS is just a tool to get the job done. And there are many, many tools out there.

The OpenSolaris idea was just because of… ZFS. ;-)

Well, that Ubuntu desktop is a “monster”. Really, everytime i need to do anything on it, i see what a bunch of applications and customizations it has (for my kids an my wife). Well, and now with these problems around OpenSolaris distro, maybe would not be a good idea anyway…

Many, many games and tweaks that did make my sons love GNU/Linux. But nothing that makes much more sense right now, because many of the new games (steam), and flash stuff run on it out of the box. But there are some games that i did configure a long time ago, and between many updates, they are still there (aptitude keeps saying that to me), and working! I have a environment with many apps, and shortcuts that was the “first” contact between my wife and a computer. ;-) That is “home” for her…

Everytime i say: “Do you still play this”? And the answer is always: “Yes”!!

So, the problem is not about the OS or ZFS. The last one would make my life easier in the upgrade procedure, and i always remember about it when a new LTS (Long Time Support) Ubuntu is released. But the real problem is a shame: Backup. In fact, i did see many silent data corruption on the boot drive while i was doing the upgrade, and a few days ago we have lost the secondary drive that had one little partition where i had “some” backup. But the sad reality is that i did fail… badly. I need a decent backup procedure for that desktop.

All the “real”(professional/family) important stuff are on my MacOSX laptop, that has a spare USB drive as a backup (\o/). But that drive that has died was, for a long time, “home” for all our family pictures. Imagine… no, better not.

In the end, the new Ubuntu LTS desktop is nice. The look is great and is really, really fast! It’s wonderful to have this level of quality and support for free! It’s unbelievable!  And with a Debian foundation, the distro is rock solid. Canonical elevates the GNU/Linux OS to a whole new level. If we pick the “mission statement” from canonical, we do realize that they get straight to the point

Thanks to all Ubuntu community and abviously for Canonical that sponsor this whole project!

No more Hardy Heron… now it’s time for Lucid Lynx.

ps.: Let’s see if i will post about my ZFS NAS/Backup server in the future, or another sad history about this desktop. Well, i have more three years… ;-)

peace

Missing some racing action…

Chinese Grand Prix (Mclaren Race 1.0b):

0809: SAFETY CAR IS IN,RACE HAS RESTARTED. JENSON NOW SECOND, LEWIS 14TH AFTER HIS PIT STOP.
0817: LEWIS FIGHTING HIS WAY BACK THROUGH THE TRAFFIC NOW – HE’S PAST BARRICHELLO FOR 11TH.
0820: LEWIS RACING WITH THE REDBULLS FOR NINTH AND TENTH.
0820: LEWIS PASSES WEBBER FOR TENTH PLACE.
0823: LEWIS PASSES KOVALAINEN FOR NINTH.
0824: LEWIS PASSES SUTIL AND VETTEL IN ONE MOVE TO TAKE SEVENTH PLACE.
0827: LEWIS SETS THE FASTEST LAP OF THE RACE, 1’43.276.
0828: ANOTHER FASTEST LAP FOR LEWIS, 1’42.061
0829: LEWIS IS CATCHING MICHAEL SCHUMACHER WHO’S FIFTH.
0833: LEWIS GETS PAST SCHUMACHER AT THE HAIRPIN FOR FIFTH PLACE.
0841: LEWIS BACK ON TRACK ON INTERMEDIATE TYRES BUT HE’S LOST TRACK POSITION TO WEBBER.
0853: LEWIS UP TO FIFTH AT THE RESTART – WEBBER GOES OFF AND HE’S PASSED SCHUMACHER.
0854: LEWIS PASSES PETROV FOR FOURTH PLACE.
0857: LEWIS’S NEXT TARGET IS ROBERT KUBICA WHO’S THIRD
0858: LEWIS TRYING TO FIND AWAY PAST KUBICA AT THE HAIRPIN.
0858: HE’S DONE IT – LEWIS TAKES THIRD PLACE.  ”LEWIS IS P2. WE’VEJ UMPED ROSBERG.”
0917: JENSON AND LEWIS ONE-TWO IN THE CHINESE GRAND PRIX.
That’s why LH did not like the end of the Turkish GP… Jason did try to pass him, and that was not fair. And worst, the Mclaren team did not try to warn him. The team was trying to change places?

ZFS Internals (part #10)

PLEASE BE AWARE THAT ANY INFORMATION YOU MAY FIND HERE MAY BE INACCURATE, AND COULD INCLUDE TECHNICAL INACCURACIES, TYPOGRAPHICAL ERRORS, AND EVEN SPELLING ERRORS.

 From the MANUAL page:
 The zdb command is used by  support  engineers  to  diagnose
 failures and gather statistics. Since the ZFS file system is
 always consistent on disk and is self-repairing, zdb  should
 only be run under the direction by a support engineer.

DO NOT TRY IT IN PRODUCTION. USE AT YOUR OWN RISK!

In ZFS internals (part #8) i did talk about a misunderstanding i had regarding ZFS physical and logical vdevs, when i did start to study this software. This filesystem (actually i do not like to call it “just” that), is all about specialized parts and Rampant layering violation.

ps.: This is one of the posts i like most, maybe because i’m not good on math, and it just seems like magic to me. Another text i really like is this, that i have a copy on my site. I did read it many years ago in a book from O’Reilly, and did look for it on internet…

As we have a volume manager, RAID, and filesystem all in one software, sometimes we think about it as a whole, and so this can lead to the wrong perception about some important concepts and isolations we have inside ZFS.

ZFS is a software very well designed that is really easy to use and see its beauty, but it has three fundamental parts that we need to understand in our quest: ZPL, DMU, and SPA. In the part #8 i did talk more about the SPA, so today i will focus on the other two.

Let’s start above in the stack…

Basically, the ZPL is the component that gives the files and directories to us.  This component is the ZFS Posix Layer, something we are used to, and the ZFS creators could not leave behind. Most of the softwares are expecting a Posix compliant filesystem, and is a well known interface in Unix and Unix like OS’s (e.g GNU/Linux and FreeBSD). So, we can imagine this piece as a “simple” translator, the “face” of ZFS.

Ok, so should not be any complexity or big deal with this layer. I mean, when i did start to think about it, my guess was: “Solaris has all the UFS implementation already, i think the engineers pick that implementation, and did change the glue for talk to the next layer: DMU”. I mean, if we are talking about an API we know and use for more than 30 years, i think the better approuch would be to use the UFS code, pretty much stable,  and just change the communication with the underlying layers. You know, talk is always easier… change some parameters, add and delete some functions, and it’s done. Something like a day. Two, if i’m not inspired. ;-)

Not going any further with the ZPL, let’s think about some few ZFS features like:

1 - transactional semantics;

2 – no-fsck,

3 – always consistent on-disk;

4 – Copy-on-Write;

UFS has no one of that features.

So, for ZFS accomplish all that goals, and if all that features are not on ZPL, they should be implemented elsewhere down the stack.  But is not that simple, and it is a combination rather than a separation as we will see next…

The other component in order is the DMU (Data Management Unit). This component works with: dataset, object, and offset. And gives an Object Transaction Interface to ZPL (and any other component that wants to use it, like ZVOL).

So, we have the “transactional semantics of ZFS” implemented here, on DMU. And yes, DMU keeps the “always consistent on-disk”  state, because the copy-on-write of blocks is implemented on it too. One thing i can risk to say is: “DMU is the heart of ZFS”.

But even doing two and a half of the 4 features i did list above, DMU does not know anything about fsck, and the object transaction interface exported by DMU need to be used appropriately. So the transition (in the sense of filesystem consistency) is 100% responsability  of  the ZPL.

The important concept behind this, is that DMU will guarantee 100% atomicity between the transition from one disk state to another, but a consistent logical view of the filesystem must be handled above this layer. In this case: ZPL.

So, ZPL uses the transactional semantics and the always consistent on-disk state given by the DMU (using copy-on-write), and eliminates the FSCK!

The ZPL job is group the transactions and commit them in a way the logical view of the Posix filesystem implemented on this layer is always consistent. I think is like the famous quote from many movies: “Guns do not kill people, people do”.

Looking at the well commented OpenSolaris source code, we can see this:

 * Each vnode op performs some logical unit of work.  To do this, the ZPL must
 * properly lock its in-core state, create a DMU transaction, do the work,
 * record this work in the intent log (ZIL), commit the DMU transaction,
 * and wait for the intent log to commit if it is a synchronous operation.
...

In other words, that was not the gun, that was the people killing the fsck.

In that same C file we can see the algorithm:

    140  *	ZFS_ENTER(zfsvfs);		// exit if unmounted
    141  * top:
    142  *	zfs_dirent_lock(&dl, ...)	// lock directory entry (may VN_HOLD())
    143  *	rw_enter(...);			// grab any other locks you need
    144  *	tx = dmu_tx_create(...);	// get DMU tx
    145  *	dmu_tx_hold_*();		// hold each object you might modify
    146  *	error = dmu_tx_assign(tx, TXG_NOWAIT);	// try to assign
    147  *	if (error) {
    148  *		rw_exit(...);		// drop locks
    149  *		zfs_dirent_unlock(dl);	// unlock directory entry
    150  *		VN_RELE(...);		// release held vnodes
    151  *		if (error == ERESTART) {
    152  *			dmu_tx_wait(tx);
    153  *			dmu_tx_abort(tx);
    154  *			goto top;
    155  *		}
    156  *		dmu_tx_abort(tx);	// abort DMU tx
    157  *		ZFS_EXIT(zfsvfs);	// finished in zfs
    158  *		return (error);		// really out of space
    159  *	}
    160  *	error = do_real_work();		// do whatever this VOP does
    161  *	if (error == 0)
    162  *		zfs_log_*(...);		// on success, make ZIL entry
    163  *	dmu_tx_commit(tx);		// commit DMU tx -- error or not
    164  *	rw_exit(...);			// drop locks
    165  *	zfs_dirent_unlock(dl);		// unlock directory entry
    166  *	VN_RELE(...);			// release held vnodes
    167  *	zil_commit(zilog, seq, foid);	// synchronous when necessary
    168  *	ZFS_EXIT(zfsvfs);		// finished in zfs
    169  *	return (error);			// done, report error

Sorry to say the obvious, but as i think in the future we will see more components in this same layer as ZPL and ZVOL, we need to know that the logical consistency will not be given to us. We can create a filesystem on top of DMU that needs FSCK for example. As we can have a not proper service on top of ZPL disabling the ZIL.

Then we see that we have many levels of consistency, and each layer is responsible for one. As an example of another level, now for an application point of view, the ZPL implements the ZFS intent log to gurantee synchronous semantics for applications requests, and provide a reliable service. This has nothing to do with the SPA or DMU, is ZPL business.

And even with all the concepts and techniques implemented on that layers to provide a good service, all that would be nothing without proper use.

The allocation, deallocation, stripping, redundancy, and whatever related to disk blocks are handled by the SPA component. This is the botton layer of ZFS that talks to the device driver directly. SPA does have a important function on ZFS performance, because even not knowing about the copy-on-write stuff itself, it needs to be efficient in the block allocation strategy, so DMU that uses copy-on-write, needs good amounts of contiguous free space to have good performance (SPA uses a derivative slab allocator algorithm for this purpose).

FSCK RIP

Blogs and Feeds

Hello there…

Today i was doing something that was on my todo list for a long time. When i did start to blog, my intention was not (just) talk about technical stuff, but just write about evertything i did like, or do not like: music, movies, politics, and a lot of things that i’m sure no one else has interest to read. I really like to have a good conversation, just about everything… and my blog was my partner when nobody else had the patience to listen (most of the time). ;-)

Well, for some reason, i did start to write more about technical stuff than about all the things are really nice for me, like: f1, Sport Club Internacional, music [1][2][3][...], movies, politics, Lost, sports, and etc. And in the other hand, i did want to create another space for a web site, more professional, and with more serious content.

It’s been a while since i did start to blog, and there are some crazy sites that have a general content aggregation from my blog, and most of them are specialized sites. So, when i do my posts about football (i really like it, i would not be a Brazilian if i did not), it’s not fair to have a nothing-to-do-content on that sites.

So, now i think i did organize these things… did update my wordpress bloghere you can see my site, and here are the main aggregated subjects of my blog (so if you have some interest, you can feed just what you really want to):

http://feeds.feedburner.com/LealsBlog/OpenSolaris

http://feeds.feedburner.com/LealsBlog/Linux

http://feeds.feedburner.com/LealsBlog/NFS

http://feeds.feedburner.com/LealsBlog/OHAC

http://feeds.feedburner.com/LealsBlog/Storage

http://feeds.feedburner.com/LealsBlog/Ubuntu

http://feeds.feedburner.com/LealsBlog/ZFS

And if you want all this *******, you can have it here:

http://feeds.feedburner.com/LealsBlog

I will try to contact the main sites that have my blog aggregated (thanks google analytics), and send the feed i think is more appropriated. For me is really good, and i think now i will feel better, more comfortable when i want to write about non-technical stuff. Wait for the worst.

peace

Oracle Open Storage Forum 2010/SP (Pictures)







Oracle Open Storage Forum 2010/RJ (Pictures)








Oracle Open Storage Forum

These days i was invited to do a presentation about Open Storage and ZFS on two Oracle/Sun events here in Brazil. One in Rio de Janeiro, and another here at São Paulo. Well, it was very interesting, first because i did live in Rio de Janeiro (1998/1999), and is a very, very beautiful place! I could be at Copacabana on the Reveillon. That was my first time back, and was a really good feeling the first step (again) on that wonderful city.
Second, i could see some old friends from Sun on both events, and talk a lot about many things. And yes, i did talk a lot about all we have doubts in the OpenSolaris community, and this whole cloud we have above us. Obviously, no answers. But, anyway… is good to talk with somebody from inside.
I’m not participating in the discussions on opensolaris.org, because i’m tired to be ignored. I’m not posting, but i’m reading, and i did see disrespect and other people being ignored. We can see many developers that were not talking sometime ago, and i think was because they simply do not like Open Source. And now, seems like they are smiling, and pretty active on the mailing lists. Well, i’m the guy from outside, they work for Oracle/Sun, they build the software. That’s ok for me. I will not try to make part of something i’m not welcome. If Oracle/Sun just want users, it’s easier for everyone.


Oracle Open Storage Forum 2010

Oracle Open Storage Forum 2010



That said, on my presentation i did talk as a client. And that is some point i think some people at Oracle are not thinking about. We have Pettabytes of storage on ZFS. Probably the largest OpenSolaris/ZFS installation in Latin America. We have 7410 storages here, and that is not because some vendor came here and did “sell” it to us. We do use OpenSolaris/ZFS because we know ZFS, and because we use the hybrid storage model of ZFS before the 7410 Oracle/SUN product. We are an example of the voluntary marketing, and i was there talking about the beauty of OpenSolaris/ZFS, because i know about OpenSolaris/ZFS. I cannot say the same about WAFL and NetApp.
So, i did talk about all the goodness of ZFS, and as a client, i think Oracle has a great advantage against the competitors: The OpenSolaris Project. I’m not buying magic from EMC Matrix Storage (that is really good, and i was a happy user sometime ago), or another magic from Hitachi. I’m buying a technology i was used to participate, and i was used to send emails for people i was used to know, and even to receive replies. The open model of OpenSolaris/ZFS is the way to go for Oracle/Sun to enter on a storage dominated market. The Unified model of fishworks seems to be the natural storage direction. But that is just my opinion, first as a community member, and second as a client.
If Oracle/Sun prefers to come to our door like EMC, Hitachi, or other closed/blackbox hardware company, good, i can play that game. I’m used to that market, and it’s simple to add Oracle/Sun to that “cloud”.

In the end, were interesting events, interesting people, and i really like to talk about such great piece of software. The Oracle/Sun Storage team from Brazil really knows the potential of ZFS, and it’s easy to show the value of this technology.
In each event i did give one book as a gift for one person, and i hope they enjoy it.
peace

Classificado!!