Computing Science, posix rules, life rules, no rules…
16:23 - Fri 3 September, 2010 |  RSS:
Publications
Comments

Performance II

Picture from joeldyke.com

Picture from joeldyke.com


In my last post about “Performance“, i did talk about a ZFS tuning parameter: zfs_prefetch_disable. And that was a ZFS read parameter…
In this post, we will take a look in another one, with the same “water to wine” effect. This time with focus on writes…
We have a set of informations available for our clients on our storage solution, so many performance related info can be verified. So, for some reason, a few storages were showing big latency times for write requests. And we were not talking about so many write requests…
Well, writes were to be served by the slog device, and so should be very fast. Indeed, we were hitting the SSD’s and the latency times were good in this stage. But the data must go to the disks (really?), and there, things(asvc_t) were not so good (+100ms, +150ms, +200ms) while in spa_sync. And with that “low” performance, we were having commits much more frequent than 30 seconds.
Next step was look into the workload, and see if we were talking about a really random write workload. Using the iopattern Dtrace script, we could see that while in spa_sync, we were facing 99% of random writes, and so, not writing much to the disks (90MB/s, 100MB/s for the whole pool). The question was: What is random for the iopattern? ;-)
Actually, i did want to know if the size of the seek was considered, and the answer were (as always), on the manual (the D script itself):

#  An event is considered random when the heads seek.
#  This program prints the percentage of events
#  that are random. The size of the seek is not
#  measured - it's either random or not.


RTFM and answered.
One possible answer could be ZFS fragmentation, but just 10% of the pool was used.
We do not have the answer for the problem, but we need to consider what we have (some good points):
– A few NFS writes per second;
– The slog device working just fine (good latencies);
– A short interval between spa_syncs (less than 10 seconds);
– 99% random writes during spa_sync;
– Big latency times on discs (+100ms/asvc_t )
– Pool empty (10% used);
– A few (big files) being updated (something like a hundred);

My bet here was that: the system is complicating itself with that simple work (i do that most of the time too). The spot is that we have (again) a busy disk but for the wrong reason. That reminded me a great blog post about I/O reordering and disk queue depth. And an important phrase:

The disk IO sub-systems are built to provide maximum throughput“…

From that post we can get this usefull script (sd for intel) too:

fbt:sd:sdstrategy:entry,
fbt:sd:sdstrategy:entry
{
        start[(struct buf *)arg0] = timestamp;
}

fbt:sd:sdintr:entry,
fbt:sd:sdintr:entry
/ start[(this->buf = (struct buf *)((struct scsi_pkt *)arg0)->\
pkt_private)] != 0 /
{
   this->un = ((struct sd_xbuf *) this->buf->b_private)->xb_un;
   @[this->un] = lquantize((timestamp - start[this->buf])/1000000,
         60000, 600000, 60000);
   @q[this->un] = quantize((timestamp - start[this->buf])/1000000);

   start[this->buf] = 0;
}

Running that on a storage with the problem, we could see thousands of operations that took ~200ms. A really bad scenario…
And a really important information (here you will know the zfs tunable), the ZFS default vdev_max_pending for these storages were the old 35. And to understand the whole history about LUN queue, here there is another great post.
So let’s stop talking and see if my guess is true. To do that, the simple procedure is to limit the queue to 2, so we know that will be just one active I/O and another waiting (without reordering of any sort). If that do change the results, we are in the right direction, if not, back to the lab.

 echo zfs_vdev_max_pending/W0t2 | mdb -kw
zfs_vdev_max_pending:           0x23             =       0x2

Let’s look at the iostat… and… Bingo! Lantency times down to a few ms, and running the above Dtrace script again, all operations like 5ms, 10ms!
So, that was just to see if we were in the right direction, but change something from 35 to 2 is radical. Then, let’s put it to 10 (the actual value in ZFS implementation), and see if we have a good performance as well…

 echo zfs_vdev_max_pending/W0t2 | mdb -kw
zfs_vdev_max_pending:           0x2             =       0xa

Good! Not so good as 0×2 (with vdez_max_pending=10 we did see times like 30ms), but now we need some time to understand the whole impact (eg. reads), and configure a good(definitive) number for this workload.

OBS: So, one more Dtrace script for the utilities folder, and here another one.
peace

Ganglia on OpenSolaris

As system administrators we need to have some essential info about our servers as a minimum requirement for our job, and identify patterns and obtain knowledge about trends in our workload. And, when we need to use Dtrace, MDB, or need to understand a FMA ereport… we do not have much time. Actually, i want to post some notes about the last one in a future blog entry.
I think the big problem about these tools is that we as sysadmin do not use that tools daily as a devel guy should do. And so we are not debugging with these tools all the time. What is good! So i think the solution is to develop some tools/scripts to make our life easier (like the Dtrace Toolkit), NFS Block Size Monitor, dcmd’s for the MDB, and have some scripts to parse ereports in an easy way. Well, but that is for future posts…
In the old days, managing Solaris servers, i was used to Orca for gathering crucial, must have, server’s performance informations. It was pretty simple, extensible, and made for solaris. First thing was install the orca on the solaris servers and “see the big picture”. But these were old days…
In the transition for OpenSolaris i did try to use zabbix for this job, and as you can imagine was not so good. Not because of zabbix, that is a really fantastic tool (when used for the right job). Actually, i was thinking in use the zabbix for other things too, but was so much complexity and the better approuch was divide the administration tasks in specific areas. We do have a homebrew Administration System for our storage business, so for standard informations like: cpu, network, memory, and etc, we just need some tool to be the right replacement for the old handy orca. The answer was: Ganglia (Wikipedia: ganglion).
Well, there is no package for OpenSolaris (OpenSolaris???)… so, i will put here some notes on how i did install it on OpenSolaris and some tips for you do not waste much time on it.
Simple things first… you will need to install SUNWapr13.

# pkg install SUNWapr13

Second, you will need libconfuse (what a name)… what i did need here was to compile it using a specific configure option to create the shared library. Without it, the standard compilation did not create the shared library, just the static one (.a). There is a note about it on the Ganglia website:

# cd confuse-2.7/
# ./configure --enable-shared
# make && make install

After that you can compile the ganglia software. The tip here was to use a specifig LDFLAGS in the configure procedure. Without it, the software was failing in run time. I did try to use the –with-libapr option, and use the absolute path to apr-1-config without luck. So, as we need things working…:

# LDFLAGS="-R/usr/apr/1.3/lib/" ./configure --with-libconfuse=/usr/local/ --enable-gexec --sysconfdir=/etc/ganglia
# make && make install


The above configure line will configure and install ganglia monitor software on “/usr/”. That is one of a few packages that the installation is not default to “/usr/local”. Not so good… if you want to change that, you can do it on the configure line.

I’m assuming that you want just the monitor part on your OpenSolaris machines, without gmetad, because you have it on another system ( you just need the gmetad on the system you will centralize the data). If you want to install it on one OpenSolaris system, you will need to append the option “–with-gmetad” to the configure line.
You can create the gmond needed configuration file using the gmond itself:

# gmond --default_config > /etc/ganglia/gmond.conf

And to get it up and running the configuration can be simple like this:
– In the gmond cluster section, change the lines as you wish…

 cluster {
  name = "Servers"
  owner = "Company"
  latlong = "unspecified"
  url = "unspecified"
}

I did use the udp_channel, so was just uncomment the line bind_hostname, and inform the gmetad host:

udp_send_channel {
 bind_hostname = yes
   host = gmetadserver
  port = 8649
  ttl = 1
}

That’s it. This same file can be used for all your servers, and obviously you can customize it like you want! But with these few configurations, you will have all your hosts working will all the essential performance monitoring (just like the old orca ;-).
In the begining of the gmond.conf file there are generic handy parameters:

 globals {
  daemonize = yes
  setuid = yes
  user = ganglia
  debug_level = 0
  max_udp_msg_len = 1472
  mute = no
  deaf = no
  allow_extra_data = yes
  host_dmax = 0 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no
  send_metadata_interval = 0 /*secs */
}

I did create a ganglia user to run this software on OpenSolaris, so we can use all the RBAC features on it. But you can use a standard nobody user for example, and should work just fine. You can change the daemonize option to no so the process gmond will stay in foreground. But you can just start gmond usind an option like “-d 2” and automatically the process will be at foreground with many useful debug messages. You can test the gmond on your machines using a simple telnet command like:

# telnet localhost 8649

That command should produce a lot of informations on XML format. Cool!
Finally, Ganglia has a powerful gexec feature that i’m not going to cover in this post, but you can enable it just changing the gexec line to “yes”.
After you did start ganglia on your servers, you can use gstat on the gmetad server to see them:

 gstat -a | head -7
CLUSTER INFORMATION
       Name: Servers
      Hosts: 380
Gexec Hosts: 0
 Dead Hosts: 0
  Localtime: Wed Aug 25 11:56:42 2010

We do not have an open repository yet, but we are thinking in create one soon. So, if you want to create a package, here you can get the .ips i did create for Ganglia Monitor and the binaries and libraries it needs:

set name=pkg.name            value="Ganglia"
set name=pkg.description     value="Ganglia Monitor"
dir mode=0755 owner=root group=bin  path=/lib
dir mode=0755 owner=root group=bin  path=/lib/svc
dir mode=0755 owner=root group=bin  path=/lib/svc/method
dir mode=0755 owner=root group=sys  path=/usr
dir mode=0755 owner=root group=bin  path=/usr/lib
dir mode=0755 owner=root group=sys  path=/usr/lib/ganglia
dir mode=0755 owner=root group=sys  path=/usr/lib/ganglia/python_modules
dir mode=0755 owner=root group=bin  path=/usr/sbin
dir mode=0755 owner=root group=bin  path=/usr/bin
dir mode=0755 owner=root group=sys  path=/etc
dir mode=0755 owner=root group=root path=/etc/ganglia
dir mode=0755 owner=root group=root path=/etc/ganglia/conf.d
dir mode=0755 owner=root group=sys  path=/var
dir mode=0755 owner=root group=sys  path=/var/svc
dir mode=0755 owner=root group=sys  path=/var/svc/manifest
dir mode=0755 owner=root group=sys  path=/var/svc/manifest/network
file usr/sbin/gmond mode=0755 owner=root group=root path=/usr/sbin/gmond
file usr/bin/gstat mode=0755 owner=root group=root path=/usr/bin/gstat
file usr/bin/gmetric mode=0755 owner=root group=root path=/usr/bin/gmetric
file lib/svc/method/ganglia mode=0755 owner=root group=root path=/lib/svc/method/ganglia
file etc/ganglia/gmond.conf mode=0644 owner=root group=root path=/etc/ganglia/gmond.conf
file etc/ganglia/conf.d/modpython.conf mode=0644 owner=root group=root path=/etc/ganglia/conf.d/modpython.conf
file var/svc/manifest/network/ganglia.xml mode=0644 owner=root group=root path=/var/svc/manifest/network/ganglia.xml
file usr/lib/ganglia/modmulticpu.so mode=0644 owner=root group=bin path=/usr/lib/ganglia/modmulticpu.so
file usr/lib/ganglia/modcpu.so mode=0644 owner=root group=bin path=/usr/lib/ganglia/modcpu.so
file usr/lib/ganglia/modsys.so mode=0644 owner=root group=bin path=/usr/lib/ganglia/modsys.so
file usr/lib/ganglia/modmem.so mode=0644 owner=root group=bin path=/usr/lib/ganglia/modmem.so
file usr/lib/ganglia/modnet.so mode=0644 owner=root group=bin path=/usr/lib/ganglia/modnet.so
file usr/lib/ganglia/modpython.so mode=0644 owner=root group=bin path=/usr/lib/ganglia/modpython.so
file usr/lib/ganglia/modload.so mode=0644 owner=root group=bin path=/usr/lib/ganglia/modload.so
file usr/lib/ganglia/moddisk.so mode=0644 owner=root group=bin path=/usr/lib/ganglia/moddisk.so
file usr/lib/ganglia/modproc.so mode=0644 owner=root group=bin path=/usr/lib/ganglia/modproc.so
file usr/lib/libganglia-3.1.7.so.0.0.0 mode=0644 owner=root group=bin path=/usr/lib/libganglia-3.1.7.so.0.0.0
file usr/lib/libganglia.a mode=0644 owner=root group=bin path=/usr/lib/libganglia.a
file usr/lib/libganglia.la mode=0644 owner=root group=bin path=/usr/lib/libganglia.la
link mode=0555 owner=root group=bin path=/usr/lib/libganglia-3.1.7.so.0 target=libganglia-3.1.7.so.0.0.0
link mode=0555 owner=root group=bin path=/usr/lib/libganglia.so target=libganglia-3.1.7.so.0.0.0

And here you can get the xml file for the SMF Service, and the start method as well. Hoping can be useful for you…
peace

América vermelha novamente!!!

PS3 (KEM-410 ACA)

Transformers


The kids are happy again… after a few days without playing FIFA, GOW III, or GTAIV, now the PS3 it’s working, finally. But that was a long story that i will make it short to maybe help others with the same problem. Seems to be a pretty normal scenario for the first PS3 models, the old ones… the console just stops to read the Blue Ray discs, but everything else keeps working just fine.
I did a little search on internet to see if that was a normal case, and a lot of informations, and two possible problems:
Software (the loader with problems after upgrading or something else):
I did see many cases like this where a simple reset on the console did fix the problem. Actually i did discover many commands and hide features on the console when trying to fix the console in this situation. But no luck, was not this case for me…
Hardware ( The optical reader dies):
Here we need to change it, and the novell starts for me…

I did call the sony representative here at São Paulo, and actually there are many to do the service. But… US $175,00 for the component and plus US $125,00 for the service itself. Something like US $300,00 and i was thinking in buy another, and to use this old one as a computer for my little kid (now, two years old).
I did try to find out the version of the component without open the console, using some kind of external serial number, but could not find out for sure (there are many models, and one do not work on the other). Then, i did open the PS3 and like the videos on the internet, is pretty simple. I did some tests with the opened console, and did seems like the ‘motor’ that should make the disc spin was not working. But could be a problem with the laser too, so to change it for sure i was needing to replace the whole component: KEM-410 ACA.
First option was to buy the KEM part from the sony representative and do the service by myself. So, i did call the sony service trying to buy just the component. And they told me they could not sell it, just if i would buy the service too (Oracle making friends ;-).
BTW, i was a great fun of Nintendo, that makes a real strong console, and was used to be the champion on games (Ocarina of Time, classic). And did loose ground for sony exactly because of the openness specs and support for developers that sony did create.
In my experience the myth becomes true: the sony console is not so robust as the nintendo ones. I had no problems with nintendo64, game cube or Nintendo wii.
So, let’s do like Will Smith in The Pursuit of Happyness, and fix the scanner! ;-)
Well, if Optimus Prime did use ebay to save the world, i could use it to fix a broken PS3 console too.
Then, last week i did buy the KEM-410 ACA component from ebay (US $79,00), did receive yesterday, and the console is working like a charm! Plus, i did receive an extra time apart from buying games, because i did reset the console trying to fix it… so, all the games progress were too… my kids will need to play all the games again. I did play a little (MGS4, and GT prologue), nothing for real…
As always, in the end i still have one screw at my desk… Manufacturers… always using more than necessary…
peace

Arithmetic exception

The One


Ok, if you are a noob C programmer, you will find this post really fun, serious, it will make you LOL. If you are a noob, really noob “XYZ” programmer, maybe you find it helpfull… maybe… no. You will LOL too…
So, one more time we have a core file sitting around on our servers. That is serious stuff, and we can have a really big problem, and a difficult debug task ahead… a good opportunity to blog about, and learn new things about gdb/mdb, dtrace… cool! Let’s start soft: gdb.
Something simple like gdb –core, so i could see what program did generated it:

...
Core was generated by `/usr/local/bin/myprogram'.
...

… SILENCE…
Hmmm, that is a ten lines program… signal 8? Arithmetic exception… betther not blog about this. ;-)
Let’s recompile myprogram with options to make life of newbies like me easier…

gcc -g -lm -lumem -o myprogram myprogram.c

Ok, now let’s see it again (gdb /usr/local/bin/myprogram –core=newcore):

GNU gdb 6.3.50_2004-11-23-cvs
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-pc-solaris2.11"...
Core was generated by `/usr/local/bin/myprogram'.
Program terminated with signal 8, Arithmetic exception.
Reading symbols from /lib/libumem.so.1...done.
Loaded symbols for /lib/libumem.so.1
Reading symbols from /lib/libm.so.2...done.
Loaded symbols for /lib/libm.so.2
Reading symbols from /lib/libc.so.1...done.
Loaded symbols for /lib/libc.so.1
#0  0x08051854 in main (argc=1, argv=0x8047e8c) at myprogram.c:144
144            variance = somae / (*ptw - 1);
(gdb)

The more i look, the more i’m disappointed with myself (vi myprogram.c)…

(gdb) print variance
$6 = 4278169904
(gdb) print somae
$7 = 0
(gdb) print *ptw
$8 = 1

You did not initialize the variance “Neo”… so it’s pointing to Zion or maybe to the Matrix itself…
Ok, the “Chosen One” is making a division by zero (*ptw – 1 = 0). We have a test to see if *ptw is greater than zero, and actually it does, but the code is for calculate average and standard deviation, Morpheus, and for that we need at least two values (that’s enough to realize we need another savior, or we are lost).
The *ptw is a pointer to total writes and we are having a dataset that in a 180 seconds period has just one write (something interesting at least, a really idle “log” share, topic for another post).
So, was a matter of change the code from if (*ptw > 0 ) to if (*ptw > 1 ), and add a else condition to assign the average latency to the actual write value (and the standard deviation to zero). Actually do the same for *ptr too (reads)…
Ok, stop complaining and Do the right thing… cypher…
peace

Senna

How did i miss this video? Fantastic! I need to put it here…

Classificado!!!

Colorado!!!

O Colorado está Classificado!



Melhores momentos da partida:

Music? They are fast…

Do you need L2ARC redundancy? I do…

Hello there…
I think you agree that the storage’s problem is the READ requests, synchronous by nature. And, as i said many times before, i think the solution for all problems (the answer for all questions ;-) is cache. Many levels, many flavors.
I did read many times about the recommended redundancy on the ZFS slog devices. In the past, earlier days of ZFS, we had a serious availability problem if we loose a slog device. So, mirror was the way to go.
But thinking about the ZIL concept, we need two failures in a row to make sense to have a mirrored slog device. And we will loose many IOPS doing so…
On the other hand, the MTTR for a slog device is the better one, in comparison with a regular vdev or a L2ARC.
Everything will be fine at the moment you replace the slog device (eg.: SSD).
And the L2ARC? Here we need time… and believe me, can be a long time.
We do configure a 100GB SSD device, delivering a lot of IOPS, very good latencies… and crash! We loose it!
Do you think the applications will be happy with the SATA latencies? We will have a performance or an availability problem? Or we will not have problems at all?
Well, as i did say at the beginning of this post, no one thinks that a failure of a warmed L2ARC device is a big deal. I would like to agree, but i don’t. And as i really like the ZFS architeture, i would guess the vdev concept for redundancy should be independent of the physical vdev. So, we could mirror the L2ARC… but no, we can’t.
So, i can understand that i’m the only one that did think about mirror a cache device, but the fact that we can not create a mirror (logical vdev) from a physical vdev, seems like a ZFS bug.
peace

2010 World Rally Championship