Solaris 10/OpenSolaris have many new features, and some users used to GNU/Linux, or even used to older versions of Solaris, sometimes don’t know some of them. We have been heard a lot about ZFS and Dtrace, for example, but there are much more!
In my previous post: A contract between OpenSolaris and GNU/Linux users, i did talk about another great feature, that is SMF.
Today we will see a little tip about FMA (Fault Management Architecture). If you think what i will post here is what FMA is all about, or if you think this facility is not “so nice like ZFS or Dtrace”, think again!
FMA is a complex (for the developers ;), enterprise fault management architecture, open source and available for use and learn in the OpenSolaris OS. Take a look in this simple example (We can monitor hardware faults in our server just issuing a simple command like):

# fmdump 
TIME                 UUID                                 SUNW-MSG-ID
Apr 01 18:57:21.0621 c5541e9b-ddb1-c214-c2af-e3d358ae1a8e ZFS-8000-FD

As you can see in the output above, there is one failure with the message id: ZFS-8000-FD. All errors reported by the FMA facility has a message ID, and you can see what is any message looking for it in this site. If you look specifically for the msg id ZFS-8000-FD, you will see that the description for that message is:

The number of I/O errors associated with a ZFS device exceeded acceptable levels.

and the FMA subsystem has taken some action for us (Automated Response):

The device has been offlined and marked as faulted. An attempt will be made \\
to activate a hot spare if available.

Well, we do not have a hot spare in this test system, so our pool must be in degraded state. Let’s look:

# zpool status
  pool: test
 state: DEGRADED
 scrub: resilver completed after 0h1m with 0 errors on Thu Nov 20 18:03:58 2008
config:

        NAME         STATE     READ WRITE CKSUM
        test      DEGRADED     0     0     0
          mirror     DEGRADED     0     0     0
            c0t2d0   REMOVED      0     0     0
            c0t3d0   ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c0t4d0   ONLINE       0     0     0
            c0t5d0   ONLINE       0     0     0

So, after change the faulted disk, and see our pool in good shape again, we must inform the FMA subsystem about our action:

# fmadm repair c5541e9b-ddb1-c214-c2af-e3d358ae1a8e
TIME                 UUID                                 SUNW-MSG-ID
Apr 01 18:57:21.0621 c5541e9b-ddb1-c214-c2af-e3d358ae1a8e ZFS-8000-FD
Apr 01 19:30:20.0122 c5541e9b-ddb1-c214-c2af-e3d358ae1a8e FMD-8000-4M
Repaired

We can look for that msg ID (FMD-8000-4M) on SUN site to see what is its description:

All faults associated with an event id have been addressed. 

and the automated response:

Some system components offlined because of the original fault may have been brought back online. 

I think is a good pratice (just my opinion, not a recommendation), to rotate the log after that:

# fmadm rotate fltlog
fmadm: fltlog has been rotated out and can now be archived

So…

# fmdump 
TIME                 UUID                                 SUNW-MSG-ID
fmdump: /var/fm/fmd/fltlog is empty

Ok you say, but i can see errors in ZFS pool using the “zpool status” command, not big deal… but keep in mind that FMA is a facility for the whole system, and answer me how do you would see this error:

# fmdump 
TIME                 UUID                                 SUNW-MSG-ID
Apr 01 00:23:54.7851 79e662bf-d4fb-771e-8553-9876ca7912c5 INTEL-8001-94

If you look the description for that error, you will see this:

This message indicates that the Solaris Fault Manager has received reports of single bit correctable
 errors from a Memory Module at a rate exceeding acceptable levels, and a Memory Module fault has
 been diagnosed. No data has been lost, and pages of the affected Memory Module are being retired
 as errors are encountered. The recommended service action for this event is to schedule replacement
 of the affected Memory Module at the earliest possible convenience. The errors are correctable in 
nature so they do not present an immediate threat to system availability, however they may be an 
indication of an impending uncorrectable failure mode. Use 'fmadm faulty' to identify the dimm to replace. 

And there is more:

fmdump -v -u 60e662bf-d4fb-661e-8553-9876ca7911b4
TIME                 UUID                                 SUNW-MSG-ID
Apr 01 00:23:54.7851 79e662bf-d4fb-771e-8553-9876ca7912c5 INTEL-8001-94
  100%  fault.memory.intel.dimm_ce

        Problem in: hc://:product-id=X7DB8:chassis-id=0123456789:server- \\
id=testserver:serial=02020208170712cf50:revision=C1/motherboard=0/ \\
memory-controller=0/dram-channel=1/dimm=1/rank=3
           Affects: mem:///motherboard=0/memory-controller=0/dram-channel=1/dimm=1/rank=3
               FRU: hc://:product-id=X7DB8:chassis-id=0123456789:server- \\
id=testserver:serial=02020208170712cf50:revision=C1/motherboard=0/ \\
memory-controller=0/dram-channel=1/dimm=1
          Location: DIMM2B

So, show some respect for FMA! ;-)
peace.