The MTTR is going up, do you know?
Sometime ago i did write about cache, and the little (or none) difference between poor performance and availability.
I still think cache is the solution for all problems, and can be the root of all evil too.
Today i was reading about the problems with 37signals campfire service, and what made me remember that old post was this:
“The current issue is that both our database servers (slave and master) have been rebooted. Which have left us with a cold cache, so new traffic overwhelms the servers. We’re working on warming the caches and will bring back Campfire as soon as we can”.
Obviously, they had the same problems everyone else may suffer: hardware failures and the sad return to life.
I guess many readers can think they could be made “a lot of things” better… that “x” is not right, better is “y” and so on. But my point is that any solution has a trade-off, and they had “two” database servers down!
Talking about storage as an example, if we loose two sides of a mirror, we loose the data and need to restore it from backup. If we don’t want this, we need to add more copies, or we need to replicate the whole storage (no cache either). Ok, we want the cache replicated too… i think you got the point.
So, after the incident, we can have many solutions for the problem, actually everyone has the solution for other’s problems. But there is a price to have a infrastructure that has a little (or none) performance impact when goes down, and returns to life. But is a fact that we have a big density of resources going on, and a real big MTTR from zero to hero.
2TB to resilver?
1TB to warm?
How big is your routing table?
How many VM’s you need to start (how many services on them)?
How much costs this “just switch on and is up and running again”?
Our systems still are “looooooooad and run”, and thereis a trade-off, always. There is a price, or a technology to be developed to solve this problem. I think it’s clear the better MTTR on ZFS using COW and without FSCK, but the 40% hit ratio diff between down/up seems to be not so clear being part of this MTTR.
Filesystem cache, Database cache, BGP convergence, VM’s coming back to life, you name it!
There is no money if everyone wants the cash. Sorry, but if your bank has it, what is your business?