Zabbix is used to monitor many of the world’s biggest environments. But it also monitors its own (albeit small) environment. And this environment recently saw a shark attack.
One early Friday morning arriving in the Zabbix office revealed suspicious sounds from the server room. It sounded a bit as if a plane was locked up in there and tried to break free. Opening the door resulted in… no, not this sound as the title might have made you guess, and not even in Wiresharks. Instead it resulted in a significant heat wave, which also immediately explained the reason for that plane-like sound – with such temperature all fans were trying real hard to prevent hardware from frying.
Turned out, electricity supply for the conditioning unit had died off, and the devices, busily serving various Zabbix resources, were fighting for fresh air.
Here we can see a 3 day ambient temperature (temperatures for CPU and disks were higher) for one server. Temperature starts rising before midnight until a sharp decrease several hours later.
We can see how before the incident temperature was very stable at 21 degree, barely deviating from this value. During the non-global warming it jumped up to 36 degrees, followed by a very rapid drop for about one hour. Temperature drop stops around 6 in the morning and here it doesn’t go back to the previous level of 21, instead sticking to ~23 degrees. While not significant, there’s a small rise around 9 o’clock to about 24 degrees.
If we also look at a graph for one of the virtualisation servers for one week, we can see the same rise of temperature, here to 37 degrees (with a very steady 21 before that, same as for the previous server).
This graph does show one weird thing, though – a decent guess would be that during the temperature rise fan RPM goes up as well. Here it’s not like that – fan RPM drops instead from about 3485 RPM to 3465 RPM. If anybody has insight into this unexpected response, please share that with us.
Let’s look at another graph. This one shows UPS data for one day, including output voltage and UPS load. While other values don’t change in response to the temperature increase, UPS load does, and does notably. It goes from 76% to 85% during the temperature increase period, with the same sharp drop afterwards.
While some of that can be attributed to increased fan RPM (except for that virtualisation system), there might be something else – again, if you have some experience with power draw monitoring at different temperatures, please, share it.
While graphs are interesting visually, they also do show some useful data, like times when temperature started to increase and when it dropped. The puzzle of one system fan RPM going down during temperature increase is very interesting as well. Additionally, this incident gave some ideas on what additional values could be monitored in the Zabbix environment – maybe that also gave some new ideas to you?
Ah that’s what happened….zabbix.com and zabbix.org down when trying to toy around with pre-2.0 thursday evening 🙂
Perhaps you can retrieve at what temperature the servers gave up, and have zabbix shut them down before temps get that high? Perhaps an SNMP-aware thermometer?
About the fans: perhaps they’re set to always spin at the same RPM instead of temperature-controlled? I don’t think that ~3500RPM is the max they can do. Apart from that: I flunked about every science course I had over the years, but how about this one: temperature starts rising, materials expand, bearings in the fans have more fricion, RPM lowers. Plausible? Busted? Myth confirmed? 🙂
hmm. as far as i know, servers didn’t give up – they continued working. i used jira & forum that morning a bit.
as for the fan rpm theory… doesn’t sound likely to me (i wouldn’t expect the increase to cause such drastic effect), but then i don’t know anything about that area either 😉
Shouldn’t you have triggers and alerts for this? 😉
oh surely there are such things, although maybe not for all items yet. but not always all media types work perfectly 😉
This example is great. When I worked for BMC i was able to take the ProactiveNet components and use these metrics to monitor a very large vendors data center that was football fields in size. So what was don is that servers were grouped based on the floorspace grid they occupied. Then I created aggregate monitors for temperature, humidity, and power for each grid block. From this I could create even screens that showed the room and how it was performing. We could see the hot and cold zones of the floorspace. Also with the aggregate metrics we had dynamic thresholding in place to alert when a metric was 5% out of the daily, weekly, and monthly norm.
What I would like to see is this aggregate monitoring put in place against each zone and the room as a whole. Then report back to us. Then maybe look into the weather chart code someone took from nagios and put in place within Zabbix. Maybe even later incorporate this functionality into the product.