Picture by Ed Schipul

Zabbix is used to monitor many of the world’s biggest environments. But it also monitors its own (albeit small) environment. And this environment recently saw a shark attack.

One early Friday morning arriving in the Zabbix office revealed suspicious sounds from the server room. It sounded a bit as if a plane was locked up in there and tried to break free. Opening the door resulted in… no, not this sound as the title might have made you guess, and not even in Wiresharks. Instead it resulted in a significant heat wave, which also immediately explained the reason for that plane-like sound – with such temperature all fans were trying real hard to prevent hardware from frying.
Turned out, electricity supply for the conditioning unit had died off, and the devices, busily serving various Zabbix resources, were fighting for fresh air.

Here we can see a 3 day ambient temperature (temperatures for CPU and disks were higher) for one server. Temperature starts rising before midnight until a sharp decrease several hours later.

We can see how before the incident temperature was very stable at 21 degree, barely deviating from this value. During the non-global warming it jumped up to 36 degrees, followed by a very rapid drop for about one hour. Temperature drop stops around 6 in the morning and here it doesn’t go back to the previous level of 21, instead sticking to ~23 degrees. While not significant, there’s a small rise around 9 o’clock to about 24 degrees.

If we also look at a graph for one of the virtualisation servers for one week, we can see the same rise of temperature, here to 37 degrees (with a very steady 21 before that, same as for the previous server).

This graph does show one weird thing, though – a decent guess would be that during the temperature rise fan RPM goes up as well. Here it’s not like that – fan RPM drops instead from about 3485 RPM to 3465 RPM. If anybody has insight into this unexpected response, please share that with us.

Let’s look at another graph. This one shows UPS data for one day, including output voltage and UPS load. While other values don’t change in response to the temperature increase, UPS load does, and does notably. It goes from 76% to 85% during the temperature increase period, with the same sharp drop afterwards.

While some of that can be attributed to increased fan RPM (except for that virtualisation system), there might be something else – again, if you have some experience with power draw monitoring at different temperatures, please, share it.

While graphs are interesting visually, they also do show some useful data, like times when temperature started to increase and when it dropped. The puzzle of one system fan RPM going down during temperature increase is very interesting as well. Additionally, this incident gave some ideas on what additional values could be monitored in the Zabbix environment – maybe that also gave some new ideas to you?