Historically, issues that might arise when using Zabbix have not been easy to troubleshoot. Not everything that would be useful is always logged. Log level can be increased, but those who still remember the first time they saw what DebugLevel=4 can do will understand why that option usually helped more advanced users. Besides, changing the log level required restart of the daemon, which in many cases would obscure the problem or introduce various other problems like delaying data on proxies in larger installations.
In what could be claimed to be one of the greatest improvements in 2.4 and maybe even beyond that, ability to change log level for a running daemon is coming.
Articles in 2.4 feature series:
- Part 1 – Multiple LLD filters
- Part 2 – Controlling redirects and header retrieval for web monitoring
- Part 3 – SSL verification and authentication controls
- Part 4 – Web monitoring URL limit increased to 2048
- Part 5 – Custom action condition formula
- Part 6 – Runtime loglevel changing
- Part 7 – Improved troubleshooting
Usually debugging Zabbix problems at log level 3 (default) was not that easy. It logged something, but a lot of things were missing. One could increase log level to 4, but that carried a small warning in the configuration file:
It has been confirmed as correct by many users, too:
<TheRedBaron> Good lord, the conf wasn’t lying when it said log level 4 produces a lot of information
One needed some Zabbix experience to find anything of use in that amount of information. A small Zabbix server could easily log tens of megabytes over the course of a few minutes. A larger system would produce more than a gigabyte in ten minutes easily. Even if you only wanted to debug web monitoring process, all of the Zabbix daemon process would be put in log level 4 and eagerly try to tell you what they are doing every second.
The requirement to restart the daemon to change the log level was a concern, too. In some cases problem would manifest rarely enough to discourage constantly running at log level 4, but restarting the daemon would hide the problem. In larger installations Zabbix server restart could take even more than 10 minutes, which would mean collected values piling up on proxies and then overloading the server…
In short, restarting a large daemon to change the log level was not something users enjoyed.
Change log level for a running daemon
Zabbix 2.4 expands on the so far only runtime option that was introduced back in 1.8.6 – ability to reload the configuration cache. There are now two additional runtime options that allow increasing and decreasing log level for a running daemon. Looking at the manpage or the output of –help:
Thus the following would increase log level for Zabbix server – same as changing DebugLevel in the configuration file and restarting the daemon would.
$ zabbix_server --runtime-control log_level_increase
On the commandline, we would see a confirmation that the signal was sent:
zabbix_server : command sent successfully
And in the logfile a whole bunch of entries like these would appear (assuming you were running with the default log level, 3):
26555:20140907:152329.029 log level has been increased to 4 (debug) 26556:20140907:152329.029 log level has been increased to 4 (debug) 26560:20140907:152329.030 log level has been increased to 4 (debug)
And a lot of stuff would get logged right away, of course.
If one would try to increase the log level past the maximum one, signal would be still sent, but the receiving processes would log a message, refusing to change the log level:
27228:20140907:160252.618 log level has been increased to 5 (trace) 27228:20140907:160258.936 cannot increase log level: maximum level has been already set
The same would happen with decreasing:
27228:20140907:160121.287 log level has been decreased to 0 (none) 27228:20140907:160121.801 cannot decrease log level: minimum level has been already set
Change log level for a single sub-process only
Changing log level for all processes without daemon restart is very nice already – but this feature actually allows to change log level only for some processes, too. Again, examining manpage or output of –help:
Assuming we would like to debug some issue with trapper processes, we can easily increase log level for all of them:
$ zabbix_server --runtime-control log_level_increase=trapper
Available process types can be seen in the internal item section in the Zabbix manual. What if the process name contains spaces? Same as in any other situation, we have to quote it to protect from the shell, for example:
$ zabbix_server --runtime-control log_level_decrease=\ "unreachable poller"
In case you noticed some suspicious activity in top, ps or some other location and would like to debug process with a specific PID, you don’t have to figure out which Zabbix process it is – you can pass the PID directly:
$ zabbix_server --runtime-control log_level_increase=10771
Extra information regarding this feature for the curious follows.
Proxy and agent join the party
It’s probably worth reminding that the ability to change log level for a running daemon also includes Zabbix proxy and agent – although on UNIX-like systems only, such an ability is not available for the Windows agent.
What happens if multiple daemons of the same type are running on a system, how do we specify which one should change the log level? That is determined the same way as for the configuration cache reload – to cite the manpage:
Default configuration file (unless -c option is specified) will be used to find PID file and signal will be sent to process, listed in PID file.
Under the hood, signal is received by the main process and this process is responsible about passing the signal to the target process. Current implementation does not allow to change log level for the main process, though – if you feel that would be useful, vote on the feature request ZBXNEXT-2427.
Forcing or reading the log level
Currently only the ability to increase or decrease the log level has been introduced – at this time there is no way to obtain current log level or force a specific log level. If there is uncertainty about current log levels and a desire to unify them, one would probably have to restart the daemon, or decrease the log level for all processes 4 times, then increase it to the desired level. If such a functionality seems useful to you, consider voting on the requests to allow setting a specific loglevel and querying current loglevel.
How it can fail
There are also cases where log level could not be changed, in addition to trying to increase or decrease it past the limits, mentioned above.
If you would try to send such a signal while the daemon is not running, it would fail to find the PID file:
zabbix_server : cannot open PID file [/var/run/zabbix/zabbix_server.pid]:  No such file or directory
If the PID file was there, but did not contain a proper PID:
zabbix_server : cannot retrieve PID from file [/tmp/zabbix_server.pid]
If it contained an incorrect PID:
zabbix_server : cannot send command to PID :  No such process
If we tried to send the signal to a non-existent Zabbix sub-process, terminal message would indicate success, but the following would be logged:
27218:20140907:162835.702 failed to redirect signal: "unreachable poller #2" process does not exist
If we tried to send a signal to a process we don’t own:
zabbix_server : cannot send command to PID :  Operation not permitted
Note that an issue has been identified during the development of this feature – when a signal is sent to a sleeping process (for example, housekeeper), it will start working. On a busy system changing log level for all processes could easily overload the system as all processes start working instantly. It is considered to be a bug, but at the time of this writing it has not been fixed yet. Follow and vote on bug report ZBX-8699.
A desired feature
Oh, by the way, this was one of the features put up for voting back at the Zabbix Conference 2013. Features for voting are chosen from the top 20 voted issues, so there certainly was enough interest in better debugging controls. It did not win back then (multiple filters for LLD rules did), but it’s really great to see this excellent troubleshooting help appear a year later.