Mysterious Zabbix Problems: How We Debug Them

This came as the first assignment for me, a new Zabbix employee – to fix a bug ZBX-3788 zabbix daemon processes hang on futex
Sometimes terminating a Zabbix agent with “killall -15 zabbix_agentd” resulted in one hanging process and other processes in the <defunct> state. strace utility showed the process was hanging on futex:

Process 10468 attached - interrupt to quit
futex(0xb783ca50, FUTEX_WAIT_PRIVATE, 2, NULL

The first step – how to reproduce the problem. As the bug manifested only occasionally, it was clear that it may take time and patience to catch it. Perhaps we will even have to start/stop the agent a thousand times before observing the bug. So, I set out for making an automatic script with a built-in “flight recorder”.

Tools: strace and ltrace

How can we see what is going inside Zabbix agent before it hangs ? One popular utility is strace. It is a tool which shows how your program is talking with kernel. But strace does not show what goes on inside the program between system calls.

There is another great tool ltrace, capable of showing not only system calls but also library calls your program is making (with arguments passed and results received), like printf().

You don’t have to trace processes from the very start. Both tools can attach to already running processes. For our purposes we will attach strace or ltrace just before terminating the process.

Reproducing the problem

A test environment

Zabbix 1.8.11rc1 source code was downloaded and compiled on a Debian GNU/Linux 6.0 (64-bit) machine. Hoping to increase the probability of process hang the configuration file zabbix_agentd.conf was edited to maximize the number of processes and logging level by setting:

DebugLevel=4
StartAgents=100

The script starts the zabbix_agentd processes and issues “killall -15 zabbix_agentd“. The starting/stopping is repeated until at least one of the zabbix_agentd processes hangs.
Here is a text of the test script, test-ZBX-3788.sh:

#!/bin/sh

ZBX_AGENTD_PATH=~/ZBX-3788/sbin
ZBX_AGENTD_CONF=~/ZBX-3788/zabbix_agentd.conf

NPROC_TARGET=103          # If "StartAgents=100" is configured in
                          # zabbix_agentd.conf,
                          # 103 processes are started.
OUT_DIR=~/ZBX-3788
WITH_TRACING=0            # 0 - run without tracing, 1 - run with tracing.
USER=$(whoami)

NRUN=1

while [ 1 ]; do
    echo NRUN=$NRUN Starting ...
    $ZBX_AGENTD_PATH/zabbix_agentd -c $ZBX_AGENTD_CONF
    sleep 4

    # Log processes.
    ps -fu $USER > $OUT_DIR/ps_after_start_$NRUN

    # Count processes.
    NPROC=$(ps -fu $USER | grep -v grep \
                | grep './sbin/zabbix_agentd -c' | wc -l)

    echo "$NPROC" processes started.

    if [ $NPROC -ne $NPROC_TARGET ]; then
         echo ERROR: "$NPROC process(es) started."
    fi

    # Find the 'zabbix_agentd' parent process PID.
    PID=$(ps -fu $USER \
         |  awk '/zabbix_agentd -c /{ if ( $3 == "1" ) print $2 }')

    # Trace the 'zabbix_agentd' parent process.
    if [ $WITH_TRACING -eq 1 ]; then
        ltrace -tt -T -S -o $OUT_DIR/ltrace_$NRUN -p $PID &
    fi

    # Trace "killall" to see in what order processes were killed.
    echo NRUN=$NRUN Stopping $PID
    if [ $WITH_TRACING -eq 1 ]; then
        strace -tt -T -v -o $OUT_DIR/strace_killall_$NRUN \
                                      killall -15 zabbix_agentd
    else
        killall -15 zabbix_agentd
    fi
    sleep 4

    # Log processes.
    ps -fu $USER > $OUT_DIR/ps_after_stop__$NRUN

    # Count processes again.
    NPROC=$(ps -fu $USER | grep -v grep \
                | grep './sbin/zabbix_agentd -c' | wc -l)
    if [ $NPROC -gt 0 ]; then
         echo GOT IT: "$NPROC process(es) hanging. See NRUN $NRUN"
         exit
    fi
    NRUN=$(( NRUN + 1 ))
done

The process hang reproduction

Let’s start the script and leave it running. Be patient, it can take hundreds of a attempts before the process hang occurs (it took half an hour in our test).

$ ./test-ZBX-3788.sh
NRUN=1 Starting ...
103 processes started.
NRUN=1 Stopping 10128
NRUN=2 Starting ...
103 processes started.
NRUN=2 Stopping 10252
...
NRUN=196 Starting ...
103 processes started.
NRUN=196 Stopping 1884
GOT IT: 1 process(es) hanging. See NRUN 196

Let’s look at what processes are hanging:

$ ps -fu zabbix18 | less
UID        PID  PPID  C STIME TTY          TIME CMD
zabbix18  1884     1  0 16:40 ?        00:00:00 /home/zabbix18/ZBX-3788/sbin/zabbix_agentd -c /home/zabbix18/ZBX-3788/zabbix_agentd.conf
zabbix18  1885  1884  0 16:40 ?        00:00:00 [zabbix_agentd] <defunct>
zabbix18  1949  1884  0 16:40 ?        00:00:00 [zabbix_agentd] <defunct>
....

Out of 103 processes one process (parent zabbix_agentd) is hanging and the rest are <defunct>. The waiting is on futex:

$ strace -p 1884
Process 1884 attached - interrupt to quit
futex(0x7f386ddc49f4, FUTEX_WAIT, 2, NULL

Our test script is working – at least one problem case is reproduced.

Now let’s terminate hanging processes with “killall -9 zabbix_agentd” and edit test script – enable tracing by setting WITH_TRACING=1. Start the scipt again, it will be writing a lot of strace and ltrace output files. The next step is comparing these output files in a normal case and in a problem case.

Log and trace file analysis

A normal termination (without process hang) in zabbix_agentd.log file looks like this (with added comments to explain the sequence of events):


3768:20120213:134343.929 Sleeping for 1 second(s)
— some zabbix_agentd child processes get SIGTERM from “killall”: —
3667:20120213:134344.094 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting …
3668:20120213:134344.094 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting …

3677:20120213:134344.098 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting …
3678:20120213:134344.098 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting …
— zabbix_agentd parent process gets SIGCHLD signal: —
3666:20120213:134344.096 One child process died (PID:3667,exitcode/signal:255). Exiting …
— “killall” manages to terminate few more child processes: —
3679:20120213:134344.099 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting …
3680:20120213:134344.099 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting …

3684:20120213:134344.101 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting …
— parent process starts exiting: —
3666:20120213:134344.101 zbx_on_exit() called
— “killall” continues to terminate child processes: —
3685:20120213:134344.102 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting …
3686:20120213:134344.102 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting …
3687:20120213:134344.102 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting …
3688:20120213:134344.103 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting …
3689:20120213:134344.103 Got signal [signal:15(SIGTERM),sender_pid:3780,sender_uid:1001,reason:0]. Exiting …
— the rest of child processes get SIGTERM from the parent process: —
3690:20120213:134344.104 Got signal [signal:15(SIGTERM),sender_pid:3666,sender_uid:1001,reason:0]. Exiting …
3692:20120213:134344.104 Got signal [signal:15(SIGTERM),sender_pid:3666,sender_uid:1001,reason:0]. Exiting …
3691:20120213:134344.104 Got signal [signal:15(SIGTERM),sender_pid:3666,sender_uid:1001,reason:0]. Exiting …

The problem case (with process hang) in log file looks like this:


15945:20120213:135706.109 Sleeping for 1 second(s)
— all zabbix_agentd child processes get SIGTERM from “killall”: —
15844:20120213:135706.323 Got signal [signal:15(SIGTERM),sender_pid:15957,sender_uid:1001,reason:0]. Exiting …
15845:20120213:135706.324 Got signal [signal:15(SIGTERM),sender_pid:15957,sender_uid:1001,reason:0]. Exiting …
15846:20120213:135706.325 Got signal [signal:15(SIGTERM),sender_pid:15957,sender_uid:1001,reason:0]. Exiting …

Hmm… There are no records from the parent process in the log file.

ltrace output for the parent process in normal case looks like this:


3666 13:43:44.043310 SYS_wait4(-1, 0x7fff14a3f21c, 0, 0, 3666)               = -512 <0.050239>
— zabbix_agentd parent process gets SIGTERM signal from “killall”: —
3666 13:43:44.093935 — SIGTERM (Terminated) —
— and starts logging it: —
3666 13:43:44.094040 __xstat(1, “/tmp/zabbix_agentd.log”, 0x7fff14a3df50 <unfinished …>
3666 13:43:44.094179 SYS_stat(“/tmp/zabbix_agentd.log”, 0x7fff14a3df50)      = 0 <0.000035>
3666 13:43:44.094364 <… __xstat resumed> )                                 = 0 <0.000233>
3666 13:43:44.094432 fopen(“/tmp/zabbix_agentd.log”, “a+” <unfinished …>
— zabbix_agentd parent process gets SIGCHLD signal from a child process: —
3666 13:43:44.094970 — SIGCHLD (Child exited) —
— and starts logging it, too: —
3666 13:43:44.095051 getpid()                                                = 3666 <0.000051>
3666 13:43:44.095217 __xstat(1, “/tmp/zabbix_agentd.log”, 0x7fff14a2d2d0 <unfinished …>
3666 13:43:44.095335 SYS_stat(“/tmp/zabbix_agentd.log”, 0x7fff14a2d2d0)      = 0 <0.000379>
3666 13:43:44.095828 <… __xstat resumed> )                                 = 0 <0.000528>
3666 13:43:44.095890 fopen(“/tmp/zabbix_agentd.log”, “a+” <unfinished …>
3666 13:43:44.096021 SYS_open(“/tmp/zabbix_agentd.log”, 1090, 0666)          = 5 <0.000028>
3666 13:43:44.096155 <… fopen resumed> )                                   = 0xda0bf0 <0.000188>

3666 13:43:44.096582 gettimeofday(0x7fff14a2d380, NULL)                      = 0 <0.000054>
3666 13:43:44.096745 localtime(0x7fff14a2d380 <unfinished …>
3666 13:43:44.096865 SYS_stat(“/etc/localtime”, 0x7fff14a1c900)              = 0 <0.000026>
3666 13:43:44.097355 <… localtime resumed> )                               = 0x7f5b0b8eb380 <0.000547>
3666 13:43:44.097420 getpid()                                                = 3666 <0.000050>
3666 13:43:44.097559 fprintf(0xda0bf0, “%6li:%.4d%.2d%.2d:%.2d%.2d%.2d.%”…, 3666, 2012, 2, 13, 13, 43, 44 <unfinished …>
3666 13:43:44.097726 SYS_fstat(5, 0x7fff14a1c170)                            = 0 <0.000365>
3666 13:43:44.098157 SYS_mmap(0, 4096, 3, 34, 0xffffffff)                    = 0x7f5b0bf9d000 <0.000024>
3666 13:43:44.098388 <… fprintf resumed> )                                 = 27 <0.000705>
3666 13:43:44.098446 vfprintf(0xda0bf0, “One child process died (PID:%d,e”…, 0x7fff14a2d360) = 66 <0.000399>
3666 13:43:44.098997 fputc(‘\n’, 0xda0bf0)                                   = 10 <0.000048>

3666 13:43:44.099165 fclose(0xda0bf0 <unfinished …>
3666 13:43:44.099270 SYS_write(5, ”  3666:20120213:134344.096 One c”…, 94) = 94 <0.000363>
3666 13:43:44.099730 SYS_close(5)                                            = 0 <0.000019>
3666 13:43:44.099813 SYS_munmap(0x7f5b0bf9d000, 4096)                        = 0 <0.000029>
3666 13:43:44.099937 <… fclose resumed> )                                  = 0 <0.000697>
3666 13:43:44.099990 __xstat(1, “/tmp/zabbix_agentd.log”, 0x7fff14a2d250 <unfinished …>
3666 13:43:44.100101 SYS_stat(“/tmp/zabbix_agentd.log”, 0x7fff14a2d250)      = 0 <0.000024>
3666 13:43:44.100577 <… __xstat resumed> )                                 = 0 <0.000507>
3666 13:43:44.100630 fopen(“/tmp/zabbix_agentd.log”, “a+” <unfinished …>
3666 13:43:44.100739 SYS_open(“/tmp/zabbix_agentd.log”, 1090, 0666)          = 5 <0.000025>
3666 13:43:44.100865 <… fopen resumed> )                                   = 0xda0bf0 <0.000161>

3666 13:43:44.100944 gettimeofday(0x7fff14a2d300, NULL)                      = 0 <0.000045>
3666 13:43:44.101084 localtime(0x7fff14a2d300 <unfinished …>
3666 13:43:44.101218 SYS_stat(“/etc/localtime”, 0x7fff14a1c880)              = 0 <0.000021>
3666 13:43:44.101348 <… localtime resumed> )                               = 0x7f5b0b8eb380 <0.000208>
3666 13:43:44.101404 getpid()                                                = 3666 <0.000152>
3666 13:43:44.101607 fprintf(0xda0bf0, “%6li:%.4d%.2d%.2d:%.2d%.2d%.2d.%”…, 3666, 2012, 2, 13, 13, 43, 44 <unfinished …>
3666 13:43:44.101653 SYS_fstat(5, 0x7fff14a1c0f0)                            = 0 <0.000005>
3666 13:43:44.101675 SYS_mmap(0, 4096, 3, 34, 0xffffffff)                    = 0x7f5b0bf9d000 <0.000005>
3666 13:43:44.101733 <… fprintf resumed> )                                 = 27 <0.000090>
3666 13:43:44.101749 vfprintf(0xda0bf0, “zbx_on_exit() called”, 0x7fff14a2d2e0) = 20 <0.000018>
3666 13:43:44.101929 fputc(‘\n’, 0xda0bf0)                                   = 10 <0.000014>

3666 13:43:44.101977 fclose(0xda0bf0 <unfinished …>
3666 13:43:44.102007 SYS_write(5, ”  3666:20120213:134344.101 zbx_o”…, 48) = 48 <0.000006>
3666 13:43:44.102037 SYS_close(5)                                            = 0 <0.000007>
3666 13:43:44.102060 SYS_munmap(0x7f5b0bf9d000, 4096)                        = 0 <0.000008>
3666 13:43:44.102093 <… fclose resumed> )                                  = 0 <0.000097>
3666 13:43:44.102107 sigemptyset(0x7fff14a2d400)                             = 0 <0.000012>
3666 13:43:44.102152 sigaddset(0x7fff14a2d400, 17)                           = 0 <0.000015>
3666 13:43:44.102198 sigprocmask(0, 0x7fff14a2d400, NULL <unfinished …>
3666 13:43:44.102229 SYS_rt_sigprocmask(0, 0x7fff14a2d400, 0, 8, 0x7f5b0b8e7e40) = 0 <0.000005>
3666 13:43:44.102276 <… sigprocmask resumed> )                             = 0 <0.000056>
3666 13:43:44.102291 kill(3667, 15 <unfinished …>
3666 13:43:44.102328 SYS_kill(3667, 15)                                      = 0 <0.000006>
3666 13:43:44.102357 <… kill resumed> )                                    = 0 <0.000045>
3666 13:43:44.102371 kill(3668, 15 <unfinished …>
3666 13:43:44.102399 SYS_kill(3668, 15)                                      = 0 <0.000005>
3666 13:43:44.102427 <… kill resumed> )                                    = 0 <0.000036>
3666 13:43:44.102444 kill(3669, 15 <unfinished …>
3666 13:43:44.102470 SYS_kill(3669, 15)                                      = 0 <0.000005>
3666 13:43:44.102501 <… kill resumed> )                                    = 0 <0.000037>

— zabbix_agentd parent process releases resources and terminates: —

3666 13:43:46.155592 free(0x00d9eb60)                                        = <void> <0.000085>
3666 13:43:46.155807 free(0x00d9f000)                                        = <void> <0.000086>
3666 13:43:46.156086 shmctl(294918, 0, 0, 123936, 0x7f5b0bf8e700 <unfinished …>
3666 13:43:46.156328 SYS_shmctl(294918, 0, 0, 0, 0x7f5b0bf8e700)             = 0 <0.000113>
3666 13:43:46.156720 <… shmctl resumed> )                                  = 0 <0.000483>
3666 13:43:46.156845 exit(0 <unfinished …>
3666 13:43:46.157268 SYS_exit_group(0 <no return …>
3666 13:43:46.157805 +++ exited (status 0) +++

In problem case – process hang – things look different:


15842 13:57:06.169775 SYS_wait4(-1, 0x7fffb698ff8c, 0, 0, 15842)             = -512 <0.151751>
— zabbix_agentd parent process gets SIGTERM signal from “killall”: —
15842 13:57:06.322104 — SIGTERM (Terminated) —
— and starts logging it: —
15842 13:57:06.322346 __xstat(1, “/tmp/zabbix_agentd.log”, 0x7fffb698ecd0 <unfinished …>
15842 13:57:06.322527 SYS_stat(“/tmp/zabbix_agentd.log”, 0x7fffb698ecd0)     = 0 <0.000026>
15842 13:57:06.322659 <… __xstat resumed> )                                = 0 <0.000178>
15842 13:57:06.322728 fopen(“/tmp/zabbix_agentd.log”, “a+” <unfinished …>
15842 13:57:06.322875 SYS_open(“/tmp/zabbix_agentd.log”, 1090, 0666)         = 5 <0.000032>
15842 13:57:06.323027 <… fopen resumed> )                                  = 0x1ce0bf0 <0.000200>

15842 13:57:06.323085 gettimeofday(0x7fffb698ed80, NULL)                     = 0 <0.000046>
15842 13:57:06.323229 localtime(0x7fffb698ed80 <unfinished …>
15842 13:57:06.323342 SYS_stat(“/etc/localtime”, 0x7fffb697e300)             = 0 <0.000023>
— logging of SIGTERM is in progress, but another signal arrives: —
15842 13:57:06.323440  SIGCHLD (Child exited) 
15842 13:57:06.323502 getpid()                                               = 15842 <0.000054>
— … and another logging is started while the first is somewhere in localtime().  —
15842 13:57:06.323666 __xstat(1, “/tmp/zabbix_agentd.log”, 0x7fffb697dc70 <unfinished …>
15842 13:57:06.323775 SYS_stat(“/tmp/zabbix_agentd.log”, 0x7fffb697dc70)     = 0 <0.000023>
15842 13:57:06.323893 <… __xstat resumed> )                                = 0 <0.000154>
15842 13:57:06.323947 fopen(“/tmp/zabbix_agentd.log”, “a+” <unfinished …>
15842 13:57:06.324071 SYS_open(“/tmp/zabbix_agentd.log”, 1090, 0666)         = 6 <0.000027>
15842 13:57:06.324208 <… fopen resumed> )                                  = 0x1ce0e30 <0.000176>
15842 13:57:06.324261 gettimeofday(0x7fffb697dd20, NULL)                     = 0 <0.000044>
15842 13:57:06.324399 localtime(0x7fffb697dd20 <unfinished …>

— here the parent process hangs, its child processes are <defunct>… —
— … until we terminate it with “killall -9”—
— Then “ltrace” reveals the hanging was on futex. —
15842 13:57:06.324486 SYS_futex(0x7f6a133689f4, 0, 2, 0, 4096 <no return …>
15842 15:12:49.992621 +++ killed by SIGKILL +++

You could say: localtime() is not a reentrant function. Right. But why does it hang ?
Let’s look into libc source code. On our Debian GNU/Linux 6.0 machine the interesting part is in eglibc-2.11.3/time/localtime.c:

...
struct tm *
__localtime_r (t, tp)
     const time_t *t;
     struct tm *tp;
{
  return __tz_convert (t, 1, tp);
}
weak_alias (__localtime_r, localtime_r)

/* Return the `struct tm' representation of *T in local time.  */
struct tm *
localtime (t)
     const time_t *t;
{
  return __tz_convert (t, 1, &_tmbuf);
}
libc_hidden_def (localtime)

Here localtime() and localtime_r() are almost the same, they both call __tz_convert(). The __tz_convert() “lives” in eglibc-2.11.3/time/tzset.c:

...
/* This locks all the state variables in tzfile.c and this file.  */
__libc_lock_define_initialized (static, tzset_lock)
...
/* Return the `struct tm' representation of *TIMER in the local timezone.
   Use local time if USE_LOCALTIME is nonzero, UTC otherwise.  */
struct tm *
__tz_convert (const time_t *timer, int use_localtime, struct tm *tp)
{
  long int leap_correction;
  int leap_extra_secs;
   
  if (timer == NULL)
    {
      __set_errno (EINVAL);
      return NULL;
    }
   
  __libc_lock_lock (tzset_lock);

  /* Update internal database according to current TZ setting.
     POSIX.1 8.3.7.2 says that localtime_r is not required to set tzname.
     This is a good idea since this allows at least a bit more parallelism.  */
  tzset_internal (tp == &_tmbuf && use_localtime, 1);
                                               
  if (__use_tzfile)
    __tzfile_compute (*timer, use_localtime, &leap_correction,
                      &leap_extra_secs, tp);
  else
    {
      if (! __offtime (timer, 0, tp))
        tp = NULL;
      else
        __tz_compute (*timer, tp, use_localtime);
      leap_correction = 0L;
      leap_extra_secs = 0;
    }
   
  if (tp)
    {
      if (! use_localtime)
        {
          tp->tm_isdst = 0;
          tp->tm_zone = "GMT";
          tp->tm_gmtoff = 0L;
        }
       
      if (__offtime (timer, tp->tm_gmtoff - leap_correction, tp))
        tp->tm_sec += leap_extra_secs;
      else
        tp = NULL;
    }

  __libc_lock_unlock (tzset_lock);

  return tp;
}

Apparently the blocking on futex is related with this lock/unlock
__libc_lock_lock (tzset_lock)

__libc_lock_unlock (tzset_lock)
hidden in localtime().

Proposed explanation

I propose the following explanation of hanging:

  1. When some event is to be logged (e.g. receiving the SIGTERM signal), the localtime() is invoked. It sets up an internal locking while setting time zone information.
  2. While the localtime() is in progress, a signal arrives (e.g. SIGCHLD). The signal handler is immediately invoked, it tries to log the signal information and invokes the localtime().
  3. Now there is a dead-lock: the first invocation of localtime() was interrupted, it still keeps the lock, another localtime() was invoked from the signal handler and cannot obtain the lock. The process is blocked by its own signal handler !

Should we replace localtime() with localtime_r() ?

Linux “man localtime” says:
“The …functions …. localtime() return a pointer to static data and hence are not thread-safe. Thread-safe versions …localtime_r() are specified by SUSv2, …”

Similarly the IBM AIX documentation on localtime() says:
“…Do not use the ….localtime … subroutine in a multithreaded environment. See the multithread alternatives in the ….localtime_r …. subroutine article.”

But Zabbix on UNIX/Linux is not multithreaded. It is multi-process. The process hang occurs when a process execution flow is interrupted by it’s own signal handler and the localtime() is used in both at the same time. So, replacing localtime() with localtime_r() will not solve the problem.

The right solution: using only asynchronous-signal-safe functions in signal handlers

It turned out that signal handlers are special. What is allowed in a normal function, may be unacceptable in a signal handler.
A good source is CERT C Secure Coding Standard which puts forward a number of recommendations and rules for implementing signal handlers, for example,
SIG30-C. Call only asynchronous-safe functions within signal handlers (includes a list of safe functions)

Unfortunately, the list of asynchronous-signal-safe functions does not include many popular functions, even printf() is not safe. Obviously, localtime() is not in the list either.

Quick mitigation: a small change in signal handler logic

Changing signal handlers to use only safe functions is a larger task, which requires redesigning our logging architecture. Properly designed signal handlers do something simple and lightweight, leaving the real work to others. It turned out there is a quicker practical solution how to significantly reduce probability of process hanging.
Let’s turn to Zabbix source code, file src/libs/zbxnix/daemon.c:

....
static void     child_signal_handler(int sig, siginfo_t *siginfo, void *context)
{
...
        switch (sig)
        {
...
                case SIGQUIT:
                case SIGINT:
                case SIGTERM:
                        zabbix_log(parent_pid == CHECKED_FIELD(siginfo, si_pid) ? LOG_LEVEL_DEBUG : LOG_LEVEL_WARNING,
                                        "Got signal [signal:%d(%s),sender_pid:%d,sender_uid:%d,reason:%d]. Exiting ...",
                                        sig, get_signal_name(sig),
                                        CHECKED_FIELD(siginfo, si_pid),
                                        CHECKED_FIELD(siginfo, si_uid),
                                        CHECKED_FIELD(siginfo, si_code));

                        if (!PARENT_PROCESS)
                                exit(FAIL);

                        if (0 == exiting)
                        {
                                exiting = 1;
                                zbx_on_exit();
                        }

                        break;
...
        }
}

static void     parent_signal_handler(int sig, siginfo_t *siginfo, void *context)
{
        switch (sig)
        {
                case SIGCHLD:
                        if (!PARENT_PROCESS)
                                exit(FAIL);

                        if (0 == exiting)
                        {
                                zabbix_log(LOG_LEVEL_CRIT, "One child process died (PID:%d,exitcode/signal:%d). Exiting ...",
                                                CHECKED_FIELD(siginfo, si_pid), CHECKED_FIELD(siginfo, si_status));
                                exiting = 1;
                                zbx_on_exit();
                        }

                        break;
                default:
                        child_signal_handler(sig, siginfo, context);
        }
}
....

This code fragment shows that in the parent process, when SIGTERM or SIGCHLD is received:

  1. the signal reception is logged using zabbix_log().
  2. exiting variable is checked and, if necessary, set and exit sequence started by calling zbx_on_exit().

The proposed mitigation is to change the logic to:
When the parent process receives SIGTERM or SIGCHLD:

  • check the exiting variable.
  • If it is 1, this means that the exit sequence is already in progress and no further actions are taken in signal handler (simply return).
  • if it is 0, then:
    • set it to 1 immediately,
    • log the signal reception (zabbix_log()),
    • start exit sequence (zbx_on_exit()).

This will significantly reduce a probability of invoking localtime() multiple times and blocking. (Although it will not completely eliminate it as there is still a small time interval between checking of the exiting variable and setting it to 1.)

The proposed mitigation was successfully tested with zabbix_agentd on a server with 4 CPU cores and later merged into Zabbix codebase.

It was an interesting and educating assignment with learning about ltrace, signals and diving into source code. A good warm-up for a Zabbix newcomer.

Author: Andris Mednis

Developer at Zabbix.

2 thoughts on “Mysterious Zabbix Problems: How We Debug Them”

  1. Avoid extra piping:

    $ ps -fu zabbix | grep -v grep | grep ‘./sbin/zabbix_agentd -c’ | wc -l
    6

    $ pgrep -u zabbix -fc ‘./sbin/zabbix_agentd -c’
    6

Leave a Reply