Zabbix allows building a High Availability solution for IT infrastructure monitoring. Learn how to deploy a simple HA solution to understand the basics of how the components work and interact with each other.
Watch the video from the Zabbix Summit 2019 presentation.
1. Introduction (5:32:58)
2. Database cluster (5:38:51)
2.1. CLI (5:38:51)
2.2. GUI (5:41:10)
2.3. MariaDB installation (5:42:30)
2.4. Replication setup: node 1 (5:43:52)
2.5. Replication setup: node 2 (5:44:20)
2.6. Replication setup: node 3 (5:45:54)
2.7. Zabbix database preparation (5:47:10)
3. Server cluster (5:48:28)
4. Front end cluster (5:51:16)
5. Conclusion (5:54:05)
Hi! I’m Edmunds Vesmanis. As you may already know, I’m responsible for training, and this will be a small training session. You’ll learn how to deploy a very simple HA solution. I call it Zabbix HA cluster 333, which stands for 3 nodes for the database, 3 nodes for the server, and 3 nodes for the front end.
I want to keep things at a basic, starting level. Of course, there are some very sophisticated solutions out there but my aim is to show a bare minimum of what to start with.
This HA solution:
- is tested, common, and time-proven;
- uses open source components;
- starts with 3 components (that’s why it is 333).
Here is a plan of what we want to achieve. As I already mentioned, this is a bare minimum, we don’t want to make it too complicated. Because, first, you need to understand the basics of how the components work and interact with each other.
As you can see, there are 3 database nodes, 3 server nodes, and 3 front end nodes. For every cluster, there is a virtual IP (VIP) that shows which of the nodes are active at the moment. The nodes will switch automatically if the basic resources die or connections fail. Manual control is also in place to override in case of problems or to perform updates.
The idea is that a user can switch the nodes at any moment via Zabbix. So, if something is wrong or if you simply want to bring down the first server node, you can click on the second one and move resources to it. This can be done from the Zabbix interface.
What you need is the official out-of-the-box template, no additional scripting is required. All you need to do is just add your VIP address and a link to the default links template. Of course, you must have the Zabbix agent, and remote commands must be enabled on all nodes.
In order to understand how MySQL replication works, let’s have a look at this circular master-slave setup with 3 nodes.
Replication works through binary logs and can be asynchronous. Let’s say an update or a test is required, and the middle node is disabled so the first one can be worked on. The replication will stop. If the binary logs expiry period is set, for example, to three days and the node is kept down for three days and then reactivated, the replication will still be pushed to all of the nodes. It doesn’t matter which node is disabled, the replication will continue where it left off after re-activation.
How can this be achieved? First of all, prepare the cluster. There will be nine virtual machines, each with a separate IP and a meaningful hostname in the Hosts file. Also, we have 3 VIPs with hostnames assigned to them to make things simple.
Note. Remember, you have a bare minimum which you want to be robust, so you don’t want to rely on other systems such as DNS or DHCP.
Then, prepare each VM by setting:
- clock synchronization;
- firewall (although at first you might want to switch it off);
- SELinux (always a troublemaker, so we just disable it);
- Hosts file;
- storage (better to use separate block devices for DB, logs, apps, and configuration);
- Zabbix Agent on all nodes (enable remote commands, set proper IP addresses).
Start by creating the database cluster. It can be done in the CLI as well as in the GUI.
1. Install all HA components with either of these commands:
## Install HA components: yum groupinstall 'High Availability' −y ## OR: yum groupinstall ha −y
2. Create hacluster user with a secure password:
## Create user for cluster: echo <CLUSTER_PASSWORD> | passwd --stdin hacluster
3. Once done on every node, authenticate the nodes using the same password:
# Authenticate cluster nodes: pcs cluster auth zabbix−ha−db1 zabbix−ha−db2 zabbix−ha−db3 \ username: hacluster password: <CLUSTER_PASSWORD> zabbix-ha-db1: Authorized zabbix-ha-db2: Authorized zabbix-ha-db3: Authorized
The next steps will be done only on one node — it doesn’t matter which one because the nodes will synchronize.
1. Create the database cluster and add resources. In this bare minimum setup, our only resource is a VIP address for the DB cluster:
# Create zabbix-db-cluster: pcs cluster setup −−name zabbix_db_cluster \ zabbix−ha−db1 zabbix−ha−db2 zabbix−ha−db3 −force ## Create resource for cluster virtual IP (VIP) pcs resource create virtual_ip ocf:heartbeat:IPaddr2 \ ip=192.168.7.89 op monitor interval=5s −−group zabbix_db_cluster
2. When it’s done, check to see if there are any problems:
## check: pcs status
Usually, there will be some problems. To fix them, the cluster should be stopped and restarted so that the nodes can resynchronize and the authentication tokens can update:
## Restart cluster services in case of: ## “cluster is not currently running on this node” error pcs cluster stop −−all && pcs cluster start −−all
3. If a firewall is used, you will need to add an exception for the HA cluster:
# in case you have a firewall: firewall−cmd −−permanent −−add−service=high−availability && firewall−cmd −−reload
4. Prevent resources from moving after recovery:
## Prevent Resources from Moving after Recovery pcs resource defaults resource−stickiness=100
This command will tell the cluster if the resource has been moved to another node, so it can remain on that node. By default, a resource is activated on one node and if something happens it migrates to another one and jumps back when the problem goes away. However, we don’t want that.
5. Then, disable STONITH (Shoot The Other Node In The Head). This is mandatory for this setup, otherwise you won’t be able to start your resources.
## if you are not using fencing disable STONITH: pcs property set stonith−enabled=false
Even I didn’t know it for some time, but there’s a GUI. And it is really helpful if you need to see what was done a year or two ago and what is happening now because you don’t do setups every day.
The GUI can be used to create clusters. You need to enter the hostnames and ports, and some advanced options are available too.
If you already have a cluster, you can navigate to your GUI and just add one of the nodes from the existing cluster.
In the GUI, you can immediately see the composition of your cluster, whether the components are connected, and the uptime. You have control, and you don’t need to type anymore.
Note. More information on how to configure a High Availability cluster with the pcsd Web UI can be found here.
In this scenario, we use MariaDB, so we proceed with MariaDB installation.
1. Install the MariaDB server on all DB nodes:
## install MariaDB server on all 3 DB nodes: yum install mariadb−server −y
2. Configure the DB settings:
## tune/configure db settings: cp ./zabbixdb.cnf /etc/my.cnf.d/
3. Start and enable MariaDB:
## Start and enable to start on boot: systemctl start mariadb systemctl enable mariadb
4. Secure the installation with a password:
## secure your installation and create <MYSQL_ROOT_PASSWORD>: mysql_secure_installation
In a test environment it won’t be required, but here is my configuration file example. I have basic settings for Zabbix DB and binary logs:
cat zabbixdb.cnf [mysqld] # ZABBIX specific settings and tuning default-storage-engine = InnoDB innodb = FORCE innodb_file_per_table = 1 innodb_buffer_pool_size = 512M # 50-75% of total RAM innodb_buffer_pool_instances = 8 # For MySQL 5.5 - 4, for 5.6+ - 8 innodb_flush_log_at_trx_commit = 2 innodb_flush_method = O_DIRECT innodb_io_capacity = 800 # HDD disks 500-800, SSD disks - 2000 sync-binlog = 0 query-cache-size = 0 server_id = 96 # for id settings IPs last number used report_host = zabbix-ha-db1 log-slave-updates log_bin = /var/lib/mysql/log-bin log_bin_index = /var/lib/mysql/log-bin.index relay_log = /var/lib/mysql/relay-bin relay_log_index = /var/lib/mysql/relay-bin.index binlog_format = mixed binlog_cache_size = 64M max_binlog_size = 1G expire_logs_days = 5 binlog_checksum = crc32 max_allowed_packet = 500M
As you can see, my logs will expire in five days so it means that I can bring a node down for five days and the replication will still work. Of course, sufficient storage is required, but this functionality is still very useful.
5. Deploy the configuration file on all nodes, adapting the server ID and hostname on every single node:
## Must be set on every db node acordingly vi /etc/my.cnf.d/zabbixdb.cnf server_id = 96 ## Last number of IP report_host = zabbix-ha-db1 ## Hostname
Replication setup: node 1
Replication is the trickiest part. I always keep my plan on a separate monitor to make sure it is properly set up. Start with node 1.
1. Log in to MySQL:
## Log in to MySQL: mysql −uroot −p <MYSQL_ROOT_PASSWORD>
2. Stop the slave:
MariaDB [(none)]> STOP SLAVE
3. Grant the replication privilege to the user providing the IP of node 2:
MariaDB [(none)]> GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'<NODE2_IP>' identified by '<REPLICATOR_PASSWORD>'
3. Show the master status:
MariaDB [(none)]> SHOW MASTER STATUS\G
Replication setup: node 2
4. Then, go to node 2. Again, log in, stop the slave, then change the master:
## Log in to MySQL: mysql −uroot −p<MYSQL_ROOT_PASSWORD> STOP SLAVE; CHANGE MASTER TO MASTER_HOST = '<NODE1_IP>', MASTER USER = '<REPLICATOR_PASSWORD>', MASTER_LOG_FILE = 'log−bin.000001', MASTER_LOG_POS = 245
This command will introduce the master to node 2 which will be the slave to node 1. Also, the master_log_file from the previous step and its position are specified.
5. Grant the replication slave privilege to node 3 identified by some secure password:
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'<NODE3_IP>' identified by '<REPLICATOR_PASSWORD>'
6. At this point you can reset the master on node 2:
7. Then, start the slave and, again, get the slave status:
START SLAVE; SHOW SLAVE STATUS\G
As you can see, there are 0 errors. Also, if it says “Waiting for master to send event”, it means that the setup was successful. Otherwise, some debugging might be required, usually it is a wrong IP address or DNS name.
8. Proceed with node 2 and get the master status, the log-bin file, and its position.
Replication setup: node 3
9. Repeat the steps for node 3:
## Log in to MySQL: mysql −uroot −p<MYSQL_ROOT_PASSWORD> STOP SLAVE; CHANGE MASTER TO MASTER_HOST = '<NODE2_IP>', MASTER_USER = '<replicator>', MASTER_PASSWORD = '<REPLICATOR_PASSWORD>', MASTER_LOG_FILE='log−bin.000001', MASTER_LOG_POS = 245; GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'<NODE1_IP>' identified by '<REPLICATOR_PASSWORD>'; RESET MASTER; START SLAVE;
10. Set up node 1 as the slave for node 3. Check the slave status:
If it says “Waiting for master to send event”, we are done with this one.
11. Show the master status on node 3:
12. Use the same commands for node 1:
STOP SLAVE; CHANGE MASTER TO MASTER_HOST ='<NODE3_IP>', MASTER_USER = 'replicator', MASTER_PASSWORD = '<REPLICATOR_PASSWORD>', MASTER_LOG_FILE='log-bin.000001', MASTER_LOG_POS =245; START SLAVE; SHOW SLAVE STATUS\G
Zabbix database preparation
The next step is to create a Zabbix database and a user:
## Login to mysql and create zabbix db/user: create database zabbix character set utf8 collate utf8_bin; grant all privileges on zabbix.* to [email protected]'%' identified by '<DB_ZABBIX_PASS>'; quit
Note. Don’t forget about utf8 and collation.
You might need to create separate users for your server and for web interface, but in a test environment I will stick with the same one.
And, of course, we need to get the schema. Basic settings and pictures are included in this file which can be obtained from Zabbix server nodes. I would prepare it in advance.
## upload db schema and basic conf: ## create.sql.gz copied from main zabbix server ## located in /usr/share/doc/zabbix-server-mysql-*/create.sql.gz zcat create.sql.gz | mysql -uzabbix -p zabbix
The next step would be to introduce partitioning, but this is a story for another time.
Note. These basic commands are useful for debugging the DB cluster:
SHOW BINARY LOGS; SHOW SLAVE STATUS; show master status\g RESET MASTER; ## removes all binary log files that are listed in the index file, leaving ## only a single, empty binary log file with a numeric suffix of .000001 RESET MASTER TO 1234; ## reset to specific binary log position PURGE BINARY LOGS BEFORE '2019-10-11 00:20:00'; ## Numbering is not reset, may be safely used while replication ## slaves are running. flush binary logs; ## Will reset state of binary logs and restarts numbering
When the DB cluster setup is complete, proceed to create the server cluster.
1. Again, install HA components, and then create the cluster user:
## Install HA components: yum groupinstall ha –y ## Create user for cluster: echo zabbix123 | passwd --stdin hacluster
2. Install the Zabbix daemon binaries. But you do not need to start or enable it — HA will take care of that.
yum install −y zabbix−server
3. The Zabbix server configuration file is already prepared and can be modified:
## Copy default zabbix_server.conf file: cp zabbix_server.conf /etc/zabbix/zabbix_server.conf ## and modify acordingly vi zabbix_server.conf
First of all, change the source IP address — put the VIP address in:
SourceIP=192.168.7.87 #VIP for zabbix-server cluster
For the DBHost, use the VIP address from the DB cluster. And, of course, there is the DB password which must be deployed on all server nodes:
DBHost=192.168.7.89 DBName=zabbix DBUser=zabbix DBPassword=<DB_ZABBIX_PASS>
4. Authenticate the nodes:
pcs cluster auth zabbix−ha−db1 zabbix−ha−db2 zabbix−ha−db3 username: hacluster password: <CLUSTER_PASSWORD>
5. Create the server cluster:
pcs cluster setup −−name zabbix_server_cluster \ zabbix−ha−srv1 zabbix−ha−srv2 zabbix−ha−srv3 −−force
6. Disable STONITH for fencing:
pcs property set stonith−enabled=false
7. Restart the cluster to reload the certificates and notifications:
pcs cluster stop −−all && pcs cluster start −−all
8. Again, switch on stickiness so that if migration is done manually, the VIP address and Zabbix server binaries are started on a particular node:
pcs resource defaults resource−stickiness=100
9. Introduce the resources. First, the VIP address:
pcs resource create virtual_ip_server ocf:heartbeat:IPaddr2 ip=192.168.7.87 op monitor interval=5s −−group zabbix_server_cluster
Then, the Zabbix server daemon:
pcs resource create ZabbixServer systemd:zabbix−server op monitor interval=10s −−group zabbix_server_cluster
Two Zabbix daemons can’t run simultaneously, that’s why you need to make sure that a particular Zabbix server is online only on one node.
10. Let’s go further. Set up the colocation of resources so that both VIP and server daemon can run only on the same node:
## Add colocation: resources must run on same node: pcs constraint colocation add virtual_ip_server ZabbixServer INFINITY −force
11. Make sure that the VIP starts before the Zabbix daemon, otherwise it will crash:
## in specific order: pcs constraint order virtual_ip_server then ZabbixServer
12. Set the timeout settings for the resources:
## Set start/stop timeout operations pcs resource op add ZabbixServer start interval=0s timeout=60s pcs resource op add ZabbixServer stop interval=0s timeout=120s
13. Check the cluster status:
You will see the cluster name, stack, and the node on which the resources are running, and, of course, the resources themselves.
Front end cluster
The setup for the front end cluster is similar.
1. Install the HA components, create a user and then install the Zabbix front end. Do not start or enable it manually.
## Install HA components: yum groupinstall ha –y ## Create user for cluster: echo zabbix123 | passwd --stdin hacluster ## install zabbix frontend: yum install -y zabbix-web-mysql
2. Prepare the configuration file for the front end, with VIPs for the server cluster nodes and DB server nodes. Deploy it to all front end nodes in the same location.
## Prepare zabbix-FE config: cat /etc/zabbix/web/zabbix.conf.php $DB['TYPE'] = 'MYSQL'; $DB['SERVER'] = '192.168.7.89'; $DB['PORT'] = '0'; $DB['DATABASE'] = 'zabbix'; $DB['USER'] = 'zabbix'; $DB['PASSWORD'] = 'zabbix123'; ... $ZBX_SERVER = '192.168.7.87'; $ZBX_SERVER_PORT = '10051'; $ZBX_SERVER_NAME = 'ZABBIX-HA'; ## Deploy to all FE nodes on same location: /etc/zabbix/web/
3. Create a virtual host in Apache to monitor the status of the Apache server itself:
## create resource for apache Enable the server-status page. vi /etc/httpd/conf.d/serverstatus.conf Listen 127.0.0.1:8080 RewriteEngine Off SetHandler server-status Allow from 127.0.0.1 Order deny,allow Deny from all </Location> </VirtualHost>
4. Configure Apache to listen on VIP of the front end cluster:
## set apache to listen only on VIP vi /etc/httpd/conf/httpd.conf +/Listen 80 ## change to: ... Listen 192.168.7.88:80 ...
Or — there’s another option — you can leave the default Apache listening settings which include all IP addresses, and use a master-master-master setup meaning that all nodes will be active. However, that might cause some difficulties.
5. Authenticate the cluster nodes:
pcs cluster auth zabbix−he−fe1 zabbix−ha−fe2 zabbix−ha−fe3 username: hacluster password: <CLUSTER_PASSWORD>
6. Create the cluster:
pcs cluster setup −−name zabbix_fe_cluster \ zabbix−ha−fe1 zabbix−ha−fe2 zabbix−ha−fe3 −−force
7. Then restart, and disable STONITH:
pcs cluster stop --all && pcs cluster start –all pcs property set stonith-enabled=false
8. Introduce the resources. Again, first comes the VIP:
pcs resource create virtual_ip_fe ocf:heartbeat:IPaddr2 ip=192.168.7.88 op monitor interval=5s −−group zabbix_fe_cluster
The second resource is the control of the Apache service:
pcs resource create zabbix_fe ocf:heartbeat:apache \ configfile=/etc/httpd/conf/httpd.conf \ statusurl="http://localhost:8080/server−status" op \ monitor interval=30s −−group zabbix_fe_cluster
So, whenever we switch to the next node, the Apache will be started on a particular one.
9. Configure the colocation (VIP and Apache must run on the same node):
pcs constraint colocation add virtual_ip_fe INFINITY
10. Configure which resource starts first:
## in specific order: pcs constraint order virtual_ip_fe then zabbix_fe
11. Switch on resource stickiness:
pcs resource defaults resource-stickiness=100
12. Set start/stop timeout operations:
pcs resource op add zabbix_fe start interval=0s timeout=60s pcs resource op add zabbix_fe stop interval=0s timeout=120s
Now the HA cluster is ready. Simple, right? Well, maybe not quite. Still, this is a bare minimum.
See also: Presentation slides