High Availability cluster building with Zabbix for continued service

Zabbix HA Cluster

Zabbix allows building a High Availability solution for IT infrastructure monitoring. Learn how to deploy a simple HA solution to understand the basics of how the components work and interact with each other.

Watch the video from the Zabbix Summit 2019 presentation.

Contents

1. Introduction (5:32:58)
2. Database cluster (5:38:51)
2.1. CLI (5:38:51)
2.2. GUI (5:41:10)
2.3. MariaDB installation (5:42:30)
2.4. Replication setup: node 1 (5:43:52)
2.5. Replication setup: node 2 (5:44:20)
2.6. Replication setup: node 3 (5:45:54)
2.7. Zabbix database preparation (5:47:10)
3. Server cluster (5:48:28)
4. Front end cluster (5:51:16)
5. Conclusion (5:54:05)

Introduction

Hi! I’m Edmunds Vesmanis. As you may already know, I’m responsible for training, and this will be a small training session. You’ll learn how to deploy a very simple HA solution. I call it Zabbix HA cluster 333, which stands for 3 nodes for the database, 3 nodes for the server, and 3 nodes for the front end.

I want to keep things at a basic, starting level. Of course, there are some very sophisticated solutions out there but my aim is to show a bare minimum of what to start with.

This HA solution:

  • is tested, common, and time-proven;
  • uses open source components;
  • starts with 3 components (that’s why it is 333).

Here is a plan of what we want to achieve. As I already mentioned, this is a bare minimum, we don’t want to make it too complicated. Because, first, you need to understand the basics of how the components work and interact with each other.

As you can see, there are 3 database nodes, 3 server nodes, and 3 front end nodes. For every cluster, there is a virtual IP (VIP) that shows which of the nodes are active at the moment. The nodes will switch automatically if the basic resources die or connections fail. Manual control is also in place to override in case of problems or to perform updates.

The idea is that a user can switch the nodes at any moment via Zabbix. So, if something is wrong or if you simply want to bring down the first server node, you can click on the second one and move resources to it. This can be done from the Zabbix interface.

Moving resources 

What you need is the official out-of-the-box template, no additional scripting is required. All you need to do is just add your VIP address and a link to the default links template. Of course, you must have the Zabbix agent, and remote commands must be enabled on all nodes.

In order to understand how MySQL replication works, let’s have a look at this circular master-slave setup with 3 nodes.

Replication works through binary logs and can be asynchronous. Let’s say an update or a test is required, and the middle node is disabled so the first one can be worked on. The replication will stop. If the binary logs expiry period is set, for example, to three days and the node is kept down for three days and then reactivated, the replication will still be pushed to all of the nodes. It doesn’t matter which node is disabled, the replication will continue where it left off after re-activation.

How can this be achieved? First of all, prepare the cluster. There will be nine virtual machines, each with a separate IP and a meaningful hostname in the Hosts file. Also, we have 3 VIPs with hostnames assigned to them to make things simple.

Note. Remember, you have a bare minimum which you want to be robust, so you don’t want to rely on other systems such as DNS or DHCP. 

Then, prepare each VM by setting:

  • clock synchronization;
  • localization;
  • firewall (although at first you might want to switch it off);
  • SELinux (always a troublemaker, so we just disable it);
  • Hosts file;
  • storage (better to use separate block devices for DB, logs, apps, and configuration);
  • Zabbix Agent on all nodes (enable remote commands, set proper IP addresses).

Database cluster

Start by creating the database cluster. It can be done in the CLI as well as in the GUI.

CLI

1. Install all HA components with either of these commands:

## Install HA components:
yum groupinstall 'High Availability' −y 
## OR:
yum groupinstall ha −y

2. Create hacluster user with a secure password:

## Create user for cluster:
echo <CLUSTER_PASSWORD> | passwd --stdin hacluster

3. Once done on every node, authenticate the nodes using the same password:

# Authenticate cluster nodes:
pcs cluster auth zabbix−ha−db1 zabbix−ha−db2 zabbix−ha−db3 \
username: hacluster
password: <CLUSTER_PASSWORD>

zabbix-ha-db1: Authorized
zabbix-ha-db2: Authorized
zabbix-ha-db3: Authorized

Cluster node authentication

The next steps will be done only on one node — it doesn’t matter which one because the nodes will synchronize.

1. Create the database cluster and add resources. In this bare minimum setup, our only resource is a VIP address for the DB cluster:

# Create zabbix-db-cluster:
pcs cluster setup −−name zabbix_db_cluster \
zabbix−ha−db1 zabbix−ha−db2 zabbix−ha−db3 −force

## Create resource for cluster virtual IP (VIP)
pcs resource create virtual_ip ocf:heartbeat:IPaddr2 \
ip=192.168.7.89 op monitor interval=5s −−group zabbix_db_cluster

2. When it’s done, check to see if there are any problems:

## check:
pcs status

Usually, there will be some problems. To fix them, the cluster should be stopped and restarted so that the nodes can resynchronize and the authentication tokens can update:

## Restart cluster services in case of:
## “cluster is not currently running on this node” error
pcs cluster stop −−all && pcs cluster start −−all

3. If a firewall is used, you will need to add an exception for the HA cluster:

# in case you have a firewall:
firewall−cmd −−permanent −−add−service=high−availability && firewall−cmd −−reload

4. Prevent resources from moving after recovery:

## Prevent Resources from Moving after Recovery
pcs resource defaults resource−stickiness=100

This command will tell the cluster if the resource has been moved to another node, so it can remain on that node. By default, a resource is activated on one node and if something happens it migrates to another one and jumps back when the problem goes away. However, we don’t want that.

5. Then, disable STONITH (Shoot The Other Node In The Head). This is mandatory for this setup, otherwise you won’t be able to start your resources.

## if you are not using fencing disable STONITH:
pcs property set stonith−enabled=false

GUI

Even I didn’t know it for some time, but there’s a GUI. And it is really helpful if you need to see what was done a year or two ago and what is happening now because you don’t do setups every day.

GUI

The GUI can be used to create clusters. You need to enter the hostnames and ports, and some advanced options are available too.

Cluster creation in GUI

If you already have a cluster, you can navigate to your GUI and just add one of the nodes from the existing cluster.

Adding an existing cluster

In the GUI, you can immediately see the composition of your cluster, whether the components are connected, and the uptime. You have control, and you don’t need to type anymore.

Editing a node

Note. More information on how to configure a High Availability cluster with the pcsd Web UI can be found here.

MariaDB installation

In this scenario, we use MariaDB, so we proceed with MariaDB installation.

1. Install the MariaDB server on all DB nodes:

## install MariaDB server on all 3 DB nodes:
yum install mariadb−server −y

2. Configure the DB settings:

## tune/configure db settings:
cp ./zabbixdb.cnf /etc/my.cnf.d/

3. Start and enable MariaDB:

## Start and enable to start on boot:
systemctl start mariadb
systemctl enable mariadb

4. Secure the installation with a password:

## secure your installation and create <MYSQL_ROOT_PASSWORD>:
mysql_secure_installation

In a test environment it won’t be required, but here is my configuration file example. I have basic settings for Zabbix DB and binary logs:

cat zabbixdb.cnf
[mysqld]
# ZABBIX specific settings and tuning
default-storage-engine          = InnoDB
innodb                          = FORCE
innodb_file_per_table           = 1
innodb_buffer_pool_size         = 512M           # 50-75% of total RAM
innodb_buffer_pool_instances    = 8            # For MySQL 5.5 - 4, for 5.6+ - 8
innodb_flush_log_at_trx_commit  = 2
innodb_flush_method             = O_DIRECT
innodb_io_capacity              = 800           # HDD disks 500-800,    SSD disks - 2000
sync-binlog                     = 0
query-cache-size                = 0
server_id                       = 96            # for id settings IPs last number used
report_host                     = zabbix-ha-db1
log-slave-updates
log_bin                         = /var/lib/mysql/log-bin
log_bin_index                   = /var/lib/mysql/log-bin.index
relay_log                       = /var/lib/mysql/relay-bin
relay_log_index                 = /var/lib/mysql/relay-bin.index
binlog_format                   = mixed
binlog_cache_size               = 64M
max_binlog_size                 = 1G
expire_logs_days                = 5
binlog_checksum                 = crc32
max_allowed_packet	        = 500M

As you can see, my logs will expire in five days so it means that I can bring a node down for five days and the replication will still work. Of course, sufficient storage is required, but this functionality is still very useful.

5. Deploy the configuration file on all nodes, adapting the server ID and hostname on every single node:

## Must be set on every db node acordingly

vi /etc/my.cnf.d/zabbixdb.cnf
server_id                       = 96 			## Last number of IP
report_host                     = zabbix-ha-db1   	## Hostname

Replication setup: node 1

Replication is the trickiest part. I always keep my plan on a separate monitor to make sure it is properly set up. Start with node 1.

1. Log in to MySQL:

## Log in to MySQL:
mysql −uroot −p <MYSQL_ROOT_PASSWORD>

2. Stop the slave:

MariaDB [(none)]> STOP SLAVE

3. Grant the replication privilege to the user providing the IP of node 2:

MariaDB [(none)]> GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'<NODE2_IP>' identified by '<REPLICATOR_PASSWORD>'

3. Show the master status:

MariaDB [(none)]> SHOW MASTER STATUS\G

Node 1 master status

Save the details about the log-bin file and its position for later.

Replication setup: node 2

4. Then, go to node 2. Again, log in, stop the slave, then change the master:

## Log in to MySQL: 
mysql −uroot −p<MYSQL_ROOT_PASSWORD>

STOP SLAVE;

CHANGE MASTER TO MASTER_HOST = '<NODE1_IP>', MASTER USER = '<REPLICATOR_PASSWORD>', MASTER_LOG_FILE = 'log−bin.000001', MASTER_LOG_POS = 245

This command will introduce the master to node 2 which will be the slave to node 1. Also, the master_log_file from the previous step and its position are specified.

5. Grant the replication slave privilege to node 3 identified by some secure password:

GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'<NODE3_IP>' identified by '<REPLICATOR_PASSWORD>'

6. At this point you can reset the master on node 2:

RESET MASTER

7. Then, start the slave and, again, get the slave status:

START SLAVE;
SHOW SLAVE STATUS\G

Node 2 slave status

As you can see, there are 0 errors. Also, if it says “Waiting for master to send event”, it means that the setup was successful. Otherwise, some debugging might be required, usually it is a wrong IP address or DNS name.

8. Proceed with node 2 and get the master status, the log-bin file, and its position.

Node 2 master status

If you reset the master on node 2, the position will be the same as it is for node 1.

Replication setup: node 3

9. Repeat the steps for node 3:

## Log in to MySQL: 
mysql −uroot −p<MYSQL_ROOT_PASSWORD>

STOP SLAVE;

CHANGE MASTER TO MASTER_HOST = '<NODE2_IP>', MASTER_USER = '<replicator>', MASTER_PASSWORD = '<REPLICATOR_PASSWORD>', MASTER_LOG_FILE='log−bin.000001', MASTER_LOG_POS = 245;

GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'<NODE1_IP>' identified by '<REPLICATOR_PASSWORD>';

RESET MASTER;
START SLAVE;

10. Set up node 1 as the slave for node 3. Check the slave status:

Node 3 slave status

If it says “Waiting for master to send event”, we are done with this one.

11. Show the master status on node 3:

Node 3 master status

12. Use the same commands for node 1:

STOP SLAVE;
CHANGE MASTER TO MASTER_HOST ='<NODE3_IP>', MASTER_USER = 'replicator', MASTER_PASSWORD = '<REPLICATOR_PASSWORD>', MASTER_LOG_FILE='log-bin.000001', MASTER_LOG_POS =245;
START SLAVE;

SHOW SLAVE STATUS\G

This is how the circular setup is done. Now you can go to any of the three nodes, do whatever SQL queries you want, and they will be replicated to all the other nodes.

Zabbix database preparation

The next step is to create a Zabbix database and a user:

## Login to mysql and create zabbix db/user:
create database zabbix character set utf8 collate utf8_bin;

grant all privileges on zabbix.* to [email protected]'%' identified by '<DB_ZABBIX_PASS>'; quit

Note. Don’t forget about utf8 and collation.

You might need to create separate users for your server and for web interface, but in a test environment I will stick with the same one.

And, of course, we need to get the schema. Basic settings and pictures are included in this file which can be obtained from Zabbix server nodes. I would prepare it in advance.

## upload db schema and basic conf: 
## create.sql.gz copied from main zabbix server 
## located in /usr/share/doc/zabbix-server-mysql-*/create.sql.gz
zcat create.sql.gz | mysql -uzabbix -p zabbix

The next step would be to introduce partitioning, but this is a story for another time.

Note. These basic commands are useful for debugging the DB cluster:

SHOW BINARY LOGS;

SHOW SLAVE STATUS;

show master status\g

RESET MASTER;	## removes all binary log files that are listed in the index file, leaving 		## only a single, empty binary log file with a numeric suffix of .000001

RESET MASTER TO 1234;	## reset to specific binary log position

PURGE BINARY LOGS BEFORE '2019-10-11 00:20:00';
			## Numbering is not reset, may be safely used while replication 				## slaves are running.

flush binary logs;	## Will reset state of binary logs and restarts numbering

Server cluster

When the DB cluster setup is complete, proceed to create the server cluster.

1. Again, install HA components, and then create the cluster user:

## Install HA components: 
yum groupinstall ha –y

## Create user for cluster:
echo zabbix123 | passwd --stdin hacluster

2. Install the Zabbix daemon binaries. But you do not need to start or enable it — HA will take care of that.

yum install −y zabbix−server

3. The Zabbix server configuration file is already prepared and can be modified:

## Copy default zabbix_server.conf file:
cp zabbix_server.conf /etc/zabbix/zabbix_server.conf

## and modify acordingly
vi zabbix_server.conf

First of all, change the source IP address — put the VIP address in:

SourceIP=192.168.7.87 #VIP for zabbix-server cluster

For the DBHost, use the VIP address from the DB cluster. And, of course, there is the DB password which must be deployed on all server nodes:

DBHost=192.168.7.89
DBName=zabbix
DBUser=zabbix
DBPassword=<DB_ZABBIX_PASS>

4. Authenticate the nodes:

pcs cluster auth zabbix−ha−db1 zabbix−ha−db2 zabbix−ha−db3
username: hacluster
password: <CLUSTER_PASSWORD>

5. Create the server cluster:

pcs cluster setup −−name zabbix_server_cluster \
zabbix−ha−srv1 zabbix−ha−srv2 zabbix−ha−srv3 −−force

6. Disable STONITH for fencing:

pcs property set stonith−enabled=false

7. Restart the cluster to reload the certificates and notifications:

pcs cluster stop −−all && pcs cluster start −−all

8. Again, switch on stickiness so that if migration is done manually, the VIP address and Zabbix server binaries are started on a particular node:

pcs resource defaults resource−stickiness=100

9. Introduce the resources. First, the VIP address:

pcs resource create virtual_ip_server ocf:heartbeat:IPaddr2 ip=192.168.7.87 op monitor interval=5s −−group zabbix_server_cluster

Then, the Zabbix server daemon:

pcs resource create ZabbixServer systemd:zabbix−server op monitor interval=10s −−group zabbix_server_cluster

Two Zabbix daemons can’t run simultaneously, that’s why you need to make sure that a particular Zabbix server is online only on one node.

10. Let’s go further. Set up the colocation of resources so that both VIP and server daemon can run only on the same node:

## Add colocation: resources must run on same node:
pcs constraint colocation add virtual_ip_server ZabbixServer INFINITY −force

11. Make sure that the VIP starts before the Zabbix daemon, otherwise it will crash:

## in specific order:
pcs constraint order virtual_ip_server then ZabbixServer

12. Set the timeout settings for the resources:

## Set start/stop timeout operations
pcs resource op add ZabbixServer start interval=0s timeout=60s
pcs resource op add ZabbixServer stop interval=0s timeout=120s

13. Check the cluster status:

pcs status

You will see the cluster name, stack, and the node on which the resources are running, and, of course, the resources themselves.

Checking the cluster

Front end cluster

The setup for the front end cluster is similar.

1. Install the HA components, create a user and then install the Zabbix front end. Do not start or enable it manually.

## Install HA components: 
yum groupinstall ha –y

## Create user for cluster:
echo zabbix123 | passwd --stdin hacluster

## install zabbix frontend:
yum install -y zabbix-web-mysql

2. Prepare the configuration file for the front end, with VIPs for the server cluster nodes and DB server nodes. Deploy it to all front end nodes in the same location.

## Prepare zabbix-FE config:
cat /etc/zabbix/web/zabbix.conf.php 
$DB['TYPE']     = 'MYSQL';
$DB['SERVER']   = '192.168.7.89';
$DB['PORT']     = '0';
$DB['DATABASE'] = 'zabbix';
$DB['USER']     = 'zabbix';
$DB['PASSWORD'] = 'zabbix123';
...
$ZBX_SERVER      = '192.168.7.87';
$ZBX_SERVER_PORT = '10051';
$ZBX_SERVER_NAME = 'ZABBIX-HA';

## Deploy to all FE nodes on same location: /etc/zabbix/web/ 

3. Create a virtual host in Apache to monitor the status of the Apache server itself:

## create resource for apache Enable the server-status page.

vi /etc/httpd/conf.d/serverstatus.conf

Listen 127.0.0.1:8080
RewriteEngine Off
SetHandler server-status
Allow from 127.0.0.1
Order deny,allow
Deny from all
</Location>
</VirtualHost>

4. Configure Apache to listen on VIP of the front end cluster:

## set apache to listen only on VIP

vi /etc/httpd/conf/httpd.conf +/Listen 80

## change to:
...
Listen 192.168.7.88:80
...

Or — there’s another option — you can leave the default Apache listening settings which include all IP addresses, and use a master-master-master setup meaning that all nodes will be active. However, that might cause some difficulties.

5. Authenticate the cluster nodes:

pcs cluster auth zabbix−he−fe1 zabbix−ha−fe2 zabbix−ha−fe3
username: hacluster
password: <CLUSTER_PASSWORD>

6. Create the cluster:

pcs cluster setup −−name zabbix_fe_cluster \
zabbix−ha−fe1 zabbix−ha−fe2 zabbix−ha−fe3 −−force

7. Then restart, and disable STONITH:

pcs cluster stop --all && pcs cluster start –all

pcs property set stonith-enabled=false

8. Introduce the resources. Again, first comes the VIP:

pcs resource create virtual_ip_fe ocf:heartbeat:IPaddr2 ip=192.168.7.88 op monitor interval=5s −−group zabbix_fe_cluster

The second resource is the control of the Apache service:

pcs resource create zabbix_fe ocf:heartbeat:apache \
configfile=/etc/httpd/conf/httpd.conf \
statusurl="http://localhost:8080/server−status" op \
monitor interval=30s −−group zabbix_fe_cluster

So, whenever we switch to the next node, the Apache will be started on a particular one.

9. Configure the colocation (VIP and Apache must run on the same node):

pcs constraint colocation add virtual_ip_fe INFINITY

10. Configure which resource starts first:

## in specific order:
pcs constraint order virtual_ip_fe then zabbix_fe

11. Switch on resource stickiness:

pcs resource defaults resource-stickiness=100

12. Set start/stop timeout operations:

pcs resource op add zabbix_fe start interval=0s timeout=60s
pcs resource op add zabbix_fe stop interval=0s timeout=120s

Conclusion

Now the HA cluster is ready. Simple, right? Well, maybe not quite. Still, this is a bare minimum.

Red Hat documentation and ClusterLabs will be useful if you need information on this topic. You could also consult with the Zabbix sales team.

See also: Presentation slides

Edmunds Vesmanis

Author: Edmunds Vesmanis

Zabbix Certified Expert & Trainer

One thought on “High Availability cluster building with Zabbix for continued service”

Leave a Reply