Ceph Storage Monitoring with Zabbix

Learn how to build a monitoring system for Ceph Storage using Zabbix, improving your visibility about your storage solution health and working proactively to identify possible failed events and performance issues before it impacts your applications and even the business continuity.

Introduction

The storage prices has been decreasing, business demands are growing fastly and companies are storing more data than ever before. Following this growth, it’s emerging a demand for monitoring and data protection involving software-defined storage solutions. Downtimes have a high cost, can directly impact business continuity and cause irreversible damage to organizations. Some after-effects are loss of assets and information, Interruption of services and operations, violation of law, regulations or contracts. Even that it could impact your business financially, loss of customers and damage a company’s reputation. Gartner estimates that a minute of downtime costs enterprises $5,600 and an hour over than $300,000. On the other hand, in a DevOps context, It’s essential to think about Continuous Monitoring, a proactive approach to monitoring throughout the all application life cycle and its components. This will help to identify the root cause of possible problems and work quickly and proactively to prevent performance problems or future outages. In this article, you will see how to implement monitoring of your storage solution (Ceph) using an Enterprise and Open Source tool (Zabbix).

What’s Ceph Storage?

The Ceph Storage is an open source software-defined storage, petabyte-scale and distributed storage, designed mainly for cloud workloads. While some traditional NAS or SAN storage solutions are based on expensive proprietary hardware solutions, software-defined storage is usually designed to run on commodity hardware and this can allow these new systems to be less expensive than traditional storage appliances. Ceph is designed primarily for the following use cases:

Storing images and virtual block device storage for an OpenStack environment (using Glance, Cinder, and Nova)
Applications that use standard APIs to access object-based storage
Persistent storage for containers

According to Ceph documentation, whether you want to provide an object storage and/or block device services to Cloud platforms, deploy a filesystem or use Ceph for another purpose, all storage cluster deployments begin with setting up each a node, your network, and the storage cluster. A Ceph storage cluster requires at least one Monitor (ceph-mon), Manager (ceph-mgr) and Object Storage Daemon (ceph-osd). The Metadata server (ceph-mds) is also required when running Ceph File System clients. These are some of many components that will be monitored by Zabbix. To learn more about what each component does, check the product documentation.

Here we are proposing a Lab, but if you are planning to do this in production, you should review hardware and operating system recommendations.

What’s Zabbix and how can it help?

Zabbix is an enterprise-class open source distributed monitoring solution. It monitors numerous network parameters and the health and integrity of servers. Zabbix uses a flexible notification mechanism that allows users to configure e-mail based alerts for virtually any event. This allows a fast reaction to server problems. It offers excellent reporting and data visualization features based on the stored data. This makes it ideal for capacity planning. It supports both polling and trapping and all reports and statistics, as well as configuration parameters, are accessed through a web-based frontend. A web-based frontend ensures that the status of your network and the health of your servers can be assessed from any location. Properly configured, It can play an important role in monitoring IT infrastructure. This is equally true for small organizations with a few servers and for large companies with a multitude of servers. I won’t be covering Zabbix Installation here, but there’s a great guide and a video in the official documentation.

How has everything started?

Started in the Red Hat Ceph Storage 3 version, as known as Luminous, the Ceph Manager daemon (ceph-mgr) is required for normal operations and runs alongside monitor daemons, to provide additional monitoring and interfaces to external monitoring and management systems. At the same time, you can create modules and extend mgrs to provide new features. Here we will use this extension ability though a Zabbix Python module where this module is responsible to export overall cluster status and performance to Zabbix server, which is a central process that performs monitoring, interacts with Zabbix proxies and agents, calculates triggers, sends notifications; a central repository of data. Obviously, you can actively still collect traditional metrics about your operational systems, but the Zabbix module will start to gather specific information about storage metrics and performance and send it to the Zabbix server.

Here we have some examples of available metrics:

Ceph performance: I/O operations, bandwidth, latency …
Storage utilization and overview
OSD status and how many are IN or UP
Number of Mons and OSDs
Number of Pools and Placement groups
Overall Ceph status and much more!

How about my Lab environment ?

The Ceph cluster installation will not be covered here, but you can find more information about how to do that in the Ceph documentation. My storage cluster was installed using ceph-ansible.

The computing resources used were 12 instances with the same configuration: 2 CPU cores and 4GB RAM and as following:

3 Monitors and Managers nodes (collocated)
3 OSDs nodes with 3 disks per node (9 OSDs in total)
2 MDS nodes
2 RADOS Gateway nodes
1 Ansible Mgmt node
1 Zabbix server node collocated (Zabbix server, MariaDB server and Zabbix frontend)

Não foi fornecido texto alternativo para esta imagem

Figure 1 – Lab topology

The software resources used:

Base OS for all instances: Red Hat Enterprise Linux 7.7
Cluster Storage nodes: Red Hat Ceph Storage 4.0
Management & Automation: Ansible 2.8
Monitoring: Zabbix 4.4

Considering my cluster is installed and ready, here it is the health, services and tasks status:

[user@mons-0 ~]$ sudo ceph -s
  cluster:
    id:     7f528221-4110-40d7-84ff-5fbf939dd451
    health: HEALTH_OK
  services:
    mon: 3 daemons, quorum mons-1,mons-2,mons-0 (age 37m)
    mgr: mons-0(active, since 3d), standbys: mons-1, mons-2
    mds: cephfs:1 {0=mdss-0=up:active} 1 up:standby
    osd: 9 osds: 9 up (since 35m), 9 in (since 3d)
    rgw: 2 daemons active (rgws-0.rgw0, rgws-1.rgw0)
  task status:
  data:
    pools:   8 pools, 312 pgs
    objects: 248 objects, 6.1 KiB
    usage:   9.1 GiB used, 252 GiB / 261 GiB avail
    pgs:     312 active+clean

How to enable the Zabbix dashboard module?

The Zabbix module is included in the ceph-mgr package and you must deploy your Ceph cluster with a Manager service enabled. You should enable the Zabbix module with a single command in one of the ceph-mgr nodes:

[user@mons-0 ~]$ sudo ceph mgr module enable zabbix

It’s possible to check if Zabbix module is enabled through the following command:

[user@mons-0 ~]$ sudo ceph mgr module ls | head -5
{
"enabled_modules": [
"dashboard",
"prometheus",
"zabbix"

Sending data from Ceph cluster to Zabbix

This solution uses Zabbix Sender utility which is a command line tool that can be used to send performance data to Zabbix server for processing purpose. The utility is usually used in long running user scripts for periodical sending of availability and performance data. It can be installed on most distributions using the package manager. You should install zabbix_sender executable on all machines running ceph-mgr for high availability.

Let’s enable Zabbix repositories and install zabbix_sender in all Ceph managers nodes:

[user@mons-0 ~]$ sudo rpm -Uvh https://repo.zabbix.com/zabbix/4.4/rhel/7/x86_64/zabbix-release-4.4-1.el7.noarch.rpm

[user@mons-0 ~]$ sudo yum clean all

[user@mons-0 ~]$ sudo yum install zabbix-sender -y

Alternatively, you can automate it and use Ansible to run these commands once a time and in the three mgrs nodes:

[user@mgmt ~]$ ansible mgrs -m command -a "sudo rpm -Uvh https://repo.zabbix.com/zabbix/4.4/rhel/7/x86_64/zabbix-release-4.4-1.el7.noarch.rpm"

[user@mgmt ~]$ ansible mgrs -m command -a "sudo yum clean all"

[user@mgmt ~]$ ansible mgrs -m command -a "sudo yum install zabbix-sender -y"

Configuring the module

After understand how everything works, you need just a piece of configuration to make this module working accurately:

zabbix_host: This is a Zabbix server hostname or IP address to which zabbix_sender will send the items as a trap.
identifier: This is a Ceph cluster identifier parameter in Zabbix. It controls the identifier/hostname to use as source when sending items to Zabbix. This should match the name of the Host in your Zabbix server. If you don’t configure the identifier parameter the ceph-<fsid> of the cluster will be used when sending data to Zabbix. This would for example be ceph-c6d33a98-8e90-790f-bd3a-1d22d8a7d354

Optionally you have many others configuration keys which can be configured and their default values:

zabbix_port: 10051 – This is a TCP port where Zabbix server runs
zabbix_sender: /usr/bin/zabbix_sender – Path for the Zabbix sender binary as default
interval: 60 – the update interval for the specified time period that zabbix_sender will be sending the data for Zabbix server. Default is 60 seconds.

Configuring your keys

Configuration keys can be set on any server with the proper cephx credentials, these are usually Monitors where the client.admin key is available.

[user@mons-0 ~]$ sudo ceph zabbix config-set zabbix_host zabbix.lab.example

[user@mons-0 ~]$ sudo ceph zabbix config-set identifier ceph4-cluster-example

[user@mons-0 ~]$ sudo ceph zabbix config-set interval 120

The current configuration of the module can also be shown using the following command:

[user@mons-0 ~]$ sudo ceph zabbix config-show 

{"zabbix_port": 10051, "zabbix_host": "zabbix.lab.example", "identifier": "ceph4-cluster-example", "zabbix_sender": "/usr/bin/zabbix_sender", "interval": 120}

Exploring Zabbix: Templates, Host creation and Dashboard

First of all, it’s time to import your template. In the Zabbix world, a template is a set of entities that can be conveniently applied to multiple hosts. The entities may be items, triggers, graphs, discovery rules, etc. Your base will be the items. Have in your mind that an item is a particular piece of data that you want to receive off of a host, a metric of data. When a template is linked to a host, all entities of the template are added to the host. Templates are assigned to each individual host directly.

Download the Zabbix template for Ceph which is available in the source directory as a XML file. It’s important to download the template file locally in a raw mode or you will have problems importing in the next step.

[user@mylaptop ~]$ curl https://raw.githubusercontent.com/ceph/ceph/master/src/pybind/mgr/zabbix/zabbix_template.xml -o zabbix_template.xml

To import template, do the following:

Go to: Configuration → Templates
Click on Import to the right
Select the import file
Click on import button
Click on Import

Figure 2 – Importing a Zabbix template

A success or failure message of the import will be displayed in the frontend.

Configure a host in Zabbix frontend and link to the newly created template:

Go to: Configuration → Hosts
Click on Create host button to the right
Enter Hostname and Group
And link Ceph template

Figure 3 – Creating your Ceph cluster host and adding to a group

Hostname and groups are required fields. Make sure the host has the same name as the identifier configured in the Ceph config-key parameter. We have many groups available and you can choose one or create a new one. Choose Linux servers for this lab.

In the Templates tab, choose the ceph-mgr Zabbix module that you imported before and click on Select and after Add button.

Figure 4 – Linking Ceph template to the host

Configuration is done and after a few minutes, data should start to appear in Zabbix web interface and under the “Monitoring > Latest Data“ menu and graphs will start to populate for the host. Many triggers are already configured in the template which will send out notifications if you configure your actions and operations.

Figure 5 – Latest data collected by Zabbix

After the data is collected, you can easily create your Ceph dashboards like that and have fun with Zabbix:

Figure 6 – Zabbix Ceph Dashboard

Kudos:

Renato Puccini and Rodney Beauclair from Red Hat for their first revision and insights.

Bio:

Alessandro Silva works at Red Hat as a Senior Cloud Success Architect where is responsible to support strategic customers in Latin America about Cloud Adoption. He’s a Red Hat Certified Architect, LPIC-3 Security Specialist and one of the first Zabbix Certified Specialists in Brazil. He’s a Zabbix advocate and has performed so many presentations at conferences, including the Zabbix Conference Latam 2016, when he presented the Zabbix Security Insights solution. Alessandro is available for connection through his Linkedin: https://linkedin.com/in/alessandro-silva-236b4b42

About Alessandro Silva

Cloud Success Architect at Red Hat, Open Source Evangelist, Digital Transformation Advocate.

View all posts by Alessandro Silva