In preparation for Zabbix Summit 2023, we sat down for a short chat with Rihards Olups, a SaaS Architect at Nokia and a man who has devoted his career to battling the problem of alert fatigue.
Please tell us about yourself and your work.
I primarily work with solutions for monitoring, security, and automation. Apart from that, I enjoy vehicles on two wheels, and I’m fascinated by how we navigate – in particular, I like cognitive sciences, maps, and the wonderful OpenStreetMap project.
How long have you been using Zabbix? What kind of daily Zabbix tasks are you involved in at your company?
I started using Zabbix in 2001, the year it was released. After submitting too many bug reports, I got invited to join the Zabbix team and had the pleasure of working with some fantastic colleagues in the Riga office. During my current daily Zabbix operations, I aim to make things manageable and migrate a highly integrated Zabbix installation from 5.0 to 6.4.3.
Can you give us a sneak peek at what we can expect to hear during your Zabbix Summit speech?
I hope to share a topic that’s been close to my heart for many years – alert noise and alert fatigue. I’ll briefly cover why it’s important, look at some basic technical solutions, and share some large-scale process and cultural approaches that could be helpful.
Do you think it’s possible to prevent unwanted alerts and alert fatigue at an early “design” stage, or is it always a matter of learning about sensitive triggers “the hard way?”
It’s not only possible, but very important to think about multiple aspects in the design and implementation phases of projects, including alerting and alert noise. This can happen on the monitoring and alerting side – triggers, alert aggregators, and processes can be designed to send out fewer alerts in general and not cause “alert flood” when a bigger outage happens. This can also happen on the general architecture side by designing for redundancy and postponing reaction to individual component failures until business hours.
What’s your opinion on proactive monitoring using predictive Zabbix functions? Can people avoid last-minute alerts and alert fatigue that way?
I would say that predictive trigger functions can be confusing for new users, both in the initial setup phase and in further maintenance. They definitely can be a useful tool for experienced teams, and their deterministic (even if often non-obvious) nature makes them fairly reliable when compared to machine learning approaches.