New OverOps feature: Critical error alerts via email, Slack, HipChat, and PagerDuty
We’re pretty excited this week, as we just launched the new alerts engine we’ve been working on in the last couple of months. The goal behind this is to allow you to get notified on the most critical errors in real-time, and fix them before they affect your users.
In this post, I wanted to share some of the rationale that guided us in building this feature, and show you some of the things it can do.
— Takipi (@takipid) June 1, 2016
Listening to our customers, developers and ops teams alike, one of the most requested features was related to alerts. Folks wanted to be alerted on errors in context – “Let me know if a new deployment introduced new errors, or alert us when the number of logged errors or uncaught exceptions exceeds a certain threshold, so we can fix it in real time”. So we went ahead and built it.
The main idea behind building this feature was allowing each user to customize when and how they receive alerts. Naturally, a Dev team lead and a Devops engineer might have different requirements and different things they want to be alerted on. We wanted to allow for maximal personalization, while keeping the UI simple and intuitive, building on top of our new views dashboard.
The new views pane in the dashboard allows users to cut through the noise, and focus only on the events that they’re interested in. For example, if you’ve deployed a new version, and want to see whether the new deployment introduced new errors that need to be dealt with, just click on “new today”.
If you’re only interested in critical uncaught exceptions, there’s a view for that as well, and you can add customized views on top of that. Our team, for example, created a dedicated view for NullPointerExcpetions, as we hate those, and want to be able to zoom in on them without noise.
The alerting mechanism uses these views to send contextual alerts in each use case. If you only want to see errors on new deployments, on NullPointerExceptions, or on Uncaught exceptions, you can just click on the bell icon next to that view. You can set alerts and thresholds for any view, and equally important – Decide how you want to be notified.
Common Use Cases
The alert settings dialog consists of two sections – Setting alerts, and configuring how to receive them. OverOps integrates out-of-the-box with a variety of messaging and alerting tools, including Slack, Hipchat and PagerDuty.
For each of the predefined or custom views in my dashboard, OverOps can be set to send two types of notifications: alert on any new error, or when the total number of errors in the view exceeds a target threshold within a rolling hour window.
Let’s say we’re deploying a major version. We’d probably want to get notified if this deployment creates any new errors (Scenario 1).
On the other hand, even if that deployment has not introduced any new types of errors, it’s possible that something went wrong, and the log errors and warnings that already existed in the system (part of the control flow) have spiked, or significantly increased in rate (Scenario 2), in which case, we’d also want to have a deeper look into what caused the spike – Something that looks like this graph:
Let’s examine how we can address both of these scenarios:
1. To set up an alert for any new errors introduced today, I can click on the bell icon next to the “new today” view, and enable it:
This will open the new alerts dialog. Here I can choose how I want to receive this alert. Since new errors are critical for us, I choose for them to be sent via email, Slack, Hipchat and PagerDuty, so there’s no way we’d miss them.
Now, we’ll be alerted on each new error introduced by a new deployment. We can also group these alerts into email digests to avoid multiple alerts:
2. To support our second scenario, we can use the new thresholds feature. Let’s say our baseline for logged warnings is 1,000 messages per hour, but if it passes 2,500 messages in a rolling hour window, it means something went wrong.
In this case we can simply go to the “Log warnings” view in the alerts dialog, and set the hourly threshold for 2,500. Since this alert may not be critical for us, we only chose to receive it on our shared Slack channel:
Here’s how the alert would look like if that threshold is exceeded –
Clicking on it will take me to the exact timeframe in which these errors spiked, so I can zoom in on exactly what caused it. I can use the Split function to see the actual errors and transaction contributing to the spike.
In the same manner, I can create my own customized views and be alerted whenever new errors are introduced into them or exceed a certain threshold within rolling hour windows. For example, we added a dedicated view only for NullPointerExcpetions, since we have a zero tolerance policy for those. Then we set a threshold on 1, so that every time a NullPointerException is thrown in our production environment, we’ll get alerted:
This feature has a lot of neat little tricks and customizations. For further reading click here. Have any feedback? Think we should add some more capabilities to this feature? Feel free to leave us a note here, or email me at firstname.lastname@example.org.