Waiting for your code to compile? Trying to find the root cause of the current issue you’ve encountered? Here are a few Gifs that can help you cope with the situation
— OverOps (@overopshq) May 10, 2018
Picture the following scenario. It’s Friday morning and you’ve just sat down with your hot cup of Java, ready to write some code. you fire up your environment, and suddenly you receive a call from support.
1. Something is wrong with the last deployment, and the call center is flooded with complaints.
2. Denial kicks in, you think you had everything under control, only to see that you’re wrong
3. At this point, you go through the logs trying to find what happened in the code that made an exception to be thrown
4. Which, if you ask us, looks more like this:
5. At this point, you try to remember that the most important thing you need to do is keep calm and debug on
6. You find (what you think is the) root cause of the issue, apply a fix or a workaround, deploy it and hope it’ll work
7. And usually, you’ll have to go deeper into the logs to try and find the actual root cause
8. But what happens if you didn’t take the time to log the issue in the first place?
9. And now you just don’t know what’s going on
10. It’s getting late, and at this point, you realize you’ve spent half of your workday trying to track this one issue
11. Or you do manage to find and fix the issue in your staging environment, but production is a whole different story
12. And remember, it IS a Friday. Deploying a fix might lead to you losing your weekend keeping track that everything is working as it should
There must be a better way, right?
And there is, Kevin!
Let’s start with the basics – log files… suck. Counting on them to find issues in the code is like… counting on the cats to not push cups off the shelf.
To be more specific, at least 20% of exceptions that occur in production will never make it to the logs at all. Add on top of that the fact that 63% of logging statements aren’t running in production, and more than 50% of logging statements don’t include ANY information about the variable state at the time of an error. Yikes.
Why is it so important? If not caught in time, these issues will affect your customers and users. This could lead to frustration, complaints, and angry tweets, and your team will have to spend 20% of their work week trying to find what went wrong and how to fix it.
Instead, you want to focus on building a reliable product and improve your developers’ productivity by cutting down the time it takes to detect the root cause of issues. And you’re not the only one that feels the same, companies like TripAdvisor, Intuit, Comcast, Zynga, and others have also made the shift towards product reliability, by automating their error resolution workflows.
In return, these companies were able to cut down days of identifying and solving production errors to mere minutes. If you too want to add valuable automation to your workflow, you can read our complete guide to root cause analysis, here.