We all want to find the root cause of various issues and bugs in our system, but do we know what we’re looking for?
Whenever an error arises, the top priority on everyone’s mind is finding its root cause and solving it, and that’s how the following post came to be. We went on a journey to understand what are the ingredients of the real root cause of errors and how to find them. Let’s check out what can make or break your application.
— OverOps (@overopshq) December 14, 2017
What do you mean when you say root cause?
The definition of root cause states that it’s an initiating cause of either a condition or a causal chain that leads to an outcome or effect of interest. On the practical side, root cause is the element that, when fixed or removed, should prevent the issue from recurring.
This concept stays the same when it comes to our application, and finding the root cause is critical to understanding the issues we’re facing. Errors and exceptions can throw a lot of information at us (that is, if we took the time to log them), and it’s our job to solve the puzzle and understand what happened.
What should a root cause include?
The term root cause is often used in the monitoring and error resolution ecosystem to describe the source of the issues and errors that arise in our application. That information ranges from tool to tool, giving us a lot of data to process and sift through to find that desired root cause.
A valuable root cause should hold the answers that’ll help us prioritize, analyze and solve issues easily. We’ve narrowed it down to the top 3 questions that, if answered, will give you a complete overview of what happened in your application:
- When did this issue happen?
- Where did the error occur within the application?
- Why did it happen in the first place?
When searching for the answer, each question should focus on a number of elements that will guide us on the right path towards solving the issue. These elements are:
To sum it up:
- When – Know errors are happening before customers report them
- Where – Route errors the developer who is responsible for solving them
- Why – Gather the data needed to solve the error
Now, let’s try to see how we can answer these questions and get the relevant information needed to identify any sneaky root cause.
1. When did the issue happen?
The first thing we want to know is when a new issue has been introduced into our system, and we want to know it as soon as it happens. It’s critical for us to discover issues before they affect our customers.
Since we’re all realistic folks around here, we know that errors are something that occurs all the time in every application and environment. As a matter of fact, most developers waste a lot of time chasing after issues (25% of our time on average, to be exact).
That’s why we should also look at the rate of every issue and it’s trend over time, to be able to separate the wheat from the chaff to know if an issue is critical or not. We also want to know the time in which the issue was first introduced, and whether we need to solve it immediately or if it can wait a little longer.
2. Where did the error occur within the application?
Once we know something had happened, it’s time to answer the following question: where did this issue happen within the code. It could be a new feature or deployment that pushed a bug, a new line of code or even a piece of legacy code that no one is taking care of.
The where should include the method names involved in the transaction, as well as the code that’s related to each of the methods in the stack trace.
Taking it one level deeper, we also want to know in which machine, microservices or server this issue had happened. Having the answers to where the code break can give us an extra set of eyes into our application.
3. Why did it happen in the first place?
We know that something happened, but in order for us to fix it we need to understand it. That’s why we need to answer the most important question of them all: why did this issue happen in the first place.
We want to know the variable state at the moment of the error, the state of the object that initiated the thread. We also want to see the DEBUG and LOG level log statements that led to the error, as well as the application state with active threads, GC and heap utilization.
For most engineering teams, this would be the time to hit the log files and start searching for that needle in our log haystack. That is, if the error was logged in the first place, and if it was logged correctly.
We talked with engineering teams in leading companies like Intuit, TripAdvisor and Intuit, and learned that by depending on logs, they were often late to detect critical issues. Sumit Nagal, Principal Engineer in Quality at Intuit points out that “Even if we did find the issues within the logs, some of them were not reproducible. Finding, reproducing and solving issues within these areas is a real challenge.”
To enhance their error resolution process, Intuit chose to use OverOps. With OverOps, Intuit’s development team were just a single click away from an automated root cause analysis for each issue that occurred in their application, giving them the right answers to where, when and why their code broke in production.
To find out how companies like Intuit, Zynga, TripAdvisor, Comcast and others are automating their error resolution workflow, check out our new eBook: The Complete Guide to Automated Root Cause Analysis.
How developers spend their time should be a top priority, and there’s no reason to compromise on any of these parameters. Even if we’re using APMs and log management tools to try and understand what happens when the application fails, we’re usually left with blind spots that can hurt us.
That’s why you should focus on developing a strategy that would answer all three questions that we’ve presented before you, and they will supply you with a complete overview of your application, its current state and how to fix it or make it better.