Meet MDC and Log View and how they add context to solving errors in production
OverOps is used to troubleshoot production errors and pinpoint their root cause down to the variable values that caused them. To provide even more context to the application’s state at the moment of error, one of the biggest requests we’ve received was the ability to see the thread-local state of the transaction.
In this post, I’d like to share the story behind 2 new features that originated from user feedback, and how they push the boundaries on what you can expect from solving Java application errors in production.
— Takipi (@takipid) July 18, 2016
TL;DR: View the 30 second video
Let’s start with MDC
MDC, short for Mapped Diagnostics Context is the thread-local state Java logging frameworks such as slf4j and logback associate with an executing thread. This information is loaded by the application into the log framework at the beginning of a transaction, and emitted by the log framework into the log file to describe the application / business context in which the code was executing.
This can be a customer ID, username or unique transaction identifier maintained across a distributed processing chain.
When troubleshooting code, having access to this context is paramount – it’s with this information you can better understand why (and for whom) the code was executing, or search across your log files for statements with that same identifier to see the full story that befell a failed transaction across multiple machines or services.
The good news is OverOps now captures the full MDC state at the moment of error for every source, stack and state analysis. This is powerful as it provides you with the full context in which code failed. You can see an example of MDC state in the following screenshot from our actual production environment – it holds a wealth of information related to the executing code, which is now a part of every OverOps data capture.
Going beyond MDC (and into something very cool)
Once we added the MDC, a sneaky thought sneaked up into our mind – why capture just the log MDC – why not capture the log statements themselves?
So this is exactly what we’ve just added – the ability to see the last 250 log statements within the thread leading up to an error. The cool thing about it is we don’t capture those statements from the log file, but directly in-memory as they are logged by the code.
What can you do with Log View?
1. DEBUG and TRACE in production
Using log view you can see DEBUG and TRACE statements leading to an error, even if they were not written into the log file in production. This is because we capture log statements as they happen inside the application in real-time, without being dependent on whether or not the log verbosity level allows for them to be persisted.
Since these statements are only captured from memory when OverOps takes a snapshot, without relying on log files, there is no additional overhead to the size of your logs. You can have your cake and eat it too! 🙂
This is a huge win for devs, as this information is almost never accessible in production and is beneficial in troubleshooting issues. Getting to it usually requires turning verbosity on and recreating the issue to get to those statements, lengthening the resolution process and requiring involvement from Ops, and at times even the customer.
2. Access logs directly from JIRA, Slack and Hipchat
As log statements are now a part of any state captured by OverOps, alerts about new errors become much more powerful as they not only include the source, stack and state, but also the log data most relevant to the error. So once you receive an alert about a new error that was just deployed, it will provide you with the most relevant log data (and MDC) without having to pull and grep for the log data out of the production system.
3. Focus on the right log events
We capture the log statements related to the thread in which the error happened (vs. statements from the 300+ threads you may have running concurrently), so you can focus immediately on the data relevant to the error.
OverOps automatically highlights the log statement in which the transaction started, so you can immediately know which logs statements are relevant to the transaction vs. ones that are pure noise.
4. Diskless logging
As log statements are persisted as part of the snapshot OverOps captures without reliance on physical logs, you’re not dependent on having access to the log files. This is very powerful in elastic environments in which by the time you’ve been made aware of an error the log files may have already been lost as the machine was taken down.
You can access snapshots for up to 90 days in our SaaS version, or for even longer periods of time in our on-premises version. As these are periodic snapshots, they can be retained for long periods without straining your storage infrastructure. You’re not limited to a short log retention policy before logs are recycled.
We’re super excited about this feature and would love for you to take a look at it and tell us what you think in the comments section below.