Why can’t production logs help you find the real root cause of your errors?

Asking if you’re using log files to monitor your application is almost like asking… do you drink water. We all use logs, but HOW we use them is a whole different question.

In the following post we’ll take a deeper look into logs and see how they are used and what’s written to them. Let’s go.

Big shout out to Aviv Danziger from our R&D team for his huge help pulling and crunching the data for us.

The Groundwork

Our quest for answers requires a large amount of data, and that’s why we turned to Google BigQuery. A few months ago we used it for the first time to see how GitHub’s top Java projects use logs.

For our current post, we took the top 400,000 Java repositories on GitHub, ranked by number of stars they were given in 2016. Out of those repositories we filtered out Android, sample projects and simple testers, which left us with 15,797 repositories.

Then, we extracted the repositories that had over 100 logging statements, which left us with 1,463 repos to work on. Now, it’s time for the fun part of finding the answers for all those questions that kept us awake at night.

TL;DR: Main Takeaways

If you’re not into pie, column or bar charts and want to skip the main course and head straight for the dessert, here are the 5 key points we learned about logging and how it’s really done:

1. Logs don’t really have as much information as we think, even though they can add up to hundreds of GBs per day. Over 50% of statements have no information about the variable state of the application
2. In production, 64% of overall logging statements are deactivated
3. The logging statements that do reach production have 35% less variables than the average development level logging statement
4. “This should never happen” always happens
5. There’s a better way to troubleshoot errors in production

Now let’s back up these points with some data.

1. How Many Logging Statements Actually Contain Variables?

The first thing we wanted to check is how many variables are sent out in each statement. We chose to slice the data on a scale from 0 variables up to 5 and above, in each repository. We then took the total count, and got a sense of the average breakdown over all of the projects in the research.

Average Java Project by Number of Variables
Average Java Project by Number of Variables

As you can see, the average Java project doesn’t log any variables in over 50% of its logging statements. We can also see that only 0.95% of logging statements send out 5 variables or more.

This means that there’s limited information about the application that is captured by the log, and finding out what actually happened might feel like searching for a needle in a log file.

2. How Many Logging Statements Are Activated in Production?

Development and production environments are different for many reasons, one of them is their relation to logging. In development, all log levels are activated. However, in production only ERROR and WARN are activated. Let’s see how this breakdown looks like.

Production vs. Development Logging
Production vs. Development Logging

The chart shows that the average Java application has 35.5% unique logging statements that have the potential to be activated in production (ERROR, WARN), and 64.5% statements that are only activated in development (TRACE, INFO, DEBUG).

Most information is lost. Ouch.

3. What’s the Average Number of Variables per Each Log Level?

So, not only do developers skimp on variables in their statements, the average Java application doesn’t send out that much statements to production logs in the first place.

Now, we’ve decided to look at each log level individually and calculate the average number of variables in the corresponding statements.

Average Number of Variables per Logging Statement
Average Number of Variables per Logging Statement

The average shows that TRACE, DEBUG and INFO statements contain more variables than WARN and ERROR. “More” is a polite word, considering the average number of variables in the first three is 0.78, and 0.5 in the last 2.

That means that production logging statements hold 35% less variables than development logging statements. In addition, as we’ve seen earlier, their overall number is also much lower.

If you’re searching the log for clues as to what happened to your application, but come up blank – this is why it happens. Not to worry, there’s a better way.

OverOps lets you see the variables behind any exception, logged error or warning, without relying on the information that was actually logged. You’ll be able to see the complete source code and variable state across the entire call stack of the event. Even if it wasn’t printed to the log file. OverOps also shows you the 250 DEBUG, TRACE and INFO level statements that were logged prior to the error, in production, even if they’re turned off and never reach the log file.

We’d be happy to show you how it works, click here to schedule a demo.

4. This Should Never Happen

Since we already have information about all of those logging statements, we’ve decided to have a little fun. We found 58 mentions to “This should never happen”.

All we can say is that if it should never happen, at least have the decency to print out a variable or 2, so you’ll be able to see why it happened anyway 🙂

How we did it?

As we mentioned, to get this data we first had to filter out irrelevant Java repositories and focus on those which had over 100 logging statements, which left us with 1,463 repos that made the cut.

Then, we added some regex magic and pulled out all of the log lines:

Now that we had the data, we started slicing it up. First we filtered out the number of variables per log level:

Then calculated the average use of each tier. That’s how we got the average percent of total repositories statements.

You can check out the calculations in our raw data file.

Final Thoughts

We all use log files, but it seems that most of us take them for granted. With the numerous log management tools out there we forget to take control of our own code – and make it meaningful for us to understand, debug and fix.

There’s a better way.

Schedule a Demo

I write about Java, Scala and everything in between. Lover of gadgets, apps, technology and tea.
  • andrewglynn

    Being able to monitor the state of an app using a fluent API such as the monitoring library in Pharo would be much better. You can write statements such as “x should not be more than 100” and it can find any cases where the check fails throughout the code run. Then again, with REST, microservices frameworks like Seneca, and JavaScript, many “applications” no longer *have* a determinate state to check.

    • Henn Idan

      Hey Andrew, thank you for your comment.

      The unique thing about OverOps compared to transmitting custom metrics, is the ability to capture the full state of the application at the moment of the event – automatically reproducing its cause without any code changes.

      On top of that, you can also set alerts through a web UI based on the number of times an event should happen. You can see how it works for TripAdvisor in the following video:

  • Jonathan Morlet

    Looks like the analysis is biased a little bit 🙂 But as it is an “infovertisement”…

    You state that Loglevel INFO only in dev and not in prod. This is may or may not be correct, and it changes the ratio in you graph. In our organisation, INFO is activated in prod, and DEBUG/TRACE filtered, but can be activated on demands…

    Second point, SL4J template syntaxe only allow for 3 parameters max, then you pass an array object.
    Object[] paramsReturn = {accountModel, productLine, subscriptionContext, nrcContext, eligibleProducts};
    logger.trace(“checkContext({},{},{},{}) – Return {}”, paramsReturn);
    It will only count as useless with only one parameter in your calculation…

    But then its give figure that made easier to promote a commercial product ( $109/month per JVM wich is not that costly in enterprise)

    • Henn Idan

      Hey Jonathan, thank you for your interest in the research for this post.
      Any log level can be activated on demand, but most production applications set their logging level to WARN – so that’s why the graph shows it this way. The full breakdown to each level is also available on our previous post right here:

      In our calculations we counted all of the parameters, and not just one of them.