Lately we’ve seen a variety of new tools that help you make sense of your logs. Open-source projects such as Scribe and LogStash, on-premise tools like Splunk, and hosted services such as SumoLogic and PaperTrail. These all help you reduce mass amounts of log data into something more meaningful.

But there’s one thing none of them can help you with. They’re all dependent on the data you actually put into the log. The task of getting that quality and quantity of the data right falls on to you. So when the chips are down and you need to debug your code with partial or missing log data, you may want to cancel that dinner reservation.

To reduce the chances of that happening I wanted to share 5 things you should be constantly mindful of when logging in production:

1. Hello my (Thread) name is

Just like Ringo, the Thread name property is one of Java’s most undervalued methods. The reason is that it’s mainly descriptive. So what? so are our names, and we put some meaning into them.

The place where thread names play the biggest part is in multi-threaded logging. Most logging frameworks will log the current thread name from which a call was made. Sadly enough, this would mostly look something like “http-nio-8080-exec-3″, a name assigned to it by the thread-pool or container.

For some reason, there’s a misconception I’ve heard more than once that thread names are immutable. They’re not. The thread name is prime real-estate in your log, and you want to make sure you use it right. This would most likely mean assigning it with detailed context such as the name of the Servlet or task its fulfilling right now, and some dynamic context such as a user or message ID.

As such, an entry point into your code should look something like this:

Thread.currentThread().setName(ProcessTask.class.getName() + “: “+ message.getID);

A more advanced version would be to load a thread local variable into the current thread, and configure a log appender that will automatically add it to every log entry.

This is extremely useful when multiple threads are writing to the server log and you need to focus on a single thread. If you’re running in a distributed / SOA environment, this has an added advantaged which we’ll see in a moment.

Find the Crap in Your Java App

Show me how >>


2. Distributed identifiers

In an SOA or message driven architecture, the execution of a specific task will most likely span across multiple machines. When processing of one such fails, connecting the dots between the machines involved and their state will be key to understanding what happened. Most log analyzers will group those log messages for you, assuming you provided them with a unique ID by which they can do that as part of the actual log message.

From a design perspective this means that every inbound operation going into your system should have a unique ID which will travel with it until it’s completed. Note that a persistent identifier such as user ID may not be a good vessel for this, as one user may have multiple operations happening for him in the course of a log file, which will make isolating a specific flow much harder. UUIDs can be a good choice here. The value can be loaded into the actual thread name or as a TLS – thread local storage.

3. Dont text + drive. Don’t log + loop

Often times you’ll see a piece of code that’s running in a tight loop and performs a log operation. The underlying assumption is that the code will run a limited number of times.

That may very well be the case when things are working well. But when the code gets unexpected input the loop may not break. In that case you’re not just dealing with an infinite loop (which is bad enough), you’re dealing with code that’s writing infinite amounts of data to disk or network.

Left to its own devices this can take a server down, or in a distributed scenario, an entire cluster. So when possible, do not log within tight loops. This is especially true when catching errors.

Let’s look at an example, where we log an exception inside a while loop:

void read() {
    while (hasNext()) {
        try {
        } catch {
            Exception e) {
            // this isn’t recommend
            logger.error(“error reading data“, e);

If readData throws, and hasNext returns true, we end up with infinite logging. One way to solve this is to make sure we won’t log everything:

void read() {
        int exceptionsThrown = 0;
        while (hasNext()) {
            try {
            } catch {
                Exception e) {
                if (exceptionsThrown & lt; THRESHOLD) {
                    logger.error(“error reading data ", e);
                    else {
                        // Now the error won’t choke the system.

Another approach is to remove the logging from the loop altogether and save the first/last exception object to log it elsewhere.

4. Uncaught handlers

Westeros has the Wall as its last line of defense. You have Thread.uncaughtExceptionHandler. So make sure you use them. If you don’t install one of these handlers, you run the risk of throwing exceptions into the wild with very little context, and very little control of if and where they end up being logged.

Finding out that your code has been encountering massive errors without them being logged, or with them being logged with little to no state data is an epic fail.

Notice that even within an uncaught exception handler, which on the surface of things has no access to any variable within that thread (which has already terminated) you still get a reference to the actual Thread object. If you adhered to step #1, you’ll still get a meaningful thread.getName() value which you can log.

5. Catch external calls

Whenever you make an external API call which leaves the JVM, chances of an exception increase dramatically. This includes Web service, HTTP, DB, File system, OS and any other JNI calls. Treat each call as it will explode (it most likely will at some point).

For the most part, the reasons for external API failures have to do with unexpected input provided to them. Having those inputs readily available for you in the log is a key part in fixing the code.

You may choose not to log an error at this point, but have the exception thrown, which is fine. In that case, simply collect as many of the relevant arguments passed to that call and parse them into the exception error message.

Just make sure that exception is caught and logged at a higher level within the call stack.

try {
    return s3client.generatePresignedUrl(request);
} catch (Exception e) {
    String err = String.format(“Error generating request: % s bucket: % s key: % s.method: % s ", request, bucket, path, method);
        log.error(err, e); //you can also throw a nested exception here with err instead.

Have any other logging best practices you’ve learned the hard way, or completely disagree with the above? Hit me in the comments section.


Compiling Lambda Expressions: Scala vs. Java 8 – read more

Tal is the CEO of OverOps. Tal has been designing scalable, real-time Java and C++ applications for the past 15 years. He still enjoys analyzing a good bug though, and instrumenting code. In his free time Tal plays Jazz drums.
  • Yeroc

    Regarding point #1 I agree that threads should have meaningful names. I don’t think it’s a good practice to rename your threads though. That would make tracking down, say a ThreadLocal leak much, much more difficult. For logging information like a messageId you should use a mapped diagnostic context (MDC) which log4j, logback and slf4j all support…

    • http://www.takipi.com/ Tal Weiss

      Hey Yeroc, I agree and also noted in the above that using TLSs as a means of transporting and logging thread context is an advanced / efficient mechanism to do this, and one which we use. Re TLS leaks – I’m curious, can you provide an example?

      • Fabien Baligand

        Other things I wish to add on your list :
        – log one info log Line per external call with id info (ex: ws opération name) and duration time.
        – for each http request, log one Line with id info (uri, …) and duration time
        – log using csv format for further processing by a spreadsheet tool or a stats batch.

        Durations are very useful to analyse a bench result or a performance problem in production. Thus, tools like logstash/elasticsearch/kibana can process it very well.

        • http://www.takipi.com/ Tal Weiss

          Hey Fabien,

          Good stuff. Thanks!

          • Fabien Baligand

            Happy that you appreciate, thanks 😉

  • Dan Howard

    For #3 you should put the TRY outside the while loop.

    • http://www.takipi.com/ Tal Weiss

      That really depends on the logic you want yo implement. If the failure of one iteration does not mean you want to cancel all the rest, then you’ll want to catch inside, otherwise – outside. The catch ones that are nested within a loop run a higher risk of creating an infinite log loop.

  • Skyr

    Be careful with #5, or you end up with some Pokemon Exception Handling (“gotta catch them all”). It is perfectly ok to wrap such shaky sections like the one in you example with a “catch Exception e”, but either do some error handling there or re-throw a generic (probably application-specific) exception.

    • http://www.takipi.com/ Tal Weiss

      Hey Skyr, thanks for the comment.

      I agree you shouldn’t spend half the time wrapping every call you make with a try / catch. The key test IMO is when a call throws an exception – will you have all the variable context you need to understand why?

      For example if you’re doing a file.open which is the kind of method which likes to throw exceptions at you, you need to need to know in advance whether the file name, which will be part of an exception message, would be enough.

      If it is, great – don’t catch. If on the other hand the name of the file was parsed dynamically based on variable state which isn’t expressed in the final file name then I would recommend to catch, parse a much more meaningful exception message – comprising of those variables and then re-throw or log.

      • Skyr

        The only thing I wanted to emphasize: For the matter of logging, don’t just catch and log (and let the program run on), but log and re-throw the exception.

  • Yonatan

    about #1 – you are totally right, thing is that asynchronous platforms are become more and more popular. This make the thread name irrelevant for debugging

    • http://www.takipi.com/ Tal Weiss

      This is where #2 comes in – distributed identifiers. It now becomes the role of the developer to make sure a unique identifier travels which is logged for each message as it travels through queues, DBs and different machines.

  • David Bernard

    #0 Design logging. Logging should be deliberate not the result of discovery during debugging.

  • Alberto Gori

    You should mention that a thread name change is explicitely prohibited by EJB specification, in case you are running under such environment.

  • bloggerkedar

    Useful post! Given the way the Java logging works however, don’t you want to format the log string after making sure that the log level is not too high? It’s a sad repetition, but something that is required. In other words, shouldn’t the application logging should always be:

    if (logger.isLoggable(Level.INFO)) {
    String s = “carefully-formatted-message-with-lots-of-contextual-information”;

    Agreed, that .info is going to check the level again, but that is okay, as compared to the creation of the formatted string. Or am I missing something?

  • Jason Wang

    Thanks for sharing Tal. Looks like I came to this party a little late. Nevertheless, my tips for better loggings are:

    1. Use MDC/NDC when you need contextual logging info.

    2. No excuse to still use log4j for new projects — logback and log4j 2 are just better in almost every way.

    3. Always treat business exceptions separately: a rare business case is still a valid use case. You should log those as info/warning NOT error.If you treated those as exceptions, you probably should be smarter with the ugly stacktrace — no need to spit those out.

    4. Need to know what NOT to log. I would argue that most of the APM solutions do partially rely on logs, but you also have more direct reporting facilities for metrics gathering through annotating your code. You have to remember logs take space and extra file IO as well as computing power to analyse — even worse if you pipe those into ELK, everything will be copied at least once. If you can send metrics such as service duration, login request info directly to monitoring service, think again if you still need to log everything.

    5. Security. Security is easily overlooked. You may find users password, personal information, credit card numbers in the logs easily. A good masking utility should be made available in your codebase to universally handle those sensitive information.

    Also, its good practice to separate security related information into its own logs with specially designed tools — https://www.owasp.org/index.php/OWASP_Security_Logging_Project

    Though we have not tried it yet.

    Jason W