Application performance is on the forefront of our minds, and Garbage Collection optimization is a good place to make small, but meaningful advancements

Automated garbage collection (along with the JIT HotSpot Compiler) is one of the most advanced and most valued components of the JVM, but many developers and engineers are far less familiar with Garbage Collection (GC), how it works and how it impacts application performance.

First, what is GC even for? Garbage collection is the memory management process for objects in the heap. As objects are allocated to the heap, they run through a few collection phases – usually rather quickly as the majority of objects in the heap have short lifespans.

Garbage collection events contain three phases – marking, deletion and copying/compaction. In the first phase, the GC runs through the heap and marks everything either as live (referenced) objects, unreferenced objects or available memory space. Unreferenced objects are then deleted, and remaining objects are compacted. In generational garbage collections, objects “age” and are promoted through 3 spaces in their lives – Eden, Survivor space and Tenured (Old) space. This shifting also occurs as a part of the compaction phase.

But enough about that, let’s get to the fun part!

Follow us on Twitter for all the latest and greatest posts from our blog:

Getting to Know Garbage Collection (GC) in Java

One of the great things about automated GC is that developers don’t really need to understand how it works. Unfortunately, that means that many developers DON’T understand how it works. Understanding garbage collection and the many available GCs, is somewhat like knowing Linux CLI commands. You don’t technically need to use them, but knowing and becoming comfortable using them can have a significant impact on your productivity.

Just like with CLI commands, there are the absolute basics. ls command to view a list of folders within a parent folder, mv to move a file from one location to another, etc. In GC, those kinds of commands would be equivalent to knowing that there is more than one GC to choose from, and that GC can cause performance concerns. Of course, there is so much more to learn (about using the Linux CLI AND about garbage collection).

The purpose of learning about Java’s garbage collection process isn’t just for gratuitous (and boring) conversation starters, the purpose is to learn how to effectively implement and maintain the right GC with optimal performance for your specific environment. Knowing that garbage collection affects application performance is basic, and there are many advanced techniques for enhancing GC performance and reducing its impact on application reliability.

GC Performance Concerns

1. Memory Leaks –

With knowledge of heap structure and how garbage collection is performed, we know that the memory usage gradually increases until a garbage collection event occurs and the usage drops back down. Heap utilization for referenced objects usually remains steady so the drop should be to more or less the same volume.

With a memory leak, each GC event clears a smaller portion of heap objects (although many objects left behind are not in use) so heap utilization will continue to increase until the heap memory is full and an OutOfMemoryError exception will be thrown. The cause for this is that the GC only marks unreferenced objects for deletion. So, even if a referenced object is no longer in use, it won’t be cleared from the heap. There are some helpful coding tricks for preventing this that we’ll cover a bit later.

2. Continuous “Stop the World” Events –

In some scenarios, garbage collection can be called a Stop the World event because when it occurs, all threads in the JVM (and thus, the application that’s running on it) are stopped to allow GC to execute. In healthy applications, GC execution time is relatively low and doesn’t have a large effect on application performance.

In suboptimal situations, however, Stop the World events can greatly impact the performance and reliability of an application. If a GC event requires a Stop the World pause and takes 2 seconds to execute, the end-user of that application will experience a 2 second delay as the threads running the application are stopped to allow GC.

When memory leaks occur, continuous Stop the World events are also problematic. As less heap memory space is purged with every execution of the GC, it takes less time for the remaining memory to fill up. When the memory is full, the JVM triggers another GC event. Eventually, the JVM will be running repeated Stop the World events causing major performance concerns.

3. CPU Usage –

And it all comes down to CPU usage. A major symptom of continuous GC / Stop the World events is a spike in CPU usage. GC is a computationally heavy operation, and so can take more than its fair share of CPU power. For GCs that run concurrent threads, CPU usage can be even higher. Choosing the right GC for your application will have the biggest impact on CPU usage, but there are also other ways to optimize for better performance in this area.

We can understand from these performance concerns surrounding garbage collection that however advanced GCs get (and they’re getting pretty advanced), their achilles’ heel remains the same. Redundant and unpredictable object allocations. To improve application performance, choosing the right GC isn’t enough. We need to know how the process works, and we need to optimize our code so that our GCs don’t pull excessive resources or cause excessive pauses in our application.

Generational GC

Before we dive into the different Java GCs and their performance impact, it’s important to understand the basics of generational garbage collection. The basic concept of generational GC is based on the idea that the longer a reference exists to an object in the heap, the less likely it is to be marked for deletion. By tagging objects with a figurative “age,” they could be separated into different storage spaces to be marked by the GC less frequently.

When an object is allocated to the heap, it’s placed in what’s called the Eden space. That’s where the objects start out, and in most cases that’s where they are marked for deletion. Objects that survive that stage “celebrate a birthday” and are copied to the Survivor space. This process is shown below:

The Eden and Survivor spaces make up what’s called the Young Generation. This is where the bulk of the action occurs. When (If) an object in the Young Generation reaches a certain age, it is promoted to the Tenured (also called Old) space. The benefit to dividing Object memories based on age is that the GC can operate at different levels.

A Minor GC is a collection that focuses only on the Young Generation, effectively ignoring the Tenured space altogether. Generally, the majority of Objects in the Young Generation are marked for deletion and a Major or Full GC (including the Old Generation) isn’t necessary to free memory on the heap. Of course a Major or Full GC will be triggered when necessary.

One quick trick for optimizing GC operation based on this is to adjust the sizes of heap areas to best fit your applications’ needs.

Collector Types

There are many available GCs to choose from, and although G1 became the default GC in Java 9, it was originally intended to replace the CMS collector which is Low Pause, so applications running with Throughput collectors may be better suited staying with their current collector. Understanding the operational differences, and the differences in performance impact, for Java garbage collectors is still important.

Throughput Collectors

Better for applications that need to be optimized for high-throughput and can trade higher latency to achieve it.

Serial –

The serial collector is the simplest one, and the one you’re least likely to be using, as it’s mainly designed for single-threaded environments (e.g. 32-bit or Windows) and for small heaps. This collector can vertically scale memory usage in the JVM but requires several Major/Full GCs to release unused heap resources. This causes frequent Stop the World pauses, which disqualifies it for all intents and purposes from being used in user-facing environments.

Parallel –

As its name describes, this GC uses multiple threads running in parallel to scan through and compact the heap. Although the Parallel GC uses multiple threads for garbage collection, it still pauses all application threads while running. The Parallel collector is best suited for apps that need to optimized for best throughput and can tolerate higher latency in exchange.

Low Pause Collectors

Most user-facing applications will require a low pause GC, so that user experience isn’t affected by long or frequent pauses. These GCs are all about optimizing for responsiveness (time/event) and strong short-term performance.

Concurrent Mark Sweep (CMS) –

Similar to the Parallel collector, the Concurrent Mark Sweep (CMS) collector utilizes multiple threads to mark and sweep (remove) unreferenced objects. However, this GC only initiates Stop the World events only in two specific instances:

(1) when initializing the initial marking of roots (objects in the old generation that are reachable from thread entry points or static variables) or any references from the main() method, and a few more

(2) when the application has changed the state of the heap while the algorithm was running concurrently, forcing it to go back and do some final touches to make sure it has the right objects marked

G1 –

The Garbage first collector (commonly known as G1) utilizes multiple background threads to scan through the heap that it divides into regions. It works by scanning those regions that contain the most garbage objects first, giving it its name (Garbage first).

This strategy reduces the chance of the heap being depleted before background threads have finished scanning for unused objects, in which case the collector would have to stop the application. Another advantage for the G1 collector is that it compacts the heap on-the-go, something the CMS collector only does during full Stop the World collections.

Improving GC Performance

Application performance is directly impacted by the frequency and duration of garbage collections, meaning that optimization of the GC process is done by reducing those metrics. There are two major ways to do this. First, by adjusting the heap sizes of young and old generations, and second, to reduce the rate of object allocation and promotion.

In terms of adjusting heap sizes, it’s not as straightforward as one might expect. The logical conclusion would be that increasing the heap size would decrease GC frequency while increasing duration, and decreasing the heap size would decrease GC duration while increasing frequency.

The fact of the matter, though, is that the duration of a Minor GC is reliant not on the size of the heap, but on the number of objects that survive the collection. That means that for applications that mostly create short-lived objects, increasing the size of the young generation can actually reduce both GC duration and frequency. However, if increasing the size of the young generation will lead to a significant increase in objects needing to be copied in survivor spaces, GC pauses will take longer leading to increased latency.

3 Tips for Writing GC-Efficient Code

Tip #1: Predict Collection Capacities –

All standard Java collections, as well as most custom and extended implementations (such as Trove and Google’s Guava), use underlying arrays (either primitive- or object-based). Since arrays are immutable in size once allocated, adding items to a collection may in many cases cause an old underlying array to be dropped in favor of a larger newly-allocated array.

Most collection implementations try to optimize this re-allocation process and keep it to an amortized minimum, even if the expected size of the collection is not provided. However, the best results can be achieved by providing the collection with its expected size upon construction.

Tip #2: Process Streams Directly –

When processing streams of data, such as data read from files, or data downloaded over the network, for example, it’s very common to see something along the lines of:

The resulting byte array could then be parsed into an XML document, JSON object or Protocol Buffer message, to name a few popular options.

When dealing with large files or ones of unpredictable size, this is obviously a bad idea, as it exposes us to OutOfMemoryErrors in case the JVM can’t actually allocate a buffer the size of the whole file.

A better way to approach this is to use the appropriate InputStream (FileInputStream in this case) and feed it directly into the parser, without first reading the whole thing into a byte array. All major libraries expose APIs to parse streams directly, for example:

Tip #3: Use Immutable Objects –

Immutability has many advantages. One that’s rarely given the attention it deserves is its effect on garbage collection.

An immutable object is an object whose fields (and specifically non-primitive fields in our case) cannot be modified after the object has been constructed.

Immutability implies that all objects referenced by an immutable container have been created before the construction of the container completes. In GC terms: The container is at least as young as the youngest reference it holds. This means that when performing garbage collection cycles on young generations, the GC can skip immutable objects that lie in older generations, since it knows for sure they cannot reference anything in the generation that’s being collected.

Less objects to scan mean less memory pages to scan, and less memory pages to scan mean shorter GC cycles, which mean shorter GC pauses and better overall throughput.

For more tips and detailed examples, check out this post covering in-depth tactics for writing more memory-efficient code.

*** Huge thanks to Amit Hurvitz from OverOps’ R&D Team for his passion and insight that went into this post!

Tali studied theoretical mathematics at Northeastern University and loves to explore the intersection of numbers and the human condition. In her free time, she enjoys drawing and spending time with animals.
  • Moshe Latin

    Great Summary

  • Benjamin Shults

    I was surprised to see the recommendation to use immutable objects and collections for improving GC performance. I’m sure it depends on the application but I would have expected the opposite because when using immutable objects and collections, a lot more garbage can be generated. (E.g., lots of string concatenation is a recipe for GC hell and using a mutable StringBuilder improves the situation.)

    Can you refer to any GC benchmarks that demonstrate situations where immutability improves GC performance? I would be interested in reading more details.

    • Mark McKenna

      I can’t refer a GC benchmark, but the logic is sound. Your logic is also pretty sound; as in most things I think there are competing factors here. A few:
      * When modifying immutable objects, you must copy that object into a new region, with replaced values. As you say, this allocates a new object in the young generation, and often makes the older object go to GC. As long as the mutable object doesn’t grow in size, you can avoid the reallocation, which reduces GC load.
      * When creating immutable objects you can take advantage of optimizations based on immutability, such as interning. Interned (pooled, memoized) instances are avoided allocations, which reduces GC load.
      * The one mentioned above: constructor-initialized, immutable objects cannot be younger than their referents. That means that old-generation immutable objects can’t refer to young-generation objects, and therefore need not be considered when computing reachability for the young-generation object. This is extra juicy when you consider how much more frequent young-generation collections are. Note though that the restriction required to take advantage of this goes beyond ‘shallow’ immutability implied by `final`: the object’s has to be immutable ‘all the way down’.
      * Thread-local persistent allocated buffers can be a very valuable technique for making mutable data fast. For example consider something like parsing JSON: you can efficiently process the inbound JSON string in 4k blocks (or whatever buffer unit you want to use); as long as you have 8k allocated, you can just flip-flop the preallocated buffers, and avoid allocations for anything that isn’t part of the parser output. And if you allocate thread-locally (or are prepared to assume single threaded usage), you can even reuse the buffer across parsing events, pushing that buffer into the tenured zone and out of general GC consideration (note also that a character array can’t contain references, so even if it’s mutable, it won’t factor into root-finding for other objects).
      * Not related to GC directly, but immutable objects also make parallelism a relative breeze, which can help to dramatically improve overall app performance, even if it does increase GC load.

      So basically yeah, going wholesale to immutables isn’t a silver bullet for GC performance, but it does bring some benefits. My current philosophy is to use immutables until I have a performance problem and profiling tells me it’s caused by heap thrashing.

      • Benjamin Shults

        Thanks for the thoughtful response. It is complicated and a lot to think about. I certainly learned a lot from the original article and your comment here.
        Your concluding “current philosophy” is generally what I go by as well with the exception of well-known situations (e.g., string concatenation in a loop) that we know cause problems for GC.