Synchronized sections are kind of like visiting your parents-in-law. You want to be there as little as possible. When it comes to locking the rules are the same – you want to spend the shortest amount of time acquiring the lock and within the critical section, to prevent bottlenecks from forming.

The core language idiom for locking has always been the synchronized keyword, for methods and discrete blocks. This keyword is really hardwired into the HotSpot JVM. Each object we allocate in our code, be it a String, Array or a full-blown JSON document, has locking capabilities built right into its header at the native GC level. The same goes for the JIT compiler that compiles and re-compiles bytecode depending on the specific state and contention levels for a specific lock.

The problem with synchronized blocks is that they’re all or nothing – you can’t have more than one thread inside a critical section. This is especially a bummer in consumer / producer scenarios, where some threads are trying to edit some data exclusively, while others are only trying to read it and are fine with sharing access.

ReadWriteLocks were meant to be the perfect solution for this. You can specify which threads block everyone else (writers), and which ones play well with others for consuming content (readers). A happy ending? Afraid not.

Unlike synchronized blocks, RW locks are not built-in to the JVM and have the same capabilities as mere mortal code. Still, to implement a locking idiom you need to instruct the CPU to perform specific operations atomically, or in specific order, to avoid race conditions. This is traditionally done through the magical portal-hole into the JVM – the unsafe class. RW Locks use Compare-And-Swap (CAS) operations to set values directly into memory as part of their thread queuing algorithm

Even so, RWLocks are just not fast enough, and at times prove to be really darn slow, to the point of not being worth bothering with. However help is on the way, with the good folks at the JDK not giving up, and are now back with the new StampedLock. This RW lock employs a new set of algorithms and memory fencing features added to the Java 8 JDK to help make this lock faster and more robust.

Does it deliver on its promise? Let’s see.

Find the Crap in Your Java App

Show me how >>


Using the lock. On the face of it StampedLocks are more complex to use. They employ a concept of stamps that are long values that serve as tickets used by any lock / unlock operation. This means that to unlock a R/W operation you need to pass it its correlating lock stamp. Pass the wrong stamp, and you’re risking an exception, or worse – unexpected behavior.

Another key piece to be really mindful of, is that unlike RWLocks, StampedLocks are not reentrant. So while they may be faster, they have the downside that threads can now deadlock against themselves. In practice, this means that more than ever, you should make sure that locks and stamps do not escape their enclosing code blocks.

long stamp = lock.writeLock();  //blocking lock, returns a stamp

try {

    write(stamp); // this is a bad move, you’re letting the stamp escape
} finally {

    lock.unlock(stamp); // release the lock in the same block - way better

Another pet peeve I have with this design is that stamps are served as long values that don’t really mean anything to you. I would have preferred lock operations to return an object which describes the stamp – its type (R/W), lock time, owner thread etc.. This would have made debugging and logging easier. This is probably intentional though, and is meant to prevent developers from passing stamps between different parts of the code, and also save on the cost of allocating an object.

Optimistic locking. The most important piece in terms of new capabilities for this lock is the new Optimistic locking mode. Research and practical experience show that read operations are for the most part not contended with write operations. Asa result, acquiring a full-blown read lock may prove to be overkill. A better approach may be to go ahead and perform the read, and at the end of it see whether the value has been actually modified in the meanwhile. If that was the case you can retry the read, or upgrade to a heavier lock.

long stamp = lock.tryOptimisticRead(); // non blocking


if (!lock.validate(stamp)) { // if a write occurred, try again with a read lock

    long stamp = lock.readLock();

    try {

    } finally {


One of the biggest hassles in picking a lock, is that its actual behavior in production will differ depending on application state. This means that the choice of a lock idiom cannot be done in a vacuum, and must take into consideration the real-world conditions under which the code will execute.

The number of concurrent reader vs. writer threads will change which lock you should use – a synchronized section or a RW lock. This gets harder as these numbers can change during the lifecycle of the JVM, depending on application state and thread contention.

To illustrate this, I stress-tested four modes of locking – synchronized, RW Lock, Stamped RW lock and RW optimistic locking under different contention levels and R/W thread combinations. Reader threads will consume the value of a counter, while writer threads will increment it from 0 to 1M.

5 readers vs. 5 writers: Stacking up five concurrent reader and five writer threads, we see that the stamped lock shines, performing much better than synchronized by a factor of 3X. RW lock also performed well. The strange thing here is that optimistic locking, which on the surface of things should be the fastest, is actually the slowest here.


10 readers vs. 10 writers: Next, I increased the levels of contention to ten writer and ten reader threads. Here things start to materially change. RW lock is now an order of magnitude slower than stamped and synchronized locks, which perform at the same level. Notice that optimistic locking surprisingly is still slower stamped RW locking.


16 readers vs. 4 writers: Next, I maintained a high level of contention while tilting the balance in favor of reader threads: sixteen readers vs. four writers.  The RW lock continues to demonstrate the reason why it’s essentially being replaced – it’s a hundred times slower. Stamped and Optimistic perform well, with synchronized not that far behind.


19 readers vs. 1 writer:  Last, I looked at how a single writer thread does against nineteen readers. Notice that the results are much slower, as the single thread takes longer to complete the work. Here we get some pretty interesting results. Not surprisingly, the RW lock takes infinity to complete. Stamped locking isn’t doing much better though… Optimistic locking is the clear winner here, beating RW lock by a factor of 100. Even so keep in mind that this locking mode may fail you, as a writer may occur during that time. Synchronized, our old faithful, continues to deliver solid results.


The full results can be found here.. Hardware: MBP quad Core i7.

The benchmark code can be found here.


It seems that on average the best performance overall is still being delivered by the intrinsic synchronized lock. Even so, the point here is not to say that it will perform the best in all situations. It’s mainly to show that your choice of locking idiom should be made based on testing both the expected level of contention, and the division between reader and writer threads before you take your code to production. Otherwise you run the risk of some serious production debugging pain.

Additional reading about StampedLocks here.

Questions, comments, notes on the benchmark? Let me know below 🙂

Duke_8 copy

5 Features In Java 8 That WILL Change How You Code – read more

Tal is the CEO of OverOps. Tal has been designing scalable, real-time Java and C++ applications for the past 15 years. He still enjoys analyzing a good bug though, and instrumenting code. In his free time Tal plays Jazz drums.
  • Kirk Pepperdine

    I seem to have missed the link to the code used in the benchmark. Can you post it?

    • Tal Weiss

      Hi Kirk,

      You’re right. I’ve added the links to both the code and the full results right above the conclusions Thanks again!

  • Chris

    What’s the GC cost of StampedLocks ? The issue with ReadWriteLocks has been memory cost and the resultant collector cost e.g thread locals / variable creation over synchronized that seemed to take more than it delivered in a lot of real world scenarios.

    • Tal Weiss

      Hi Chris,

      That’s a great question! In this post I mostly looked at throughput, but the GC effects of locks in high scale environments is def something that needs to be looked at as well. I’ll see if we can get some research done on that as well in the near future.

  • Timothy Stewart

    Looking at the code, I don’t think the implementation is quite correct. You have:

    public long getCounter()
    long stamp = rwlock.tryOptimisticRead();


    if (rwlock.validate(stamp))
    return counter;

    return counter;

    both of the “return counter” lines are dirty reads since there is nothing preventing the writer thread from updating while the counter is being read. Instead, you would want to get the optimistic read stamp, read the value, then validate that the stamp is still okay, and return the gotten value:

    public long getCounter()

    long stamp = rwlock.tryOptimisticRead();
    long value = counter;
    if (rwlock.validate(stamp)) {
    return value;
    } else {
    // the value is dirty, repeat the optimistic read, or fallback to a full read lock

    Instead of an else, you could retry the optimistic lock by putting it in a while(true) loop, or you could fallback to a full read lock. A straight return of the dirty value isn’t a good option (otherwise you’re not testing the optimistic lock pattern).

    Thanks for putting the benchmark code together. Hope you don’t mind — I’ve grabbed it to run myself, and added a ReentrantLock implementation to the list which seems very close to synchronized, and changed your main class looping a bit so I can run them all in turn and print a summary.

    I’m shocked at how bad RWLock is. I really expected it to be a better option with many readers, few writers.

    • Timothy Stewart

      Just because I’m finding this really interesting, I changed up the scenario a bit. Instead of having X threads constantly writing, and X threads constantly reading, I thought more along the lines of a web application that might have a backing cache. The cache is generally read only, but occasionally there is a write.

      So I set it up so that there is constant 10 threads running (as if my web app had 10 worker threads) and a 2000:1 ratio on reads to writes (randomly determined whether to write or read in each worker).

      The results come out a bit differently:

      SUMMARY — threads: 10. reader/writer ratio: 2000:1. rounds: 5. Target: 100001
      DIRTY … 1089.4
      VOLATILE … 1028.6
      ATOMIC … 1247.2
      ADDER … 1562.2
      OPTIMISTIC … 5260
      STAMPED … 12814.2
      REENTRANT … 9512.6
      RWLOCK … 18882.8
      SYNCHRONIZED … 22696.4

      All of these are from your implementation code except ReentrantLock and the Optimistic lock will retry optimistically until it succeeds in getting a clean read. RWLock is slightly better than synchronized in this scenario. Obviously, I’m on different hardware and my scenario is totally different, so the numbers only compare relative to each other, not to your own results.

      Note that of the single variable access methods, volatile comes out ahead, but if you flip the scenario to make reads light and writes heavy atomic performs better, and adder far better than that. Of the fully synchronized methods, synchronized becomes twice as bad as the others, with the others coming out pretty close. For example, 100 writes to a read (so reads are pretty sparse):

      SUMMARY — threads: 10. reader/writer ratio: 1:100. rounds: 5. Target: 100000001

      DIRTY … 2119.2
      VOLATILE … 3373
      ATOMIC … 1772.2
      ADDER … 899.4
      OPTIMISTIC … 4388.8
      STAMPED … 6022.4
      REENTRANT … 5071.6
      RWLOCK … 6114.4
      SYNCHRONIZED … 12009.6

      Thanks again for writing this code and article — all very interesting. I think my biggest takeaway here is that the usage pattern really should drive which approach you choose for both simple value access and full synchronization.

      • Tal Weiss

        Hey, thanks again Timothy for posting these. Really good stuff.

        I think it’s no surprise that atomics do a much better job at increasing a counter, but by nature they’re limited to that, vs. locks’ ability to make critical sections of entire blocks of code.

        Volatiles are only good if one thread is doing the modification, and dirty counters.. well they’re just that 🙂 Btw, I’m a big fan of the new Java 8 Adders – I think the internal approach is brilliant.

        It’s really interesting to se that in this scenario stamped isn’t really outperforming RW and RE-RW. I was surprised to see the synchronized is 2X slower in the second scenario.

        I think you’re right and that was also how I felt when I ran these tests – you can’t really choose a lock idiom based on documentation, or what looks like the “cleanest” solution on paper. You really have to tailor your code to the real-world environment and conditions in which it will perform, and that’s no easy feat 🙂

    • Tal Weiss

      Hi Timothy – thanks for the great comment!

      You’re absolutely right that I’m increasing the counter even regardless if the stamp was invalidated. This is something I thought about, but decided to go with just measuring the cost of an optimistic locking, the reason being that diff apps might react to this situation differently by either looping the optimistic locking or upgrading the lock to a heavier read lock.

      For this I decided to measure just the straight up cost of optimistic locking. You’re right that in extremely contended situations you may find yourself spin-locking if you try to continuously lock optimistically. As for RW lock performance, that’s why we now have StampedLock 🙂

      • Phillip Henry

        Hi, Tal. I agree with Timothy. Using OptimisticStamped in your metrics is misleading. You’re not measuring the cost of optimistic locking if you don’t attain the optimistic lock. Should the results be amended so the optimistic lock is retried if it fails? Then the code is semantically consistent with the other methods that use locks. Otherwise, me might as well be returning a dirty value, which of course is going to be faster.
        – Phillip

        • Phillip Henry

          Well, that’s a surprise. When I implemented this, the results on my E5-2687W were *better* than dirty-ish reads. I am perplexed…

          While I look into it, I noted that the ADDER (or rather the backing Java class, LongAdder) also does not guarantee a valid value either (see the javadoc of LongAdder.sum()). Basically, this method is only accurate in the uncontended case so not suitable for some uses.

          Thanks for bringing Adders to my attention! Interesting post, Tal.

          • Phillip Henry

            At the risk of sounding slightly obsessive, I’ve put together some graphs of the time it takes for the different locking techniques using permutations of read and write threads at

            Feedback appreciated!

          • Tal Weiss

            Hey Philip – Beautiful work – we’ll make sure to tweet and share it!

  • Jason

    This is a pretty skewed benchmark. The point of a ReadWriteLock is to increase throughput for reads when writes are relatively less common. From looking at the benchmark code, though, the top-line number is wholly dependent on how long the writer threads take to push a counter to a target value. In other words, it ignores that the ReadWriteLock could be getting dramatically higher read throughput compared to the other approaches, and only reflects that an increasingly busy set of readers can increasingly delay a shrinking pool of writers, which should be somewhat obvious.

    What would be useful is instead of measuring how long it takes just the writers to do their thing, run the test for a fixed amount of time, with each reader/writer thread maintaining a local count of how many locked operations it performs. Looking at the relative counts of read and write operations achieved in a given timeslice can give us a sense for the tradeoffs in read, write, and total throughput that each type of lock entails.

    • Tal Weiss

      Hi Jason,

      Thanks for the great comment.

      The point of the benchmark was to try and get as much contention as possible between the R/W threads , which is I picked a relatively small operation to be carried out. I was also interested in seeing how those levels of contentions change in relation to the number of readers and writers.

      You could put a sleep or IO call within the write critical section to alleviate the lock contention, and make the write operation longer (making life for the lock a bit easier).

      There’s really no one benchmark that can cover all possible scenarios, and that’s why the point I make is that you have to test your own scenario in production (or good enough simulation) to pick the right locking idiom across your different R/W states and combinations.

      If you want, please change the benchmark to fit your scenario of choice, run it and I’ll definitely post those results. Thanks again!

  • Gad Meyer

    Hi Tal,
    Great article as usual.
    I find it interesting whether the ratios change in each scenario if:
    a. We change the OS the JVM runs on.
    b. Or especially if we run the code on a virtual machine (As it happens in the cloud). I develop C# on a Windows machine, however in production my code probably runs on a Linux machine which runs Windows Server on a VM. I believe many developers face the same situation.

    Anyway, I’ll definately use your code to experiment with C#’s lock and ReaderWriterLockSlim 🙂

    • Tal Weiss

      Hi Gad,

      Thanks for the note and kind words!

      Other than the synchronized lock which is hard wired into the JVM, most other locks are implemented mostly in Java and are not dependent on, or are wrappers of native POSIX / WIN idioms.

      The majority of the native calls outside of the JVM (e.g. CAS operations, memory fencing) is made using the unsafe class directly into the JIT compiler and processor. As are result I don’t believe you’ll much of a diff in OS performance with those.

      With regards to virtualization, that’s a great question. When running in a hyprevisor things can change materially. I don’t have hard data on that though – a good idea for a post in itself! 🙂

  • Richard Richter

    Very interesting result. It’s quite satisfying to hear that synchronized isn’t that bad at all. I tried to switch Java Simon library to ReadWriteLock once as an experiment – and for some thread count it was bit faster, for other bit slower than synchronized – but all this within 10%. So I didn’t bother at all for the sake of much simpler code with synchronized. Now seeing these results for new classes I will think twice if I try these locks (after I move the library to JDK 8, which will not be very soon).

    All in all, it is really good that synchornized performs quite satisfiably in all cases. I never understood why so many coders frown over them very often not even knowing difference between atomic change of one field and consistency across multiple fields. Maybe based on something very old or cases of very specific applications…? Don’t know. Good article in any case, thanks.

  • Richárd Olivér Legéndi

    The Peter Griffin link is broken unfortunately 🙂 Anyway, great description, thanks!

  • Michael Huang

    Another factor to point out: the time consumed during each read and write. If these values are very quick and tiny, then intrinsic monitor (synchronized construct) isn’t too bad ; otherwise, there can be some other surprisingly different results.

  • Christophe Lorenz

    You must never use optimistic locking. Although it worked in the past and was a cool hack, it is now obsolete on todays cpu and will fail at some point in the future.
    Modern cpus execute instructions out of order to improve efficiency.
    That means whatever code you propose can be re-arranged for speed optimisations.
    Effectively re-ordering your “lock” and read operations, breaking optimistic lock fallback.
    This is already true today, but future processors will push the optimisations futher, breaking existing code.
    The only way around this is being explicit in your code. Use locking to create a barrier that will block re-ordering.