Production errors get in the way of great customer experience, whether they’re reporting the bugs to you or to all of their friends via Twitter

CI/CD workflows are all the rage these days with teams automating the build, test and deploy stages of the application lifecycle. But for some reason, the next stage in the lifecycle –when something breaks– isn’t getting the same treatment.

Any error that makes its way into production has the potential to cause damage, whether that means it’s affecting all of your users by taking down the entire system or it’s affecting just one user’s experience. Lack of visibility in production mean that we don’t know anything’s wrong until errors or bugs are already causing trouble for our users. Why is all of this so important?

Let’s dig in.

How Many Monthly Support Tickets Are Opened…?

According to a benchmark done by Zendesk, the customer service software company, the monthly average ticket volume that support teams received via their service is 259 tickets/month (not to mention natural spikes around the holiday season). It’s not hard to guess that fluctuations in ticket volume from month to month has a tangible correlation to customer satisfaction, but let’s look at an example from this same benchmark.

Looking at the data from one year to the next, the people at Zendesk saw that customer satisfaction drops in the last quarter of the fiscal year. To understand why in one year it dropped 6% compared to 2% the previous year, they looked to the change in ticket volume. What they found, is that the ticket volume increased proportionally – that is to say, when the ticket volume increased 3X from one year to the next, customer satisfaction dropped by 3X as well.

Now, this report doesn’t give much clarity into which of these is the cause and which is the effect. But, it does show that there is a connection between system errors and customer satisfaction, and in understanding the relationship between the two in the past we can better understand how such factors influence our business in the present.

How Long Do Customers Have to Wait for a Solution…?

Continuing with the theme of customer satisfaction and how it relates to application errors or failures, what happens after the initial support ticket is received also has an impact. The average number of messages that it takes a support technician to resolve a user-reported issue is 8 messages.

That’s 8 messages from the support tech to the customer or client, asking for more information on the error, checking with the dev team and hopefully identifying the cause of the issue. In cases where the end-user is the initial reporter of an issue, the support team may be able to resolve the issue on their own or will otherwise facilitate the dev team in resolving the issue.

Although in many cases, the support team is more than capable of understanding and resolving the issue on their own, only 39% of issues can be resolved with 1 reply from a customer service representative. On top of that, the benchmark for full resolution time (across a range of industries) is 20 hours, more than 2 full work days.

The Application lifecycle and How it All Falls Apart

So, how does this happen?

Most applications, regardless of the industry they’re built for or the code language they’re built in, go through a similar release lifecycle from development to production. Once the code has gone through development, it’s tested for expected outcomes. Once it’s been tested for those expected outcomes, it goes to staging where it’s tested against more “real life” situations.

Unfortunately, an application is only really put to the test when it reaches production and real users start to interact with it. A good QA team can catch most instances where an application could be expected to fail, but no QA team can anticipate all of the edge cases of user behavior before the application is sent to production.

And worse? As the application progresses through these steps of the lifecycle, visibility into why things happen the way they do decreases exponentially. So, when something breaks in production, it can be close to impossible to get to the root cause of the issue.

Visibility into root cause of errors decreases exponentially as the release cycle progresses

Due to this lack of visibility, it’s not uncommon for users to be the main source of information for teams working on solving an issue which leads to more support tickets. And I think we can all agree that the end-user usually isn’t the best source for trying to pinpoint an issue (which leads to longer resolution times).

Whether the user is only acting as the initial reporter of the issue or if they’re involved throughout the troubleshooting process, the effects that this has on customer satisfaction and on the company’s bottom-line are undeniable. Not to mention, the stress that high numbers of support tickets can inflict on the entire development team and the high potential for disaster.

Why We Can’t Afford 20-Hour Resolution Times

These days, the impact of an error that takes up to 20 hours to resolve has the potential to be catastrophic. Even errors that can be identified and resolved in less time can leave a lot of damage in their wake.

John Allspaw, the former CTO of Etsy, once said in a talk, “The increasing significance of our systems, the increasing potential for economic, political, and human damage when they don’t work properly, the proliferation of dependencies and associated uncertainty — all make me very worried. And, if you look at your own system and its problems, I think you will agree that we need to do more than just acknowledge this.”

When errors cripple the entire system, they may limit your customers’ ability to use your product or do their jobs and, as a result, they (sometimes) start to make headlines on the news and on the Twitter-sphere.

Just ask these companies:

Now, let’s take a closer look at 3 stories from this list that reveal some of the different ways that application errors can negatively impact customer experience:

1. United Airlines, July 2015

How 1 Hour of Downtime Grounded 4,900 Flights and Caused Delays for More Than a Day

One morning in July 2015, United Airlines experienced “a network connectivity issue” that required officials handwriting boarding tickets (without confirmation that passengers were not on a no-fly list) and led to an imposed ground stop, meaning United planes couldn’t take off at all. The technical issue was resolved in a little more than an hour, but in the meantime it affected 4,900 flights worldwide and caused a domino effect of delays and cancellations throughout the day and into the next.

James Record, a professor of aviation at Dowling College explained that, “the schedules are so tight that it will take a long time for United to restore normal operations after so many flights are disrupted, even if it is only for a bit more than an hour.” He also mentioned that system failures like this happen at least several times a year.

For us, it’s not hard to believe or to understand why issues like these happen. All it takes is for one small line of code out of millions to be out of place or broken. Still, flying is a stressful time for people and the average person trying to get from Chicago to LA on a Wednesday morning may not understand how this happens, and they certainly won’t understand when their flight is still delayed hours later. To them, we all just look like idiots.

2. GitHub, January 2016

How 2 Hours of Downtime Prevents Users from Getting Work Done

Luckily for GitHub, they have a client-base composed almost entirely of developers who are at least slightly more forgiving when issues arise. The difference, of course, is that their users can more easily identify with the situation. Hence:

But that doesn’t mean that system downtime doesn’t affect them, it definitely does.

When GitHub.com was unavailable for 2 hours on July 28, 2016 due to a brief power outage that caused “a cascading failure”, their users found themselves simply unable to work in some cases or at least less productive in others. Fortunately, they kept their sense of humor about it:

3. SSP Pure Broking, Aug 2016

How ZERO visibility can lead to 2 Full Weeks of System Downtime

In the worst of many software failures that plagued the customers of SSP Pure Broking, a power outage caused a malfunction in one of the company’s data centers knocking it offline for close to 2 weeks.

During the downtime, insurance brokers using this system (about 40% of UK brokers) were not able to trade with new clients, but also had current clients whose coverage may have lapsed and were not notified. The ramifications of a system failure like this can have life or death consequences.

Aside from the outrage at the effects this would have on people’s lives and businesses, many of those affected just couldn’t believe that something like this was happening in this day and age:

 Final Thoughts

Think we’re being dramatic? Maybe a little. But it doesn’t change the facts.

We have lower visibility into the root cause of errors in production, which means that in many cases we have to rely on our end-users to identify the problem and to help solve it. In minor cases, this might mean an increase in support tickets which has been shown to correlate with lower customer satisfaction. In more serious cases, the error can be bad enough to cause a system failure in which ALL of your end-users are affected.

Luckily, it doesn’t have to be this way. Join our on-demand webinar to find out how TripAdvisor solves production errors in minutes.

Data source: Zendesk Benchmark, 2013

Tali studied theoretical mathematics at Northeastern University and loves to explore the intersection of numbers and the human condition. In her free time, she enjoys drawing and spending time with animals.