Today, there is little dispute that software is indeed eating the world. Every part of our life includes software, from the way we watch movies, to the way we rideshare, reserve hotels, communicate, shop, etc.
Marc Andreessen’s article, “Why Software Is Eating the World”, tells the story and uncovers why all companies are now software companies. Shifting to a software-driven mentality has caused a lot of challenges as net new competition and pace of innovation creates pressure and stress on leadership, line of business, and IT organizations.
How do you foster collaboration between Dev and Ops at your company? We’re exploring the evolving relationship between these two functions as new architectures and technologies continue to emerge. Take 5 minutes to tell us how Dev and Ops work together in your organization!
— OverOps (@overopshq) October 23, 2018
New World, New Expectations
In this new era where software is eating the world, customer expectations have changed dramatically. Patience is a thing of the past. Tolerance for slow response times, errors and security holes send customers elsewhere with the click of a mouse (they don’t have to drive across town anymore). What’s worse, not only do customers move on to your competition, but they let everyone else know about their experience via social media. Your applications need to be reliable, or you’ll quickly feel the wrath of your customers. The effects of brand tarnishment in this new world are real, possible, and live longer than ever before.
How is IT trying to deal with application reliability?
Application reliability is an ongoing challenge for every company, small and large. To deal with the challenges in this new world, IT is arming itself with tools, processes and people. Among those tools you can find automated testing, log aggregators, APMs, static code analyzers, bug trackers and more. The processes include agile methodologies, peer reviews and regression testing. New teams and roles, like Site Reliability Engineers (SRE) and DevOps, were created to bridge the gap between development, deployment and operations.
Despite all of these initiatives, application stability and reliability are still an ongoing challenge.
Does anyone really know what is happening inside the application?
I have worked with hundreds of companies over the past ten years and have learned a great deal about how applications are deployed and managed. Companies accept daily application reboots as a solution for hidden problems, including: memory leaks, file handle limits, and other unhandled code issues that cause application failures. Many errors happen 1000s of times a day, but are never addressed. I often hear the argument, “with tens and thousands of errors, how do I even know which errors are worth my time to investigate?” Excessive logging practices cause log files to become very large, impact application performance and create noise when outages and sev1 issues occur.
To make things more complicated, logs only account for areas where the engineers have been expecting to see an error. This creates data gaps so it is impossible to determine the full scope of errors occurring. When did it become acceptable to have applications full of issues with no resolution in sight?
Why should I care if my application has errors if no one is complaining?
I get asked this question a lot. If a customer or user is not complaining that something is broken, why should I care? There are really two arguments here.
The first is: just because your customer didn’t call to complain, doesn’t mean they didn’t log on to Twitter to complain. It doesn’t mean that they didn’t leave to your competitors, and it doesn’t mean that the issue they experienced won’t lead to outages and more errors in down the road.
The second is: you should care because every error has a cost (see below). Zero inbox should be the goal for every development team. This means less outages, late night support calls and war rooms….you get your life back.
“If the error is not recorded in logs did it really happen? Errors are most likely occurring and OverOps can quantify it and increase application confidence. I’ve seen developers find issues in 5 minutes with OverOps that they were convinced didn’t exist.”
-Derek D’Alessandro, 20-year technology developer & operations veteran
What is the true cost of an application error?
There are two areas of cost that come from application errors. There are soft costs such as loss of customers, brand tarnishment, additional staff requirements and lower job satisfaction (developers quickly get tired of debugging code issues). These are hard to quantify, and we use formulas like “lifetime value of a customer” and “real cost of losing an employee” to help measure the cost.
The other costs are what are called “hard costs”. This overhead consists of the cost of log aggregators (including the ever-growing logging storage and indexing costs), additional hardware/storage/network, and APM tools. Excessive errors affect CPU performance, can significantly increase log volume and leave behind unknown side effects that can cause outages at a later time.
“Applications need to be as solid as reasonably possible, long time customers that have a single or multiple bad experience(s) are a powerful net demoter in social media. OverOps gives valuable insights into operational quality before a single customer sees the new application release.”
-Derek D’Alessandro, 20-year technology developer & operations veteran
Culture of Accountability
Before I became a field engineer, I ran an engineering team for over ten years writing software for major insurance companies. Their SLA’s for scale, performance and reliability kept me and my team working long hours and always finding ways to improve. We had a team goal of 98% bug-free and we held each other accountable.
One way we did this was if someone broke an application, their name went on the board. This meant everyone (including our leadership) had visibility into our development process. When a name was on the board, this meant there was a chance to learn and improve. Collectively as a team, we discussed the issue and then searched our code base for other such occurrences and fixed them.
What came from this exercise was a team of developers that trusted, respected and learned from each other. This sort of accountability is harder to find these days, with the growing pressure on teams to deliver new features faster than the competition, leading to shorter deadlines and less time to learn from your mistakes.
Now, I see code get “thrown over the fence” for others to deal with. This creates animosity, bottlenecks, finger pointing and ultimately, less reliable applications. It is time for a change…in culture.
Changing the Status Quo
The time has come for teams and team members to be accountable for their code, beyond writing it and forgetting about it. In order for that to happen, application teams need to know what is going on inside their applications. They should know how many errors and logging events are occurring every minute of every day. As discussed above, each of these has a cost, and the goal should be to have no known or unknown errors.
If everyone is held accountable, applications will perform at a higher standard. The effects of no errors would be tremendous.
- There would be no more rebooting due to memory leaks or file handle limits reached.
- The logs would be cleaner, and it would be so much easier to find the signal through the noise.
- When real application issues occur, it would be easier and faster to troubleshoot.
- Applications would scale better as CPU would not be wasted on error handling.
- Customers would not be complaining or leaving for your competitor.
How do I get visibility inside my application at runtime?
As I mentioned above, we should know what errors are occurring in our application throughout the software development lifecycle. How is this possible? OverOps is an application reliability platform that captures every error and log.warn and log.error event that occurs in your application, including previously unknown events like “Swallowed Exceptions” that never show up in the logs.
The data that is captured is game changing as dev and ops teams will have 100% visibility into the running application. Events are granular down to the method in the code with error counts and rates. The last important component is the ARC™ (Automated Root Cause) screen which shows the code that was executing along with the data that was moving through the application at the time of the error.
“Show me the data. If you can’t measure it, you can’t understand it and can’t fix it”, – anonymous
OverOps’ Examples of Success
The first example, a large SaaS company had 1500 unique errors occurring every day with some happening over a billion times a day. The amount of CPU and logging were causing issues with scale and cloud cost. The visibility in OverOps helped them prioritize the most costly errors and by fixing those, they were able to reduce their number of application servers from 200 to 100 reducing cloud spend and giving the ability to scale.
Our second example is a large credit card company that had been struggling with scale and intermittent outages. They turned on OverOps and immediately saw 100s of errors of which 65% were swallowed exceptions. These types of errors never show up in a log file and are unknown to all teams. A simple cost/performance calculation was completed which uncovered 100s of hours of lost CPU time. After the code was fixed, latencies decreased by about 10% increasing application throughput.
Technology has changed, and will continue to change, the way we do business in the future. Now that software is the center of how every company interacts with its customers, it is important to deliver the highest quality experience every time. To that end, it is time to start a culture of accountability and build a future where developers spend more time building new features and less time debugging legacy debt, and Operations teams have full visibility into the applications, down to the line of code.
I have worked with several companies this year who are on the journey to creating a culture of accountability. They believe in inbox zero, full visibility in their applications and changing the status quo of what is expected from their development and operation teams. It can be challenging to get started, but the outcome is well worth the trouble! The keys to success are visibility into your running applications, prioritization of the errors found and a desire/drive to fix these issues.