See how Viator, a TripAdvisor company, debugs Java in production
One of the biggest pains we all experience is trying to debug applications in production. It becomes a real challenge for companies who have to tend to million of users on a daily basis, sharing their private information with the company.
We had a chat with Viator, a TripAdvisor company, and learned how they solve errors in production and cut down days of work into minutes. Join us and learn about the new way to debug Java in production. For a deeper dive, check out this session by their Director of Software Development.
— OverOps (@overopshq) January 18, 2017
7 Millions Visitors, 120 Million API Requests
Viator offers searching, finding and booking tours and attractions for travelers all over the world. Their application has 7 million visitors per month, and their backend receives 120 million API requests per month.
All of this is being handled by a team of 60 developers, and 100 AWS instances over 3 front ends and 20 backend services. Viator is PCI (Payment Card Industry) Level One compliant, which means it answers a set of high-level security standards that keep credit card information secured.
Working in a PCI environment means that it’s nearly impossible to replicate issues locally, and it limits production access. User data has to remain secure, even from the developer’s eyes.
The workflow consists of two weekly sprints of 12 teams, when each team is encouraged to put some time in each sprint for diagnosing critical exceptions.
On top of that there’s legacy code, when the original developer is long gone and things are being logged just-in-case, which leads to very noisy logs. To be more specific, the company writes about 100GB of logs per day.
This makes identifying, finding and understanding errors and exceptions a big challenge.
Viator is using a number of monitoring tools, such as Pingdom, New Relic, PagerDuty and others to keep track of everything that happens in the servers, but sometimes these tools are not enough. For example, when looking through performance monitoring tools for a certain error, it’s hard to identify and find any obvious elements that might have caused it within the code.
Reducing Days of Work to Minutes
Steve Rogers, Software Development Director at Viator, told us about a certain error the company has encountered in production, and how they solved it.
After releasing a major version, they saw a sudden spike of 500 errors on one of their APIs, and the logs were filling up with exceptions. In order to debug this issue, the developers had to:
1. Roll back the release
2. Look in the logs and code, and nothing was obvious
3. Create a new hotfix release with extra logging in
4. Release the new version
5. Wait for replication – Did not take long
6 . Get the new logs and rollback the release
7. Fix the issue
8. Finally, release the new version
That whole process took 3 days. Ouch.
There’s got to be a better way, right?
After installing OverOps, the development team got real time alerts of errors that occur in each new version they release. The alerts include the variables at the moment of error, which let them easily reproduce and solve issues within minutes.
OverOps email digests alert developers on new errors that were detected in their application during the last 24 hours. These emails also notify them about specific errors that exceed target volumes that were set up in advance.