A step-by-step walkthrough for choosing a Graphite monitoring architecture for your application (Updated: 14/8/17)
Graphite is an excellent open source tool for handling visualizations and metrics. It has a powerful querying API and a fairly feature-rich setup. In fact, the Graphite metric protocol is often chosen the de facto format for many metrics gatherers. However, Graphite isn’t always a straightforward tool to deploy and use. It runs into some issues with scale, thanks to its design and its use of huge amounts of small I/O operations, and can be a bit of a pain to deploy.
— Takipi (@takipid) May 18, 2015
Part of the reason for the deployment pains is that it’s made up of three distinct elements (well, it requires four if you include the metric gatherer), and depending on your environment, one or more of the default elements may not be satisfactory for what you need.
While having three components can cause some implementation headaches, there’s a positive result as well. Each piece is a distinct unit, so you can mix and match which of the three elements you actually use. That means you can build a fully customized Graphite deployment just for you.
Let’s take a piece-by-piece look at what you’ll need to build a Graphite setup and some non-Graphite alternatives for each component.
1. Metrics Gatherers – Dropwizard Metrics, StatsD, and more
The first piece of your Graphite deployment isn’t part of Graphite at all. That’s because Graphite doesn’t collect any metrics on its own; it requires metrics to be sent to it. This isn’t usually a particularly large limitation, since most metric gatherers out there deliver metrics in the Graphite format, but it’s still something to keep in mind. There’s a large list of different tools available here, and no one tool is packaged in with basic Graphite monitoring.
Picking your Metric Gatherer(s) – The Graphite documentation has a list of collection tools that may work for you, including popular choices like CollectD and Diamond, but it’s infrequently updated, so here are a couple more you may want to consider:
Dropwizard Metrics – Metrics is a Java library that provides you with visibility into your production environment through a range of metrics. It has a Graphite reporter that can send all this data to your Graphite deployment. It’s a solid foundation for using Graphite with Java.
StatsD – StatsD is a network daemon from Etsy that runs on node.js. It listens for a range of statistics/metrics and aggregates them out to tools like Graphite. StatsD works with a lot of other visualization and metrics tools as well.
Takeaway – Graphite doesn’t come bundled with a metrics gatherer. However, the Graphite metrics protocol is super common, so it’s not hard to find one or more that work with your application. Since so many metrics gatherers play well with Graphite, you don’t need to pick just one. You can send metrics from several sources.
2. The Listener – Carbon, graphite-ng, and Riemann
The next piece of Graphite is the component for listening for the metrics you send it and writing them to disk. This piece is Carbon. Carbon is made up of daemon(s) and has some built-in flexibility in terms of the way it works. In the basic small-scale setup, your Carbon daemon will listen for metrics and report them to your Whisper storage database. However, as you grow you can add an aggregation element to it, which buffers metrics over a period of time before sending them to Whisper in one chunk. You can also use a Carbon relay to replicate metrics to multiple Carbon backends. This is particularly useful as you reach higher scale and need multiple Carbon daemons to handle the incoming metrics.
Downsides and Potential Issues – Usually, the issues that people run into with Carbon are about matters of scale. Here are several downsides to Carbon where scale is concerned:
- An individual Carbon daemon can only handle so much at once, due to the way it’s designed in Python. Multiple threads aren’t able to run native code at one time, so you run into scenarios where the Carbon daemon just starts dropping metrics.
- Carbon has a load threshold for the amount it can handle at once, but this threshold isn’t communicated to you.
- Carbon doesn’t have any persistent open file handles to Whisper, so storing each metric requires an entire multiple step read/write sequence.
Within standard Graphite, the workaround in these situations is to split the work up into Carbon relays and Carbon caches. You still have to keep an eye out for loads though, as exceeding the Carbon loads will lead to lost data. If that’s too much of a pain or just not possible for you, you can look at a Carbon alternative.
One alternative is graphite-ng, which is essentially Carbon rewritten in Go to avoid several of the above problems. The focus on this project has so far been on improving the Carbon relay and aggregation functions. If you like the Carbon functionality, but want to get around some performance limitations, this is a good choice.
Another alternative is Reimann. Written in Clojure, Reimann is used to aggregate and process “event streams.” Both events and streams are fairly straightforward concepts once you look at them, and Riemann can send them to the rest of your Graphite deployment in place of Carbon. It provides an additional benefit of adding some alerting capabilities to this step in the process. This is a good choice if you want to move further away from Carbon as an architecture and want to get some alerting capabilities involved.
Takeaway – Carbon listens for metrics and writes them to your storage database, but often runs into performance issues at scale. There are alternatives available that can get around this problem.
3. The Storage Database – Whisper, InfluxDB, and Cyanite
Downsides and Potential Issues: Whisper is based on RRD (Round-Robin Database), but was written with some key differences, such as the ability to backfill entries for historical data and handle irregularly occurring data entries. These are useful features for a metrics and visualization tool, but they come with tradeoffs.
- Performance is slower for Whisper, since it’s written in Python.
- As designed, it runs into some issues with storage space, since each metric requires a file and is all single instance. This was an intentional tradeoff to facilitate some of the previously mentioned benefits, but there’s no denying that Whisper is disk space inefficient.
- Carbon and Whisper end up involving a lot of IO calls thanks to their design. Scale again becomes an issue on the disk IO front as you move beyond small size deployments.
Whisper Alternatives: You can get around some of Whisper’s performance issues with SSDs and some design implementations, but only to a point. If the database portion is causing fits for you, there are a few alternatives to consider checking out.
One of the main ones today is Influxdata (InfluxDB). Based on LevelDB, Influxdata is a time series database that’s written in Go.Influxdata is able to work around the disk IO issues with some write optimizations and no one metric = one file requirements. Influxdata supports the protocol that Carbon uses, making it an able Whisper replacement, but implements a SQL-like query language. There are even projects designed to making use of Influxdata as a Whisper replacement as easy as possible, like graphite-influxdb, which makes communicating with the Graphite API mostly seamless. Influxdata is on the newer side, but is very promising and works well with a large range of other tools.
Another option is to use a Cassandra-based storage database. Thanks to work from graphite-cyanite, a Cyanite database is a good option for this route. Cyanite has been developed with the aim of replacing Whisper in the Graphite architecture, meaning that it works with Carbon and Graphite-web (with a few dependencies). Using Cyanite helps solve some of the performance and availability issues of Whisper when running at scale. Cyanite is a good choice if you only want to replace Whisper and not build a more extensive Graphite alternative.
Takeaways – The storage database component for Graphite monitoring is Whisper. At higher scale, Whisper runs into some performance and availability issues unless you go very fancy on the hardware front and break it up into complex manual clusters. There are several database alternatives that you can use for improved performance and availability if this is an issue for you.
4. The Visualizer – Graphite-Web and Grafana
Once you’ve gathered and stored the metrics, the third step is visualizing them. That’s the role of Graphite-web. Graphite-web is a Django-based webapp that allows you to visualize and play with your metrics. It provides a fairly solid amount of capabilities in terms of what you can do with your data, but the visualization component isn’t exactly beautiful. It’s the front-end component, so we’re talking more user experience here, but that’s an important consideration too.
Alternatives: Thanks to the great Graphite API, there are a huge number of alternative dashboards out there (see the Visualization section here). Since there are so many visualization options and a lot of them come down to personal taste, I won’t list many alternatives, but I do want to call one out in particular. Perhaps the biggest alternative visualization tool for Graphite today, or at least the most buzzed about, is Grafana.
Grafana is an open source dashboard tool that works with both Graphite and InfluxDB. It used to be a front-end only tool that required ElasticSearch for storing dashboards, but with the v2.0 release, it comes with a backend storage component for storing dashboards you create. Grafana was designed with creating a better visualization component for Graphite in mind, so it’s very well suited for replacing the default Graphite-web. It’s quite feature-rich and is being worked on at a steady rate. Grafana does have a backend component now, but there are tools that are pure front-end if that’s something you want. The tools list in the Graphite documentation has a list of some of those.
Takeaways – There are loads of visualization options available for Graphite if you find the default visualizer too basic or unappealing. Some of them are pure client-side, and some have a backend component for storing dashboards you build. No matter what you want, you’ll be able to find something here.
5. Code Level Metrics – Trends
OverOps is releasing a new feature that will let you send code level metrics to Graphite from the JVM for the errors in your application – tied together with the the variable state that caused them. To sign up and be notified when Trends become available for your Graphite deployment, you can check out OverOps right here: https://www.overops.com
For all the complaints people levy against Graphite (it’s not worked on consistently enough! The dashboards are ugly! It’s a pain at scale!), there’s a reason it’s a popular tool. If you want an open source metrics and visualization tool that can bring features on a line with many enterprise tools, Graphite is worth giving a shot. And one of the great things about it is that you can customize it to your heart’s content. It’s not exactly plug-and-play with the different components, but where’s the fun in that? With a little trial and error, you can build a completely customized Graphite (or Graphite-like) deployment that works great for your environment.