In this blog post I will show you how to get started with grafana custom dashboard for Kubernetes. We will also be learning basics of Prometheus query language. Both of those tools are very useful in everyday of cluster admin’s and user’s life.
Why you should learn Prometheus query language
Grafana and Prometheus are very powerful tools that enable you to monitor almost anything about your Kubernetes cluster. They are, though, difficult to master. I have recently installed Kubernetes 1.9.2 and deployed Prometheus with Grafana. After having installed a few dashboards it turns out that most of the graphs do not show any data. After digging a bit deeper, I have learned that Dashboards were setup with metrics that were either deprecated, removed or renamed. Therefore I have started to customize dashboards in order to get metrics from newest Kubernetes.
I have also noticed that dashboards do not provide very valuable metrics out-of-the-box. Therefore you still need to customize grafana in order to get most of your setup.
Getting started
Before we dive in, you need to deploy prometheus and grafana to your cluster. You can find my setup on github here. I have deployed grafana using kubectl-compatible yaml files and prometheus using helm. You will also need to have Grafana datasource configured to fetch data from Prometheus.
To begin with, create a new dashbaord and add new graph to it. The pictures below are to help you with locating edit button in panel.
If you see the same screen as the one below, we can start writing Prometheus queries.
Let’s start with something simple and vital to your system. Before diving deeper into your metrics, you should make sure that your Kubernetes cluster has enough memory and CPU. Measuring memory and CPU usage based on host, namespace and pod will enable you to detect problems quickly. If you divide namespaces by teams or departments, it will also give you insights into who’s using your cluster the most.
Let’s start by suming up memory usage by namespace. Remember that some of the metrics might have been renamed. If you cannot find metrics that I am using below, you need to find the replacement by yourself.
The query below sums up all container memory usage by namespace while filtering out memory usage that doesn’t have a namespace.
sum(container_memory_usage_bytes{image!=""}) by (namespace)
The results are there and look quite ok. However, I bet you have no idea how much memory is 1.2Bil. Let’s fix this in tab Axes
. Change the Y axis unit type to bytes as shown below.
The graph already looks much better.
Let’s also change the labels under the chart as I don’t like those brackets {}
around namespace names. You can change this by changing legend format to {{ namespace }}
as shown below.
You can do the same with pods and nodes for both memory and CPU. In order to monitor CPU use container_cpu_usage_seconds_total
instead of container_memory_usage_bytes
.
This is very handy and easy feature. It doesnt' require you to change the previous prometheus query. Just duplicate your graph and go to Display settings again.
You will see Stacking
option. You should also enable nulls as zero
in this case.
Despite the fact that the graphs we just created already look quite nice, CPU usage still shows in very arbitrary unit that is not easy to decipher. Let’s then modify one graph to get CPU usage as a percent of CPU capacity.
You can start with something like this. 100% means that 1 CPU core is fully utilized over given period of time.
NOTE: You have to be careful with the rate over which the metrics are shown ([10m]
part in the query below). If you use too small interval (such as 1m), the grafana might tell you that there are no data points available. This is because the scape points are too far away in time, and prometheus will not be able to calculate rate between those points.
sum(rate (container_cpu_usage_seconds_total{id!=""}[10m]) * 100) by (namespace)
Let’s raise the bar a bit. Now we want to show percentage of CPU used divided by total number of CPUs in the cluster per namespace. This metrics will give us good insight into how much resources we still have in our cluster. Let’s change our previous query and divide it by number of CPUs in cluster.
Because the previous query that we used returns a vector, we need to convert machine_cpu_cores
to scalar in order to be able to divide it by each namespace resource usage. To get more graps of this, you can test both queries in Prometheus query explorer.
sum (rate (container_cpu_usage_seconds_total{id!="", namespace!=""}[10m]) * 100) by (namespace) / scalar(sum(machine_cpu_cores))
That’s all for the basics of creating grafana charts. This guide should enable you to create simple charts and monitor basic metrics of your Kubernetes cluster. In the coming posts I will dive deeper into getting meaningful information from your cluster metrics.
You can see below my end result.