Prometheus counters don't behave

…when the numbers are not increasing consistently.

The problem

We have moved from httpd + mod_wsgi to launch Packit’s API server with mod_wsgi-express. That change simplified how we run Packit - look at the number of removed lines! But sadly it broke a Grafana chart I used almost daily. The chart shows an amount of traffic from GitHub webhooks Packit processes.

This chart is wrong because it should display “current load”, instead of these constantly increasing lines which reset at some point:

Incorrect chart for amount of processed traffic

Suddenly we have this chart which doesn’t tell us anything meaningful.

Analysis

There are a few things to understand in order to resolve the problem. I learnt all of these while researching :)

  • Prometheus counters are meant to only increase or reset if the process restarts.

    Hold on, but that’s happening in the chart above, so… Why it’s wrong?

  • Grafana increase() function should be used to visualize the data since it shows a difference between measurements and hence shows the load.

    Well, we used it before changing the webserver and it worked as expected. So what change broke the chart?

  • The increase() function can deal with the resets to 0. Except that based on the chart above it seems it cannot.

  • Finally (and most importantly), we have a /metrics endpoint in Packit which provides numbers for Prometheus used for the chart above. The data is stored in the Python WSGI process, the numbers are not in any database: they are just regular Python variables defined using the python prometheus client library. They are in mercy of Python, WSGI, Linux and OpenShift.

Please read here, in the original GitHub issue comment, the complete analysis of the problem with real examples.

This is just a quick blog post, so here’s the actual bug:

TL;DR the /metrics endpoint is served by multiple httpd processes which have different numbers and they confuse the increase() function because the numbers are no longer monotonically increasing.

Resolution

\o/

Incorrect chart for amount of processed traffic

I resolved this by applying labels to different measurements (process ID in this case) and aggregate them with the sum() function. You can see the updated query in the chart above, here’s a corresponding PR. Grafana’s not public.

Happy measuring!

comments powered by Disqus