【Prometheus】Best Practice

Posted by 西维蜀黍 on 2021-01-01, Last Modified on 2022-01-03

Cardinality

Prometheus performance almost always comes down to one thing: label cardinality.

Cardinality is how many unique values of something there are. So for example a label containing HTTP methods would have a cardinality of 2 if you had only GET and POST in your application.

It’s fairly common that things start out reasonable. You might have a histogram covering 2 HTTP methods, 7 HTTP paths, 5 machines, and a Prometheus typically only monitors one environment and datacenter. So that’s 2x7x5x12 = 840. Well within the capabilities of a single Prometheus.

What tends to catch you out is that things usually don’t grow in only one dimension. Increased traffic means more machines, and more users usually means more features so new endpoints. So you might now have say 3x8x6x12, which is an increase of just 1 for each of the first three factors, resulting in 1728 or more than double the original!

It’s still small overall, but this is just one metric, from one subsystem, and this is only one minor growth spurt. Over time growth accumulates and compounds, and can bring you to a point where gradually your Prometheus starts to creak. No one change caused it, but it still needs to be dealt with before your monitoring falls over. A Prometheus 2.x can handle somewhere north of ten millions series over a time window, which is rather generous, but unwise label choices can eat that surprisingly quickly.

Scrape Interval

Reference