Hi Conrad,
On Fri, Feb 25, 2022 at 04:08:32PM +0000, Conrad Wood wrote:
I think I'm missing something. The
"predicted" on the panel[1] looks
rather different to grafana.
The screenshots were taken about 5 seconds apart.
Is that expected?
Yes. The prediction algorithm was explained here:
https://lists.bitfolk.com/lurker/message/20220223.204720.d86cf277.en.html
The two things (emailed notifications about data transfer, and the
figures on the panel about data transfer) have their predictions
calculated differently.
For the email notifications:
- First it does a lightweight prediction by taking the sum of data
transferred this period, averaging it to a per-second value and
then multiplying that by the number of seconds in 30 days.
- If:
- This prediction indicates a state transition, either from
"predicted okay" to "predicted over" or the reverse, AND
- You're in the first 15 days of the period, THEN:
a more heavyweight prediction is done by adding up the actual
figures for the last 30 days. That is then used as the prediction.
For the panel:
- Just uses the simple lightweight prediction.
Reasoning: If we're going to send a notification about a state
change, either to tell you to worry or to tell you not to worry, it
would be better to use more accurate figures for that prediction.
The lightweight prediction might be using only a small amount of
data. In your case right now about 3 days' worth to guess about a 30
day period.
If you look back at the last email notification sent to you, you'll
see that it predicted you'll go over again this period, and its
predicted figure matches what our Prometheus is now telling you,
because that's always how it's worked.
It's true that the simple prediction for you is right now returning
a much lower figure. You can look back in your bandwidth graph and
see why that is: There was a prolonged period of high usage a while
ago, but within the last 30 days. If we ignore that period then of
course the prediction will come out much lower. Is it correct to
ignore that period? Who knows, we just have to pick an algorithm,
you could argue either way.
So why not always use the more accurate prediction on the panel as
well?
The reason is historical. The figures are stored in a regular SQL
database and summing up 30 days of metrics from that source takes an
appreciable amount of time. This setup pre-dates Prometheus or any
other sort of time series database we had going.
So, we have only been doing it when it's considered important to do
so, in the batch job that sends out email notifications about data
transfer limits, and then the calculated predictions were thrown
away. It wasn't considered feasible to do those calculations in a
web page.
By the time you are 15 days in to this reporting period the two
figures will match.
So, what you are seeing is expected but it doesn't mean we can't
improve things.
I do not want to calculate the official billing figures from
Prometheus because I don't want to make Prometheus (or Grafana) an
essential piece of infrastructure.
I could store the calculated predictions back in the database and
use those on the panel.
It's always worked like this and no one has really noticed in more
than 10 years, probably because hardly anyone reaches their limits
and when they do it's usually nearer to the end of the reporting
period. It should be pretty easy to store the calculated predictions
though, I've just never thought of doing it before.
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting