Hello,
Long email about DNS timers and alerting based on them. Unless you
have domains on BitFolk's secondary DNS platform you probably won't
care about this, and even then you still probably don't care unless
you've been receiving alerts about them. Turn back now!
Still here? OK.
I've recently implemented DNS secondary domain zone age alerts. They
send alerts when the zone on BitFolk's nameservers is too old. This
saves me having to read logs and open a support ticket to advise
customers that the zone transfers are failing, so I'm all in favour
of that.
The definition of "too old" differs on a per-domain basis. There are
two values in the SOA record of a DNS domain; refresh and expire.
The refresh value tells secondary servers how often to check in
with the primary.
The expire value tells secondary servers how long they should
consider themselves valid for without successful contact with the
primary. If there is no contact with the primary for the expire
period then the secondary server stops serving the domain and
returns SERVFAIL for every query.
So, based on the above, a DNS domain should never be "older" than
refresh. If it is older then that means that at least one refresh
attempt failed. If the age approaches expire then the domain is in
danger of not being served.
At the moment I have decided to send a warning alert on 150% of
refresh and a critical alert on 50% of expire.
RIPE recommends 84600 (one day) for refresh and 3600000 (1000 hours;
almost 6 weeks) for expire:
http://www.ripe.net/ripe/docs/ripe-203
RFC1912 (1996) recommends one day for refresh and 2-4 weeks for
expire:
http://www.faqs.org/rfcs/rfc1912.html
So let's say you go with RIPE's recommendations. You'd receive
a warning alert after your secondary DNS setup was broken for 36 hours,
and you'd receive a critical alert if it was still broken after 500
hours (almost 3 weeks). 500 hours after that, your domain stops
being served on the secondary servers.
That seems reasonable.
Finally getting around to the point of this email: what do you think
I should do about problematic SOA values that customers have chosen?
For example, there are some domains currently on BitFolk's servers
where the refresh and expire are both set to 300 seconds (5
minutes). Ignoring what happens with alerts for a moment, that means
that every 5 minutes the secondary servers check the primary, and if
that fails even once, the domain will return SERVFAIL for all
queries until contact is made again.
I can't understand what the use is of such a fragile setting; it
looks erroneous to me. This isn't just DNS purism saying, "ooh, I
don't like your non-standard values!" It will actually cause
breakage very easily. But perhaps it is not for me to reason why.
Those domains have been like that for a long time and I assume no
one has noticed. It must have caused some problems any time the
primary nameserver was unreachable by the secondary servers. But
arguably that is not my problem.
When combined with this new alerting though, what happens is that
there isn't a refresh for 5 minutes then 2.5 minutes into that a
critical alert fires since we're half way to expire (5 minutes). All
being well there should be a recovery ~2.5 mins later. In reality
these times will be variable because BitFolk's Nagios doesn't check
DNS every few minutes, more like an hour plus.
That is the most extreme example of this problem, but there are a few
other domains in there where refresh and expire have been set to the
same value. It will lead to a cycle of alert and then recovery,
forever.
So, what do you think I should do?
I'm not willing to give up on the alerts because I think most people
would like to know when their DNS setup is broken (or in danger of
being broken), and it saves me having to personally interact to tell
people this. Intentional DNS breakage is not my problem, but
answering/opening support tickets is.
Alerting can be disabled on a per-domain basis. Currently only by
asking support, but eventually you'll be able to flip that on the
Panel¹.
So how about have Panel warn on the web page about what are
considered unwise SOA values, and just allow the alerts to be
disabled if for some reason this sort of fragile DNS setup is
intentional?
Cheers,
Andy
¹
https://panel.bitfolk.com/dns/#toc-secondary-dns
--
http://bitfolk.com/ -- No-nonsense VPS hosting