On Monday, December 30, 2013, Andy Smith wrote:
Hi,
I'd like to revisit a topic that has never really been resolved -
what to do when someone goes past the limit of their backup space.
When I say backups I'm talking about the backups service as
described here:
https://bitfolk.com/customer_information.html#toc_2_Local_backups
It's by no means an awesome service - I recognise that everyone has
their own preferred methods of doing backups and there's no way to
please everyone - but it is taken advantage of by 38 people at
present.
The way it works currently with regard to disk usage is:
- A backup job runs
- Disk usage is calculated and the usage is recorded in a database
- Nagios sends warnings when that usage goes above 95%, sends
critical alerts if it goes above 100%
Critical on 100% is maybe too late?
I would say maybe ~90% can be critical, as you are clearly running out of space and won't be able to backup anymore.
Or if the size of the last backup times 7 (like in a week you won't be able to backup anymore) is more than x%, then critical. But maybe is a pita to get the size of the last backup?
- Backups keep on running anyway
- Both I and the customer see those Nagios alerts
So, let's say someone goes above 100% usage. Here's what I tend to
do:
- Leave it for a bit to see if the usage starts going down. If it
does then it will probably go below 100% again as the customer
fixed whatever got backed up that shouldn't
- If it keeps going upwards or is so far beyond 100% that it would
take ages to drop, then I open a ticket with the customer asking
them what they want to do.
- Most of the time I get no reply, so assuming the overage is only
small I wait a week or two before asking them to respond.
- Eventually I do get a response and it will usually be a request
for one of two things:
a. Buy more disk space for backups, or
b. Go into the backups and delete every instance of some directory
that should never have been backed up
I really, really dislike doing (b) because I don't want to mess
about in customer files, I might make a mistake, I might see things
I don't want to see, etc. But I will do it if the customer insists.
If you have the size of the last backup, is it possible to add a check to see if the current backup is X% more than the last one?
This seems to me, that I'm totally inexperienced and never dealt with this, that can detect early when something got backed up when it shouldn't?
As you can probably see, all of this is quite a hassle to resolve.
Basically I don't want to be sending emails and deleting files by
hand.
I can think of a couple of ways to reduce the hassle, and I was
wondering if any of you who currently take advantage of the backups
have any thoughts on this:
1. I could stop providing the local backups service.
38 people isn't a huge amount, and it probably won't be a big
hardship to find other backup strategies. Most other solutions
are quite complex and in these days of "unlimited backup space"
that many services offer, maybe I should just not bother?
2. When the customer goes over 100% I could automatically add disk
space to cover the usage, and invoice them.
2a. Like (2) but just leave it a couple of weeks before doing that,
to give them chance to fix it first.
3. Something else?
If usage is 100% I would abort the current backup with an error (and send the corresponding alert). After all, that's what happens when you run out of space...
And, if possible, I would lower the nagios alert % and try to detect when the backup increased from one day to the other by X% and alert about that too.
But in any case, the most reasonable thing to do for me is to abort the next backups until there is free space.
But take into account that I don't use the service :)
Note that although "just suspend the customer's backups as soon as
they go past 100%" initially sounds like a good idea, it may not be
as it prevents the customer from removing whatever it was they
backed up that they didn't mean to, i.e. fixing it themselves.
Sorry, don't follow you here :-S
Thanks,
Rodrigo