What to do when a customer's backups go above 100%

List overview All Threads
Download

newer

older

Strange DDOS attack?

2013-12-21 2331Z and ongoing...

Andy Smith

30 Dec 2013 30 Dec '13

12:05 p.m.

Hi, I'd like to revisit a topic that has never really been resolved - what to do when someone goes past the limit of their backup space. When I say backups I'm talking about the backups service as described here: https://bitfolk.com/customer_information.html#toc_2_Local_backups It's by no means an awesome service - I recognise that everyone has their own preferred methods of doing backups and there's no way to please everyone - but it is taken advantage of by 38 people at present. The way it works currently with regard to disk usage is: - A backup job runs - Disk usage is calculated and the usage is recorded in a database - Nagios sends warnings when that usage goes above 95%, sends critical alerts if it goes above 100% - Backups keep on running anyway - Both I and the customer see those Nagios alerts So, let's say someone goes above 100% usage. Here's what I tend to do: - Leave it for a bit to see if the usage starts going down. If it does then it will probably go below 100% again as the customer fixed whatever got backed up that shouldn't - If it keeps going upwards or is so far beyond 100% that it would take ages to drop, then I open a ticket with the customer asking them what they want to do. - Most of the time I get no reply, so assuming the overage is only small I wait a week or two before asking them to respond. - Eventually I do get a response and it will usually be a request for one of two things: a. Buy more disk space for backups, or b. Go into the backups and delete every instance of some directory that should never have been backed up I really, really dislike doing (b) because I don't want to mess about in customer files, I might make a mistake, I might see things I don't want to see, etc. But I will do it if the customer insists. As you can probably see, all of this is quite a hassle to resolve. Basically I don't want to be sending emails and deleting files by hand. I can think of a couple of ways to reduce the hassle, and I was wondering if any of you who currently take advantage of the backups have any thoughts on this: 1. I could stop providing the local backups service. 38 people isn't a huge amount, and it probably won't be a big hardship to find other backup strategies. Most other solutions are quite complex and in these days of "unlimited backup space" that many services offer, maybe I should just not bother? 2. When the customer goes over 100% I could automatically add disk space to cover the usage, and invoice them. 2a. Like (2) but just leave it a couple of weeks before doing that, to give them chance to fix it first. 3. Something else? What are your thoughts on (2)? Or any further suggestions? Would you need any modification to the alerting settings before this would be acceptable? Note that although "just suspend the customer's backups as soon as they go past 100%" initially sounds like a good idea, it may not be as it prevents the customer from removing whatever it was they backed up that they didn't mean to, i.e. fixing it themselves. Would (2) be more workable if there was some mechanism for the customer to go in and delete stuff from the backups in the space of time they have before they will actually get invoiced? Cheers, Andy -- http://bitfolk.com/ -- No-nonsense VPS hosting "I am the permanent milk monitor of all hobbies!" — Simon Quinlank

Attachments:

signature.asc (application/pgp-signature — 198 bytes)

Show replies by date

Rodrigo Campos

30 Dec 30 Dec

2:44 p.m.

On Monday, December 30, 2013, Andy Smith wrote:

...

Critical on 100% is maybe too late? I would say maybe ~90% can be critical, as you are clearly running out of space and won't be able to backup anymore. Or if the size of the last backup times 7 (like in a week you won't be able to backup anymore) is more than x%, then critical. But maybe is a pita to get the size of the last backup?

...

- Backups keep on running anyway - Both I and the customer see those Nagios alerts So, let's say someone goes above 100% usage. Here's what I tend to do: - Leave it for a bit to see if the usage starts going down. If it does then it will probably go below 100% again as the customer fixed whatever got backed up that shouldn't - If it keeps going upwards or is so far beyond 100% that it would take ages to drop, then I open a ticket with the customer asking them what they want to do. - Most of the time I get no reply, so assuming the overage is only small I wait a week or two before asking them to respond. - Eventually I do get a response and it will usually be a request for one of two things: a. Buy more disk space for backups, or b. Go into the backups and delete every instance of some directory that should never have been backed up I really, really dislike doing (b) because I don't want to mess about in customer files, I might make a mistake, I might see things I don't want to see, etc. But I will do it if the customer insists.

If you have the size of the last backup, is it possible to add a check to see if the current backup is X% more than the last one? This seems to me, that I'm totally inexperienced and never dealt with this, that can detect early when something got backed up when it shouldn't?

...

As you can probably see, all of this is quite a hassle to resolve. Basically I don't want to be sending emails and deleting files by hand. I can think of a couple of ways to reduce the hassle, and I was wondering if any of you who currently take advantage of the backups have any thoughts on this: 1. I could stop providing the local backups service. 38 people isn't a huge amount, and it probably won't be a big hardship to find other backup strategies. Most other solutions are quite complex and in these days of "unlimited backup space" that many services offer, maybe I should just not bother? 2. When the customer goes over 100% I could automatically add disk space to cover the usage, and invoice them. 2a. Like (2) but just leave it a couple of weeks before doing that, to give them chance to fix it first. 3. Something else?

If usage is 100% I would abort the current backup with an error (and send the corresponding alert). After all, that's what happens when you run out of space... And, if possible, I would lower the nagios alert % and try to detect when the backup increased from one day to the other by X% and alert about that too. But in any case, the most reasonable thing to do for me is to abort the next backups until there is free space. But take into account that I don't use the service :)

...

Note that although "just suspend the customer's backups as soon as they go past 100%" initially sounds like a good idea, it may not be as it prevents the customer from removing whatever it was they backed up that they didn't mean to, i.e. fixing it themselves.

Sorry, don't follow you here :-S Thanks, Rodrigo

Andy Smith

3:49 p.m.

Hi Rodrigo, On Mon, Dec 30, 2013 at 12:44:59PM -0200, Rodrigo Campos wrote:

...

On Monday, December 30, 2013, Andy Smith wrote:

- Nagios sends warnings when that usage goes above 95%, sends critical alerts if it goes above 100%

Critical on 100% is maybe too late?

Having just checked it is actually 95% for warning and 99% for critical.

...

I would say maybe ~90% can be critical, as you are clearly running out of space and won't be able to backup anymore.

That is a fair point although the words "warning" and "critical" are at the moment just words used in template text in the alert and there is therefore no significance between them except what the recipient reads. Also doubtless different people will consider different percentages to be what they want. There isn't a concept of "running out of space" at the moment - if you go above 100% then your backups still work. You just eventually get asked by me to pay for more space or have some stuff deleted. Perhaps it is best if the critical alerts stay at 99% and I allow the warning percentage to be configurable.

...

Or if the size of the last backup times 7 (like in a week you won't be able to backup anymore) is more than x%, then critical. But maybe is a pita to get the size of the last backup?

It doesn't really work like this as there isn't a concept of "size of last backup" - only files that change are backed up, so if you had ten 1GB files that did not change since last time then the usage would be 10GB even though there are two sets of backups. If one file changed then both versions would be stored, so the usage across both sets of backups would be 11GB. So, there is a *differential* of 1GB per backup run, and it is true that I could take note of this and compare it to how much space is left then guess how many of these backup runs would fit given the same amount of diffs every time. That is really complicated though and I'm not convinced there'd be very much value in this compared to just the used percentage.

...

While possible, these just sound like more alerts that people are not going to be very interested in. For those who do use the backups service, do you feel that a simple percent used alert isn't good enough and you need to know about rates of change?

...

But in any case, the most reasonable thing to do for me is to abort the next backups until there is free space.

I'm not sure that is reasonable, and I will explain why below..

...

Sorry, don't follow you here :-S

The backups are incremental. They aren't just X amount of files times Y backup points. It's X amount of files plus the amount of changes over a configurable time period that in the default case is 6 months but some people have it set to 12 months or more. The default backup schedule looks like this: - Once every four hours, keep 6. - Once every day, keep 7. - Once every week, keep 4. - Once every month, keep 6. This means that (without you contacting support to ask for stuff to be deleted out of backups), once a file is backed up, it isn't going away for 6 months. Even if you delete it off your disk. e.g., you create: /var/tmp/dvd_rip of 8GB or whatever and it gets backed up, so it's now accessible via: /srv/backups/hourly.0/var/tmp/dvd_rip Noticing your backup space usage went up by 8GB you delete /var/tmp/dvd_rip or otherwise mask it from being backed up. The file doesn't disappear out of your backups though. At the next run it'll be accessible as: /srv/backups/hourly.1/var/tmp/dvd_rip and tomorrow it will be: /srv/backups/daily.0/var/tmp/dvd_rip and so on. By now you're probably wondering where I am going with this since it doesn't explain how a customer can take some action to reduce the space their backups use, in fact all I have done is explain how a customer CAN'T fix it. Well the thing is that at every backup run the oldest iteration is being deleted, so on the 6th daily run hourly.5 is being deleted and on the 6th monthly run monthly.5 is being deleted. Therefore if you identify things that have been backed up for a long time but which don't actually need to be, you can delete them from your disk or else mask them from being backed up, and as they age out they won't take up disk space any more. An example might be the files in /var/log/ which change all the time so at every hourly run you will back up a new set of them. If you decide that you don't want a backup of them every four hours then you might mask them from being backed up. This will have immediate effect with the next backup run since that is a set of logs that got aged out of hourly.5 and never appeared in hourly.0. I do take your point though, because there is nothing stopping anyone doing the above well before 100% is reached. What I just described is also a fairly rare case - normal cause of suddenly going past 100% is mistakenly letting some big transient file be backed up, and there's currently no way for the customer to fix that by themselves. At the moment there is no negative effect from going past 100% except that I will write to you and ask you to sort it out. So I could be wrong about suspending their backups being an unreasonable thing to do. I had suggested the option of "you will automatically order more disk and be charged for it" as one possible negative consequence, and it appeals to me because it's very simple! You suggest an alternative negative consequence of "once usage goes above 100%, suspend backups". That is also fairly simple, and has the advantage that no one gets a bill that they don't intend to pay. It has the downside that the customer's backups now will never get re-enabled unless they contact me to buy more disk or ask me to delete things. Which of these makes the most sense? Should both options exist for people to choose between? If I implemented a way (from the Panel) to nuke the most recent set of backups then would that make the "suspend" option the best one as the customer can still fix it themselves? That is, upon receiving the critical alert that they had now used more than 100% of backup space and their backups have been disabled, they determine that this is because something large got backed up that shouldn't have been backed up. They could then go to the Panel and delete the most recent backup run, and then backups start working again at the next run, all of this without needing to submit support tickets and without being charged any extra. Now I have typed it out, that does sound rather more friendly than sending people bills. Cheers, Andy -- http://bitfolk.com/ -- No-nonsense VPS hosting

Keith Williams

4:12 p.m.

Andy wrote:- *If I implemented a way (from the Panel) to nuke the most recent set* *of backups then would that make the "suspend" option the best one as* *the customer can still fix it themselves?* *That is, upon receiving the critical alert that they had now used* *more than 100% of backup space and their backups have been disabled,* *they determine that this is because something large got backed up* *that shouldn't have been backed up. They could then go to the Panel* *and delete the most recent backup run, and then backups start* *working again at the next run, all of this without needing to submit* *support tickets and without being charged any extra.*

...

That sounds the best way, for customers at least. No surprise bills and,

in the end, less work for you. Keith -- Keith Williams Keith's Place www.keiths-place.co.uk Tailor Made English www.tmenglish.org West Norfolk RSPCA www.westnorfolkrspca.org.uk

Nigel Barker

5:15 p.m.

Hi Andy Personally I'm send a warn at 80%, a critical at 90% and fail any backups taking use beyond 100%. If you could give users the ability to delete either an entire backup, or all backups of a specific file/directory, then it is their problem. I'm not personally fond of the idea of billing people whose backups grow beyond their limit, might require a check of the t&c's they agree to, but whatever it certainly shouldn't load you with work when a customer uses more than they've paid for! Nige

...

Hi Rodrigo, On Mon, Dec 30, 2013 at 12:44:59PM -0200, Rodrigo Campos wrote:

On Monday, December 30, 2013, Andy Smith wrote:

- Nagios sends warnings when that usage goes above 95%, sends critical alerts if it goes above 100%

Critical on 100% is maybe too late?

Having just checked it is actually 95% for warning and 99% for critical.

I would say maybe ~90% can be critical, as you are clearly running out of space and won't be able to backup anymore.

Or if the size of the last backup times 7 (like in a week you won't be able to backup anymore) is more than x%, then critical. But maybe is a pita to get the size of the last backup?

But in any case, the most reasonable thing to do for me is to abort the next backups until there is free space.

I'm not sure that is reasonable, and I will explain why below..

Sorry, don't follow you here :-S

Duggie

6:39 p.m.

Hi Andy, As a customer who has been in this position recently (the shame!) I would be happy to allow backups to fail past 100%, and an alert sent to point me at a control panel feature which allows me to crudely admin my backups ie delete one, several or all backups. If you would send a weekly reminder until the problem is resolved, then I couldn't expect you to do very much more. I don't expect you to have to manually intervene with this service. It's low tech and low cost. And I appreciate the service. This also sounds reasonably straight forward to implement. It is certainly straight forward to explain in the FAQs. Thanks, Duggie On Monday, December 30, 2013, Nigel Barker wrote:

...

Hi Rodrigo, On Mon, Dec 30, 2013 at 12:44:59PM -0200, Rodrigo Campos wrote:

On Monday, December 30, 2013, Andy Smith wrote:

- Nagios sends warnings when that usage goes above 95%, sends critical alerts if it goes above 100%

Critical on 100% is maybe too late?

Having just checked it is actually 95% for warning and 99% for critical.

I would say maybe ~90% can be critical, as you are clearly running out of space and won't be able to backup anymore.

Or if the size of the last backup times 7 (like in a week you won't be able to backup anymore) is more than x%, then critical. But maybe is a pita to get the size of the last backup?

But in any case, the most reasonable thing to do for me is to abort the next backups until there is free space.

I'm not sure that is reasonable, and I will explain why below..

Sorry, don't follow you here :-S

_______________________________________________

users mailing list users(a)lists.bitfolk.com <javascript:;> https://lists.bitfolk.com/mailman/listinfo/users

_______________________________________________ users mailing list users(a)lists.bitfolk.com <javascript:;> https://lists.bitfolk.com/mailman/listinfo/users

-- Duggie

Ross Younger

7:27 p.m.

On 31/12/13 04:49, Andy Smith wrote:

...

Which of these makes the most sense? Should both options exist for people to choose between?

As a user I'd personally go for suspend-with-alert, but I do wonder whether automatic-extend would be preferable for some? I guess it's down to what the users want, bearing in mind that 38 customers isn't worth spending too much time on ;-)

...

If I implemented a way (from the Panel) to nuke the most recent set of backups then would that make the "suspend" option the best one as the customer can still fix it themselves?

I like the sound of this. In an ideal world (pie in the sky etc etc) the panel would show me a simple listing of the backup sets - timestamp and size - and allow me to blow any of them away; this covers the case that some large file was accidentally backed up a few runs ago. Cheers, Ross

Aaron B. Russell

8:05 p.m.

On Dec 30, 2013, at 7:27pm, Ross Younger <ross(a)impropriety.org.uk> wrote:

...

On 31/12/13 04:49, Andy Smith wrote:

Which of these makes the most sense? Should both options exist for people to choose between?

If I implemented a way (from the Panel) to nuke the most recent set of backups then would that make the "suspend" option the best one as the customer can still fix it themselves?

This sounds like the best option, although perhaps a slightly simpler-to-build approach would be to just automatically delete the oldest backup (despite the schedule choices) to make space for the larger new ones, a la OS X's Time Machine, though I'd like to be notified if this occurs so I can take action to fix it.

3862

days inactive

3862

days old

users@mailman.bitfolk.com

Manage subscription

7 comments

7 participants

tags (0)

participants (7)

Aaron B. Russell
Andy Smith
Duggie
Keith Williams
Nigel Barker
Rodrigo Campos
Ross Younger