Hi folks,
Short version:
Two of our servers appear to be subject to a now fixed kernel bug
affecting IPv6, and require a reboot for kernel upgrade. Host
bellini.bitfolk.com will be rebooted on Monday 20th February at
2200Z.
Provided that does fix the problem, host president.bitfolk.com will
be similarly rebooted the following day, Tuesday 21st February also
at 2200Z.
Longer version:
While investigating some recent reports of poor IPv6 performance, it
seems that both bellini.bitfolk.com and president.bitfolk.com are
affected by a bug in the igb Intel gigabit Ethernet driver as
described here:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=630730
The symptoms are very poor IPv6 performance in the region of a
maximum of 100kilobit/sec. On doing a tcpdump you will see packets
of length greater than 1500 bytes, followed by an ICMP6 "packet too
big" message coming from our server, and then a retransmit.
These hosts have been up for 138 days (bellini) and 223 days
(president), and unfortunately neither we nor any customers noticed
any problems until recently. On casual inspection IPv6 works. It's
only noticeable when trying to do a larger data transfer.
Since the current impact is so low, I am not going to rush to reboot
these hosts. I would rather give plenty of notice to those who need
it.
When it's time for the reboot we will shut down all VPSes on these
servers cleanly, reboot the machine and then begin booting them up
again. Downtime is expected to be in the region of 15 minutes. If
you are in any doubt as to whether your VPS starts cleanly with all
required services running, you should test this ahead of time.
I am fairly confident that the problems observed are caused by that
bug and therefore that a kernel upgrade will fix it, but
unfortunately we do not have any other hardware that uses the igb
driver. If doing the upgrade on bellini does not resolve the issue
then we will have to consider our options.
In the mean time, if your VPS is hosted on bellini or president then
you may wish to set your VPS to prefer IPv4 DNS results ahead of
IPv6 results:
https://tools.bitfolk.com/wiki/IPv6#Preferring_IPv4_over_IPv6
Cheers,
Andy
--
http://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce
Dear customer,
[ Apologies if you've received duplicate copies of this email;
we felt it was of sufficient importance to send direct to the
contacts for each account. ]
If you've been following our mailing lists you'll know that the time
has come where we need you to change the IP address(es) associated
with your VPS.
Basically:
- For each IP address on your VPS you need to enable a new address,
which has already been routed to you.
- You then need to reconfigure your services to use only the new IP
addresses.
- Finally you need to disable the old IP addresses.
Full information on what you need to do can be found here:
https://tools.bitfolk.com/wiki/Renumbering_for_customers
The next time you need to reboot your VPS you should take care to
instead shut it down and boot it again from your Xen Shell,
otherwise you will lose the routes to the new IP addresses that have
been added.
The old IP addresses will be disabled approximately *three months*
from now, so if you don't make these changes before then YOU WILL
EXPERIENCE A LOSS OF SERVICE.
Therefore if you have any questions please do not hesitate to ask,
preferably on our users list:
https://lists.bitfolk.com/mailman/listinfo/users
Cheers,
Andy
--
http://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce
After making sure that my VPS is receiving packets on the right address,
I'm now getting warnings that it's sending on the old address. To my
knowledge I don't have any software installed for which I needed to specify
the VPS' IP, so my guess is that this will end when I remove the old
address from network/interfaces. Is that right? Is there a way to test
before deleting the old IP?
Thanks,
Mike
On Sat, Jan 28, 2012 at 7:14 AM, <andy(a)bitfolk.com> wrote:
> Dear customer,
>
> You're receiving this email because our monitoring has detected that, in
> the last 24 hours, your VPS "science" is still sending packets from the
> address 212.13.195.254, which is part of our deprecated range of IP
> addresses, 212.13.194.0/23.
>
> Back at the start of November 2011 we announced that a renumbering would
> be necessary, and we set a deadline of Monday 6th February 2012 for this
> to be completed:
>
> * Announcement:
>
> http://lists.bitfolk.com/lurker/message/20111104.221826.7f38e097.en.html
>
> * Full instructions:
>
> https://tools.bitfolk.com/wiki/Renumbering_for_customers
>
> There's now just over a week before the IP addresses that you are still
> making use of are taken out of service. If you don't act to stop using
> these addresses then YOU WILL EXPERIENCE A LOSS OF SERVICE once
> 212.13.194.0/23 is decommissioned.
>
> Therefore if you have any questions about the renumbering process or
> need help to complete it, it is important that you get in touch with us
> as soon as possible.
>
> Questions regarding how to renumber would be best asked on the users
> mailing list:
>
> http://lists.bitfolk.com/lurker/list/users.en.html
>
> BitFolk will only be able to carry out the work on your behalf as a
> chargeable consultancy service. Please contact support if you would like
> us to do this.
>
> It is possible that you have received this email purely because you
> still have one or more of the deprecated IP addresses active on your
> VPS. Internet hosts receive a constant "background noise" of random
> traffic. It is recommended to remove the deprecated IP addresses once
> you think you don't need them any longer, both to avoid this and to
> reassure yourself that they are really not in use by other systems.
>
> Best regards,
> Andy Smith
> BitFolk Ltd
>
Hi,
If you don't currently take advantage of BitFolk's free backups
service then the following doesn't apply to you.
Two extra checks were added last night on the backups that BitFolk
manages for you.
All this info is going to go on a page I will create at
https://tools.bitfolk.com/wiki/Backups but you need to know it now.
- Backup age
This checks that your most recent successful¹ backup is not too
old. The warning threshold is 2.5x the appropriate interval. So if your
backups happen every 4 hours, you'll receive an alert if they are
ever older than 10 hours. If your backups happen daily then
you'll receive an alert if they are ever more than 60 hours old.
And so on.
This alerting replaces the manual process of me being sent
excerpts of log files that say a customer's backups are failing,
then opening a support ticket with the customer to make them
aware.
If you receive this alert then your backups are definitely not
happening.
The alert looks like this:
From: nagios(a)bitfolk.com
Subject: ** PROBLEM alert - backup0.bitfolk.com/Backup age youraccount is CRITICAL **
***** Nagios *****
Notification Type: PROBLEM
Service: Backup age youraccount
Host: backup0.bitfolk.com
Address: 85.119.80.240
State: CRITICAL
Date/Time: Tue Jan 31 12:37:38 UTC 2012
Additional Info:
FILE_AGE CRITICAL: /data/backup/rsnapshot.6-7-4-6/hourly.0/85.119.82.121 is 16842912 seconds old and 4096 bytes
(Those who haven't had a successful backup run in the last couple
of days will have a huge number of seconds listed there because
the backup system was only modified to record last successful
contact recently)
- Backup space usage
This checks that your total backup space usage is not approaching
your current quota. The thresholds are 95% for a warning and 99%
for a critical.
This alerting replaces the manual process of me being warned about
customers who are exceeding their quota and then opening support
tickets with them to discuss what they want to do about it².
If you receive this alert then your backups are still happening,
but you're in danger of (or already are) using more than the
agreed space. If you exceed your quota then we may disable your
backups, so that would eventually cause the above backup age alert
to fire.
Please note that we can't update the measurement of how much space
you're using very often. Backup directories contain hundreds of
millions of files, many of which are copies of each other, but
it's not possible to tell without looking. Adding it all up takes
quite a long time and stresses disk IO quite badly. So we only
update quotas every day at best at the moment.
Also please note that since this is a backup system, deleting
files on your VPS does not immediately result in using less disk
space for backups. It would be a rather pointless backup system if
it threw away deleted files immediately. :) Anything that gets
backed up is going to be kept for as long as your chosen backup
schedule dictates, e.g. 6 months by default. If you have blown
your quota by accidentally allowing large amounts of data to be
backed up, you are still going to have to contact support to get
it deleted.
The alert looks like this:
From: nagios(a)bitfolk.com
Subject: ** PROBLEM alert - backup2.bitfolk.com/Backup space usage youraccount is WARNING **
***** Nagios *****
Notification Type: PROBLEM
Service: Backup space usage youraccount
Host: backup2.bitfolk.com
Address: 85.119.80.230
State: WARNING
Date/Time: Tue Jan 31 15:52:28 UTC 2012
Additional Info:
WARNING 98.50% (394/400MiB) used
It is possible for the usage to go above 100% because we do allow
you to go over your quota for short periods of time.
Cheers,
Andy
¹ "Successful" as in, rsync connected to your host, did some stuff
and then exited with a non-error exit code. It does not
necessarily mean that what you think should be backed up is being
backed up. As with any backup solution you need to assure yourself
on a regular basis that it's doing what you expect.
² Generally one or more of:
- Buy some more disk space for backups
- Backup fewer files
- Backup less often
--
http://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce
Hi,
Some of this was easy (changing where the slave DNSes got their info
from was a simple sed search and replace) and some tedious (changing all
the master DNS where because of needing to change the serial number for
all of them, this was done by hand via a control panel - if there was an
easier way, I am not sure I want to know now :) )
The one thing not mentioned on the wiki is that doing
> grep -r 212.13.19 *
in /etc now comes up with one hit, in ntp.conf:
> # and admin.curacao.bitfolk.com (nagios)
> restrict 212.13.194.71
so I changed that to 85.119.80.238 and 85.119.80.244 per the customer
information page on the website and restarted ntp.
Now I think it's just a case of waiting for the rest of the net to
notice the DNS changes...
Ian
Hello,
Long email about DNS timers and alerting based on them. Unless you
have domains on BitFolk's secondary DNS platform you probably won't
care about this, and even then you still probably don't care unless
you've been receiving alerts about them. Turn back now!
Still here? OK.
I've recently implemented DNS secondary domain zone age alerts. They
send alerts when the zone on BitFolk's nameservers is too old. This
saves me having to read logs and open a support ticket to advise
customers that the zone transfers are failing, so I'm all in favour
of that.
The definition of "too old" differs on a per-domain basis. There are
two values in the SOA record of a DNS domain; refresh and expire.
The refresh value tells secondary servers how often to check in
with the primary.
The expire value tells secondary servers how long they should
consider themselves valid for without successful contact with the
primary. If there is no contact with the primary for the expire
period then the secondary server stops serving the domain and
returns SERVFAIL for every query.
So, based on the above, a DNS domain should never be "older" than
refresh. If it is older then that means that at least one refresh
attempt failed. If the age approaches expire then the domain is in
danger of not being served.
At the moment I have decided to send a warning alert on 150% of
refresh and a critical alert on 50% of expire.
RIPE recommends 84600 (one day) for refresh and 3600000 (1000 hours;
almost 6 weeks) for expire:
http://www.ripe.net/ripe/docs/ripe-203
RFC1912 (1996) recommends one day for refresh and 2-4 weeks for
expire:
http://www.faqs.org/rfcs/rfc1912.html
So let's say you go with RIPE's recommendations. You'd receive
a warning alert after your secondary DNS setup was broken for 36 hours,
and you'd receive a critical alert if it was still broken after 500
hours (almost 3 weeks). 500 hours after that, your domain stops
being served on the secondary servers.
That seems reasonable.
Finally getting around to the point of this email: what do you think
I should do about problematic SOA values that customers have chosen?
For example, there are some domains currently on BitFolk's servers
where the refresh and expire are both set to 300 seconds (5
minutes). Ignoring what happens with alerts for a moment, that means
that every 5 minutes the secondary servers check the primary, and if
that fails even once, the domain will return SERVFAIL for all
queries until contact is made again.
I can't understand what the use is of such a fragile setting; it
looks erroneous to me. This isn't just DNS purism saying, "ooh, I
don't like your non-standard values!" It will actually cause
breakage very easily. But perhaps it is not for me to reason why.
Those domains have been like that for a long time and I assume no
one has noticed. It must have caused some problems any time the
primary nameserver was unreachable by the secondary servers. But
arguably that is not my problem.
When combined with this new alerting though, what happens is that
there isn't a refresh for 5 minutes then 2.5 minutes into that a
critical alert fires since we're half way to expire (5 minutes). All
being well there should be a recovery ~2.5 mins later. In reality
these times will be variable because BitFolk's Nagios doesn't check
DNS every few minutes, more like an hour plus.
That is the most extreme example of this problem, but there are a few
other domains in there where refresh and expire have been set to the
same value. It will lead to a cycle of alert and then recovery,
forever.
So, what do you think I should do?
I'm not willing to give up on the alerts because I think most people
would like to know when their DNS setup is broken (or in danger of
being broken), and it saves me having to personally interact to tell
people this. Intentional DNS breakage is not my problem, but
answering/opening support tickets is.
Alerting can be disabled on a per-domain basis. Currently only by
asking support, but eventually you'll be able to flip that on the
Panel¹.
So how about have Panel warn on the web page about what are
considered unwise SOA values, and just allow the alerts to be
disabled if for some reason this sort of fragile DNS setup is
intentional?
Cheers,
Andy
¹ https://panel.bitfolk.com/dns/#toc-secondary-dns
--
http://bitfolk.com/ -- No-nonsense VPS hosting
Hi,
I've just enabled some additional alerting for the DNS secondary
service. It will copy you in on alerts regarding BitFolk nameservers
that are not correctly serving your domain.
They look like this:
From: nagios(a)bitfolk.com
Subject: ** PROBLEM alert - a.authns.bitfolk.com/Auth. DNS example.com is UNKNOWN **
***** Nagios *****
Notification Type: PROBLEM
Service: Auth. DNS example.com
Host: a.authns.bitfolk.com
Address: 85.119.80.222
State: UNKNOWN
Date/Time: Fri Jan 27 19:46:10 UTC 2012
Additional Info:
DNS UNKNOWN - 1.444 seconds response time (No ANSWER SECTION found)
This would indicate that a.authns.bitfolk.com is not serving the
domain example.com when we would expect it to be.
The reason for this change is that there are quite a few customer
domains that aren't being served correctly and rather than keep
opening tickets and chasing it up, I would rather let Nagios do what
it is designed for.
I am also working on a check that AXFRs are happening.
Cheers,
Andy
--
http://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce
Hello,
It appears that recently, CentOS 6.x and Scientific Linux 6.x
installers started to require 512MiB RAM. Our smallest and most
popular VPS plan currently has 480MiB RAM. That means that the
average¹ BitFolk customer now cannot self-install derivatives of
RHEL 6.x.
This is extremely annoying since I suspect that these distributions
work just the same in 480MiB RAM now as they did a few months ago.
I can't find a simple way to override that check (please let me know
if you know of one), and I'm not quite ready to increase the default
RAM allocation to 512MiB.
In the short term I am tempted to make the installer boot with
512MiB RAM if you have 480MiB. It will then revert to 480MiB upon
normal use.
Any comments?
Cheers,
Andy
¹ Mode and median. The mean is 641MiB.
--
http://bitfolk.com/ -- No-nonsense VPS hosting
Q. How many mathematicians does it take to change a light bulb?
A. Only one - who gives it to six Californians, thereby reducing the problem
to an earlier joke.
Hi
I'm pretty sure that my host has been fully configured for new IP, but
I got notification that someone has used old IP recently.
Has anyone a simple solution for catching where from and which
protocol are those connections coming for old IP?
Thanks,
Taavi
Hi everyone,
I have decided to venture down what I hope is a well trodden path by now;
upgrading my VPS from Debian Lenny to Squeeze.
I have scoured the list archives and tried to make the most of
http://www.debian.org/releases/squeeze/i386/release-notes/ch-upgrading.en.h…
however I'm not ashamed to admit that I'm no expert in this regard and
very much still learning so would appreciate a critique of my plan of
action:
- Ask Support kindly to perform a temporary disk snapshot
- Login via Xen console
- Verify no pending actions required for currently installed packages:
aptitude (Then hit 'g' once in 'visual mode')
- Verify that all packages are in an upgradable state:
dpkg --audit
- Show currently installed kernel(s):
dpkg -l | grep linux-image
Mine currently shows:
ii linux-image-2.6-xen-686 2.6.26+17+lenny1 Linux 2.6 image on
i686, oldstyle Xen suppor
ii linux-image-2.6.26-1-xen-686 2.6.26-13lenny2 Linux 2.6.26
image on i686, oldstyle Xen sup
ii linux-image-2.6.26-2-xen-686 2.6.26-26lenny2 Linux 2.6.26
image on i686, oldstyle Xen sup
- Confirm non-usage of grub2:
dpkg -l | grep grub
Mine currently shows:
ii grub 0.97-47lenny2 GRand Unified Bootloader (Legacy version)
ii grub-common 1.96+20080724-16 GRand Unified Bootloader, version
2 (common
- Updates apt sources lists from lenny to squeeze:
sed -i s/lenny/squeeze/g /etc/apt/sources.list
- Manually edit /etc/apt/source.list to confirm success of the above step
and comment out any other repositories (non-Debian, backports etc) ?
- Upgrade the kernel: (*** Am I aiming for the right one here? ***)
aptitude install linux-image-2.6-686-bigmem
- Update grub configuration:
update-grub
- Remove clocksource=jiffies from kopt directive in /boot/grub/menu.lst
and confirm correct kernel will be loaded (i.e. default # matches new
kernel position)
- Upgrade udev (to minimise the risk of running the old udev with the new
kernel):
apt-get install udev
- Reboot
- Record a transcript of the upgrade session:
script -t 2>~/upgrade-squeeze.time -a ~/upgrade-squeeze.script
(This can be reviewed at a later date with scriptreplay
~/upgrade-squeeze.time ~/upgrade-squeeze.script)
- Update the package list:
apt-get update
- Perform a minimal upgrade (i.e. upgrade those packages that don't
require installation/removal of any other package(s)):
apt-get upgrade
- Complete the rest of the upgrade:
apt-get dist-upgrade
- Remove old/obsolete packages no longer required:
apt-get autoremove
- (Hopefully:) After the dust settles, advise Support that the snapshot of
the old system can be removed
Hope does that all look? Please don't hold back...
Regards,
Mathew