On Mon, Dec 05, 2022 at 12:25:59AM +0000, Andy Smith wrote:
We are likely going to have to do an emergency reboot
in a moment.
What happened was, the SAS controller (which is built in to the
motherboard) did something strange and stopped responding:
Dec 5 00:00:20 talisker kernel: [15457308.417397] sd 0:0:1:0: attempting task
abort!scmd(0x000000000eb85a6f), outstanding for 7040 ms & timeout 7000 ms
Dec 5 00:00:20 talisker kernel: [15457308.417397] sd 0:0:1:0: [sdb] tag#2370 CDB: ATA
command pass through(16) 85 08 2e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00
Dec 5 00:00:20 talisker kernel: [15457308.417397] scsi target0:0:1: handle(0x000a),
sas_address(0x4433221101000000), phy(1)
Dec 5 00:00:20 talisker kernel: [15457308.417397] scsi target0:0:1: enclosure logical
id(0x500304801ce84801), slot(1)
Dec 5 00:00:20 talisker kernel: [15457308.417397] scsi target0:0:1: enclosure
level(0x0000), connector name( )
Dec 5 00:00:52 talisker kernel: [15457339.698559] mpt3sas_cm0: In func:
mpt3sas_scsih_issue_tm
Dec 5 00:00:52 talisker kernel: [15457339.699439] mpt3sas_cm0: Command Timeout
Dec 5 00:00:52 talisker kernel: [15457339.700221] mf:
Dec 5 00:00:52 talisker kernel: [15457339.700221]
Dec 5 00:00:52 talisker kernel: [15457339.700221] 0100000a
Dec 5 00:00:52 talisker kernel: [15457339.700221] 00000100
Dec 5 00:00:52 talisker kernel: [15457339.700221] 00000000
Dec 5 00:00:52 talisker kernel: [15457339.700248] 00000000
Dec 5 00:00:52 talisker kernel: [15457339.700249] 00000000
Dec 5 00:00:52 talisker kernel: [15457339.700249] 00000000
Dec 5 00:00:52 talisker kernel: [15457339.700249] 00000000
Dec 5 00:00:52 talisker kernel: [15457339.700249] 00000000
Dec 5 00:00:52 talisker kernel: [15457339.700249]
Dec 5 00:00:52 talisker kernel: [15457339.700249]
Dec 5 00:00:52 talisker kernel: [15457339.700249] 00000000
Dec 5 00:00:52 talisker kernel: [15457339.700249] 00000000
Dec 5 00:00:52 talisker kernel: [15457339.700258] 00000000
Dec 5 00:00:52 talisker kernel: [15457339.700258] 00000000
Dec 5 00:00:52 talisker kernel: [15457339.700258] 00000943
Dec 5 00:00:52 talisker kernel: [15457339.700258]
Dec 5 00:01:02 talisker kernel: [15457349.938722] mpt3sas_cm0: sending diag reset !!
Dec 5 00:01:03 talisker kernel: [15457351.223529] mpt3sas_cm0: diag reset: SUCCESS
Dec 5 00:01:03 talisker kernel: [15457351.293977] mpt3sas_cm0: CurrentHostPageSize is 0:
Setting default host page size to 4k
Dec 5 00:01:18 talisker kernel: [15457366.578292] mpt3sas_cm0:
_base_display_fwpkg_version: complete
Dec 5 00:01:18 talisker kernel: [15457366.578296] mpt3sas_cm0:
_base_display_fwpkg_version: timeout
Dec 5 00:01:18 talisker kernel: [15457366.579119] mf:
Dec 5 00:01:18 talisker kernel: [15457366.579119]
Dec 5 00:01:18 talisker kernel: [15457366.579121] 12000001
Dec 5 00:01:18 talisker kernel: [15457366.579123] 00000000
Dec 5 00:01:18 talisker kernel: [15457366.579124] 00000000
Dec 5 00:01:18 talisker kernel: [15457366.579125] 00000000
Dec 5 00:01:18 talisker kernel: [15457366.579126] 00000000
Dec 5 00:01:18 talisker kernel: [15457366.579127] 00000000
Dec 5 00:01:18 talisker kernel: [15457366.579129] 00000000
Dec 5 00:01:18 talisker kernel: [15457366.579130] 00000100
Dec 5 00:01:18 talisker kernel: [15457366.579131]
Dec 5 00:01:18 talisker kernel: [15457366.579131]
Dec 5 00:01:18 talisker kernel: [15457366.579132] 0758a000
Dec 5 00:01:18 talisker kernel: [15457366.579134] 00000022
Dec 5 00:01:18 talisker kernel: [15457366.579135] 00000100
Dec 5 00:01:18 talisker kernel: [15457366.579136] 40000000
Dec 5 00:01:18 talisker kernel: [15457366.579137]
Dec 5 00:01:18 talisker kernel: [15457366.579143] mpt3sas_cm0:
mpt3sas_base_hard_reset_handler: FAILED
Dec 5 00:01:18 talisker kernel: [15457366.579154] sd 0:0:1:0: task abort: FAILED
scmd(0x000000000eb85a6f)
Dec 5 00:01:18 talisker kernel: [15457366.579162] sd 0:0:3:0: attempting task
abort!scmd(0x00000000b20a175c), outstanding for 65088 ms & timeout 30000 ms
Dec 5 00:01:18 talisker kernel: [15457366.579169] sd 0:0:3:0: [sdd] tag#2380 CDB:
Read(16) 88 00 00 00 00 00 a4 fc 47 50 00 00 00 08 00 00
Dec 5 00:01:18 talisker kernel: [15457366.579173] scsi target0:0:3: handle(0x000c),
sas_address(0x4433221103000000), phy(3)
Dec 5 00:01:18 talisker kernel: [15457366.579176] scsi target0:0:3: enclosure logical
id(0x500304801ce84801), slot(3)
Dec 5 00:01:18 talisker kernel: [15457366.579179] scsi target0:0:3: enclosure
level(0x0000), connector name( )
Dec 5 00:01:18 talisker kernel: [15457366.579182] sd 0:0:3:0: No reference found at
driver, assuming scmd(0x00000000b20a175c) might have completed
…and then pages and pages more of much the same.
Drives sda, sdb, sdc and sdd are SSDs that are attached to the onboard
SAS ports and these hold all the customer VM storage on that server
(your xvd* block devices that aren't archive storage). They had all
disappeared so no I/O was possible for VMs. THe host itself still worked
because its storage is on other devices.
I tried for a while to reset the controller, make it re-scan etc, but
all of this just met timeouts so in the end I had to send a forcible
poweroff sysctl to every running VM. Some of those either didn't
complete their poweroff or are set to ignore that sysctl, so for those I
then had to send a "destroy" (like yanking power). There wasn't really
anything that I or the VMs could do to ensure a clean shutdown as they
could do no I/O at all.
On boot no problems were evident. The SAS controller could see all of
its drives. The RAID arrays assembled again without incident and VMs
were able to start up again.
I think this will have effectively been like a power loss on the
storage, which your filesystems and applications should hopefully be
able to handle without corruption. One customer VM is showing as down
but when I had a look at the console it seemed likely to be a
configuration error in the VM rather than anything related to this
incident.
After the host booted I made a mistake and accidentally broke IPv4
networking to talisker and its VMs for about 15 minutes. I didn't notice
because I was using IPv6 myself, and I wasn't paying close attention to
alerts. Once I started going through checking everything had recovered
properly I did notice and fixed the issue.
I don't know if the above SAS controller issue was a software bug or an
indicator of hardware failure or some freak occurrence. We will keep an
eye on things and if it ends up being necessary we'll move customers to
a different host.
If any customer on talisker is still experiencing problems, please check
your Xen Shell consoles and if the solution is not then evident, please
email support(a)bitfolk.com to open a support ticket.
Apologies for the disruption, and for the error that lengthened the
outage by about 15 minutes.
Thanks,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting