[bitfolk] Re: There are currently problems with host "talisker"

5 Dec 2022

Thanks for you rnocturnal efforts, Andy. Sudden loss of connectivity 
with my VPS sent me to the panel, but that was giving 502 and later 503 
errors. Was that a knock-on effect of the problems with Talisker? It's 
working again now.

Cheers,

Simon.

On 05/12/2022 01:46, Andy Smith via BitFolk Users wrote:
...
  On Mon, Dec 05, 2022 at 12:25:59AM +0000, Andy Smith
wrote:
  We are likely going to have to do an emergency
reboot in a moment.

 What happened was, the SAS controller (which is built in to the
 motherboard) did something strange and stopped responding:

 Dec  5 00:00:20 talisker kernel: [15457308.417397] sd 0:0:1:0: attempting task
abort!scmd(0x000000000eb85a6f), outstanding for 7040 ms & timeout 7000 ms
 Dec  5 00:00:20 talisker kernel: [15457308.417397] sd 0:0:1:0: [sdb] tag#2370 CDB: ATA
command pass through(16) 85 08 2e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00
 Dec  5 00:00:20 talisker kernel: [15457308.417397] scsi target0:0:1: handle(0x000a),
sas_address(0x4433221101000000), phy(1)
 Dec  5 00:00:20 talisker kernel: [15457308.417397] scsi target0:0:1: enclosure logical
id(0x500304801ce84801), slot(1)
 Dec  5 00:00:20 talisker kernel: [15457308.417397] scsi target0:0:1: enclosure
level(0x0000), connector name(     )
 Dec  5 00:00:52 talisker kernel: [15457339.698559] mpt3sas_cm0: In func:
mpt3sas_scsih_issue_tm
 Dec  5 00:00:52 talisker kernel: [15457339.699439] mpt3sas_cm0: Command Timeout
 Dec  5 00:00:52 talisker kernel: [15457339.700221] mf:
 Dec  5 00:00:52 talisker kernel: [15457339.700221] 	
 Dec  5 00:00:52 talisker kernel: [15457339.700221] 0100000a
 Dec  5 00:00:52 talisker kernel: [15457339.700221] 00000100
 Dec  5 00:00:52 talisker kernel: [15457339.700221] 00000000
 Dec  5 00:00:52 talisker kernel: [15457339.700248] 00000000
 Dec  5 00:00:52 talisker kernel: [15457339.700249] 00000000
 Dec  5 00:00:52 talisker kernel: [15457339.700249] 00000000
 Dec  5 00:00:52 talisker kernel: [15457339.700249] 00000000
 Dec  5 00:00:52 talisker kernel: [15457339.700249] 00000000
 Dec  5 00:00:52 talisker kernel: [15457339.700249]
 Dec  5 00:00:52 talisker kernel: [15457339.700249] 	
 Dec  5 00:00:52 talisker kernel: [15457339.700249] 00000000
 Dec  5 00:00:52 talisker kernel: [15457339.700249] 00000000
 Dec  5 00:00:52 talisker kernel: [15457339.700258] 00000000
 Dec  5 00:00:52 talisker kernel: [15457339.700258] 00000000
 Dec  5 00:00:52 talisker kernel: [15457339.700258] 00000943
 Dec  5 00:00:52 talisker kernel: [15457339.700258]
 Dec  5 00:01:02 talisker kernel: [15457349.938722] mpt3sas_cm0: sending diag reset !!
 Dec  5 00:01:03 talisker kernel: [15457351.223529] mpt3sas_cm0: diag reset: SUCCESS
 Dec  5 00:01:03 talisker kernel: [15457351.293977] mpt3sas_cm0: CurrentHostPageSize is 0:
Setting default host page size to 4k
 Dec  5 00:01:18 talisker kernel: [15457366.578292] mpt3sas_cm0:
_base_display_fwpkg_version: complete
 Dec  5 00:01:18 talisker kernel: [15457366.578296] mpt3sas_cm0:
_base_display_fwpkg_version: timeout
 Dec  5 00:01:18 talisker kernel: [15457366.579119] mf:
 Dec  5 00:01:18 talisker kernel: [15457366.579119] 	
 Dec  5 00:01:18 talisker kernel: [15457366.579121] 12000001
 Dec  5 00:01:18 talisker kernel: [15457366.579123] 00000000
 Dec  5 00:01:18 talisker kernel: [15457366.579124] 00000000
 Dec  5 00:01:18 talisker kernel: [15457366.579125] 00000000
 Dec  5 00:01:18 talisker kernel: [15457366.579126] 00000000
 Dec  5 00:01:18 talisker kernel: [15457366.579127] 00000000
 Dec  5 00:01:18 talisker kernel: [15457366.579129] 00000000
 Dec  5 00:01:18 talisker kernel: [15457366.579130] 00000100
 Dec  5 00:01:18 talisker kernel: [15457366.579131]
 Dec  5 00:01:18 talisker kernel: [15457366.579131] 	
 Dec  5 00:01:18 talisker kernel: [15457366.579132] 0758a000
 Dec  5 00:01:18 talisker kernel: [15457366.579134] 00000022
 Dec  5 00:01:18 talisker kernel: [15457366.579135] 00000100
 Dec  5 00:01:18 talisker kernel: [15457366.579136] 40000000
 Dec  5 00:01:18 talisker kernel: [15457366.579137]
 Dec  5 00:01:18 talisker kernel: [15457366.579143] mpt3sas_cm0:
mpt3sas_base_hard_reset_handler: FAILED
 Dec  5 00:01:18 talisker kernel: [15457366.579154] sd 0:0:1:0: task abort: FAILED
scmd(0x000000000eb85a6f)
 Dec  5 00:01:18 talisker kernel: [15457366.579162] sd 0:0:3:0: attempting task
abort!scmd(0x00000000b20a175c), outstanding for 65088 ms & timeout 30000 ms
 Dec  5 00:01:18 talisker kernel: [15457366.579169] sd 0:0:3:0: [sdd] tag#2380 CDB:
Read(16) 88 00 00 00 00 00 a4 fc 47 50 00 00 00 08 00 00
 Dec  5 00:01:18 talisker kernel: [15457366.579173] scsi target0:0:3: handle(0x000c),
sas_address(0x4433221103000000), phy(3)
 Dec  5 00:01:18 talisker kernel: [15457366.579176] scsi target0:0:3: enclosure logical
id(0x500304801ce84801), slot(3)
 Dec  5 00:01:18 talisker kernel: [15457366.579179] scsi target0:0:3: enclosure
level(0x0000), connector name(     )
 Dec  5 00:01:18 talisker kernel: [15457366.579182] sd 0:0:3:0: No reference found at
driver, assuming scmd(0x00000000b20a175c) might have completed

 …and then pages and pages more of much the same.

 Drives sda, sdb, sdc and sdd are SSDs that are attached to the onboard
 SAS ports and these hold all the customer VM storage on that server
 (your xvd* block devices that aren't archive storage). They had all
 disappeared so no I/O was possible for VMs. THe host itself still worked
 because its storage is on other devices.

 I tried for a while to reset the controller, make it re-scan etc, but
 all of this just met timeouts so in the end I had to send a forcible
 poweroff sysctl to every running VM. Some of those either didn't
 complete their poweroff or are set to ignore that sysctl, so for those I
 then had to send a "destroy" (like yanking power). There wasn't really
 anything that I or the VMs could do to ensure a clean shutdown as they
 could do no I/O at all.

 On boot no problems were evident. The SAS controller could see all of
 its drives. The RAID arrays assembled again without incident and VMs
 were able to start up again.

 I think this will have effectively been like a power loss on the
 storage, which your filesystems and applications should hopefully be
 able to handle without corruption. One customer VM is showing as down
 but when I had a look at the console it seemed likely to be a
 configuration error in the VM rather than anything related to this
 incident.

 After the host booted I made a mistake and accidentally broke IPv4
 networking to talisker and its VMs for about 15 minutes. I didn't notice
 because I was using IPv6 myself, and I wasn't paying close attention to
 alerts. Once I started going through checking everything had recovered
 properly I did notice and fixed the issue.

 I don't know if the above SAS controller issue was a software bug or an
 indicator of hardware failure or some freak occurrence. We will keep an
 eye on things and if it ends up being necessary we'll move customers to
 a different host.

 If any customer on talisker is still experiencing problems, please check
 your Xen Shell consoles and if the solution is not then evident, please
 email support(a)bitfolk.com to open a support ticket.

 Apologies for the disruption, and for the error that lengthened the
 outage by about 15 minutes.

 Thanks,
 Andy

 _______________________________________________
 BitFolk Users mailing list &lt;users(a)mailman.bitfolk.com&gt;
 You're subscribed as &lt;simon(a)thekelleys.org.uk&gt;
 Unsubscribe:
<https://mailman.bitfolk.com/mailman/postorius/lists/users.mailman.bitfolk.com/>
 or send an email to &lt;users-leave(a)mailman.bitfolk.com&gt;

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

[bitfolk] Re: There are currently problems with host "talisker"