This unfortunately has happened again today, at about 14:23Z.
This time I was logging the serial console to a file
and so am able
to see that there was the equivalent of a kernel panic in the
hypervisor.
That is, I do not believe that hen's hardware is at fault. I think
it's tripping against a bug in Xen, and it's happened to the same
host twice because it's been triggered by the same guest doing
something (I do not believe malicious at this stage).
I've not got a quick fix to this because moving all customers on hen
to new hardware is likely just going to crash the hypervisor on the
other hardware. I need to discuss the problem with the Xen
developers and see if I get anywhere.
In between last time and this I also built a new version of the
hypervisor and set every host to boot into it, so hen is now
actually running a very slightly newer version than everything else
(and also compared to what it was running before). This possibly
could help, just by chance, though as far as I am aware it is not a
known bug.
So I am very sorry but I am going to have to ask you to bear with me
for a little while, while I investigate this more. Until I can
establish which guest triggered it I can't move any of the customers
on host hen to other hosts because that possibly just triggers it
elsewhere. And it could still elsewhere anyway.
If I don't make headway with this then I can revert to earlier
versions that we've been stable on for a long time, but security
issues have been fixed since then so I'm not going to do that except
as a last resort.
I will provide more information as soon as I can.
Thanks,
Andy
If you have a spare HV, why not try to identify the guest involved by
moving half the guests off of hen? The HV that crashes has the errant
guest, move another half from that HV, see which HV crashes. Continue
till you've identified, or you've a small enough number to be worth
contacting users and asking what their guests are doing at the panic time?