[bitfolk] Single parity redundancy and "large" HDDs (Was Re: Re: Sending emails from Bitfolk VPS)

14 Oct 2022

Hello,

On Fri, Oct 14, 2022 at 04:10:58PM +0100, Dom Latter via BitFolk Users wrote:
...
  I've not considered single parity raid to be risky
- what do other people
 think? 
I wouldn't do single parity with devices as big as 6TB, especially
not at Hetzner who are very likely using consumer HDDs (not ones
recommended to enterprise/NAS use), and often you're not the first
user of them (i.e. they took them out of a server that was rented by
others previously).

The issue is that if you lose one device and then replace it, it
will take a really long time to read 6TB of data off of the other
devices and during that whole time there is no redundancy at all, so
if there are any further unreadable bits on other devices, the data
is lost.

The risk is a little bit lower if you have been regularly scrubbing
your array as that forces it to read everything and find any
unreadable bits. Most Linux distributions using MD RAID do schedule
this for once per month.

It's probably also acceptable if you can restore from backups
without too much grief.

I get why you people would want to do RAID-5 or ZFS raidz1 when
there's only four devices, as RAID-6 or raidz2 is losing half the
capacity…

Sometimes people try to draw conclusions from the Bit Error Rate
/ Unrecoverable Error Rate of the devices, e.g. Toshiba X300 6TB is
1 per 10¹⁴:

    https://storage.toshiba.com/docs/support-docs/X300-SalesSheet_English_Web_r…

So they will say once you've read 10¹⁴ bits (12½TB) off of one of those
drives you could expect 1 bit of that to be unreadable. That starts
to look quite worrying when you need to read 6TB off of three of them to
heal the array each time, with normal read load on top.

The thing is, the UER can't mean that, as observed in practice:

    https://louwrentius.com/dont-be-afraid-of-raid.html
    http://www.raidtips.com/raid5-ure.aspx

HDDs just aren't as bad as that. So I don't know exactly what UER
means or how we're supposed to think of it. Perhaps it is some sort
of minimum reliability guarantee in that if you read 12½ TB off of
one of those drives and 1 bit of it comes out wrong then that's
still functioning as designed, but if 2 bits are wrong then that is
unacceptable and will be replaced under warranty?

Though even that interpretation has difficulties since you can't
read just 1 bit from an HDD, you can only read a sector, which at
these capacities will be 4KiB, and either it will all read correctly
or it won't (and you'll get an error for all of that sector).

Anyway, my main concern wouldn't be the published error rate but
just the fact that at 100MB/sec sustained write, it would take 2½
days to heal the array during which time there is no redundancy. And
100MB/s is really optimistic for a HDD in a system under other load.

Cheers,
Andy

-- 
https://bitfolk.com/ -- No-nonsense VPS hosting

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

[bitfolk] Single parity redundancy and "large" HDDs (Was Re: Re: Sending emails from Bitfolk VPS)