Hello,
This email is a bit of a ramble about block device IO and SSDs and
contains no information immediately relevant to your service, so
feel free to skip it.
In considering what the next iteration of BitFolk infrastructure
will be like, I wonder about the best ways to use SSDs.
As you may be aware, IO load is the biggest deal in virtual hosting.
It's the limit everyone hits first. It's probably what will dismay
you first on Amazon EC2. Read
http://wiki.postgresql.org/images/7/7f/Adam-lowry-postgresopen2011.pdf
or at least pages 8, 29 and 30 of it.
Usually it is IO load that tells us when it's time to stop putting
customers on a server, even if it has a bunch of RAM and disk
space left. If disk latency gets too high everything will suck,
people will complain and cancel their accounts. When the disk
latency approaches 10ms we know it's time to stop adding VMs.
Over the years we've experimented with various solutions. We built
a server with 10kRPM SAS drives, and that works nicely, but the
storage then costs so much that it's just not economical.
After that we started building bigger servers with 8 disks instead of
4, and that's where we are now. This worked out, as we can usually
get around twice as many VMs on one server, and it saves having to
pay for an extra chassis, motherboard, PSUs and RAID controller.
SSD prices have now dropped enough that it's probably worth looking
at how they can be used here. I can think of several ways to go:
- Give you the option of purchasing SSD-backed capacity
=====================================================
Say SSD capacity costs 10 times what SATA capacity does. You get
to choose between 5G of SATA-backed storage or 0.5G of SSD-backed
storage for any additional storage you might like to purchase, the
same price for either.
Advantages:
- The space is yours alone; you get to put what you like on it. If
you've determined where your storage hot spots are, you can put
them on SSD and know they're on SSD.
Disadvantages:
- In my experience most people do not appreciate choice, they just
want it to work.
Most people aren't in a position to analyse their storage use
and find hot spots. They lack either the inclination or the
capability or both - the service is fine until it's not.
- It means buying two expensive SSDs that will spend most of their
time being unused.
Two required because they'll have to be in a RAID-1.
Most of the time unused because the capacity won't be sold
immediately.
Expensive because they will need to be large enough to cater to
as large a demand as I can imagine for each server.
Unfortunately I have a hard time guessing what that demand would
be like so I'll probably guess wrong.
- Find some means of using SSDs as a form of tiered storage
=========================================================
We could continue deploying the majority of your storage from SATA
disks while also employing SSDs to cache these slower disks in
some manner.
The idea is that frequently-accessed data is backed on SSD whereas
data that is accessed less often is left on the larger-capacity
SATA, and *this remains transparent to the end user*, i.e. the VM.
This is not a new idea; plenty of storage hardware already does
it, ZFS can do it and so can BTRFS.
Advantages:
- For whatever benefit there is, everyone gets to feel it. If done
right, any VM that needs more IOPs should get more IOPs.
- Expensive SSDs purchased can be used immediately, in full.
Disadvantages:
- Since we can't use ZFS or expensive storage hardware, any
short-term solution is likely to be rather hacky. Do we want to
be pioneers here? This is your data.
- Customers with VMs that don't have heavy IO requirements (most)
will be subsidising those who *do* have heavy IO requirements.
It's very unlikely we will put prices up, but SSDs are not free
so it has the effect of delaying the usual progression of
more-for-less that this type of service goes through.[1]
- Beyond what might be quite a blunt instrument, customers will
have no way to request faster storage and rely on it being
present. You have "some storage" and if that storage isn't
performing as fast as you would like, all we would be able to do
is try to see why it's not being cached on SSD.
- Both?
=====
Perhaps there is some way to do both? Maybe to start with using the
whole SSD as cache, but as requests to purchase SSD-backed storage
come in the cache could be reduced?
Advantages:
- Again everyone feels the benefit immediately and hardware isn't
wasted.
- If the customer needs to buy SSD-backed storage then they can.
Disadvantages:
- If the caching is good enough then no one would feel the need to
buy SSD anyway, so why add complexity?
Questionable:
- If people buy all of the SSD, does that reduce caching benefit
to zero and suddenly screw everyone else over?
Presumably SSD-backed storage could be priced such that if a lot
of people did buy it, it would be economical to go out and buy a
pair of larger ones and swap them over without downtime[2].
So, if anyone has any thoughts on this I'd be interested in hearing
them.
If you had an IO latency problem, would you know how to diagnose it
to determine that it was something you were doing as opposed to
"BitFolk's storage is overloaded but it's not me"?
If you could do that, would you be likely to spend more money on
SSD-backed storage?
If we came to you and said that your VPS service was IO-bound and
would run faster if you bought some SSD-backed storage, do you think
that you would?[3]
My gut feeling at the moment is that while I would love to be
feeding the geek inside everyone and offering eleventy-billion
choices, demand for SSD-backed storage at an additional cost will be
low.
I also think it's going to be very difficult for an admin of a
virtualised block device to tell the difference between:
"All my processes are really slow at talking to storage; it's
because of my process ID 12345 which is a heavy DB query"
and:
"All my processes are really slow at talking to storage; that's
definitely a problem with BitFolk's storage and not anything I
am doing."
By the way, I think we've done reasonably well at keeping IO latency
down, over the years:
barbar:
http://tools.bitfolk.com/cacti/graphs/graph_1634_6.png
bellini:
http://tools.bitfolk.com/cacti/graphs/graph_2918_4.png
cosmo:
http://tools.bitfolk.com/cacti/graphs/graph_2282_4.png
curacao:
http://tools.bitfolk.com/cacti/graphs/graph_1114_6.png
dunkel:
http://tools.bitfolk.com/cacti/graphs/graph_1485_6.png
faustino:
http://tools.bitfolk.com/cacti/graphs/graph_1314_6.png
kahlua:
http://tools.bitfolk.com/cacti/graphs/graph_1192_6.png
kwak:
http://tools.bitfolk.com/cacti/graphs/graph_1113_6.png
obstler:
http://tools.bitfolk.com/cacti/graphs/graph_1115_6.png
president:
http://tools.bitfolk.com/cacti/graphs/graph_2639_4.png
urquell:
http://tools.bitfolk.com/cacti/graphs/graph_2013_6.png
(Play at home quiz: which four of the above do you think have eight
disks instead of four? Which one has four 10kRPM SAS disks? Answers
at [4])
In general we've found that keeping the IO latency below 10ms keeps
people happy.
There have been short periods where we've failed to keep it below
10ms and I'm sure that many of you can remember times when you've
found your VPS sluggish. Conversely I suspect that not many
customers can think of times when their VPSes have been the *cause*
of high IO load, yet high IO load is in general only caused by
customer VMs! So for every time you have experienced this, someone
else was causing it![5]
I think that, being in the business of providing virtual
infrastructure at commodity prices, we can't really expect too many
people to want or be able to take the time to profile their storage
use and make a call on what needs to be backed by SATA or SSD.
I think we first need to try to make it as good as possible for
everyone, always. There may be a time in the future where it's
commonplace for customers to evaluate storage in terms of IO
operations per second instead of gigabytes, but I don't think we are
there yet.
As for the "low-end customers subsidise higher-end customers"
argument, that's just how shared infrastructure works and is already
the case in many existing metrics, so what's one more? While we
continue to not have a good way to ration out IO capacity it is
difficult to add it as a line item.
So, at the moment I'm more drawn to the "both" option but with the
main focus being on caching with a view to making it better for
everyone, and hopefully overall reducing our costs. If we can sell
some dedicated SSD storage to those who have determined that they
need it then that would be a bonus.
Thoughts? Don't say, "buy a big SAN!" :-)
Cheers,
Andy
[1] You know, when we double the RAM or whatever but keep the price
to you the same.
[2] Hot swap trays plus Linux md = online array grow. In theory.
[3] "Nice virtual machine you have here. Would be a real shame if
the storage latency were to go through the roof, yanno? We got
some uh… extras… that can help you right out of that mess.
Pauly will drop by tomorrow with an invoice."
— Tony Soprano's Waste Management and Virtual Server Hosting,
Inc.
[4] echo "oryyvav, pbfzb, cerfvqrag naq hedhryy unir rvtug qvfxf.
oneone unf sbhe FNF qvfxf." | rot13
[5] Barring *very* occasional problems like a disk broken in such a
way that it doesn't die but delays every IO request, or a
battery on a RAID controller going defective, which disables the
write cache.
--
http://bitfolk.com/ -- No-nonsense VPS hosting