Written by J. Moellenkamp
on September 03, 2009
Reading time: 7 minutes
The IT Business Technology Oracle Solaris English

Some perspective to this DIY storage server mentioned at Storagemojo

I’ve received yesterday some mails/tweets with hints to a “Thumper for poor” DIY chassis. Those mails asked me for an opinion towards this piece of hardware and if it’s a competition to our X4500/X4540. Those questions arised after Robin Harris wrote his article “Cloud storage for $100 a terabyte”, which referred to the company Backblaze, which constructed a storage server on its own and described it on their blog in the article “Petabytes on a budget: How to build cheap cloud storage”. Sorry, that this article took so long and there may be a higher rate of typos, as my sinusitis came back with a vengeance … right in the second week of my vacation. But now this rather long article is ready :)
At first: No, it isn’t a system comparable to an X4540 … even without the considerations of DIY versus Tier-1 vendor. I have a rather long opinion about it, but let’s say one thing at first: I see several problems, but i think it fits their need, so it’s an optimal design for them and they designed it to be the optimum for them. I assume, many problems are addressed in the application logic. The nice thing at custom-build is the fact, that you can build a system exactly for your needs. And the Backblaze system is a system reduced to the minimum. This device is that cheap because it cuts several corners. That’s okay for them. But for general purpose this creates problems. I want to share my concerns just to show you, that you can’t compare this to a X4540 device. And even more important: I have to deny the conclusions of the Backblaze people. This isn’t a good design, even when you just need cheap storage, when you don’t own a middleware that does a lot of stuff that ZFS would do in the filesystem for example in the hardware. On the other side it supports my arguments in regard of the waning importance of RAID controllers. The more intelligent your application is, the less intelligent your storage needs to be. So … what are my objections to this DIY device:

The DIY Thumper has no power-distribution grid. So when one PSU fails, all devices connected at this power supply will fail. In the case PSU2 fails, the system board is away, thus the machine fails. Game over ... until power comes back.
Connected to the last problem: Given the disk layout, the power-distribution isn't correct. They use it with RAID6, but RAID6 just protects you against 2 failures. I don't see a sensible layout in three RAID6 groups, that would allow the system to loose 25 disks at once. A more reasonable RAID Level would be RAID10, but there you have 5 disks without a partner in the other PSU failure domain.
I don't know if i consider a foam sleeve around the disks and some nylon screws as enough vibration dampening, especially when your hard disks. I'm looking forward to the next article they announced which was announced to cover this topic. It will be even more interesting to hear more about it in the future because of the performance and the longevity of the disks in such an environment. Just an example for the real world: Once we found out that disks near to a fan were a tad slower than the ones far away from the fan. This led to changes to the vibration handling in that system.
This baby cries for ZFS. So much capacity, no battery backup RAID controller, only 10^14 disks. But i see the reason, why this choice wasn't feasible for them: Since a few weeks ago, the OpenSolaris SATA framework hadn't support for port multiplier. This was introduced with the putback of PSARC/2009/394 to OpenSolaris. But now it's integrated. And given, that this baby just speaks HTTPS to the outside and the software relies on Tomcat, it should be a piece of cake to move to Opensolaris and ZFS now.
This design isn't really performance oriented. As they use Port multiplier to couple their disks to cheap SATA PCIe/PCI controller, one 3 GBit/s interface has to feed 5 disks. One ST31500341AS delivers round about 120 MByte/s (saw several benchmarks suggesting such a value). Five of them deliver 600 MByte/s, a little bit less than 6 GBit/s. So each SATA channel is oversubscribed by a factor of two.
Even more important, three of the connections to the port extenders are coupled to a standard PCI-Port. One PCI-Conventional 3.0 port (didn't find an information what the board provides, thus i assumed the fastest, source is the german wikipedia page about PCI) is capable to deliver round-about 4 Gigabit/second (to be exact 4,226 GBit/s). Thus you connected 18 GBit/s worth of hard disks at 4 GBit/s worth of connectivity.
I have similar objections for the PCIe connection for SATA-cards. Those ports are PCIe at 1x. One PCIe 1x port has a theoretical throughput of 250 MByte/s. So such a port would be fully loaded by just two hard disks. But this baby connects ten disks to a single lane of PCIe.
Of course those hard disks doesn't run at max speed all the time, i assume the load pattern will be very random in the special use case of Backblaze. But this leads to a high mechanical load to the disks and to some additional objections. Based on the manual of the hard disk, i see two problems here:
- The ST31500341AS is a desktop disk. They do not even one of this nearline disks like we use in the X4500/X4540. When you look in the disk manual, all reliability calculations were done on the basis of 2400 hours of operation per year. But a year has 8760 hours. When you don't believe me this 2400 hours, just look at page 24 of the manual.
- The reliability considerations of Seagate assume a desktop usage pattern, not a server usage pattern.
- Seagate writes in their manual itself: "The AFR and MTBF will be degraded if used in an enterprise application". But given the long credits list at their end, i assume they've read the manual and considered this in their choice of hard disks.
- There is another important point about the reliability of the disks: The AFR and the MTBF for the 7200.11 is valid for a surrounding temperature of 25 degrees celcius. Running it above this temperature reduces the MTBF and increases the AFR. Other harddisks build with enterprise usage in mind use another normal temperature vastly higher as the foundation of this calculations.
- But due to the usage of RAID-6 those disks will see a high throughput in any case. RAID6 relies on a READ/MODIFY/WRITE cycle due to the nature of RAID6. So you read/write vastly more than just writing the modified data to disc. This may even interfere with the sparse throughput of the system. We've introduced RAIDZ, RAIDZ2 and RAIDZ3 to circumvent this kind of problems
- No battery backup for the caches, but RAID6 ... well ... "Warning ... write holes ahead"
- This system uses a Desktop Board, the DG43NB, thus system resources are a little bit sparse on this board. Just 1 processor and just 4 GB of RAM. I find the later one a little bit problematic. For general purpose a lot of more memory would be feasible. There are good reasons to have 32 GB or 64 GB in a X4540. Without a large amount of cache, you aren't able to shave off a little bit of the IOPS load to get back to a moderate load, thus the choice of a desktop disks gets even more problematic here.

← → Top