Some perspective to this DIY storage server mentioned at Storagemojo

I’ve received yesterday some mails/tweets with hints to a “Thumper for poor” DIY chassis. Those mails asked me for an opinion towards this piece of hardware and if it’s a competition to our X4500/X4540. Those questions arised after Robin Harris wrote his article “Cloud storage for $100 a terabyte”, which referred to the company Backblaze, which constructed a storage server on its own and described it on their blog in the article “Petabytes on a budget: How to build cheap cloud storage”. Sorry, that this article took so long and there may be a higher rate of typos, as my sinusitis came back with a vengeance … right in the second week of my vacation. But now this rather long article is ready :)
At first: No, it isn’t a system comparable to an X4540 … even without the considerations of DIY versus Tier-1 vendor. I have a rather long opinion about it, but let’s say one thing at first: I see several problems, but i think it fits their need, so it’s an optimal design for them and they designed it to be the optimum for them. I assume, many problems are addressed in the application logic. The nice thing at custom-build is the fact, that you can build a system exactly for your needs. And the Backblaze system is a system reduced to the minimum. This device is that cheap because it cuts several corners. That’s okay for them. But for general purpose this creates problems. I want to share my concerns just to show you, that you can’t compare this to a X4540 device. And even more important: I have to deny the conclusions of the Backblaze people. This isn’t a good design, even when you just need cheap storage, when you don’t own a middleware that does a lot of stuff that ZFS would do in the filesystem for example in the hardware. On the other side it supports my arguments in regard of the waning importance of RAID controllers. The more intelligent your application is, the less intelligent your storage needs to be. So … what are my objections to this DIY device:

  • The DIY Thumper has no power-distribution grid. So when one PSU fails, all devices connected at this power supply will fail. In the case PSU2 fails, the system board is away, thus the machine fails. Game over ... until power comes back.
  • Connected to the last problem: Given the disk layout, the power-distribution isn't correct. They use it with RAID6, but RAID6 just protects you against 2 failures. I don't see a sensible layout in three RAID6 groups, that would allow the system to loose 25 disks at once. A more reasonable RAID Level would be RAID10, but there you have 5 disks without a partner in the other PSU failure domain.
  • I don't know if i consider a foam sleeve around the disks and some nylon screws as enough vibration dampening, especially when your hard disks. I'm looking forward to the next article they announced which was announced to cover this topic. It will be even more interesting to hear more about it in the future because of the performance and the longevity of the disks in such an environment. Just an example for the real world: Once we found out that disks near to a fan were a tad slower than the ones far away from the fan. This led to changes to the vibration handling in that system.
  • This baby cries for ZFS. So much capacity, no battery backup RAID controller, only 10^14 disks. But i see the reason, why this choice wasn't feasible for them: Since a few weeks ago, the OpenSolaris SATA framework hadn't support for port multiplier. This was introduced with the putback of PSARC/2009/394 to OpenSolaris. But now it's integrated. And given, that this baby just speaks HTTPS to the outside and the software relies on Tomcat, it should be a piece of cake to move to Opensolaris and ZFS now.
  • This design isn't really performance oriented. As they use Port multiplier to couple their disks to cheap SATA PCIe/PCI controller, one 3 GBit/s interface has to feed 5 disks. One ST31500341AS delivers round about 120 MByte/s (saw several benchmarks suggesting such a value). Five of them deliver 600 MByte/s, a little bit less than 6 GBit/s. So each SATA channel is oversubscribed by a factor of two.
  • Even more important, three of the connections to the port extenders are coupled to a standard PCI-Port. One PCI-Conventional 3.0 port (didn't find an information what the board provides, thus i assumed the fastest, source is the german wikipedia page about PCI) is capable to deliver round-about 4 Gigabit/second (to be exact 4,226 GBit/s). Thus you connected 18 GBit/s worth of hard disks at 4 GBit/s worth of connectivity.
  • I have similar objections for the PCIe connection for SATA-cards. Those ports are PCIe at 1x. One PCIe 1x port has a theoretical throughput of 250 MByte/s. So such a port would be fully loaded by just two hard disks. But this baby connects ten disks to a single lane of PCIe.
  • Of course those hard disks doesn't run at max speed all the time, i assume the load pattern will be very random in the special use case of Backblaze. But this leads to a high mechanical load to the disks and to some additional objections. Based on the manual of the hard disk, i see two problems here:
    • The ST31500341AS is a desktop disk. They do not even one of this nearline disks like we use in the X4500/X4540. When you look in the disk manual, all reliability calculations were done on the basis of 2400 hours of operation per year. But a year has 8760 hours. When you don't believe me this 2400 hours, just look at page 24 of the manual.
    • The reliability considerations of Seagate assume a desktop usage pattern, not a server usage pattern.
    • Seagate writes in their manual itself: "The AFR and MTBF will be degraded if used in an enterprise application". But given the long credits list at their end, i assume they've read the manual and considered this in their choice of hard disks.
    • There is another important point about the reliability of the disks: The AFR and the MTBF for the 7200.11 is valid for a surrounding temperature of 25 degrees celcius. Running it above this temperature reduces the MTBF and increases the AFR. Other harddisks build with enterprise usage in mind use another normal temperature vastly higher as the foundation of this calculations.
    • </ul>
    • But due to the usage of RAID-6 those disks will see a high throughput in any case. RAID6 relies on a READ/MODIFY/WRITE cycle due to the nature of RAID6. So you read/write vastly more than just writing the modified data to disc. This may even interfere with the sparse throughput of the system. We've introduced RAIDZ, RAIDZ2 and RAIDZ3 to circumvent this kind of problems
    • No battery backup for the caches, but RAID6 ... well ... "Warning ... write holes ahead"
    • This system uses a Desktop Board, the DG43NB, thus system resources are a little bit sparse on this board. Just 1 processor and just 4 GB of RAM. I find the later one a little bit problematic. For general purpose a lot of more memory would be feasible. There are good reasons to have 32 GB or 64 GB in a X4540. Without a large amount of cache, you aren't able to shave off a little bit of the IOPS load to get back to a moderate load, thus the choice of a desktop disks gets even more problematic here.
    • </ul> </noautobr> I think, Robin Harris is correct with his comment, that this system is a DC-3. It flies, it can transport goods and passengers from A to B in a reasonable, but not fast speed but don't forget your parachutes ;) It's the same with this storage, this hw needs the parachute in form of the software in front of the device. But, and this is one of the key take aways for you ... even when other systems are more expensive, they are not overpriced. At first don't compare the mentioned list prices with the street prices for components. Second: Of course you can save an dollar at one or the other place, but: The seagate hard disk costs you 100 Euro at a big german computer online-shop, the HUA721010KLA330 (aka Hitachi Ultrastar A7K1000 1TB) costs you roundabout 200 Euro after a search at Google. Just using other (in my opinion correct for general purpose) disks, would double the price despite offering less storage. And even this price isn't indicative, as most often there are special agreements between drive manufacturers and system manufactures because of quality standards, quality management and conditions. The technical differences of the UltraStar: 1 errors in 10^15, qualified for 24/7 operations by the manufacturer, qualified for a enterprise work pattern (and even here only a lighter one) and 1.2 Million Hours MTBF normalized on 40 degrees (AFAIK) instead of 0.7 million Hours at 25 degrees. Quality costs. Period. The same for a desktop board in the DIY-"Thumper" instead of a custom build board for optimal performance (a SATA controller for each disk or using 8x lane PCIe for 8 disks instead of 1x lane PCIe for 10 disk e.g.). I'm pretty sure Sun could build an equally priced system, when you take the bare metal of the X4500 chassis and rip out all the specialities of the X4500/X4540 systems. But such a system with so many corners left wouldn't a be a system you expect from Sun. And yes, the X4540 has less capacity at the moment, but i think it's not far too fetched, that the X4540 gets 2TB drives as soon as they reached the same quality standards and qualification as the current drives givinh the X4540 a capacity of 96 TB. To close this article: It's about making decision. Application and hardware has to be seen as one. When your application is capable to overcome the limitations and problems of such ultra-cheap storage (and the software of Backblaze seems to have this capabilities), such a DIY thing may be a good solution for you. If you have to run normal applications without this capablities, the general-purpose system looks as a much better road in my opinion.