The individual owning this blog works for Oracle in Germany. The opinions expressed here are his own, are not necessarily reviewed in advance by anyone but the individual author, and neither Oracle nor any other party necessarily agrees with them.
Friday, August 28. 2009
While commenting to a tweet of @alecmuffet a question arised in my head: Why do many people talk about SAN boot and why do people seldomly ask for NFS boot?
Asides from my opinion that systems should be able to boot without help from the outside (debugging is much easier, when you have an OS on your metal), SAN boot looks to me like combining the worst of all worlds. You depend on centralized infrastructure, you have to put HBAs (at least two) in every server, you have to provide an additional fabric, encryption is still an unsolved problem.
On the other side you have NFS, deduplication and cloning is a non-problem when you configure your boot environment in a clever way, encryption is available by the means of IPsec. Caching to get load from the network is possible by CacheFS for example. Or think about the combination of NFS boot with the snapshot/clone function of ZFS to clone you boot environments?
So why is everybody looking in the direction of SAN boot instead of using NFS boot, when they look at centralized boot storage? The issue of needing an additional fabric is known, so solutions are developed, but instead of simple going in the direction of NFS the industry is running into the direction of blocks over IP or blocks over Ethernet. Strange world ...
Posted by Joerg Moellenkamp in English, The IT Business at 21:53 | Comments (28)
Related entries by tags:
Display comments as (Linear | Threaded)
It's probably because that's the
only way that Windows can do
a remote boot.
#1 Gary Mills on 2009-08-29 02:59
Okay, that's a reason, but why do i see diskless boot at Unix/Linux systems?
To the question why Lunix diskless boot I would say PXE Grub perhaps?
NFSROOT is commonly used for shared FS in clusters...
Since the majority of HPC clusters probably is on Linux, the cluster builders limited the search for NFS servers to boot from to Linux, found the early Linux NFS implementations lacking in scalability, and developed other means. Those means worked end the evolution to alternatives was halted.
Sorry, only partially true : recent Windows versions offer software initiators to add extra disk, if you want to boot from SAN you need some software or microcode to start the process to get to the master boot record which is on the machine on the other side of the SAN. Like :
The real reason I think is the Windows NTLDR code is cranky about disk geometry.
The solution that got to the markets firster became prevalent to the point that it became the accepted wisdom.
A propos "accepted wisdom" :
"After breaking through the technology barrier,
you run into the barrier of accepted wisdom."
Sounds like we need a Joerg style tutorial
to educate and inform.
Yes please, Tutorial, Tutorial!
Problem imho is that there is no support to have NFS mounted from BIOS (mobo or NIC nor FC-card).
You could get the initial stage of a system that can mount NFS booted with gPXE/EtherBoot though.
Of cource I have strong opinions on the subject but I don't necessarily agree with them
In a company I worked in, we used to have lots of small servers with cachefs for all datas and config files, It worked well and was very easy to add or replace a node.
Problem is that CacheFS is going to be EOL'ed : http://docs.sun.com/app/docs/doc/820-7273/eos-36?l=es&a=view&q=820-7273
I think it's a cultural/psychological thing because, frankly, a lot of administrators don't "get" network filesystems when you exceed the common "net use h: \\server\share" usage.
And using iSCSI (or ATAoE, or...) just extends the local-disk knowledge a single step. Additionally there are a lot of rather cheap "NAS servers" that interface with these protocols and need less CPU horsepower to handle than NFS (or so I have been told).
Personally I've administered NFS booted (Linux/BSD) workstations many years ago and found them rather easy to implement.
#5 Christian Vogel on 2009-08-29 11:20
OK. I will try to give my perspective on why I want my SUN box to boot off a SAN and not from NFS or any other means.
1. I already have a SAN because I am running a modern data centre.
2. Everything else is booting from the SAN.
3. I want the OS-disks to be included in the same DR regime as the rest of the servers.
4. For most Operating Systems booting off a SAN is trivial.
1. The SUN community fights hard against SAN. So what should I do? Play along and create an exclusive infrastructure, DR regime and monitoring for just the SUN server or go against their advice?
2. It is a technical challenge especially combined with third party failover drivers.
We are just now in the design phase for a customer that is going to centralize(datacentre and DR) and these are the questions we must answer. Everything else is easily centralized but what should we do with the SUN servers? Ditch them and change to Linux or AIX? What should SUNs response be. Try to convince us to change the overall data centre design(and maybe lose a SUN customer) or help us implement it - with SUN servers.
#6 lparvirt on 2009-08-29 11:55
1. Of course you have a SAN and i see the advantages of a SAN, but you have a Ethernet Network as well. You have your disaster recovery scheme there. Many problems are solved in s standardized way like encrypting the data while it's on the cable. Gigabit-Ethernet is ubiquitous. I can use copper instead of fibre.
2. Is it really feasible to use a SAN on a system that wouldn't need a SAN otherwise just for SAN boot? I saw webservers or small other systems with SAN just to boot over them. that doesn't look as a good investment. An instead of managing a boot instance, you have to manage the storage network, which is vastly more complex than a local boot disk or even an Ethernet Network.
3. I haven't anything against SAN boot, i just have some problems with it's economic feasibility and with the problem, when you can't startup your system without help of other components. Simple exampe: Your storage is from EMC, your SAN from Brocade, your server from HP,IBM or Sun. The server doesn't boot. Who is responsible? I would place bootable local disks even in the case i'm using SAN boot or NFS boot boot disks into system to be sure, that a non-booting system is really because of my system and not a weird probem in between my system and my boot-storage. When my system comes up with local boot disk i can tell my storage vendor: "It's your turn, fix your sh..".
And there is another point, why harddisks shoud be in the system: Using expensive centralized storage for boot disk may have it advantages, but using expensive centralized storage for swap disks looks like a waste of money for me.
4. DR is a good point, but you can still do your centralized DR on your storage, when you put one or two NFS servers in front of it. NFS HA is really a no-brainer. But i have my doubts, that using centralized boot environments for DR is really a killer-argument for network-boot. It looks more feasible to me to use a standardized local boot disk and a self-contained application repository in /opt/companyname, that i'm mounting via NFS or via a block device. Is something replicabe so easily like an OS installation really worth of putting onto expensive storage? And at the end i just need a small tarball to put everything i need to re-personalize a system after a crash. And in any case, i can't simply switch into another data center (what i consider as a big advantage of centralized replicated storage), as i have to adapt the system to the new network. Even when you boot disk has moved into the new data center, the IP network hasn't moved the the other side nescessarily.
Of course, for the data disk SAN is still a very good choice when your application isn't capable to do the data management on it's own. But i'm not convinced that boot disks are such a part.
5. SAN boot isn't a property of the OS, it's more of a problem of the HBA. So when it's well implemented in the HBA BIOS it trivial for every OS.
6. I'm not aware, that Sun is fighting SAN. Why do you have this idea? Or do you take my stance of "SAN, where SAN makes sense. DAS, where DAS makes sense. No SAN for SANs sake" as the opinion of the community?
7. You write "modern datacenter". Well ... SAN isn't modern, it's standard today. I've used my first SAN 1998 (copper, 1 GB) and implemented a vastly bigger one at 2000. I dont' know why SAN has still this moniker "modern". And at the end every kind of technoogy will subsituted by something more cost effective.
Look. You asked why customers wants SAN boot and not NFS and I gave you some strong reasons. Customers are important and SUN should focus on how to make life easier for all the customers that use SAN and DR by making SUN/Solaris and SAN/DR a better match(it is not a good fit now). SUN should not focus on "eduating" the customer into a SUN mindset where there is only NFS and ZFS-JBODS. If they continue to do that they will eventually lose money and be up for grabs by companies that are a lot more customer oriented. Oh wait...
And are you trying to convince me that you have a fair and balanced view on SAN block storage? Come on.
1. "Encryption is still an unsolved problem.". You encrypt your SAN disk in the OS as every other disk, you can encrypt via the HBA driver, you can encrypt by SAN modules. Or you can have the disk system encrypt your LUNs for you. A lot of methods to do in-flight or data-at-rest encryption. Now most customers don't use any of these features since local SANs tend to be really simple and can be seen as an extension of your SCSCI/SAS cable. But the technology exists at FIPS Level 2/3. For long distance DR, encryption modules exist(FIPS Level 2) or if it is FCIP you can use your normal IP encryption gear. Some systems(Clariion at least) use iSCSI directly for long distance block replication.
2. SAN boot for small webservers. OK I agree that SAN boot for a small stateless web server is too high a cost. That is where virtualization comes in. Put your small web server on a VM with shared IO. For a larger web server like the T5440 I think a $1000 HBA is justified cost wise.
3."Complexity and trouble shooting". I think SAN boot is really simple with AIX and have had no problems. That IP boot is simpler is not my experience and still you depend on an external system. I have used more time trouble shooting Jumpstart than trouble shooting SANs. The problems and complexity of SAN is a myth. But of course that myth is good for storage administrators and their pay check.
Swap disks in a SAN is not about cost. The reluctance to put swap into SAN is mostly that a thrashing system may exhaust the disk cache. But that problem will vanish as more and more disk vendors implement QoS at the LUN level.
4. DR and NFS. So you present a solution with SAN->NFS->client. Didn't you just talk about complexity? I do not understand the problem with the network on the DR side. It is the same problem/solution regardless of your data replication methodology. And by doing a number of DR excercises for customers I can testify that the boot disk is important even if it holds no application data. A lot of times there are crontab entries, tuning entries, environment settings, passwords etc etc. that haven't been copied(forgot to do it or a script failed) from production to DR and this creates a lot of problems and wasted time in starting the DR. Syncing the boot disks with the rest helps alot. There is always some problems establishing DR but you want as few of them as possible.
5."SAN boot isn't a property of the OS". Yes it is because it has to do with device handling. And it has to do with how well HBA drivers and failover software plays with the OS. DR using SVM is a real nightmare.
6. I have no idea why SUN is so sceptical when it comes to SAN. Almost no impact in the SAN market may be a reason.
#6.1.1 lparvirt on 2009-08-31 12:28
I don't consider your reasons as strong ones. At least you didn't convinced me of using SAN boot.
0. At first: Don't think of my opinion as the opinion of Sun. They might match, but they don't have to. This is my private blog and express my opinion in this blog.
1. I have a fair and balanced view to SAN block storage. Because it's fair and balanced, i'm thinking SAN not a solution for every task and boot disks and swap disks is especially one of those tasks. I'm not convinced, that it's economically feasible to use expensive SAN disks to substitute relatively cheap disks in the server. On the other side, i wouldn't use NFS for a large ERP database. Technology offers me a rich toolset and i'm trying to use an optimal selection of it that matches to customers mode of operation.
2. At first, just to make it clear: I'm in strong dislike of any network boot functionality. In my opinion, a system should come up without any help of an external system. So i wouldn't use SAN boot as well as i wouldn't use NFS boot. But i see the advantages of centralized boot storage, so my question was, why customers use NFS for this tasks at least for small systems. NFS boot is there, it doesn't need an additional fabric, it just uses IP. I can use the personal from the networking department.
On the other side, i still have to buy the local disk for swap, so no saved money here. As i told before, my prefered technology is local booting. as i would need the local hard disks in any way.
With NFS booting i could use the disk at least for caching. when i don't want 146 GB swap That's the same reason, why i considering CacheFS EOL as a big mistake and i'm advocating at the moment for a follow-on feature.
3. Obviously i could use virtualization to counter the price point of the virtualisation. But: Now you have to pay for the virtualisation software, you have an additional level of complexity. Is it really feasible to put two HBAs in a 1 RU Nehalem Server for example? And it doesn't end at the HBA, you have to calculate two switch ports at two switches into the equation. You have to calculate the partial costs for your centralized storage array and so on. The costs for SAN and virtualisation vastly overshadows the costs of the system, not to talk about the increased complexity.
4. I thought about Link Encryption, not about data encryption. And, yes there are existing encryption solutions, but either they use the IP mechanisms you would use for NFS or IPsec as well or they are propritary ones like the Paranioa-boxes.
5. Sun has no impact to the SAN market? Well, then explain my why we sell so many SAN storage boxes. Perhaps some people have a more differentiated opinion towards SAN, because Sun doesn't have to protect a storage business consisting out of large systems. Of course a EMC would say: Put everything on SAN because their main business is SAN. There is a business to protect. SAN boot looks like the classic "When all i have is a hammer, every problem looks like a nail".
6. We didn't talked about Jumpstart, we talk about NFS diskless system and thats really easy to configure even without additional tools. And honestly ... i don't find any difficult tasks in configuring Jumpstart and it's even easier with Jumpstart Enterprise Toolkit.
7. I'm pretty use that SAN as a seperated fabric is on it's way down. There are some sign of this development: Like Brocade buying Foundry Networks. Or Cisco investing heavily in FCoE. And on this way down the way of storing data will change, too, because when you use a converged network for network and storage, you don't have the artificial border of the two fabrics, you have today, that is forcing you to use seperate protocols.
8. In my opinion neither blocklevel nor file-level protocols like we know it, will be the future of data storage in enterprises. I additionally don't think that block level or file-level DR will be the future of DR. Both functions will move into the application over the time. But we've discussed about it a while ago and i just want to refer to the study of Forrester for external support. And as far as i know it wasn't a study sponsored by Sun.
9. I've just used the sample of a NFS-Server in front of your Storage boxes to show you that centralized storage DR and NFS server aren't mutually exclusive. I would use other means of replication to ensure DR like the AVS functionality to do replication in the boxes.
10. Of course you have to document your changes and to replay them on another system, but it looks like it just a brute force solution to get rid of holes in documentation and errors in DR scripting as you describe it. But what do you do in the case, you have to completely recreate a boot environment from scratch? Looks like an expensive solution for a problem that has to be resolved at its root cause anyway ....
I've used flasharchives in conjunction with a jumpstart server to establish DR procedures and never hat a problem with it. And i didn't had to connect each and every device to a centralized storage.
11. By the way, you still didn't answered to the problem of problem isolation in case of a failed system. That's my largest hassle with any network boot technology.
12. When you find SUN SAN horrible, you've never worked with LINUX SAN ... And so far Solaris in conjunction with MPXIO worked like a charm for me ...
We had a customer very interested in SAN boot, wanting a 3rd mirror of the root disk on SAN. But it was such a pain to do because the hardware had to be exactly alike to allow failover to a different box.
Since then we started just putting zones on SAN because that got us all the ease of failing around between machines without the hardware pain.
And lately I've put my ldoms on SAN while keeping the boot disks for the control domain internally. Not that it's a preferred situation, but using internal disks aren't an option if you want to avoid the single point of failure that the sas controller is.
Why not NFS? Speed maybe. It can be slow compared to more direct connections to disks. Especially with IPsec.
#8 Tim on 2009-08-31 15:57
Interestingly there isn't much of a difference and often those difference is just needed by low number of shops out there.
On the other side: There is a rising numbers of VMware installations that use NFS instead of disks because the number of LUNs just got outrageous. They wouldn't do it, if it's slow and encryption latency hits all IP protocols.
I guess I'm "doing it wrong" then, because for me (with NFS on Solaris) the performance is just shockingly slow. Maybe I need to spend more time tuning
#8.1.1 Tim on 2009-08-31 17:50
Just two questions and a hope:
1. Do you use NFS with ZFS?
2. If yes, did you read the ZFS Tuning guide at solarisinternals.com
3. BTW: I hope, you didn't compared it with Linux NFS with the async option ...
1. No. With both UFS and VxFS.
3. No, I haven't done comparisons to Unix. Just to local disk and SAN connections.
#188.8.131.52.1 Tim on 2009-08-31 23:34
1. Did you use Jumbo Frames ?
2. Any options for your NFS mount?
4. Configuration of the server? Config of the Client?
If something is "shockingly slow" in a network service most often you have an issue with the network itself and not the service.
Try ftp between the servers(from both sides) to possibly eliminate any network issues. If you have unix on both sides you can try a ftp command that bypass the disks and tests different blocksizes. Default read size and write size for NFS is probably 16K or 4K depending on version:
200 Type set to I.
ftp> put "|dd if=/dev/zero bs=16k count=10000" /dev/null
If the network is slow at all block sizes you may have the classic full/half duplex if your NIC is at 100Mbit.
If the network is slow at some block sizes(the higher ones) you may have a firewall between NFS server and client. Test NFS with lower wsize and rsize settings.
Anyway check the network first.
#184.108.40.206 lparvirt on 2009-09-01 12:22
Yes, obviously you are right about checking the network first, but with the rise of gigabit ethernet and desktop switches this easy explanation for mediocre performance got rare. That's at least my experience in 5 years of this blog and side-band communication in my mail. And sometimes people react a little bit annoyed, when you ask them about their network
Speed might really be an issue for most customers, as well as stability. You don't want to put the root disk of a bunch of servers on the already burdened Ethernet infrastructure. From my experience working in an environment with SAN and LAN mixed at equal shares, I always found SAN to be a lot more reliable and better scaling with high load situations.
Booting directly from NFS is interesting.
I recently published a patch for Etherboot.org's gPXE network bootloader to boot directly from NFS:
No PXE/TFTP necessary - point it at your kernel + bootarchive/initramfs and go.
My code experimental and I'd appreciate feedback if you try it out. Have not tried Solaris but would like to see it booting, too.
#10 Stefan on 2009-09-01 12:22
Solaris no longer supports NFS diskless boot, except as part of the JumpStart/AI mechanism.
NFS diskless boot has a number of problems. For example: the client and its /, /usr, /var, ... needs to be tightly coupled, so that the server knows to let that client have full privilege -- but servers don't have a way to support this, so they don't. Whereas an iSCSI diskless boot has no such problems.
You mean it's not longer supported in OpenSolaris. The support is still included in Solaris 10.
We're trying to avoid an expensive SAN as well... we're using iSCSI for VM's system and data disks, and running dedicated gigE cables between the hosts.
If you are going to run dedicated links, like in a SAN, why not just get more gig Ethernet ports and run dedicated gigE cables between the iSCSI server and the client? no proprietary, expensive HBA's and no proprietary expensive, werid cables or switches. GigE is usually good enough. Most modern nics have auto uplink anyway, so you can just direct plug them together.
BTW: I'd love to hear folks' IP address naming ideas for all these dedicated point-to-point ethernet links
#12 Matt on 2009-09-03 05:27
And beside that with 10GBe or with Infiniband you have a fabric that is capable to carry both kinds of traffic without problems. And Infiniband isn't that expensive per port. It's even cheaper that FC. Of course you buy this advantage at the costs of distance.
A bit late for the discussion....
We used to SAN-boot blades with Linux + GFS (Cluster FS).
It's relatively easy when you basically do your zoning so that all nodes can see all presentations from the SAN and use the SAN to do the access-control.
Then, you don't need much admin-intervention once you want to boot a different WWN with an existing LUN because the original blade went broken.
But once you tighten the FC-zones (yeah, another "zone"), then it becomes much more work, because you've got to adjust the configuration on the FC-switches, too.
I wanted to try running VMware ESXi via iSCSI from a ZFS-filer (to try out the iSCSI-part of Solaris) - only to find that the majority of people seem to use NFS. Seems to work good enough.
One reason why I can see shops going SAN-boot is that they are weighing the costs of SAN-boot (FC-ports) against the license-costs of bare-metal recovery for their backup-software.
(Few shops think about bare-metal recovery - and of those who do, most stop thinking once they find out that their backup-software vendor will basically charge them again for the privilege...)
With RHs GFS, you could actually boot a large number of blades from a single image (using special symlinks to branch to individualized files like ifcfg-ethX etc.).
But with totally diskless blades, you were also swapping to the SAN, which can kill its IOPs...
FreeBSD has a special configuration-framework for its NFS-boot feature so that you can do this similarly (without a SAN, of course).
I don't know about Solaris, but I suspect it's where the idea came from
The author does not allow comments to this entry
The LKSF book
The book with the consolidated Less known Solaris Tutorials is available for download here
Martin about End of c0t0d0s0.org
Mon, 01.05.2017 11:21
Thank you for many interesting blog posts. Good luck with al l new endeavours!
Hosam about End of c0t0d0s0.org
Mon, 01.05.2017 08:58
Joerg Moellenkamp about tar -x and NFS - or: The devil in the details
Fri, 28.04.2017 13:47
At least with ZFS this isn't c orrect. A rmdir for example do esn't trigger a zil_commit, as long as you don't speci [...]
Thu, 27.04.2017 22:31
You say: "The following dat a modifying procedures are syn chronous: WRITE (with stable f lag set to FILE_SYNC), C [...]