Shared sZIL without an external storage box
This was yet another thought i had while waiting for the tea water to boil: There is a challenge when deploying an SSD for the separated ZIL in a cluster. Let’s assume you have a cluster for a NFS service on ZFS for example. Without a seperated ZIL this is easy. You use shared storage: Connect your storage to both cluster nodes, configure the system accordingly and the storage follows the service. But there is challenge with sZIL: Where as a zpool is still consistent even without the data on the sZIL, you need the storage with the pool and the storage with the sZIL to get the state of the storage at the point-in-time of the last completed synchronous write. So you need to ensure that the sZIL is available at the other node as well. The classic way to do this: Putting an SSD into the SAN (you need one with FC or SAS connectivity) or into an storage array. But is this really nescessary: Could you put the the SSD for sZIL into the servers and still ensure the failover capability. I had two ideas for doing so. Both leverage that you have relatively cheap high speed interconnects today like Infiniband or 10GBe today. Both ideas leverages the point that we have target implementations for SCSI in Solaris. As an SSD in the SAN would have switch and wire latencies as well, it shouldn’t make such a difference latencywise from the physical perspective. Perhaps it’s different for the latency introduced by Solaris solaris, but on the other side SAS-Expanders in the storage boxes or the RAID-Controller CPUs introduce latency, too. In my thought i’m just focused on two-node clusters (should be possible with more nodes, but the configuration will be complex). I have two ideas, both have in common that the sZIL SSD are in the server, and not in a third box.
- Mirroring the sZIL using RAID-1: In ZFS you can mirror the sZIL on multiple devices. Now let's assume we split our SSD in to halves. One for the local sZIL and one for a remote sZIL used by the other node of the cluster. The remote sZIL half is shared to a network via iSCSI, FCoE, iSER, SRP or FC (at best with a direct connection without switches - when possible - to reduce latencies). You need an interconnect anyway, so you could make this interconnect as fast and as low-latency as possible. You use this half on a node to mirror your sZIL on the other cluster node. Thus when the system does a failover on the other node, it has still a mirror of the sZIL locally available.
- Mirroring the sZIL using the Availability Suite: Another, really obvious idea would be to configure a synchronous remote mirror of the sZIL with the help of AVS instead of the mirroring. So it's guaranteed, that the sZIL on one cluster node is equal to the sZIL on the other clusternode. Otherwise this idea is similar to the RAID-1 mirror idea, albeit the codepath should be much longer compared to the RAID-1 solution.
It’s just an idea, i didn’t checked so far. I’m sure that there are some pitfalls, but not unresolvable ones. What do you think about it?