QuicksearchDisclaimerThe individual owning this blog works for Oracle in Germany. The opinions expressed here are his own, are not necessarily reviewed in advance by anyone but the individual author, and neither Oracle nor any other party necessarily agrees with them.
|
About ZFS and high-end storageWednesday, November 18. 2009Trackbacks
Trackback specific URI for this entry
No Trackbacks
Comments
Display comments as
(Linear | Threaded)
I had to grin, when i read this article, cause thats exactly the trouble we had the last few weeks/months. HDS Tagmastores, ZFS corrupted or unavailable Pools in case of corrupted Informix Chunks. Switching to RAW Devices, switching back to ZFS, changing Block sizes from 2k (Informix standard blocksize) to 32k blocksize. Resizing HDS Setup.
Also a lot of Database Restores. A Sceniario i don't wish my worst enemies.
We are using Sun/HDS 9985's - SVM mirrors across multiple 9985's for our informix database's, all luns are used raw metaset's. We where just researching how performance or any other problems will be if where to start using ZFS zvols, We would rely be interested (if you don't mind to share) your experience with ZFS/Zvol's and informix database's.
We also had some discussions with our Sun sales rep (we are in need for another storage array), what are the benefits from using a 9985 vs just using jbods/zfs/7410/etc.. combination, we didn't get a clear explanation, a good explanation would rely help.
There are good reasons for High End storage and only some of them are technical reasons. Thus without knowing about your architecture and business needs, nobody can really answer this question. This is the same reason, why i don't want to interfere with my colleagues with a simple answer here. Perhaps you should make a meeting with a local storage specialist from Sun to discuss your needs. They can help you for sure.
Total Disaster... no the Issue is not solved by now. we changed the whole LUN Layout we striped our ZFS Pools over the RAID Groups and used RAW Filesystem under ZFS Control. But from my point of view the issue is not solved completely. I/O Problems are ongoing. But its getting better slowly. We thought about caching with SSD's or putting a DB Logspace on a separate JBOD. A lot of ideas effort commitment and brainstorming together with SUN.
Yeah, layering ZFS on top of other RAIDs seems like it's a disadvantage.
But the one advantage is that you're at least building out a ZFS environment that can be easily migrated to a newer, cheaper disk platform down the road. You can migrate filesystems one by one using ZFS internal mechanisms instead of relying on rsync.
Interesting post. We run ZFS on top of 9990V arrays. While I did look at using RAIDZ to allow ZFS to repair data corruption, we had to also look at the cost of that very expensive parity LUN and the performance impact of RAIDZ. I have yet to see ZFS report data corruption on LUNs from the 9990V.
We use multiple LUNs across RGs from the 9990Vs to get performance, but they are merely pooled and not RAIDZ. I would love to use RAIDZ on top of the 9990Vs, but 9990V disk is very expensive. There is definitely a case for ZFS+cheap disk, but there are still reasons to go with high-end storage arrays.
We use a somewhat in between solution for our Sun Storagetek 2530 DAS, where we have multiple RAID-1 LUNs consisting of two full hard drives, and those hard drives are then assembled with RAID-0 in ZFS, effectively giving us a RAID-0+1 array.
The reason for why we have the RAID-1 on the Storagetek and not in ZFS, is partially due to the thought of having to send every written data twice over the SAS cable to the Storagetek, even though that may not be the bottleneck. All in all, I think we may have been better of just buying a JBOD disk array, instead of the 2530, in order to ease the management since it's all done in ZFS.
Nice article. A little correction - even in one LUN case ZFS will usually be actually able to detect & repair corrupted data if it happens to be a meta-block thanks to multiple copies of meta-data blocks in zfs regardless if raid is or isn't done in zfs. Additionally it makes sense to provide zfs with 3 luns and do a striping in zfs. In such a case zfs will try to spread each meta-data block copy to a different LUN. Additionally ZFS will be able to issue/queue more IOs to a pool in total (default is 35 per lun, and it was lowered recently to 15 iirc) which might be beneficial on large disk arrays.
Here's my experience: In 2005/2006, soon after I got my hands on Sol10, I implemented ZFS on top of LUNs from an EMC array. During the testing phase I knowingly ignored a HBA/driverSwitchEMC mismatch. Things worked fine and soon I forgot about the mismatch. I also implemented ZFS without RaidZ or Raid-1 since I figured the EMC LUNs were already protected.
After a few months ZFS started complaining loudly about corruption. While working with the storage team we determined the switch port injecting corruption into the data stream. EMC to its credit faithfully wrote/returned the corrupted bits correctly! Since this was a test system there wasn't a big problem. The moral of the story is exactly what Joerg says; we are only as safe as the least reliable component in the system. I'm a firm believer in JBODs with ZFS protection. In addition to the protection I also get the I (inexpensive/independent) back in the RAID. Do the high-end (as measured by $s) storage devices have a place in the datacenter? Yes, but not for every workload. There is a large percentage of DC storage needs that can be met by ZFS on JBOD (FishWorks anybody?).
Joerg:
Don't forget about: # zfs set copies=2 zwimming or copies=3 for single SAN-based LUN's. ZFS will repair bad blocks from copies. This is true for laptops, SSD's, flash drives, etc. Places you wouldn't expect or fund RAID as an approach due to bus design limits, cost, etc. Checking those checksums and having another copy is key to the ZFS repair process/protection scheme. Richard Elling wrote about this idea in 2007: http://blogs.sun.com/relling/entry/zfs_copies_and_data_protection So, if you never expect the Enterprise SAN LUN's to fail then adding ZFS copies helps potential limit silent data corruption events at the time they appear. Now would all copies be equally subject to the source failures... they would all have bad data and fail checksums and you'd have a seriously broken array.
Hello Dave:
I'm aware of this method ... but as this doubles the the space consumption as well as RAID1, the customer has to do exactly the same considerations. Of course there is another advantage: You can control the "RAID1-ness" of a filesystem at a finer scale ... But at the end you double the consumption of more expensive storage.
So maybe this is a dumb question, but how does Solaris handle possible data corruption when using a single disk with zfs root?
article --- by using just a single LUN you give up an essential feature of ZFS: The auto repair capability
I am still having difficulty understanding the issue, possibly due to semantics. When you say "single LUN" - do you explicitly mean a RAID0 concatenated LUN? Does this concern apply to a single RAIDZ or RAIDZ2 LUN? I thought a single RAIDZ LUN has enough parity to reconstruct a whole drive, never mind a little bit rot... or RAIDZ2 has enough parity to reconstruct 2 whole drives, never mind a little bit rot. comment --- [regarding copies=2 to allow repairing of bad blocks] But at the end you double the consumption of more expensive storage How about using Compression in conjunction with Copies? This seems like a reasonable way to mitigate space consumption while protecting against bit rot. (I am curious whether someone has tested ZFS copies with ZFS dedup... I could imagine a bug leaking into there...)
The single LUN was meant as: When you do the RAID in a storage array and just map out a single storage LUN out of it. Thus from the perspective of ZFS there is no redundancy. Of course there is a redundancy in the box, but ZFS can't access it directly to check it if it has the correct checksum.
When you do RAID1 or RAIDZ by ZFS, you have the redundancy, that ZFS needs to repair the data. The compression would just help, when you have data which sizes is half after compression, as the copies features writes the data twice on separate locations.
Thanks for helping me understand...
Silent block corruption can be corrected on a single LUN if there are redundancies in the LUN (i.e. RAID1, RAIDZ, RAIDZ2, etc.) Silent block correction can not be corrected on a single LUN if there is no redundancy (i.e. RAID0) In a single drive scenario (i.e. usb flash stick, laptop hard drive, laptop flash disk, etc.) - bit rot can be corrected using "copies=2" (unlike any other OS), with the loss of space, but this space concern can be mitigated using compression. Now I have the full picture!!! Thanks!!!
I'll take one more try to get you to correct something inaccurate in your writing. You "know about copies" but you still wrote:
"But by using just a single LUN you give up an essential feature of ZFS: The auto repair capability. It simply doesn't have the data to copy or recalculate the correct data. ... But using just a single LUN doesn't allow this." But copies=2 does allow a single LUN to repair a corrupt block under ZFS. You can add the objection about doubling the amount of disk but you should edit the post since it's not accurate as written. Propagating the idea that ZFS must have multiple disks to be of value stops people form wanting it for laptops and USB drives, etc. teach them to think of copies for extra data protection from silent data corruption for important data. NTFS, HFS, ext3, etc don't offer the option of extra copies for important file systems (I think). Copies can be enable at the file system level and not at the pool level so the user has control over the decision to use it where it's needed. I wonder if copies has any performance benefits for reads... reducing overall read head movement statistically. The essance of data protection always boils down to having copies somewhere. Lots of copies...
This is not new story... Oracle uses checksums to verify data. Some high-end boxes are (were) able to verify checksums before writing to physical disks and reject write operation if they are wrong. Such cooperation of server and storage guarantees that information is stored in exacly the same way as it was written by server. This is more proactive approach, which minimizes need for recovery.
That is interesting, those high end boxes, how much do they cost and can you provide links or model names so I can check them up? Are they 100.000USD boxes and up? Compare that to open source ZFS and no special hardware.
I seems great that they reject write operations if it is not validated against the checksum. But what happens if the block changes after it has been written? Is it verified before read?
There are two types of checksums: generated by Oracle and storage. Some storage models can verify Oracle checksums before data writing to ensure that data is delivered in right way.
Storage checksums are used to verify that data is stored in right way. I don't know if Oracle checksums are verified by storage before sending to servers, but it will be done by Oracle anyway. List of validated storage models is available at: http://www.oracle.com/technology/deploy/availability/htdocs/vendors_hard.html
I have yet to find an article discussing ZFS copies=2 that explains what happens if I have a zpool spread across multiple drives with copies=2, a drive fails and is replaced. Now you have files that have only 1 copy. Do new copies of those files get re-written to the newly replaced drive?
Hi Elliot,
I am sure someone will correct me if I mis-type here. If you have 2 or more drives, if you are using "copies" and a drive fails, you can only recover if you have mirroring, raidz, or some other type of RAID with parity. If you are running with copies={2 or more} in a dual mirrored environment, you actually have 4 copies of your data (2 copies of your data files on each drive.) A loss of a single drive in a dual mirror with copies=2 will not reduce you to 1 copy, but rather you will still have 2 copies of all your data files on the working disk of the mirror mirror, but your extra disk and two additional copies of your data have been lost. When you replace that disk, the resilvering process will re-construct your other drive, providing your additional 2 copies - providing you a total of 4 data copies. The "copies" parameter does not protect you from drive failure, but from "bit rot" and other type of data corruption (sun spots, degrading media, etc.) This is very useful on one drive systems with a single drive. You can use the "copies" parameter on a striped pool across 2 or more drives, but if you configure it as a pure RAID 0 stripe with "copies=2" - the copies will not protect you against drive failure, only against "bit rot" or internal data corruption. In this environment, you lose a single drive in a 3 drive RAID 1 striped pool, you will lose the entire pool. How is that for a stab at it?
That's not exactly true. Even when doing striping with two disk drives and copies set to 2 ZFS will try to write each copy to different disk so if you loose your disk it should still protect you like mirroring. However there is no guarantee that ZFS will be able to spread each copy to different disk.
However in a two disk configuration a simple mirroring would be better and would guarantee access to all data if one disk would fail. copies is an attemtp to provide mirror-*like* capabilities for a selected dataset while pool is non-redundant. it has also other uses but one needs to be very careful on expectations. |
+1The LKSF bookThe book with the consolidated Less known Solaris Tutorials is available for download here
Web 2.0Contact
Networking xing.com My photos Comments about Lieber SPON, ...
Fri, 17.05.2013 08:09
Du solltest auch noch JS ausch
alten, dann bekommst du auch n
icht diese lustige Nachricht ü
ber AdBlocker
Sun, 12.05.2013 16:52
Honestly? Don't know .... and
will not speculate about it h
ere.
When i worked as an a
dmin (or later project e [...]
Buttons![]() This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Germany License
![]() ![]() ![]() Blog Administration |