QuicksearchDisclaimerThe individual owning this blog works for Oracle in Germany. The opinions expressed here are his own, are not necessarily reviewed in advance by anyone but the individual author, and neither Oracle nor any other party necessarily agrees with them.
|
No, ZFS really doesn't need a fsckWednesday, November 4. 2009Comments
Display comments as
(Linear | Threaded)
ZFS doesn't need a fsck tool that checks the entire filesystem... well it has one (IMO) "zpool scrub" However, it should have a way to help repair a pool that is damaged or for some reason won't mount, enough to make it attachable, to salvage intact data, even if only in a read-only state, for example.
For those who say ZFS doesn't need repair tools, I point you to: http://opensolaris.org/jive/thread.jspa?messageID=289537 And many other hundreds of cases, where people have lost their entire pool, for one reason or another. Most of the data should be intact, but the drives ignored CACHE FLUSH (or something such as that), and a crash occured, ZFS panics or won't re-attach the pool due to inconsistencies. The inconsistency may be the user's fault, but ZFS is not off the hook. There should be easy to use recovery tools just like there are with other filesystems.
jimmy in the thread you linked to, you'll notice the final solution is exactly what was discussed in this blog entry. the pool was rolled back to the last usable uberblock and only the last few transactions were lost.
Have you actually read Joerg blog entry you are commenting on? He specifically addresses the issue which has been fixed in snv_126.
If you would read the link you provided more carefully you would learn that a user was able to import his pool and access its data thanks to help from Victor. The new functionality implemented in zfs in build 126 basically provides what Victory did manually - importing a pool by using a previous uberblock. For details please re-read the post you are commenting to...
Jimmy, I think the solution that you ask for has already been put back, and is on the way.
The repair method that Jeff, Mike, and others mention on the discussion thread that you linked to is the same idea that is now the pool recovery option in PSARC 2009/479. -cheers, CSB
Yes, that is all well and good, but what about memory corruption (to which ECC memory is not immune) and bugs in ZFS, drivers or other parts of the kernel that cause memory corruption, or more specifically, corruption of metadata?
That is something that can pass undetected and be safely written to disk (with correct checksum and all) and be very long-lived, to the point where you can't rollback to a previous uberblock. In fact this has already happened before, both memory corruption caused by hardware and metadata corruption caused by bugs in ZFS. That is something that an fsck-like tool could fix and would prevent from having to restore from tape.
I'm sorry but what you are describing is utter nonsense. No fsck tool in any filesystem will be able to fix your data which was already corrupted in a memory before writing it to disk.
And by the nature of the way fsck works in most filesystems it doesn;t even try to fix your data rather it is trying to guess and fix the correct metadata with mixed results. ZFS does much more than that.
No, it is not "utter nonsense".
Read my post again. I didn't say fsck fixes data corruption, I said it fixes corruption of metadata (yes, metadata can be fixed/cleared and even reconstructed by fsck, as in the case of bitmap/spacemap corruption).
Please show me the filesystem and OS that can handle silent memory corruption, lying I/O controllers, meteor strike, &c.
ZFS does everything that it can to preserve integrity, however no software can be clever enough to get around the fundamental problem of "garbage in, garbage out". Did the hardware lie? Then GIGO applies, whether it's a bad disk controller, or an uncorrected, undetected memory fault. As for "bugs in ZFS", please be more specific as to which bugs you think are corrupting data. Same goes for bugs in drivers, and the kernel -- give CR numbers, please. ZFS tries to be robust in the face of the most likely threats. Now that it's been around for a while, the implementors are working on some less likely stuff. Seems like the right priority to me.
I think you don't know how an fsck tool works.
A fsck tool works by traversing the entire filesystem metadata and trying to spot inconsistencies. Once it spots an inconsistency (at least for ext3/4's fsck), it tries to fix it in the way that makes more sense, by assigning a score to each piece of information that is inconsistent, trying to see which one is more likely to be correct. Of course, that is an heuristic and is not guaranteed to fix it in the correct way, but at least it brings the filesystem metadata back to a consistent state, with (in most cases) a minimal amount of lost data. Many problems are very easy to fix, for example, orphaned files, bitmap/spacemap corruption (e.g. doubly allocated/crosslinked blocks, leaked blocks, etc), etc, there are tons of things fsck fixes. All of these problems can also happen with ZFS, it is not immune against hardware memory corruption and bugs. And at least with ext3/ext4, you can bring your filesystem back to a consistent state (read: such a state that the implementation doesn't refuse to import/mount and that doesn't cause panics). See CR 6458218 for a bug in ZFS that caused metadata corruption, feel free to search for others (which I know there are), because I have more interesting things to do :p
Anonymous,
Yes, fsck can return an inconsistent filesystem to a consistent state. But what do those consistent sectors contain now? Is the data the same as before? Would the FS know the difference? fsck cannot always guarantee that the consistent FS retains the data integrity that it had before. So fsck can succeed, and still leave you with missing or corrupted data. I've seen it a few times over the years, broken fragments of data in lost+found. On the other hand -- given non-lying hardware -- ZFS is not vulnerable to this same variety of inconsistency. Is it free from any vulnerability or bug? No, of course not. But it is designed to avoid this type of inconsistency. The idea with this recovery utility is that you can roll back a few transaction groups, and find something guaranteed to be consistent, both the FS structure and the referenced data. So I disagree with your statement that ZFS is vulnerable to the same sort of corruption that can be fixed with fsck on other filesystems. Again if you have lying hardware, then all bets are off, regardless of OS or FS. No FS or fsck in the world can guess as to what a bit ought to have been, if it was written as the result of an undetected memory error, or a broken I/O controller. Adding fsck to ZFS wouldn't help this, either. As for CR 6634517, this is a dup of 6458218, which was reported back in 2006, and then fixed. Nobody should be affected by it these days. If you're having trouble with this one, then perhaps you're using a very old OpenSolaris build? I would suggest that it's time to patch or upgrade, like you would with any OS or FS. At this point, I think that everyone here has done well to explain the difference between ZFS and other filesystems, when it comes to maintaining consistency and data integrity.
You're missing the point completely.
Fsck can succeed without leaving you missing or corrupt data. Corrupted free bitmaps/spacemaps can be fully reconstructed, for example. ZFS is not immune to spacemap corruption (I pointed you to one of the bugs). Yes, the bug is fixed now, but there were people who had to restore from tape because of that bug. It could have been prevented if ZFS had an fsck tool. The exact same thing can happen even if ZFS didn't have bugs. Hardware can corrupt a spacemap in memory. Fsck could fix it and prevent users from restoring from tape. I'm not going to repeat this again. Also, even if fsck would cause you to lose some data, at least you wouldn't lose the entire pool. And in many cases you could easily find out which files had been corrupted or not. In fact, fsck even tells you which files had corrupted metadata. In fact, if you think a bit, you'd realize that typically when metadata is corrupted (for whatever reason), the damage is very localized and doesn't even affect 99% of the filesystem. Also, zip files/tarballs have embedded checksums, for example. git and mercurial have a command to verify the consistency of the workspaces, with the help of their checksums. I suppose databases could do the same (if they don't already). What I'm trying to say is that fsck gives you choices that you don't have without it. Yes, it doesn't guarantee that all the data is 100% intact, but if you are fully aware of its limitations, you could use it to your advantage and prevent a lot of lost time. It doesn't mean that you have to use it, but at least give others a chance to not lose their time. And guess what, even with ZFS you don't know whether your data is corrupted or not. Again, hardware memory corruption and bugs in ZFS/drivers/the OS can cause your data to be corrupted, it could be a 1-in-a-billion event and you wouldn't know it. Hardware is not perfect, software is not perfect, ZFS is not magical and chances are that none of these will never change.
Friend, I tried to think of another way to present the differences between a transactional COW filesystem and other (more traditional) types of FS, but words fail me, so I'll stop now.
I'll leave you with this constructive thought -- you seem passionate about the possibilities of fsck, and about the shortcomings of ZFS. That's just fine, as fortunately for us all, ZFS is open-source. That being the case, I encourage you to put together a prototype of a ZFS fsck tool as you describe. This would be something fundamentally different than txg rollback. If you can make ZFS better, i.e. allow it to maintain or recover data integrity whereas the current methods cannot, then I'm sure that the ZFS team would gladly welcome your contribution, and give due credit. Beyond that, I think you will eventually discover that there is no programmatic way to get around the universal phenomenon of GIGO. It can never be eliminated entirely. Thanks all... see you next topic. =-) -cheers, CSB
I think anonymous thinks of not yet known bugs, which may be in zfs. And if a memory corruption happens, he likes to get the most of the data left.
This seems to be reasonable to me, but you can't predict, what a hidden error looks like. What the zfs team could do, is a tool, which gives you the files on disk, without the path, if this is possible. Doing predictions on what can happen is pretty useless, like making precise plans for "The Big One", which may hit earth in the future. You don't know when and where, so it's nice to talk about it while having a beer, but there is no urge to start working in the near future.
You can get the files ... you can do this with zdb ... nevertheless it isn't an easy task ...
This whole debate seems rather political to me, with the fsck/bazaar style mentality opposed to the zfs-integrity/cathedral one.
I'm no fs designer, but, as far as I can understand, one side says "shit happens, so you'd better be prepared to recover for a bad situation, whatever it is, trying to salvage as much data as you can , and leave it up to the user to sort out if the recovered stuff is useful". The other side says "we build so much checking into the system that data is always good, and in the extremely unlikely chance it is bad, we do not want to spend time trying to fix it, but rather go back to a known state where it was good". When data loss occurred to me, it was because of cheap hardware and feeding to it too much current from the wrong adapter - after the disk onboard chips were fried the only option I had left (not being willing to pay 500 plus eur for professional data recovery) was recovery of data from another disk where it had been previously - voluntarily - deleted from. I think in such situation the 'recover as much as you can' line of thinking would bear more fruit, but undelete is a different thing altogether from fsck... would zfs be better/worse/equal in such a case?
Well ... i don't think of both opinions of an cathedral/bazar distinction. It's more about data availability vs. data integrity thinking. When you look at Linux it leans forward to availability, especiallly when you see performance as a property of availability. Solaris leans to data integrity, even when it costs performance. Linux doesn't use write barriers per default, thus it sacrifies integrity for availability, whereas Solaris switches of the write cache with UFS for example, to ensure that no power failure can harm the data integrity, even when this slows down the system.
I think fsck vs. rollback is the same. With fsck you essentially doesn't know in what state you are getting the data back. With rollback you exactly know that you have a consistent state at the moment designated by the timestamp. Regarding the undelete: In this case i assume you are better with ZFS: At first practically unlimited number of snapshots (i'm doing one each evening on my fileserver) and at second it's relatively easy to get back a file with zdb (as long as you know a little bit about the on-disk structure) and with COPY on Write and the deferred reuse of blocks, it should be even possible to recover an older version of a file, you have accidentially overwritten (like echo "blah" > wrongfile ) when you react immediatly)
Simple Question (but slightly offtopic): How can i tell sub-sub-substandard hardware from standard hardware?
As no vendor would say, that their devices are substandard and there is no way to check it by a program, the only way it tripping the wire by simulating such an failure ...
True, but there are ways to tell some shortcuts have been taken. If things tell you they have worked much faster than is possible for instance you can assume they are lying. It should be possible to test for many common shortcuts and give reports on it.
Honestly, for old FS lags like me you could've summarized that as "ZFS has a virtual log structure" and I wouldn't need the rest. A huge oversimplification I know but it expresses the kernel of the reasoning.
|
+1The LKSF bookThe book with the consolidated Less known Solaris Tutorials is available for download here
Web 2.0Contact
Networking xing.com My photos Comments about Lieber SPON, ...
Fri, 17.05.2013 08:09
Du solltest auch noch JS ausch
alten, dann bekommst du auch n
icht diese lustige Nachricht ü
ber AdBlocker
Sun, 12.05.2013 16:52
Honestly? Don't know .... and
will not speculate about it h
ere.
When i worked as an a
dmin (or later project e [...]
Buttons![]() This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Germany License
![]() ![]() ![]() Blog Administration |
Tracked: Nov 06, 13:16
There are many "gems" in Linux that are really no-gos for me ... like the oom-killer. Robert pointed to another one - "Linux, O_SYNC and Write Barriers" where he points to a discussion about the incorrectness of the implementation O_SYNC implementation.
Tracked: Dec 07, 08:13