The individual owning this blog works for Oracle in Germany. The opinions expressed here are his own, are not necessarily reviewed in advance by anyone but the individual author, and neither Oracle nor any other party necessarily agrees with them.
Monday, August 15. 2011
Sometimes you “know” the problem from the first moment. But sometimes your feeling in the gut results in something that is perceived as a large change, so you have to find the smoking gun, the undeniable proof for your hypothesis.
This is the story of such a search. It started with a telephone call of a colleague. He got my name from another colleague. An Oracle database running on a Solaris system, the datafiles and logs are located on a Veritas File System. The customer saw massive delays (in the range of hundreds of seconds) when excuting certain commands. One of the commands was “truncate table”.
A hypothesis - but the proof?And in the beginning it started with a red herring.
In this case the thread is trying to execute something on a semaphore, but it wasn’t able to do so. However the semtimedop is timebombed. When the timeout is reached without being able to execute on the semaphore , it terminates with error 11. All the timeouts were consistent with the waiting time seen from the SQL commands perspective.
Obviously the customer and other involved parties were tempted to see this as the problem, but already thought that this may be just the harbinger of bad news. And after a short look into the truss files, I was pretty sure that they were right with their doubts in regard of passing the . It was just the harbinger of bad news.
After a short amount of research I suspected, that we were talking about a locking problem here. There was just a problem: vxfs. At first I worked seldomly with it, thus it’s not really my center of expertise.
One point that diverted the attention of the customer from the locking stuff is a small but important difference: The customer knew that Oracle likes Direct I/O. With UFS the "Direct I/O" is doing a little bit more than just making the I/O direct by disabling buffering. It also removes the inode r/w lock mandated by POSIX rules.The customer knew about UFS Direct I/O that and thus activated Direct I/O on vxfs. And thus I found lines like
However i stil suspected a lock contention problem, and had a reason for it: Direct I/O isn't the same with vxfs than it's in UFS. In vxfs Direct I/O is really just the direct part. It doesn't enable concurrent I/O (explain that moniker later) to a file. The removal of the inode r/w-lock isn't part of the feature. You have to use either Quick I/O (QIO) or the ODM module for vxfs. As both features weren't activated, that was the moment where i told the customer "Hey, choose the ODM module for vxfs or QIO, activate it and the problem should go away". Both remove that lock contention and thus are of big help in order to get better Oracle performance when using vxfs. Just to remove a misunderstanding: ODM (Oracle Disk Management) is an API in Oracle, not of Veritas. Oracles DNFS (direct NFS) is implemented via an ODM module as well.
The problem: You used to pay for both vxfs, neither of them is really cheap and before doing the change, the customer wanted to know that i was right with my diagnosis (according to the release notes, ODM and QIO are now part of the SF except in basic).
I wrote of two problems, but just wrote of one so far. Normally, finding out this inode rwlock contention problems are quite easy to find . But not in this case. vxfs is different than UFS in a multitude of ways. It doesn’t use the locking primitives of Solaris but has its own instead. And thus all values reported by prefered diagnosis tools were pretty useless. Damned … how should you find problems, when your instruments can’t show the problems. Without instrumentation troubleshooting is just guesswork and experience.
At this point a question on a mail alias (it’s great to have people on internal aliases, that have forgotten more about Solaris than I know) and some research via google yielded the same result in a few minutes of time:
Digging in the dirtI asked the customer to put a dtrace script into a script that is executed in the moment of the wait:
The result was interesting, as it clearly showed a peak of 307 events in the range 34359738368 nanoseconds (34.36 seconds) to 68719476735 nanoseconds (68.72 seconds).
This was especially interesting as the same dtrace script didn't showed such a peak during times where the system ran flawlessly.
Okay ... well ... next step ... what parts of the system were executing this vxfs`vx_rwsleep_rec_lock function. I could have used dtrace for this task as well, but i wanted some additional insight in one step. Thus i used a nice little command of the modular debugger in Solaris:
The output is quite long on a loaded solaris system. It prints something like this for each thread:
I hate multiple line outputs when searching for patterns. There is nothing better than two monitors, an terminal streched on both and the two glibberish grep-implementations on the front side of your skull. But this works best, if one event is just in one line.
So i did some grepsed-fu on it.
Each thread is now in a single line. Yeah … perhaps there is a more elegant way to do this, but that was the first that came into my mind
Just a quick check.
At the moment of the hang, 1008 processes were in
When you further dig down into the large heap of data:
From all this threads belonging to the
At that moment i thought: That isn't a smoking gun, that's a smoking howitzer.
An attempt to explainMost threads excuting this function are part of the database writers. When you think about it, that's not so astonishing, especially when you think about the nature of an rwlock. At first: There is a rwlock for each inode in a filesystem. Their function: Multiple readers can get the lock and so they can read concurrently from the file, but just one writer is able to hold it and thus to write into the file. Equally important: You can't write to the file as long one or more readers is in the codepath protected by the rwlock for this file, and no one can read from the file as long there is a writer in protected codepath.
In really basic rwlock implementations this can lead to writer starvation, as it's hard for the writer to get the lock, because all readers have to relinquish the rwlock and no new readers should start before the writer can get the lock. Out of this reason, the Solaris threads implementation tends to favour writers before readers. However when you have many writers, it may take a long time before the backlog of writes. Blindly prefering writers is not a solution as well, because then readers would starve which is even more problematic, because reads are always synchronous by nature. As i wrote at other locations. While a system can chose the time of an physical write to some extent, it can't chose the time of a read. A function won't execute as long the data isn't available. But that's out of scope of this article. For the capability to write and read in parallel to a file the name cocurrent I/O was coined.
I just wrote that it can take a moment before the backlog of writes has been executed. In this case it was even worse: The inode r/w lock adds insult to injury. Because basically the inode r/w lock limits you to just a single write I/O operation in parallel to a file, no matter how many HBA, how many disks you have in your system. And now you've made a while out of a moment. Even when the changes in the file are totally unrelated, e.g. changing a block belonging to the user table stored in it and another block in the article database or you want to read a block into the sga containing the customer database and writing the new salary for the promited assistant. You can't do this in parallel due to the inode rw lock. And with many updates in your workload it's not that astonishing that database writer threads start to twiddeling fingers in an increasing number in order to wait for their turn to write to the file.
You may ask yourself, why the heck there is such a mechanism. The r/w lock is something mandatory in order to be Posix compliant. You need it to ensure write ordering and consistent reads, when updates occur in parallel to read. Obviously you really want such a protection when working with files. However especially with databases a file is just a container for a large heap of things. Independent things. And things are now different.
Out of this reason there were some developments in the database realm to get rid of the inode rwlock and put this mechanism elsewere. Oracle allows you to use a raw disk, and so it has to do the consistent read and write ordering stuff anyways and as it’s aware of the inner structure of the heaps of data, it can do it with a much greater granularity than just per inode and thus per file. The inode r/w lock is just a bottleneck without any use in this case.
Out of this reason Direct I/O of UFS for example offers a mode that removes the lock. It's not the way, that those write ordering things or consistency protections are away. They are just in a layer that knows more about the structure inside the file and thus can do a better job at doing this job. vxfs knows similar mechanisms. QIO or ODM don't have such an inodewise locking. They are working differently compared with UFS direct I/O but as an earlier chancelor of the Federal Republic of Germany said: Outcome matters.
One question was still open. Why was this problem reproducible by a "TRUNCATE TABLE" command? That’s pretty easy however you have to dig deep into the internals of Oracle. When Oracle executes a TRUNCATE TABLE command, it checkpoints the database. In such a situation it writes all dirty blocks from the SGA into the database datafiles. This must be done for recovery purposes.
Such checkpointing may trigger a storm of writes via the database writer, especially when you have a SGA with a lot of dirty blocks. The checkpoint has to complete, before the TRUNCATE TABLE executes. And then we are at another red herring at the end: It's not the TRUNCATE TABLE command that was slow ... it's the checkpoint occuring before. You can check this pretty easy, when a "TRUNCATE TABLE" takes too long for your taste, trigger a checkpoint manually and do the TRUNCATE TABLE directly afterwards. TRUNCATE TABLE does still a checkpoint, but as you've already cleaned up the SGA from dirty buffers, it doesn't have to do much writing. It should run much faster now.
ConclusionAt the end i had to tell the customer, that in essence everything works as designed. It would be a bug, when the system would act just a little bit different differently. However that's seldom the answer a customer wants.
So: The solution for the issue? It's as old as it's easy. Getting rid of the inode rwlock. Get concurrent I/O: Either by using raw disks, by using ASM, by using UFS or by using ODM or QIO for vxfs. I just can reiterate something i've already said: When you put your Oracle database file into a filesystem, you want to use direct I/O and concurrent I/O!
Ein schönes Beispiel für systematisches Troubleshooting
Drüben beim anscheinend untoten C0t0d0s0 gibt es ein schönes Beispiel für systematisches Troubleshooting mittels Dtrace für Oracle und VxFS. Sehr lesenswert, vor allem unabhängig von der Thematik, ist es einfach gut zu sehen das hier einer systematisch vo
Tracked: Aug 17, 10:34
Display comments as (Linear | Threaded)
Presumably the customer has licensed Veritas Storage Foundation for Oracle. If that is the case, then odm is likely to offer relief as well. While I don't know the implementation details of odm, it is described as making database file access much more like raw device access. I've run into several situations where enabling odm was the solution to abysmal Oracle performance.
Of course, another option is to use ASM. From what I remember, its feature set includes a lot of things that look a lot like ZFS (snapshots, clones, etc.) but is specifically designed to offer raw performance for Oracle databases without sacrificing the features of file systems.
#1 Mike Gerdts on 2011-08-15 23:17
Yes, that's the reason why i wrote that QIO or ODM for vxfs are possible solutions. ODM is a really interesting thing. It ignores the usual system calls and uses just a single one bypassing many systemcalls ...
Of course ASM is the best solution, however it's a migration task. Most customers i know want to keep changes small and thus it was out of scope as customers wanted the most minimal change.
Assuming this is 9i, 10g, or 11g and Veritas 5.x then I would argue that the issue is asyncio isn't enabled correctly in ether the init.ora or the vxfs mount (or both). There is no need or Oracle edition (dbed), SF /HA v5.0 or v5.1 will perform exceptionally well for all but the largest and most intensive databases. It will perform well for both oltp and throughput if configured correctly.
Things were done differently in earlier versions, don't assume the same VxVM or VxFS rules apply, in many cases they don't and applying some tuning that worked in previous versions will actually cause serious performance issues.
Can't really comment in detail since we don't have any specifics on Oracle DB version or SF version in use. It is well documented in the release notes and install guide though.
#2 Jeff on 2011-08-16 09:27
to be correct in my comments, SF /HA v5.1 now includes many of the features that used to be in dbed. My guess here is few were buying dbed and they are trying to compete with ZFS and ASM where possible.
Storage Foundation Standard HA
Veritas File System
Veritas Volume Manager
Veritas Cluster Server
Veritas Quick I/O option
Veritas Extension for Oracle Disk Manager option
Optionally licensed features:
Veritas Volume Replicator
Veritas Global Cluster Option
#2.1 Jeff on 2011-08-16 10:12
I looked this up right after your first comment and integrated that already ... however you have still to activate both, for example by linking the correct libodm into Oracle or doing the symlink-stuff for QIO. Without that no ODM or QIO and thus you are running in the mentioned problem. It's nothing wrong with VXFS in this situation, but you have to use what's offered.
However one question: Could you explain me how a missing ASYNCIO config send all the threads of the database writer into twiddeling fingers around vx_rwsleep_rec_lock .... and where all this threads are coming from?
To make it short: The async I/O configuration isn't the probleme here. However as far as i know KAIO is dependent on QIO (ODM is a completely different beast using it own mechanism). When you haven't activated QIO, it falls back to threaded AIO as it would do with UFS and thus you are seeing the large numbers of threads in the dbwriter processes. But perhaps i'm wrong here.
Thanks for a nice article. If you are interested in another way of transforming your mdb output, try this:
# echo "::threadlist -v" | mdb –k | nawk '$1=$1' RS= OFS=" "
Cheers and keep 'em coming
#3 Bartek on 2011-08-16 19:40
> When Oracle executes a TRUNCATE TABLE command, it checkpoints the database
It is not a whole database checkpoint. It is 'only' a object level checkpoint to be exact.
#4 Ronny Egner on 2011-08-17 07:16
Here is a nice explanation why:
#4.1 Ronny Egner on 2011-08-17 07:44
I know ... but when i'm writing about segment-level checkpoints, i have to explain them ... and this is an article about Direct I/O and not about the database
Hi Joerg, I just love these kind of articles. Very interesting and real world examples how to debug problems. Perfect - thank you very much!
#5 Bernhard on 2011-08-18 13:07
is your DTrace script strictly safe ? Could there be multiple lwps running that function ? If so should self->in be an aggregation with perhaps the tid as a key ?
Very useful info gleamed via this btw, thanks!
#6 Seán on 2011-08-18 23:42
The self-in is thread local ... so the script should yield the information that i want .... my question wasn't "which threads are in that function
", but "how long are threads in that function and thus wait for the "lock" "
You're quite correct, I was thinking of nested functions.
#6.1.1 sean on 2011-08-20 21:52
Very good article...
I analyzed a similar problem of reader/writer lock ... For french readers you will find it here: http://www.gloumps.org/article-analyse-des-alertes-oracle-warning-aiowait-timed-out-81971908.html
welcome back to the blogosphere, glad to see your posts.
#8 Anantha on 2011-08-28 05:17
welcome back to the blogosphere, glad to see your posts.
#9 Anantha on 2011-08-28 05:18
The author does not allow comments to this entry
The LKSF book
The book with the consolidated Less known Solaris Tutorials is available for download here
Martin about End of c0t0d0s0.org
Mon, 01.05.2017 11:21
Thank you for many interesting blog posts. Good luck with al l new endeavours!
Hosam about End of c0t0d0s0.org
Mon, 01.05.2017 08:58
Joerg Moellenkamp about tar -x and NFS - or: The devil in the details
Fri, 28.04.2017 13:47
At least with ZFS this isn't c orrect. A rmdir for example do esn't trigger a zil_commit, as long as you don't speci [...]
Thu, 27.04.2017 22:31
You say: "The following dat a modifying procedures are syn chronous: WRITE (with stable f lag set to FILE_SYNC), C [...]