The individual owning this blog works for Oracle in Germany. The opinions expressed here are his own, are not necessarily reviewed in advance by anyone but the individual author, and neither Oracle nor any other party necessarily agrees with them.
Sunday, August 21. 2011
Yesterday i held my deduplication talk at the Froscon 2011. I think it was acceptable and the lecture room 3 was really filled. To be honest: I don't expected such an audience at that time. So a big thank you for all who attended the talk.
My talk started at 10:00 o'clock and thanks to apron parking, a rental car pickup in the feeled middle of nowhere in Düsseldorf and a larger traffic jam near Cologne led to an arrival at 09:58 … with a presentation at 10:00 and the urgent need to visit the restroom before a disaster happens.
I had just a short time at Froscon … i had a date with my house, that still needs work and don't accept when i'm saying "No … not this weekend". And thus i was back in the rental car at 11:45 and was back in Lueneburg shortly after 15:00, ten hours after leaving home. 3.30h for 430km … not that bad for an Opel Corsa.
However there is something i've recognized: I'm really missing standing in front of customers and technically interested people and trying to transfer my enthusiasm for technology. I'm really missing being on the road.
Monday, August 15. 2011
Sometimes you “know” the problem from the first moment. But sometimes your feeling in the gut results in something that is perceived as a large change, so you have to find the smoking gun, the undeniable proof for your hypothesis.
This is the story of such a search. It started with a telephone call of a colleague. He got my name from another colleague. An Oracle database running on a Solaris system, the datafiles and logs are located on a Veritas File System. The customer saw massive delays (in the range of hundreds of seconds) when excuting certain commands. One of the commands was “truncate table”.
A hypothesis - but the proof?And in the beginning it started with a red herring.
In this case the thread is trying to execute something on a semaphore, but it wasn’t able to do so. However the semtimedop is timebombed. When the timeout is reached without being able to execute on the semaphore , it terminates with error 11. All the timeouts were consistent with the waiting time seen from the SQL commands perspective.
Obviously the customer and other involved parties were tempted to see this as the problem, but already thought that this may be just the harbinger of bad news. And after a short look into the truss files, I was pretty sure that they were right with their doubts in regard of passing the . It was just the harbinger of bad news.
After a short amount of research I suspected, that we were talking about a locking problem here. There was just a problem: vxfs. At first I worked seldomly with it, thus it’s not really my center of expertise.
One point that diverted the attention of the customer from the locking stuff is a small but important difference: The customer knew that Oracle likes Direct I/O. With UFS the "Direct I/O" is doing a little bit more than just making the I/O direct by disabling buffering. It also removes the inode r/w lock mandated by POSIX rules.The customer knew about UFS Direct I/O that and thus activated Direct I/O on vxfs. And thus I found lines like
However i stil suspected a lock contention problem, and had a reason for it: Direct I/O isn't the same with vxfs than it's in UFS. In vxfs Direct I/O is really just the direct part. It doesn't enable concurrent I/O (explain that moniker later) to a file. The removal of the inode r/w-lock isn't part of the feature. You have to use either Quick I/O (QIO) or the ODM module for vxfs. As both features weren't activated, that was the moment where i told the customer "Hey, choose the ODM module for vxfs or QIO, activate it and the problem should go away". Both remove that lock contention and thus are of big help in order to get better Oracle performance when using vxfs. Just to remove a misunderstanding: ODM (Oracle Disk Management) is an API in Oracle, not of Veritas. Oracles DNFS (direct NFS) is implemented via an ODM module as well.
The problem: You used to pay for both vxfs, neither of them is really cheap and before doing the change, the customer wanted to know that i was right with my diagnosis (according to the release notes, ODM and QIO are now part of the SF except in basic).
I wrote of two problems, but just wrote of one so far. Normally, finding out this inode rwlock contention problems are quite easy to find . But not in this case. vxfs is different than UFS in a multitude of ways. It doesn’t use the locking primitives of Solaris but has its own instead. And thus all values reported by prefered diagnosis tools were pretty useless. Damned … how should you find problems, when your instruments can’t show the problems. Without instrumentation troubleshooting is just guesswork and experience.
At this point a question on a mail alias (it’s great to have people on internal aliases, that have forgotten more about Solaris than I know) and some research via google yielded the same result in a few minutes of time:
Digging in the dirtI asked the customer to put a dtrace script into a script that is executed in the moment of the wait:
The result was interesting, as it clearly showed a peak of 307 events in the range 34359738368 nanoseconds (34.36 seconds) to 68719476735 nanoseconds (68.72 seconds).
This was especially interesting as the same dtrace script didn't showed such a peak during times where the system ran flawlessly.
Okay ... well ... next step ... what parts of the system were executing this vxfs`vx_rwsleep_rec_lock function. I could have used dtrace for this task as well, but i wanted some additional insight in one step. Thus i used a nice little command of the modular debugger in Solaris:
The output is quite long on a loaded solaris system. It prints something like this for each thread:
I hate multiple line outputs when searching for patterns. There is nothing better than two monitors, an terminal streched on both and the two glibberish grep-implementations on the front side of your skull. But this works best, if one event is just in one line.
So i did some grepsed-fu on it.
Each thread is now in a single line. Yeah … perhaps there is a more elegant way to do this, but that was the first that came into my mind
Just a quick check.
At the moment of the hang, 1008 processes were in
When you further dig down into the large heap of data:
From all this threads belonging to the
At that moment i thought: That isn't a smoking gun, that's a smoking howitzer.
An attempt to explainMost threads excuting this function are part of the database writers. When you think about it, that's not so astonishing, especially when you think about the nature of an rwlock. At first: There is a rwlock for each inode in a filesystem. Their function: Multiple readers can get the lock and so they can read concurrently from the file, but just one writer is able to hold it and thus to write into the file. Equally important: You can't write to the file as long one or more readers is in the codepath protected by the rwlock for this file, and no one can read from the file as long there is a writer in protected codepath.
In really basic rwlock implementations this can lead to writer starvation, as it's hard for the writer to get the lock, because all readers have to relinquish the rwlock and no new readers should start before the writer can get the lock. Out of this reason, the Solaris threads implementation tends to favour writers before readers. However when you have many writers, it may take a long time before the backlog of writes. Blindly prefering writers is not a solution as well, because then readers would starve which is even more problematic, because reads are always synchronous by nature. As i wrote at other locations. While a system can chose the time of an physical write to some extent, it can't chose the time of a read. A function won't execute as long the data isn't available. But that's out of scope of this article. For the capability to write and read in parallel to a file the name cocurrent I/O was coined.
I just wrote that it can take a moment before the backlog of writes has been executed. In this case it was even worse: The inode r/w lock adds insult to injury. Because basically the inode r/w lock limits you to just a single write I/O operation in parallel to a file, no matter how many HBA, how many disks you have in your system. And now you've made a while out of a moment. Even when the changes in the file are totally unrelated, e.g. changing a block belonging to the user table stored in it and another block in the article database or you want to read a block into the sga containing the customer database and writing the new salary for the promited assistant. You can't do this in parallel due to the inode rw lock. And with many updates in your workload it's not that astonishing that database writer threads start to twiddeling fingers in an increasing number in order to wait for their turn to write to the file.
You may ask yourself, why the heck there is such a mechanism. The r/w lock is something mandatory in order to be Posix compliant. You need it to ensure write ordering and consistent reads, when updates occur in parallel to read. Obviously you really want such a protection when working with files. However especially with databases a file is just a container for a large heap of things. Independent things. And things are now different.
Out of this reason there were some developments in the database realm to get rid of the inode rwlock and put this mechanism elsewere. Oracle allows you to use a raw disk, and so it has to do the consistent read and write ordering stuff anyways and as it’s aware of the inner structure of the heaps of data, it can do it with a much greater granularity than just per inode and thus per file. The inode r/w lock is just a bottleneck without any use in this case.
Out of this reason Direct I/O of UFS for example offers a mode that removes the lock. It's not the way, that those write ordering things or consistency protections are away. They are just in a layer that knows more about the structure inside the file and thus can do a better job at doing this job. vxfs knows similar mechanisms. QIO or ODM don't have such an inodewise locking. They are working differently compared with UFS direct I/O but as an earlier chancelor of the Federal Republic of Germany said: Outcome matters.
One question was still open. Why was this problem reproducible by a "TRUNCATE TABLE" command? That’s pretty easy however you have to dig deep into the internals of Oracle. When Oracle executes a TRUNCATE TABLE command, it checkpoints the database. In such a situation it writes all dirty blocks from the SGA into the database datafiles. This must be done for recovery purposes.
Such checkpointing may trigger a storm of writes via the database writer, especially when you have a SGA with a lot of dirty blocks. The checkpoint has to complete, before the TRUNCATE TABLE executes. And then we are at another red herring at the end: It's not the TRUNCATE TABLE command that was slow ... it's the checkpoint occuring before. You can check this pretty easy, when a "TRUNCATE TABLE" takes too long for your taste, trigger a checkpoint manually and do the TRUNCATE TABLE directly afterwards. TRUNCATE TABLE does still a checkpoint, but as you've already cleaned up the SGA from dirty buffers, it doesn't have to do much writing. It should run much faster now.
ConclusionAt the end i had to tell the customer, that in essence everything works as designed. It would be a bug, when the system would act just a little bit different differently. However that's seldom the answer a customer wants.
So: The solution for the issue? It's as old as it's easy. Getting rid of the inode rwlock. Get concurrent I/O: Either by using raw disks, by using ASM, by using UFS or by using ODM or QIO for vxfs. I just can reiterate something i've already said: When you put your Oracle database file into a filesystem, you want to use direct I/O and concurrent I/O!
Friday, August 12. 2011
Friday, August 12. 2011
Wednesday, August 10. 2011
Sunday, August 7. 2011
Since the closure of the blog i got many many requests for giving content to people in form of an excerpt of the database. Even more. I got a significant numbers of angry mails. It seems that i was the best source for some things and thus the people were really angry about removing their source of information. I was somewhat astonished about the anger, however in absolute numbers it was significant but in relative numbers it was rather small. Nevertheless, i didn't expected angry people.
After thinking about the whole weekend i decided to put selected content back online instead of creating a pdf of the blog page and sending it to the people asking for content. As a side effect, it repairs a lot of links on Solaris related pages.
However the reasons for my blog closure are still valid, the blog stays in this freezed state. Thus i decided to do this as a selective republication. Certain classes of documents will stay offline. In that regard i made the following decision, which content gets back into blog: Personal content on one side, work related technical content as long it's related to Solaris on the other side.
The LKSF book
The book with the consolidated Less known Solaris Tutorials is available for download here
Joerg Moellenkamp about tar -x and NFS - or: The devil in the details
Fri, 28.04.2017 13:47
At least with ZFS this isn't c orrect. A rmdir for example do esn't trigger a zil_commit, as long as you don't speci [...]
Thu, 27.04.2017 22:31
You say: "The following dat a modifying procedures are syn chronous: WRITE (with stable f lag set to FILE_SYNC), C [...]
Filip Hristovski about Empty homes
Thu, 23.03.2017 16:00
Great post! As always. What about this situation: a leaf with only memory resource, wit hout a CPU lgroup 2 ( [...]
Tiago Rodrigues Eckhardt about Less known Solaris features: Getting rid of Zombies
Thu, 13.10.2016 15:51
Funny thing, I just found out that there is no equivalent co mmand for Linux.
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Germany License