The individual owning this blog works for Oracle in Germany. The opinions expressed here are his own, are not necessarily reviewed in advance by anyone but the individual author, and neither Oracle nor any other party necessarily agrees with them.
Sunday, October 30. 2011
Damned axe Ö is it really that hard to believe that an automatic system in Debian makes a dumb decision based on dumb assumptions (classic BIBO - bullshit in , bullshit out) ? It has nothing to do with a butchered up distribution by a hosting provider or an error between the keyboard and the chair. I would consider it a bug if it would deliver a differen result given the input to the system and the ruleset codified in the system.
Per default an init.d script is inserted by the
This information is used to put the init.d script at the right place of the sequence of startup.
There is a number of scripts just dependent on the existence of the local filesystem. This ones are run with the sequence number 01 thus first.
One of the services started first is syslog, and as you may have recognized in the head of the ssh init.d script, ssh depends on it. So it gets a higher sequence number and thus it's started later:
But wait, what has happened with 02 and 03? They are used by a single service each:
Why do they have this special treatment. It's pretty simple. Both services are flagged as being interactive, as they could ask for some user interaction.
In this cases both might probably ask for a password for the key. They have to start early to be sure that nothing can gets and block the tty in order to enable both scripts to show the password dialog. Keep in mind that we are in a state where no getty is running. However they can't earlier, as both rely on syslog. Essentially they are started as soon as possible and before other scripts with the same dependencies. And thats basically a part af the problem.
And this essentially leads to the situation, that
Another reader hinted me to the fact that lightttpd starts after ssh. At first this is just of cursory interest at a situation where the webserver in question is apache. At second the dependencies are different. At first the init.d script for lighty has no
I won't comment what i'm thinking about such a Ö aÖ aÖ solution ...
Thursday, October 27. 2011
Surely youíve recognized that my blog was down for a few days and with it all services on the system. The problem that led to this situation was a really dumb one. Perhaps this article is more a story about not thinking about a failure mode just because itís not a problem under your preferred operating system (Or to be exact: It was a problem before Solaris 10, but afterwards it was solved). And and most it's a story about being totally problem-blind in the first moment.
Perhaps I should explain first that c0t0d0s0.org isnít run with Solaris, it uses this-other-unixoid-operating-system in a well-known non-commercial variant. Thatís the dirty secret of c0t0d0s0.org. No technical reason for it, but webserving and mail could be done by any operating system and thus I used that operating system with ubiquitous availability at almost all providers of dedicated servers. Iím able to migrate the server from one dedicated server provider to another within 2 hours including moving the data and did this three times in the past (from 1&1 to Hetzner, and two times within Hetzner). This saved quite significant money until now and thatís the basic reason why I donít want donations and when you do donations I would donate this money to kiva.org)
Hetzner has reasonably priced dedicated servers and I had no problems in the past, however they have one important shortfall: No serial console in the standard product. When you need a console, you have to make a support call and they connect one. As you need it really seldom, itís okay. As I found out later: I With this serial console I would have recognized the problem within a minute, and fixed in a second. However: The console was exactly thing that I didnít had to my disposal at this moment.
So it was a lot harder to find out whatís happened. However i wanted my server back as soon as possible (out of personal reasons I was just able to start the recovery in the evening and as I have job to do I could only do the further stuff in the evenings as well) and thus I just reimaged the server after keeping a copy of the logfiles. I have a quite extensive backup regimen with very regular rsyncs and database replication on my server at home thus I knew I would perhaps just lose an hour of minutes of data and that was okay for me, additionally I was able to mount the disks of the non-working installation and to copy the delta of mails between the last backup and the last mail in the queue to my backup.
What had happened: At 10:something my server provider had a large power outage. The UPS didnít take over as planed and thus a lot of servers rebooted. One of them was mine. Damned Ö but thatís the basic reason why Iím a fan of proper enterprise architecture and not of some singular availability features, no matter what marketing tells you. Real availability is hard work and often expensive. But: When you really bet your business on IT, you need an architecture that is even capable to cover an UPS that proofs to be not so uninterruptible. The availability feature UPS may fail (and did fail my case) but a proper enterprise architecture keeps your service up and running. Even more important: With a proper enterprise architecture you donít need the feature UPS for availability reasons at all because your service can survive the outage of some parts. Perhaps you want the UPS out of other reasons like ďdonít want the hassle of bringing up all the systems again.Ē. But you donít need it with such an architecture to keep your business running. By having a proper planed enterprise architecture with servers on two seperate sides with different power grids you may forget about the UPS because a UPS wonít help you with a prolonged power outage for example because of region-wide blackout. An outage that maybe will take out the connectivity as well as itís not that unprobable that your local carrier has the same power problem However my business is being an architect, not blogging and thus I didnít have such an architecture in place for c0t0d0s0.org But thatís a different story.
Okay: After a while my system worked again and thus I had time to find out what had happened. I knew that the system was still reacting on pings, thus I knew the kernel of this-other-unixoid-operating-system in a well-known non-commercial variant was working. Looking into the logfiles I saw complete bootup of the kernel and some of the daemons were starting up .... like acpid for example. However I couldnít log into SSH. No signs in the logfiles of a ssh daemon startup. The apache was in a half-reacting state. Port 80 was open but it didnít reacted to HTTP commands.
Out of this I concluded: The kernel and the boot configuration is okay. The bring up of the services is at least working partially, because otherwise it would start services at all. And as it reacted on the networking there must have been at least a working boot of services mandated by rcS.d, as otherwise there would be no networking. The problem must be in apache that is frozen halfway. And out of other reasons ssh isnít started at all. There must have been a major fsckup in the startup of the services As I had no console as explained before I needed to conclude from the leftovers what had happened.
And now was a little bit puzzled. 5-6 years ago I would have recognized this problem in an instance (because sometimes i've produced ... well ... suboptimal startup scripts) Ö but now today it took a while, because I didnít felt prey for 5-6 years to such a problem. It took me a lot of more thoughts what might had happened. When you do one operating system for a living and one for hobby, you tend to project your mindset of one to the others and you donít do justice to this other OS.
As you all may know, Solaris ditched init.d with Solaris 10 in order to introduce SMF (not to forget the equally important features like the contract filesystem and the Fault Management Architecture). One of the nice advantages of SMF is that services that arenít interdependent will be started in parallel without waiting for another. This has two advantages: At first the system can start up much faster, at second a service not able to start up canít block the startup of the rest (short of services needed by all others).
The init.d concept is a different. All services are started in sequence. The sequence is numerical and then alphabetical. That is of course slower but more important Ö depending on the way you write your script a script or binary hanging or waiting for user interaction can block the startup. The variant of this-other-unixoid-operating-system is using init.d
And itís quite easy to block a service. For example by integrating a new SSL key and certificate. My key had a password and apache was asking for this key in order to startup. This exacly happened. Acpi started up because it was started before Apache (guess what: ACpi is before APache in an alphabetical order, and way before Ssh). This is the basic reason why you strip of the password from your key. Guess what I did last week: I put a new key and certificate on my server and I forgot to strip the password from it.
And that exactly happened: In my version of this-other-unixoid-operating-system the ssh daemon is started after the apache daemon. When Apache waits for something you won't get SSH. Damned Ö it's facepalm time. Basically I felt prey to a beginners error because Iím working with an operating system that reacts totally differently on such situations. On Solaris such a situation just donít matter at all Ö you get at least your ssh login and the system non-availability is just a service-non-availability you can fix within a second. However given the init.d system of this-other-unixoid-operating-system the outcome was somewhat more problematic. However: What a dumb error on my side ...
However: The reinstallation wasn't that bad ... the system could used a reinstallation because of some tests and experiments. So it was worth the work in the evenings. And on the other side: Who had the glorious idea to start apache before ssh?
That said, this-other-unixoid-operating-system in newer variants have a different startup mechanism up upstart all the services. However: My heating control is running on a beagleboard-XM at the moment using a really current variant of this-other-unixoid-operating-system just released a few days ago. It uses this new startup mechanism. And itís justs my unimportant personal preference, that doesnít matter: But I donít like it. And I have a lot of reasons for it. It looks by far too much designed for desktop needs. However my dislike would require an article I wonít write in this blog nowadays. But as I wrote: Thatís my personal opinion that doesnít matter.
However itís really important that this-other-unixoid-operating-system gets away from the old init.d mechanism to something more current. I think in 2011 every operating system deserves something more functional, something better than init.d Ö init.d is simple and well understood, however it creates classes of problems unnecessary today. Especially: In order to keep die-hard Solaris admins to fall prey to such a beginners error because such problems were parts of their distant past. And now i will start to cut holes for my eyes in the brown paper bag for my head.
Friday, October 7. 2011
Thursday, October 6. 2011
Wednesday, October 5. 2011
BTW ... the hungarian folk dancers explain other sort algorithms as well: merge-sort, insert-sort, bubble-sort, shell-sort and select-sort
Wednesday, October 5. 2011
Tuesday, October 4. 2011
There is a nice example of the power of boot environment. Boot environments are something like snapshots of your operating system installation made writeable. As you may already assume, they are based on ZFS snapshots and the clone functionality. This is possible due to the usage of ZFS as the root filesystem.
So: Please don't try this at home. Whey you try it, don't try it on any Solaris 11 Express installation of any value. But don't try it. I don't want to hear any story. that you've deleted your ERP system by accident because you used the wrong terminal window. Leave that to trained professional stunt admins with the right equipment (Solaris 11 Express)
Assume you have a system, configured with all your application, everything is running fine. So you think it would be nice to have something like a freezed state of this situation. No problem. This command will do the trick.
When you reboot your system you will see it as a new entry in the grub menu.
Okay, but boot into the old environment starting "Oracle Solaris ..." first by selecting it in the grub menu (it should be already selected, or you used
Essentially we've just nuked the installation. After a moment the system should just freeze. Reset the system and boot again via grub into the boot environment starting with "Oracle Solaris ...":
Okay ... on a normal system this would send you to the tapes. With Solaris 11: Reset the system. Boot into the boot environment "rescuenet" via selecting it in grub.
Tada! Just creating a boot environment with a single command after a config change may safe your butt later .... and btw ... this even works in zones ... they know the concepts of boot environment,too.
Tuesday, October 4. 2011
You may have noticed, that there are no product announcement for Oracle products on this page, even when there are now a lot of announcement that i was really waiting for a long time. And i will keep it this way. So in case you want information about announcement, you have to search them at other locations. I want to draw your attention to a blog by my colleages working in the eSTEP (EMEA Systems Technology Enablement Program) program. They started just it. So i would like to ask for your kind attention for:"The official eSTEP blog"
Monday, October 3. 2011
Just a short hint: The What's new document of Solaris 10 Update 9 states, that the support for IPoIB Connected Mode has been added in the release. However you have to search a bit in order for some information how to activate it. The necessary step is documented in the manpage for the ibd driver. Let's assume you have to instances of the ibd driver running (ibd0 and ibd1). In this case you have to change one line at the end of
PS: The process for Solaris 11 is better, as you just use dladm for it. However connected mode is the default there anyway. In Solaris 10 unreliable datagram was kept as the default, as one of the rules in Solaris is that you have to opt-in to such changes between updates.
The LKSF book
The book with the consolidated Less known Solaris Tutorials is available for download here
Martin about End of c0t0d0s0.org
Mon, 01.05.2017 11:21
Thank you for many interesting blog posts. Good luck with al l new endeavours!
Hosam about End of c0t0d0s0.org
Mon, 01.05.2017 08:58
Joerg Moellenkamp about tar -x and NFS - or: The devil in the details
Fri, 28.04.2017 13:47
At least with ZFS this isn't c orrect. A rmdir for example do esn't trigger a zil_commit, as long as you don't speci [...]
Thu, 27.04.2017 22:31
You say: "The following dat a modifying procedures are syn chronous: WRITE (with stable f lag set to FILE_SYNC), C [...]