The individual owning this blog works for Oracle in Germany. The opinions expressed here are his own, are not necessarily reviewed in advance by anyone but the individual author, and neither Oracle nor any other party necessarily agrees with them.
Saturday, April 7. 2012
Ruud van der Pas and Jared Smolens wrote an really interesting whitepaper about the SPARC T4 and its behaviour in regard with certain code: How the SPARC T4 Processor Optimizes Throughput Capacity: A Case Study. In this article the authors compare and explain the behaviour of the the UltraSPARC T4 and T2+ processor in order to highlight some of the strengths of the SPARC T-series processors in general and the T4 in particular.
Wednesday, April 4. 2012
Another Unix on a very small device: "UNIX ON PIC32 – MEET RETROBSD FOR DUINOMITE". The Duinomite is a small board based on a PIC32 microcontroller.
Serge did amazing job by porting the old days 2.11BSD Unix used to run on PDP-11 to PIC32 (MIPS). In just 128KB RAM footprint he manage to boot UNIX OS and you have 96KB left for applications.
Wednesday, June 29. 2011
Robin Harris of Storagemojo pointed to an interesting article about about deduplication and it's impact to the resiliency of your data against data corruption on ACM Queue.
The problem in short: A considerable number of filesystems store important metadata at multiple locations. For example the ZFS rootblock is copied to three locations. Other filesystems have similar provisions to protect their metadata. However you can easily proof, that the rootblock pointer in the uberblock of ZFS for example is pointing to blocks with absolutely equal content in all three locatition (with zdb -uu and zdb -r). It has to be that way, because they are protected by the same checksum. A number of devices offer block level dedup, either as an option or as part of their inner workings. However when you store three identical blocks on them and the devices does block level dedup internally, the device may just deduplicated your redundant metadata to a block stored just once that is stored on the non-voilatile storage. When this block is corrupted, you have essentially three corrupted copies. Three hit with one bullet.
This is indeed an interesting problem: A device doing deduplication doesn't know if a block is important or just a datablock. This is the reason why I like deduplication like it's done in ZFS. It's an integrated part and so important parts don't get deduplicated away. A disk accessed by a block level interface doesn't know anything about the importance of a block. A metadata block is nothing different to it's inner mechanism than a normal data block because there is no way to tell that this is important and that those redundancies aren't allowed to fall prey to some clever deduplication mechanism.
Robin talks about this in regard of the Sandforce disk controllers who use a kind of dedup to reduce some of the nasty effects of writing data to flash, but the problem is much broader. However this is relevant whenever you are using a device with block level deduplication. It's just the point that you have to activate it for most implementation by command, whereas certain devices do this by default or by design and you don't know about it. However I'm not perfectly sure about that … given that storage administration and server administration are often different groups with different business objectives I would ask your storage guys if they have activated dedup without telling somebody elase on their boxes in order to speak less often with the storage sales rep.
The problem is even more interesting with ZFS. You may use ditto blocks to protect important data to store multiple copies of data in the pool to increase redundancy, even when your pool just consists out of one disk or just a striped set of disk. However when your device is doing dedup internally it may remove your redundancy before it hits the nonvolatile storage. You've won nothing. Just spend your disk quota on the the LUNs in the SAN and you make your disk admin happy because of the good dedup ratio However you can just fall in this specific "deduped ditto block"trap when your pool just consists out of a single device, because ZFS writes ditto blocks on different disks, when there is more than just one disk. Yet another reason why you should spend some extra-thought when putting your zpool on a single LUN, especially when the LUN is sliced and dices out of a large heap of storage devices by a storage controller.
However I have one problem with the articles and their specific mention of ZFS: You can just hit by this problem when you are using the deduplicating device for the pool. However in the specifically mentioned case of SSD this isn't the usecase. Most implementations of SSD in conjunction with ZFS are hybrid storage pools and so rotating rust disk is used as pool and SSD are used as L2ARC/sZIL. And there it simply doesn't matter: When you really have to resort to the sZIL (your system went down, it doesn't matter of one block or several blocks are corrupt, you have to fail back to the last known good transaction group the device. On the other side, when a block in L2ARC is corrupt, you simply read it from the pool and in HSP implementations this is the already mentioned rust.
In conjunction with ZFS this is more interesting when using a storage array, that is capable to do dedup and where you use LUNs for your pool. However as mentioned before, on those devices it's a user made decision to do so, and so it's less probable that you deduplicating your redundancies.
Other filesystems lacking acapability similar to hybrid storage pools are more "haunted" by this problem of SSD using dedup-like mechanisms internally, because those filesystem really store the data on the the SSD instead of using it just as accelerating devices.
However at the end Robin is correct: It's jet another point why protecting your data by creating redundancies by dispersing it several disks (by mirror or parity RAIDs) is really important. No dedup mechanism inside a device can dedup away your redundancy when you write it to a totally different and indepenent device.
Thursday, February 3. 2011
Sometimes you just think ... "Hell ... this shouts for being misunderstood". Asynchronous filesystem semantics and Asynchronous I/O are such concepts. Sounds the same, but isn't. And often people get it wrong, even when you not talking about the application, but just about the filesystem.
As you know there are two important concepts of executing writes. Synchronous and asynchronous or to say it differently blocking or non-blocking. When you trigger an asynchronous write, the write call comes back, right after you issued the call. An synchronous write call (by definition) is only allowed to come back, when the system has assured (as far it's technically possible) that the data is on some kind of nonvolatile storage.
Synchronous writes are absolutely essential for example for mail servers: The MTA can't send OK to the MUA or to another MTA the "OK, got the mail", as long it's sure the mail is nonvolatile on the other server, because the sending MTA deletes it from its queue after such a OK. A mail could be lost, when the power fails after the OK, but before the non-volatile storing of the mail. So you send the OK afterwards. How to you ensure, that the OK is after the nonvolatile storing? Yes ... it's done by a sync write.
It's the same for databases: One of the important foundations of ACID is the synchronous write. Without synchronous writes, forget about the D ...
However a synchronous write has a problem and it is because of its greatest advantage. It doesn't come back until the write has been completed. The problem is that you can't do anything else in this time in this thread.
Translating this to an example in the real world, you can imagine this like being the dispatcher of large group of employees. When you give your employees their tasks synchronously, you give your employees the job and wait for the completion before you dispatch another job. You can be sure, that the task has been completed before you throw the the piece of paper with the todo, however it's not really efficient.
Switching back to asynchronous writes isn't an option as well, as the only way to check, if the write call was really processed through the complete code path and the code paths inside the HBA and the controller adjacent to the disk, is to read the location (okay, they are other reasons to do so, however that a completely different discussion). The asynchronous write call gives you no feedback of that the write call has been completed in terms of writing stuff to nonvolatile storage. Basically you trust the operating system that it will do everything right and aside of this there is absolutely no guarantee that asynchronously written data is on disk after a system crash.
Translated to the real world example, this would be like giving a task to an employee, throwing away the piece of paper where you wrote down the task ... and forget about it. Obviously you can work very fast this way, and often this works well, however when an employee quits or is absent due to illness, the job may not completed.
By the way: It's the same for reads. A reads blocks the calling process until the data has been delivered, and this time there is no way around with it with normal means. Reading is synchronous by necessity. The process can't say "Well ... when i have to wait for it ... i will take some other data available"
So, waiting processes until the write comes call returns, missing guarantees and no feedback on the other side. And reads synchronous anyway. And even more devastating: When a write or read call blocks a thread, because it waits until all the stuff between the OS and the disk has been done, no matter how many parallel disks or filesystem you have. How do you get out of this challenge?
There is a way around that and this is Asynchronous I/O. The read and writes are still synchronous or asynchronous in the sense of blocking or non-blocking. So you can have basically an synchronous write in a asynchronous I/O model. This is important: As a colleague said it years ago: You can configure Oracle to use a asynchronous I/O model, but you can't tell Oracle to use non-synchronous write semantics. Remember the D.
To use this model, there are a number of new calls in the OS that support this asynchronous I/O modell.
The write call aiowrite or read call call aioread returns right after you issue the call. Sounds like asynchronous write semantics. However there is an important difference: At first the read or write is now run in a concurrent thread and the aioread or aiowrite returns immediately after they are issued. But contrary to the asynchronous write semantic, you get a feedback about the result of the write.
There are essentially two important ways to yield this information, at first you could call the aiowait function. This returns as soon as a outstanding read or write triggered by aiowrite or aioread returns. It waits either forever (until an outstanding aio request completes), not at all ("Is something completed available ... no ... okay ... let's go further". You use this mode of operating to implement a polling mechanism) or wait for a certain time.
The other way is to implement a signal handler. Whenever a I/O request made by those calls, they will send the signal SIGPOLL, the signal handler can then dequeue the notification of a completed I/O request with aiowait (in fact, they have to because it's the only way to get those notifications out of the queue)
However: Both mechanisms notify (SIGPOLL fires, aiowait returns), when one of outstanding asynchronous requests is completed. They doesn't report the I/O call that has been completed. You have to find which one. Most often this is done by scanning the return code buffers of the aiowrite/aioread requests after you have initialized them with a value expressing "in progress". When a i/o request has been completed, it's set to something else, so you just have to scan for return code buffers, that are not on the state "in progress).
Translated into the real world, it's like dispatching a steady stream of tasks to your employees. However you don't monitor their progress yourself, but you have some colleagues for that. Those colleagues just send you a notification "Task xyz has been completed with that result" and could throw the piece of paper with the task away, as soon as this notification was send to you. You could do it at a jour-fixe, you could wait in front of your colleagues or you could wait for 5 minutes for completion notifications and then do something else for the next hours.
The advantage of asynchronous I/O is obvious: The application can issue I/O requests without waiting for the completion of others, you can even issue asynchronous reads. By doing so a single application thread can have multiple I/O requests in flight. However you addition you get still a feedback, that the I/O request has been successfully completed and you can use it in areas where you would have opted to use synchronous aka blocking write semantics in the past. Important to know: Your application must be enabled by the developers to use an asynchronous I/O pattern.
Wednesday, January 12. 2011
Mike Swingler of Apple announced yesterday, that Apple has started to contribute to the OpenJDK project:
I'm very happy to let you know that today we made the first public contribution of code to the OpenJDK project for Mac OS X. This initial contribution builds on the hard work of the BSD port, and initially has the same functionality. Today's contribution simply modifies the build process to create universal binary, and produces a .jdk bundle which is recognized by Java Preferences and the JVM detection logic in Mac OS X.You will find a list of important links in regard of this project in the announcement mail.
Friday, October 29. 2010
Sometimes i can observe trends in reader interest in the webserver logs: In the last time there were a lot of accesses to an article dating back to September 2007 when i talked about SSL acceleration and made the case for T2.
In the first few seconds i asked my self, why there is that new interest in this article. However then i just thought "Damned, Firesheep".
Just in case you didn't heard about it: With firesheep you can highjack user sessions by sniffing on unencrypted data flow in your network. This can work, as often the authentication form itself is encrypted, but afterwards they use unencrypted communication. The authorization to access those pages with your login credentials is carried forward by cookies. When you are able to gather this cookie, you can take over the session of a user. This isn't really a new attack vector, but firesheep is makes this really easy ... script-kiddies easy. It's integrated into Firefox as a plugin and hijacking other peoples session is as easy as changing tabs.When you want an explanation for that, please look at Eric Butlers preso "Hey Web 2.0:Start protecting user privacy instead of pretending to". The attack vector bases on the fact that someone can sniff your data. You may think at first "Well, we have switches, no sniffing", however think about unencrypted (and i consider WEP as unencrypted) WLANs.
There is just one solution: Encrypt everything. HTTPS to the help. However while this isn't a problem on client side, it creates a lot of problems on the server side. Like the problem, that virtual hosts are in need of their own IP addresses (there is an Hen-Egg-Problem in SSL). Or that you have to provide the other end of the cryptographic connection, thus you have to do all the calculation on the server side.
I'm still pretty sure, that SPARC T-class and Solaris can really help you here. The T-Series because of it's cryptographic unit. It isn't just an rather small extension like the additional commands in Nehalem to accelerate AES, it's a full-fledged crypto co processor. In the case of the SPARC T3 you have 16 of them and they got some significant update to accelerate mor ciphers and to handle many small packets much better. There is an interesting presentation availble about this piece of silicon in T3. It's already the 3rd generation of crypto acceleration in T3.
However hardware is nothing without software and thus i want to point you view to kssl. kssl is an in-kernel proxy for SSL. This in-kernel proxy uses the hardware acceleration in the T3 via the Solaris Cryptographic Framework (SCF), another nice thing that does the abstraction between the hardware (or non-existing hardware, when you do cryptography on your normal CPU) and the services using cryptography. Why is the in-kernel proxy that useful? At first it's easy to configure, but more important you don't have to hop between user- and kernelspace, as when a usermode library like openssl uses the kernel driver of your hardware. This gives you additional performance.
This kssl and SCF thing is however nothing that's limited to T-class. It's available on x86-solaris as well and in the case you purchase a supported crypto acceleration card, kssl uses this via SCF as well. However a PCI based card can never be as fast as crypto accelerators directly on die and on core like the ones in T-class.
Monday, February 8. 2010
In some discussions in mail and in some community forums linking to the articles about deduplication and hashing there was a slight misunderstanding. I should explain some things.
Continue reading "To dedup or not to dedup - that results in a lot of questions"
Friday, January 1. 2010
Frequent readers of my blog know that i'm using an Notebook via iSCSI to emulate a SSD. At the moment my SSD is an old Acer Aspire notebook. It works reasonably well, it's much faster than a hard disk and i can test the behavior of applications with SSD without buying one.
Continue reading "Selfmade SSDs - or: the tale of thinking too complex ...."
Friday, November 6. 2009
On 16th December i will talk about Hadoop on Sun at the Apache Hadoop Get Together in Berlin. The talk will be divided in three parts:
My presentation will not be the only one. Richard Hutton from nugg.ad talks about data processing with hadoop in his presentation "Moving from five days to one hour." Nikolaus Pohle from Nurago talks about data analysis with Hadoop for online market research.
Tuesday, October 20. 2009
Yesterday the usual news outlets speculated about the rumours circulating about the reasons for the data loss at Danger. For the uninitiated: Danger is the the company behind the Sidekick.
TGDdaily started by reporting the information provided by a reader with the article: "Oracle, Linux, Sun to blame for Sidekick Danger data loss":
TG Daily reader Tommy T informed us earlier today that: “Microsoft was running on Sun Solaris/Linux/Oracle-based Danger servers at the Verizon Business Center in San Jose as part of a contractual obligation to T-Mobile.”That isn't really a surprise. Despite all the press fuzz, many many sites still deploy and use Solaris for their mission critical stuff, while using Linux for stuff that scales well on clusters.
Of course Microsoft was asked for the reasons and just reported
Sidekick runs on Danger’s proprietary service that Microsoft inherited when it acquired Danger in 2008. The Danger service is built on a mix of Danger created technologies and 3rd party technologies.I think that's a good answer, we don't blame anyone but we tell the world, the application wasn't developed at Microsoft and the infrastructure wasn't Microsoft as well. Especially as Microsoft doesn't need additional doubts about the feasibility of their operating and infrastructure environment after Microsoft loosed the .net implementation at the London Stock Exchange in favor of of a Solaris/Linux implementation. I don't expect Microsoft to shed more light into the situation.
And it isn't that way, that they are the only shop on the world using RAC on Sun. The Inquirer expresses this situation in a nice way - "Oracle and Sun fingered for Sidekick fiasco"
That would all be well and good but for the fact that the Sun-Oracle combo is not exactly rocket science and is popular among those who manage very large, mission critical corporate databases.I think, the truth about this outage is much more complex, as the various media outlets state. IT systems are called systems for a reason. They are large complex interrelationships of components. There is a database, there is server, there is storage, there is a SAN, there is a network. And there are many devils of the detail in any system. Whoever says something different, simply lies.
But at the end it's pretty irrelevant. It's like with aircraft crashes. There isn't a single reason why an aircraft crash happens. When there is a major CNN moment at a service, there are several resons why it happened. It may start with a small hick-up and other effects leads to a disaster.
At the end it's a large heap of rumors surrounding this situation. Daniel Eran Dilger writes in "Microsoft’s Pink/Danger backup problem blamed on Roz Ho":
According to the source, the real problem was that a Microsoft manager directed the technicians performing scheduled maintenance to work without a safety net in order to save time and money. The insider reported:As far as I'm understanding the rumor, they made a backup, deleted they only other backup in the course of this procedure, stopped the backup and did a firmware update. Sorry ... i don't even do an update of of notebook without starting a backup before doing so and my blog replicates it data every six hours to a system in my office at home.
A backup is a safety net. You don't do any task that could put your data at risk without having an backup. And you should better test the restore. Of course this takes time, at my blog a few minutes ... at Danger a few days. But you are sure that you don't appear at CNN. (okay, i would appear at CNN, but i would get some question in our coffee kitchen at work). Sh.t happens, customers understand this, but they don't accept that you don't value their data. Perhaps this is the real story of the this mishap.
“In preparation for this [SAN] upgrade, they were performing a backup, but it was 2 days into a 6 day backup procedure (it’s a lot of data). Someone from Microsoft (Roz Ho) told them to stop the backup procedure and proceed with the upgrade after assurances from Hitachi that a backup wasn’t necessary. This was done against the objections of Danger engineers.I really can't imagine a single reason why a HDS/HP/EMC/Sun storage support engineer says "Hey. You don't need a backup. We will do the firmware upgrade without it". I just can imagine that the the engineer said: "Well ... i did several upgrades so far and there wasn't a problem that sent me to the tapes"
By the way: There is one thing, that keeps me really puzzled. It's a piece of information at TheRegister:
It involves 20 or so CentOS Linux servers [...]Sorry, CentOS is a nice variant of Linux, but i don't really think that's a good idea to use a distribution of Linux with at least questionable commercial support (The "Commercial Support" menu item at centos.org delivers a page stating that there is no content) in a mission critical environment. I know that an enterprise Red Hat or Suse distribution/support agreement costs you some bucks, but in such an environment i would opt for the professionally supported variant. But again: This is a example of an attitude. And this attitude is not one of valuing the data of the users.
A hickup in a service has most often their reason in the realm of technology, but real disaster has more reasons, it has deeper reasons. Just 10% are technological reasons. Or to say it more precisely: A disaster can be root caused by a technical reason, but the step from a hickup to a disaster needs always reasons outside the realm of technology. And given the rumors surrounding the Danger outage are true, this outage is one of the best examples to show it.
Monday, October 12. 2009
Sun has now announced the F5100 officially. The specs are really impressive: The largest version with 80 FMods yields 1.6 Million read IOPS and 1.2 million write IOPS with 4k blocks. 12.8 GB/sec read and 9.7 GB/sec write. 1,920 GB capacity. It uses 386 watts while active. More specs are available at the Sun specs page for the Sun Storage F5100 Flash Array. Furthermore the documentation of this devices should reappear soon at docs.sun.com. Until that moment you will find some additional information about the device at Sun's website. This device isn't that expensive as you may think. The device starts at $45,995 (with 20 Flash Modules giving you 480 GB, 397 K IOPS read and 304 K IOPS write).
When you look in such a device you will see 80 densly packed FMods.
In the back of the device you will find the SAS connectors for the device. There are four domains with 4 SAS connectors each. For a maximum performance configuration you configure 5 FMods and a SAS connector into a zone.
Most important: The FMods doesn't have a battery on each module to protect the caches. This is done centrally by super capacitors in so called Energy Storage Modules. You plug them into the front of these devices.
Just a warning: Those ESM are quite potent super caps. Read the manual with care before handling those modules. You need a lot of power to protect all the caches and those ESM are capable to deliver this power.
Sunday, October 11. 2009
Robin Harris points to an interesting study about DRAM failures in his blog storagemojo (BTW: Robins blog is really a great read). He points to the paper "DRAM Errors in the Wild: A Large-Scale Field Study"( written by Bianca Schroeder (University of Toronto), Eduardo Pinheiro and Wolf-Dietrich Weber (both from Google)) in his article.
Some of the numbers are really terrifying: 4.15% unrecoverable errors for of of the platforms are much more then i had thought and i'm somewhat conservative in my thinking how far i trust hardware. Furthermore hard errors (as in "bit permanently flipped and put it to the trashbin") are vastly more common reasons for errors as most people think.
As a sidenote: After the discussion about DRAM prices in the M3000 i've got some flak because of memory prices at different quality and got many comments "there never failed a dimm at my home pc". But given the point that Google is said to use cheaper hardware and the amount of errors (especially the unrecoverable ones) there may be a point behind the fixation of Sun in regard of memory quality
Friday, October 9. 2009
A reader think it makes no sense to use the STEC SSD, and we should switch to the Intel X25E drives. That sounds reasonable at first as the X25E is much cheaper as the STECs. But as usual the devil is in the detail. So why do many people still use some of the more expensive STECs? Do they have too much money? Are they morons? I will tell you a dirty little secret. No, not at all. When you take it really seriously the X25E aren't enterprise ready. At least in their default setting.
Continue reading "Somewhat stable Solid State "
Tuesday, October 6. 2009
Nice to see, that i'm not the only one who thinks, that IBM will run in the same challenges like Sun in regard of massive multicore processors. Great to see, when this position is somewhat confirmed by someone who isn't known as friendly to Sun and is said to be firmly on the blue side.
BlueToTheBone writes in his column "The Four Hundred" about Moore's Law and the Performance Wall:
Well, with the Power7 chips coming next year, IBM has to get the multithreading fixed in DB2 for i or get a whole lot of excuses ready for why customers buy more cores and threads, running at lower clock speeds, and don't see performance go up.But he points to an even more interesting point, that isn't really a known territory to an open systems guy like me. Besides of this "bytecode-compiled is slower than iron-compiled" stuff (which isn't true since the invention of Just-in-time-compilers) he has a very valid point. Much of the software is really old, and it wasn't written for a environments with hundreds of cores. We learned the hard way, that there were a vast amount of pitfalls in the Open System world which thinks multithreaded/multiprocess for quite a time now. Now software developed in RPG and COBOL (and many code lines originate from a time when many of us weren't even a glint in the eyes of our parents) hits with Power7 on an environment that doesn't fulfill on the promise of ever-increasing single-thread performance. BlueToTheBone comes to a similar conclusion than the one i've made a while ago. Perhaps many applications stay on Power6 while newer developments can move to Power7:
. It might even mean putting off a move to Power7 iron and sticking with Power6 or Power6+ boxes as you dig through your code and see how parallelization can and cannot be used to make your applications run faster as well as offer more capacity to support more work.
Monday, October 5. 2009
Looks like IBM is somewhat concerned about the Exabyte V2 announcement and that announcement that is like to happen on October 14th. At least there is a loud "We have it, too - kind of" from Armonk. They tout their product IBM DB2 Pure Scale. Power machines with AIX, some infiniband gearand special edition of DB2.
Some educated guessesMr. BlueToTheBone is right. Details are sparse at the moment. But just to write down some points, you remember for having a deeper look at after an official announcement. Those points are purely speculative given that the information in the ElReg article are really sparse:
There are some other points, but those are much deeper in the realm of speculation, so i just wait for the announcement for further comments.
FrankenbaseSome people say, that Exabyte V2 in its current incarnation is just rushed together in the light of the upcoming merger. But the stuff that IBM told exclusively in the ears of Bluetothebone sounds like kludged together just to counter an competitors announcement. I'm sure that IBM will show of a strange configuration, calling it DB2 Pure Scale. But Frankenstein creation will look as a beauty beside this solution.
The LKSF book
The book with the consolidated Less known Solaris Tutorials is available for download here
Martin about End of c0t0d0s0.org
Mon, 01.05.2017 11:21
Thank you for many interesting blog posts. Good luck with al l new endeavours!
Hosam about End of c0t0d0s0.org
Mon, 01.05.2017 08:58
Joerg Moellenkamp about tar -x and NFS - or: The devil in the details
Fri, 28.04.2017 13:47
At least with ZFS this isn't c orrect. A rmdir for example do esn't trigger a zil_commit, as long as you don't speci [...]
Thu, 27.04.2017 22:31
You say: "The following dat a modifying procedures are syn chronous: WRITE (with stable f lag set to FILE_SYNC), C [...]