Reading time: 15 minutes
Networking Administration BestPractice Experiences Onceuponatimeaserver
Zebras, anchors, bottlenecks and link aggregation
- Rationale of this article
- About link aggregation
- Searching for a pattern
- Being anchored
- Searching for the problem
Diagnosing problems with computers has a lot to do with diagnosing illnesses (that’s the reason why I’m working on an article about patching and vaccination and the inherent similarities). Therefore I think you can reuse some parts from the mindset of a person healing other persons. Concepts and ideas that were introduced by doctors to prevent diagnostic errors. This article contains two of them.
Years ago I read the book „House of God“. It’s a book about the life of medical interns (whoever is frequently watching US series playing in hospitals knows that this is a part of the medical education in the US after visiting medical school). It’s quite an interesting read. That said there was an important term in it which I found very useful to explain a problem when solving problems. I’ve heard the term later in series like House M.D., ER and Scrubs (yeah, i’m guilty of loving such series). The term was „zebras“. In the book or the show a zebra is an exceedingly rare, very unlikely diagnosis, when something more general, more common is the correct explanation. The matching admonition is „if you hear hoof beats, think horses, not zebras“. It’s not that it’s never a zebra you will find at the end but most of the times it’s really a horse and so this admonition holds a lot of truth.
I found some very strange zebras in my career but when searching for issues there were horses in 99.9% of all cases.
Another concept when thinking about diagnostic errors is anchoring or anchoring bias. It’s a cognitive bias with impact in almost all areas of human life, but I heard about it many years ago in a weekly health news broadcast on radio when driving to work. To get all the news for the day, I frequently listen to NDR Info and in the time slot I was usually driving to Hamburg for work, they have this broadcast „Visite“ once a week.
As music radio stations are exceptionally horrible around Hamburg I usually stay with info radio (or swap to Spotify) and thus got some interesting medical news. You are usually totally fed up about the world („What did Boris Johnson do?“, „Trump didn’t really said that“ or the usually „F*ck AFD!“ thoughts) after listening to news radio for a while but at least it’s an informed fed up. But I divert from the topic of this blog entry.
I found this concept really interesting. The idea is that doctors can fixate on the early information they get when they try to diagnose the problem. By fixating on them, they fixate on a certain diagnosis, preventing them from finding the correct one, even when over the course of time they get information that points in a different direction contrary to the initial finding. I first heard about it in this medical news broadcast, but the concept is more common. You find this “jumping to conclusions too early when not all information is on the table” in many areas of your life. And often it’s hard to get rid of this anchor.
Rationale of this article
Why am I writing this article? I would like to share my experience in this regard with an example. Because this example showed to me how important it is to make frequent zebra checks in your thought process. That you have to step back for a moment rather frequently when you build your house of working theories to solve your problem.
Imagine the following situation: You configured a 4-port trunk using LACP on a Server A. There is a second server with another 4-port trunk. Both systems are in the same Layer-2 network. Let’s assume those ports are 10 GBit/s. You make a load test, and whatever you do on the systems, it won’t budge: You are still not able to measure more than 10 GBit/s worth of data to the between the systems .Even with a load generator software without any accesses to hard disks (so it isn’t hold back by disk speed) it doesn’t go higher than 10 GBit/s.
About link aggregation
There is some basic knowledge that you need to have: You have always keep in mind when working with trunking that traffic isn’t balanced round robin (there are implementations that can do this, but it creates a lot of problems, so it’s generally discouraged). There is a multitude of reasons why you want the packets of one traffic flow on the network in order. When your system is getting a reply from a web server, you don’t want to receive the fourth packet before the first three, because you have to reorder the packets. So the load balancing of link aggregation or trunking the traffic is working in a way that all packets of a single network flow are always transmitted over the the same cable as long as the port is functional. This ensures that the packets are always received in order to prevent this reordering of the data stream in the IP stack.
The interesting question is now how you choose one of the links for your packet in a way fulfilling the condition mentioned above. This is done by employing an algorithm that delivers always the same result for all packets of a single traffic flow. Let’s say the flow transporting your image from Facebook always delivers the result 4 for example. The obvious way to do this is to use some data that is in the traffic. Information that is used to transport the traffic to its destination. Like the Ethernet or IP header. And do some nifty calculations on it.
You can do this on Layer 2 of the networking stack by using the MAC addresses or parts of it as the input for an algorithm determining which member port of a LACP trunk has to be used. You can use the source MAC address, the destination MAC address or both. For example if you use both, all network traffic with the same combination of MAC addresses will use the same cable in the trunk. So if you use Layer 2 no matter how many IP connections you have an no matter how many IP addresses you have configured on a single MAC address, the maximum speed between two systems is the the speed of a single member interface, as all traffic using this combination of MAC addresses will yield the same result in the algorithm determining the link and thus use the same link and so you are bound to the limit of this link.
If you use Layer3 as an input it uses the IP addresses (source, destination or both). So all connections between two systems will use the same cable. Example: Everything between 192.168.1.1 and 192.168.1.2 will yield for example the same result „2“ , and thus always use the same member interface of the trunk. No matter how many TCP/IP connections you have.
Doing so is a reasonable choice when you are using trunking for links between switches when there is a multitude of systems communicating over the trunk thus you have many IP addresses as sources and destinations on your trunk and it will roughly balance things out.
However today link aggregation is often used as well to connect servers with multiple Ethernet connections to a network and it’s not uncommon that you do it to make the connection to a particular server especially fast … for example a server speaking with an NFS server. A server speaking with a backup system. Something like that. The problem: You speak from one IP address to another IP address.
If you do balancing based on Layer 3 information, it’s one flow for the balancing algorithm because the source and the destination IP is always the same and everything will go over one cable. In this case you need additional information to separate the flows from each other and the obvious choice is the IP port number. Again often you can configure that the load balancing decision can be done based on source port , destination port or both.
When you have multiple TCP connections running on the aggregation, it can distribute it based on the fact that each connection has a unique combination of source and destination port.
That said it can make sense to combine the layers, so if there is by chance data flows with the same source and destination port just with different IP addresses can use different links.
Searching for a pattern
But back to the problem: One thing that catched my eye from the beginning was a kind of perfection. The limits weren’t at some arbitrary numbers, let’s say 13 Gbit/s or something. Those numbers were perfectly aligned to the interface speed of a single interface . Everywhere. No matter what you measured, you always measured 10 Gbit/s (or to be exact. something close but less than that).
Why did I write about anchoring before? I got an initial telephone call in such a situation with a comment that already included a kind of diagnosis „Jörg, we have a problem here, could it be that feature A could not work with Feature B. Or do we have a tuning problem here?“.
I jumped into conclusions and talked about possible tuning mechanisms („You could use feature C to get better performance“) and tried to prove that feature A is either working together with feature B or not. I anchored on some initial information and some initial ideas without really having enough information. I anchored on these ideas.
After hanging up (yeah, still one of the strange words we use that most of the time lost their meaning, because you don’t put a mobile on a hook) and doing some research in the docs, I thought „What the heck are you doing here?“. I didn’t have any information to support or exclude any of this idea. I had an anchor bias.
Sometimes you get anchored by questions people ask and you use your preexistent knowledge to help to solve the problem they describe and not they problem at hand. You stay at this anchor for a while. You assume.
This bias is only human, even the best fall for it and so mere mortals like me obviously fall to it, too. So I have a certain rule that works for me. I check after a while on which information my current understanding and conclusion of a problem is based on. If this information is still predominantly only the initial information I put my conclusion aside for possible later use and rethink everything.
That said, sometimes you don’t have to think about anchor diagnosis. I think when your bones are sticking out of the skin of your leg after being hit by a car, the conclusion is obvious and we don’t have to think about anchoring. If you see a blaring configuration error right at the start you have already seen before and you know that always leads to a problem, you don’t have to think about anchoring for an initial idea.
However in this case I anchored at first due to the initial questions and I had to throw away the anchor. I simply didn’t have enough information.
Searching for the problem
So the problem in this situation seemed to be that you only used the performance of one interface. The obvious idea is: The stuff that should load balance on your system isn’t balancing. Then I looked at the configuration and it was perfectly configured, all status information that you got about this system indicated that the link aggregation was working just fine.
Of course when hearing about this problem and having some preliminary information, there are a lot of zebras here. One zebra would be the theory that even with the balancing algorithm using the most diverse input data (L2+L3+L4+VLAN) the data is always put to the same interface. Because you used this configuration in the past and it always worked, you start to think about rather strange ideas, zebras . And because it’s perfectly set up, your mind strays even more to strange ideas. However the idea is really a big fat zebra. In a correctly set up system I never saw such a thing. And it was set up correctly.
Well, to be exact … I saw this once. Out of curiosity I carefully constructed a set of TCP/IP connections that really was balanced on the same line. But that was a synthetic load. It was specifically constructed to prove a point why LACP trunks aren’t perfectly balanced.
And knowing this is a big fat zebra, this thought that crossed my mind was wooed away pretty much immediately. Just don’t think about such theories. When all traffic goes over one interface it’s never because of a very strange property of the data flows that leads a perfectly configured load balancing to always choose the same link, it’s always about an balancing algorithm inadequate to balance your data flows. That’s a much simpler explanation and often it’s really the simplest explanation.
A slightly smaller zebra is: There could be a strange incompatibility between features but then you have to take into consideration that the configuration you are doing is most often not unique and there are others using similar configurations.
So you stand in a hoard of zebras because nothing makes sense because everything looks perfectly. Before you assume a bug, assume an error in your configuration.
Working Theory 1
You have to step back in this case and simplify the problem. Starting with very simple working theories.
Of course with all the knowledge about trunking you could ask yourself: Is the traffic „load balancable“ with the means of the link aggregation load balancing algorithms? The first working theory is „The traffic could be not suitable for load balancing because it’s just a single connection“. Are there multiple separate traffic flows you could put on different ports? Easy to check with a tcpdump. In this case there were multiple connections. And in the test case we knew it before, because we set up the load generator in a way using multiple connections, let’s say 4, 8 oder 16.
Determining this is sometimes not easy, because you have to know something about system defaults. You have to know how the application establishes their communication to other systems. . I know a number of application and system services that multiplex all their traffic between two systems on a single tcp connection and when the application is working this way, the link aggregation can’t deliver more than the performance of one interface because it can’t distribute the traffic. As it it just a single undividable connection
It’s really important to know how the communication pattern of your application is and when you don’t know it, check it with a sniffer at first. Don’t assume anything here. Don’t assume that you have load that you can balance on separate links
Working Theory 2
Next working theory : „You get the performance of 1 interface out of a 4 interface trunk because the system isn’t balancing and puts everything on one interface“. Setting up Link aggregation only tells the system what to do with outbound traffic. The other side makes it’s own decision based on its own configuration.
So let’s check if the outbound traffic is sent out on multiple links. Let’s check if the system is doing its load balancing. This is easy to check. Log into one server, measure the load on the interfaces. In this example there was outbound traffic on all interfaces. However it summed up to 10 GBit/s. It was indeed load balancing, it sent out data on all interfaces, just not the amount you expected.
Working Theory 3
As there are always two systems involved, the next theory is obvious. “Perhaps there is no link aggregation on the other server“. Perhaps we have just borked the configuration of the other side. Okay. Easy to check again. A short look into the configuration and status information. Everything is fine, a nice and clean LACP based aggregate. Checking for the load on the interfaces. It was indeed balancing on all four connections again , but again it just summed up to 10 Gbit/s.
Working Theory 4
Okay, a step back. Server A is doing outbound load balancing and Server B is doing outbound load balancing. The servers are doing their job just fine in this regard. At this point it would be a zebra to discuss a bottleneck on the server - that the servers are just not able to push more traffic - . Discussing for example about interrupt distribution, PCI bus limits or that the CPU isn’t able to move the traffic. We aren’t there yet. The simpler explanation would be: There must be a bottleneck in between. You can just put the working theory at the side now.
The assumption of correctness
Here is an important experience I made over the time: Zebras thrive under the assumption of correctness. Don’t assume it’s not one component until the component has proven innocence. It’s not at the courts, in diagnosing problems at first everyone is guilty. I know people say „No, it’s not the SAN“ or „No, it’s not the LAN“ augmented with comments like „we had someone here to check error counters“.
This is one important lesson life gave to me: When you assume that something is correctly working without checking, you will always jump to zebras. Never ever consider something is fine just because someone has told you it’s fine. I can remember a number of occasions where questioning everything would have saved me a lot of time, and so I doubt everything nowadays. In this case it was tempting to assume that everything is right in the LAN because LACP has set up just a fine link aggregation and that there must be some weird problems on the servers.
Working theory 5
Under these circumstances it could be tempting to think about very strange networking problems. Borked up transceivers, strange spanning trees. Because you may assume the „everything is alright and well designed“ comment is correct and on a diagram of the network it may really look that way.
But the best next working theory is based on doubting this : „We have a bottleneck in the network, which chokes down the performance. Perhaps there is a connection between all the switches in between that is just a 10 Gbit/s link.“ This could be checked and ruled out in a few seconds in our case. Server A and Server B are connected to the very same switch. There is just the backplane in between.
Beware of the zebra: It must be something with the backplane. Not today, not with 10 GBit/s on any halfway descent switch. And anyway … the admins tell you that they have ample bandwidth between their switches with big trunks and fat 100 Gbit/s pipes. Even if it had to leave the switch there would be no bottleneck at 10 GBit/s. And as I wrote before, all the tests has this 10 Gbit/s limit perfectness
Getting closer to the horses
Again, take a deep breath. There is one logical conclusion. Remember that I wrote that setting up link aggregation just controls the balancing for outbound traffic and that each system on the way between your endpoints has it’s own balancing configuration. Is it possible that the switch isn’t balancing the traffic? A short check with some nice diagnostic tools shows : Yup … there is only outbound traffic from the switch perspective on one of the interfaces and some residual traffic besides the main communication streams between a server and a server acting as a client. Your main traffic flow for example from your server to your NFS server isn’t balanced.
A zebra at this point of time would be to assume some strange kind of bug in the switch. But remember again what I told you about how load balancing is done in networks with an algorithm to ensure that all traffic of a flow is going over one ethernet connection. Inbound traffic will be outbound traffic and the switch has to make its own balancing decision.
Working theory 6
„The data arrives perfectly load balanced in the switch and then the switch makes its own balancing decision, which always uses the same interface.“ Again forget about zebras like the strange pattern I mentioned at the beginning. The obvious question is „What is the algorithm used by the switch to do the load balancing?“. Is the configuration adequate?
And bingo … a moment later the problem was found: the switch is using the source and destination ip as input for the decision which interface has to be used. So all outbound traffic from the perspective of the switch between Server A and Server B would be forwarded on a single port between the switch on the server. It uses only one interface to transport the traffic from the switch to the servers, no matter how balanced the traffic is from the servers to the switch.
And with this information everything fits into the situation. Whatever you could possibly push into the switch over 4 interfaces , the switch was only able to send 10 Gbit/s to the destination server because the load balancing method always chose the same interface because the algorithm used to decide which port has to be used always resulted into using the same interface.
In this case it was indeed quite an ordinary horse. Just an incorrect configuration by keeping the default active on a switch. Very easy to overlook. And not the first thing you may think about.
In hindsight there was an opportunity to catch the error earlier and I almost got it at that point. Remember that I wrote in the „Working theory 2“ section that the outbound traffic was perfectly balanced. I saw at the same time an obscure absence of inbound traffic on an interface, there was no load balancing on the inbound traffic. At this point of time I could have catched the issue but I didn’t because I didn’t step back there and didn’t really think about this. Additionally I got a question right at that moment. Sometimes you don’t see the horse because there are not only zebras there but other horses as well. And sometimes you just don’t see them because someone is looking over your virtual shoulder.
The blog entry had two objectives. On the technical side, you have to check all switches between two systems if link aggregation doesn’t deliver the expected result.
But there is another point: About describing a thought process I’m universally using. In order to prevent me from drifting away into the realm of zebras. Because I’ve learned that deep knowledge about a topic increases the ability to think about zebras and the susceptibility to zebras. And this is not totally offset by the experience of many years in the business to know that most of the time you should think about horses.
This is the reason why I’m always trying to divide the problem into simple working theories. If this is not leading to a result, there is always time for searching for zebras. But don’t do it at first. And this is perhaps the very short summary of a rather long article