|zone_reclaim_mode is the essence of all evil||Created: 19.03.2010 00:00|
For months, I've been trying to track a problem we had on
one of our new (mini-)clusters. Symptoms included (but were
not limited to):
When I say "memory allocation problems", I do not mean that the machine was out of memory. That would of course explain all sorts of strange behaviour, but that wasn't the case: The machines never were out of memory, they had a few gigabytes free. They just behaved as if they were.
The machines were dual socket "Nehalem" Xeon nodes with 12 GB of RAM. We had another testcluster with basically identical nodes, just from a different manufacturer and with 24 GB of RAM. We never were able to reproduce the strange behaviour there, no matter how hard we tried, not even by making the machines really run out of memory.
The thing with these dual socket "Nehalems" is that they are ccNUMA nodes: Each processor has some memory attached locally, access to memory attached to the other CPU happens through that other CPU. This is transparent to software, but due to the additional hop it is a bit slower, more "expensive". Linux understands this, and tries to keep memory allocations local on such machines. It will however allocate memory from the other socket if it has no choice. At least it does that usually, however not in this case: as it turned out, when the symptoms appeared one socket was always out of memory. It was as if Linux would just refuse to get memory from the other socket. That was why pinning the process helped: Pinning the program to a core on the other socket would mean the memory there was suddenly "local", and Linux would happily start to allocate memory from there, and the execution of the stuck program would continue. However, the reason why allocating from the other socket worked as designed for hours, days, sometimes weeks before suddenly breaking horribly and completely remained unclear.
I finally traced it all to the zone_reclaim_mode. This is a setting in the kernel (available through /proc/sys/vm/zone_reclaim_mode) that is supposed to activate a mode where the kernel will first try to reclaim local memory before resorting to memory on other sockets. Apparently, the implementation of this is completely broken, at least in 2.6.24. What it does in reality is that, under certain conditions (probably when memory got a little fragmented?) it leads to permanent attempts to free memory when there is nothing to free, sucking up all available CPU doing that, and never resorting to the other socket that has more than enough memory available.
So why did we see this problem only on one cluster and never on the other, even when both had the same kernel settings? Well, WE never turned on that zone_reclaim_mode crap. Linux (the kernel, not the distribution!) does that automatically during boot when it detects that getting pages from another socket ("zone") is too expensive. And while the kernel decided it was too expensive on the one cluster, it didn't on the other. Apparently, the condition for "too expensive" is ">2". On the one cluster, numactl --hardware lists the distance to the remote node as 21, on the other it is 20. The kernel turns on the zone_reclaim_mode automatically on the nodes where the distance is 21, and off on the ones with distance 20.
Luckily, a simple sysctl.conf entry fixes this nonsense by turning off zone_reclaim_mode. Not turning this experimental piece of crap "feature" on in the first place would however have been better, as it could have saved me months of annoying search.
There is hope the kernel developers might fix or have fixed this feature already: The changelog for 188.8.131.52 lists several bugfixes for it, but the author doesn't seem to be sure to have catched them all, and asks for bugreports if problems still arise. Luckily, it will be some time before we change to a kernel >184.108.40.206, so I will not be a guinea pig again anytime soon.
write a new comment: