PoempelFox Blog

[..] [RSS Feed]

Fri, 19. Mar 2010

zone_reclaim_mode is the essence of all evil Created: 19.03.2010 00:00
For months, I've been trying to track a problem we had on one of our new (mini-)clusters. Symptoms included (but were not limited to):
  • Trivial system calls, such as reading a file in /proc, took a full second to execute.
  • Programs would hang indefinitely doing basically nothing, sucking up 100% CPU while doing that. As it turned out later, they were not completely stuck, they progressed at a rate of about one instruction per second.
  • All symptoms could turn up at any time, but could also completely dissappear again within a second.
  • After a reboot, the symptoms would usually not show again for some time.
  • Running a program that requested too much memory (thus invoking the OOM killer) would greatly increase the chances of the problems appearing afterwards.
  • Pinning a program to random cores, i.e. restricting it to only run on one specific core (via "taskset") would usually make it run normally. Just what the "right" core was would vary. This would also work with programs already running and being "stuck": they could be revived by moving them to some other core externally.
  • Turning OFF swap would greatly decrease the likelyness of the problems appearing.
  • Using a vanilla instead of the Ubuntu 8.04 LTS 2.6.24 kernel would decrease the likelyness of the problems appearing, but not by much.
Why it took me so long to find the problem, I don't know - I should have tracked it down faster, because I had narrowed it down to memory allocation problems pretty fast. But from there it took quite some time to locate the problem.
When I say "memory allocation problems", I do not mean that the machine was out of memory. That would of course explain all sorts of strange behaviour, but that wasn't the case: The machines never were out of memory, they had a few gigabytes free. They just behaved as if they were.
The machines were dual socket "Nehalem" Xeon nodes with 12 GB of RAM. We had another testcluster with basically identical nodes, just from a different manufacturer and with 24 GB of RAM. We never were able to reproduce the strange behaviour there, no matter how hard we tried, not even by making the machines really run out of memory.
The thing with these dual socket "Nehalems" is that they are ccNUMA nodes: Each processor has some memory attached locally, access to memory attached to the other CPU happens through that other CPU. This is transparent to software, but due to the additional hop it is a bit slower, more "expensive". Linux understands this, and tries to keep memory allocations local on such machines. It will however allocate memory from the other socket if it has no choice. At least it does that usually, however not in this case: as it turned out, when the symptoms appeared one socket was always out of memory. It was as if Linux would just refuse to get memory from the other socket. That was why pinning the process helped: Pinning the program to a core on the other socket would mean the memory there was suddenly "local", and Linux would happily start to allocate memory from there, and the execution of the stuck program would continue. However, the reason why allocating from the other socket worked as designed for hours, days, sometimes weeks before suddenly breaking horribly and completely remained unclear.
I finally traced it all to the zone_reclaim_mode. This is a setting in the kernel (available through /proc/sys/vm/zone_reclaim_mode) that is supposed to activate a mode where the kernel will first try to reclaim local memory before resorting to memory on other sockets. Apparently, the implementation of this is completely broken, at least in 2.6.24. What it does in reality is that, under certain conditions (probably when memory got a little fragmented?) it leads to permanent attempts to free memory when there is nothing to free, sucking up all available CPU doing that, and never resorting to the other socket that has more than enough memory available.
So why did we see this problem only on one cluster and never on the other, even when both had the same kernel settings? Well, WE never turned on that zone_reclaim_mode crap. Linux (the kernel, not the distribution!) does that automatically during boot when it detects that getting pages from another socket ("zone") is too expensive. And while the kernel decided it was too expensive on the one cluster, it didn't on the other. Apparently, the condition for "too expensive" is ">2". On the one cluster, numactl --hardware lists the distance to the remote node as 21, on the other it is 20. The kernel turns on the zone_reclaim_mode automatically on the nodes where the distance is 21, and off on the ones with distance 20.
Luckily, a simple sysctl.conf entry fixes this nonsense by turning off zone_reclaim_mode. Not turning this experimental piece of crap "feature" on in the first place would however have been better, as it could have saved me months of annoying search.
There is hope the kernel developers might fix or have fixed this feature already: The changelog for lists several bugfixes for it, but the author doesn't seem to be sure to have catched them all, and asks for bugreports if problems still arise. Luckily, it will be some time before we change to a kernel >, so I will not be a guinea pig again anytime soon.
Thanks for the post, really helpful! I guess this happens on multi-cpu systems, 20+ cores.
mAcRoS 02.11.2012 11:35

This post has poisoned the well for a lot of applications which *need* zone-aware allocation. You won, and the setting is disabled by default, but now your post is inspiring a lot of people out there to resist the setting even when testing shows their application needs it.

zone_reclaim_mode is for HPC workloads: tightly synchronized lock-step codes which cannot tolerate variations in memory latency and which cannot benefit from VFS caching. HPC may be a minority of Linux workloads, but it's important.

It's not a "crap feature", just a poor selection of defaults. I continue (going into 2016) to see this article used to justify a lot of new factually incorrect FUD and poor design choices.
Phil 17.12.2015 22:22

I'm well aware that HPC is an important workload, as this post was about one of our clusters which are used exclusively for HPC.

However, I doubt there is a large share of HPC workloads that NEEDS this setting. It would only be useful if your workload was NOT allocating most of its memory on startup (first touch), and at the same time do a lot of I/O that fills the VFS cache. Of course, you do need to make sure the memory is not filled by cache usage from a previous job when a new job starts - on our clusters, the batch system tries to ensure that (echo 3 > /proc/sys/vm/drop_caches).

As for using this article as justification: this entry does state quite clearly to which (by now extinct) kernel versions it applies, and that newer kernels could contain some fixes. We do have zone_reclaim_mode enabled on our newer cluster, and so far it did not explode like in old kernel versions (AFAICT).
PoempelFox 14.01.2016 08:09

Setting zone_reclaim_mode=1 may create what I like to call paging bouncing on your system. If you use 'sar -B', you can see paging numbers bounce from low to really insanely high numbers. Additio (unfortunately, the rest of this comment was cut due to admin error)
Teddy Knab 27.11.2018 19:04

write a new comment:
name or nickname
eMail adress (optional)
Your comment:
calculate: (2 times 10) plus 3

EOPage - generated with blosxom