Fast, Easy, Cheap: Pick One

Just some other blog about computers and programming

New OOM Killer Implementation in Linux 2.6.36

As reported on Kernel Newbies, there is almost a complete rewrite of the Out Of Memory (OOM) killer algorithm in the recently-release 2.6.36 kernel.

The LWN article “Another OOM killer rewrite” has a detailed explanation of the changes.

The addition of new heuristics for fork bomb detection and changes to child process detection are a welcome change, but I wonder about the modifications to the calculation of the badness score. It appears that process run-times are no longer considered, only the overall percentage of total memory used. For multi-job batch processing system as implemented compute cluster this is less than ideal.

Consider the following scenario:

  1. Job A is long-running high-memory job has been humming along on a node for quite some time

  2. Job B begins to run on the node

  3. Job B allocates a bunch of memory, and causes the OOMKiller to be triggered

  4. Job A gets killed because it’s using more memory

In the old OOM killer implementation, Job A would be safe because its badness score would be considerably lower than that of Job B by virtue of it having run for a longer period. In the new implementation, it’s selected for killing simply because it’s using more memory.

As a cluster administrator, this is not what I want. In most cases the desired policy is that existing jobs should take precedence over newly executed. The old implementation gets this right pretty much by default but the new one does not.

Speaking of defaults, another modification that comes along with the new badness heuristic is the new oom_score_adj variable in /proc which is meant to replace the oom_adj variable. The new oom_score_adj is simply an absolute value that can be used to linearly alter the badness score. The oom_adj variable in the previous version was a bit-shift of the badness, which I think was more useful for our purposes.

For example, in our cluster we would set the oom_adj value of the node shepherd processes (pbs_mom or sge_execd) to +10 or so. This would ensure that the processes controlling the jobs, and more importantly, their children which are the actual jobs, would be considered for killing by a score several orders of magnitude more than the system processes which actually keep the nodes alive.

Combined with runtime calculations this simple tweak meant that the first processes to be considered for killing would be those launched by the most recently executed job, exactly what we want. With the new algorithm it’s still unclear to me how to achieve the same behavior.

It seems that the new implementation is designed more with desktop usage in mind. It makes the assumption more desirable to reclaim as much memory as possible by killing the fewest number of processes.

While the invocation of the OOM killer by the kernel can typically be avoided through careful tuning of your resource management / scheduling system, the facilities to enforce the limits are not perfect. Most resource managers implement memory “quotas” through a combination of the system’s ulimits facility and resource tracking by the shepherd process.

The major limitation of the ulimits approach is that it applies on a per-process basis. If a job is granted a limit 1 GB of memory, nothing stops it from spawning multiple processes each of which consumes up to the limit. This is one reason why a second a level of enforcement is performed by the shepherd process. In GridEngine this is implemented by assigning a unique job-specific extended group ID to the job processes, and then tallying up the memory usage of all processes with that ID. This catches the case where multiple processes use more memory than requested but to keep overhead down the usage is only polled every few minutes. A job could still allocate and dirty more than its share of memory in between the polls and cause the OOMKiller to be triggered.

Since it seems the OOM killer can never be avoided entirely, it’s important that the algorithm is tunable enough that an administrator can control its behavior with a high degree of certainty. The new implementation in 2.6.36 seems to take some steps backwards in that respect.