Braindead thread scheduling in Linux

In my previous post I showed some of the scaling benchmarks I was getting with my newly multithreaded molecular dynamics code.  Those scaling figures were done on a dedicated node though, and I found that when I wanted to actually start shoving out dozens of SMP jobs to my cluster, my scaling performance absolutely tanked.

As it turns out, Linux (or at least, the kernel that comes with Ubuntu Server 10.04) is very close to braindead when it comes to scheduling and binding multithreaded processes.  
Given the following node hierarchy

it makes the most sense to run either two four-thread simulations or one eight-thread simulation per node.  Since I've been running simulation batches of ten or twenty jobs at once and I've only got eight nodes on my cluster, running two four-way jobs per node was the way to go.  I was hoping to get something like:

where each of the four threads get a single core on the same socket.  SGE claims to have topology-aware scheduling by means of the -binding parameter, but this appears to have absolutely no effect on SMP jobs.  The next thing I tried was to use taskset to restrict each four-thread job to a single socket via something like 
taskset -c 0,2,4,6 ./mdvggsmp.x
This should work, right?  As it turns out, this is what I was a range of variations along the lines of

where Linux would put two threads on a single core within the affinity range, then leave one core completely idle.  As one would expect, this caused severe performance degradation.

As it turns out, not only do I have to bind the parent process to core0, core2, core4, and core6, but I need to then bind each thread to a core.  Once connected into one of the nodes,
$ ps -mo pid,tid,fname,user,psr -p `pgrep mdv`
21654     - mdvgg230 glock      -
    - 21654 -        glock      0
    - 21655 -        glock      2
    - 21656 -        glock      2
    - 21657 -        glock      6
    - 21658 -        glock      4
In some cases, the thread allocations were so screwed up that it was easier to simply re-specify the thread affinity for every single thread manually:
$ taskset -p -c 0 21654
$ taskset -p -c 0 21655
$ taskset -p -c 2 21656
$ taskset -p -c 4 21657
$ taskset -p -c 6 21658
Since the original taskset kept all the threads on the same socket (and NUMA pool), the migration of threads to specific cores had no major problems as far as context switching or hitting non-local memory.  Upon doing this, my jobs started running at (or above) the benchmarked speedups I reported earlier.

Intel and GNU's implementations of OpenMP provide an environment-variable-based API for specifying thread affinity that is quite powerful and far less tedious than this taskset method.  Unfortunately, our SGE configuration isn't smart enough to be able to understand what cores are bound already when launching a job, so I had to come up with a very ugly locking scheme so that each task in a job array could call dibs on a processor socket and have other tasks honor it.  Since my nodes only have two sockets and therefore support only two SMP jobs, it was relatively straightforward:
#$ -t 1-10
#$ -cwd
#$ -pe openmp 4
#$ -binding linear:4
#$ -S /bin/bash
mynode=$(uname -n)
# taskid-dependent sleep staggers the order in which tasks check locks
sleep $((SGE_TASK_ID*10))if [ ! -f "lock.$mynode" ]; then
  echo $SGE_TASK_ID > lock.$mynode
echo "Binding to socket #${PROCSET} on ${mynode}"
Surely there is an easier way to do this though.  It seems silly that this much effort is required to run multiple SMP jobs on a single node.