scheduling bug in nfsd on SMP servers

Gabriel Barazer gabriel at oxeva.fr
Fri Feb 23 10:39:31 EST 2007


On 02/21/2007 1:28:28 +0100, Greg Banks <gnb at sgi.com> wrote:

> Yes, this is precisely the intended behaviour and is important for
> achieving performance on NUMA machines with multiple NICs.  The
> two problems I see here are that
> 
> 1) the heuristics in svc_pool_map_choose_mode() are choosing an
>    inappropriate cpu->pool mapping mode for your machine, and
> 
> To help with 1) can you please send the output of "cat /proc/cpuinfo" ?
> I have a open SGI bug to fix 2), I'll try to get a patch out soonish.

I have nearly the same cpuinfo as Neil (see bottom): Maybe this core 
mapping is due to the Hyperthreading technology : Hyperthreading don't 
emulate multi-core but identical cores on the same physical socket (this 
is the same socket, with 2 logical processors with 1 core each).

I think multi-core servers have 2 different core id on the same socket. 
This can lead to a wrong behavior when detecting the CPU topology.

I don't know well the NUMA architecture, but on my server (and other 
classic linux servers with 1 to 4 CPUs / cores, no NUMA), the NFS 
threads are well scheduled between all the CPUs.

What I don't understand is how the pool design can work because with 
SVC_POOL_PERCPU, each thread is "locked" on 1 CPU and can't access 
another CPU's interrupt.

Because NIC interrupts (with one NIC and no trunking, the most common 
case), goes only to 1 CPU (APIC redirects interrupts only on 1 CPU), all 
nfs kernel threads which are locked on other CPUs but the NIC's CPU will 
never get a query. So the thread pool design doesn't depend only on the 
CPU count, but, as suggested Greg, on the NIC count too (there is 
another case with multiple NIC, where 1 used to servicing NFS queries, 
and the other to do something else).

The only dirty workaround with SVC_POOL_PERCPU is to map all the threads 
to the NIC's CPU with manually changing cpu mapping with 
/proc/fs/nfsd/pool_threads. And this is dirty because on a file server 
with 4 processors, only 1 can be used to serve NFS queries (which is a 
silly waste of processing power). This is without talking about the irq 
balancing utilities (e.g. irqbalance), which can change the NIC's IRQ 
affinity to another CPU with no nfs threads bound, resulting in NOT 
servicing ANY nfs query anymore.

The only way to have all the CPU's working with only 1 NIC is having 1 
global thread pool with no CPU lock restrictions. (by forcing this 
behavior inside svc_pool_map_choose_mode() )

I just got from friends some /proc/cpuinfo from Core2 Duo and Xeon 
Dual-core servers, at the bottom of this mail. Note how the core ID 
numbering is different between dual-core and hyperthreading. I'll try to 
get some other cpuinfos from dual-cpu dual-core AMD servers (which are 
NUMA if I remember right).


 > 2) there is no way short of code changes to override it's choice.
 >

the best way is to have a kernel switch as proposed by Olaf, instead of 
having an autodetect function, because detecting the right behavior is 
very difficult in kernel-mode. In the case where thread pool mapping is 
wished, one can add the kernel switch, and then use a userland daemon 
which does the work of mapping NICs interrupts and nfs threads to the 
right CPUs with the /proc/fs/nfsd/pool_threads and 
/proc/irq/*/smp_affinity (a kind of irqbalance daemon which take care of 
nfs kernel threads too)

Gabriel

--- [DUAL XEON w/ HYPERTHREADING] /proc/cpuinfo ---
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 3
model name      :                   Intel(R) Xeon(TM) CPU 2.80GHz
stepping        : 4
cpu MHz         : 2800.095
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall 
lm constant_tsc pni monitor ds_cpl cid xtpr
bogomips        : 5605.30
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 3
model name      :                   Intel(R) Xeon(TM) CPU 2.80GHz
stepping        : 4
cpu MHz         : 2800.095
cache size      : 1024 KB
physical id     : 3
siblings        : 2
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall 
lm constant_tsc pni monitor ds_cpl cid xtpr
bogomips        : 5600.29
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 15
model           : 3
model name      :                   Intel(R) Xeon(TM) CPU 2.80GHz
stepping        : 4
cpu MHz         : 2800.095
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall 
lm constant_tsc pni monitor ds_cpl cid xtpr
bogomips        : 5600.32
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 15
model           : 3
model name      :                   Intel(R) Xeon(TM) CPU 2.80GHz
stepping        : 4
cpu MHz         : 2800.095
cache size      : 1024 KB
physical id     : 3
siblings        : 2
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall 
lm constant_tsc pni monitor ds_cpl cid xtpr
bogomips        : 5600.54
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:
----------------------

--- [CORE2 DUO]
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz
stepping        : 6
cpu MHz         : 2393.989
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall 
lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 4794.58
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz
stepping        : 6
cpu MHz         : 2393.989
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall 
lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 4788.21
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

--- [DUAL XEON 5xxx DUAL-CORE] ---
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
stepping        : 6
cpu MHz         : 2666.859
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall 
lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5342.02
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
stepping        : 6
cpu MHz         : 2666.859
cache size      : 4096 KB
physical id     : 3
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall 
lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5333.90
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
stepping        : 6
cpu MHz         : 2666.859
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall 
lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5333.97
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
stepping        : 6
cpu MHz         : 2666.859
cache size      : 4096 KB
physical id     : 3
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall 
lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 5333.99
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
---------------------------


More information about the NFSv4 mailing list