scheduling bug in nfsd on SMP servers
Gabriel Barazer
gabriel at oxeva.fr
Fri Feb 23 10:39:31 EST 2007
On 02/21/2007 1:28:28 +0100, Greg Banks <gnb at sgi.com> wrote:
> Yes, this is precisely the intended behaviour and is important for
> achieving performance on NUMA machines with multiple NICs. The
> two problems I see here are that
>
> 1) the heuristics in svc_pool_map_choose_mode() are choosing an
> inappropriate cpu->pool mapping mode for your machine, and
>
> To help with 1) can you please send the output of "cat /proc/cpuinfo" ?
> I have a open SGI bug to fix 2), I'll try to get a patch out soonish.
I have nearly the same cpuinfo as Neil (see bottom): Maybe this core
mapping is due to the Hyperthreading technology : Hyperthreading don't
emulate multi-core but identical cores on the same physical socket (this
is the same socket, with 2 logical processors with 1 core each).
I think multi-core servers have 2 different core id on the same socket.
This can lead to a wrong behavior when detecting the CPU topology.
I don't know well the NUMA architecture, but on my server (and other
classic linux servers with 1 to 4 CPUs / cores, no NUMA), the NFS
threads are well scheduled between all the CPUs.
What I don't understand is how the pool design can work because with
SVC_POOL_PERCPU, each thread is "locked" on 1 CPU and can't access
another CPU's interrupt.
Because NIC interrupts (with one NIC and no trunking, the most common
case), goes only to 1 CPU (APIC redirects interrupts only on 1 CPU), all
nfs kernel threads which are locked on other CPUs but the NIC's CPU will
never get a query. So the thread pool design doesn't depend only on the
CPU count, but, as suggested Greg, on the NIC count too (there is
another case with multiple NIC, where 1 used to servicing NFS queries,
and the other to do something else).
The only dirty workaround with SVC_POOL_PERCPU is to map all the threads
to the NIC's CPU with manually changing cpu mapping with
/proc/fs/nfsd/pool_threads. And this is dirty because on a file server
with 4 processors, only 1 can be used to serve NFS queries (which is a
silly waste of processing power). This is without talking about the irq
balancing utilities (e.g. irqbalance), which can change the NIC's IRQ
affinity to another CPU with no nfs threads bound, resulting in NOT
servicing ANY nfs query anymore.
The only way to have all the CPU's working with only 1 NIC is having 1
global thread pool with no CPU lock restrictions. (by forcing this
behavior inside svc_pool_map_choose_mode() )
I just got from friends some /proc/cpuinfo from Core2 Duo and Xeon
Dual-core servers, at the bottom of this mail. Note how the core ID
numbering is different between dual-core and hyperthreading. I'll try to
get some other cpuinfos from dual-cpu dual-core AMD servers (which are
NUMA if I remember right).
> 2) there is no way short of code changes to override it's choice.
>
the best way is to have a kernel switch as proposed by Olaf, instead of
having an autodetect function, because detecting the right behavior is
very difficult in kernel-mode. In the case where thread pool mapping is
wished, one can add the kernel switch, and then use a userland daemon
which does the work of mapping NICs interrupts and nfs threads to the
right CPUs with the /proc/fs/nfsd/pool_threads and
/proc/irq/*/smp_affinity (a kind of irqbalance daemon which take care of
nfs kernel threads too)
Gabriel
--- [DUAL XEON w/ HYPERTHREADING] /proc/cpuinfo ---
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 3
model name : Intel(R) Xeon(TM) CPU 2.80GHz
stepping : 4
cpu MHz : 2800.095
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall
lm constant_tsc pni monitor ds_cpl cid xtpr
bogomips : 5605.30
clflush size : 64
cache_alignment : 128
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 3
model name : Intel(R) Xeon(TM) CPU 2.80GHz
stepping : 4
cpu MHz : 2800.095
cache size : 1024 KB
physical id : 3
siblings : 2
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall
lm constant_tsc pni monitor ds_cpl cid xtpr
bogomips : 5600.29
clflush size : 64
cache_alignment : 128
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 15
model : 3
model name : Intel(R) Xeon(TM) CPU 2.80GHz
stepping : 4
cpu MHz : 2800.095
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall
lm constant_tsc pni monitor ds_cpl cid xtpr
bogomips : 5600.32
clflush size : 64
cache_alignment : 128
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 15
model : 3
model name : Intel(R) Xeon(TM) CPU 2.80GHz
stepping : 4
cpu MHz : 2800.095
cache size : 1024 KB
physical id : 3
siblings : 2
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall
lm constant_tsc pni monitor ds_cpl cid xtpr
bogomips : 5600.54
clflush size : 64
cache_alignment : 128
address sizes : 36 bits physical, 48 bits virtual
power management:
----------------------
--- [CORE2 DUO]
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
stepping : 6
cpu MHz : 2393.989
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall
lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips : 4794.58
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
stepping : 6
cpu MHz : 2393.989
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall
lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips : 4788.21
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
--- [DUAL XEON 5xxx DUAL-CORE] ---
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz
stepping : 6
cpu MHz : 2666.859
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall
lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips : 5342.02
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz
stepping : 6
cpu MHz : 2666.859
cache size : 4096 KB
physical id : 3
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall
lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips : 5333.90
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz
stepping : 6
cpu MHz : 2666.859
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall
lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips : 5333.97
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz
stepping : 6
cpu MHz : 2666.859
cache size : 4096 KB
physical id : 3
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall
lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips : 5333.99
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
---------------------------
More information about the NFSv4
mailing list