NFSv4 performance issues on a lagged network
Bryce Harrington
bryce at osdl.org
Tue Aug 29 16:48:22 EDT 2006
Hi all,
Our Summer of Code student, Dipankar Sarkar, has been developing and
experimenting with tools for analyzing NFS when running in a degraded
network environment (dropped or delayed packets, etc.)
Thomas Talpey of Netapp gave us good guidance that rate control would be
better than other network effects such as dropped packets and packet
reordering, as the latter simply test Linux's TCP/IP stack. It turned
out he was exactly right; Dipankar found that the situations such as
dropped packets did not produce interesting results.
But as Thomas had suspected, there are some very interesting failure
conditions when operating NFS with delayed packets. Please see
Dipankar's page for charts and discussion:
http://www.desinerd.com/nfsv4-netem/index.html
This is against the 2.6.17-rc1 and 2.6.17-rc2 CITI patches (later
versions of the CITI patch were crashing -- these crashes were
separately investigated and are now solved).
As can be seen in the charts, with no packet delay the behavior is
fairly smooth, with the read curve approaching the maximum possible with
the hardware.
But when some packet delay is added, both read and write performance get
hit hard. In fact, it also leads to crashes or "can't read superblock"
errors; this is why most of the graphs appear incomplete. We also see
cases where throughput is reduced to near zero for all measurements.
Initially, it was suspected that this may be an effect of using network
emulation, and that it might be a bug in NetEm rather than in NFS.
However, to test this, runs were made using the emulator but with no
delay, and with only a 0.001 ms delay. As you can see in these cases,
the results approximate the situation where NetEm is not used at all,
within iozone's normal fluctuations.
You'll notice that there isn't a gradual degredation in performance;
rather, there are large and sudden fall-offs. Sometimes these drops
occur during the run, in other cases the test starts off very poorly and
recovers very slowly. There doesn't seem to be a strong correlation
between the amount of packet delay and the degree of network delay; some
of the 10 ms delay cases actually show better performance than the
shorter delay cases.
To test if this issue is NFSv4-specific, runs were also made against
NFSv3. The issue is apparent for NFSv3 as well, for 1-10 ms packet
delays.
Does anyone have an idea what could be causing these problems? Are they
things that can be fixed in NFSv4? Are there any tunables we could
experiment with, or other information we could gather to help narrow in
on the cause?
Due to Dipankar's efforts, we not only have this analysis but also have
the capability added to Crucible to easily conduct further testing of
client/server issues under different network conditions (or to re-run
Dipankar's tests against any proposed fixes). If you'd like to get a
feel for the types of settings that can be varied in Crucible, here is
one of the config files for a NetEm test:
http://crucible.osdl.org/runs/1650/run_profile.txt
Bryce
More information about the NFSv4
mailing list