[nfsv4] NFSv4 performance issues on a lagged network
Talpey, Thomas
Thomas.Talpey at netapp.com
Sat Sep 9 11:45:56 EDT 2006
That second observation is *very* interesting. You mean with a large
RPC slot table, the iozone does not loop, even with the pathological
delays in the network? This may indicate an issue in RPC task scheduling.
Perhaps when tasks are stalled on slot availability, there are wait/wakeup
issues.
When you describe iozone being in a loop - do you know exactly where
it's looping? If you mount with "intr", is it killable?
Tom.
At 01:47 AM 9/9/2006, Dipankar Sarkar wrote:
>Hey Trond
>
>I ran the test again
><http://crucible.osdl.org/runs/1982/test_output/iozone.sys.log.png>http://crucible.osdl.org/runs/1982/test_output/iozone.sys.log.png
>with about 50 seconds of packet delay, after which it is switched off , it works great ..... as we can see in the outputs
>
>I changed the tcp slot entries to 64 and again ran the previous full packet delay test (where i do not switch the delay off). and again it seems it works great (i mean does not take iozone into a loop) .... This ouput shows the performance hit because of the packet delay
>
><http://crucible.osdl.org/runs/1983/test_output/iozone.sys.log.png>http://crucible.osdl.org/runs/1983/test_output/iozone.sys.log.png
>http://crucible.osdl.org/runs/1983/
>
>Dipankar
>
>On 9/9/06, Trond Myklebust <<mailto:trond.myklebust at fys.uio.no>trond.myklebust at fys.uio.no> wrote:
>On Fri, 2006-09-08 at 18:29 -0700, Bryce Harrington wrote:
>> On Wed, Aug 30, 2006 at 05:01:30PM -0400, Trond Myklebust wrote:
>> > On Tue, 2006-08-29 at 17:07 -0400, Trond Myklebust wrote:
>> > > See <http://crucible.osdl.org/runs/1621/sysinfo/nfs05.console>http://crucible.osdl.org/runs/1621/sysinfo/nfs05.console
>> > >
>> > > It looks like the networking layer is passing back an ENETUNREACH error
>> > > that we weren't expecting, and consequently weren't handling.
>> > >
>> > > Hmm... Not much you can do in those circumstances except delay and then
>> > > retry. I suppose the same goes for EHOSTUNREACH (which we didn't
>> > > receive, but could conceivably happen too).
>> > >
>> > > I'll look into drafting a patch.
>> >
>> > OK... Please could you see if the attached patch has an effect on those
>> > errors.
>> >
>> > Cheers,
>> > Trond
>>
>> Hi Trond,
>>
>> Thanks for the patch. This has the effect of causing iozone to go into
>> a loop and fail to finish. Here are three runs on this patch with
>> identical conditions:
>>
>> <http://crucible.osdl.org/runs/1791/test_output/iozone.sys.log>http://crucible.osdl.org/runs/1791/test_output/iozone.sys.log
>> http://crucible.osdl.org/runs/1793/test_output/iozone.sys.log
>> <http://crucible.osdl.org/runs/1794/test_output/iozone.sys.log>http://crucible.osdl.org/runs/1794/test_output/iozone.sys.log
>>
>> All three runs are getting stuck at almost exactly the same spot in the
>> test.
>>
>> On the console, the output is:
>>
>> [ http://crucible.osdl.org/runs/1793/sysinfo/nfs05.console]
>> ...
>> nfs05 login: nfsd: last server has exited
>> nfsd: unexporting all filesystems
>> NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery
>> directory
>> NFSD: starting 90-second grace period
>> *** Run 1793: Running 'src/current/iozone -+q 1 -i 0 -i 1 -g 128M -Race
>> -U /mnt/192.168.10.4 -f /mnt/192.168.10.4/iozone.tmp' ***
>> nfs: server <http://192.168.10.4>192.168.10.4 not responding, still trying
>> [-- MARK -- Tue Sep 5 23:00:00 2006]
>> [-- MARK -- Wed Sep 6 00:00:00 2006]
>> ...
>>
>> Ideas on what to try next?
>
>What happens if you turn off your packet delay mechanism _after_ you
>start seeing the above hang?
>When we start seeing ENETUNREACH errors, then that is a sign that the
>networking layer is having trouble actually routing the packets. My
>guess is that you have tuned the packet delay to the point where the
>networking layer is spending all its time trying to establish contact
>with the router/gateway ('cos it isn't able to talk to anything on the
>network). If you turn off the delay, then presumably the system will
>recover.
>
>Note also, that you might want to try changing the setup so that the
>delay happens between two routers instead of between the client and its
>primary network.
>
>Cheers,
> Trond
>
>
>_______________________________________________
>NFSv4 mailing list
>NFSv4 at linux-nfs.org
>http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4
More information about the NFSv4
mailing list