[nfsv4] NFSv4 performance issues on a lagged network
Trond Myklebust
trond.myklebust at fys.uio.no
Fri Sep 8 23:28:27 EDT 2006
On Fri, 2006-09-08 at 18:29 -0700, Bryce Harrington wrote:
> On Wed, Aug 30, 2006 at 05:01:30PM -0400, Trond Myklebust wrote:
> > On Tue, 2006-08-29 at 17:07 -0400, Trond Myklebust wrote:
> > > See http://crucible.osdl.org/runs/1621/sysinfo/nfs05.console
> > >
> > > It looks like the networking layer is passing back an ENETUNREACH error
> > > that we weren't expecting, and consequently weren't handling.
> > >
> > > Hmm... Not much you can do in those circumstances except delay and then
> > > retry. I suppose the same goes for EHOSTUNREACH (which we didn't
> > > receive, but could conceivably happen too).
> > >
> > > I'll look into drafting a patch.
> >
> > OK... Please could you see if the attached patch has an effect on those
> > errors.
> >
> > Cheers,
> > Trond
>
> Hi Trond,
>
> Thanks for the patch. This has the effect of causing iozone to go into
> a loop and fail to finish. Here are three runs on this patch with
> identical conditions:
>
> http://crucible.osdl.org/runs/1791/test_output/iozone.sys.log
> http://crucible.osdl.org/runs/1793/test_output/iozone.sys.log
> http://crucible.osdl.org/runs/1794/test_output/iozone.sys.log
>
> All three runs are getting stuck at almost exactly the same spot in the
> test.
>
> On the console, the output is:
>
> [http://crucible.osdl.org/runs/1793/sysinfo/nfs05.console]
> ...
> nfs05 login: nfsd: last server has exited
> nfsd: unexporting all filesystems
> NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery
> directory
> NFSD: starting 90-second grace period
> *** Run 1793: Running 'src/current/iozone -+q 1 -i 0 -i 1 -g 128M -Race
> -U /mnt/192.168.10.4 -f /mnt/192.168.10.4/iozone.tmp' ***
> nfs: server 192.168.10.4 not responding, still trying
> [-- MARK -- Tue Sep 5 23:00:00 2006]
> [-- MARK -- Wed Sep 6 00:00:00 2006]
> ...
>
> Ideas on what to try next?
What happens if you turn off your packet delay mechanism _after_ you
start seeing the above hang?
When we start seeing ENETUNREACH errors, then that is a sign that the
networking layer is having trouble actually routing the packets. My
guess is that you have tuned the packet delay to the point where the
networking layer is spending all its time trying to establish contact
with the router/gateway ('cos it isn't able to talk to anything on the
network). If you turn off the delay, then presumably the system will
recover.
Note also, that you might want to try changing the setup so that the
delay happens between two routers instead of between the client and its
primary network.
Cheers,
Trond
More information about the NFSv4
mailing list