System lockup with recent pull from Bruce's git tree

Frank Filz ffilzlnx at us.ibm.com
Thu Sep 7 12:20:18 EDT 2006


On Thu, 2006-09-07 at 11:01 -0400, J. Bruce Fields wrote:
> On Wed, Sep 06, 2006 at 01:58:36PM -0700, Frank Filz wrote:
> > I have been working on trying to eliminate the BKL from the NFS client.
> > I was starting to think there is a problem with some asynch RPC calls,
> > so I decided to pull Bruce's current git tree a couple days ago (2.6.18-
> > rc5+CITI_NFS4_ALL-1) and run my test on that tree and see if I was
> > seeing the problem. What I'm seeing is worse. Before, what I was seeing
> > was an NFS v4 operation would wait for a response forever, and other
> > threads that accessed the same inode would then hang (waiting for the
> > inode mutex).
> 
> Do you know what operation it was that the server wasn't responding to,
> and how to reproduce the problem?

It seems like it might be a direct read or write. Unfortunately with the
hard hang, I'm not sure what the latest code is actually tripping on. I
could go back to one of the earlier code drops (w/o any of my code
changes) and try for a re-create (especially if I go back to before the
mutex debugging was changed so it's absolutely clear who is involved in
the hang).

> > With the latest code (and definitely no changes of my
> > own), my system is completely locking up.
> > 
> > The testing I am running is:
> > 
> > Mounts:
> > 
> > mount -t nfs -o hard,intr localhost:/home /mnt
> > mount -t nfs4 -o hard,intr localhost:/ /mnt2
> 
> Note that the client and server are not designed to be deadlock-free
> when they're on the same machine.  (The client may not be able to write
> out dirty pages until the server gets some memory it needs to handle the
> requests, and the server may not be able to get that memory until the
> client can free those dirty pages.)  I run connectathon test and such
> regularly over localhost mounts, but those don't touch a lot of data.
> 
> So can you reproduce this with client and server on separate machines?

I can certainly try that. I did a spot check of the sysrq tasks dump
from an earlier run (that may be tainted by my code changes - so not
dead certain it's the same problem, but unless something has recently
changed here, I suspect this is a lingering problem), and my sense is
that nfsd is not part of the problem (all nfsd threads are in the normal
idle state, and client access was successful as long as the specific
inode that was hung was not accessed). Of course if nfsd dropped the
request due to some interaction with the client that would be different.

Frank




More information about the NFSv4 mailing list