System lockup with recent pull from Bruce's git tree

Frank Filz ffilzlnx at us.ibm.com
Wed Sep 6 16:58:36 EDT 2006


I have been working on trying to eliminate the BKL from the NFS client.
I was starting to think there is a problem with some asynch RPC calls,
so I decided to pull Bruce's current git tree a couple days ago (2.6.18-
rc5+CITI_NFS4_ALL-1) and run my test on that tree and see if I was
seeing the problem. What I'm seeing is worse. Before, what I was seeing
was an NFS v4 operation would wait for a response forever, and other
threads that accessed the same inode would then hang (waiting for the
inode mutex). With the latest code (and definitely no changes of my
own), my system is completely locking up.

The testing I am running is:

Mounts:

mount -t nfs -o hard,intr localhost:/home /mnt
mount -t nfs4 -o hard,intr localhost:/ /mnt2

Exports:


/home *(rw,fsid=0,insecure,no_subtree_check,no_root_squash,async)
/home gss/krb5(rw,fsid=0,insecure,no_subtree_check,no_root_squash,async)
/home gss/krb5i
(rw,fsid=0,insecure,no_subtree_check,no_root_squash,async)
/home gss/krb5p
(rw,fsid=0,insecure,no_subtree_check,no_root_squash,async)

Test:

nohup fsstress -d /mnt/stress/d1 -l 0 -n 1000 -p 4 -r &
nohup fsstress -d /mnt/stress/d2 -l 0 -n 1000 -p 4 -r &
nohup fsstress -d /mnt/stress/d3 -l 0 -n 1000 -p 4 -r &
nohup fsstress -d /mnt/stress/d4 -l 0 -n 1000 -p 4 -r &
nohup fsstress -d /mnt2/stress/d5 -l 0 -n 1000 -p 4 -r &
nohup fsstress -d /mnt2/stress/d6 -l 0 -n 1000 -p 4 -r &
nohup fsstress -d /mnt2/stress/d7 -l 0 -n 1000 -p 4 -r &
nohup fsstress -d /mnt2/stress/d8 -l 0 -n 1000 -p 4 -r &

And since fsstress doesn't give any indication that it's hung, I also
run the following script:

while [ 1 -eq 1 ]
do
        find /mnt2
done

I noticed a change in the mutex debugging, it used to give information
on who was blocked on a mutex that made it very clear what fsstress
thread was hung causing the find thread to hang also. When I was running
with 2.6.18-rc2, it was not so clear, but the find clearly hung (though
once or twice I did see the system completely hang). Now, twice in a
row, after a couple hours run, my system has hung hard.

Is anyone aware of any recent changes that might have gone in? From what
I was seeing earlier, it looks like some of the async RPC calls (it
seems like there are several ways NFS client makes RPC calls - adding
quite a bit of confusion) end up waiting for the response forever. My
assumption of what's happening is that the response arrived before the
NFS thread started waiting. I do notice that in most cases, there is a
lock_kernel()/unlock_kernel() surrounding calls to rpc_execute(), but in
a couple cases these calls don't exist. The direct I/O, which uses a
different mechanism to wait than the rest of the code seems to be the
problem, and I do note that it doesn't hold the BKL across the entire
make-request-wait-for-response path, which I thought might be the
problem.

Thanks for any thoughts

Frank Filz




More information about the NFSv4 mailing list