Debugging a hung connection

J. Bruce Fields bfields at fieldses.org
Sun May 25 13:23:42 EDT 2008


On Wed, May 21, 2008 at 12:31:21PM -0400, Norman Elton wrote:
> >> Dave / Bruce,
> >>
> >> >
> >> > Well, the simplest thing to look for would just be any rpc calls that
> >> > don't get replies.  What exactly are the client and server?
> >> >
> >>
> >> Thanks for the leads. I've enabled rpcdebug (very handy, hadn't seen
> >> it before!) on the server and both a working and non-working client to
> >> compare results.
> >>
> >> The dead client stops at:
> >>
> >> May 21 11:42:09 auth3 kernel: NFS: atomic_lookup(0:15/1769473), test
> >> May 21 11:42:09 auth3 kernel: RPC:     looking up RPCSEC_GSS cred
> >>
> >> In this case, I'm trying to touch a file called "test". Interestingly,
> >> in the working client, I never see a call to atomic_lookup.
> >>
> >> What would cause one client to call this function, but another not to?
> >
> > I don't know, but that may be a red herring--the RPCSEC_GSS lookup seems
> > a more likely cause for delay.  I gssd running on that client?  Have you
> > tried turning on its debugging?  Could it be having trouble reaching the
> > kerberos server?
> >
> > --b.
> >
> 
> Yes, I've enabled debug on gssd, and don't get any output from either
> the working or frozen client. That line ("looking up RPCSEC_GSS cred")
> appears a few dozen times on both clients.
> 
> I realized I didn't answer one questions earlier. All are RHEL5
> machines, clients are 2.6.18-53.1.14.el5, server is 2.6.18-53.1.4.el5.
> 
> May have stumbled on something, though. On the working client, if I
> unmount the share, restart gssd for good measure, and remount, I see
> the output from gssd creating a context, ending with "doing downcall".
> I suppose this is gssd writing the gss context into the kernel. All
> seems well, and the mount works as expected.
> 
> On the nonworking client, when I do the same, I get the same output,
> followed by "destroying client clnt5". Is gssd somehow destroying or
> invalidating this context?
> 
> The debug output from rpcsvcgssd on the server is identical for both
> clients, no signs of errors there.
> 
> Am I on the right track?

I don't know.  The "destroying client clnt5" thing is actually gssd
noticing that the kernel destroyed that client.  Usually that means the
client attempted a mount which failed.

Hm, actually, more likely: the client also, at least with recent
kernels, creates a new rpc client for the setclientid call.  That
setclientid would happen on the first open, which would explain why "ls"
works but not "touch".  That upcall will request credentials for "uid
0", which gssd by convention will take to mean machine credentials.

If you check a packet trace with wireshark, you'll see the "good client"
do a setclientid, setclientid_confirm, open, and close on the touched
file.  The bad client will never get to that first setclientid.

Maybe there's something different about the keytab (check klist -k) or
the /etc/hosts file between the two machines.

And what are the kernel versions on the two clients?

--b.


More information about the NFSv4 mailing list