READ/RENEW loop bug

Muntz, Daniel Dan.Muntz at netapp.com
Thu Jan 21 17:40:46 EST 2010


 

> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields at fieldses.org] 
> Sent: Thursday, January 21, 2010 2:21 PM
> To: Myklebust, Trond
> Cc: nfsv4 at linux-nfs.org
> Subject: Re: READ/RENEW loop bug
> 
> On Thu, Jan 21, 2010 at 05:12:48PM -0500, Trond Myklebust wrote:
> > On Thu, 2010-01-21 at 16:56 -0500, J. Bruce Fields wrote: 
> > > On Tue, Jan 19, 2010 at 09:28:45PM -0500, Steven Brudenell wrote:
> > > > To clarify, since I can't remember if I said this in private to 
> > > > Bruce or to the list:
> > > > 
> > > > There isn't any trigger for this bug such as a 
> rebooting server. 
> > > > Both client and server are humming along fine, when 
> suddenly the 
> > > > client ends up with a stateid twenty hours older than 
> the clientid.
> > > 
> > > Well, then maybe it is possible there's some client bug that's 
> > > causing it to use the wrong stateid.  I'm not sure how to 
> look for that.
> > 
> > There was a time when the renew daemon was being turned off 
> every time 
> > you unmounted a filesystem from the same server. That bug 
> was fixed by 
> > commit 3050141bae57984dd660e6861632ccf9b8bca77e (NFSv4: Kill 
> > nfs4_renewd_prepare_shutdown()).
> > 
> > However I can see that even in today's tree, we don't guarantee 
> > recovery from the NFS4ERR_STALE_STATEID error. I'm 
> therefore thinking 
> > of committing something like the appended patch to ensure 
> that we do...
> 
> Makes sense.  The one disadvantage from my point of view is 
> that it might cover up a server problem that we'd rather know 
> about.  But there's not much to be done about that.  (Will 
> the application wind up getting an IO error in this case, or 
> does the client try to get a new
> open?)

Yeah, I was wondering if there might be a server error here that would
get masked.  When I get some time, there's an error I can reproduce
occasionally that _seems_ like the server is "forgetting" a stateid.
There's not much to see in traces when this happens--the client gets a
stateid and the server (sometimes quickly, sometimes after some time),
starts returning NFS4ERR_BAD_STATEID, and eventually starts returning
NFS4ERR_EXPIRED.  The client (intentionally) is stubbornly using the
same stateid because there's no obvious reason that the stateid should
be invalid.

  -Dan

> 
> --b.
> 
> > 
> > Cheers
> >   Trond
> > 
> ----------------------------------------------------------------------
> > -----------------
> > NFS: Ensure that we handle NFS4ERR_STALE_STATEID correctly
> > 
> > From: Trond Myklebust <Trond.Myklebust at netapp.com>
> > 
> > Even if the server is crazy, we should be able to mark the 
> stateid as 
> > being bad, to ensure it gets recovered.
> > 
> > Signed-off-by: Trond Myklebust <Trond.Myklebust at netapp.com>
> > ---
> > 
> >  fs/nfs/nfs4_fs.h   |    1 +
> >  fs/nfs/nfs4proc.c  |   38 +++++++++++++++++++++++++-------------
> >  fs/nfs/nfs4state.c |    2 +-
> >  3 files changed, 27 insertions(+), 14 deletions(-)
> > 
> > 
> > diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h index 
> > ea2f41b..0c6fda3 100644
> > --- a/fs/nfs/nfs4_fs.h
> > +++ b/fs/nfs/nfs4_fs.h
> > @@ -278,6 +278,7 @@ extern void nfs4_state_set_mode_locked(struct 
> > nfs4_state *, fmode_t);  extern void 
> > nfs4_schedule_state_recovery(struct nfs_client *);  extern void 
> > nfs4_schedule_state_manager(struct nfs_client *);  extern int 
> > nfs4_state_mark_reclaim_nograce(struct nfs_client *clp, struct 
> > nfs4_state *state);
> > +extern int nfs4_state_mark_reclaim_reboot(struct nfs_client *clp, 
> > +struct nfs4_state *state);
> >  extern void nfs41_handle_sequence_flag_errors(struct 
> nfs_client *clp, 
> > u32 flags);  extern void nfs4_put_lock_state(struct nfs4_lock_state 
> > *lsp);  extern int nfs4_set_lock_state(struct nfs4_state *state, 
> > struct file_lock *fl); diff --git a/fs/nfs/nfs4proc.c 
> > b/fs/nfs/nfs4proc.c index afbfe67..dca0ebc 100644
> > --- a/fs/nfs/nfs4proc.c
> > +++ b/fs/nfs/nfs4proc.c
> > @@ -249,14 +249,14 @@ static int 
> nfs4_handle_exception(const struct nfs_server *server, int errorcode,
> >  			if (state == NULL)
> >  				break;
> >  			nfs4_state_mark_reclaim_nograce(clp, state);
> > -		case -NFS4ERR_STALE_CLIENTID:
> > +			goto do_state_recovery;
> >  		case -NFS4ERR_STALE_STATEID:
> > +			if (state == NULL)
> > +				break;
> > +			nfs4_state_mark_reclaim_reboot(clp, state);
> > +		case -NFS4ERR_STALE_CLIENTID:
> >  		case -NFS4ERR_EXPIRED:
> > -			nfs4_schedule_state_recovery(clp);
> > -			ret = nfs4_wait_clnt_recover(clp);
> > -			if (ret == 0)
> > -				exception->retry = 1;
> > -			break;
> > +			goto do_state_recovery;
> >  #if defined(CONFIG_NFS_V4_1)
> >  		case -NFS4ERR_BADSESSION:
> >  		case -NFS4ERR_BADSLOT:
> > @@ -289,6 +289,12 @@ static int nfs4_handle_exception(const 
> struct nfs_server *server, int errorcode,
> >  	}
> >  	/* We failed to handle the error */
> >  	return nfs4_map_errors(ret);
> > +do_state_recovery:
> > +	nfs4_schedule_state_recovery(clp);
> > +	ret = nfs4_wait_clnt_recover(clp);
> > +	if (ret == 0)
> > +		exception->retry = 1;
> > +	return ret;
> >  }
> >  
> > 
> > @@ -3420,15 +3426,14 @@ _nfs4_async_handle_error(struct 
> rpc_task *task, const struct nfs_server *server,
> >  			if (state == NULL)
> >  				break;
> >  			nfs4_state_mark_reclaim_nograce(clp, state);
> > -		case -NFS4ERR_STALE_CLIENTID:
> > +			goto do_state_recovery;
> >  		case -NFS4ERR_STALE_STATEID:
> > +			if (state == NULL)
> > +				break;
> > +			nfs4_state_mark_reclaim_reboot(clp, state);
> > +		case -NFS4ERR_STALE_CLIENTID:
> >  		case -NFS4ERR_EXPIRED:
> > -			rpc_sleep_on(&clp->cl_rpcwaitq, task, NULL);
> > -			nfs4_schedule_state_recovery(clp);
> > -			if (test_bit(NFS4CLNT_MANAGER_RUNNING, 
> &clp->cl_state) == 0)
> > -				
> rpc_wake_up_queued_task(&clp->cl_rpcwaitq, task);
> > -			task->tk_status = 0;
> > -			return -EAGAIN;
> > +			goto do_state_recovery;
> >  #if defined(CONFIG_NFS_V4_1)
> >  		case -NFS4ERR_BADSESSION:
> >  		case -NFS4ERR_BADSLOT:
> > @@ -3456,6 +3461,13 @@ _nfs4_async_handle_error(struct 
> rpc_task *task, const struct nfs_server *server,
> >  	}
> >  	task->tk_status = nfs4_map_errors(task->tk_status);
> >  	return 0;
> > +do_state_recovery:
> > +	rpc_sleep_on(&clp->cl_rpcwaitq, task, NULL);
> > +	nfs4_schedule_state_recovery(clp);
> > +	if (test_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) == 0)
> > +		rpc_wake_up_queued_task(&clp->cl_rpcwaitq, task);
> > +	task->tk_status = 0;
> > +	return -EAGAIN;
> >  }
> >  
> >  static int
> > diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c index 
> > 6d263ed..c1e2733 100644
> > --- a/fs/nfs/nfs4state.c
> > +++ b/fs/nfs/nfs4state.c
> > @@ -901,7 +901,7 @@ void 
> nfs4_schedule_state_recovery(struct nfs_client *clp)
> >  	nfs4_schedule_state_manager(clp);
> >  }
> >  
> > -static int nfs4_state_mark_reclaim_reboot(struct nfs_client *clp, 
> > struct nfs4_state *state)
> > +int nfs4_state_mark_reclaim_reboot(struct nfs_client *clp, struct 
> > +nfs4_state *state)
> >  {
> >  
> >  	set_bit(NFS_STATE_RECLAIM_REBOOT, &state->flags);
> > 
> _______________________________________________
> NFSv4 mailing list
> NFSv4 at linux-nfs.org
> http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4
> 


More information about the NFSv4 mailing list