READ/RENEW loop bug
Muntz, Daniel
Dan.Muntz at netapp.com
Thu Jan 21 17:40:46 EST 2010
> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields at fieldses.org]
> Sent: Thursday, January 21, 2010 2:21 PM
> To: Myklebust, Trond
> Cc: nfsv4 at linux-nfs.org
> Subject: Re: READ/RENEW loop bug
>
> On Thu, Jan 21, 2010 at 05:12:48PM -0500, Trond Myklebust wrote:
> > On Thu, 2010-01-21 at 16:56 -0500, J. Bruce Fields wrote:
> > > On Tue, Jan 19, 2010 at 09:28:45PM -0500, Steven Brudenell wrote:
> > > > To clarify, since I can't remember if I said this in private to
> > > > Bruce or to the list:
> > > >
> > > > There isn't any trigger for this bug such as a
> rebooting server.
> > > > Both client and server are humming along fine, when
> suddenly the
> > > > client ends up with a stateid twenty hours older than
> the clientid.
> > >
> > > Well, then maybe it is possible there's some client bug that's
> > > causing it to use the wrong stateid. I'm not sure how to
> look for that.
> >
> > There was a time when the renew daemon was being turned off
> every time
> > you unmounted a filesystem from the same server. That bug
> was fixed by
> > commit 3050141bae57984dd660e6861632ccf9b8bca77e (NFSv4: Kill
> > nfs4_renewd_prepare_shutdown()).
> >
> > However I can see that even in today's tree, we don't guarantee
> > recovery from the NFS4ERR_STALE_STATEID error. I'm
> therefore thinking
> > of committing something like the appended patch to ensure
> that we do...
>
> Makes sense. The one disadvantage from my point of view is
> that it might cover up a server problem that we'd rather know
> about. But there's not much to be done about that. (Will
> the application wind up getting an IO error in this case, or
> does the client try to get a new
> open?)
Yeah, I was wondering if there might be a server error here that would
get masked. When I get some time, there's an error I can reproduce
occasionally that _seems_ like the server is "forgetting" a stateid.
There's not much to see in traces when this happens--the client gets a
stateid and the server (sometimes quickly, sometimes after some time),
starts returning NFS4ERR_BAD_STATEID, and eventually starts returning
NFS4ERR_EXPIRED. The client (intentionally) is stubbornly using the
same stateid because there's no obvious reason that the stateid should
be invalid.
-Dan
>
> --b.
>
> >
> > Cheers
> > Trond
> >
> ----------------------------------------------------------------------
> > -----------------
> > NFS: Ensure that we handle NFS4ERR_STALE_STATEID correctly
> >
> > From: Trond Myklebust <Trond.Myklebust at netapp.com>
> >
> > Even if the server is crazy, we should be able to mark the
> stateid as
> > being bad, to ensure it gets recovered.
> >
> > Signed-off-by: Trond Myklebust <Trond.Myklebust at netapp.com>
> > ---
> >
> > fs/nfs/nfs4_fs.h | 1 +
> > fs/nfs/nfs4proc.c | 38 +++++++++++++++++++++++++-------------
> > fs/nfs/nfs4state.c | 2 +-
> > 3 files changed, 27 insertions(+), 14 deletions(-)
> >
> >
> > diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h index
> > ea2f41b..0c6fda3 100644
> > --- a/fs/nfs/nfs4_fs.h
> > +++ b/fs/nfs/nfs4_fs.h
> > @@ -278,6 +278,7 @@ extern void nfs4_state_set_mode_locked(struct
> > nfs4_state *, fmode_t); extern void
> > nfs4_schedule_state_recovery(struct nfs_client *); extern void
> > nfs4_schedule_state_manager(struct nfs_client *); extern int
> > nfs4_state_mark_reclaim_nograce(struct nfs_client *clp, struct
> > nfs4_state *state);
> > +extern int nfs4_state_mark_reclaim_reboot(struct nfs_client *clp,
> > +struct nfs4_state *state);
> > extern void nfs41_handle_sequence_flag_errors(struct
> nfs_client *clp,
> > u32 flags); extern void nfs4_put_lock_state(struct nfs4_lock_state
> > *lsp); extern int nfs4_set_lock_state(struct nfs4_state *state,
> > struct file_lock *fl); diff --git a/fs/nfs/nfs4proc.c
> > b/fs/nfs/nfs4proc.c index afbfe67..dca0ebc 100644
> > --- a/fs/nfs/nfs4proc.c
> > +++ b/fs/nfs/nfs4proc.c
> > @@ -249,14 +249,14 @@ static int
> nfs4_handle_exception(const struct nfs_server *server, int errorcode,
> > if (state == NULL)
> > break;
> > nfs4_state_mark_reclaim_nograce(clp, state);
> > - case -NFS4ERR_STALE_CLIENTID:
> > + goto do_state_recovery;
> > case -NFS4ERR_STALE_STATEID:
> > + if (state == NULL)
> > + break;
> > + nfs4_state_mark_reclaim_reboot(clp, state);
> > + case -NFS4ERR_STALE_CLIENTID:
> > case -NFS4ERR_EXPIRED:
> > - nfs4_schedule_state_recovery(clp);
> > - ret = nfs4_wait_clnt_recover(clp);
> > - if (ret == 0)
> > - exception->retry = 1;
> > - break;
> > + goto do_state_recovery;
> > #if defined(CONFIG_NFS_V4_1)
> > case -NFS4ERR_BADSESSION:
> > case -NFS4ERR_BADSLOT:
> > @@ -289,6 +289,12 @@ static int nfs4_handle_exception(const
> struct nfs_server *server, int errorcode,
> > }
> > /* We failed to handle the error */
> > return nfs4_map_errors(ret);
> > +do_state_recovery:
> > + nfs4_schedule_state_recovery(clp);
> > + ret = nfs4_wait_clnt_recover(clp);
> > + if (ret == 0)
> > + exception->retry = 1;
> > + return ret;
> > }
> >
> >
> > @@ -3420,15 +3426,14 @@ _nfs4_async_handle_error(struct
> rpc_task *task, const struct nfs_server *server,
> > if (state == NULL)
> > break;
> > nfs4_state_mark_reclaim_nograce(clp, state);
> > - case -NFS4ERR_STALE_CLIENTID:
> > + goto do_state_recovery;
> > case -NFS4ERR_STALE_STATEID:
> > + if (state == NULL)
> > + break;
> > + nfs4_state_mark_reclaim_reboot(clp, state);
> > + case -NFS4ERR_STALE_CLIENTID:
> > case -NFS4ERR_EXPIRED:
> > - rpc_sleep_on(&clp->cl_rpcwaitq, task, NULL);
> > - nfs4_schedule_state_recovery(clp);
> > - if (test_bit(NFS4CLNT_MANAGER_RUNNING,
> &clp->cl_state) == 0)
> > -
> rpc_wake_up_queued_task(&clp->cl_rpcwaitq, task);
> > - task->tk_status = 0;
> > - return -EAGAIN;
> > + goto do_state_recovery;
> > #if defined(CONFIG_NFS_V4_1)
> > case -NFS4ERR_BADSESSION:
> > case -NFS4ERR_BADSLOT:
> > @@ -3456,6 +3461,13 @@ _nfs4_async_handle_error(struct
> rpc_task *task, const struct nfs_server *server,
> > }
> > task->tk_status = nfs4_map_errors(task->tk_status);
> > return 0;
> > +do_state_recovery:
> > + rpc_sleep_on(&clp->cl_rpcwaitq, task, NULL);
> > + nfs4_schedule_state_recovery(clp);
> > + if (test_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) == 0)
> > + rpc_wake_up_queued_task(&clp->cl_rpcwaitq, task);
> > + task->tk_status = 0;
> > + return -EAGAIN;
> > }
> >
> > static int
> > diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c index
> > 6d263ed..c1e2733 100644
> > --- a/fs/nfs/nfs4state.c
> > +++ b/fs/nfs/nfs4state.c
> > @@ -901,7 +901,7 @@ void
> nfs4_schedule_state_recovery(struct nfs_client *clp)
> > nfs4_schedule_state_manager(clp);
> > }
> >
> > -static int nfs4_state_mark_reclaim_reboot(struct nfs_client *clp,
> > struct nfs4_state *state)
> > +int nfs4_state_mark_reclaim_reboot(struct nfs_client *clp, struct
> > +nfs4_state *state)
> > {
> >
> > set_bit(NFS_STATE_RECLAIM_REBOOT, &state->flags);
> >
> _______________________________________________
> NFSv4 mailing list
> NFSv4 at linux-nfs.org
> http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4
>
More information about the NFSv4
mailing list