[pnfs] callback problem
William A. (Andy) Adamson
andros at citi.umich.edu
Thu May 24 11:43:27 EDT 2007
On 5/24/07, Benny Halevy <bhalevy at panasas.com> wrote:
>
> Benny Halevy wrote:
> > I tracked down a couple of problems that prevented me
> > from mounting a root directory...
> >
> > 1) client side encoding of SEQUENCE was changed in commit
> > 9894b9f2f5b442238be118ff60eaa918ef6d4843. The server side wasn't
> > causing the server to read illegal opnum '1' after SEQUENCE.
> >
> > 2) commit 01ca2177853ab2488f02867f5de8f44852e9c23f went in
> > assuming commit d02755f9a27767cd3c46a0e4c073c5ccacce04be is already
> > applied. I don't see the patchset I sent on 5/17 in the branch.
> > what happened? This causes the client structure to get free'd
>
> looks like they still sit in origin/my-sessions-5-21
> 34254650a682785a6151a6c690bba772ffcfe210 to
> 3aab918fc037bcbec21f8cc5f40b559f307024fa (head)
please push them to the 4.1-sessions and master branches
>
> >
> > I'm testing the fix and will send patches soon.
>
> patches will follow on a different thread.
I don't know what i did to create the remote branch origin/my-sessions-5-21.
I thought all patches were applied to the 4.1-sessions branch....
i'll apply and push them now.
sorry for the confusion.
>
> > Benny
> >
> > Iyer, Rahul wrote:
> >>
> >>
> >>> -----Original Message-----
> >>> From: William A. (Andy) Adamson [mailto:andros at citi.umich.edu]
> >>> Sent: Wednesday, May 23, 2007 2:17 PM
> >>> To: Iyer, Rahul
> >>> Cc: pnfs at linux-nfs.org
> >>> Subject: Re: [pnfs] callback problem
> >>>
> >>>
> >>>
> >>> On 5/23/07, Iyer, Rahul <Rahul.Iyer at netapp.com> wrote:
> >>>
> >>>
> >>>
> >>> > -----Original Message-----
> >>> > From: William A. (Andy) Adamson [mailto:andros at citi.umich.edu]
> >>> > Sent: Wednesday, May 23, 2007 2:08 PM
> >>> > To: Iyer, Rahul
> >>> > Cc: pnfs at linux-nfs.org
> >>> > Subject: Re: [pnfs] callback problem
> >>> >
> >>> > hi rahul
> >>> >
> >>> > what branch?
> >>> >
> >>> It's the 4.1-sessions branch. Also, oddly enough, I'm not
> >> hitting the
> >>> mempool error.
> >>>
> >>>
> >>>
> >>> i got the mempool error when exchange_id failed in setup_session. the
> >>> "Too many channel attr bitmaps" message is in
> >>> nfsd4_decode_create_session which implies that exchange_id succeeded.
> >> Yes. I grepped through the server and noticed that. I'm confused now.
> >> Are you sure it was exchange_id that failed and not create_session?
> >>
> >> If the create_session call was made by the client and in response to
> it,
> >> the server issued a cb_null (it's async) and then went on and crashed
> >> resulting in create_session to fail on the client, there *might* be a
> >> case where we call mempool_free when a request is allocated. That could
> >> be the BUG(). Do you have the line number, filename for that mempool
> >> BUG() handy?
> >> Thanks,
> >> Regards
> >> Rahul
> >>
> >>> -->Andy
> >>>
> >>>
> >>>
> >>>
> >>> This time around, I had the v4.1 mount fail and the v4.0
> >>> mount succeed. The server did have a NULL pointer dereference
> >> though.
> >>> Regards
> >>> Rahul.
> >>>
> >>> >
> >>> >
> >>> > On 5/23/07, Iyer, Rahul <Rahul.Iyer at netapp.com> wrote:
> >>> >
> >>> > Hi Andy,
> >>> > I get a NULL pointer dereference on the server side.
> >>> >
> >>> > Here's what I see on the server side:
> >>> >
> >>> > <start>
> >>> > Too many channel attr bitmaps!
> >>> > Unable to handle kernel NULL pointer dereference at
> >>> > 00000000000000b0
> >>> > RIP:
> >>> > [<ffffffff880568e2>] :nfsd:cmp_creds+0x2/0x10
> >>> > PGD 1fc581067 PUD 1fc55f067 PMD 0
> >>> > Oops: 0000 [1] PREEMPT SMP
> >>> > CPU 1
> >>> > Modules linked in: nfs nfsd exportfs lockd sunrpc
> >>> > Pid: 4910, comm: nfsd Not tainted 2.6.18.3 #47
> >>> > RIP: 0010:[<ffffffff880568e2>] [<ffffffff880568e2>]
> >>> > :nfsd:cmp_creds+0x2/0x10
> >>> > RSP: 0018:ffff8101fce01a78 EFLAGS: 00010286
> >>> > RAX: 0000000000000000 RBX: ffff8101fe7bf448 RCX:
> >>> > 0000000000000000
> >>> > RDX: 2222222222222222 RSI: ffff8101fe7bf448 RDI:
> >>> > 00000000000000b0
> >>> > RBP: ffff8101fce01b10 R08: ffffffff88079990 R09:
> >>> > ffffffff88079990
> >>> > R10: ffff8101fce01a90 R11: ffff8101fe7e9800 R12:
> >>> > ffff8101fcf23080
> >>> > R13: ffff8101fe7bf400 R14: ffff8101fce01a90 R15:
> >>> > 0000000000000003
> >>> > FS: 00002baa68a666d0(0000) GS:ffff8101ffc01340(0000)
> >>> > knlGS:0000000000000000
> >>> > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>> > CR2: 00000000000000b0 CR3: 00000001fcab6000 CR4:
> >>> > 00000000000006e0
> >>> > Process nfsd (pid: 4910, threadinfo
> >>> ffff8101fce00000, task
> >>> > ffff8101fde60140)
> >>> > Stack: ffffffff8805a14d ffff8101fe7e9800
> >>> 190c1eac80244b79
> >>> > 3864663632396533
> >>> > 3436393164623462 6335633234653932 3730383534633231
> >>> > ffff8101fe7bf400
> >>> > ffff8101fcf23000 ffff810100000024 ffff8101fce5c088
> >>> > 590e5b2216a85446
> >>> > Call Trace:
> >>> > [<ffffffff8805a14d>]
> >>> :nfsd:nfsd4_exchange_id+0x16d/0x2a0
> >>> > [<ffffffff8804fd5a>]
> >>> :nfsd:nfsd4_proc_compound+0x15ba/0x1820
> >>> > [<ffffffff8024465f>] __lock_acquire+0xc3f/0xd20
> >>> > [<ffffffff80487e5f>] release_sock+0x2f/0xe0
> >>> > [<ffffffff88009fd9>]
> >>> :sunrpc:svc_sock_enqueue+0x49/0x2c0
> >>> > [<ffffffff80242ba0>] check_usage+0x40/0x2b0
> >>> > [<ffffffff80244d7b>] trace_hardirqs_on+0x11b/0x150
> >>> > [<ffffffff8024465f>] __lock_acquire+0xc3f/0xd20
> >>> > [<ffffffff8800a22d>]
> >>> :sunrpc:svc_sock_enqueue+0x29d/0x2c0
> >>> > [<ffffffff8800ff08>]
> >>> :sunrpc:sunrpc_cache_lookup+0x98/0x170
> >>> > [<ffffffff804f39e0>] _read_unlock+0x30/0x60
> >>> > [<ffffffff8800ff08>]
> >>> :sunrpc:sunrpc_cache_lookup+0x98/0x170
> >>> > [<ffffffff8800c801>] :sunrpc:ip_map_lookup+0xc1/0xd0
> >>> > [<ffffffff8805557a>]
> >>> > :nfsd:nfs4svc_decode_compoundargs+0xfaa/0x1080
> >>> > [<ffffffff8803e121>] :nfsd:nfsd_dispatch+0x101/0x1e0
> >>> > [<ffffffff880093fa>] :sunrpc:svc_process+0x46a/0x740
> >>> > [<ffffffff8803e6f8>] :nfsd:nfsd+0x1d8/0x380
> >>> > [<ffffffff8020ad2c>] child_rip+0xa/0x12
> >>> > [<ffffffff804f3bbb>] _spin_unlock_irq+0x2b/0x60
> >>> > [<ffffffff8020a33c>] restore_args+0x0/0x30
> >>> > [<ffffffff8803e520>] :nfsd:nfsd+0x0/0x380
> >>> > [<ffffffff8020ad22>] child_rip+0x0/0x12
> >>> >
> >>> >
> >>> > Code: 39 07 55 48 89 e5 c9 0f 94 c0 0f b6 c0 c3 48 8b
> >>> > 05 b1 a4 02
> >>> > RIP [<ffffffff880568e2>] :nfsd:cmp_creds+0x2/0x10
> >>> > RSP <ffff8101fce01a78>
> >>> > CR2: 00000000000000b0
> >>> >
> >>> > =====================================
> >>> > [ BUG: lock held at task exit time! ]
> >>> > -------------------------------------
> >>> > nfsd/4910 is exiting with locks still held!
> >>> > 2 locks held by nfsd/4910:
> >>> > #0: (hash_sem){..--}, at: [<ffffffff88045200>]
> >>> > exp_readlock+0x10/0x20
> >>> > [nfsd]
> >>> > #1: (client_mutex){--..}, at: [<ffffffff804f2015>]
> >>> > mutex_lock+0x25/0x30
> >>> >
> >>> > stack backtrace:
> >>> >
> >>> > Call Trace:
> >>> > [<ffffffff80241ab9>] debug_check_no_locks_held+0x89/0xa0
> >>> > [<ffffffff8022acf3>] do_exit+0x923/0x990
> >>> > [<ffffffff8021d4b3>] do_page_fault+0x7e3/0x8e0
> >>> > [<ffffffff8024465f>] __lock_acquire+0xc3f/0xd20
> >>> > [<ffffffff8026fae8>] kfree+0xf8/0x110
> >>> > [<ffffffff804f1fbb>] __mutex_lock_slowpath+0x23b/0x270
> >>> > [<ffffffff80244b79>] mark_held_locks+0x79/0xa0
> >>> > [<ffffffff8020ab5d>] error_exit+0x0/0x96
> >>> > [<ffffffff880568e2>] :nfsd:cmp_creds+0x2/0x10
> >>> > [<ffffffff8805a14d>] :nfsd:nfsd4_exchange_id+0x16d/0x2a0
> >>> > [<ffffffff8804fd5a>]
> >>> :nfsd:nfsd4_proc_compound+0x15ba/0x1820
> >>> > [<ffffffff8024465f>] __lock_acquire+0xc3f/0xd20
> >>> > [<ffffffff80487e5f>] release_sock+0x2f/0xe0
> >>> > [<ffffffff88009fd9>] :sunrpc:svc_sock_enqueue+0x49/0x2c0
> >>> > [<ffffffff80242ba0>] check_usage+0x40/0x2b0
> >>> > [<ffffffff80244d7b>] trace_hardirqs_on+0x11b/0x150
> >>> > [<ffffffff8024465f>] __lock_acquire+0xc3f/0xd20
> >>> > [<ffffffff8800a22d>]
> >>> :sunrpc:svc_sock_enqueue+0x29d/0x2c0
> >>> > [<ffffffff8800ff08>]
> >>> :sunrpc:sunrpc_cache_lookup+0x98/0x170
> >>> > [<ffffffff804f39e0>] _read_unlock+0x30/0x60
> >>> > [<ffffffff8800ff08>]
> >>> :sunrpc:sunrpc_cache_lookup+0x98/0x170
> >>> > [<ffffffff8800c801>] :sunrpc:ip_map_lookup+0xc1/0xd0
> >>> > [<ffffffff8805557a>]
> >>> > :nfsd:nfs4svc_decode_compoundargs+0xfaa/0x1080
> >>> > [<ffffffff8803e121>] :nfsd:nfsd_dispatch+0x101/0x1e0
> >>> > [<ffffffff880093fa>] :sunrpc:svc_process+0x46a/0x740
> >>> > [<ffffffff8803e6f8>] :nfsd:nfsd+0x1d8/0x380
> >>> > [<ffffffff8020ad2c>] child_rip+0xa/0x12
> >>> > [<ffffffff804f3bbb>] _spin_unlock_irq+0x2b/0x60
> >>> > [<ffffffff8020a33c>] restore_args+0x0/0x30
> >>> > [<ffffffff8803e520>] :nfsd:nfsd+0x0/0x380
> >>> > [<ffffffff8020ad22>] child_rip+0x0/0x12
> >>> >
> >>> >
> >>> > <end>
> >>> >
> >>> > Regards
> >>> > Rahul
> >>> >
> >>> >
> >>> > > -----Original Message-----
> >>> > > From: William A. (Andy) Adamson
> >>> [mailto:andros at citi.umich.edu ]
> >>> > > Sent: Wednesday, May 23, 2007 1:42 PM
> >>> > > To: Marc Eshel
> >>> > > Cc: Iyer, Rahul; pnfs at linux-nfs.org
> >>> > <mailto: pnfs at linux-nfs.org <mailto:pnfs at linux-nfs.org> >
> >>> > > Subject: Re: [pnfs] callback problem
> >>> > >
> >>> > >
> >>> > >
> >>> > > On 5/23/07, Marc Eshel <
> >>> eshel at almaden.ibm.com <mailto:eshel at almaden.ibm.com> > wrote:
> >>> > >
> >>> > > Some more observation I will continue to debug.
> >>> > >
> >>> > > On the client side I got the following
> >>> messages after
> >>> > > mount and 2 of them
> >>> > > where incorrect. The first one
> >>> "Couldn't mount...." but
> >>> > > mount did complete
> >>> > > and I was able to use it. The seconds
> >>> one "started cb
> >>> > > service!" is wrong
> >>> > > callback failed since the IP address
> >>> was 0.0.0.0 on the
> >>> > > server side.
> >>> > >
> >>> > > [ root at fin30 ~]# mount -t nfs4 fin18:/ /mnt
> >>> > > Message from syslogd at fin30 at Wed May 23
> >>> > 11:34:03 2007 ...
> >>> > > fin30 kernel: started cb service!
> >>> > > Message from syslogd at fin30 at Wed May 23
> >>> > 11:34:05 2007 ...
> >>> > > fin30 kernel: Couldn't mount using
> >>> minorversion 1
> >>> > > Message from syslogd at fin30 at Wed May 23
> >>> > 11:34:05 2007 ...
> >>> > > fin30 kernel: started cb service!
> >>> > >
> >>> > >
> >>> > >
> >>> > > On the server side I had to comment
> >>> list_move_tail line
> >>> > > since it was
> >>> > > crashing with a null pointer.
> >>> > >
> >>> > > static inline void
> >>> > > renew_client(struct nfs4_client *clp)
> >>> > > {
> >>> > > /*
> >>> > > * Move client to the end to the
> >>> LRU list.
> >>> > > */
> >>> > > dprintk("renewing client (clientid
> >>> > %08x/%08x)\n",
> >>> > >
> >>> clp->cl_clientid.cl_boot,
> >>> > > clp->cl_clientid.cl_id);
> >>> > > //??? list_move_tail(&clp->cl_lru,
> >>> &client_lru);
> >>> > > clp->cl_time = get_seconds();
> >>> > > }
> >>> > >
> >>> > >
> >>> > > i think there is a race on the server regarding the
> >>> > > nfs_client - i've failed in
> >>> > >
> >>> > > 1) find_confirmed_client_by_str called by
> >>> nfsd4_exchange_id
> >>> > > 2) succeeded in calling renew_client, and failed in
> >>> > nfsd4_sequence.
> >>> > >
> >>> > > -->Andy
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>>
> >>>
> >>>
> >>>
> >>>
> >> _______________________________________________
> >> pNFS mailing list
> >> pNFS at linux-nfs.org
> >> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> >
> >
>
>
> --
> Benny Halevy
> Software Architect
> Tel/Fax: +972-3-647-8340
> Mobile: +972-54-802-8340
> bhalevy at panasas.com
>
> Panasas, Inc.
> The Leader in Parallel Storage
> www.panasas.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://linux-nfs.org/pipermail/pnfs/attachments/20070524/f27292ae/attachment-0001.htm
More information about the pNFS
mailing list