[pnfs] callback problem

William A. (Andy) Adamson andros at citi.umich.edu
Thu May 24 11:43:27 EDT 2007


On 5/24/07, Benny Halevy <bhalevy at panasas.com> wrote:
>
> Benny Halevy wrote:
> > I tracked down a couple of problems that prevented me
> > from mounting a root directory...
> >
> > 1) client side encoding of SEQUENCE was changed in commit
> > 9894b9f2f5b442238be118ff60eaa918ef6d4843.  The server side wasn't
> > causing the server to read illegal opnum '1' after SEQUENCE.
> >
> > 2) commit 01ca2177853ab2488f02867f5de8f44852e9c23f went in
> > assuming commit d02755f9a27767cd3c46a0e4c073c5ccacce04be is already
> > applied.  I don't see the patchset I sent on 5/17 in the branch.
> > what happened?  This causes the client structure to get free'd
>
> looks like they still sit in origin/my-sessions-5-21
> 34254650a682785a6151a6c690bba772ffcfe210 to
> 3aab918fc037bcbec21f8cc5f40b559f307024fa (head)



please push them to the 4.1-sessions and master branches
>
> >
> > I'm testing the fix and will send patches soon.
>
> patches will follow on a different thread.



I don't know what i did to create the remote branch origin/my-sessions-5-21.
I thought all patches were applied to the 4.1-sessions branch....

i'll apply and push them now.

sorry for the confusion.

>
> > Benny
> >
> > Iyer, Rahul wrote:
> >>
> >>
> >>> -----Original Message-----
> >>> From: William A. (Andy) Adamson [mailto:andros at citi.umich.edu]
> >>> Sent: Wednesday, May 23, 2007 2:17 PM
> >>> To: Iyer, Rahul
> >>> Cc: pnfs at linux-nfs.org
> >>> Subject: Re: [pnfs] callback problem
> >>>
> >>>
> >>>
> >>> On 5/23/07, Iyer, Rahul <Rahul.Iyer at netapp.com> wrote:
> >>>
> >>>
> >>>
> >>>     > -----Original Message-----
> >>>     > From: William A. (Andy) Adamson [mailto:andros at citi.umich.edu]
> >>>     > Sent: Wednesday, May 23, 2007 2:08 PM
> >>>     > To: Iyer, Rahul
> >>>     > Cc: pnfs at linux-nfs.org
> >>>     > Subject: Re: [pnfs] callback problem
> >>>     >
> >>>     > hi rahul
> >>>     >
> >>>     > what branch?
> >>>     >
> >>>     It's the 4.1-sessions branch. Also, oddly enough, I'm not
> >> hitting the
> >>>     mempool error.
> >>>
> >>>
> >>>
> >>> i got the mempool error when exchange_id failed in setup_session. the
> >>> "Too many channel attr bitmaps" message is in
> >>> nfsd4_decode_create_session which implies that exchange_id succeeded.
> >> Yes. I grepped through the server and noticed that. I'm confused now.
> >> Are you sure it was exchange_id that failed and not create_session?
> >>
> >> If the create_session call was made by the client and in response to
> it,
> >> the server issued a cb_null (it's async) and then went on and crashed
> >> resulting in create_session to fail on the client, there *might* be a
> >> case where we call mempool_free when a request is allocated. That could
> >> be the BUG(). Do you have the line number, filename for that mempool
> >> BUG() handy?
> >> Thanks,
> >> Regards
> >> Rahul
> >>
> >>> -->Andy
> >>>
> >>>
> >>>
> >>>
> >>>     This time around, I had the v4.1 mount fail and the v4.0
> >>>     mount succeed. The server did have a NULL pointer dereference
> >> though.
> >>>     Regards
> >>>     Rahul.
> >>>
> >>>     >
> >>>     >
> >>>     > On 5/23/07, Iyer, Rahul <Rahul.Iyer at netapp.com> wrote:
> >>>     >
> >>>     >       Hi Andy,
> >>>     >       I get a NULL pointer dereference on the server side.
> >>>     >
> >>>     >       Here's what I see on the server side:
> >>>     >
> >>>     >       <start>
> >>>     >       Too many channel attr bitmaps!
> >>>     >       Unable to handle kernel NULL pointer dereference at
> >>>     > 00000000000000b0
> >>>     >       RIP:
> >>>     >       [<ffffffff880568e2>] :nfsd:cmp_creds+0x2/0x10
> >>>     >       PGD 1fc581067 PUD 1fc55f067 PMD 0
> >>>     >       Oops: 0000 [1] PREEMPT SMP
> >>>     >       CPU 1
> >>>     >       Modules linked in: nfs nfsd exportfs lockd sunrpc
> >>>     >       Pid: 4910, comm: nfsd Not tainted 2.6.18.3 #47
> >>>     >       RIP: 0010:[<ffffffff880568e2>]  [<ffffffff880568e2>]
> >>>     >       :nfsd:cmp_creds+0x2/0x10
> >>>     >       RSP: 0018:ffff8101fce01a78  EFLAGS: 00010286
> >>>     >       RAX: 0000000000000000 RBX: ffff8101fe7bf448 RCX:
> >>>     > 0000000000000000
> >>>     >       RDX: 2222222222222222 RSI: ffff8101fe7bf448 RDI:
> >>>     > 00000000000000b0
> >>>     >       RBP: ffff8101fce01b10 R08: ffffffff88079990 R09:
> >>>     > ffffffff88079990
> >>>     >       R10: ffff8101fce01a90 R11: ffff8101fe7e9800 R12:
> >>>     > ffff8101fcf23080
> >>>     >       R13: ffff8101fe7bf400 R14: ffff8101fce01a90 R15:
> >>>     > 0000000000000003
> >>>     >       FS:  00002baa68a666d0(0000) GS:ffff8101ffc01340(0000)
> >>>     >       knlGS:0000000000000000
> >>>     >       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>>     >       CR2: 00000000000000b0 CR3: 00000001fcab6000 CR4:
> >>>     > 00000000000006e0
> >>>     >       Process nfsd (pid: 4910, threadinfo
> >>> ffff8101fce00000, task
> >>>     >       ffff8101fde60140)
> >>>     >       Stack:  ffffffff8805a14d ffff8101fe7e9800
> >>> 190c1eac80244b79
> >>>     >       3864663632396533
> >>>     >       3436393164623462 6335633234653932 3730383534633231
> >>>     > ffff8101fe7bf400
> >>>     >       ffff8101fcf23000 ffff810100000024 ffff8101fce5c088
> >>>     > 590e5b2216a85446
> >>>     >       Call Trace:
> >>>     >       [<ffffffff8805a14d>]
> >>> :nfsd:nfsd4_exchange_id+0x16d/0x2a0
> >>>     >       [<ffffffff8804fd5a>]
> >>> :nfsd:nfsd4_proc_compound+0x15ba/0x1820
> >>>     >       [<ffffffff8024465f>] __lock_acquire+0xc3f/0xd20
> >>>     >       [<ffffffff80487e5f>] release_sock+0x2f/0xe0
> >>>     >       [<ffffffff88009fd9>]
> >>> :sunrpc:svc_sock_enqueue+0x49/0x2c0
> >>>     >       [<ffffffff80242ba0>] check_usage+0x40/0x2b0
> >>>     >       [<ffffffff80244d7b>] trace_hardirqs_on+0x11b/0x150
> >>>     >       [<ffffffff8024465f>] __lock_acquire+0xc3f/0xd20
> >>>     >       [<ffffffff8800a22d>]
> >>> :sunrpc:svc_sock_enqueue+0x29d/0x2c0
> >>>     >       [<ffffffff8800ff08>]
> >>> :sunrpc:sunrpc_cache_lookup+0x98/0x170
> >>>     >       [<ffffffff804f39e0>] _read_unlock+0x30/0x60
> >>>     >       [<ffffffff8800ff08>]
> >>> :sunrpc:sunrpc_cache_lookup+0x98/0x170
> >>>     >       [<ffffffff8800c801>] :sunrpc:ip_map_lookup+0xc1/0xd0
> >>>     >       [<ffffffff8805557a>]
> >>>     > :nfsd:nfs4svc_decode_compoundargs+0xfaa/0x1080
> >>>     >       [<ffffffff8803e121>] :nfsd:nfsd_dispatch+0x101/0x1e0
> >>>     >       [<ffffffff880093fa>] :sunrpc:svc_process+0x46a/0x740
> >>>     >       [<ffffffff8803e6f8>] :nfsd:nfsd+0x1d8/0x380
> >>>     >       [<ffffffff8020ad2c>] child_rip+0xa/0x12
> >>>     >       [<ffffffff804f3bbb>] _spin_unlock_irq+0x2b/0x60
> >>>     >       [<ffffffff8020a33c>] restore_args+0x0/0x30
> >>>     >       [<ffffffff8803e520>] :nfsd:nfsd+0x0/0x380
> >>>     >       [<ffffffff8020ad22>] child_rip+0x0/0x12
> >>>     >
> >>>     >
> >>>     >       Code: 39 07 55 48 89 e5 c9 0f 94 c0 0f b6 c0 c3 48 8b
> >>>     > 05 b1 a4 02
> >>>     >       RIP  [<ffffffff880568e2>] :nfsd:cmp_creds+0x2/0x10
> >>>     >       RSP <ffff8101fce01a78>
> >>>     >       CR2: 00000000000000b0
> >>>     >
> >>>     >       =====================================
> >>>     >       [ BUG: lock held at task exit time! ]
> >>>     >       -------------------------------------
> >>>     >       nfsd/4910 is exiting with locks still held!
> >>>     >       2 locks held by nfsd/4910:
> >>>     >       #0:  (hash_sem){..--}, at: [<ffffffff88045200>]
> >>>     > exp_readlock+0x10/0x20
> >>>     >       [nfsd]
> >>>     >       #1:  (client_mutex){--..}, at: [<ffffffff804f2015>]
> >>>     >       mutex_lock+0x25/0x30
> >>>     >
> >>>     >       stack backtrace:
> >>>     >
> >>>     >       Call Trace:
> >>>     >       [<ffffffff80241ab9>] debug_check_no_locks_held+0x89/0xa0
> >>>     >       [<ffffffff8022acf3>] do_exit+0x923/0x990
> >>>     >       [<ffffffff8021d4b3>] do_page_fault+0x7e3/0x8e0
> >>>     >       [<ffffffff8024465f>] __lock_acquire+0xc3f/0xd20
> >>>     >       [<ffffffff8026fae8>] kfree+0xf8/0x110
> >>>     >       [<ffffffff804f1fbb>] __mutex_lock_slowpath+0x23b/0x270
> >>>     >       [<ffffffff80244b79>] mark_held_locks+0x79/0xa0
> >>>     >       [<ffffffff8020ab5d>] error_exit+0x0/0x96
> >>>     >       [<ffffffff880568e2>] :nfsd:cmp_creds+0x2/0x10
> >>>     >       [<ffffffff8805a14d>] :nfsd:nfsd4_exchange_id+0x16d/0x2a0
> >>>     >       [<ffffffff8804fd5a>]
> >>> :nfsd:nfsd4_proc_compound+0x15ba/0x1820
> >>>     >       [<ffffffff8024465f>] __lock_acquire+0xc3f/0xd20
> >>>     >       [<ffffffff80487e5f>] release_sock+0x2f/0xe0
> >>>     >       [<ffffffff88009fd9>] :sunrpc:svc_sock_enqueue+0x49/0x2c0
> >>>     >       [<ffffffff80242ba0>] check_usage+0x40/0x2b0
> >>>     >       [<ffffffff80244d7b>] trace_hardirqs_on+0x11b/0x150
> >>>     >       [<ffffffff8024465f>] __lock_acquire+0xc3f/0xd20
> >>>     >       [<ffffffff8800a22d>]
> >>> :sunrpc:svc_sock_enqueue+0x29d/0x2c0
> >>>     >       [<ffffffff8800ff08>]
> >>> :sunrpc:sunrpc_cache_lookup+0x98/0x170
> >>>     >       [<ffffffff804f39e0>] _read_unlock+0x30/0x60
> >>>     >       [<ffffffff8800ff08>]
> >>> :sunrpc:sunrpc_cache_lookup+0x98/0x170
> >>>     >       [<ffffffff8800c801>] :sunrpc:ip_map_lookup+0xc1/0xd0
> >>>     >       [<ffffffff8805557a>]
> >>>     > :nfsd:nfs4svc_decode_compoundargs+0xfaa/0x1080
> >>>     >       [<ffffffff8803e121>] :nfsd:nfsd_dispatch+0x101/0x1e0
> >>>     >       [<ffffffff880093fa>] :sunrpc:svc_process+0x46a/0x740
> >>>     >       [<ffffffff8803e6f8>] :nfsd:nfsd+0x1d8/0x380
> >>>     >       [<ffffffff8020ad2c>] child_rip+0xa/0x12
> >>>     >       [<ffffffff804f3bbb>] _spin_unlock_irq+0x2b/0x60
> >>>     >       [<ffffffff8020a33c>] restore_args+0x0/0x30
> >>>     >       [<ffffffff8803e520>] :nfsd:nfsd+0x0/0x380
> >>>     >       [<ffffffff8020ad22>] child_rip+0x0/0x12
> >>>     >
> >>>     >
> >>>     >       <end>
> >>>     >
> >>>     >       Regards
> >>>     >       Rahul
> >>>     >
> >>>     >
> >>>     >       > -----Original Message-----
> >>>     >       > From: William A. (Andy) Adamson
> >>> [mailto:andros at citi.umich.edu ]
> >>>     >       > Sent: Wednesday, May 23, 2007 1:42 PM
> >>>     >       > To: Marc Eshel
> >>>     >       > Cc: Iyer, Rahul; pnfs at linux-nfs.org
> >>>     > <mailto: pnfs at linux-nfs.org <mailto:pnfs at linux-nfs.org> >
> >>>     >       > Subject: Re: [pnfs] callback problem
> >>>     >       >
> >>>     >       >
> >>>     >       >
> >>>     >       > On 5/23/07, Marc Eshel <
> >>> eshel at almaden.ibm.com <mailto:eshel at almaden.ibm.com> > wrote:
> >>>     >       >
> >>>     >       >       Some more observation I will continue to debug.
> >>>     >       >
> >>>     >       >       On the client side I got the following
> >>> messages after
> >>>     >       > mount and 2 of them
> >>>     >       >       where incorrect. The first one
> >>> "Couldn't mount...." but
> >>>     >       > mount did complete
> >>>     >       >       and I was able to use it. The seconds
> >>> one "started cb
> >>>     >       > service!" is wrong
> >>>     >       >       callback failed since the IP address
> >>> was 0.0.0.0 on the
> >>>     >       > server side.
> >>>     >       >
> >>>     >       >       [ root at fin30 ~]# mount -t nfs4 fin18:/ /mnt
> >>>     >       >       Message from syslogd at fin30 at Wed May 23
> >>>     > 11:34:03 2007 ...
> >>>     >       >       fin30 kernel: started cb service!
> >>>     >       >       Message from syslogd at fin30 at Wed May 23
> >>>     > 11:34:05 2007 ...
> >>>     >       >       fin30 kernel: Couldn't mount using
> >>> minorversion 1
> >>>     >       >       Message from syslogd at fin30 at Wed May 23
> >>>     > 11:34:05 2007 ...
> >>>     >       >       fin30 kernel: started cb service!
> >>>     >       >
> >>>     >       >
> >>>     >       >
> >>>     >       >       On the server side I had to comment
> >>> list_move_tail line
> >>>     >       > since it was
> >>>     >       >       crashing with a null pointer.
> >>>     >       >
> >>>     >       >       static inline void
> >>>     >       >       renew_client(struct nfs4_client *clp)
> >>>     >       >       {
> >>>     >       >               /*
> >>>     >       >               * Move client to the end to the
> >>> LRU list.
> >>>     >       >               */
> >>>     >       >               dprintk("renewing client (clientid
> >>>     > %08x/%08x)\n",
> >>>     >       >
> >>> clp->cl_clientid.cl_boot,
> >>>     >       >                               clp->cl_clientid.cl_id);
> >>>     >       >       //???   list_move_tail(&clp->cl_lru,
> >>> &client_lru);
> >>>     >       >               clp->cl_time = get_seconds();
> >>>     >       >       }
> >>>     >       >
> >>>     >       >
> >>>     >       > i think there is a race on the server regarding the
> >>>     >       > nfs_client -  i've failed in
> >>>     >       >
> >>>     >       > 1) find_confirmed_client_by_str called by
> >>> nfsd4_exchange_id
> >>>     >       > 2) succeeded in calling renew_client, and failed in
> >>>     > nfsd4_sequence.
> >>>     >       >
> >>>     >       > -->Andy
> >>>     >       >
> >>>     >       >
> >>>     >       >
> >>>     >       >
> >>>     >
> >>>     >
> >>>     >
> >>>     >
> >>>     >
> >>>
> >>>
> >>>
> >>>
> >>>
> >> _______________________________________________
> >> pNFS mailing list
> >> pNFS at linux-nfs.org
> >> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> >
> >
>
>
> --
> Benny Halevy
> Software Architect
> Tel/Fax: +972-3-647-8340
> Mobile: +972-54-802-8340
> bhalevy at panasas.com
>
> Panasas, Inc.
> The Leader in Parallel Storage
> www.panasas.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://linux-nfs.org/pipermail/pnfs/attachments/20070524/f27292ae/attachment-0001.htm 


More information about the pNFS mailing list