[pnfs] Potential circular waiting when bl_read/write_pagelist return -EAGAIN

Zhang Jingwang yyalone at gmail.com
Sun Jan 24 23:01:41 EST 2010


I'd appreciate if you give me some comments on this problem.

error message in /var/log/messages looks like:

Blocklayout_Client kernel: INFO: task events/0:6 blocked for more than
120 seconds.
Blocklayout_Client kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Blocklayout_Client kernel: events/0      D 0000000000000000  3672
6      2 0x00000000
Blocklayout_Client kernel: ffff88003eb81c90 0000000000000046
ffff88003eb74860 ffff880024759c00
Blocklayout_Client kernel: ffff88003eb81c20 0000000000000246
ffff88003eb81fd8 ffff88003eb81fd8
Blocklayout_Client kernel: ffff88003eb74c50 000000000000fa40
00000000001d5b80 ffff88003eb74c50
Blocklayout_Client kernel: Call Trace:
Blocklayout_Client kernel: [<ffffffffa021a5bd>] ?
pnfs_return_layout_barrier+0xb2/0xe6 [nfs]
Blocklayout_Client kernel: [<ffffffffa021bb12>]
_pnfs_return_layout+0x26e/0x3ca [nfs]
Blocklayout_Client kernel: [<ffffffff8106f036>] ?
autoremove_wake_function+0x0/0x39
Blocklayout_Client kernel: [<ffffffffa021befe>]
pnfs_read_done+0x9b/0xca [nfs]
Blocklayout_Client kernel: [<ffffffffa003c355>]
bl_read_cleanup+0x40/0x44 [blocklayoutdriver]
Blocklayout_Client kernel: [<ffffffff8106b086>]
worker_thread+0x257/0x350
Blocklayout_Client kernel: [<ffffffff8106b02e>] ?
worker_thread+0x1ff/0x350
Blocklayout_Client kernel: [<ffffffffa003c315>] ?
bl_read_cleanup+0x0/0x44 [blocklayoutdriver]
Blocklayout_Client kernel: [<ffffffff8106f036>] ?
autoremove_wake_function+0x0/0x39
Blocklayout_Client kernel: [<ffffffff8106ae2f>] ?
worker_thread+0x0/0x350
Blocklayout_Client kernel: [<ffffffff8106ed64>] kthread+0x7f/0x87
Blocklayout_Client kernel: [<ffffffff81012cea>] child_rip+0xa/0x20
Blocklayout_Client kernel: [<ffffffff81012650>] ?
restore_args+0x0/0x30
Blocklayout_Client kernel: [<ffffffff8106ece5>] ? kthread+0x0/0x87
Blocklayout_Client kernel: [<ffffffff81012ce0>] ? child_rip+0x0/0x20
Blocklayout_Client kernel: 2 locks held by events/0/6:
Blocklayout_Client kernel: #0:  (events){+.+.+.}, at:
[<ffffffff8106b02e>] worker_thread+0x1ff/0x350
Blocklayout_Client kernel: #1:  (&rdata->task.u.tk_work){+.+...}, at:
[<ffffffff8106b02e>] worker_thread+0x1ff/0x350

remove the lseg refcount check in return_layout_barrier can address
this problem:
static bool
pnfs_return_layout_barrier(struct nfs_inode *nfsi,
                           struct nfs4_pnfs_layout_segment *range)
{
        struct pnfs_layout_segment *lseg;
        bool ret = false;

        spin_lock(&nfsi->lo_lock);
        list_for_each_entry (lseg, &nfsi->layout.segs, fi_list) {
                if (!should_free_lseg(lseg, range))
                        continue;
                lseg->valid = false;
//                if (!_pnfs_can_return_lseg(lseg)) {
//                        dprintk("%s: wait on lseg %p refcount %d\n",
//                                __func__, lseg,
//                                atomic_read(&lseg->kref.refcount));
//                       ret = true;
//               }
        }
        if (atomic_read(&nfsi->layout.lgetcount))
                ret = true;
        spin_unlock(&nfsi->lo_lock);

        dprintk("%s:Return %d\n", __func__, ret);
        return ret;
}

But this could result in that nfsi->layout->ld_data is released before
bl_cleanup_layoutcommit is called which causes a NULL pointer panic.

The latter problem can be solved by holding a refcount to nfsi->layout.

My question is that: Is this the right way to solve this problem? Thanks!

2010/1/8 Zhang Jingwang <yyalone at gmail.com>:
> Take bl_read_pagelist for example:
>
> 1). setup arguments.
> 2). call pnfs_update_layout get a layout(lseg) and take a reference to it
> 3). call layoutdriver's read_pagelist
> 4). when read is done, layoutdriver call pnfs's callback pnfs_read_done
> 5). pnfs_read_done release the reference to the layout(lseg)
> 6). if error is -EAGAIN, call return_layout and read again.
> return_layout will wait others to release the lseg.
>
> In step 4:
> bl_end_par_io_read(void *data)
> {
>        struct nfs_read_data *rdata = data;
>
>        INIT_WORK(&rdata->task.u.tk_work, bl_read_cleanup);
>        schedule_work(&rdata->task.u.tk_work);
> }
> This code construct a work_struct and give it to keventd thread.
> keventd will call bl_read_cleanup, and bl_read_cleanup calls
> pnfs_read_done to finish step 5 and step 6.
>
> Assume that two read request use the same lseg, and each take a
> reference to it. When the first return -EAGAIN and go into step 6, it
> will find that there is another thread using the lseg, so it will wait
> until the other one release it. But when the second read request want
> to release the lseg's reference, it will pending this work to keventd
> thread, which is waiting for lseg's release. Because the execution of
> the pending functions in the work queue list is serialized on each
> CPU, so the second will not get a chance to release the lseg since the
> first will never finish his work.
>
> So the keventd thread will be blocked forever.
>
> --
> Zhang Jingwang
> National Research Centre for High Performance Computers
> Institute of Computing Technology, Chinese Academy of Sciences
> No. 6, South Kexueyuan Road, Haidian District
> Beijing, China
>



-- 
Zhang Jingwang
National Research Centre for High Performance Computers
Institute of Computing Technology, Chinese Academy of Sciences
No. 6, South Kexueyuan Road, Haidian District
Beijing, China


More information about the pNFS mailing list