[pnfs] Permanent inode access stall after large writes with LAYOUT_NFSV4_FILES

Mc Carthy, Fergal Fergal.McCarthy at hp.com
Wed Oct 4 07:31:00 EDT 2006


I'm working with the PNFS 2.6.17-CITI_NFS4_ALL-pnfs-1-pnfs code base,
using LAYOUT_NFSV4_FILES with STRIPE_DENSE stripes. 

 

I'm repeatedly seeing a problem whereby if I try to write any
significantly size amount of data in one go, e.g. 1GB, then the command
hangs forever, with a stack trace like show below:

 

# [root at zen12 ~]# dd if=/dev/zero of=/imports/test/dd.out count=1K bs=1M

1024+0 records in

1024+0 records out

<hangs and never returns to prompt>

 

crash> set 3406

    PID: 3406

COMMAND: "dd"

   TASK: e0000040dbcd8000  [THREAD_INFO: e0000040dbcd8f00]

    CPU: 0

  STATE: TASK_INTERRUPTIBLE 

crash> bt

PID: 3406   TASK: e0000040dbcd8000  CPU: 0   COMMAND: "dd"

 #0 [BSP:e0000040dbcd92e0] schedule at a000000100672010

 #1 [BSP:e0000040dbcd92c0] nfs_wait_bit_interruptible at
a000000201b88b90

 #2 [BSP:e0000040dbcd9270] __wait_on_bit at a000000100674500

 #3 [BSP:e0000040dbcd9238] out_of_line_wait_on_bit at a000000100674660

 #4 [BSP:e0000040dbcd9208] nfs_wait_on_request at a000000201b88ca0

 #5 [BSP:e0000040dbcd91b8] nfs_wait_on_requests_locked at
a000000201b95400

 #6 [BSP:e0000040dbcd9168] nfs_writepages at a000000201b9a840

 #7 [BSP:e0000040dbcd9138] do_writepages at a0000001000dfe90

 #8 [BSP:e0000040dbcd9100] __filemap_fdatawrite_range at
a0000001000cfe00

 #9 [BSP:e0000040dbcd90e0] filemap_fdatawrite at a0000001000cfe50

#10 [BSP:e0000040dbcd90b8] nfs_file_release at a000000201b744a0

#11 [BSP:e0000040dbcd9058] __fput at a00000010012e670

#12 [BSP:e0000040dbcd9038] fput at a00000010012e6c0

#13 [BSP:e0000040dbcd9008] filp_close at a00000010012aa70

#14 [BSP:e0000040dbcd8f88] sys_close at a00000010012abc0

#15 [BSP:e0000040dbcd8f88] ia64_ret_from_syscall at a00000010000b600

  EFRAME: e0000040dbcdfe40

      B0: 40000000000024a0      CR_IIP: a000000000010640

 CR_IPSR: 00001213081a6018      CR_IFS: 0000000000000004

  AR_PFS: c000000000000004      AR_RSC: 000000000000000f

 AR_UNAT: 0000000000000000     AR_RNAT: 0000000000000000

  AR_CCV: 0000000000000000     AR_FPSR: 0009804c8a70033f

  LOADRS: 0000000000280000 AR_BSPSTORE: 60000fff7fffc160

      B6: 20000000001d0dc0          B7: 2000000000118350

      PR: 000000000555a261          R1: 2000000000254238

      R2: 2000000000279874          R3: 0000000000000015

      R8: 0000000000000001          R9: 20000000002798d4

     R10: 0000000000000073         R11: c000000000000289

     R12: 60000fffff593700         R13: 20000000002c06a0

     R14: 00000000fffff940         R15: 0000000000000405

     R16: 00000000fbad8004         R17: 20000000002b7bb8

     R18: 60000fffff593128         R19: 60000fffff5930a0

     R20: 20000000002b7c28         R21: 60000fffff593178

     R22: 20000000002b5728         R23: 00000000000020e8

     R24: 20000000002b5728         R25: 200000000003d560

     R26: 20000000002798c8         R27: 20000000002b5760

     R28: 20000000002b5728         R29: 0000000000000025

     R30: 400000000000a768         R31: 400000000000a768

      F6: 1003ecccccccccccccccd     F7: 1003e0000000000000000

      F8: 000000000000000000000     F9: 000000000000000000000

     F10: 000000000000000000000    F11: 000000000000000000000

#16 [BSP:e0000040dbcd8f88] __kernel_syscall_via_break at
a000000000010640

 

The inode in question here is verifiably the inode of the file on the
mounted PNFS/MDS fs.

 

The PNFS fs in this case has 2 DS nodes each serving 1 DS filesystem,
and if I write a 1GB file using a fill pattern of '@' and check the
stripe files on the DS filesystems I see the expected 512MB of densely
packed '@' characters in each stripe file:

 

[od -A x -a of DS0 stripe file]

000000   @   @   @   @   @   @   @   @   @   @   @   @   @   @   @   @

*

20000000

 

[od -A x -a of DS1 stripe file]

000000   @   @   @   @   @   @   @   @   @   @   @   @   @   @   @   @

*

20000000

 

Checking in kernel I find that there are now stalled RPCs queued on any
of the DS rpc_xprt queues and in fact all (16) of the rqst/slot buffers
are back on the free list.

 

This suggests to me that the fact the PNFS I/O activity has completed is
not being successfully notified upstream to the MDS inode. This suggests
that the rpc_call_done callback passed to the rpc_call_async() is not
doing the write thing, and tracing through the code I have identified, I
believe, the likely routine, namely nfs_writeback_done_full(), which is
registered in the nfs_write_full_ops callbacks used for async NFSv4
writes. And there is an interesting comment about the
nfs_writeback_done_full() which, if accurate/uptodate, suggests that
there is a known potential race associated with this callback routine.

 

[extract from linux/fs/nfs/write.c]

/*

 * Handle a write reply that flushes a whole page.

 *

 * FIXME: There is an inherent race with invalidate_inode_pages and

 *    writebacks since the page->count is kept > 1 for as long

 *    as the page has a write request pending.

 */

static void nfs_writeback_done_full(struct rpc_task *task, void
*calldata)

{

    struct nfs_write_data   *data = calldata;

 

    if (nfs_writeback_done(task, data) != 0)

        return;

 

    nfs_writeback_full_page(data, task->tk_status);

}

 

static const struct rpc_call_ops nfs_write_full_ops = {

    .rpc_call_done = nfs_writeback_done_full,

    .rpc_release = nfs_writedata_release,

};

 

I did a quick check of the LKML but nothing obvious jumped out at me.

 

Therefore I'd appreciate any help/pointers on how to go about getting
past this problem... ;-)

 

Fergal.

 

--

Fergal.McCarthy at HP.com

(The contents of this message and any attachments to it are confidential
and may be legally privileged. If you have received this message in
error you should delete it from your system immediately and advise the
sender. To any recipient of this message within HP, unless otherwise
stated, you should consider this message and attachments as "HP
CONFIDENTIAL".)
  

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://linux-nfs.org/pipermail/pnfs/attachments/20061004/05cf3029/attachment-0001.htm


More information about the pNFS mailing list