1. 29 9月, 2012 3 次提交
    • D
      ext4: serialize dio nonlocked reads with defrag workers · 17335dcc
      Dmitry Monakhov 提交于
      Inode's block defrag and ext4_change_inode_journal_flag() may
      affect nonlocked DIO reads result, so proper synchronization
      required.
      
      - Add missed inode_dio_wait() calls where appropriate
      - Check inode state under extra i_dio_count reference.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      17335dcc
    • D
      ext4: completed_io locking cleanup · 28a535f9
      Dmitry Monakhov 提交于
      Current unwritten extent conversion state-machine is very fuzzy.
      - For unknown reason it performs conversion under i_mutex. What for?
        My diagnosis:
        We already protect extent tree with i_data_sem, truncate and punch_hole
        should wait for DIO, so the only data we have to protect is end_io->flags
        modification, but only flush_completed_IO and end_io_work modified this
        flags and we can serialize them via i_completed_io_lock.
      
        Currently all these games with mutex_trylock result in the following deadlock
         truncate:                          kworker:
          ext4_setattr                       ext4_end_io_work
          mutex_lock(i_mutex)
          inode_dio_wait(inode)  ->BLOCK
                                   DEADLOCK<- mutex_trylock()
                                              inode_dio_done()
        #TEST_CASE1_BEGIN
        MNT=/mnt_scrach
        unlink $MNT/file
        fallocate -l $((1024*1024*1024)) $MNT/file
        aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
        sleep 2
        truncate -s 0 $MNT/file
        #TEST_CASE1_END
      
      Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286
      
      This patch makes state machine simple and clean:
      
      (1) xxx_end_io schedule final extent conversion simply by calling
          ext4_add_complete_io(), which append it to ei->i_completed_io_list
          NOTE1: because of (2A) work should be queued only if
          ->i_completed_io_list was empty, otherwise the work is scheduled already.
      
      (2) ext4_flush_completed_IO is responsible for handling all pending
          end_io from ei->i_completed_io_list
          Flushing sequence consists of following stages:
          A) LOCKED: Atomically drain completed_io_list to local_list
          B) Perform extents conversion
          C) LOCKED: move converted io's to to_free list for final deletion
             	     This logic depends on context which we was called from.
          D) Final end_io context destruction
          NOTE1: i_mutex is no longer required because end_io->flags modification
          is protected by ei->ext4_complete_io_lock
      
      Full list of changes:
      - Move all completion end_io related routines to page-io.c in order to improve
        logic locality
      - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
      - remove EXT4_IO_END_FSYNC
      - Improve SMP scalability by removing useless i_mutex which does not
        protect io->flags anymore.
      - Reduce lock contention on i_completed_io_lock by optimizing list walk.
      - Rename ext4_end_io_nolock to end4_end_io and make it static
      - Check flush completion status to ext4_ext_punch_hole(). Because it is
        not good idea to punch blocks from corrupted inode.
      
      Changes since V3 (in request to Jan's comments):
        Fall back to active flush_completed_IO() approach in order to prevent
        performance issues with nolocked DIO reads.
      Changes since V2:
        Fix use-after-free caused by race truncate vs end_io_work
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      28a535f9
    • D
      ext4: ext4_inode_info diet · f45ee3a1
      Dmitry Monakhov 提交于
      Generic inode has unused i_private pointer which may be used as cur_aio_dio
      storage.
      
      TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
            to have concurent AIO_DIO requests.
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f45ee3a1
  2. 27 9月, 2012 1 次提交
  3. 20 9月, 2012 1 次提交
    • T
      ext4: fix potential deadlock in ext4_nonda_switch() · 00d4e736
      Theodore Ts'o 提交于
      In ext4_nonda_switch(), if the file system is getting full we used to
      call writeback_inodes_sb_if_idle().  The problem is that we can be
      holding i_mutex already, and this causes a potential deadlock when
      writeback_inodes_sb_if_idle() when it tries to take s_umount.  (See
      lockdep output below).
      
      As it turns out we don't need need to hold s_umount; the fact that we
      are in the middle of the write(2) system call will keep the superblock
      pinned.  Unfortunately writeback_inodes_sb() checks to make sure
      s_umount is taken, and the VFS uses a different mechanism for making
      sure the file system doesn't get unmounted out from under us.  The
      simplest way of dealing with this is to just simply grab s_umount
      using a trylock, and skip kicking the writeback flusher thread in the
      very unlikely case that we can't take a read lock on s_umount without
      blocking.
      
      Also, we now check the cirteria for kicking the writeback thread
      before we decide to whether to fall back to non-delayed writeback, so
      if there are any outstanding delayed allocation writes, we try to get
      them resolved as soon as possible.
      
         [ INFO: possible circular locking dependency detected ]
         3.6.0-rc1-00042-gce894ca #367 Not tainted
         -------------------------------------------------------
         dd/8298 is trying to acquire lock:
          (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
      
         but task is already holding lock:
          (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
      
         which lock already depends on the new lock.
      
         2 locks held by dd/8298:
          #0:  (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3
          #1:  (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
      
         stack backtrace:
         Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca #367
         Call Trace:
          [<c015b79c>] ? console_unlock+0x345/0x372
          [<c06d62a1>] print_circular_bug+0x190/0x19d
          [<c019906c>] __lock_acquire+0x86d/0xb6c
          [<c01999db>] ? mark_held_locks+0x5c/0x7b
          [<c0199724>] lock_acquire+0x66/0xb9
          [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
          [<c06db935>] down_read+0x28/0x58
          [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
          [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
          [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4
          [<c0271ece>] ext4_da_write_begin+0x27/0x193
          [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb
          [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205
          [<c01ddce7>] generic_file_aio_write+0x78/0xd3
          [<c026d336>] ext4_file_write+0x480/0x4a6
          [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c
          [<c0180944>] ? sched_clock_cpu+0x11a/0x13e
          [<c01967e9>] ? trace_hardirqs_off+0xb/0xd
          [<c018099f>] ? local_clock+0x37/0x4e
          [<c0209f2c>] do_sync_write+0x67/0x9d
          [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44
          [<c020a7b9>] vfs_write+0x7b/0xe6
          [<c020a9a6>] sys_write+0x3b/0x64
          [<c06dd4bd>] syscall_call+0x7/0xb
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      00d4e736
  4. 18 9月, 2012 1 次提交
    • C
      ext4: fix possible non-initialized variable in htree_dirblock_to_tree() · 90b0a973
      Carlos Maiolino 提交于
      htree_dirblock_to_tree() declares a non-initialized 'err' variable,
      which is passed as a reference to another functions expecting them to
      set this variable with their error codes.
      
      It's passed to ext4_bread(), which then passes it to ext4_getblk(). If
      ext4_map_blocks() returns 0 due to a lookup failure, leaving the
      ext4_getblk() buffer_head uninitialized, it will make ext4_getblk()
      return to ext4_bread() without initialize the 'err' variable, and
      ext4_bread() will return to htree_dirblock_to_tree() with this variable
      still uninitialized.  htree_dirblock_to_tree() will pass this variable
      with garbage back to ext4_htree_fill_tree(), which expects a number of
      directory entries added to the rb-tree. which, in case, might return a
      fake non-zero value due the garbage left in the 'err' variable, leading
      the kernel to an Oops in ext4_dx_readdir(), once this is expecting a
      filled rb-tree node, when in turn it will have a NULL-ed one, causing an
      invalid page request when trying to get a fname struct from this NULL-ed
      rb-tree node in this line:
      
      fname = rb_entry(info->curr_node, struct fname, rb_hash);
      
      The patch itself initializes the err variable in
      htree_dirblock_to_tree() to avoid usage mistakes by the called
      functions, and also fix ext4_getblk() to return a initialized 'err'
      variable when ext4_map_blocks() fails a lookup.
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      90b0a973
  5. 20 8月, 2012 1 次提交
  6. 31 7月, 2012 1 次提交
  7. 23 7月, 2012 4 次提交
  8. 10 7月, 2012 1 次提交
    • Z
      ext4: add a new nolock flag in ext4_map_blocks · 729f52c6
      Zheng Liu 提交于
      EXT4_GET_BLOCKS_NO_LOCK flag is added to indicate that we don't need
      to acquire i_data_sem lock in ext4_map_blocks.  Meanwhile, it changes
      ext4_get_block() to not start a new journal because when we do a
      overwrite dio, there is no any metadata that needs to be modified.
      
      We define a new function called ext4_get_block_write_nolock, which is
      used in dio overwrite nolock.  In this function, it doesn't try to
      acquire i_data_sem lock and doesn't start a new journal as it does a
      lookup.
      
      CC: Tao Ma <tm@tao.ma>
      CC: Eric Sandeen <sandeen@redhat.com>
      CC: Robin Dong <hao.bigrat@gmail.com>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      729f52c6
  9. 01 6月, 2012 1 次提交
  10. 16 5月, 2012 1 次提交
  11. 30 4月, 2012 3 次提交
  12. 22 3月, 2012 3 次提交
  13. 20 3月, 2012 3 次提交
  14. 05 3月, 2012 1 次提交
    • J
      ext4: clean up the flags passed to __blockdev_direct_IO · 93ef8541
      Jeff Moyer 提交于
      For extent-based files, you can perform DIO to holes, as mentioned in
      the comments in ext4_ext_direct_IO.  However, that function passes
      DIO_SKIP_HOLES to __blockdev_direct_IO, which is *really* confusing to
      the uninitiated reader.  The key, here, is that the get_block function
      passed in, ext4_get_block_write, completely ignores the create flag
      that is passed to it (the create flag is passed in from the direct I/O
      code, which uses the DIO_SKIP_HOLES flag to determine whether or not
      it should be cleared).
      
      This is a long-winded way of saying that the DIO_SKIP_HOLES flag is
      ultimately ignored.  So let's remove it.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      93ef8541
  15. 03 3月, 2012 1 次提交
  16. 21 2月, 2012 2 次提交
    • J
      ext4: fix race between unwritten extent conversion and truncate · 266991b1
      Jeff Moyer 提交于
      The following comment in ext4_end_io_dio caught my attention:
      
      	/* XXX: probably should move into the real I/O completion handler */
              inode_dio_done(inode);
      
      The truncate code takes i_mutex, then calls inode_dio_wait.  Because the
      ext4 code path above will end up dropping the mutex before it is
      reacquired by the worker thread that does the extent conversion, it
      seems to me that the truncate can happen out of order.  Jan Kara
      mentioned that this might result in error messages in the system logs,
      but that should be the extent of the "damage."
      
      The fix is pretty straight-forward: don't call inode_dio_done until the
      extent conversion is complete.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      266991b1
    • L
      ext4: ignore EXT4_INODE_JOURNAL_DATA flag with delalloc · 3d2b1582
      Lukas Czerner 提交于
      Ext4 does not support data journalling with delayed allocation enabled.
      We even do not allow to mount the file system with delayed allocation
      and data journalling enabled, however it can be set via FS_IOC_SETFLAGS
      so we can hit the inode with EXT4_INODE_JOURNAL_DATA set even on file
      system mounted with delayed allocation (default) and that's where
      problem arises. The easies way to reproduce this problem is with the
      following set of commands:
      
       mkfs.ext4 /dev/sdd
       mount /dev/sdd /mnt/test1
       dd if=/dev/zero of=/mnt/test1/file bs=1M count=4
       chattr +j /mnt/test1/file
       dd if=/dev/zero of=/mnt/test1/file bs=1M count=4 conv=notrunc
       chattr -j /mnt/test1/file
      
      Additionally it can be reproduced quite reliably with xfstests 272 and
      269. In fact the above reproducer is a part of test 272.
      
      To fix this we should ignore the EXT4_INODE_JOURNAL_DATA inode flag if
      the file system is mounted with delayed allocation. This can be easily
      done by fixing ext4_should_*_data() functions do ignore data journal
      flag when delalloc is set (suggested by Ted). We also have to set the
      appropriate address space operations for the inode (again, ignoring data
      journal flag if delalloc enabled).
      
      Additionally this commit introduces ext4_inode_journal_mode() function
      because ext4_should_*_data() has already had a lot of common code and
      this change is putting it all into one function so it is easier to
      read.
      
      Successfully tested with xfstests in following configurations:
      
      delalloc + data=ordered
      delalloc + data=writeback
      data=journal
      nodelalloc + data=ordered
      nodelalloc + data=writeback
      nodelalloc + data=journal
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      3d2b1582
  17. 09 1月, 2012 1 次提交
  18. 05 1月, 2012 1 次提交
  19. 29 12月, 2011 4 次提交
  20. 14 12月, 2011 3 次提交
    • Y
      ext4: correctly handle pages w/o buffers in ext4_discard_partial_buffers() · 093e6e36
      Yongqiang Yang 提交于
      If a page has been read into memory and never been written, it has no
      buffers, but we should handle the page in truncate or punch hole.
      
      VFS code of writing operations has handled holes correctly, so this
      patch removes the code handling holes in writing operations.
      Signed-off-by: NYongqiang Yang <xiaoqiangnk@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@kernel.org
      093e6e36
    • Y
      ext4: avoid potential hang in mpage_submit_io() when blocksize < pagesize · 13a79a47
      Yongqiang Yang 提交于
      If there is an unwritten but clean buffer in a page and there is a
      dirty buffer after the buffer, then mpage_submit_io does not write the
      dirty buffer out.  As a result, da_writepages loops forever.
      
      This patch fixes the problem by checking dirty flag.
      Signed-off-by: NYongqiang Yang <xiaoqiangnk@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@kernel.org
      13a79a47
    • A
      ext4: avoid hangs in ext4_da_should_update_i_disksize() · ea51d132
      Andrea Arcangeli 提交于
      If the pte mapping in generic_perform_write() is unmapped between
      iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic(), the
      "copied" parameter to ->end_write can be zero. ext4 couldn't cope with
      it with delayed allocations enabled. This skips the i_disksize
      enlargement logic if copied is zero and no new data was appeneded to
      the inode.
      
       gdb> bt
       #0  0xffffffff811afe80 in ext4_da_should_update_i_disksize (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x1\
       08000, len=0x1000, copied=0x0, page=0xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2467
       #1  ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
       xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
       #2  0xffffffff810d97f1 in generic_perform_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value o\
       ptimized out>, pos=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2440
       #3  generic_file_buffered_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value optimized out>, p\
       os=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2482
       #4  0xffffffff810db5d1 in __generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, ppos=0\
       xffff88001e26be40) at mm/filemap.c:2600
       #5  0xffffffff810db853 in generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=<value optimi\
       zed out>, pos=<value optimized out>) at mm/filemap.c:2632
       #6  0xffffffff811a71aa in ext4_file_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, pos=0x108000) a\
       t fs/ext4/file.c:136
       #7  0xffffffff811375aa in do_sync_write (filp=0xffff88003f606a80, buf=<value optimized out>, len=<value optimized out>, \
       ppos=0xffff88001e26bf48) at fs/read_write.c:406
       #8  0xffffffff81137e56 in vfs_write (file=0xffff88003f606a80, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x4\
       000, pos=0xffff88001e26bf48) at fs/read_write.c:435
       #9  0xffffffff8113816c in sys_write (fd=<value optimized out>, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x\
       4000) at fs/read_write.c:487
       #10 <signal handler called>
       #11 0x00007f120077a390 in __brk_reservation_fn_dmi_alloc__ ()
       #12 0x0000000000000000 in ?? ()
       gdb> print offset
       $22 = 0xffffffffffffffff
       gdb> print idx
       $23 = 0xffffffff
       gdb> print inode->i_blkbits
       $24 = 0xc
       gdb> up
       #1  ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
       xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
       2512                    if (ext4_da_should_update_i_disksize(page, end)) {
       gdb> print start
       $25 = 0x0
       gdb> print end
       $26 = 0xffffffffffffffff
       gdb> print pos
       $27 = 0x108000
       gdb> print new_i_size
       $28 = 0x108000
       gdb> print ((struct ext4_inode_info *)((char *)inode-((int)(&((struct ext4_inode_info *)0)->vfs_inode))))->i_disksize
       $29 = 0xd9000
       gdb> down
       2467            for (i = 0; i < idx; i++)
       gdb> print i
       $30 = 0xd44acbee
      
      This is 100% reproducible with some autonuma development code tuned in
      a very aggressive manner (not normal way even for knumad) which does
      "exotic" changes to the ptes. It wouldn't normally trigger but I don't
      see why it can't happen normally if the page is added to swap cache in
      between the two faults leading to "copied" being zero (which then
      hangs in ext4). So it should be fixed. Especially possible with lumpy
      reclaim (albeit disabled if compaction is enabled) as that would
      ignore the young bits in the ptes.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@kernel.org
      ea51d132
  21. 12 12月, 2011 1 次提交
  22. 06 12月, 2011 1 次提交
  23. 02 12月, 2011 1 次提交