1. 22 2月, 2013 1 次提交
    • D
      mm: only enforce stable page writes if the backing device requires it · 1d1d1a76
      Darrick J. Wong 提交于
      Create a helper function to check if a backing device requires stable
      page writes and, if so, performs the necessary wait.  Then, make it so
      that all points in the memory manager that handle making pages writable
      use the helper function.  This should provide stable page write support
      to most filesystems, while eliminating unnecessary waiting for devices
      that don't require the feature.
      
      Before this patchset, all filesystems would block, regardless of whether
      or not it was necessary.  ext3 would wait, but still generate occasional
      checksum errors.  The network filesystems were left to do their own
      thing, so they'd wait too.
      
      After this patchset, all the disk filesystems except ext3 and btrfs will
      wait only if the hardware requires it.  ext3 (if necessary) snapshots
      pages instead of blocking, and btrfs provides its own bdi so the mm will
      never wait.  Network filesystems haven't been touched, so either they
      provide their own stable page guarantees or they don't block at all.
      The blocking behavior is back to what it was before 3.0 if you don't
      have a disk requiring stable page writes.
      
      Here's the result of using dbench to test latency on ext2:
      
      3.8.0-rc3:
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       WriteX        109347     0.028    59.817
       ReadX         347180     0.004     3.391
       Flush          15514    29.828   287.283
      
      Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms
      
      3.8.0-rc3 + patches:
       WriteX        105556     0.029     4.273
       ReadX         335004     0.005     4.112
       Flush          14982    30.540   298.634
      
      Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms
      
      As you can see, the maximum write latency drops considerably with this
      patch enabled.  The other filesystems (ext3/ext4/xfs/btrfs) behave
      similarly, but see the cover letter for those results.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d1d1a76
  2. 26 12月, 2012 2 次提交
    • J
      ext4: fix deadlock in journal_unmap_buffer() · 53e87268
      Jan Kara 提交于
      We cannot wait for transaction commit in journal_unmap_buffer()
      because we hold page lock which ranks below transaction start.  We
      solve the issue by bailing out of journal_unmap_buffer() and
      jbd2_journal_invalidatepage() with -EBUSY.  Caller is then responsible
      for waiting for transaction commit to finish and try invalidation
      again. Since the issue can happen only for page stradding i_size, it
      is simple enough to manually call jbd2_journal_invalidatepage() for
      such page from ext4_setattr(), check the return value and wait if
      necessary.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      53e87268
    • J
      ext4: split off ext4_journalled_invalidatepage() · 4520fb3c
      Jan Kara 提交于
      In data=journal mode we don't need delalloc or DIO handling in invalidatepage
      and similarly in other modes we don't need the journal handling. So split
      invalidatepage implementations.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4520fb3c
  3. 11 12月, 2012 6 次提交
  4. 03 12月, 2012 1 次提交
  5. 30 11月, 2012 1 次提交
  6. 16 11月, 2012 1 次提交
    • T
      ext4: remove calls to ext4_jbd2_file_inode() from delalloc write path · f3b59291
      Theodore Ts'o 提交于
      The calls to ext4_jbd2_file_inode() are needed to guarantee that we do
      not expose stale data in the data=ordered mode.  However, they are not
      necessary because in all of the cases where we have newly allocated
      blocks in the delayed allocation write path, we immediately submit the
      dirty pages for I/O.  Hence, we can avoid the overhead of adding the
      inode to the list of inodes whose data pages will be to be flushed out
      to disk completely during the next commit operation.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f3b59291
  7. 15 11月, 2012 1 次提交
  8. 09 11月, 2012 3 次提交
  9. 01 10月, 2012 1 次提交
    • T
      ext4: fix mtime update in nodelalloc mode · 041bbb6d
      Theodore Ts'o 提交于
      Commits 5e8830dc and 41c4d25f introduced a regression into
      v3.6-rc1 for ext4 in nodealloc mode, such that mtime updates would not
      take place for files modified via mmap if the page was already in the
      page cache.  This would also affect ext3 file systems mounted using
      the ext4 file system driver.
      
      The problem was that ext4_page_mkwrite() had a shortcut which would
      avoid calling __block_page_mkwrite() under some circumstances, and the
      above two commit transferred the responsibility of calling
      file_update_time() to __block_page_mkwrite --- which woudln't get
      called in some circumstances.
      
      Since __block_page_mkwrite() only has three callers,
      block_page_mkwrite(), ext4_page_mkwrite, and nilfs_page_mkwrite(), the
      best way to solve this is to move the responsibility for calling
      file_update_time() to its caller.
      
      This problem was found via xfstests #215 with a file system mounted
      with -o nodelalloc.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: stable@vger.kernel.org
      041bbb6d
  10. 29 9月, 2012 6 次提交
    • D
      ext4: serialize truncate with owerwrite DIO workers · 1f555cfa
      Dmitry Monakhov 提交于
      Jan Kara have spotted interesting issue:
      There are  potential data corruption issue with  direct IO overwrites
      racing with truncate:
       Like:
        dio write                      truncate_task
        ->ext4_ext_direct_IO
         ->overwrite == 1
          ->down_read(&EXT4_I(inode)->i_data_sem);
          ->mutex_unlock(&inode->i_mutex);
                                     ->ext4_setattr()
                                      ->inode_dio_wait()
                                      ->truncate_setsize()
                                      ->ext4_truncate()
                                       ->down_write(&EXT4_I(inode)->i_data_sem);
          ->__blockdev_direct_IO
           ->ext4_get_block
           ->submit_io()
          ->up_read(&EXT4_I(inode)->i_data_sem);
                                       # truncate data blocks, allocate them to
                                       # other inode - bad stuff happens because
                                       # dio is still in flight.
      
      In order to serialize with truncate dio worker should grab extra i_dio_count
      reference before drop i_mutex.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1f555cfa
    • D
      ext4: endless truncate due to nonlocked dio readers · 1b65007e
      Dmitry Monakhov 提交于
      If we have enough aggressive DIO readers, truncate and other dio
      waiters will wait forever inside inode_dio_wait(). It is reasonable
      to disable nonlock DIO read optimization during truncate.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1b65007e
    • D
      ext4: serialize unlocked dio reads with truncate · 1c9114f9
      Dmitry Monakhov 提交于
      Current serialization will works only for DIO which holds
      i_mutex, but nonlocked DIO following race is possible:
      
      dio_nolock_read_task            truncate_task
      				->ext4_setattr()
      				 ->inode_dio_wait()
      ->ext4_ext_direct_IO
        ->ext4_ind_direct_IO
          ->__blockdev_direct_IO
            ->ext4_get_block
      				 ->truncate_setsize()
      				 ->ext4_truncate()
      				 #alloc truncated blocks
      				 #to other inode
            ->submit_io()
           #INFORMATION LEAK
      
      In order to serialize with unlocked DIO reads we have to
      rearrange wait sequence
      1) update i_size first
      2) if i_size about to be reduced wait for outstanding DIO requests
      3) and only after that truncate inode blocks
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1c9114f9
    • D
      ext4: serialize dio nonlocked reads with defrag workers · 17335dcc
      Dmitry Monakhov 提交于
      Inode's block defrag and ext4_change_inode_journal_flag() may
      affect nonlocked DIO reads result, so proper synchronization
      required.
      
      - Add missed inode_dio_wait() calls where appropriate
      - Check inode state under extra i_dio_count reference.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      17335dcc
    • D
      ext4: completed_io locking cleanup · 28a535f9
      Dmitry Monakhov 提交于
      Current unwritten extent conversion state-machine is very fuzzy.
      - For unknown reason it performs conversion under i_mutex. What for?
        My diagnosis:
        We already protect extent tree with i_data_sem, truncate and punch_hole
        should wait for DIO, so the only data we have to protect is end_io->flags
        modification, but only flush_completed_IO and end_io_work modified this
        flags and we can serialize them via i_completed_io_lock.
      
        Currently all these games with mutex_trylock result in the following deadlock
         truncate:                          kworker:
          ext4_setattr                       ext4_end_io_work
          mutex_lock(i_mutex)
          inode_dio_wait(inode)  ->BLOCK
                                   DEADLOCK<- mutex_trylock()
                                              inode_dio_done()
        #TEST_CASE1_BEGIN
        MNT=/mnt_scrach
        unlink $MNT/file
        fallocate -l $((1024*1024*1024)) $MNT/file
        aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
        sleep 2
        truncate -s 0 $MNT/file
        #TEST_CASE1_END
      
      Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286
      
      This patch makes state machine simple and clean:
      
      (1) xxx_end_io schedule final extent conversion simply by calling
          ext4_add_complete_io(), which append it to ei->i_completed_io_list
          NOTE1: because of (2A) work should be queued only if
          ->i_completed_io_list was empty, otherwise the work is scheduled already.
      
      (2) ext4_flush_completed_IO is responsible for handling all pending
          end_io from ei->i_completed_io_list
          Flushing sequence consists of following stages:
          A) LOCKED: Atomically drain completed_io_list to local_list
          B) Perform extents conversion
          C) LOCKED: move converted io's to to_free list for final deletion
             	     This logic depends on context which we was called from.
          D) Final end_io context destruction
          NOTE1: i_mutex is no longer required because end_io->flags modification
          is protected by ei->ext4_complete_io_lock
      
      Full list of changes:
      - Move all completion end_io related routines to page-io.c in order to improve
        logic locality
      - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
      - remove EXT4_IO_END_FSYNC
      - Improve SMP scalability by removing useless i_mutex which does not
        protect io->flags anymore.
      - Reduce lock contention on i_completed_io_lock by optimizing list walk.
      - Rename ext4_end_io_nolock to end4_end_io and make it static
      - Check flush completion status to ext4_ext_punch_hole(). Because it is
        not good idea to punch blocks from corrupted inode.
      
      Changes since V3 (in request to Jan's comments):
        Fall back to active flush_completed_IO() approach in order to prevent
        performance issues with nolocked DIO reads.
      Changes since V2:
        Fix use-after-free caused by race truncate vs end_io_work
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      28a535f9
    • D
      ext4: ext4_inode_info diet · f45ee3a1
      Dmitry Monakhov 提交于
      Generic inode has unused i_private pointer which may be used as cur_aio_dio
      storage.
      
      TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
            to have concurent AIO_DIO requests.
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f45ee3a1
  11. 27 9月, 2012 1 次提交
  12. 20 9月, 2012 1 次提交
    • T
      ext4: fix potential deadlock in ext4_nonda_switch() · 00d4e736
      Theodore Ts'o 提交于
      In ext4_nonda_switch(), if the file system is getting full we used to
      call writeback_inodes_sb_if_idle().  The problem is that we can be
      holding i_mutex already, and this causes a potential deadlock when
      writeback_inodes_sb_if_idle() when it tries to take s_umount.  (See
      lockdep output below).
      
      As it turns out we don't need need to hold s_umount; the fact that we
      are in the middle of the write(2) system call will keep the superblock
      pinned.  Unfortunately writeback_inodes_sb() checks to make sure
      s_umount is taken, and the VFS uses a different mechanism for making
      sure the file system doesn't get unmounted out from under us.  The
      simplest way of dealing with this is to just simply grab s_umount
      using a trylock, and skip kicking the writeback flusher thread in the
      very unlikely case that we can't take a read lock on s_umount without
      blocking.
      
      Also, we now check the cirteria for kicking the writeback thread
      before we decide to whether to fall back to non-delayed writeback, so
      if there are any outstanding delayed allocation writes, we try to get
      them resolved as soon as possible.
      
         [ INFO: possible circular locking dependency detected ]
         3.6.0-rc1-00042-gce894ca #367 Not tainted
         -------------------------------------------------------
         dd/8298 is trying to acquire lock:
          (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
      
         but task is already holding lock:
          (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
      
         which lock already depends on the new lock.
      
         2 locks held by dd/8298:
          #0:  (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3
          #1:  (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
      
         stack backtrace:
         Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca #367
         Call Trace:
          [<c015b79c>] ? console_unlock+0x345/0x372
          [<c06d62a1>] print_circular_bug+0x190/0x19d
          [<c019906c>] __lock_acquire+0x86d/0xb6c
          [<c01999db>] ? mark_held_locks+0x5c/0x7b
          [<c0199724>] lock_acquire+0x66/0xb9
          [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
          [<c06db935>] down_read+0x28/0x58
          [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
          [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
          [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4
          [<c0271ece>] ext4_da_write_begin+0x27/0x193
          [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb
          [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205
          [<c01ddce7>] generic_file_aio_write+0x78/0xd3
          [<c026d336>] ext4_file_write+0x480/0x4a6
          [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c
          [<c0180944>] ? sched_clock_cpu+0x11a/0x13e
          [<c01967e9>] ? trace_hardirqs_off+0xb/0xd
          [<c018099f>] ? local_clock+0x37/0x4e
          [<c0209f2c>] do_sync_write+0x67/0x9d
          [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44
          [<c020a7b9>] vfs_write+0x7b/0xe6
          [<c020a9a6>] sys_write+0x3b/0x64
          [<c06dd4bd>] syscall_call+0x7/0xb
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      00d4e736
  13. 18 9月, 2012 1 次提交
    • C
      ext4: fix possible non-initialized variable in htree_dirblock_to_tree() · 90b0a973
      Carlos Maiolino 提交于
      htree_dirblock_to_tree() declares a non-initialized 'err' variable,
      which is passed as a reference to another functions expecting them to
      set this variable with their error codes.
      
      It's passed to ext4_bread(), which then passes it to ext4_getblk(). If
      ext4_map_blocks() returns 0 due to a lookup failure, leaving the
      ext4_getblk() buffer_head uninitialized, it will make ext4_getblk()
      return to ext4_bread() without initialize the 'err' variable, and
      ext4_bread() will return to htree_dirblock_to_tree() with this variable
      still uninitialized.  htree_dirblock_to_tree() will pass this variable
      with garbage back to ext4_htree_fill_tree(), which expects a number of
      directory entries added to the rb-tree. which, in case, might return a
      fake non-zero value due the garbage left in the 'err' variable, leading
      the kernel to an Oops in ext4_dx_readdir(), once this is expecting a
      filled rb-tree node, when in turn it will have a NULL-ed one, causing an
      invalid page request when trying to get a fname struct from this NULL-ed
      rb-tree node in this line:
      
      fname = rb_entry(info->curr_node, struct fname, rb_hash);
      
      The patch itself initializes the err variable in
      htree_dirblock_to_tree() to avoid usage mistakes by the called
      functions, and also fix ext4_getblk() to return a initialized 'err'
      variable when ext4_map_blocks() fails a lookup.
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      90b0a973
  14. 02 9月, 2012 1 次提交
  15. 20 8月, 2012 1 次提交
  16. 04 8月, 2012 2 次提交
  17. 31 7月, 2012 1 次提交
  18. 23 7月, 2012 4 次提交
  19. 10 7月, 2012 1 次提交
    • Z
      ext4: add a new nolock flag in ext4_map_blocks · 729f52c6
      Zheng Liu 提交于
      EXT4_GET_BLOCKS_NO_LOCK flag is added to indicate that we don't need
      to acquire i_data_sem lock in ext4_map_blocks.  Meanwhile, it changes
      ext4_get_block() to not start a new journal because when we do a
      overwrite dio, there is no any metadata that needs to be modified.
      
      We define a new function called ext4_get_block_write_nolock, which is
      used in dio overwrite nolock.  In this function, it doesn't try to
      acquire i_data_sem lock and doesn't start a new journal as it does a
      lookup.
      
      CC: Tao Ma <tm@tao.ma>
      CC: Eric Sandeen <sandeen@redhat.com>
      CC: Robin Dong <hao.bigrat@gmail.com>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      729f52c6
  20. 01 6月, 2012 1 次提交
  21. 16 5月, 2012 1 次提交
  22. 30 4月, 2012 2 次提交