1. 09 10月, 2012 1 次提交
  2. 21 9月, 2012 1 次提交
  3. 20 9月, 2012 1 次提交
    • T
      ext4: fix potential deadlock in ext4_nonda_switch() · 00d4e736
      Theodore Ts'o 提交于
      In ext4_nonda_switch(), if the file system is getting full we used to
      call writeback_inodes_sb_if_idle().  The problem is that we can be
      holding i_mutex already, and this causes a potential deadlock when
      writeback_inodes_sb_if_idle() when it tries to take s_umount.  (See
      lockdep output below).
      
      As it turns out we don't need need to hold s_umount; the fact that we
      are in the middle of the write(2) system call will keep the superblock
      pinned.  Unfortunately writeback_inodes_sb() checks to make sure
      s_umount is taken, and the VFS uses a different mechanism for making
      sure the file system doesn't get unmounted out from under us.  The
      simplest way of dealing with this is to just simply grab s_umount
      using a trylock, and skip kicking the writeback flusher thread in the
      very unlikely case that we can't take a read lock on s_umount without
      blocking.
      
      Also, we now check the cirteria for kicking the writeback thread
      before we decide to whether to fall back to non-delayed writeback, so
      if there are any outstanding delayed allocation writes, we try to get
      them resolved as soon as possible.
      
         [ INFO: possible circular locking dependency detected ]
         3.6.0-rc1-00042-gce894ca #367 Not tainted
         -------------------------------------------------------
         dd/8298 is trying to acquire lock:
          (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
      
         but task is already holding lock:
          (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
      
         which lock already depends on the new lock.
      
         2 locks held by dd/8298:
          #0:  (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3
          #1:  (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
      
         stack backtrace:
         Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca #367
         Call Trace:
          [<c015b79c>] ? console_unlock+0x345/0x372
          [<c06d62a1>] print_circular_bug+0x190/0x19d
          [<c019906c>] __lock_acquire+0x86d/0xb6c
          [<c01999db>] ? mark_held_locks+0x5c/0x7b
          [<c0199724>] lock_acquire+0x66/0xb9
          [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
          [<c06db935>] down_read+0x28/0x58
          [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
          [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
          [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4
          [<c0271ece>] ext4_da_write_begin+0x27/0x193
          [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb
          [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205
          [<c01ddce7>] generic_file_aio_write+0x78/0xd3
          [<c026d336>] ext4_file_write+0x480/0x4a6
          [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c
          [<c0180944>] ? sched_clock_cpu+0x11a/0x13e
          [<c01967e9>] ? trace_hardirqs_off+0xb/0xd
          [<c018099f>] ? local_clock+0x37/0x4e
          [<c0209f2c>] do_sync_write+0x67/0x9d
          [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44
          [<c020a7b9>] vfs_write+0x7b/0xe6
          [<c020a9a6>] sys_write+0x3b/0x64
          [<c06dd4bd>] syscall_call+0x7/0xb
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      00d4e736
  4. 11 9月, 2012 1 次提交
  5. 01 8月, 2012 1 次提交
  6. 23 7月, 2012 1 次提交
  7. 09 6月, 2012 2 次提交
  8. 06 5月, 2012 7 次提交
    • J
      writeback: Avoid iput() from flusher thread · 169ebd90
      Jan Kara 提交于
      Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
      because iput() can do a lot of work - for example truncate the inode if it's
      the last iput on unlinked file. Some filesystems depend on flusher thread
      progressing (e.g. because they need to flush delay allocated blocks to reduce
      allocation uncertainty) and so flusher thread doing truncate creates
      interesting dependencies and possibilities for deadlocks.
      
      We get rid of iput() in flusher thread by using the fact that I_SYNC inode
      flag effectively pins the inode in memory. So if we take care to either hold
      i_lock or have I_SYNC set, we can get away without taking inode reference
      in writeback_sb_inodes().
      
      As a side effect of these changes, we also fix possible use-after-free in
      wb_writeback() because inode_wait_for_writeback() call could try to reacquire
      i_lock on the inode that was already free.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      169ebd90
    • J
      writeback: Refactor writeback_single_inode() · 4f8ad655
      Jan Kara 提交于
      The code in writeback_single_inode() is relatively complex. The list requeing
      logic makes sense only for flusher thread but not really for sync_inode() or
      write_inode_now() callers. Also when we want to get rid of inode references
      held by flusher thread, we will need a special I_SYNC handling there.
      
      So separate part of writeback_single_inode() which does the real writeback work
      into __writeback_single_inode() and make writeback_single_inode() do only stuff
      necessary for callers writing only one inode, moving the special list handling
      into writeback_sb_inodes(). As a sideeffect this fixes a possible race where we
      could skip some inode during sync(2) because other writer refiled it from b_io
      to b_dirty list. Also I_SYNC handling is moved into the callers of
      __writeback_single_inode() to make locking easier.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      4f8ad655
    • J
      writeback: Remove wb->list_lock from writeback_single_inode() · f0d07b7f
      Jan Kara 提交于
      writeback_single_inode() doesn't need wb->list_lock for anything on entry now.
      So remove the requirement. This makes locking of writeback_single_inode()
      temporarily awkward (entering with i_lock, returning with i_lock and
      wb->list_lock) but it will be sanitized in the next patch.
      
      Also inode_wait_for_writeback() doesn't need wb->list_lock for anything. It was
      just taking it to make usage convenient for callers but with
      writeback_single_inode() changing it's not very convenient anymore. So remove
      the lock from that function.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      f0d07b7f
    • J
      writeback: Separate inode requeueing after writeback · ccb26b5a
      Jan Kara 提交于
      Move inode requeueing after inode has been written out into a separate
      function.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      ccb26b5a
    • J
      writeback: Move I_DIRTY_PAGES handling · 6290be1c
      Jan Kara 提交于
      Instead of clearing I_DIRTY_PAGES and resetting it when we didn't succeed in
      writing them all, just clear the bit only when we succeeded writing all the
      pages. We also move the clearing of the bit close to other i_state handling to
      separate it from writeback list handling. This is desirable because list
      handling will differ for flusher thread and other writeback_single_inode()
      callers in future. No filesystem plays any tricks with I_DIRTY_PAGES (like
      checking it in ->writepages or ->write_inode implementation) so this movement
      is safe.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      6290be1c
    • J
      writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() · cc1676d9
      Jan Kara 提交于
      When writeback_single_inode() is called on inode which has I_SYNC already
      set while doing WB_SYNC_NONE, inode is moved to b_more_io list. However
      this makes sense only if the caller is flusher thread. For other callers of
      writeback_single_inode() it doesn't really make sense and may be even wrong
      - flusher thread may be doing WB_SYNC_ALL writeback in parallel.
      
      So we move requeueing from writeback_single_inode() to writeback_sb_inodes().
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      cc1676d9
    • J
      writeback: Move clearing of I_SYNC into inode_sync_complete() · 365b94ae
      Jan Kara 提交于
      Move clearing of I_SYNC into inode_sync_complete().  It is more logical to have
      clearing of I_SYNC bit and waking of waiters in one place. Also later we will
      have two places needing to clear I_SYNC and wake up waiters so this allows them
      to use the common helper. Moving of I_SYNC clearing to a later stage of
      writeback_single_inode() is safe since we hold i_lock all the time.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      365b94ae
  9. 21 3月, 2012 2 次提交
  10. 07 3月, 2012 1 次提交
  11. 29 2月, 2012 1 次提交
  12. 01 2月, 2012 1 次提交
  13. 08 1月, 2012 1 次提交
  14. 04 1月, 2012 1 次提交
  15. 18 12月, 2011 2 次提交
  16. 29 11月, 2011 1 次提交
  17. 22 11月, 2011 1 次提交
    • T
      freezer: implement and use kthread_freezable_should_stop() · 8a32c441
      Tejun Heo 提交于
      Writeback and thinkpad_acpi have been using thaw_process() to prevent
      deadlock between the freezer and kthread_stop(); unfortunately, this
      is inherently racy - nothing prevents freezing from happening between
      thaw_process() and kthread_stop().
      
      This patch implements kthread_freezable_should_stop() which enters
      refrigerator if necessary but is guaranteed to return if
      kthread_stop() is invoked.  Both thaw_process() users are converted to
      use the new function.
      
      Note that this deadlock condition exists for many of freezable
      kthreads.  They need to be converted to use the new should_stop or
      freezable workqueue.
      
      Tested with synthetic test case.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NHenrique de Moraes Holschuh <ibm-acpi@hmh.eng.br>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      8a32c441
  18. 31 10月, 2011 2 次提交
  19. 03 10月, 2011 2 次提交
    • W
      writeback: per-bdi background threshold · b00949aa
      Wu Fengguang 提交于
      One thing puzzled me is that in JBOD case, the per-disk writeout
      performance is smaller than the corresponding single-disk case even
      when they have comparable bdi_thresh. Tracing shows find that in single
      disk case, bdi_writeback is always kept high while in JBOD case, it
      could drop low from time to time and correspondingly bdi_reclaimable
      could sometimes rush high.
      
      The fix is to watch bdi_reclaimable and kick background writeback as
      soon as it goes high. This resembles the global background threshold
      but in per-bdi manner. The trick is, as long as bdi_reclaimable does
      not go high, bdi_writeback naturally won't go low because
      bdi_reclaimable+bdi_writeback ~= bdi_thresh.
      
      With less fluctuated writeback pages, JBOD performance is observed to
      increase noticeably in various cases.
      
      vmstat:nr_written values before/after patch:
      
        3.1.0-rc4-wo-underrun+      3.1.0-rc4-bgthresh3+  
      ------------------------  ------------------------  
                     125596480       +25.9%    158179363  JBOD-10HDD-16G/ext4-100dd-1M-24p-16384M-20:10-X
                      61790815      +110.4%    130032231  JBOD-10HDD-16G/ext4-10dd-1M-24p-16384M-20:10-X
                      58853546        -0.1%     58823828  JBOD-10HDD-16G/ext4-1dd-1M-24p-16384M-20:10-X
                     110159811       +24.7%    137355377  JBOD-10HDD-16G/xfs-100dd-1M-24p-16384M-20:10-X
                      69544762       +10.8%     77080047  JBOD-10HDD-16G/xfs-10dd-1M-24p-16384M-20:10-X
                      50644862        +0.5%     50890006  JBOD-10HDD-16G/xfs-1dd-1M-24p-16384M-20:10-X
                      42677090       +28.0%     54643527  JBOD-10HDD-thresh=100M/ext4-100dd-1M-24p-16384M-100M:10-X
                      47491324       +13.3%     53785605  JBOD-10HDD-thresh=100M/ext4-10dd-1M-24p-16384M-100M:10-X
                      52548986        +0.9%     53001031  JBOD-10HDD-thresh=100M/ext4-1dd-1M-24p-16384M-100M:10-X
                      26783091       +36.8%     36650248  JBOD-10HDD-thresh=100M/xfs-100dd-1M-24p-16384M-100M:10-X
                      35526347       +14.0%     40492312  JBOD-10HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X
                      44670723        -1.1%     44177606  JBOD-10HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X
                     127996037       +22.4%    156719990  JBOD-10HDD-thresh=2G/ext4-100dd-1M-24p-16384M-2048M:10-X
                      57518856        +3.8%     59677625  JBOD-10HDD-thresh=2G/ext4-10dd-1M-24p-16384M-2048M:10-X
                      51919909       +12.2%     58269894  JBOD-10HDD-thresh=2G/ext4-1dd-1M-24p-16384M-2048M:10-X
                      86410514       +79.0%    154660433  JBOD-10HDD-thresh=2G/xfs-100dd-1M-24p-16384M-2048M:10-X
                      40132519       +38.6%     55617893  JBOD-10HDD-thresh=2G/xfs-10dd-1M-24p-16384M-2048M:10-X
                      48423248        +7.5%     52042927  JBOD-10HDD-thresh=2G/xfs-1dd-1M-24p-16384M-2048M:10-X
                     206041046       +44.1%    296846536  JBOD-10HDD-thresh=4G/xfs-100dd-1M-24p-16384M-4096M:10-X
                      72312903       -19.4%     58272885  JBOD-10HDD-thresh=4G/xfs-10dd-1M-24p-16384M-4096M:10-X
                      50635672        -0.5%     50384787  JBOD-10HDD-thresh=4G/xfs-1dd-1M-24p-16384M-4096M:10-X
                      68308534      +115.7%    147324758  JBOD-10HDD-thresh=800M/ext4-100dd-1M-24p-16384M-800M:10-X
                      57882933       +14.5%     66269621  JBOD-10HDD-thresh=800M/ext4-10dd-1M-24p-16384M-800M:10-X
                      52183472       +12.8%     58855181  JBOD-10HDD-thresh=800M/ext4-1dd-1M-24p-16384M-800M:10-X
                      53788956       +94.2%    104460352  JBOD-10HDD-thresh=800M/xfs-100dd-1M-24p-16384M-800M:10-X
                      44493342       +35.5%     60298210  JBOD-10HDD-thresh=800M/xfs-10dd-1M-24p-16384M-800M:10-X
                      42641209       +18.9%     50681038  JBOD-10HDD-thresh=800M/xfs-1dd-1M-24p-16384M-800M:10-X
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      b00949aa
    • W
      writeback: add bg_threshold parameter to __bdi_update_bandwidth() · af6a3113
      Wu Fengguang 提交于
      No behavior change.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      af6a3113
  20. 31 7月, 2011 1 次提交
  21. 24 7月, 2011 1 次提交
    • W
      writeback: don't busy retry writeback on new/freeing inodes · fcc5c222
      Wu Fengguang 提交于
      Fix a system hang bug introduced by commit b7a2441f ("writeback:
      remove writeback_control.more_io") and e8dfc305 ("writeback: elevate
      queue_io() into wb_writeback()") easily reproducible with high memory
      pressure and lots of file creation/deletions, for example, a kernel
      build in limited memory.
      
      It hangs when some inode is in the I_NEW, I_FREEING or I_WILL_FREE 
      state, the flusher will get stuck busy retrying that inode, never
      releasing wb->list_lock. The lock in turn blocks all kinds of other
      tasks when they are trying to grab it.
      
      As put by Jan, it's a safe change regarding data integrity. I_FREEING or
      I_WILL_FREE inodes are written back by iput_final() and it is reclaim
      code that is responsible for eventually removing them. So writeback code
      can safely ignore them. I_NEW inodes should move out of this state when
      they are fully set up and in the writeback round following that, we will
      consider them for writeback. So the change makes sense.                                                         
      
      CC: Jan Kara <jack@suse.cz> 
      Reported-by: NHugh Dickins <hughd@google.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      fcc5c222
  22. 20 7月, 2011 1 次提交
    • D
      superblock: move pin_sb_for_writeback() to fs/super.c · 12ad3ab6
      Dave Chinner 提交于
      The per-sb shrinker has the same requirement as the writeback
      threads of ensuring that the superblock is usable and pinned for the
      time it takes to run the work. Both need to take a passive reference
      to the sb, take a read lock on the s_umount lock and then only
      continue if an unmount is not in progress.
      
      pin_sb_for_writeback() does this exactly, so move it to fs/super.c
      and rename it to grab_super_passive() and exporting it via
      fs/internal.h for all the VFS code to be able to use.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      12ad3ab6
  23. 10 7月, 2011 4 次提交
    • W
      writeback: scale IO chunk size up to half device bandwidth · 1a12d8bd
      Wu Fengguang 提交于
      Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
      concern of not holding I_SYNC for too long.  (At least, that was the
      comment previously.)  This doesn't make sense now because the only
      time we wait for I_SYNC is if we are calling sync or fsync, and in
      that case we need to write out all of the data anyway.  Previously
      there may have been other code paths that waited on I_SYNC, but not
      any more.					    -- Theodore Ts'o
      
      So remove the MAX_WRITEBACK_PAGES constraint. The writeback pages
      will adapt to as large as the storage device can write within 500ms.
      
      XFS is observed to do IO completions in a batch, and the batch size is
      equal to the write chunk size. To avoid dirty pages to suddenly drop
      out of balance_dirty_pages()'s dirty control scope and create large
      fluctuations, the chunk size is also limited to half the control scope.
      
      The balance_dirty_pages() control scrope is
      
      	[(background_thresh + dirty_thresh) / 2, dirty_thresh]
      
      which is by default [15%, 20%] of global dirty pages, whose range size
      is dirty_thresh / DIRTY_FULL_SCOPE.
      
      The adpative write chunk size will be rounded to the nearest 4MB
      boundary.
      
      http://bugzilla.kernel.org/show_bug.cgi?id=13930
      
      CC: Theodore Ts'o <tytso@mit.edu>
      CC: Dave Chinner <david@fromorbit.com>
      CC: Chris Mason <chris.mason@oracle.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      1a12d8bd
    • W
      writeback: introduce smoothed global dirty limit · c42843f2
      Wu Fengguang 提交于
      The start of a heavy weight application (ie. KVM) may instantly knock
      down determine_dirtyable_memory() if the swap is not enabled or full.
      global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
      dirty thresholds that are _much_ lower than the global/bdi dirty pages.
      
      balance_dirty_pages() will then heavily throttle all dirtiers including
      the light ones, until the dirty pages drop below the new dirty thresholds.
      During this _deep_ dirty-exceeded state, the system may appear rather
      unresponsive to the users.
      
      About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
      threshold to heavy dirtiers than light ones, and the dirty pages will
      be throttled around the heavy dirtiers' dirty threshold and reasonably
      below the light dirtiers' dirty threshold. In this state, only the heavy
      dirtiers will be throttled and the dirty pages are carefully controlled
      to not exceed the light dirtiers' dirty threshold. However if the
      threshold itself suddenly drops below the number of dirty pages, the
      light dirtiers will get heavily throttled.
      
      So introduce global_dirty_limit for tracking the global dirty threshold
      with policies
      
      - follow downwards slowly
      - follow up in one shot
      
      global_dirty_limit can effectively mask out the impact of sudden drop of
      dirtyable memory. It will be used in the next patch for two new type of
      dirty limits. Note that the new dirty limits are not going to avoid
      throttling the light dirtiers, but could limit their sleep time to 200ms.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      c42843f2
    • W
      writeback: bdi write bandwidth estimation · e98be2d5
      Wu Fengguang 提交于
      The estimation value will start from 100MB/s and adapt to the real
      bandwidth in seconds.
      
      It tries to update the bandwidth only when disk is fully utilized.
      Any inactive period of more than one second will be skipped.
      
      The estimated bandwidth will be reflecting how fast the device can
      writeout when _fully utilized_, and won't drop to 0 when it goes idle.
      The value will remain constant at disk idle time. At busy write time, if
      not considering fluctuations, it will also remain high unless be knocked
      down by possible concurrent reads that compete for the disk time and
      bandwidth with async writes.
      
      The estimation is not done purely in the flusher because there is no
      guarantee for write_cache_pages() to return timely to update bandwidth.
      
      The bdi->avg_write_bandwidth smoothing is very effective for filtering
      out sudden spikes, however may be a little biased in long term.
      
      The overheads are low because the bdi bandwidth update only occurs at
      200ms intervals.
      
      The 200ms update interval is suitable, because it's not possible to get
      the real bandwidth for the instance at all, due to large fluctuations.
      
      The NFS commits can be as large as seconds worth of data. One XFS
      completion may be as large as half second worth of data if we are going
      to increase the write chunk to half second worth of data. In ext4,
      fluctuations with time period of around 5 seconds is observed. And there
      is another pattern of irregular periods of up to 20 seconds on SSD tests.
      
      That's why we are not only doing the estimation at 200ms intervals, but
      also averaging them over a period of 3 seconds and then go further to do
      another level of smoothing in avg_write_bandwidth.
      
      CC: Li Shaohua <shaohua.li@intel.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      e98be2d5
    • W
      writeback: make writeback_control.nr_to_write straight · d46db3d5
      Wu Fengguang 提交于
      Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
      and initialize the struct writeback_control there.
      
      struct writeback_control is basically designed to control writeback of a
      single file, but we keep abuse it for writing multiple files in
      writeback_sb_inodes() and its callers.
      
      It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
      work->nr_pages starts to make sense, and instead of saving and restoring
      pages_skipped in writeback_sb_inodes it can always start with a clean
      zero value.
      
      It also makes a neat IO pattern change: large dirty files are now
      written in the full 4MB writeback chunk size, rather than whatever
      remained quota in wbc->nr_to_write.
      Acked-by: NJan Kara <jack@suse.cz>
      Proposed-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      d46db3d5
  24. 08 6月, 2011 3 次提交