1. 28 6月, 2013 1 次提交
    • D
      xfs: Inode create log items · 3ebe7d2d
      Dave Chinner 提交于
      Introduce the inode create log item type for logical inode create logging.
      Instead of logging the changes in buffers, pass the range to be
      initialised through the log by a new transaction type.  This reduces
      the amount of log space required to record initialisation during
      allocation from about 128 bytes per inode to a small fixed amount
      per inode extent to be initialised.
      
      This requires a new log item type to track it through the log
      and the AIL. This is a relatively simple item - most callbacks are
      noops as this item has the same life cycle as the transaction.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      3ebe7d2d
  2. 20 6月, 2013 1 次提交
    • J
      xfs: Remove XFS_MOUNT_RETERR · 39a45d84
      Jie Liu 提交于
      XFS_MOUNT_RETERR is going to be set at xfs_parseargs() if
      mp->m_dalign is enabled, so any time we enter "if (mp->m_dalign)"
      branch in xfs_update_alignment(), XFS_MOUNT_RETERR is set and so
      we always be emitting a warning and returning an error.
      
      Hence, we can remove it and get rid of a couple of redundant
      check up against it at xfs_upate_alignment().
      
      Thanks Dave Chinner for the suggestions of simplify the code
      in xfs_parseargs().
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Mark Tinguely <tinguely@sgi.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      39a45d84
  3. 06 6月, 2013 1 次提交
  4. 04 3月, 2013 1 次提交
    • E
      fs: Limit sys_mount to only request filesystem modules. · 7f78e035
      Eric W. Biederman 提交于
      Modify the request_module to prefix the file system type with "fs-"
      and add aliases to all of the filesystems that can be built as modules
      to match.
      
      A common practice is to build all of the kernel code and leave code
      that is not commonly needed as modules, with the result that many
      users are exposed to any bug anywhere in the kernel.
      
      Looking for filesystems with a fs- prefix limits the pool of possible
      modules that can be loaded by mount to just filesystems trivially
      making things safer with no real cost.
      
      Using aliases means user space can control the policy of which
      filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
      with blacklist and alias directives.  Allowing simple, safe,
      well understood work-arounds to known problematic software.
      
      This also addresses a rare but unfortunate problem where the filesystem
      name is not the same as it's module name and module auto-loading
      would not work.  While writing this patch I saw a handful of such
      cases.  The most significant being autofs that lives in the module
      autofs4.
      
      This is relevant to user namespaces because we can reach the request
      module in get_fs_type() without having any special permissions, and
      people get uncomfortable when a user specified string (in this case
      the filesystem type) goes all of the way to request_module.
      
      After having looked at this issue I don't think there is any
      particular reason to perform any filtering or permission checks beyond
      making it clear in the module request that we want a filesystem
      module.  The common pattern in the kernel is to call request_module()
      without regards to the users permissions.  In general all a filesystem
      module does once loaded is call register_filesystem() and go to sleep.
      Which means there is not much attack surface exposed by loading a
      filesytem module unless the filesystem is mounted.  In a user
      namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
      which most filesystems do not set today.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Reported-by: NKees Cook <keescook@google.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      7f78e035
  5. 14 1月, 2013 1 次提交
  6. 09 11月, 2012 1 次提交
  7. 18 10月, 2012 10 次提交
    • D
      xfs: rename xfs_sync.[ch] to xfs_icache.[ch] · 6d8b79cf
      Dave Chinner 提交于
      xfs_sync.c now only contains inode reclaim functions and inode cache
      iteration functions. It is not related to sync operations anymore.
      Rename to xfs_icache.c to reflect it's contents and prepare for
      consolidation with the other inode cache file that exists
      (xfs_iget.c).
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      6d8b79cf
    • D
      xfs: xfs_quiesce_attr() should quiesce the log like unmount · c75921a7
      Dave Chinner 提交于
      xfs_quiesce_attr() is supposed to leave the log empty with an
      unmount record written. Right now it does not wait for the AIL to be
      emptied before writing the unmount record, not does it wait for
      metadata IO completion, either. Fix it to use the same method and
      code as xfs_log_unmount().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      c75921a7
    • D
      xfs: move xfs_quiesce_attr() into xfs_super.c · c7eea6f7
      Dave Chinner 提交于
      Both callers of xfs_quiesce_attr() are in xfs_super.c, and there's
      nothing really sync-specific about this functionality so it doesn't
      really matter where it lives. Move it to benext to it's callers, so
      all the remount/sync_fs code is in the one place.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      c7eea6f7
    • D
      xfs: xfs_sync_fsdata is redundant · 34061f5c
      Dave Chinner 提交于
      Why do we need to write the superblock to disk once we've written
      all the data?  We don't actually - the reasons for doing this are
      lost in the mists of time, and go back to the way Irix used to drive
      VFS flushing.
      
      On linux, this code is only called from two contexts: remount and
      .sync_fs. In the remount case, the call is followed by a metadata
      sync, which unpins and writes the superblock.  In the sync_fs case,
      we only need to force the log to disk to ensure that the superblock
      is correctly on disk, so we don't actually need to write it. Hence
      the functionality is either redundant or superfluous and thus can be
      removed.
      
      Seeing as xfs_quiesce_data is essentially now just a log force,
      remove it as well and fold the code back into the two callers.
      Neither of them need the log covering check, either, as that is
      redundant for the remount case, and unnecessary for the .sync_fs
      case.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      34061f5c
    • D
      xfs: syncd workqueue is no more · 5889608d
      Dave Chinner 提交于
      With the syncd functions moved to the log and/or removed, the syncd
      workqueue is the only remaining bit left. It is used by the log
      covering/ail pushing work, as well as by the inode reclaim work.
      
      Given how cheap workqueues are these days, give the log and inode
      reclaim work their own work queues and kill the syncd work queue.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      5889608d
    • D
      xfs: xfs_sync_data is redundant. · 9aa05000
      Dave Chinner 提交于
      We don't do any data writeback from XFS any more - the VFS is
      completely responsible for that, including for freeze. We can
      replace the remaining caller with a VFS level function that
      achieves the same thing, but without conflicting with current
      writeback work.
      
      This means we can remove the flush_work and xfs_flush_inodes() - the
      VFS functionality completely replaces the internal flush queue for
      doing this writeback work in a separate context to avoid stack
      overruns.
      
      This does have one complication - it cannot be called with page
      locks held.  Hence move the flushing of delalloc space when ENOSPC
      occurs back up into xfs_file_aio_buffered_write when we don't hold
      any locks that will stall writeback.
      
      Unfortunately, writeback_inodes_sb_if_idle() is not sufficient to
      trigger delalloc conversion fast enough to prevent spurious ENOSPC
      whent here are hundreds of writers, thousands of small files and GBs
      of free RAM.  Hence we need to use sync_sb_inodes() to block callers
      while we wait for writeback like the previous xfs_flush_inodes
      implementation did.
      
      That means we have to hold the s_umount lock here, but because this
      call can nest inside i_mutex (the parent directory in the create
      case, held by the VFS), we have to use down_read_trylock() to avoid
      potential deadlocks. In practice, this trylock will succeed on
      almost every attempt as unmount/remount type operations are
      exceedingly rare.
      
      Note: we always need to pass a count of zero to
      generic_file_buffered_write() as the previously written byte count.
      We only do this by accident before this patch by the virtue of ret
      always being zero when there are no errors. Make this explicit
      rather than needing to specifically zero ret in the ENOSPC retry
      case.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      9aa05000
    • D
      xfs: sync work is now only periodic log work · f661f1e0
      Dave Chinner 提交于
      The only thing the periodic sync work does now is flush the AIL and
      idle the log. These are really functions of the log code, so move
      the work to xfs_log.c and rename it appropriately.
      
      The only wart that this leaves behind is the xfssyncd_centisecs
      sysctl, otherwise the xfssyncd is dead. Clean up any comments that
      related to xfssyncd to reflect it's passing.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      f661f1e0
    • D
      xfs: don't run the sync work if the filesystem is read-only · 7f7bebef
      Dave Chinner 提交于
      If the filesystem is mounted or remounted read-only, stop the sync
      worker that tries to flush or cover the log if the filesystem is
      dirty. It's read-only, so it isn't dirty. Restart it on a remount,rw
      as necessary. This avoids the need for RO checks in the work.
      
      Similarly, stop the sync work when the filesystem is frozen, and
      start it again when the filesysetm is thawed. This avoids the need
      for special freeze checks in the work.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      7f7bebef
    • D
      xfs: rationalise xfs_mount_wq users · 7e18530b
      Dave Chinner 提交于
      Instead of starting and stopping background work on the xfs_mount_wq
      all at the same time, separate them to where they really are needed
      to start and stop.
      
      The xfs_sync_worker, only needs to be started after all the mount
      processing has completed successfully, while it needs to be stopped
      before the log is unmounted.
      
      The xfs_reclaim_worker is started on demand, and can be
      stopped before the unmount process does it's own inode reclaim pass.
      
      The xfs_flush_inodes work is run on demand, and so we really only
      need to ensure that it has stopped running before we start
      processing an unmount, freeze or remount,ro.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      7e18530b
    • D
      xfs: xfs_syncd_stop must die · 33c7a2bc
      Dave Chinner 提交于
      xfs_syncd_start and xfs_syncd_stop tie a bunch of unrelated
      functionailty together that actually have different start and stop
      requirements. Kill these functions and open code the start/stop
      methods for each of the background functions.
      
      Subsequent patches will move the start/stop functions around to the
      correct places to avoid races and shutdown issues.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      33c7a2bc
  8. 03 10月, 2012 1 次提交
  9. 27 9月, 2012 6 次提交
  10. 19 9月, 2012 1 次提交
    • B
      xfs: stop the sync worker before xfs_unmountfs · 0ba6e536
      Ben Myers 提交于
      Cancel work of the xfs_sync_worker before teardown of the log in
      xfs_unmountfs.  This prevents occasional crashes on unmount like so:
      
      PID: 21602  TASK: ee9df060  CPU: 0   COMMAND: "kworker/0:3"
       #0 [c5377d28] crash_kexec at c0292c94
       #1 [c5377d80] oops_end at c07090c2
       #2 [c5377d98] no_context at c06f614e
       #3 [c5377dbc] __bad_area_nosemaphore at c06f6281
       #4 [c5377df4] bad_area_nosemaphore at c06f629b
       #5 [c5377e00] do_page_fault at c070b0cb
       #6 [c5377e7c] error_code (via page_fault) at c070892c
          EAX: f300c6a8  EBX: f300c6a8  ECX: 000000c0  EDX: 000000c0  EBP: c5377ed0
          DS:  007b      ESI: 00000000  ES:  007b      EDI: 00000001  GS:  ffffad20
          CS:  0060      EIP: c0481ad0  ERR: ffffffff  EFLAGS: 00010246
       #7 [c5377eb0] atomic64_read_cx8 at c0481ad0
       #8 [c5377ebc] xlog_assign_tail_lsn_locked at f7cc7c6e [xfs]
       #9 [c5377ed4] xfs_trans_ail_delete_bulk at f7ccd520 [xfs]
      #10 [c5377f0c] xfs_buf_iodone at f7ccb602 [xfs]
      #11 [c5377f24] xfs_buf_do_callbacks at f7cca524 [xfs]
      #12 [c5377f30] xfs_buf_iodone_callbacks at f7cca5da [xfs]
      #13 [c5377f4c] xfs_buf_iodone_work at f7c718d0 [xfs]
      #14 [c5377f58] process_one_work at c024ee4c
      #15 [c5377f98] worker_thread at c024f43d
      #16 [c5377fbc] kthread at c025326b
      #17 [c5377fe8] kernel_thread_helper at c070e834
      
      PID: 26653  TASK: e79143b0  CPU: 3   COMMAND: "umount"
       #0 [cde0fda0] __schedule at c0706595
       #1 [cde0fe28] schedule at c0706b89
       #2 [cde0fe30] schedule_timeout at c0705600
       #3 [cde0fe94] __down_common at c0706098
       #4 [cde0fec8] __down at c0706122
       #5 [cde0fed0] down at c025936f
       #6 [cde0fee0] xfs_buf_lock at f7c7131d [xfs]
       #7 [cde0ff00] xfs_freesb at f7cc2236 [xfs]
       #8 [cde0ff10] xfs_fs_put_super at f7c80f21 [xfs]
       #9 [cde0ff1c] generic_shutdown_super at c0333d7a
      #10 [cde0ff38] kill_block_super at c0333e0f
      #11 [cde0ff48] deactivate_locked_super at c0334218
      #12 [cde0ff58] deactivate_super at c033495d
      #13 [cde0ff68] mntput_no_expire at c034bc13
      #14 [cde0ff7c] sys_umount at c034cc69
      #15 [cde0ffa0] sys_oldumount at c034ccd4
      #16 [cde0ffb0] system_call at c0707e66
      
      commit 11159a05 added this to xfs_log_unmount and needs to be cleaned up
      at a later date.
      Signed-off-by: NBen Myers <bpm@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      0ba6e536
  11. 18 9月, 2012 1 次提交
    • B
      xfs: stop the sync worker before xfs_unmountfs · 4026c9fd
      Ben Myers 提交于
      Cancel work of the xfs_sync_worker before teardown of the log in
      xfs_unmountfs.  This prevents occasional crashes on unmount like so:
      
      PID: 21602  TASK: ee9df060  CPU: 0   COMMAND: "kworker/0:3"
       #0 [c5377d28] crash_kexec at c0292c94
       #1 [c5377d80] oops_end at c07090c2
       #2 [c5377d98] no_context at c06f614e
       #3 [c5377dbc] __bad_area_nosemaphore at c06f6281
       #4 [c5377df4] bad_area_nosemaphore at c06f629b
       #5 [c5377e00] do_page_fault at c070b0cb
       #6 [c5377e7c] error_code (via page_fault) at c070892c
          EAX: f300c6a8  EBX: f300c6a8  ECX: 000000c0  EDX: 000000c0  EBP: c5377ed0
          DS:  007b      ESI: 00000000  ES:  007b      EDI: 00000001  GS:  ffffad20
          CS:  0060      EIP: c0481ad0  ERR: ffffffff  EFLAGS: 00010246
       #7 [c5377eb0] atomic64_read_cx8 at c0481ad0
       #8 [c5377ebc] xlog_assign_tail_lsn_locked at f7cc7c6e [xfs]
       #9 [c5377ed4] xfs_trans_ail_delete_bulk at f7ccd520 [xfs]
      #10 [c5377f0c] xfs_buf_iodone at f7ccb602 [xfs]
      #11 [c5377f24] xfs_buf_do_callbacks at f7cca524 [xfs]
      #12 [c5377f30] xfs_buf_iodone_callbacks at f7cca5da [xfs]
      #13 [c5377f4c] xfs_buf_iodone_work at f7c718d0 [xfs]
      #14 [c5377f58] process_one_work at c024ee4c
      #15 [c5377f98] worker_thread at c024f43d
      #16 [c5377fbc] kthread at c025326b
      #17 [c5377fe8] kernel_thread_helper at c070e834
      
      PID: 26653  TASK: e79143b0  CPU: 3   COMMAND: "umount"
       #0 [cde0fda0] __schedule at c0706595
       #1 [cde0fe28] schedule at c0706b89
       #2 [cde0fe30] schedule_timeout at c0705600
       #3 [cde0fe94] __down_common at c0706098
       #4 [cde0fec8] __down at c0706122
       #5 [cde0fed0] down at c025936f
       #6 [cde0fee0] xfs_buf_lock at f7c7131d [xfs]
       #7 [cde0ff00] xfs_freesb at f7cc2236 [xfs]
       #8 [cde0ff10] xfs_fs_put_super at f7c80f21 [xfs]
       #9 [cde0ff1c] generic_shutdown_super at c0333d7a
      #10 [cde0ff38] kill_block_super at c0333e0f
      #11 [cde0ff48] deactivate_locked_super at c0334218
      #12 [cde0ff58] deactivate_super at c033495d
      #13 [cde0ff68] mntput_no_expire at c034bc13
      #14 [cde0ff7c] sys_umount at c034cc69
      #15 [cde0ffa0] sys_oldumount at c034ccd4
      #16 [cde0ffb0] system_call at c0707e66
      
      commit 11159a05 added this to xfs_log_unmount and needs to be cleaned up
      at a later date.
      Signed-off-by: NBen Myers <bpm@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      4026c9fd
  12. 21 8月, 2012 1 次提交
    • T
      workqueue: deprecate flush[_delayed]_work_sync() · 43829731
      Tejun Heo 提交于
      flush[_delayed]_work_sync() are now spurious.  Mark them deprecated
      and convert all users to flush[_delayed]_work().
      
      If you're cc'd and wondering what's going on: Now all workqueues are
      non-reentrant and the regular flushes guarantee that the work item is
      not pending or running on any CPU on return, so there's no reason to
      use the sync flushes at all and they're going away.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Ian Campbell <ian.campbell@citrix.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mattia Dongili <malattia@linux.it>
      Cc: Kent Yoder <key@linux.vnet.ibm.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Karsten Keil <isdn@linux-pingi.de>
      Cc: Bryan Wu <bryan.wu@canonical.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: linux-wireless@vger.kernel.org
      Cc: Anton Vorontsov <cbou@mail.ru>
      Cc: Sangbeom Kim <sbkim73@samsung.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Petr Vandrovec <petr@vandrovec.name>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Avi Kivity <avi@redhat.com> 
      43829731
  13. 30 7月, 2012 1 次提交
  14. 22 7月, 2012 1 次提交
  15. 02 7月, 2012 2 次提交
    • D
      xfs: remove struct xfs_dabuf and infrastructure · 1d9025e5
      Dave Chinner 提交于
      The struct xfs_dabuf now only tracks a single xfs_buf and all the
      information it holds can be gained directly from the xfs_buf. Hence
      we can remove the struct dabuf and pass the xfs_buf around
      everywhere.
      
      Kill the struct dabuf and the associated infrastructure.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      1d9025e5
    • D
      xfs: struct xfs_buf_log_format isn't variable sized. · 77c1a08f
      Dave Chinner 提交于
      The struct xfs_buf_log_format wants to think the dirty bitmap is
      variable sized.  In fact, it is variable size on disk simply due to
      the way we map it from the in-memory structure, but we still just
      use a fixed size memory allocation for the in-memory structure.
      
      Hence it makes no sense to set the function up as a variable sized
      structure when we already know it's maximum size, and we always
      allocate it as such. Simplify the structure by making the dirty
      bitmap a fixed sized array and just using the size of the structure
      for the allocation size.
      
      This will make it much simpler to allocate and manipulate an array
      of format structures for discontiguous buffer support.
      
      The previous struct xfs_buf_log_item size according to
      /proc/slabinfo was 224 bytes. pahole doesn't give the same size
      because of the variable size definition. With this modification,
      pahole reports the same as /proc/slabinfo:
      
      	/* size: 224, cachelines: 4, members: 6 */
      
      Because the xfs_buf_log_item size is now determined by the maximum
      supported block size we introduce a dependency on xfs_alloc_btree.h.
      Avoid this dependency by moving the idefines for the maximum block
      sizes supported to xfs_types.h with all the other max/min type
      defines to avoid any new dependencies.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      77c1a08f
  16. 15 5月, 2012 4 次提交
    • D
      xfs: clean up xfs_bit.h includes · ad1e95c5
      Dave Chinner 提交于
      With the removal of xfs_rw.h and other changes over time, xfs_bit.h
      is being included in many files that don't actually need it. Clean
      up the includes as necessary.
      
      Also move the only-used-once xfs_ialloc_find_free() static inline
      function out of a header file that is widely included to reduce
      the number of needless dependencies on xfs_bit.h.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ad1e95c5
    • D
      xfs: Do background CIL flushes via a workqueue · 4c2d542f
      Dave Chinner 提交于
      Doing background CIL flushes adds significant latency to whatever
      async transaction that triggers it. To avoid blocking async
      transactions on things like waiting for log buffer IO to complete,
      move the CIL push off into a workqueue.  By moving the push work
      into a workqueue, we remove all the latency that the commit adds
      from the foreground transaction commit path. This also means that
      single threaded workloads won't do the CIL push procssing, leaving
      them more CPU to do more async transactions.
      
      To do this, we need to keep track of the sequence number we have
      pushed work for. This avoids having many transaction commits
      attempting to schedule work for the same sequence, and ensures that
      we only ever have one push (background or forced) in progress at a
      time. It also means that we don't need to take the CIL lock in write
      mode to check for potential background push races, which reduces
      lock contention.
      
      To avoid potential issues with "smart" IO schedulers, don't use the
      workqueue for log force triggered flushes. Instead, do them directly
      so that the log IO is done directly by the process issuing the log
      force and so doesn't get stuck on IO elevator queue idling
      incorrectly delaying the log IO from the workqueue.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4c2d542f
    • C
      xfs: on-stack delayed write buffer lists · 43ff2122
      Christoph Hellwig 提交于
      Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
      and write back the buffers per-process instead of by waking up xfsbufd.
      
      This is now easily doable given that we have very few places left that write
      delwri buffers:
      
       - log recovery:
      	Only done at mount time, and already forcing out the buffers
      	synchronously using xfs_flush_buftarg
      
       - quotacheck:
      	Same story.
      
       - dquot reclaim:
      	Writes out dirty dquots on the LRU under memory pressure.  We might
      	want to look into doing more of this via xfsaild, but it's already
      	more optimal than the synchronous inode reclaim that writes each
      	buffer synchronously.
      
       - xfsaild:
      	This is the main beneficiary of the change.  By keeping a local list
      	of buffers to write we reduce latency of writing out buffers, and
      	more importably we can remove all the delwri list promotions which
      	were hitting the buffer cache hard under sustained metadata loads.
      
      The implementation is very straight forward - xfs_buf_delwri_queue now gets
      a new list_head pointer that it adds the delwri buffers to, and all callers
      need to eventually submit the list using xfs_buf_delwi_submit or
      xfs_buf_delwi_submit_nowait.  Buffers that already are on a delwri list are
      skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
      list.  The biggest change to pass down the buffer list was done to the AIL
      pushing. Now that we operate on buffers the trylock, push and pushbuf log
      item methods are merged into a single push routine, which tries to lock the
      item, and if possible add the buffer that needs writeback to the buffer list.
      This leads to much simpler code than the previous split but requires the
      individual IOP_PUSH instances to unlock and reacquire the AIL around calls
      to blocking routines.
      
      Given that xfsailds now also handle writing out buffers, the conditions for
      log forcing and the sleep times needed some small changes.  The most
      important one is that we consider an AIL busy as long we still have buffers
      to push, and the other one is that we do increment the pushed LSN for
      buffers that are under flushing at this moment, but still count them towards
      the stuck items for restart purposes.  Without this we could hammer on stuck
      items without ever forcing the log and not make progress under heavy random
      delete workloads on fast flash storage devices.
      
      [ Dave Chinner:
      	- rebase on previous patches.
      	- improved comments for XBF_DELWRI_Q handling
      	- fix XBF_ASYNC handling in queue submission (test 106 failure)
      	- rename delwri submit function buffer list parameters for clarity
      	- xfs_efd_item_push() should return XFS_ITEM_PINNED ]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      43ff2122
    • S
      xfs: using GFP_NOFS for blkdev_issue_flush · 7582df51
      Shaohua Li 提交于
      Issuing a block device flush request in transaction context using GFP_KERNEL
      directly can cause deadlocks due to memory reclaim recursion. Use GFP_NOFS to
      avoid recursion from reclaim context.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      7582df51
  17. 06 5月, 2012 1 次提交
  18. 18 4月, 2012 1 次提交
    • D
      xfs: Ensure inode reclaim can run during quotacheck · 8a00ebe4
      Dave Chinner 提交于
      Because the mount process can run a quotacheck and consume lots of
      inodes, we need to be able to run periodic inode reclaim during the
      mount process. This will prevent running the system out of memory
      during quota checks.
      
      This essentially reverts 2bcf6e97, but that is safe to do now that
      the quota sync code that was causing problems during long quotacheck
      executions is now gone.
      
      The reclaim work is currently protected from running during the
      unmount process by a check against MS_ACTIVE. Unfortunately, this
      also means that the reclaim work cannot run during mount.  The
      unmount process should stop the reclaim cleanly before freeing
      anything that the reclaim work depends on, so there is no need to
      have this guard in place.
      
      Also, the inode reclaim work is demand driven, so there is no need
      to start it immediately during mount. It will be started the moment
      an inode is queued for reclaim, so qutoacheck will trigger it just
      fine.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      8a00ebe4
  19. 17 4月, 2012 1 次提交
  20. 27 3月, 2012 1 次提交
    • D
      xfs: don't cache inodes read through bulkstat · 5132ba8f
      Dave Chinner 提交于
      When we read inodes via bulkstat, we generally only read them once
      and then throw them away - they never get used again. If we retain
      them in cache, then it simply causes the working set of inodes and
      other cached items to be reclaimed just so the inode cache can grow.
      
      Avoid this problem by marking inodes read by bulkstat not to be
      cached and check this flag in .drop_inode to determine whether the
      inode should be added to the VFS LRU or not. If the inode lookup
      hits an already cached inode, then don't set the flag. If the inode
      lookup hits an inode marked with no cache flag, remove the flag and
      allow it to be cached once the current reference goes away.
      
      Inodes marked as not cached will get cleaned up by the background
      inode reclaim or via memory pressure, so they will still generate
      some short term cache pressure. They will, however, be reclaimed
      much sooner and in preference to cache hot inodes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      5132ba8f
  21. 23 3月, 2012 1 次提交
    • D
      xfs: introduce an allocation workqueue · c999a223
      Dave Chinner 提交于
      We currently have significant issues with the amount of stack that
      allocation in XFS uses, especially in the writeback path. We can
      easily consume 4k of stack between mapping the page, manipulating
      the bmap btree and allocating blocks from the free list. Not to
      mention btree block readahead and other functionality that issues IO
      in the allocation path.
      
      As a result, we can no longer fit allocation in the writeback path
      in the stack space provided on x86_64. To alleviate this problem,
      introduce an allocation workqueue and move all allocations to a
      seperate context. This can be easily added as an interposing layer
      into xfs_alloc_vextent(), which takes a single argument structure
      and does not return until the allocation is complete or has failed.
      
      To do this, add a work structure and a completion to the allocation
      args structure. This allows xfs_alloc_vextent to queue the args onto
      the workqueue and wait for it to be completed by the worker. This
      can be done completely transparently to the caller.
      
      The worker function needs to ensure that it sets and clears the
      PF_TRANS flag appropriately as it is being run in an active
      transaction context. Work can also be queued in a memory reclaim
      context, so a rescuer is needed for the workqueue.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      c999a223
  22. 21 3月, 2012 1 次提交