1. 20 7月, 2016 1 次提交
  2. 28 5月, 2016 1 次提交
  3. 20 5月, 2016 1 次提交
  4. 19 5月, 2016 2 次提交
  5. 18 5月, 2016 19 次提交
    • D
      xfs: move reclaim tagging functions · ad438c40
      Dave Chinner 提交于
      Rearrange the inode tagging functions so that they are higher up in
      xfs_cache.c and so there is no need for forward prototypes to be
      defined. This is purely code movement, no other change.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      ad438c40
    • D
      xfs: simplify inode reclaim tagging interfaces · 545c0889
      Dave Chinner 提交于
      Inode radix tree tagging for reclaim passes a lot of unnecessary
      variables around. Over time the xfs-perag has grown a xfs_mount
      backpointer, and an internal agno so we don't need to pass other
      variables into the tagging functions to supply this information.
      
      Rework the functions to pass the minimal variable set required
      and simplify the internal logic and flow.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      545c0889
    • D
      xfs: rename variables in xfs_iflush_cluster for clarity · 19429363
      Dave Chinner 提交于
      The cluster inode variable uses unconventional naming - iq - which
      makes it hard to distinguish it between the inode passed into the
      function - ip - and that is a vector for mistakes to be made.
      Rename all the cluster inode variables to use a more conventional
      prefixes to reduce potential future confusion (cilist, cilist_size,
      cip).
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      19429363
    • D
      xfs: xfs_iflush_cluster has range issues · 5a90e53e
      Dave Chinner 提交于
      xfs_iflush_cluster() does a gang lookup on the radix tree, meaning
      it can find inodes beyond the current cluster if there is sparse
      cache population. gang lookups return results in ascending index
      order, so stop trying to cluster inodes once the first inode outside
      the cluster mask is detected.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5a90e53e
    • D
      xfs: mark reclaimed inodes invalid earlier · 8a17d7dd
      Dave Chinner 提交于
      The last thing we do before using call_rcu() on an xfs_inode to be
      freed is mark it as invalid. This means there is a window between
      when we know for certain that the inode is going to be freed and
      when we do actually mark it as "freed".
      
      This is important in the context of RCU lookups - we can look up the
      inode, find that it is valid, and then use it as such not realising
      that it is in the final stages of being freed.
      
      As such, mark the inode as being invalid the moment we know it is
      going to be reclaimed. This can be done while we still hold the
      XFS_ILOCK_EXCL and the flush lock in xfs_inode_reclaim, meaning that
      it occurs well before we remove it from the radix tree, and that
      the i_flags_lock, the XFS_ILOCK and the inode flush lock all act as
      synchronisation points for detecting that an inode is about to go
      away.
      
      For defensive purposes, this allows us to add a further check to
      xfs_iflush_cluster to ensure we skip inodes that are being freed
      after we grab the XFS_ILOCK_SHARED and the flush lock - we know that
      if the inode number if valid while we have these locks held we know
      that it has not progressed through reclaim to the point where it is
      clean and is about to be freed.
      
      [bfoster: fixed __xfs_inode_clear_reclaim() using ip->i_ino after it
      	  had already been zeroed.]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8a17d7dd
    • D
      xfs: xfs_inode_free() isn't RCU safe · 1f2dcfe8
      Dave Chinner 提交于
      The xfs_inode freed in xfs_inode_free() has multiple allocated
      structures attached to it. We free these in xfs_inode_free() before
      we mark the inode as invalid, and before we run call_rcu() to queue
      the structure for freeing.
      
      Unfortunately, this freeing can race with other accesses that are in
      the RCU current grace period that have found the inode in the radix
      tree with a valid state.  This includes xfs_iflush_cluster(), which
      calls xfs_inode_clean(), and that accesses the inode log item on the
      xfs_inode.
      
      The log item structure is freed in xfs_inode_free(), so there is the
      possibility we can be accessing freed memory in xfs_iflush_cluster()
      after validating the xfs_inode structure as being valid for this RCU
      context. Hence we can get spuriously incorrect clean state returned
      from such checks. This can lead to use thinking the inode is dirty
      when it is, in fact, clean, and so incorrectly attaching it to the
      buffer for IO and completion processing.
      
      This then leads to use-after-free situations on the xfs_inode itself
      if the IO completes after the current RCU grace period expires. The
      buffer callbacks will access the xfs_inode and try to do all sorts
      of things it shouldn't with freed memory.
      
      IOWs, xfs_iflush_cluster() only works correctly when racing with
      inode reclaim if the inode log item is present and correctly stating
      the inode is clean. If the inode is being freed, then reclaim has
      already made sure the inode is clean, and hence xfs_iflush_cluster
      can skip it. However, we are accessing the inode inode under RCU
      read lock protection and so also must ensure that all dynamically
      allocated memory we reference in this context is not freed until the
      RCU grace period expires.
      
      To fix this, move all the potential memory freeing into
      xfs_inode_free_callback() so that we are guarantee RCU protected
      lookup code will always have the memory structures it needs
      available during the RCU grace period that lookup races can occur
      in.
      Discovered-by: NBrain Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      1f2dcfe8
    • A
      xfs: optimise xfs_iext_destroy · 32b43ab6
      Alex Lyakas 提交于
      When unmounting XFS, we call:
      
      xfs_inode_free => xfs_idestroy_fork => xfs_iext_destroy
      
      This goes over the whole indirection array and calls
      xfs_iext_irec_remove for each one of the erps (from the last one to
      the first one). As a result, we keep shrinking (reallocating
      actually) the indirection array until we shrink out all of its
      elements. When we have files with huge numbers of extents, umount
      takes 30-80 sec, depending on the amount of files that XFS loaded
      and the amount of indirection entries of each file. The unmount
      stack looks like:
      
      [<ffffffffc0b6d200>] xfs_iext_realloc_indirect+0x40/0x60 [xfs]
      [<ffffffffc0b6cd8e>] xfs_iext_irec_remove+0xee/0xf0 [xfs]
      [<ffffffffc0b6cdcd>] xfs_iext_destroy+0x3d/0xb0 [xfs]
      [<ffffffffc0b6cef6>] xfs_idestroy_fork+0xb6/0xf0 [xfs]
      [<ffffffffc0b87002>] xfs_inode_free+0xb2/0xc0 [xfs]
      [<ffffffffc0b87260>] xfs_reclaim_inode+0x250/0x340 [xfs]
      [<ffffffffc0b87583>] xfs_reclaim_inodes_ag+0x233/0x370 [xfs]
      [<ffffffffc0b8823d>] xfs_reclaim_inodes+0x1d/0x20 [xfs]
      [<ffffffffc0b96feb>] xfs_unmountfs+0x7b/0x1a0 [xfs]
      [<ffffffffc0b98e4d>] xfs_fs_put_super+0x2d/0x70 [xfs]
      [<ffffffff811e9e36>] generic_shutdown_super+0x76/0x100
      [<ffffffff811ea207>] kill_block_super+0x27/0x70
      [<ffffffff811ea519>] deactivate_locked_super+0x49/0x60
      [<ffffffff811eaaee>] deactivate_super+0x4e/0x70
      [<ffffffff81207593>] cleanup_mnt+0x43/0x90
      [<ffffffff81207632>] __cleanup_mnt+0x12/0x20
      [<ffffffff8108f8e7>] task_work_run+0xa7/0xe0
      [<ffffffff81014ff7>] do_notify_resume+0x97/0xb0
      [<ffffffff81717c6f>] int_signal+0x12/0x17
      
      Further, this reallocation prevents us from freeing the extent list
      from a RCU callback as allocation can block. Hence if the extent
      list is in indirect format, optimise the freeing of the extent list
      to only use kmem_free calls by freeing entire extent buffer pages at
      a time, rather than extent by extent.
      
      [dchinner: simplified freeing loop based on Christoph's suggestion]
      Signed-off-by: NAlex Lyakas <alex@zadarastorage.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      32b43ab6
    • D
      xfs: skip stale inodes in xfs_iflush_cluster · 7d3aa7fe
      Dave Chinner 提交于
      We don't write back stale inodes so we should skip them in
      xfs_iflush_cluster, too.
      
      cc: <stable@vger.kernel.org> # 3.10.x-
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      7d3aa7fe
    • D
      xfs: fix inode validity check in xfs_iflush_cluster · 51b07f30
      Dave Chinner 提交于
      Some careless idiot(*) wrote crap code in commit 1a3e8f3d ("xfs:
      convert inode cache lookups to use RCU locking") back in late 2010,
      and so xfs_iflush_cluster checks the wrong inode for whether it is
      still valid under RCU protection. Fix it to lock and check the
      correct inode.
      
      (*) Careless-idiot: Dave Chinner <dchinner@redhat.com>
      
      cc: <stable@vger.kernel.org> # 3.10.x-
      Discovered-by: NBrain Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      51b07f30
    • D
      xfs: xfs_iflush_cluster fails to abort on error · b1438f47
      Dave Chinner 提交于
      When a failure due to an inode buffer occurs, the error handling
      fails to abort the inode writeback correctly. This can result in the
      inode being reclaimed whilst still in the AIL, leading to
      use-after-free situations as well as filesystems that cannot be
      unmounted as the inode log items left in the AIL never get removed.
      
      Fix this by ensuring fatal errors from xfs_imap_to_bp() result in
      the inode flush being aborted correctly.
      
      cc: <stable@vger.kernel.org> # 3.10.x-
      Reported-by: NShyam Kaushik <shyam@zadarastorage.com>
      Diagnosed-by: NShyam Kaushik <shyam@zadarastorage.com>
      Tested-by: NShyam Kaushik <shyam@zadarastorage.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b1438f47
    • D
      xfs: remove xfs_fs_evict_inode() · 8179c036
      Dave Chinner 提交于
      Joe Lawrence reported a list_add corruption with 4.6-rc1 when
      testing some custom md administration code that made it's own
      block device nodes for the md array. The simple test loop of:
      
      for i in {0..100}; do
      	mknod --mode=0600 $tmp/tmp_node b $MAJOR $MINOR
      	mdadm --detail --export $tmp/tmp_node > /dev/null
      	rm -f $tmp/tmp_node
      done
      
      
      Would produce this warning in bd_acquire() when mdadm opened the
      device node:
      
      list_add double add: new=ffff88043831c7b8, prev=ffff8804380287d8, next=ffff88043831c7b8.
      
      And then produce this from bd_forget from kdevtmpfs evicting a block
      dev inode:
      
      list_del corruption. prev->next should be ffff8800bb83eb10, but was ffff88043831c7b8
      
      This is a regression caused by commit c19b3b05 ("xfs: mode di_mode
      to vfs inode"). The issue is that xfs_inactive() frees the
      unlinked inode, and the above commit meant that this freeing zeroed
      the mode in the struct inode. The problem is that after evict() has
      called ->evict_inode, it expects the i_mode to be intact so that it
      can call bd_forget() or cd_forget() to drop the reference to the
      block device inode attached to the XFS inode.
      
      In reality, the only thing we do in xfs_fs_evict_inode() that is not
      generic is call xfs_inactive(). We can move the xfs_inactive() call
      to xfs_fs_destroy_inode() without any problems at all, and this
      will leave the VFS inode intact until it is completely done with it.
      
      So, remove xfs_fs_evict_inode(), and do the work it used to do in
      ->destroy_inode instead.
      
      cc: <stable@vger.kernel.org> # 4.6
      Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8179c036
    • C
      xfs: add "fail at unmount" error handling configuration · e6b3bb78
      Carlos Maiolino 提交于
      If we take "retry forever" literally on metadata IO errors, we can
      hang at unmount, once it retries those writes forever. This is the
      default behavior, unfortunately.
      
      Add an error configuration option for this behavior and default it
      to "fail" so that an unmount will trigger actuall errors, a shutdown
      and allow the unmount to succeed. It will be noisy, though, as it
      will log the errors and shutdown that occurs.
      
      To fix this, we need to mark the filesystem as being in the process
      of unmounting. Do this with a mount flag that is added at the
      appropriate time (i.e. before the blocking AIL sync). We also need
      to add this flag if mount fails after the initial phase of log
      recovery has been run.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e6b3bb78
    • C
      xfs: add configuration handlers for specific errors · e0a431b3
      Carlos Maiolino 提交于
      now most of the infrastructure is in place, we can start adding
      support for configuring specific errors such as ENODEV, ENOSPC, EIO,
      etc. Add these error configurations and configure them all to have
      appropriate behaviours. That is, all will be configured to retry
      forever by default, except for ENODEV, which is an unrecoverable
      error, so it will be configured to not retry on error
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e0a431b3
    • C
      xfs: add configuration of error failure speed · a5ea70d2
      Carlos Maiolino 提交于
      On reception of an error, we can fail immediately, perform some
      bound amount of retries or retry indefinitely. The current behaviour
      we have is to retry forever.
      
      However, we'd like the ability to choose how long the filesystem
      should try after an error, it can either fail immediately, retry a
      few times, or retry forever. This is implemented by using
      max_retries sysfs attribute, to hold the amount of times we allow
      the filesystem to retry after an error. Being -1 a special case
      where the filesystem will retry indefinitely.
      
      Add both a maximum retry count and a retry timeout so that we can
      bound by time and/or physical IO attempts.
      
      Finally, plumb these into xfs_buf_iodone error processing so that
      the error behaviour follows the selected configuration.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a5ea70d2
    • C
      xfs: introduce table-based init for error behaviors · ef6a50fb
      Carlos Maiolino 提交于
      Before we start expanding the number of error classes and errors we
      can configure behaviour for, we need a simple and clear way to
      define the default behaviour that we initialized each mount with.
      Introduce a table based method for keeping the initial configuration
      in, and apply that to the existing initialization code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ef6a50fb
    • C
      xfs: add configurable error support to metadata buffers · df309390
      Carlos Maiolino 提交于
      With the error configuration handle for async metadata write errors
      in place, we can now add initial support to the IO error processing
      in xfs_buf_iodone_error().
      
      Add an infrastructure function to look up the configuration handle,
      and rearrange the error handling to prepare the way for different
      error handling conigurations to be used.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      df309390
    • C
      xfs: introduce metadata IO error class · ffd40ef6
      Carlos Maiolino 提交于
      Now we have the basic infrastructure, add the first error class so
      we can build up the infrastructure in a meaningful way. Add the
      metadata async write IO error class and sysfs entry, and introduce a
      default configuration that matches the existing "retry forever"
      behavior for async write metadata buffers.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ffd40ef6
    • C
      xfs: configurable error behavior via sysfs · 192852be
      Carlos Maiolino 提交于
      We need to be able to change the way XFS behaviours in error
      conditions depending on the type of underlying storage. This is
      necessary for handling non-traditional block devices with extended
      error cases, such as thin provisioned devices that can return ENOSPC
      as an IO error.
      
      Introduce the basic sysfs infrastructure needed to define and
      configure error behaviours. This is done to be generic enough to
      extend to configuring behaviour in other error conditions, such as
      ENOMEM, which also has different desired behaviours according to
      machine configuration.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      192852be
    • B
      xfs: buffer ->bi_end_io function requires irq-safe lock · 9bdd9bd6
      Brian Foster 提交于
      Reports have surfaced of a lockdep splat complaining about an
      irq-safe -> irq-unsafe locking order in the xfs_buf_bio_end_io() bio
      completion handler. This only occurs when I/O errors are present
      because bp->b_lock is only acquired in this context to protect
      setting an error on the buffer. The problem is that this lock can be
      acquired with the (request_queue) q->queue_lock held. See
      scsi_end_request() or ata_qc_schedule_eh(), for example.
      
      Replace the locked test/set of b_io_error with a cmpxchg() call.
      This eliminates the need for the lock and thus the lock ordering
      problem goes away.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      9bdd9bd6
  6. 17 5月, 2016 2 次提交
  7. 03 5月, 2016 1 次提交
  8. 02 5月, 2016 4 次提交
  9. 11 4月, 2016 1 次提交
  10. 06 4月, 2016 8 次提交