1. 18 5月, 2016 8 次提交
    • D
      xfs: xfs_iflush_cluster has range issues · 5a90e53e
      Dave Chinner 提交于
      xfs_iflush_cluster() does a gang lookup on the radix tree, meaning
      it can find inodes beyond the current cluster if there is sparse
      cache population. gang lookups return results in ascending index
      order, so stop trying to cluster inodes once the first inode outside
      the cluster mask is detected.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5a90e53e
    • D
      xfs: mark reclaimed inodes invalid earlier · 8a17d7dd
      Dave Chinner 提交于
      The last thing we do before using call_rcu() on an xfs_inode to be
      freed is mark it as invalid. This means there is a window between
      when we know for certain that the inode is going to be freed and
      when we do actually mark it as "freed".
      
      This is important in the context of RCU lookups - we can look up the
      inode, find that it is valid, and then use it as such not realising
      that it is in the final stages of being freed.
      
      As such, mark the inode as being invalid the moment we know it is
      going to be reclaimed. This can be done while we still hold the
      XFS_ILOCK_EXCL and the flush lock in xfs_inode_reclaim, meaning that
      it occurs well before we remove it from the radix tree, and that
      the i_flags_lock, the XFS_ILOCK and the inode flush lock all act as
      synchronisation points for detecting that an inode is about to go
      away.
      
      For defensive purposes, this allows us to add a further check to
      xfs_iflush_cluster to ensure we skip inodes that are being freed
      after we grab the XFS_ILOCK_SHARED and the flush lock - we know that
      if the inode number if valid while we have these locks held we know
      that it has not progressed through reclaim to the point where it is
      clean and is about to be freed.
      
      [bfoster: fixed __xfs_inode_clear_reclaim() using ip->i_ino after it
      	  had already been zeroed.]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8a17d7dd
    • D
      xfs: xfs_inode_free() isn't RCU safe · 1f2dcfe8
      Dave Chinner 提交于
      The xfs_inode freed in xfs_inode_free() has multiple allocated
      structures attached to it. We free these in xfs_inode_free() before
      we mark the inode as invalid, and before we run call_rcu() to queue
      the structure for freeing.
      
      Unfortunately, this freeing can race with other accesses that are in
      the RCU current grace period that have found the inode in the radix
      tree with a valid state.  This includes xfs_iflush_cluster(), which
      calls xfs_inode_clean(), and that accesses the inode log item on the
      xfs_inode.
      
      The log item structure is freed in xfs_inode_free(), so there is the
      possibility we can be accessing freed memory in xfs_iflush_cluster()
      after validating the xfs_inode structure as being valid for this RCU
      context. Hence we can get spuriously incorrect clean state returned
      from such checks. This can lead to use thinking the inode is dirty
      when it is, in fact, clean, and so incorrectly attaching it to the
      buffer for IO and completion processing.
      
      This then leads to use-after-free situations on the xfs_inode itself
      if the IO completes after the current RCU grace period expires. The
      buffer callbacks will access the xfs_inode and try to do all sorts
      of things it shouldn't with freed memory.
      
      IOWs, xfs_iflush_cluster() only works correctly when racing with
      inode reclaim if the inode log item is present and correctly stating
      the inode is clean. If the inode is being freed, then reclaim has
      already made sure the inode is clean, and hence xfs_iflush_cluster
      can skip it. However, we are accessing the inode inode under RCU
      read lock protection and so also must ensure that all dynamically
      allocated memory we reference in this context is not freed until the
      RCU grace period expires.
      
      To fix this, move all the potential memory freeing into
      xfs_inode_free_callback() so that we are guarantee RCU protected
      lookup code will always have the memory structures it needs
      available during the RCU grace period that lookup races can occur
      in.
      Discovered-by: NBrain Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      1f2dcfe8
    • A
      xfs: optimise xfs_iext_destroy · 32b43ab6
      Alex Lyakas 提交于
      When unmounting XFS, we call:
      
      xfs_inode_free => xfs_idestroy_fork => xfs_iext_destroy
      
      This goes over the whole indirection array and calls
      xfs_iext_irec_remove for each one of the erps (from the last one to
      the first one). As a result, we keep shrinking (reallocating
      actually) the indirection array until we shrink out all of its
      elements. When we have files with huge numbers of extents, umount
      takes 30-80 sec, depending on the amount of files that XFS loaded
      and the amount of indirection entries of each file. The unmount
      stack looks like:
      
      [<ffffffffc0b6d200>] xfs_iext_realloc_indirect+0x40/0x60 [xfs]
      [<ffffffffc0b6cd8e>] xfs_iext_irec_remove+0xee/0xf0 [xfs]
      [<ffffffffc0b6cdcd>] xfs_iext_destroy+0x3d/0xb0 [xfs]
      [<ffffffffc0b6cef6>] xfs_idestroy_fork+0xb6/0xf0 [xfs]
      [<ffffffffc0b87002>] xfs_inode_free+0xb2/0xc0 [xfs]
      [<ffffffffc0b87260>] xfs_reclaim_inode+0x250/0x340 [xfs]
      [<ffffffffc0b87583>] xfs_reclaim_inodes_ag+0x233/0x370 [xfs]
      [<ffffffffc0b8823d>] xfs_reclaim_inodes+0x1d/0x20 [xfs]
      [<ffffffffc0b96feb>] xfs_unmountfs+0x7b/0x1a0 [xfs]
      [<ffffffffc0b98e4d>] xfs_fs_put_super+0x2d/0x70 [xfs]
      [<ffffffff811e9e36>] generic_shutdown_super+0x76/0x100
      [<ffffffff811ea207>] kill_block_super+0x27/0x70
      [<ffffffff811ea519>] deactivate_locked_super+0x49/0x60
      [<ffffffff811eaaee>] deactivate_super+0x4e/0x70
      [<ffffffff81207593>] cleanup_mnt+0x43/0x90
      [<ffffffff81207632>] __cleanup_mnt+0x12/0x20
      [<ffffffff8108f8e7>] task_work_run+0xa7/0xe0
      [<ffffffff81014ff7>] do_notify_resume+0x97/0xb0
      [<ffffffff81717c6f>] int_signal+0x12/0x17
      
      Further, this reallocation prevents us from freeing the extent list
      from a RCU callback as allocation can block. Hence if the extent
      list is in indirect format, optimise the freeing of the extent list
      to only use kmem_free calls by freeing entire extent buffer pages at
      a time, rather than extent by extent.
      
      [dchinner: simplified freeing loop based on Christoph's suggestion]
      Signed-off-by: NAlex Lyakas <alex@zadarastorage.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      32b43ab6
    • D
      xfs: skip stale inodes in xfs_iflush_cluster · 7d3aa7fe
      Dave Chinner 提交于
      We don't write back stale inodes so we should skip them in
      xfs_iflush_cluster, too.
      
      cc: <stable@vger.kernel.org> # 3.10.x-
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      7d3aa7fe
    • D
      xfs: fix inode validity check in xfs_iflush_cluster · 51b07f30
      Dave Chinner 提交于
      Some careless idiot(*) wrote crap code in commit 1a3e8f3d ("xfs:
      convert inode cache lookups to use RCU locking") back in late 2010,
      and so xfs_iflush_cluster checks the wrong inode for whether it is
      still valid under RCU protection. Fix it to lock and check the
      correct inode.
      
      (*) Careless-idiot: Dave Chinner <dchinner@redhat.com>
      
      cc: <stable@vger.kernel.org> # 3.10.x-
      Discovered-by: NBrain Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      51b07f30
    • D
      xfs: xfs_iflush_cluster fails to abort on error · b1438f47
      Dave Chinner 提交于
      When a failure due to an inode buffer occurs, the error handling
      fails to abort the inode writeback correctly. This can result in the
      inode being reclaimed whilst still in the AIL, leading to
      use-after-free situations as well as filesystems that cannot be
      unmounted as the inode log items left in the AIL never get removed.
      
      Fix this by ensuring fatal errors from xfs_imap_to_bp() result in
      the inode flush being aborted correctly.
      
      cc: <stable@vger.kernel.org> # 3.10.x-
      Reported-by: NShyam Kaushik <shyam@zadarastorage.com>
      Diagnosed-by: NShyam Kaushik <shyam@zadarastorage.com>
      Tested-by: NShyam Kaushik <shyam@zadarastorage.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b1438f47
    • D
      xfs: remove xfs_fs_evict_inode() · 8179c036
      Dave Chinner 提交于
      Joe Lawrence reported a list_add corruption with 4.6-rc1 when
      testing some custom md administration code that made it's own
      block device nodes for the md array. The simple test loop of:
      
      for i in {0..100}; do
      	mknod --mode=0600 $tmp/tmp_node b $MAJOR $MINOR
      	mdadm --detail --export $tmp/tmp_node > /dev/null
      	rm -f $tmp/tmp_node
      done
      
      
      Would produce this warning in bd_acquire() when mdadm opened the
      device node:
      
      list_add double add: new=ffff88043831c7b8, prev=ffff8804380287d8, next=ffff88043831c7b8.
      
      And then produce this from bd_forget from kdevtmpfs evicting a block
      dev inode:
      
      list_del corruption. prev->next should be ffff8800bb83eb10, but was ffff88043831c7b8
      
      This is a regression caused by commit c19b3b05 ("xfs: mode di_mode
      to vfs inode"). The issue is that xfs_inactive() frees the
      unlinked inode, and the above commit meant that this freeing zeroed
      the mode in the struct inode. The problem is that after evict() has
      called ->evict_inode, it expects the i_mode to be intact so that it
      can call bd_forget() or cd_forget() to drop the reference to the
      block device inode attached to the XFS inode.
      
      In reality, the only thing we do in xfs_fs_evict_inode() that is not
      generic is call xfs_inactive(). We can move the xfs_inactive() call
      to xfs_fs_destroy_inode() without any problems at all, and this
      will leave the VFS inode intact until it is completely done with it.
      
      So, remove xfs_fs_evict_inode(), and do the work it used to do in
      ->destroy_inode instead.
      
      cc: <stable@vger.kernel.org> # 4.6
      Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8179c036
  2. 27 3月, 2016 6 次提交
    • L
      Linux 4.6-rc1 · f55532a0
      Linus Torvalds 提交于
      f55532a0
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client · d5a38f6e
      Linus Torvalds 提交于
      Pull Ceph updates from Sage Weil:
       "There is quite a bit here, including some overdue refactoring and
        cleanup on the mon_client and osd_client code from Ilya, scattered
        writeback support for CephFS and a pile of bug fixes from Zheng, and a
        few random cleanups and fixes from others"
      
      [ I already decided not to pull this because of it having been rebased
        recently, but ended up changing my mind after all.  Next time I'll
        really hold people to it.  Oh well.   - Linus ]
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits)
        libceph: use KMEM_CACHE macro
        ceph: use kmem_cache_zalloc
        rbd: use KMEM_CACHE macro
        ceph: use lookup request to revalidate dentry
        ceph: kill ceph_get_dentry_parent_inode()
        ceph: fix security xattr deadlock
        ceph: don't request vxattrs from MDS
        ceph: fix mounting same fs multiple times
        ceph: remove unnecessary NULL check
        ceph: avoid updating directory inode's i_size accidentally
        ceph: fix race during filling readdir cache
        libceph: use sizeof_footer() more
        ceph: kill ceph_empty_snapc
        ceph: fix a wrong comparison
        ceph: replace CURRENT_TIME by current_fs_time()
        ceph: scattered page writeback
        libceph: add helper that duplicates last extent operation
        libceph: enable large, variable-sized OSD requests
        libceph: osdc->req_mempool should be backed by a slab pool
        libceph: make r_request msg_size calculation clearer
        ...
      d5a38f6e
    • L
      Merge tag 'ofs-pull-tag-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux · 698f415c
      Linus Torvalds 提交于
      Pull orangefs filesystem from Mike Marshall.
      
      This finally merges the long-pending orangefs filesystem, which has been
      much cleaned up with input from Al Viro over the last six months.  From
      the documentation file:
      
       "OrangeFS is an LGPL userspace scale-out parallel storage system.  It
        is ideal for large storage problems faced by HPC, BigData, Streaming
        Video, Genomics, Bioinformatics.
      
        Orangefs, originally called PVFS, was first developed in 1993 by Walt
        Ligon and Eric Blumer as a parallel file system for Parallel Virtual
        Machine (PVM) as part of a NASA grant to study the I/O patterns of
        parallel programs.
      
        Orangefs features include:
      
          - Distributes file data among multiple file servers
          - Supports simultaneous access by multiple clients
          - Stores file data and metadata on servers using local file system
            and access methods
          - Userspace implementation is easy to install and maintain
          - Direct MPI support
          - Stateless"
      
      see Documentation/filesystems/orangefs.txt for more in-depth details.
      
      * tag 'ofs-pull-tag-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: (174 commits)
        orangefs: fix orangefs_superblock locking
        orangefs: fix do_readv_writev() handling of error halfway through
        orangefs: have ->kill_sb() evict the VFS side of things first
        orangefs: sanitize ->llseek()
        orangefs-bufmap.h: trim unused junk
        orangefs: saner calling conventions for getting a slot
        orangefs_copy_{to,from}_bufmap(): don't pass bufmap pointer
        orangefs: get rid of readdir_handle_s
        ornagefs: ensure that truncate has an up to date inode size
        orangefs: move code which sets i_link to orangefs_inode_getattr
        orangefs: remove needless wrapper around GFP_KERNEL
        orangefs: remove wrapper around mutex_lock(&inode->i_mutex)
        orangefs: refactor inode type or link_target change detection
        orangefs: use new getattr for revalidate and remove old getattr
        orangefs: use new getattr in inode getattr and permission
        orangefs: use new orangefs_inode_getattr to get size in write and llseek
        orangefs: use new orangefs_inode_getattr to create new inodes
        orangefs: rename orangefs_inode_getattr to orangefs_inode_old_getattr
        orangefs: remove inode->i_lock wrapper
        orangefs: put register_chrdev immediately before register_filesystem
        ...
      698f415c
    • L
      Merge tag 'ntb-4.6' of git://github.com/jonmason/ntb · b4cec5f6
      Linus Torvalds 提交于
      Pull NTB bug fixes from Jon Mason:
       "NTB bug fixes for tasklet from spinning forever, link errors,
        translation window setup, NULL ptr dereference, and ntb-perf errors.
      
        Also, a modification to the driver API that makes _addr functions
        optional"
      
      * tag 'ntb-4.6' of git://github.com/jonmason/ntb:
        NTB: Remove _addr functions from ntb_hw_amd
        NTB: Make _addr functions optional in the API
        NTB: Fix incorrect clean up routine in ntb_perf
        NTB: Fix incorrect return check in ntb_perf
        ntb: fix possible NULL dereference
        ntb: add missing setup of translation window
        ntb: stop link work when we do not have memory
        ntb: stop tasklet from spinning forever during shutdown.
        ntb: perf test: fix address space confusion
      b4cec5f6
    • L
      Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 895a1067
      Linus Torvalds 提交于
      Pull more SCSI updates from James Bottomley:
       "The only new stuff which missed the first pull request is an update to
        the UFS driver.
      
        The rest is an assortment of bug fixes and minor tweaks which appeared
        recently (some are fixes for recent code and some are stuff spotted
        recently by the checkers or the new gcc-6 compiler [most of Arnd's
        stuff])"
      
      * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (32 commits)
        scsi_common: do not clobber fixed sense information
        scsi: ufs: select CONFIG_NLS
        scsi: fc: use get/put_unaligned64 for wwn access
        fnic: move printk()s outside of the critical code section.
        qla2xxx: avoid maybe_uninitialized warning
        megaraid_sas: add missing curly braces in ioctl handler
        lpfc: fix misleading indentation
        scsi_transport_sas: add 'scsi_target_id' sysfs attribute
        scsi_dh_alua: uninitialized variable in alua_check_vpd()
        scsi: ufs-qcom: add printouts of testbus debug registers
        scsi: ufs-qcom: enable/disable the device ref clock
        scsi: ufs-qcom: set PA_Local_TX_LCC_Enable before link startup
        scsi: ufs: add device quirk delay before putting UFS rails in LPM
        scsi: ufs: fix leakage during link off state
        scsi: ufs: tune UniPro parameters to optimize hibern8 exit time
        scsi: ufs: handle non spec compliant bkops behaviour by device
        scsi: ufs: add retry for query descriptors
        scsi: ufs: add error recovery after DL NAC error
        scsi: ufs: make error handling bit faster
        scsi: ufs: disable vccq if it's not needed by UFS device
        ...
      895a1067
    • L
      f2fs/crypto: fix xts_tweak initialization · 02fc59a0
      Linus Torvalds 提交于
      Commit 0b81d077 ("fs crypto: move per-file encryption from f2fs
      tree to fs/crypto") moved the f2fs crypto files to fs/crypto/ and
      renamed the symbol prefixes from "f2fs_" to "fscrypt_" (and from "F2FS_"
      to just "FS" for preprocessor symbols).
      
      Because of the symbol renaming, it's a bit hard to see it as a file
      move: use
      
          git show -M30 0b81d077
      
      to lower the rename detection to just 30% similarity and make git show
      the files as renamed (the header file won't be shown as a rename even
      then - since all it contains is symbol definitions, it looks almost
      completely different).
      
      Even with the renames showing as renames, the diffs are not all that
      easy to read, since so much is just the renames.  But Eric Biggers
      noticed that it's not just all renames: the initialization of the
      xts_tweak had been broken too, using the inode number rather than the
      page offset.
      
      That's not right - it makes the xfs_tweak the same for all pages of each
      inode.  It _might_ make sense to make the xfs_tweak contain both the
      offset _and_ the inode number, but not just the inode number.
      Reported-by: NEric Biggers <ebiggers3@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02fc59a0
  3. 26 3月, 2016 26 次提交