1. 05 6月, 2017 7 次提交
  2. 04 6月, 2017 1 次提交
  3. 03 6月, 2017 1 次提交
    • R
      dax: fix race between colliding PMD & PTE entries · e2093926
      Ross Zwisler 提交于
      We currently have two related PMD vs PTE races in the DAX code.  These
      can both be easily triggered by having two threads reading and writing
      simultaneously to the same private mapping, with the key being that
      private mapping reads can be handled with PMDs but private mapping
      writes are always handled with PTEs so that we can COW.
      
      Here is the first race:
      
        CPU 0					CPU 1
      
        (private mapping write)
        __handle_mm_fault()
          create_huge_pmd() - FALLBACK
          handle_pte_fault()
            passes check for pmd_devmap()
      
      					(private mapping read)
      					__handle_mm_fault()
      					  create_huge_pmd()
      					    dax_iomap_pmd_fault() inserts PMD
      
            dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
            			  installed in our page tables at this spot.
      
      Here's the second race:
      
        CPU 0					CPU 1
      
        (private mapping read)
        __handle_mm_fault()
          passes check for pmd_none()
          create_huge_pmd()
            dax_iomap_pmd_fault() inserts PMD
      
        (private mapping write)
        __handle_mm_fault()
          create_huge_pmd() - FALLBACK
      					(private mapping read)
      					__handle_mm_fault()
      					  passes check for pmd_none()
      					  create_huge_pmd()
      
          handle_pte_fault()
            dax_iomap_pte_fault() inserts PTE
      					    dax_iomap_pmd_fault() inserts PMD,
      					       but we already have a PTE at
      					       this spot.
      
      The core of the issue is that while there is isolation between faults to
      the same range in the DAX fault handlers via our DAX entry locking,
      there is no isolation between faults in the code in mm/memory.c.  This
      means for instance that this code in __handle_mm_fault() can run:
      
      	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
      		ret = create_huge_pmd(&vmf);
      
      But by the time we actually get to run the fault handler called by
      create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
      fault has installed a normal PMD here as a parent.  This is the cause of
      the 2nd race.  The first race is similar - there is the following check
      in handle_pte_fault():
      
      	} else {
      		/* See comment in pte_alloc_one_map() */
      		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
      			return 0;
      
      So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
      will bail and retry the fault.  This is correct, but there is nothing
      preventing the PMD from being installed after this check but before we
      actually get to the DAX PTE fault handlers.
      
      In my testing these races result in the following types of errors:
      
        BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
        BUG: non-zero nr_ptes on freeing mm: 15
      
      Fix this issue by having the DAX fault handlers verify that it is safe
      to continue their fault after they have taken an entry lock to block
      other racing faults.
      
      [ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
        Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
      Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NPawel Lebioda <pawel.lebioda@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pawel Lebioda <pawel.lebioda@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Xiong Zhou <xzhou@redhat.com>
      Cc: Eryu Guan <eguan@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2093926
  4. 31 5月, 2017 2 次提交
    • B
      xfs: use ->b_state to fix buffer I/O accounting release race · 63db7c81
      Brian Foster 提交于
      We've had user reports of unmount hangs in xfs_wait_buftarg() that
      analysis shows is due to btp->bt_io_count == -1. bt_io_count
      represents the count of in-flight asynchronous buffers and thus
      should always be >= 0. xfs_wait_buftarg() waits for this value to
      stabilize to zero in order to ensure that all untracked (with
      respect to the lru) buffers have completed I/O processing before
      unmount proceeds to tear down in-core data structures.
      
      The value of -1 implies an I/O accounting decrement race. Indeed,
      the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele()
      (where the buffer lock is no longer held) means that bp->b_flags can
      be updated from an unsafe context. While a user-level reproducer is
      currently not available, some intrusive hacks to run racing buffer
      lookups/ioacct/releases from multiple threads was used to
      successfully manufacture this problem.
      
      Existing callers do not expect to acquire the buffer lock from
      xfs_buf_rele(). Therefore, we can not safely update ->b_flags from
      this context. It turns out that we already have separate buffer
      state bits and associated serialization for dealing with buffer LRU
      state in the form of ->b_state and ->b_lock. Therefore, replace the
      _XBF_IN_FLIGHT flag with a ->b_state variant, update the I/O
      accounting wrappers appropriately and make sure they are used with
      the correct locking. This ensures that buffer in-flight state can be
      modified at buffer release time without racing with modifications
      from a buffer lock holder.
      
      Fixes: 9c7504aa ("xfs: track and serialize in-flight async buffers against unmount")
      Cc: <stable@vger.kernel.org> # v4.8+
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NLibor Pechacek <lpechacek@suse.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      63db7c81
    • L
      "Yes, people use FOLL_FORCE ;)" · f511c0b1
      Linus Torvalds 提交于
      This effectively reverts commit 8ee74a91 ("proc: try to remove use
      of FOLL_FORCE entirely")
      
      It turns out that people do depend on FOLL_FORCE for the /proc/<pid>/mem
      case, and we're talking not just debuggers. Talking to the affected people, the use-cases are:
      
      Keno Fischer:
       "We used these semantics as a hardening mechanism in the julia JIT. By
        opening /proc/self/mem and using these semantics, we could avoid
        needing RWX pages, or a dual mapping approach. We do have fallbacks to
        these other methods (though getting EIO here actually causes an assert
        in released versions - we'll updated that to make sure to take the
        fall back in that case).
      
        Nevertheless the /proc/self/mem approach was our favored approach
        because it a) Required an attacker to be able to execute syscalls
        which is a taller order than getting memory write and b) didn't double
        the virtual address space requirements (as a dual mapping approach
        would).
      
        I think in general this feature is very useful for anybody who needs
        to precisely control the execution of some other process. Various
        debuggers (gdb/lldb/rr) certainly fall into that category, but there's
        another class of such processes (wine, various emulators) which may
        want to do that kind of thing.
      
        Now, I suspect most of these will have the other process under ptrace
        control, so maybe allowing (same_mm || ptraced) would be ok, but at
        least for the sandbox/remote-jit use case, it would be perfectly
        reasonable to not have the jit server be a ptracer"
      
      Robert O'Callahan:
       "We write to readonly code and data mappings via /proc/.../mem in lots
        of different situations, particularly when we're adjusting program
        state during replay to match the recorded execution.
      
        Like Julia, we can add workarounds, but they could be expensive."
      
      so not only do people use FOLL_FORCE for both reads and writes, but they
      use it for both the local mm and remote mm.
      
      With these comments in mind, we likely also cannot add the "are we
      actively ptracing" check either, so this keeps the new code organization
      and does not do a real revert that would add back the original comment
      about "Maybe we should limit FOLL_FORCE to actual ptrace users?"
      Reported-by: NKeno Fischer <keno@juliacomputing.com>
      Reported-by: NRobert O'Callahan <robert@ocallahan.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f511c0b1
  5. 29 5月, 2017 5 次提交
    • M
      ovl: filter trusted xattr for non-admin · a082c6f6
      Miklos Szeredi 提交于
      Filesystems filter out extended attributes in the "trusted." domain for
      unprivlieged callers.
      
      Overlay calls underlying filesystem's method with elevated privs, so need
      to do the filtering in overlayfs too.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      a082c6f6
    • A
      ovl: mark upper merge dir with type origin entries "impure" · f3a15685
      Amir Goldstein 提交于
      An upper dir is marked "impure" to let ovl_iterate() know that this
      directory may contain non pure upper entries whose d_ino may need to be
      read from the origin inode.
      
      We already mark a non-merge dir "impure" when moving a non-pure child
      entry inside it, to let ovl_iterate() know not to iterate the non-merge
      dir directly.
      
      Mark also a merge dir "impure" when moving a non-pure child entry inside
      it and when copying up a child entry inside it.
      
      This can be used to optimize ovl_iterate() to perform a "pure merge" of
      upper and lower directories, merging the content of the directories,
      without having to read d_ino from origin inodes.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      f3a15685
    • K
      ocfs2: Use ERR_CAST() to avoid cross-structure cast · 7585d12f
      Kees Cook 提交于
      When trying to propagate an error result, the error return path attempts
      to retain the error, but does this with an open cast across very different
      types, which the upcoming structure layout randomization plugin flags as
      being potentially dangerous in the face of randomization. This is a false
      positive, but what this code actually wants to do is use ERR_CAST() to
      retain the error value.
      
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      7585d12f
    • K
      ntfs: Use ERR_CAST() to avoid cross-structure cast · fee2aa75
      Kees Cook 提交于
      When trying to propagate an error result, the error return path attempts
      to retain the error, but does this with an open cast across very different
      types, which the upcoming structure layout randomization plugin flags as
      being potentially dangerous in the face of randomization. This is a false
      positive, but what this code actually wants to do is use ERR_CAST() to
      retain the error value.
      
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      fee2aa75
    • K
      NFS: Use ERR_CAST() to avoid cross-structure cast · fe3b81b4
      Kees Cook 提交于
      When the call to nfs_devname() fails, the error path attempts to retain
      the error via the mnt variable, but this requires a cast across very
      different types (char * to struct vfsmount *), which the upcoming
      structure layout randomization plugin flags as being potentially
      dangerous in the face of randomization. This is a false positive, but
      what this code actually wants to do is retain the error value, so this
      patch explicitly sets it, instead of using what seems to be an
      unexpected cast.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      fe3b81b4
  6. 26 5月, 2017 5 次提交
  7. 25 5月, 2017 1 次提交
  8. 24 5月, 2017 7 次提交
    • T
      NFSv4.0: Fix a lock leak in nfs40_walk_client_list · b49c15f9
      Trond Myklebust 提交于
      Xiaolong Ye's kernel test robot detected the following Oops:
      [  299.158991] BUG: scheduling while atomic: mount.nfs/9387/0x00000002
      [  299.169587] 2 locks held by mount.nfs/9387:
      [  299.176165]  #0:  (nfs_clid_init_mutex){......}, at: [<ffffffff8130cc92>] nfs4_discover_server_trunking+0x47/0x1fc
      [  299.201802]  #1:  (&(&nn->nfs_client_lock)->rlock){......}, at: [<ffffffff813125fa>] nfs40_walk_client_list+0x2e9/0x338
      [  299.221979] CPU: 0 PID: 9387 Comm: mount.nfs Not tainted 4.11.0-rc7-00021-g14d1bbb0 #45
      [  299.235584] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
      [  299.251176] Call Trace:
      [  299.255192]  dump_stack+0x61/0x7e
      [  299.260416]  __schedule_bug+0x65/0x74
      [  299.266208]  __schedule+0x5d/0x87c
      [  299.271883]  schedule+0x89/0x9a
      [  299.276937]  schedule_timeout+0x232/0x289
      [  299.283223]  ? detach_if_pending+0x10b/0x10b
      [  299.289935]  schedule_timeout_uninterruptible+0x2a/0x2c
      [  299.298266]  ? put_rpccred+0x3e/0x115
      [  299.304327]  ? schedule_timeout_uninterruptible+0x2a/0x2c
      [  299.312851]  msleep+0x1e/0x22
      [  299.317612]  nfs4_discover_server_trunking+0x102/0x1fc
      [  299.325644]  nfs4_init_client+0x13f/0x194
      
      It looks as if we recently added a spin_lock() leak to
      nfs40_walk_client_list() when cleaning up the code.
      Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
      Fixes: 14d1bbb0 ("NFS: Create a common nfs4_match_client() function")
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      b49c15f9
    • B
      pnfs: Fix the check for requests in range of layout segment · 08cb5b0f
      Benjamin Coddington 提交于
      It's possible and acceptable for NFS to attempt to add requests beyond the
      range of the current pgio->pg_lseg, a case which should be caught and
      limited by the pg_test operation.  However, the current handling of this
      case replaces pgio->pg_lseg with a new layout segment (after a WARN) within
      that pg_test operation.  That will cause all the previously added requests
      to be submitted with this new layout segment, which may not be valid for
      those requests.
      
      Fix this problem by only returning zero for the number of bytes to coalesce
      from pg_test for this case which allows any previously added requests to
      complete on the current layout segment.  The check for requests starting
      out of range of the layout segment moves to pg_init, so that the
      replacement of pgio->pg_lseg will be done when the next request is added.
      Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      08cb5b0f
    • D
      pNFS/flexfiles: missing error code in ff_layout_alloc_lseg() · 662f9a10
      Dan Carpenter 提交于
      If xdr_inline_decode() fails then we end up returning ERR_PTR(0).  The
      caller treats NULL returns as -ENOMEM so it doesn't really hurt runtime,
      but obviously we intended to set an error code here.
      
      Fixes: d67ae825 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      662f9a10
    • O
      NFS fix COMMIT after COPY · 6d3b5d8d
      Olga Kornievskaia 提交于
      Fix a typo in the commit e0926934
      "NFS append COMMIT after synchronous COPY"
      Reported-by: NEryu Guan <eguan@redhat.com>
      Fixes: e0926934 ("NFS append COMMIT after synchronous COPY")
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Tested-by: NEryu Guan <eguan@redhat.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      6d3b5d8d
    • J
      reiserfs: Make flush bios explicitely sync · d8747d64
      Jan Kara 提交于
      Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as
      synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
      definitions.  generic_make_request_checks() however strips REQ_FUA and
      REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
      write cache and thus write effectively becomes asynchronous which can
      lead to performance regressions
      
      Fix the problem by making sure all bios which are synchronous are
      properly marked with REQ_SYNC.
      
      Fixes: b685d3d6
      CC: reiserfs-devel@vger.kernel.org
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      d8747d64
    • J
      gfs2: Make flush bios explicitely sync · 0f0b9b63
      Jan Kara 提交于
      Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as
      synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
      definitions.  generic_make_request_checks() however strips REQ_FUA and
      REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
      write cache and thus write effectively becomes asynchronous which can
      lead to performance regressions
      
      Fix the problem by making sure all bios which are synchronous are
      properly marked with REQ_SYNC.
      
      Fixes: b685d3d6
      CC: Steven Whitehouse <swhiteho@redhat.com>
      CC: cluster-devel@redhat.com
      CC: stable@vger.kernel.org
      Acked-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      0f0b9b63
    • J
      nfsd4: fix null dereference on replay · 9a307403
      J. Bruce Fields 提交于
      if we receive a compound such that:
      
      	- the sessionid, slot, and sequence number in the SEQUENCE op
      	  match a cached succesful reply with N ops, and
      	- the Nth operation of the compound is a PUTFH, PUTPUBFH,
      	  PUTROOTFH, or RESTOREFH,
      
      then nfsd4_sequence will return 0 and set cstate->status to
      nfserr_replay_cache.  The current filehandle will not be set.  This will
      cause us to call check_nfsd_access with first argument NULL.
      
      To nfsd4_compound it looks like we just succesfully executed an
      operation that set a filehandle, but the current filehandle is not set.
      
      Fix this by moving the nfserr_replay_cache earlier.  There was never any
      reason to have it after the encode_op label, since the only case where
      he hit that is when opdesc->op_func sets it.
      
      Note that there are two ways we could hit this case:
      
      	- a client is resending a previously sent compound that ended
      	  with one of the four PUTFH-like operations, or
      	- a client is sending a *new* compound that (incorrectly) shares
      	  sessionid, slot, and sequence number with a previously sent
      	  compound, and the length of the previously sent compound
      	  happens to match the position of a PUTFH-like operation in the
      	  new compound.
      
      The second is obviously incorrect client behavior.  The first is also
      very strange--the only purpose of a PUTFH-like operation is to set the
      current filehandle to be used by the following operation, so there's no
      point in having it as the last in a compound.
      
      So it's likely this requires a buggy or malicious client to reproduce.
      Reported-by: NScott Mayhew <smayhew@redhat.com>
      Cc: stable@kernel.vger.org
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      9a307403
  9. 19 5月, 2017 4 次提交
  10. 18 5月, 2017 3 次提交
  11. 17 5月, 2017 4 次提交
新手
引导
客服 返回
顶部