1. 14 7月, 2017 17 次提交
  2. 11 6月, 2017 1 次提交
  3. 10 6月, 2017 10 次提交
  4. 05 6月, 2017 1 次提交
  5. 04 6月, 2017 1 次提交
  6. 03 6月, 2017 1 次提交
    • R
      dax: fix race between colliding PMD & PTE entries · e2093926
      Ross Zwisler 提交于
      We currently have two related PMD vs PTE races in the DAX code.  These
      can both be easily triggered by having two threads reading and writing
      simultaneously to the same private mapping, with the key being that
      private mapping reads can be handled with PMDs but private mapping
      writes are always handled with PTEs so that we can COW.
      
      Here is the first race:
      
        CPU 0					CPU 1
      
        (private mapping write)
        __handle_mm_fault()
          create_huge_pmd() - FALLBACK
          handle_pte_fault()
            passes check for pmd_devmap()
      
      					(private mapping read)
      					__handle_mm_fault()
      					  create_huge_pmd()
      					    dax_iomap_pmd_fault() inserts PMD
      
            dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
            			  installed in our page tables at this spot.
      
      Here's the second race:
      
        CPU 0					CPU 1
      
        (private mapping read)
        __handle_mm_fault()
          passes check for pmd_none()
          create_huge_pmd()
            dax_iomap_pmd_fault() inserts PMD
      
        (private mapping write)
        __handle_mm_fault()
          create_huge_pmd() - FALLBACK
      					(private mapping read)
      					__handle_mm_fault()
      					  passes check for pmd_none()
      					  create_huge_pmd()
      
          handle_pte_fault()
            dax_iomap_pte_fault() inserts PTE
      					    dax_iomap_pmd_fault() inserts PMD,
      					       but we already have a PTE at
      					       this spot.
      
      The core of the issue is that while there is isolation between faults to
      the same range in the DAX fault handlers via our DAX entry locking,
      there is no isolation between faults in the code in mm/memory.c.  This
      means for instance that this code in __handle_mm_fault() can run:
      
      	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
      		ret = create_huge_pmd(&vmf);
      
      But by the time we actually get to run the fault handler called by
      create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
      fault has installed a normal PMD here as a parent.  This is the cause of
      the 2nd race.  The first race is similar - there is the following check
      in handle_pte_fault():
      
      	} else {
      		/* See comment in pte_alloc_one_map() */
      		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
      			return 0;
      
      So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
      will bail and retry the fault.  This is correct, but there is nothing
      preventing the PMD from being installed after this check but before we
      actually get to the DAX PTE fault handlers.
      
      In my testing these races result in the following types of errors:
      
        BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
        BUG: non-zero nr_ptes on freeing mm: 15
      
      Fix this issue by having the DAX fault handlers verify that it is safe
      to continue their fault after they have taken an entry lock to block
      other racing faults.
      
      [ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
        Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
      Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NPawel Lebioda <pawel.lebioda@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pawel Lebioda <pawel.lebioda@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Xiong Zhou <xzhou@redhat.com>
      Cc: Eryu Guan <eguan@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2093926
  7. 01 6月, 2017 3 次提交
    • J
      btrfs: fix race with relocation recovery and fs_root setup · a9b3311e
      Jeff Mahoney 提交于
      If we have to recover relocation during mount, we'll ultimately have to
      evict the orphan inode.  That goes through the reservation dance, where
      priority_reclaim_metadata_space and flush_space expect fs_info->fs_root
      to be valid.  That's the next thing to be set up during mount, so we
      crash, almost always in flush_space trying to join the transaction
      but priority_reclaim_metadata_space is possible as well.  This call
      path has been problematic in the past WRT whether ->fs_root is valid
      yet.  Commit 957780eb (Btrfs: introduce ticketed enospc
      infrastructure) added new users that are called in the direct path
      instead of the async path that had already been worked around.
      
      The thing is that we don't actually need the fs_root, specifically, for
      anything.  We either use it to determine whether the root is the
      chunk_root for use in choosing an allocation profile or as a root to pass
      btrfs_join_transaction before immediately committing it.  Anything that
      isn't the chunk root works in the former case and any root works in
      the latter.
      
      A simple fix is to use a root we know will always be there: the
      extent_root.
      
      Cc: <stable@vger.kernel.org> # v4.8+
      Fixes: 957780eb (Btrfs: introduce ticketed enospc infrastructure)
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a9b3311e
    • J
      btrfs: fix memory leak in update_space_info failure path · 896533a7
      Jeff Mahoney 提交于
      If we fail to add the space_info kobject, we'll leak the memory
      for the percpu counter.
      
      Fixes: 6ab0a202 (btrfs: publish allocation data in sysfs)
      Cc: <stable@vger.kernel.org> # v3.14+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      896533a7
    • D
      btrfs: use correct types for page indices in btrfs_page_exists_in_range · cc2b702c
      David Sterba 提交于
      Variables start_idx and end_idx are supposed to hold a page index
      derived from the file offsets. The int type is not the right one though,
      offsets larger than 1 << 44 will get silently trimmed off the high bits.
      (1 << 44 is 16TiB)
      
      What can go wrong, if start is below the boundary and end gets trimmed:
      - if there's a page after start, we'll find it (radix_tree_gang_lookup_slot)
      - the final check "if (page->index <= end_idx)" will unexpectedly fail
      
      The function will return false, ie. "there's no page in the range",
      although there is at least one.
      
      btrfs_page_exists_in_range is used to prevent races in:
      
      * in hole punching, where we make sure there are not pages in the
        truncated range, otherwise we'll wait for them to finish and redo
        truncation, but we're going to replace the pages with holes anyway so
        the only problem is the intermediate state
      
      * lock_extent_direct: we want to make sure there are no pages before we
        lock and start DIO, to prevent stale data reads
      
      For practical occurence of the bug, there are several constaints.  The
      file must be quite large, the affected range must cross the 16TiB
      boundary and the internal state of the file pages and pending operations
      must match.  Also, we must not have started any ordered data in the
      range, otherwise we don't even reach the buggy function check.
      
      DIO locking tries hard in several places to avoid deadlocks with
      buffered IO and avoids waiting for ranges. The worst consequence seems
      to be stale data read.
      
      CC: Liu Bo <bo.li.liu@oracle.com>
      CC: stable@vger.kernel.org	# 3.16+
      Fixes: fc4adbff ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking")
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cc2b702c
  8. 31 5月, 2017 2 次提交
    • B
      xfs: use ->b_state to fix buffer I/O accounting release race · 63db7c81
      Brian Foster 提交于
      We've had user reports of unmount hangs in xfs_wait_buftarg() that
      analysis shows is due to btp->bt_io_count == -1. bt_io_count
      represents the count of in-flight asynchronous buffers and thus
      should always be >= 0. xfs_wait_buftarg() waits for this value to
      stabilize to zero in order to ensure that all untracked (with
      respect to the lru) buffers have completed I/O processing before
      unmount proceeds to tear down in-core data structures.
      
      The value of -1 implies an I/O accounting decrement race. Indeed,
      the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele()
      (where the buffer lock is no longer held) means that bp->b_flags can
      be updated from an unsafe context. While a user-level reproducer is
      currently not available, some intrusive hacks to run racing buffer
      lookups/ioacct/releases from multiple threads was used to
      successfully manufacture this problem.
      
      Existing callers do not expect to acquire the buffer lock from
      xfs_buf_rele(). Therefore, we can not safely update ->b_flags from
      this context. It turns out that we already have separate buffer
      state bits and associated serialization for dealing with buffer LRU
      state in the form of ->b_state and ->b_lock. Therefore, replace the
      _XBF_IN_FLIGHT flag with a ->b_state variant, update the I/O
      accounting wrappers appropriately and make sure they are used with
      the correct locking. This ensures that buffer in-flight state can be
      modified at buffer release time without racing with modifications
      from a buffer lock holder.
      
      Fixes: 9c7504aa ("xfs: track and serialize in-flight async buffers against unmount")
      Cc: <stable@vger.kernel.org> # v4.8+
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NLibor Pechacek <lpechacek@suse.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      63db7c81
    • L
      "Yes, people use FOLL_FORCE ;)" · f511c0b1
      Linus Torvalds 提交于
      This effectively reverts commit 8ee74a91 ("proc: try to remove use
      of FOLL_FORCE entirely")
      
      It turns out that people do depend on FOLL_FORCE for the /proc/<pid>/mem
      case, and we're talking not just debuggers. Talking to the affected people, the use-cases are:
      
      Keno Fischer:
       "We used these semantics as a hardening mechanism in the julia JIT. By
        opening /proc/self/mem and using these semantics, we could avoid
        needing RWX pages, or a dual mapping approach. We do have fallbacks to
        these other methods (though getting EIO here actually causes an assert
        in released versions - we'll updated that to make sure to take the
        fall back in that case).
      
        Nevertheless the /proc/self/mem approach was our favored approach
        because it a) Required an attacker to be able to execute syscalls
        which is a taller order than getting memory write and b) didn't double
        the virtual address space requirements (as a dual mapping approach
        would).
      
        I think in general this feature is very useful for anybody who needs
        to precisely control the execution of some other process. Various
        debuggers (gdb/lldb/rr) certainly fall into that category, but there's
        another class of such processes (wine, various emulators) which may
        want to do that kind of thing.
      
        Now, I suspect most of these will have the other process under ptrace
        control, so maybe allowing (same_mm || ptraced) would be ok, but at
        least for the sandbox/remote-jit use case, it would be perfectly
        reasonable to not have the jit server be a ptracer"
      
      Robert O'Callahan:
       "We write to readonly code and data mappings via /proc/.../mem in lots
        of different situations, particularly when we're adjusting program
        state during replay to match the recorded execution.
      
        Like Julia, we can add workarounds, but they could be expensive."
      
      so not only do people use FOLL_FORCE for both reads and writes, but they
      use it for both the local mm and remote mm.
      
      With these comments in mind, we likely also cannot add the "are we
      actively ptracing" check either, so this keeps the new code organization
      and does not do a real revert that would add back the original comment
      about "Maybe we should limit FOLL_FORCE to actual ptrace users?"
      Reported-by: NKeno Fischer <keno@juliacomputing.com>
      Reported-by: NRobert O'Callahan <robert@ocallahan.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f511c0b1
  9. 30 5月, 2017 1 次提交
    • J
      ext4: fix fdatasync(2) after extent manipulation operations · 67a7d5f5
      Jan Kara 提交于
      Currently, extent manipulation operations such as hole punch, range
      zeroing, or extent shifting do not record the fact that file data has
      changed and thus fdatasync(2) has a work to do. As a result if we crash
      e.g. after a punch hole and fdatasync, user can still possibly see the
      punched out data after journal replay. Test generic/392 fails due to
      these problems.
      
      Fix the problem by properly marking that file data has changed in these
      operations.
      
      CC: stable@vger.kernel.org
      Fixes: a4bb6b64Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      67a7d5f5
  10. 29 5月, 2017 3 次提交
    • M
      ovl: filter trusted xattr for non-admin · a082c6f6
      Miklos Szeredi 提交于
      Filesystems filter out extended attributes in the "trusted." domain for
      unprivlieged callers.
      
      Overlay calls underlying filesystem's method with elevated privs, so need
      to do the filtering in overlayfs too.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      a082c6f6
    • A
      ovl: mark upper merge dir with type origin entries "impure" · f3a15685
      Amir Goldstein 提交于
      An upper dir is marked "impure" to let ovl_iterate() know that this
      directory may contain non pure upper entries whose d_ino may need to be
      read from the origin inode.
      
      We already mark a non-merge dir "impure" when moving a non-pure child
      entry inside it, to let ovl_iterate() know not to iterate the non-merge
      dir directly.
      
      Mark also a merge dir "impure" when moving a non-pure child entry inside
      it and when copying up a child entry inside it.
      
      This can be used to optimize ovl_iterate() to perform a "pure merge" of
      upper and lower directories, merging the content of the directories,
      without having to read d_ino from origin inodes.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      f3a15685
    • K
      ocfs2: Use ERR_CAST() to avoid cross-structure cast · 7585d12f
      Kees Cook 提交于
      When trying to propagate an error result, the error return path attempts
      to retain the error, but does this with an open cast across very different
      types, which the upcoming structure layout randomization plugin flags as
      being potentially dangerous in the face of randomization. This is a false
      positive, but what this code actually wants to do is use ERR_CAST() to
      retain the error value.
      
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      7585d12f