1. 20 6月, 2017 1 次提交
  2. 19 6月, 2017 1 次提交
    • H
      mm: larger stack guard gap, between vmas · 1be7107f
      Hugh Dickins 提交于
      Stack guard page is a useful feature to reduce a risk of stack smashing
      into a different mapping. We have been using a single page gap which
      is sufficient to prevent having stack adjacent to a different mapping.
      But this seems to be insufficient in the light of the stack usage in
      userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
      used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
      which is 256kB or stack strings with MAX_ARG_STRLEN.
      
      This will become especially dangerous for suid binaries and the default
      no limit for the stack size limit because those applications can be
      tricked to consume a large portion of the stack and a single glibc call
      could jump over the guard page. These attacks are not theoretical,
      unfortunatelly.
      
      Make those attacks less probable by increasing the stack guard gap
      to 1MB (on systems with 4k pages; but make it depend on the page size
      because systems with larger base pages might cap stack allocations in
      the PAGE_SIZE units) which should cover larger alloca() and VLA stack
      allocations. It is obviously not a full fix because the problem is
      somehow inherent, but it should reduce attack space a lot.
      
      One could argue that the gap size should be configurable from userspace,
      but that can be done later when somebody finds that the new 1MB is wrong
      for some special case applications.  For now, add a kernel command line
      option (stack_guard_gap) to specify the stack gap size (in page units).
      
      Implementation wise, first delete all the old code for stack guard page:
      because although we could get away with accounting one extra page in a
      stack vma, accounting a larger gap can break userspace - case in point,
      a program run with "ulimit -S -v 20000" failed when the 1MB gap was
      counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
      and strict non-overcommit mode.
      
      Instead of keeping gap inside the stack vma, maintain the stack guard
      gap as a gap between vmas: using vm_start_gap() in place of vm_start
      (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
      places which need to respect the gap - mainly arch_get_unmapped_area(),
      and and the vma tree's subtree_gap support for that.
      Original-patch-by: NOleg Nesterov <oleg@redhat.com>
      Original-patch-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: Helge Deller <deller@gmx.de> # parisc
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be7107f
  3. 17 6月, 2017 1 次提交
  4. 16 6月, 2017 1 次提交
  5. 15 6月, 2017 13 次提交
  6. 12 6月, 2017 2 次提交
    • B
      configfs: Introduce config_item_get_unless_zero() · 19e72d3a
      Bart Van Assche 提交于
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      [hch: minor style tweak]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      19e72d3a
    • N
      configfs: Fix race between create_link and configfs_rmdir · ba80aa90
      Nicholas Bellinger 提交于
      This patch closes a long standing race in configfs between
      the creation of a new symlink in create_link(), while the
      symlink target's config_item is being concurrently removed
      via configfs_rmdir().
      
      This can happen because the symlink target's reference
      is obtained by config_item_get() in create_link() before
      the CONFIGFS_USET_DROPPING bit set by configfs_detach_prep()
      during configfs_rmdir() shutdown is actually checked..
      
      This originally manifested itself on ppc64 on v4.8.y under
      heavy load using ibmvscsi target ports with Novalink API:
      
      [ 7877.289863] rpadlpar_io: slot U8247.22L.212A91A-V1-C8 added
      [ 7879.893760] ------------[ cut here ]------------
      [ 7879.893768] WARNING: CPU: 15 PID: 17585 at ./include/linux/kref.h:46 config_item_get+0x7c/0x90 [configfs]
      [ 7879.893811] CPU: 15 PID: 17585 Comm: targetcli Tainted: G           O 4.8.17-customv2.22 #12
      [ 7879.893812] task: c00000018a0d3400 task.stack: c0000001f3b40000
      [ 7879.893813] NIP: d000000002c664ec LR: d000000002c60980 CTR: c000000000b70870
      [ 7879.893814] REGS: c0000001f3b43810 TRAP: 0700   Tainted: G O     (4.8.17-customv2.22)
      [ 7879.893815] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28222242  XER: 00000000
      [ 7879.893820] CFAR: d000000002c664bc SOFTE: 1
                      GPR00: d000000002c60980 c0000001f3b43a90 d000000002c70908 c0000000fbc06820
                      GPR04: c0000001ef1bd900 0000000000000004 0000000000000001 0000000000000000
                      GPR08: 0000000000000000 0000000000000001 d000000002c69560 d000000002c66d80
                      GPR12: c000000000b70870 c00000000e798700 c0000001f3b43ca0 c0000001d4949d40
                      GPR16: c00000014637e1c0 0000000000000000 0000000000000000 c0000000f2392940
                      GPR20: c0000001f3b43b98 0000000000000041 0000000000600000 0000000000000000
                      GPR24: fffffffffffff000 0000000000000000 d000000002c60be0 c0000001f1dac490
                      GPR28: 0000000000000004 0000000000000000 c0000001ef1bd900 c0000000f2392940
      [ 7879.893839] NIP [d000000002c664ec] config_item_get+0x7c/0x90 [configfs]
      [ 7879.893841] LR [d000000002c60980] check_perm+0x80/0x2e0 [configfs]
      [ 7879.893842] Call Trace:
      [ 7879.893844] [c0000001f3b43ac0] [d000000002c60980] check_perm+0x80/0x2e0 [configfs]
      [ 7879.893847] [c0000001f3b43b10] [c000000000329770] do_dentry_open+0x2c0/0x460
      [ 7879.893849] [c0000001f3b43b70] [c000000000344480] path_openat+0x210/0x1490
      [ 7879.893851] [c0000001f3b43c80] [c00000000034708c] do_filp_open+0xfc/0x170
      [ 7879.893853] [c0000001f3b43db0] [c00000000032b5bc] do_sys_open+0x1cc/0x390
      [ 7879.893856] [c0000001f3b43e30] [c000000000009584] system_call+0x38/0xec
      [ 7879.893856] Instruction dump:
      [ 7879.893858] 409d0014 38210030 e8010010 7c0803a6 4e800020 3d220000 e94981e0 892a0000
      [ 7879.893861] 2f890000 409effe0 39200001 992a0000 <0fe00000> 4bffffd0 60000000 60000000
      [ 7879.893866] ---[ end trace 14078f0b3b5ad0aa ]---
      
      To close this race, go ahead and obtain the symlink's target
      config_item reference only after the existing CONFIGFS_USET_DROPPING
      check succeeds.
      
      This way, if configfs_rmdir() wins create_link() will return -ENONET,
      and if create_link() wins configfs_rmdir() will return -EBUSY.
      Reported-by: NBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Tested-by: NBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      ba80aa90
  7. 11 6月, 2017 1 次提交
  8. 10 6月, 2017 10 次提交
  9. 08 6月, 2017 2 次提交
    • B
      xfs: fix spurious spin_is_locked() assert failures on non-smp kernels · 95989c46
      Brian Foster 提交于
      The 0-day kernel test robot reports assertion failures on
      !CONFIG_SMP kernels due to failed spin_is_locked() checks. As it
      turns out, spin_is_locked() is hardcoded to return zero on
      !CONFIG_SMP kernels and so this function cannot be relied on to
      verify spinlock state in this configuration.
      
      To avoid this problem, replace the associated asserts with lockdep
      variants that do the right thing regardless of kernel configuration.
      Drop the one assert that checks for an unlocked lock as there is no
      suitable lockdep variant for that case. This moves the spinlock
      checks from XFS debug code to lockdep, but generally provides the
      same level of protection.
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      95989c46
    • D
      crypto: Work around deallocated stack frame reference gcc bug on sparc. · d41519a6
      David Miller 提交于
      On sparc, if we have an alloca() like situation, as is the case with
      SHASH_DESC_ON_STACK(), we can end up referencing deallocated stack
      memory.  The result can be that the value is clobbered if a trap
      or interrupt arrives at just the right instruction.
      
      It only occurs if the function ends returning a value from that
      alloca() area and that value can be placed into the return value
      register using a single instruction.
      
      For example, in lib/libcrc32c.c:crc32c() we end up with a return
      sequence like:
      
              return  %i7+8
               lduw   [%o5+16], %o0   ! MEM[(u32 *)__shash_desc.1_10 + 16B],
      
      %o5 holds the base of the on-stack area allocated for the shash
      descriptor.  But the return released the stack frame and the
      register window.
      
      So if an intererupt arrives between 'return' and 'lduw', then
      the value read at %o5+16 can be corrupted.
      
      Add a data compiler barrier to work around this problem.  This is
      exactly what the gcc fix will end up doing as well, and it absolutely
      should not change the code generated for other cpus (unless gcc
      on them has the same bug :-)
      
      With crucial insight from Eric Sandeen.
      
      Cc: <stable@vger.kernel.org>
      Reported-by: NAnatoly Pugachev <matorola@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      d41519a6
  10. 05 6月, 2017 1 次提交
  11. 04 6月, 2017 1 次提交
  12. 03 6月, 2017 1 次提交
    • R
      dax: fix race between colliding PMD & PTE entries · e2093926
      Ross Zwisler 提交于
      We currently have two related PMD vs PTE races in the DAX code.  These
      can both be easily triggered by having two threads reading and writing
      simultaneously to the same private mapping, with the key being that
      private mapping reads can be handled with PMDs but private mapping
      writes are always handled with PTEs so that we can COW.
      
      Here is the first race:
      
        CPU 0					CPU 1
      
        (private mapping write)
        __handle_mm_fault()
          create_huge_pmd() - FALLBACK
          handle_pte_fault()
            passes check for pmd_devmap()
      
      					(private mapping read)
      					__handle_mm_fault()
      					  create_huge_pmd()
      					    dax_iomap_pmd_fault() inserts PMD
      
            dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
            			  installed in our page tables at this spot.
      
      Here's the second race:
      
        CPU 0					CPU 1
      
        (private mapping read)
        __handle_mm_fault()
          passes check for pmd_none()
          create_huge_pmd()
            dax_iomap_pmd_fault() inserts PMD
      
        (private mapping write)
        __handle_mm_fault()
          create_huge_pmd() - FALLBACK
      					(private mapping read)
      					__handle_mm_fault()
      					  passes check for pmd_none()
      					  create_huge_pmd()
      
          handle_pte_fault()
            dax_iomap_pte_fault() inserts PTE
      					    dax_iomap_pmd_fault() inserts PMD,
      					       but we already have a PTE at
      					       this spot.
      
      The core of the issue is that while there is isolation between faults to
      the same range in the DAX fault handlers via our DAX entry locking,
      there is no isolation between faults in the code in mm/memory.c.  This
      means for instance that this code in __handle_mm_fault() can run:
      
      	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
      		ret = create_huge_pmd(&vmf);
      
      But by the time we actually get to run the fault handler called by
      create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
      fault has installed a normal PMD here as a parent.  This is the cause of
      the 2nd race.  The first race is similar - there is the following check
      in handle_pte_fault():
      
      	} else {
      		/* See comment in pte_alloc_one_map() */
      		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
      			return 0;
      
      So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
      will bail and retry the fault.  This is correct, but there is nothing
      preventing the PMD from being installed after this check but before we
      actually get to the DAX PTE fault handlers.
      
      In my testing these races result in the following types of errors:
      
        BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
        BUG: non-zero nr_ptes on freeing mm: 15
      
      Fix this issue by having the DAX fault handlers verify that it is safe
      to continue their fault after they have taken an entry lock to block
      other racing faults.
      
      [ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
        Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
      Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NPawel Lebioda <pawel.lebioda@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pawel Lebioda <pawel.lebioda@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Xiong Zhou <xzhou@redhat.com>
      Cc: Eryu Guan <eguan@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2093926
  13. 01 6月, 2017 3 次提交
    • J
      btrfs: fix race with relocation recovery and fs_root setup · a9b3311e
      Jeff Mahoney 提交于
      If we have to recover relocation during mount, we'll ultimately have to
      evict the orphan inode.  That goes through the reservation dance, where
      priority_reclaim_metadata_space and flush_space expect fs_info->fs_root
      to be valid.  That's the next thing to be set up during mount, so we
      crash, almost always in flush_space trying to join the transaction
      but priority_reclaim_metadata_space is possible as well.  This call
      path has been problematic in the past WRT whether ->fs_root is valid
      yet.  Commit 957780eb (Btrfs: introduce ticketed enospc
      infrastructure) added new users that are called in the direct path
      instead of the async path that had already been worked around.
      
      The thing is that we don't actually need the fs_root, specifically, for
      anything.  We either use it to determine whether the root is the
      chunk_root for use in choosing an allocation profile or as a root to pass
      btrfs_join_transaction before immediately committing it.  Anything that
      isn't the chunk root works in the former case and any root works in
      the latter.
      
      A simple fix is to use a root we know will always be there: the
      extent_root.
      
      Cc: <stable@vger.kernel.org> # v4.8+
      Fixes: 957780eb (Btrfs: introduce ticketed enospc infrastructure)
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a9b3311e
    • J
      btrfs: fix memory leak in update_space_info failure path · 896533a7
      Jeff Mahoney 提交于
      If we fail to add the space_info kobject, we'll leak the memory
      for the percpu counter.
      
      Fixes: 6ab0a202 (btrfs: publish allocation data in sysfs)
      Cc: <stable@vger.kernel.org> # v3.14+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      896533a7
    • D
      btrfs: use correct types for page indices in btrfs_page_exists_in_range · cc2b702c
      David Sterba 提交于
      Variables start_idx and end_idx are supposed to hold a page index
      derived from the file offsets. The int type is not the right one though,
      offsets larger than 1 << 44 will get silently trimmed off the high bits.
      (1 << 44 is 16TiB)
      
      What can go wrong, if start is below the boundary and end gets trimmed:
      - if there's a page after start, we'll find it (radix_tree_gang_lookup_slot)
      - the final check "if (page->index <= end_idx)" will unexpectedly fail
      
      The function will return false, ie. "there's no page in the range",
      although there is at least one.
      
      btrfs_page_exists_in_range is used to prevent races in:
      
      * in hole punching, where we make sure there are not pages in the
        truncated range, otherwise we'll wait for them to finish and redo
        truncation, but we're going to replace the pages with holes anyway so
        the only problem is the intermediate state
      
      * lock_extent_direct: we want to make sure there are no pages before we
        lock and start DIO, to prevent stale data reads
      
      For practical occurence of the bug, there are several constaints.  The
      file must be quite large, the affected range must cross the 16TiB
      boundary and the internal state of the file pages and pending operations
      must match.  Also, we must not have started any ordered data in the
      range, otherwise we don't even reach the buggy function check.
      
      DIO locking tries hard in several places to avoid deadlocks with
      buffered IO and avoids waiting for ranges. The worst consequence seems
      to be stale data read.
      
      CC: Liu Bo <bo.li.liu@oracle.com>
      CC: stable@vger.kernel.org	# 3.16+
      Fixes: fc4adbff ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking")
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cc2b702c
  14. 31 5月, 2017 2 次提交
    • B
      xfs: use ->b_state to fix buffer I/O accounting release race · 63db7c81
      Brian Foster 提交于
      We've had user reports of unmount hangs in xfs_wait_buftarg() that
      analysis shows is due to btp->bt_io_count == -1. bt_io_count
      represents the count of in-flight asynchronous buffers and thus
      should always be >= 0. xfs_wait_buftarg() waits for this value to
      stabilize to zero in order to ensure that all untracked (with
      respect to the lru) buffers have completed I/O processing before
      unmount proceeds to tear down in-core data structures.
      
      The value of -1 implies an I/O accounting decrement race. Indeed,
      the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele()
      (where the buffer lock is no longer held) means that bp->b_flags can
      be updated from an unsafe context. While a user-level reproducer is
      currently not available, some intrusive hacks to run racing buffer
      lookups/ioacct/releases from multiple threads was used to
      successfully manufacture this problem.
      
      Existing callers do not expect to acquire the buffer lock from
      xfs_buf_rele(). Therefore, we can not safely update ->b_flags from
      this context. It turns out that we already have separate buffer
      state bits and associated serialization for dealing with buffer LRU
      state in the form of ->b_state and ->b_lock. Therefore, replace the
      _XBF_IN_FLIGHT flag with a ->b_state variant, update the I/O
      accounting wrappers appropriately and make sure they are used with
      the correct locking. This ensures that buffer in-flight state can be
      modified at buffer release time without racing with modifications
      from a buffer lock holder.
      
      Fixes: 9c7504aa ("xfs: track and serialize in-flight async buffers against unmount")
      Cc: <stable@vger.kernel.org> # v4.8+
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NLibor Pechacek <lpechacek@suse.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      63db7c81
    • L
      "Yes, people use FOLL_FORCE ;)" · f511c0b1
      Linus Torvalds 提交于
      This effectively reverts commit 8ee74a91 ("proc: try to remove use
      of FOLL_FORCE entirely")
      
      It turns out that people do depend on FOLL_FORCE for the /proc/<pid>/mem
      case, and we're talking not just debuggers. Talking to the affected people, the use-cases are:
      
      Keno Fischer:
       "We used these semantics as a hardening mechanism in the julia JIT. By
        opening /proc/self/mem and using these semantics, we could avoid
        needing RWX pages, or a dual mapping approach. We do have fallbacks to
        these other methods (though getting EIO here actually causes an assert
        in released versions - we'll updated that to make sure to take the
        fall back in that case).
      
        Nevertheless the /proc/self/mem approach was our favored approach
        because it a) Required an attacker to be able to execute syscalls
        which is a taller order than getting memory write and b) didn't double
        the virtual address space requirements (as a dual mapping approach
        would).
      
        I think in general this feature is very useful for anybody who needs
        to precisely control the execution of some other process. Various
        debuggers (gdb/lldb/rr) certainly fall into that category, but there's
        another class of such processes (wine, various emulators) which may
        want to do that kind of thing.
      
        Now, I suspect most of these will have the other process under ptrace
        control, so maybe allowing (same_mm || ptraced) would be ok, but at
        least for the sandbox/remote-jit use case, it would be perfectly
        reasonable to not have the jit server be a ptracer"
      
      Robert O'Callahan:
       "We write to readonly code and data mappings via /proc/.../mem in lots
        of different situations, particularly when we're adjusting program
        state during replay to match the recorded execution.
      
        Like Julia, we can add workarounds, but they could be expensive."
      
      so not only do people use FOLL_FORCE for both reads and writes, but they
      use it for both the local mm and remote mm.
      
      With these comments in mind, we likely also cannot add the "are we
      actively ptracing" check either, so this keeps the new code organization
      and does not do a real revert that would add back the original comment
      about "Maybe we should limit FOLL_FORCE to actual ptrace users?"
      Reported-by: NKeno Fischer <keno@juliacomputing.com>
      Reported-by: NRobert O'Callahan <robert@ocallahan.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f511c0b1