1. 20 6月, 2017 6 次提交
  2. 19 6月, 2017 1 次提交
    • H
      mm: larger stack guard gap, between vmas · 1be7107f
      Hugh Dickins 提交于
      Stack guard page is a useful feature to reduce a risk of stack smashing
      into a different mapping. We have been using a single page gap which
      is sufficient to prevent having stack adjacent to a different mapping.
      But this seems to be insufficient in the light of the stack usage in
      userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
      used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
      which is 256kB or stack strings with MAX_ARG_STRLEN.
      
      This will become especially dangerous for suid binaries and the default
      no limit for the stack size limit because those applications can be
      tricked to consume a large portion of the stack and a single glibc call
      could jump over the guard page. These attacks are not theoretical,
      unfortunatelly.
      
      Make those attacks less probable by increasing the stack guard gap
      to 1MB (on systems with 4k pages; but make it depend on the page size
      because systems with larger base pages might cap stack allocations in
      the PAGE_SIZE units) which should cover larger alloca() and VLA stack
      allocations. It is obviously not a full fix because the problem is
      somehow inherent, but it should reduce attack space a lot.
      
      One could argue that the gap size should be configurable from userspace,
      but that can be done later when somebody finds that the new 1MB is wrong
      for some special case applications.  For now, add a kernel command line
      option (stack_guard_gap) to specify the stack gap size (in page units).
      
      Implementation wise, first delete all the old code for stack guard page:
      because although we could get away with accounting one extra page in a
      stack vma, accounting a larger gap can break userspace - case in point,
      a program run with "ulimit -S -v 20000" failed when the 1MB gap was
      counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
      and strict non-overcommit mode.
      
      Instead of keeping gap inside the stack vma, maintain the stack guard
      gap as a gap between vmas: using vm_start_gap() in place of vm_start
      (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
      places which need to respect the gap - mainly arch_get_unmapped_area(),
      and and the vma tree's subtree_gap support for that.
      Original-patch-by: NOleg Nesterov <oleg@redhat.com>
      Original-patch-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: Helge Deller <deller@gmx.de> # parisc
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be7107f
  3. 17 6月, 2017 1 次提交
  4. 16 6月, 2017 1 次提交
  5. 15 6月, 2017 13 次提交
  6. 12 6月, 2017 2 次提交
    • B
      configfs: Introduce config_item_get_unless_zero() · 19e72d3a
      Bart Van Assche 提交于
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      [hch: minor style tweak]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      19e72d3a
    • N
      configfs: Fix race between create_link and configfs_rmdir · ba80aa90
      Nicholas Bellinger 提交于
      This patch closes a long standing race in configfs between
      the creation of a new symlink in create_link(), while the
      symlink target's config_item is being concurrently removed
      via configfs_rmdir().
      
      This can happen because the symlink target's reference
      is obtained by config_item_get() in create_link() before
      the CONFIGFS_USET_DROPPING bit set by configfs_detach_prep()
      during configfs_rmdir() shutdown is actually checked..
      
      This originally manifested itself on ppc64 on v4.8.y under
      heavy load using ibmvscsi target ports with Novalink API:
      
      [ 7877.289863] rpadlpar_io: slot U8247.22L.212A91A-V1-C8 added
      [ 7879.893760] ------------[ cut here ]------------
      [ 7879.893768] WARNING: CPU: 15 PID: 17585 at ./include/linux/kref.h:46 config_item_get+0x7c/0x90 [configfs]
      [ 7879.893811] CPU: 15 PID: 17585 Comm: targetcli Tainted: G           O 4.8.17-customv2.22 #12
      [ 7879.893812] task: c00000018a0d3400 task.stack: c0000001f3b40000
      [ 7879.893813] NIP: d000000002c664ec LR: d000000002c60980 CTR: c000000000b70870
      [ 7879.893814] REGS: c0000001f3b43810 TRAP: 0700   Tainted: G O     (4.8.17-customv2.22)
      [ 7879.893815] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28222242  XER: 00000000
      [ 7879.893820] CFAR: d000000002c664bc SOFTE: 1
                      GPR00: d000000002c60980 c0000001f3b43a90 d000000002c70908 c0000000fbc06820
                      GPR04: c0000001ef1bd900 0000000000000004 0000000000000001 0000000000000000
                      GPR08: 0000000000000000 0000000000000001 d000000002c69560 d000000002c66d80
                      GPR12: c000000000b70870 c00000000e798700 c0000001f3b43ca0 c0000001d4949d40
                      GPR16: c00000014637e1c0 0000000000000000 0000000000000000 c0000000f2392940
                      GPR20: c0000001f3b43b98 0000000000000041 0000000000600000 0000000000000000
                      GPR24: fffffffffffff000 0000000000000000 d000000002c60be0 c0000001f1dac490
                      GPR28: 0000000000000004 0000000000000000 c0000001ef1bd900 c0000000f2392940
      [ 7879.893839] NIP [d000000002c664ec] config_item_get+0x7c/0x90 [configfs]
      [ 7879.893841] LR [d000000002c60980] check_perm+0x80/0x2e0 [configfs]
      [ 7879.893842] Call Trace:
      [ 7879.893844] [c0000001f3b43ac0] [d000000002c60980] check_perm+0x80/0x2e0 [configfs]
      [ 7879.893847] [c0000001f3b43b10] [c000000000329770] do_dentry_open+0x2c0/0x460
      [ 7879.893849] [c0000001f3b43b70] [c000000000344480] path_openat+0x210/0x1490
      [ 7879.893851] [c0000001f3b43c80] [c00000000034708c] do_filp_open+0xfc/0x170
      [ 7879.893853] [c0000001f3b43db0] [c00000000032b5bc] do_sys_open+0x1cc/0x390
      [ 7879.893856] [c0000001f3b43e30] [c000000000009584] system_call+0x38/0xec
      [ 7879.893856] Instruction dump:
      [ 7879.893858] 409d0014 38210030 e8010010 7c0803a6 4e800020 3d220000 e94981e0 892a0000
      [ 7879.893861] 2f890000 409effe0 39200001 992a0000 <0fe00000> 4bffffd0 60000000 60000000
      [ 7879.893866] ---[ end trace 14078f0b3b5ad0aa ]---
      
      To close this race, go ahead and obtain the symlink's target
      config_item reference only after the existing CONFIGFS_USET_DROPPING
      check succeeds.
      
      This way, if configfs_rmdir() wins create_link() will return -ENONET,
      and if create_link() wins configfs_rmdir() will return -EBUSY.
      Reported-by: NBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Tested-by: NBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      ba80aa90
  7. 11 6月, 2017 1 次提交
  8. 10 6月, 2017 10 次提交
  9. 08 6月, 2017 2 次提交
    • B
      xfs: fix spurious spin_is_locked() assert failures on non-smp kernels · 95989c46
      Brian Foster 提交于
      The 0-day kernel test robot reports assertion failures on
      !CONFIG_SMP kernels due to failed spin_is_locked() checks. As it
      turns out, spin_is_locked() is hardcoded to return zero on
      !CONFIG_SMP kernels and so this function cannot be relied on to
      verify spinlock state in this configuration.
      
      To avoid this problem, replace the associated asserts with lockdep
      variants that do the right thing regardless of kernel configuration.
      Drop the one assert that checks for an unlocked lock as there is no
      suitable lockdep variant for that case. This moves the spinlock
      checks from XFS debug code to lockdep, but generally provides the
      same level of protection.
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      95989c46
    • D
      crypto: Work around deallocated stack frame reference gcc bug on sparc. · d41519a6
      David Miller 提交于
      On sparc, if we have an alloca() like situation, as is the case with
      SHASH_DESC_ON_STACK(), we can end up referencing deallocated stack
      memory.  The result can be that the value is clobbered if a trap
      or interrupt arrives at just the right instruction.
      
      It only occurs if the function ends returning a value from that
      alloca() area and that value can be placed into the return value
      register using a single instruction.
      
      For example, in lib/libcrc32c.c:crc32c() we end up with a return
      sequence like:
      
              return  %i7+8
               lduw   [%o5+16], %o0   ! MEM[(u32 *)__shash_desc.1_10 + 16B],
      
      %o5 holds the base of the on-stack area allocated for the shash
      descriptor.  But the return released the stack frame and the
      register window.
      
      So if an intererupt arrives between 'return' and 'lduw', then
      the value read at %o5+16 can be corrupted.
      
      Add a data compiler barrier to work around this problem.  This is
      exactly what the gcc fix will end up doing as well, and it absolutely
      should not change the code generated for other cpus (unless gcc
      on them has the same bug :-)
      
      With crucial insight from Eric Sandeen.
      
      Cc: <stable@vger.kernel.org>
      Reported-by: NAnatoly Pugachev <matorola@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      d41519a6
  10. 05 6月, 2017 1 次提交
  11. 04 6月, 2017 1 次提交
  12. 03 6月, 2017 1 次提交
    • R
      dax: fix race between colliding PMD & PTE entries · e2093926
      Ross Zwisler 提交于
      We currently have two related PMD vs PTE races in the DAX code.  These
      can both be easily triggered by having two threads reading and writing
      simultaneously to the same private mapping, with the key being that
      private mapping reads can be handled with PMDs but private mapping
      writes are always handled with PTEs so that we can COW.
      
      Here is the first race:
      
        CPU 0					CPU 1
      
        (private mapping write)
        __handle_mm_fault()
          create_huge_pmd() - FALLBACK
          handle_pte_fault()
            passes check for pmd_devmap()
      
      					(private mapping read)
      					__handle_mm_fault()
      					  create_huge_pmd()
      					    dax_iomap_pmd_fault() inserts PMD
      
            dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
            			  installed in our page tables at this spot.
      
      Here's the second race:
      
        CPU 0					CPU 1
      
        (private mapping read)
        __handle_mm_fault()
          passes check for pmd_none()
          create_huge_pmd()
            dax_iomap_pmd_fault() inserts PMD
      
        (private mapping write)
        __handle_mm_fault()
          create_huge_pmd() - FALLBACK
      					(private mapping read)
      					__handle_mm_fault()
      					  passes check for pmd_none()
      					  create_huge_pmd()
      
          handle_pte_fault()
            dax_iomap_pte_fault() inserts PTE
      					    dax_iomap_pmd_fault() inserts PMD,
      					       but we already have a PTE at
      					       this spot.
      
      The core of the issue is that while there is isolation between faults to
      the same range in the DAX fault handlers via our DAX entry locking,
      there is no isolation between faults in the code in mm/memory.c.  This
      means for instance that this code in __handle_mm_fault() can run:
      
      	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
      		ret = create_huge_pmd(&vmf);
      
      But by the time we actually get to run the fault handler called by
      create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
      fault has installed a normal PMD here as a parent.  This is the cause of
      the 2nd race.  The first race is similar - there is the following check
      in handle_pte_fault():
      
      	} else {
      		/* See comment in pte_alloc_one_map() */
      		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
      			return 0;
      
      So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
      will bail and retry the fault.  This is correct, but there is nothing
      preventing the PMD from being installed after this check but before we
      actually get to the DAX PTE fault handlers.
      
      In my testing these races result in the following types of errors:
      
        BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
        BUG: non-zero nr_ptes on freeing mm: 15
      
      Fix this issue by having the DAX fault handlers verify that it is safe
      to continue their fault after they have taken an entry lock to block
      other racing faults.
      
      [ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
        Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
      Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NPawel Lebioda <pawel.lebioda@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pawel Lebioda <pawel.lebioda@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Xiong Zhou <xzhou@redhat.com>
      Cc: Eryu Guan <eguan@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2093926