1. 09 9月, 2016 2 次提交
    • D
      x86/pkeys: Make mprotect_key() mask off additional vm_flags · a8502b67
      Dave Hansen 提交于
      Today, mprotect() takes 4 bits of data: PROT_READ/WRITE/EXEC/NONE.
      Three of those bits: READ/WRITE/EXEC get translated directly in to
      vma->vm_flags by calc_vm_prot_bits().  If a bit is unset in
      mprotect()'s 'prot' argument then it must be cleared in vma->vm_flags
      during the mprotect() call.
      
      We do this clearing today by first calculating the VMA flags we
      want set, then clearing the ones we do not want to inherit from
      the original VMA:
      
      	vm_flags = calc_vm_prot_bits(prot, key);
      	...
      	newflags = vm_flags;
      	newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
      
      However, we *also* want to mask off the original VMA's vm_flags in
      which we store the protection key.
      
      To do that, this patch adds a new macro:
      
      	ARCH_VM_PKEY_FLAGS
      
      which allows the architecture to specify additional bits that it would
      like cleared.  We use that to ensure that the VM_PKEY_BIT* bits get
      cleared.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-arch@vger.kernel.org
      Cc: Dave Hansen <dave@sr71.net>
      Cc: arnd@arndb.de
      Cc: linux-api@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: luto@kernel.org
      Cc: akpm@linux-foundation.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/20160729163013.E48D6981@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a8502b67
    • D
      mm: Implement new pkey_mprotect() system call · 7d06d9c9
      Dave Hansen 提交于
      pkey_mprotect() is just like mprotect, except it also takes a
      protection key as an argument.  On systems that do not support
      protection keys, it still works, but requires that key=0.
      Otherwise it does exactly what mprotect does.
      
      I expect it to get used like this, if you want to guarantee that
      any mapping you create can *never* be accessed without the right
      protection keys set up.
      
      	int real_prot = PROT_READ|PROT_WRITE;
      	pkey = pkey_alloc(0, PKEY_DENY_ACCESS);
      	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      	ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
      
      This way, there is *no* window where the mapping is accessible
      since it was always either PROT_NONE or had a protection key set
      that denied all access.
      
      We settled on 'unsigned long' for the type of the key here.  We
      only need 4 bits on x86 today, but I figured that other
      architectures might need some more space.
      
      Semantically, we have a bit of a problem if we combine this
      syscall with our previously-introduced execute-only support:
      What do we do when we mix execute-only pkey use with
      pkey_mprotect() use?  For instance:
      
      	pkey_mprotect(ptr, PAGE_SIZE, PROT_WRITE, 6); // set pkey=6
      	mprotect(ptr, PAGE_SIZE, PROT_EXEC);  // set pkey=X_ONLY_PKEY?
      	mprotect(ptr, PAGE_SIZE, PROT_WRITE); // is pkey=6 again?
      
      To solve that, we make the plain-mprotect()-initiated execute-only
      support only apply to VMAs that have the default protection key (0)
      set on them.
      
      Proposed semantics:
      1. protection key 0 is special and represents the default,
         "unassigned" protection key.  It is always allocated.
      2. mprotect() never affects a mapping's pkey_mprotect()-assigned
         protection key. A protection key of 0 (even if set explicitly)
         represents an unassigned protection key.
         2a. mprotect(PROT_EXEC) on a mapping with an assigned protection
             key may or may not result in a mapping with execute-only
             properties.  pkey_mprotect() plus pkey_set() on all threads
             should be used to _guarantee_ execute-only semantics if this
             is not a strong enough semantic.
      3. mprotect(PROT_EXEC) may result in an "execute-only" mapping. The
         kernel will internally attempt to allocate and dedicate a
         protection key for the purpose of execute-only mappings.  This
         may not be possible in cases where there are no free protection
         keys available.  It can also happen, of course, in situations
         where there is no hardware support for protection keys.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: linux-arch@vger.kernel.org
      Cc: Dave Hansen <dave@sr71.net>
      Cc: arnd@arndb.de
      Cc: linux-api@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: luto@kernel.org
      Cc: akpm@linux-foundation.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/20160729163012.3DDD36C4@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      7d06d9c9
  2. 02 9月, 2016 3 次提交
  3. 27 8月, 2016 4 次提交
  4. 23 8月, 2016 2 次提交
    • J
      usercopy: fix overlap check for kernel text · 94cd97af
      Josh Poimboeuf 提交于
      When running with a local patch which moves the '_stext' symbol to the
      very beginning of the kernel text area, I got the following panic with
      CONFIG_HARDENED_USERCOPY:
      
        usercopy: kernel memory exposure attempt detected from ffff88103dfff000 (<linear kernel text>) (4096 bytes)
        ------------[ cut here ]------------
        kernel BUG at mm/usercopy.c:79!
        invalid opcode: 0000 [#1] SMP
        ...
        CPU: 0 PID: 4800 Comm: cp Not tainted 4.8.0-rc3.after+ #1
        Hardware name: Dell Inc. PowerEdge R720/0X3D66, BIOS 2.5.4 01/22/2016
        task: ffff880817444140 task.stack: ffff880816274000
        RIP: 0010:[<ffffffff8121c796>] __check_object_size+0x76/0x413
        RSP: 0018:ffff880816277c40 EFLAGS: 00010246
        RAX: 000000000000006b RBX: ffff88103dfff000 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: ffff88081f80dfa8 RDI: ffff88081f80dfa8
        RBP: ffff880816277c90 R08: 000000000000054c R09: 0000000000000000
        R10: 0000000000000005 R11: 0000000000000006 R12: 0000000000001000
        R13: ffff88103e000000 R14: ffff88103dffffff R15: 0000000000000001
        FS:  00007fb9d1750800(0000) GS:ffff88081f800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000021d2000 CR3: 000000081a08f000 CR4: 00000000001406f0
        Stack:
         ffff880816277cc8 0000000000010000 000000043de07000 0000000000000000
         0000000000001000 ffff880816277e60 0000000000001000 ffff880816277e28
         000000000000c000 0000000000001000 ffff880816277ce8 ffffffff8136c3a6
        Call Trace:
         [<ffffffff8136c3a6>] copy_page_to_iter_iovec+0xa6/0x1c0
         [<ffffffff8136e766>] copy_page_to_iter+0x16/0x90
         [<ffffffff811970e3>] generic_file_read_iter+0x3e3/0x7c0
         [<ffffffffa06a738d>] ? xfs_file_buffered_aio_write+0xad/0x260 [xfs]
         [<ffffffff816e6262>] ? down_read+0x12/0x40
         [<ffffffffa06a61b1>] xfs_file_buffered_aio_read+0x51/0xc0 [xfs]
         [<ffffffffa06a6692>] xfs_file_read_iter+0x62/0xb0 [xfs]
         [<ffffffff812224cf>] __vfs_read+0xdf/0x130
         [<ffffffff81222c9e>] vfs_read+0x8e/0x140
         [<ffffffff81224195>] SyS_read+0x55/0xc0
         [<ffffffff81003a47>] do_syscall_64+0x67/0x160
         [<ffffffff816e8421>] entry_SYSCALL64_slow_path+0x25/0x25
        RIP: 0033:[<00007fb9d0c33c00>] 0x7fb9d0c33c00
        RSP: 002b:00007ffc9c262f28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
        RAX: ffffffffffffffda RBX: fffffffffff8ffff RCX: 00007fb9d0c33c00
        RDX: 0000000000010000 RSI: 00000000021c3000 RDI: 0000000000000004
        RBP: 00000000021c3000 R08: 0000000000000000 R09: 00007ffc9c264d6c
        R10: 00007ffc9c262c50 R11: 0000000000000246 R12: 0000000000010000
        R13: 00007ffc9c2630b0 R14: 0000000000000004 R15: 0000000000010000
        Code: 81 48 0f 44 d0 48 c7 c6 90 4d a3 81 48 c7 c0 bb b3 a2 81 48 0f 44 f0 4d 89 e1 48 89 d9 48 c7 c7 68 16 a3 81 31 c0 e8 f4 57 f7 ff <0f> 0b 48 8d 90 00 40 00 00 48 39 d3 0f 83 22 01 00 00 48 39 c3
        RIP  [<ffffffff8121c796>] __check_object_size+0x76/0x413
         RSP <ffff880816277c40>
      
      The checked object's range [ffff88103dfff000, ffff88103e000000) is
      valid, so there shouldn't have been a BUG.  The hardened usercopy code
      got confused because the range's ending address is the same as the
      kernel's text starting address at 0xffff88103e000000.  The overlap check
      is slightly off.
      
      Fixes: f5509cc1 ("mm: Hardened usercopy")
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      94cd97af
    • E
      usercopy: avoid potentially undefined behavior in pointer math · 7329a655
      Eric Biggers 提交于
      check_bogus_address() checked for pointer overflow using this expression,
      where 'ptr' has type 'const void *':
      
      	ptr + n < ptr
      
      Since pointer wraparound is undefined behavior, gcc at -O2 by default
      treats it like the following, which would not behave as intended:
      
      	(long)n < 0
      
      Fortunately, this doesn't currently happen for kernel code because kernel
      code is compiled with -fno-strict-overflow.  But the expression should be
      fixed anyway to use well-defined integer arithmetic, since it could be
      treated differently by different compilers in the future or could be
      reported by tools checking for undefined behavior.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      7329a655
  5. 12 8月, 2016 7 次提交
  6. 11 8月, 2016 6 次提交
  7. 10 8月, 2016 1 次提交
    • V
      mm: memcontrol: only mark charged pages with PageKmemcg · c4159a75
      Vladimir Davydov 提交于
      To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
      which sets page->_mapcount to -512.  Currently, we set/clear PageKmemcg
      in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated
      with __GFP_ACCOUNT, including those that aren't actually charged to any
      cgroup, i.e. allocated from the root cgroup context.  To avoid overhead
      in case cgroups are not used, we only do that if memcg_kmem_enabled() is
      true.  The latter is set iff there are kmem-enabled memory cgroups
      (online or offline).  The root cgroup is not considered kmem-enabled.
      
      As a result, if a page is allocated with __GFP_ACCOUNT for the root
      cgroup when there are kmem-enabled memory cgroups and is freed after all
      kmem-enabled memory cgroups were removed, e.g.
      
        # no memory cgroups has been created yet, create one
        mkdir /sys/fs/cgroup/memory/test
        # run something allocating pages with __GFP_ACCOUNT, e.g.
        # a program using pipe
        dmesg | tail
        # remove the memory cgroup
        rmdir /sys/fs/cgroup/memory/test
      
      we'll get bad page state bug complaining about page->_mapcount != -1:
      
        BUG: Bad page state in process swapper/0  pfn:1fd945c
        page:ffffea007f651700 count:0 mapcount:-511 mapping:          (null) index:0x0
        flags: 0x1000000000000000()
      
      To avoid that, let's mark with PageKmemcg only those pages that are
      actually charged to and hence pin a non-root memory cgroup.
      
      Fixes: 4949148a ("mm: charge/uncharge kmemcg from generic page allocator paths")
      Reported-and-tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4159a75
  8. 08 8月, 2016 2 次提交
  9. 05 8月, 2016 7 次提交
  10. 04 8月, 2016 1 次提交
  11. 03 8月, 2016 5 次提交