1. 12 9月, 2013 1 次提交
    • C
      mm: track vma changes with VM_SOFTDIRTY bit · d9104d1c
      Cyrill Gorcunov 提交于
      Pavel reported that in case if vma area get unmapped and then mapped (or
      expanded) in-place, the soft dirty tracker won't be able to recognize this
      situation since it works on pte level and ptes are get zapped on unmap,
      loosing soft dirty bit of course.
      
      So to resolve this situation we need to track actions on vma level, there
      VM_SOFTDIRTY flag comes in.  When new vma area created (or old expanded)
      we set this bit, and keep it here until application calls for clearing
      soft dirty bit.
      
      Thus when user space application track memory changes now it can detect if
      vma area is renewed.
      Reported-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Rob Landley <rob@landley.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9104d1c
  2. 16 8月, 2013 1 次提交
    • L
      Fix TLB gather virtual address range invalidation corner cases · 2b047252
      Linus Torvalds 提交于
      Ben Tebulin reported:
      
       "Since v3.7.2 on two independent machines a very specific Git
        repository fails in 9/10 cases on git-fsck due to an SHA1/memory
        failures.  This only occurs on a very specific repository and can be
        reproduced stably on two independent laptops.  Git mailing list ran
        out of ideas and for me this looks like some very exotic kernel issue"
      
      and bisected the failure to the backport of commit 53a59fc6 ("mm:
      limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").
      
      That commit itself is not actually buggy, but what it does is to make it
      much more likely to hit the partial TLB invalidation case, since it
      introduces a new case in tlb_next_batch() that previously only ever
      happened when running out of memory.
      
      The real bug is that the TLB gather virtual memory range setup is subtly
      buggered.  It was introduced in commit 597e1c35 ("mm/mmu_gather:
      enable tlb flush range in generic mmu_gather"), and the range handling
      was already fixed at least once in commit e6c495a9 ("mm: fix the TLB
      range flushed when __tlb_remove_page() runs out of slots"), but that fix
      was not complete.
      
      The problem with the TLB gather virtual address range is that it isn't
      set up by the initial tlb_gather_mmu() initialization (which didn't get
      the TLB range information), but it is set up ad-hoc later by the
      functions that actually flush the TLB.  And so any such case that forgot
      to update the TLB range entries would potentially miss TLB invalidates.
      
      Rather than try to figure out exactly which particular ad-hoc range
      setup was missing (I personally suspect it's the hugetlb case in
      zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
      did), this patch just gets rid of the problem at the source: make the
      TLB range information available to tlb_gather_mmu(), and initialize it
      when initializing all the other tlb gather fields.
      
      This makes the patch larger, but conceptually much simpler.  And the end
      result is much more understandable; even if you want to play games with
      partial ranges when invalidating the TLB contents in chunks, now the
      range information is always there, and anybody who doesn't want to
      bother with it won't introduce subtle bugs.
      
      Ben verified that this fixes his problem.
      Reported-bisected-and-tested-by: NBen Tebulin <tebulin@googlemail.com>
      Build-testing-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Build-testing-by: NRichard Weinberger <richard.weinberger@gmail.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b047252
  3. 04 7月, 2013 3 次提交
  4. 29 6月, 2013 1 次提交
  5. 26 6月, 2013 1 次提交
  6. 01 5月, 2013 2 次提交
  7. 30 4月, 2013 2 次提交
  8. 25 4月, 2013 1 次提交
  9. 28 2月, 2013 1 次提交
  10. 26 2月, 2013 1 次提交
  11. 23 2月, 2013 1 次提交
  12. 12 1月, 2013 1 次提交
    • X
      fs/exec.c: work around icc miscompilation · 6d92d4f6
      Xi Wang 提交于
      The tricky problem is this check:
      
      	if (i++ >= max)
      
      icc (mis)optimizes this check as:
      
      	if (++i > max)
      
      The check now becomes a no-op since max is MAX_ARG_STRINGS (0x7FFFFFFF).
      
      This is "allowed" by the C standard, assuming i++ never overflows,
      because signed integer overflow is undefined behavior.  This
      optimization effectively reverts the previous commit 362e6663
      ("exec.c, compat.c: fix count(), compat_count() bounds checking") that
      tries to fix the check.
      
      This patch simply moves ++ after the check.
      Signed-off-by: NXi Wang <xi.wang@gmail.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d92d4f6
  13. 21 12月, 2012 1 次提交
    • K
      exec: do not leave bprm->interp on stack · b66c5984
      Kees Cook 提交于
      If a series of scripts are executed, each triggering module loading via
      unprintable bytes in the script header, kernel stack contents can leak
      into the command line.
      
      Normally execution of binfmt_script and binfmt_misc happens recursively.
      However, when modules are enabled, and unprintable bytes exist in the
      bprm->buf, execution will restart after attempting to load matching
      binfmt modules.  Unfortunately, the logic in binfmt_script and
      binfmt_misc does not expect to get restarted.  They leave bprm->interp
      pointing to their local stack.  This means on restart bprm->interp is
      left pointing into unused stack memory which can then be copied into the
      userspace argv areas.
      
      After additional study, it seems that both recursion and restart remains
      the desirable way to handle exec with scripts, misc, and modules.  As
      such, we need to protect the changes to interp.
      
      This changes the logic to require allocation for any changes to the
      bprm->interp.  To avoid adding a new kmalloc to every exec, the default
      value is left as-is.  Only when passing through binfmt_script or
      binfmt_misc does an allocation take place.
      
      For a proof of concept, see DoTest.sh from:
      
         http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: halfdog <me@halfdog.net>
      Cc: P J P <ppandit@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b66c5984
  14. 20 12月, 2012 1 次提交
  15. 18 12月, 2012 1 次提交
    • K
      exec: use -ELOOP for max recursion depth · d7402698
      Kees Cook 提交于
      To avoid an explosion of request_module calls on a chain of abusive
      scripts, fail maximum recursion with -ELOOP instead of -ENOEXEC. As soon
      as maximum recursion depth is hit, the error will fail all the way back
      up the chain, aborting immediately.
      
      This also has the side-effect of stopping the user's shell from attempting
      to reexecute the top-level file as a shell script. As seen in the
      dash source:
      
              if (cmd != path_bshell && errno == ENOEXEC) {
                      *argv-- = cmd;
                      *argv = cmd = path_bshell;
                      goto repeat;
              }
      
      The above logic was designed for running scripts automatically that lacked
      the "#!" header, not to re-try failed recursion. On a legitimate -ENOEXEC,
      things continue to behave as the shell expects.
      
      Additionally, when tracking recursion, the binfmt handlers should not be
      involved. The recursion being tracked is the depth of calls through
      search_binary_handler(), so that function should be exclusively responsible
      for tracking the depth.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: halfdog <me@halfdog.net>
      Cc: P J P <ppandit@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7402698
  16. 29 11月, 2012 5 次提交
  17. 19 11月, 2012 1 次提交
  18. 26 10月, 2012 1 次提交
  19. 13 10月, 2012 2 次提交
    • J
      vfs: make path_openat take a struct filename pointer · 669abf4e
      Jeff Layton 提交于
      ...and fix up the callers. For do_file_open_root, just declare a
      struct filename on the stack and fill out the .name field. For
      do_filp_open, make it also take a struct filename pointer, and fix up its
      callers to call it appropriately.
      
      For filp_open, add a variant that takes a struct filename pointer and turn
      filp_open into a wrapper around it.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      669abf4e
    • J
      vfs: define struct filename and have getname() return it · 91a27b2a
      Jeff Layton 提交于
      getname() is intended to copy pathname strings from userspace into a
      kernel buffer. The result is just a string in kernel space. It would
      however be quite helpful to be able to attach some ancillary info to
      the string.
      
      For instance, we could attach some audit-related info to reduce the
      amount of audit-related processing needed. When auditing is enabled,
      we could also call getname() on the string more than once and not
      need to recopy it from userspace.
      
      This patchset converts the getname()/putname() interfaces to return
      a struct instead of a string. For now, the struct just tracks the
      string in kernel space and the original userland pointer for it.
      
      Later, we'll add other information to the struct as it becomes
      convenient.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      91a27b2a
  20. 09 10月, 2012 2 次提交
    • M
      mm: avoid taking rmap locks in move_ptes() · 38a76013
      Michel Lespinasse 提交于
      During mremap(), the destination VMA is generally placed after the
      original vma in rmap traversal order: in move_vma(), we always have
      new_pgoff >= vma->vm_pgoff, and as a result new_vma->vm_pgoff >=
      vma->vm_pgoff unless vma_merge() merged the new vma with an adjacent one.
      
      When the destination VMA is placed after the original in rmap traversal
      order, we can avoid taking the rmap locks in move_ptes().
      
      Essentially, this reintroduces the optimization that had been disabled in
      "mm anon rmap: remove anon_vma_moveto_tail".  The difference is that we
      don't try to impose the rmap traversal order; instead we just rely on
      things being in the desired order in the common case and fall back to
      taking locks in the uncommon case.  Also we skip the i_mmap_mutex in
      addition to the anon_vma lock: in both cases, the vmas are traversed in
      increasing vm_pgoff order with ties resolved in tree insertion order.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38a76013
    • O
      exec: make de_thread() killable · d5bbd43d
      Oleg Nesterov 提交于
      Change de_thread() to use KILLABLE rather than UNINTERRUPTIBLE while
      waiting for other threads.  The only complication is that we should
      clear ->group_exit_task and ->notify_count before we return, and we
      should do this under tasklist_lock.  -EAGAIN is used to match the
      initial signal_group_exit() check/return, it doesn't really matter.
      
      This fixes the (unlikely) race with coredump.  de_thread() checks
      signal_group_exit() before it starts to kill the subthreads, but this
      can't help if another CLONE_VM (but non CLONE_THREAD) task starts the
      coredumping after de_thread() unlocks ->siglock.  In this case the
      killed sub-thread can block in exit_mm() waiting for coredump_finish(),
      execing thread waits for that sub-thead, and the coredumping thread
      waits for execing thread.  Deadlock.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5bbd43d
  21. 06 10月, 2012 2 次提交
  22. 03 10月, 2012 1 次提交
  23. 01 10月, 2012 2 次提交
    • A
      generic sys_execve() · 38b983b3
      Al Viro 提交于
      Selected by __ARCH_WANT_SYS_EXECVE in unistd.h.  Requires
      	* working current_pt_regs()
      	* *NOT* doing a syscall-in-kernel kind of kernel_execve()
      implementation.  Using generic kernel_execve() is fine.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      38b983b3
    • A
      generic kernel_execve() · 282124d1
      Al Viro 提交于
      based mostly on arm and alpha versions.  Architectures can define
      __ARCH_WANT_KERNEL_EXECVE and use it, provided that
      	* they have working current_pt_regs(), even for kernel threads.
      	* kernel_thread-spawned threads do have space for pt_regs
      in the normal location.  Normally that's as simple as switching to
      generic kernel_thread() and making sure that kernel threads do *not*
      go through return from syscall path; call the payload from equivalent
      of ret_from_fork if we are in a kernel thread (or just have separate
      ret_from_kernel_thread and make copy_thread() use it instead of
      ret_from_fork in kernel thread case).
      	* they have ret_from_kernel_execve(); it is called after
      successful do_execve() done by kernel_execve() and gets normal
      pt_regs location passed to it as argument.  It's essentially
      a longjmp() analog - it should set sp, etc. to the situation
      expected at the return for syscall and go there.  Eventually
      the need for that sucker will disappear, but that'll take some
      surgery on kernel_thread() payloads.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      282124d1
  24. 27 9月, 2012 3 次提交
  25. 20 9月, 2012 1 次提交
  26. 31 7月, 2012 1 次提交