1. 11 2月, 2015 18 次提交
  2. 06 2月, 2015 3 次提交
  3. 30 1月, 2015 2 次提交
    • L
      vm: make stack guard page errors return VM_FAULT_SIGSEGV rather than SIGBUS · 9c145c56
      Linus Torvalds 提交于
      The stack guard page error case has long incorrectly caused a SIGBUS
      rather than a SIGSEGV, but nobody actually noticed until commit
      fee7e49d ("mm: propagate error from stack expansion even for guard
      page") because that error case was never actually triggered in any
      normal situations.
      
      Now that we actually report the error, people noticed the wrong signal
      that resulted.  So far, only the test suite of libsigsegv seems to have
      actually cared, but there are real applications that use libsigsegv, so
      let's not wait for any of those to break.
      Reported-and-tested-by: NTakashi Iwai <tiwai@suse.de>
      Tested-by: NJan Engelhardt <jengelh@inai.de>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
      Cc: linux-arch@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c145c56
    • L
      vm: add VM_FAULT_SIGSEGV handling support · 33692f27
      Linus Torvalds 提交于
      The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
      "you should SIGSEGV" error, because the SIGSEGV case was generally
      handled by the caller - usually the architecture fault handler.
      
      That results in lots of duplication - all the architecture fault
      handlers end up doing very similar "look up vma, check permissions, do
      retries etc" - but it generally works.  However, there are cases where
      the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
      
      In particular, when accessing the stack guard page, libsigsegv expects a
      SIGSEGV.  And it usually got one, because the stack growth is handled by
      that duplicated architecture fault handler.
      
      However, when the generic VM layer started propagating the error return
      from the stack expansion in commit fee7e49d ("mm: propagate error
      from stack expansion even for guard page"), that now exposed the
      existing VM_FAULT_SIGBUS result to user space.  And user space really
      expected SIGSEGV, not SIGBUS.
      
      To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
      duplicate architecture fault handlers about it.  They all already have
      the code to handle SIGSEGV, so it's about just tying that new return
      value to the existing code, but it's all a bit annoying.
      
      This is the mindless minimal patch to do this.  A more extensive patch
      would be to try to gather up the mostly shared fault handling logic into
      one generic helper routine, and long-term we really should do that
      cleanup.
      
      Just from this patch, you can generally see that most architectures just
      copied (directly or indirectly) the old x86 way of doing things, but in
      the meantime that original x86 model has been improved to hold the VM
      semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
      "newer" things, so it would be a good idea to bring all those
      improvements to the generic case and teach other architectures about
      them too.
      Reported-and-tested-by: NTakashi Iwai <tiwai@suse.de>
      Tested-by: NJan Engelhardt <jengelh@inai.de>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
      Cc: linux-arch@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33692f27
  4. 27 1月, 2015 3 次提交
  5. 13 1月, 2015 1 次提交
    • W
      mm: mmu_gather: use tlb->end != 0 only for TLB invalidation · 721c21c1
      Will Deacon 提交于
      When batching up address ranges for TLB invalidation, we check tlb->end
      != 0 to indicate that some pages have actually been unmapped.
      
      As of commit f045bbb9 ("mmu_gather: fix over-eager
      tlb_flush_mmu_free() calling"), we use the same check for freeing these
      pages in order to avoid a performance regression where we call
      free_pages_and_swap_cache even when no pages are actually queued up.
      
      Unfortunately, the range could have been reset (tlb->end = 0) by
      tlb_end_vma, which has been shown to cause memory leaks on arm64.
      Furthermore, investigation into these leaks revealed that the fullmm
      case on task exit no longer invalidates the TLB, by virtue of tlb->end
       == 0 (in 3.18, need_flush would have been set).
      
      This patch resolves the problem by reverting commit f045bbb9, using
      instead tlb->local.nr as the predicate for page freeing in
      tlb_flush_mmu_free and ensuring that tlb->end is initialised to a
      non-zero value in the fullmm case.
      Tested-by: NMark Langsdorf <mlangsdo@redhat.com>
      Tested-by: NDave Hansen <dave@sr71.net>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      721c21c1
  6. 12 1月, 2015 2 次提交
  7. 09 1月, 2015 6 次提交
    • V
      mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed · 9e5e3661
      Vlastimil Babka 提交于
      Charles Shirron and Paul Cassella from Cray Inc have reported kswapd
      stuck in a busy loop with nothing left to balance, but
      kswapd_try_to_sleep() failing to sleep.  Their analysis found the cause
      to be a combination of several factors:
      
      1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait
      
      2. The process has been killed (by OOM in this case), but has not yet been
         scheduled to remove itself from the waitqueue and die.
      
      3. kswapd checks for throttled processes in prepare_kswapd_sleep():
      
              if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
                      wake_up(&pgdat->pfmemalloc_wait);
      		return false; // kswapd will not go to sleep
      	}
      
         However, for a process that was already killed, wake_up() does not remove
         the process from the waitqueue, since try_to_wake_up() checks its state
         first and returns false when the process is no longer waiting.
      
      4. kswapd is running on the same CPU as the only CPU that the process is
         allowed to run on (through cpus_allowed, or possibly single-cpu system).
      
      5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
         encounters no voluntary preemption points and repeatedly fails
         prepare_kswapd_sleep(), blocking the process from running and removing
         itself from the waitqueue, which would let kswapd sleep.
      
      So, the source of the problem is that we prevent kswapd from going to
      sleep until there are processes waiting on the pfmemalloc_wait queue,
      and a process waiting on a queue is guaranteed to be removed from the
      queue only when it gets scheduled.  This was done to make sure that no
      process is left sleeping on pfmemalloc_wait when kswapd itself goes to
      sleep.
      
      However, it isn't necessary to postpone kswapd sleep until the
      pfmemalloc_wait queue actually empties.  To prevent processes from being
      left sleeping, it's actually enough to guarantee that all processes
      waiting on pfmemalloc_wait queue have been woken up by the time we put
      kswapd to sleep.
      
      This patch therefore fixes this issue by substituting 'wake_up' with
      'wake_up_all' and removing 'return false' in the code snippet from
      prepare_kswapd_sleep() above.  Note that if any process puts itself in
      the queue after this waitqueue_active() check, or after the wake up
      itself, it means that the process will also wake up kswapd - and since
      we are under prepare_to_wait(), the wake up won't be missed.  Also we
      update the comment prepare_kswapd_sleep() to hopefully more clearly
      describe the races it is preventing.
      
      Fixes: 5515061d ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e5e3661
    • V
      memcg: fix destination cgroup leak on task charges migration · 4bdfc1c4
      Vladimir Davydov 提交于
      We are supposed to take one css reference per each memory page and per
      each swap entry accounted to a memory cgroup.  However, during task
      charges migration we take a reference to the destination cgroup twice
      per each swap entry: first in mem_cgroup_do_precharge()->try_charge()
      and then in mem_cgroup_move_swap_account(), permanently leaking the
      destination cgroup.
      
      The hunk taking the second reference seems to be a leftover from the
      pre-00501b53 ("mm: memcontrol: rewrite charge API") era.  Remove it
      to fix the leak.
      
      Fixes: e8ea14cc (mm: memcontrol: take a css reference for each charged page)
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bdfc1c4
    • J
      mm: memcontrol: switch soft limit default back to infinity · 24d404dc
      Johannes Weiner 提交于
      Commit 3e32cb2e ("mm: memcontrol: lockless page counters")
      accidentally switched the soft limit default from infinity to zero,
      which turns all memcgs with even a single page into soft limit excessors
      and engages soft limit reclaim on all of them during global memory
      pressure.  This makes global reclaim generally more aggressive, but also
      inverts the meaning of existing soft limit configurations where unset
      soft limits are usually more generous than set ones.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24d404dc
    • J
      mm/debug_pagealloc: remove obsolete Kconfig options · 70ecb3cb
      Joonsoo Kim 提交于
      These are obsolete since commit e30825f1 ("mm/debug-pagealloc:
      prepare boottime configurable") was merged.  So remove them.
      
      [pebolle@tiscali.nl: find obsolete Kconfig options]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Paul Bolle <pebolle@tiscali.nl>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ecb3cb
    • J
      mm: protect set_page_dirty() from ongoing truncation · 2d6d7f98
      Johannes Weiner 提交于
      Tejun, while reviewing the code, spotted the following race condition
      between the dirtying and truncation of a page:
      
      __set_page_dirty_nobuffers()       __delete_from_page_cache()
        if (TestSetPageDirty(page))
                                           page->mapping = NULL
      				     if (PageDirty())
      				       dec_zone_page_state(page, NR_FILE_DIRTY);
      				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
          if (page->mapping)
            account_page_dirtied(page)
              __inc_zone_page_state(page, NR_FILE_DIRTY);
      	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
      
      which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
      
      Dirtiers usually lock out truncation, either by holding the page lock
      directly, or in case of zap_pte_range(), by pinning the mapcount with
      the page table lock held.  The notable exception to this rule, though,
      is do_wp_page(), for which this race exists.  However, do_wp_page()
      already waits for a locked page to unlock before setting the dirty bit,
      in order to prevent a race where clear_page_dirty() misses the page bit
      in the presence of dirty ptes.  Upgrade that wait to a fully locked
      set_page_dirty() to also cover the situation explained above.
      
      Afterwards, the code in set_page_dirty() dealing with a truncation race
      is no longer needed.  Remove it.
      Reported-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d6d7f98
    • K
      mm: prevent endless growth of anon_vma hierarchy · 7a3ef208
      Konstantin Khlebnikov 提交于
      Constantly forking task causes unlimited grow of anon_vma chain.  Each
      next child allocates new level of anon_vmas and links vma to all
      previous levels because pages might be inherited from any level.
      
      This patch adds heuristic which decides to reuse existing anon_vma
      instead of forking new one.  It adds counter anon_vma->degree which
      counts linked vmas and directly descending anon_vmas and reuses anon_vma
      if counter is lower than two.  As a result each anon_vma has either vma
      or at least two descending anon_vmas.  In such trees half of nodes are
      leafs with alive vmas, thus count of anon_vmas is no more than two times
      bigger than count of vmas.
      
      This heuristic reuses anon_vmas as few as possible because each reuse
      adds false aliasing among vmas and rmap walker ought to scan more ptes
      when it searches where page is might be mapped.
      
      Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu
      Fixes: 5beb4930 ("mm: change anon_vma linking to fix multi-process server scalability issue")
      [akpm@linux-foundation.org: fix typo, per Rik]
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Reported-by: NDaniel Forrest <dan.forrest@ssec.wisc.edu>
      Tested-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NJerome Marchand <jmarchan@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>	[2.6.34+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a3ef208
  8. 07 1月, 2015 2 次提交
    • L
      mm: propagate error from stack expansion even for guard page · fee7e49d
      Linus Torvalds 提交于
      Jay Foad reports that the address sanitizer test (asan) sometimes gets
      confused by a stack pointer that ends up being outside the stack vma
      that is reported by /proc/maps.
      
      This happens due to an interaction between RLIMIT_STACK and the guard
      page: when we do the guard page check, we ignore the potential error
      from the stack expansion, which effectively results in a missing guard
      page, since the expected stack expansion won't have been done.
      
      And since /proc/maps explicitly ignores the guard page (commit
      d7824370: "mm: fix up some user-visible effects of the stack guard
      page"), the stack pointer ends up being outside the reported stack area.
      
      This is the minimal patch: it just propagates the error.  It also
      effectively makes the guard page part of the stack limit, which in turn
      measn that the actual real stack is one page less than the stack limit.
      
      Let's see if anybody notices.  We could teach acct_stack_growth() to
      allow an extra page for a grow-up/grow-down stack in the rlimit test,
      but I don't want to add more complexity if it isn't needed.
      Reported-and-tested-by: NJay Foad <jay.foad@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fee7e49d
    • P
      rcu: Make SRCU optional by using CONFIG_SRCU · 83fe27ea
      Pranith Kumar 提交于
      SRCU is not necessary to be compiled by default in all cases. For tinification
      efforts not compiling SRCU unless necessary is desirable.
      
      The current patch tries to make compiling SRCU optional by introducing a new
      Kconfig option CONFIG_SRCU which is selected when any of the components making
      use of SRCU are selected.
      
      If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.
      
         text    data     bss     dec     hex filename
         2007       0       0    2007     7d7 kernel/rcu/srcu.o
      
      Size of arch/powerpc/boot/zImage changes from
      
         text    data     bss     dec     hex filename
       831552   64180   23944  919676   e087c arch/powerpc/boot/zImage : before
       829504   64180   23952  917636   e0084 arch/powerpc/boot/zImage : after
      
      so the savings are about ~2000 bytes.
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      CC: Josh Triplett <josh@joshtriplett.org>
      CC: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
      83fe27ea
  9. 30 12月, 2014 1 次提交
    • M
      mm: get rid of radix tree gfp mask for pagecache_get_page · 45f87de5
      Michal Hocko 提交于
      Commit 2457aec6 ("mm: non-atomically mark page accessed during page
      cache allocation where possible") has added a separate parameter for
      specifying gfp mask for radix tree allocations.
      
      Not only this is less than optimal from the API point of view because it
      is error prone, it is also buggy currently because
      grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if
      fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by
      AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then
      the radix tree allocation wouldn't obey the restriction and might
      recurse into filesystem and cause deadlocks.  This is the case for most
      filesystems unfortunately because only ext4 and gfs2 are using
      AOP_FLAG_NOFS.
      
      Let's simply remove radix_gfp_mask parameter because the allocation
      context is same for both page cache and for the radix tree.  Just make
      sure that the radix tree gets only the sane subset of the mask (e.g.  do
      not pass __GFP_WRITE).
      
      Long term it is more preferable to convert remaining users of
      AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this
      interface even further.
      Reported-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45f87de5
  10. 23 12月, 2014 1 次提交
  11. 19 12月, 2014 1 次提交
    • G
      mm/zsmalloc: adjust order of functions · 66cdef66
      Ganesh Mahendran 提交于
      Currently functions in zsmalloc.c does not arranged in a readable and
      reasonable sequence.  With the more and more functions added, we may
      meet below inconvenience.  For example:
      
      Current functions:
      
          void zs_init()
          {
          }
      
          static void get_maxobj_per_zspage()
          {
          }
      
      Then I want to add a func_1() which is called from zs_init(), and this
      new added function func_1() will used get_maxobj_per_zspage() which is
      defined below zs_init().
      
          void func_1()
          {
              get_maxobj_per_zspage()
          }
      
          void zs_init()
          {
              func_1()
          }
      
          static void get_maxobj_per_zspage()
          {
          }
      
      This will cause compiling issue. So we must add a declaration:
      
          static void get_maxobj_per_zspage();
      
      before func_1() if we do not put get_maxobj_per_zspage() before
      func_1().
      
      In addition, puting module_[init|exit] functions at the bottom of the
      file conforms to our habit.
      
      So, this patch ajusts function sequence as:
      
          /* helper functions */
          ...
          obj_location_to_handle()
          ...
      
          /* Some exported functions */
          ...
      
          zs_map_object()
          zs_unmap_object()
      
          zs_malloc()
          zs_free()
      
          zs_init()
          zs_exit()
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66cdef66