1. 26 7月, 2011 1 次提交
  2. 16 6月, 2011 1 次提交
    • R
      mm: fix negative commitlimit when gigantic hugepages are allocated · b0320c7b
      Rafael Aquini 提交于
      When 1GB hugepages are allocated on a system, free(1) reports less
      available memory than what really is installed in the box.  Also, if the
      total size of hugepages allocated on a system is over half of the total
      memory size, CommitLimit becomes a negative number.
      
      The problem is that gigantic hugepages (order > MAX_ORDER) can only be
      allocated at boot with bootmem, thus its frames are not accounted to
      'totalram_pages'.  However, they are accounted to hugetlb_total_pages()
      
      What happens to turn CommitLimit into a negative number is this
      calculation, in fs/proc/meminfo.c:
      
              allowed = ((totalram_pages - hugetlb_total_pages())
                      * sysctl_overcommit_ratio / 100) + total_swap_pages;
      
      A similar calculation occurs in __vm_enough_memory() in mm/mmap.c.
      
      Also, every vm statistic which depends on 'totalram_pages' will render
      confusing values, as if system were 'missing' some part of its memory.
      
      Impact of this bug:
      
      When gigantic hugepages are allocated and sysctl_overcommit_memory ==
      OVERCOMMIT_NEVER.  In a such situation, __vm_enough_memory() goes through
      the mentioned 'allowed' calculation and might end up mistakenly returning
      -ENOMEM, thus forcing the system to start reclaiming pages earlier than it
      would be ususal, and this could cause detrimental impact to overall
      system's performance, depending on the workload.
      
      Besides the aforementioned scenario, I can only think of this causing
      annoyances with memory reports from /proc/meminfo and free(1).
      
      [akpm@linux-foundation.org: standardize comment layout]
      Reported-by: NRuss Anderson <rja@sgi.com>
      Signed-off-by: NRafael Aquini <aquini@linux.com>
      Acked-by: NRuss Anderson <rja@sgi.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0320c7b
  3. 06 6月, 2011 1 次提交
  4. 27 5月, 2011 1 次提交
  5. 25 5月, 2011 1 次提交
  6. 10 4月, 2011 1 次提交
  7. 31 3月, 2011 1 次提交
  8. 23 3月, 2011 1 次提交
  9. 14 1月, 2011 5 次提交
  10. 03 12月, 2010 1 次提交
  11. 27 10月, 2010 1 次提交
  12. 08 10月, 2010 9 次提交
  13. 24 9月, 2010 2 次提交
  14. 11 8月, 2010 5 次提交
  15. 10 8月, 2010 1 次提交
  16. 25 5月, 2010 1 次提交
    • M
      cpuset,mm: fix no node to alloc memory when changing cpuset's mems · c0ff7453
      Miao Xie 提交于
      Before applying this patch, cpuset updates task->mems_allowed and
      mempolicy by setting all new bits in the nodemask first, and clearing all
      old unallowed bits later.  But in the way, the allocator may find that
      there is no node to alloc memory.
      
      The reason is that cpuset rebinds the task's mempolicy, it cleans the
      nodes which the allocater can alloc pages on, for example:
      
      (mpol: mempolicy)
      	task1			task1's mpol	task2
      	alloc page		1
      	  alloc on node0? NO	1
      				1		change mems from 1 to 0
      				1		rebind task1's mpol
      				0-1		  set new bits
      				0	  	  clear disallowed bits
      	  alloc on node1? NO	0
      	  ...
      	can't alloc page
      	  goto oom
      
      This patch fixes this problem by expanding the nodes range first(set newly
      allowed bits) and shrink it lazily(clear newly disallowed bits).  So we
      use a variable to tell the write-side task that read-side task is reading
      nodemask, and the write-side task clears newly disallowed nodes after
      read-side task ends the current memory allocation.
      
      [akpm@linux-foundation.org: fix spello]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Menage <menage@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0ff7453
  17. 12 5月, 2010 1 次提交
    • M
      hugetlbfs: kill applications that use MAP_NORESERVE with SIGBUS instead of OOM-killer · 4a6018f7
      Mel Gorman 提交于
      Ordinarily, application using hugetlbfs will create mappings with
      reserves.  For shared mappings, these pages are reserved before mmap()
      returns success and for private mappings, the caller process is guaranteed
      and a child process that cannot get the pages gets killed with sigbus.
      
      An application that uses MAP_NORESERVE gets no reservations and mmap()
      will always succeed at the risk the page will not be available at fault
      time.  This might be used for example on very large sparse mappings where
      the developer is confident the necessary huge pages exist to satisfy all
      faults even though the whole mapping cannot be backed by huge pages.
      Unfortunately, if an allocation does fail, VM_FAULT_OOM is returned to the
      fault handler which proceeds to trigger the OOM-killer.  This is
      unhelpful.
      
      Even without hugetlbfs mounted, a user using mmap() can trivially trigger
      the OOM-killer because VM_FAULT_OOM is returned (will provide example
      program if desired - it's a whopping 24 lines long).  It could be
      considered a DOS available to an unprivileged user.
      
      This patch alters hugetlbfs to kill a process that uses MAP_NORESERVE
      where huge pages were not available with SIGBUS instead of triggering the
      OOM killer.
      
      This change affects hugetlb_cow() as well.  I feel there is a failure case
      in there, but I didn't create one.  It would need a fairly specific target
      in terms of the faulting application and the hugepage pool size.  The
      hugetlb_no_page() path is much easier to hit but both might as well be
      closed.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a6018f7
  18. 25 4月, 2010 1 次提交
  19. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  20. 21 2月, 2010 1 次提交
    • R
      MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itself · 4b3073e1
      Russell King 提交于
      On VIVT ARM, when we have multiple shared mappings of the same file
      in the same MM, we need to ensure that we have coherency across all
      copies.  We do this via make_coherent() by making the pages
      uncacheable.
      
      This used to work fine, until we allowed highmem with highpte - we
      now have a page table which is mapped as required, and is not available
      for modification via update_mmu_cache().
      
      Ralf Beache suggested getting rid of the PTE value passed to
      update_mmu_cache():
      
        On MIPS update_mmu_cache() calls __update_tlb() which walks pagetables
        to construct a pointer to the pte again.  Passing a pte_t * is much
        more elegant.  Maybe we might even replace the pte argument with the
        pte_t?
      
      Ben Herrenschmidt would also like the pte pointer for PowerPC:
      
        Passing the ptep in there is exactly what I want.  I want that
        -instead- of the PTE value, because I have issue on some ppc cases,
        for I$/D$ coherency, where set_pte_at() may decide to mask out the
        _PAGE_EXEC.
      
      So, pass in the mapped page table pointer into update_mmu_cache(), and
      remove the PTE value, updating all implementations and call sites to
      suit.
      
      Includes a fix from Stephen Rothwell:
      
        sparc: fix fallout from update_mmu_cache API change
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      4b3073e1
  21. 03 2月, 2010 1 次提交
  22. 12 1月, 2010 1 次提交
  23. 16 12月, 2009 1 次提交