1. 19 10月, 2009 2 次提交
    • A
      HWPOISON: Fix page count leak in hwpoison late kill in do_swap_page · 4779cb31
      Andi Kleen 提交于
      When returning due to a poisoned page drop the page count.
      
      It wasn't a fatal problem because noone cares about the page count
      on a poisoned page (except when it wraps), but it's cleaner to fix it.
      
      Pointed out by Linus.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      4779cb31
    • W
      HWPOISON: return early on non-LRU pages · e43c3afb
      Wu Fengguang 提交于
      Right now we have some trouble with non atomic access
      to page flags when locking the page. To plug this hole
      for now, limit error recovery to LRU pages for now.
      
      This could be better fixed by defining a suitable protocol,
      but let's go this simple way for now
      
      This avoids unnecessary races with __set_page_locked() and
      __SetPageSlab*() and maybe more non-atomic page flag operations.
      
      This loses isolated pages which are currently in page reclaim, but these
      are relatively limited compared to the total memory.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      [AK: new description, bug fixes, cleanups]
      e43c3afb
  2. 02 10月, 2009 4 次提交
    • K
      memcg: reduce check for softlimit excess · ef8745c1
      KAMEZAWA Hiroyuki 提交于
      In charge/uncharge/reclaim path, usage_in_excess is calculated repeatedly
      and it takes res_counter's spin_lock every time.
      
      This patch removes unnecessary calls for res_count_soft_limit_excess.
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef8745c1
    • K
      memcg: some modification to softlimit under hierarchical memory reclaim. · 4e649152
      KAMEZAWA Hiroyuki 提交于
      This patch clean up/fixes for memcg's uncharge soft limit path.
      
      Problems:
        Now, res_counter_charge()/uncharge() handles softlimit information at
        charge/uncharge and softlimit-check is done when event counter per memcg
        goes over limit. Now, event counter per memcg is updated only when
        memory usage is over soft limit. Here, considering hierarchical memcg
        management, ancesotors should be taken care of.
      
        Now, ancerstors(hierarchy) are handled in charge() but not in uncharge().
        This is not good.
      
        Prolems:
        1. memcg's event counter incremented only when softlimit hits. That's bad.
           It makes event counter hard to be reused for other purpose.
      
        2. At uncharge, only the lowest level rescounter is handled. This is bug.
           Because ancesotor's event counter is not incremented, children should
           take care of them.
      
        3. res_counter_uncharge()'s 3rd argument is NULL in most case.
           ops under res_counter->lock should be small. No "if" sentense is better.
      
      Fixes:
        * Removed soft_limit_xx poitner and checks in charge and uncharge.
          Do-check-only-when-necessary scheme works enough well without them.
      
        * make event-counter of memcg incremented at every charge/uncharge.
          (per-cpu area will be accessed soon anyway)
      
        * All ancestors are checked at soft-limit-check. This is necessary because
          ancesotor's event counter may never be modified. Then, they should be
          checked at the same time.
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e649152
    • K
      memcg: fix refcnt going negative · 26251eaf
      KAMEZAWA Hiroyuki 提交于
      __mem_cgroup_largest_soft_limit_node() returns a mem_cgroup_per_zone "mz"
      with incremnted mz->mem->css's refcnt.  Then, the caller of this function
      has to call css_put(mz->mem->css).
      
      But, mz can be !NULL even if "not found" i.e.  without css_get().  By
      this, css->refcnt will go down to minus.
      
      This may cause various things...one of results will be
      initite-loop in css_tryget()  as this.
      
      INFO: RCU detected CPU 0 stall (t=10000 jiffies)
      sending NMI to all CPUs:
      NMI backtrace for cpu 0
      CPU 0:
      <snip>
      
       <<EOE>>  <IRQ>  [<ffffffff810884bd>] trace_hardirqs_off+0xd/0x10
        [<ffffffff8102a940>] flat_send_IPI_mask+0x90/0xb0
        [<ffffffff8102a9c9>] flat_send_IPI_all+0x69/0x70
        [<ffffffff81027372>] arch_trigger_all_cpu_backtrace+0x62/0xa0
        [<ffffffff810bff8e>] __rcu_pending+0x7e/0x370
        [<ffffffff810c02c7>] rcu_check_callbacks+0x47/0x130
        [<ffffffff81063a26>] update_process_times+0x46/0x70
        [<ffffffff81085930>] tick_sched_timer+0x60/0x160
        [<ffffffff810858d0>] ? tick_sched_timer+0x0/0x160
        [<ffffffff8107a03a>] __run_hrtimer+0xba/0x150
        [<ffffffff8107a325>] hrtimer_interrupt+0xd5/0x1b0
        [<ffffffff81426dfe>] ? trace_hardirqs_off_thunk+0x3a/0x3c
        [<ffffffff8142cacd>] smp_apic_timer_interrupt+0x6d/0x9b
        [<ffffffff8100cb33>] apic_timer_interrupt+0x13/0x20
        <EOI>  [<ffffffff811317b6>] ? mem_cgroup_walk_tree+0x156/0x180
        [<ffffffff811316d3>] ? mem_cgroup_walk_tree+0x73/0x180
        [<ffffffff81131692>] ? mem_cgroup_walk_tree+0x32/0x180
        [<ffffffff81131a00>] ? mem_cgroup_get_local_stat+0x0/0x110
        [<ffffffff81131d5b>] ? mem_control_stat_show+0x14b/0x330
        [<ffffffff810a57fd>] ? cgroup_seqfile_show+0x3d/0x60
      
      Above shows CPU0 caught in css_tryget()'s inifinite loop because
      of bad refcnt.
      
      This is a fix to set mz=NULL at the top of retry path.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26251eaf
    • H
      mm/rmap.c: fix comment · bf89c8c8
      Huang Shijie 提交于
      The page_address_in_vma() is not only used in unuse_vma().
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf89c8c8
  3. 29 9月, 2009 5 次提交
    • T
      percpu: make allocation failures more verbose · f2badb0c
      Tejun Heo 提交于
      Warn and dump stack when percpu allocation fails.  percpu allocator is
      still young and unchecked NULL percpu pointer usage can result in
      random memory corruption when combined with the pointer shifting in
      access macros.  Allocation failures should be rare and the warning
      message will be disabled after certain times.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f2badb0c
    • T
      percpu: make pcpu_setup_first_chunk() failures more verbose · 635b75fc
      Tejun Heo 提交于
      The parameters to pcpu_setup_first_chunk() come from different sources
      depending on architecture and can be quite complex.  The function runs
      various sanity checks on the parameters and triggers BUG() if
      something isn't right.  However, this is very early during the boot
      and not reporting exactly what the problem is makes debugging even
      harder.
      
      Add PCPU_SETUP_BUG() macro which prints out enough information about
      the parameters.  As the macro still puts separate BUG() for each
      check, it won't lose any information even on the situations where only
      the program counter can be retrieved.
      
      While at it, also bump pcpu_dump_alloc_info() message to KERN_INFO so
      that it's visible on the console if boot fails to complete.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      635b75fc
    • T
      percpu: make embedding first chunk allocator check vmalloc space size · 6ea529a2
      Tejun Heo 提交于
      Embedding first chunk allocator maintains the distances between units
      in the vmalloc area and thus needs vmalloc space to be larger than the
      maximum distances between units; otherwise, it wouldn't be able to
      create any dynamic chunks.  This patch makes the embedding first chunk
      allocator check vmalloc space size and if the maximum distance between
      units is larger than 75% of it, print warning and, if page mapping
      allocator is available, fail initialization so that the system falls
      back onto it.
      
      This should work around percpu allocation failure problems on certain
      sparc64 configurations where distances between NUMA nodes are larger
      than the vmalloc area and makes percpu allocator more robust for
      future configurations.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      6ea529a2
    • T
      percpu: make pcpu_build_alloc_info() clear static buffers · fb59e72e
      Tejun Heo 提交于
      pcpu_build_alloc_info() may be called multiple times when percpu is
      falling back to different first chunk allocator.  Make it clear static
      buffers so that they don't contain values from previous runs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      fb59e72e
    • T
      percpu: fix unit_map[] verification in pcpu_setup_first_chunk() · ffe0d5a5
      Tejun Heo 提交于
      pcpu_setup_first_chunk() incorrectly used NR_CPUS as the impossible
      unit number while unit number can equal and go over NR_CPUS with
      sparse unit map.  This triggers BUG_ON() spuriously on machines which
      have non-power-of-two number of cpus.  Use UINT_MAX instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NTony Vroon <tony@linx.net>
      ffe0d5a5
  4. 28 9月, 2009 1 次提交
  5. 27 9月, 2009 1 次提交
    • L
      x86: Fix hwpoison code related build failure on 32-bit NUMAQ · d949f36f
      Linus Torvalds 提交于
      This build failure triggers:
      
       In file included from include/linux/suspend.h:8,
                       from arch/x86/kernel/asm-offsets_32.c:11,
                       from arch/x86/kernel/asm-offsets.c:2:
       include/linux/mm.h:503:2: error: #error SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
      
      Because due to the hwpoison page flag we ran out of page
      flags on 32-bit.
      
      Dont turn on hwpoison on 32-bit NUMA (it's rare in any
      case).
      
      Also clean up the Kconfig dependencies in the generic MM
      code by introducing ARCH_SUPPORTS_MEMORY_FAILURE.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d949f36f
  6. 26 9月, 2009 4 次提交
  7. 25 9月, 2009 3 次提交
    • D
      NOMMU: Ignore mmap() address param as it is a hint · 06aab5a3
      David Howells 提交于
      Ignore the address parameter given to NOMMU mmap() as it is a hint, rather
      than giving an error if it's non-zero.  MAP_FIXED still gets an error.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06aab5a3
    • D
      NOMMU: Fix MAP_PRIVATE mmap() of objects where the data can be mapped directly · 645d83c5
      David Howells 提交于
      Fix MAP_PRIVATE mmap() of files and devices where the data in the backing store
      might be mapped directly.  Use the BDI_CAP_MAP_DIRECT capability flag to govern
      whether or not we should be trying to map a file directly.  This can be used to
      determine whether or not a region has been filled in at the point where we call
      do_mmap_shared() or do_mmap_private().
      
      The BDI_CAP_MAP_DIRECT capability flag is cleared by validate_mmap_request() if
      there's any reason we can't use it.  It's also cleared in do_mmap_pgoff() if
      f_op->get_unmapped_area() fails.
      
      Without this fix, attempting to run a program from a RomFS image on a
      non-mappable MTD partition results in a BUG as the kernel attempts XIP, and
      this can be caught in gdb:
      
      Program received signal SIGABRT, Aborted.
      0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
      (gdb) bt
      #0  0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
      #1  0xc005f168 in do_mmap_pgoff (file=0xc31a6620, addr=<value optimized out>, len=3808, prot=3, flags=6146, pgoff=0) at mm/nommu.c:1373
      #2  0xc00a96b8 in elf_fdpic_map_file (params=0xc33fbbec, file=0xc31a6620, mm=0xc31bef60, what=0xc0213144 "executable") at mm.h:1145
      #3  0xc00aa8b4 in load_elf_fdpic_binary (bprm=0xc316cb00, regs=<value optimized out>) at fs/binfmt_elf_fdpic.c:343
      #4  0xc006b588 in search_binary_handler (bprm=0x6, regs=0xc33fbce0) at fs/exec.c:1234
      #5  0xc006c648 in do_execve (filename=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460, regs=0xc33fbce0) at fs/exec.c:1356
      #6  0xc0008cf0 in sys_execve (name=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460) at arch/frv/kernel/process.c:263
      #7  0xc00075dc in __syscall_call () at arch/frv/kernel/entry.S:897
      
      Note that this fix does the following commit differently:
      
      	commit a190887b
      	Author: David Howells <dhowells@redhat.com>
      	Date:   Sat Sep 5 11:17:07 2009 -0700
      	nommu: fix error handling in do_mmap_pgoff()
      Reported-by: NGraff Yang <graff.yang@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      645d83c5
    • A
      procfs: disable per-task stack usage on NOMMU · c44972f1
      Andrew Morton 提交于
      It needs walk_page_range().
      Reported-by: NMichal Simek <monstr@monstr.eu>
      Tested-by: NMichal Simek <monstr@monstr.eu>
      Cc: Stefani Seibold <stefani@seibold.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c44972f1
  8. 24 9月, 2009 14 次提交
  9. 23 9月, 2009 3 次提交
    • K
      kcore: register module area in generic way · 81ac3ad9
      KAMEZAWA Hiroyuki 提交于
      Some archs define MODULED_VADDR/MODULES_END which is not in VMALLOC area.
      This is handled only in x86-64.  This patch make it more generic.  And we
      can use vread/vwrite to access the area.  Fix it.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81ac3ad9
    • K
      walk system ram range · 908eedc6
      KAMEZAWA Hiroyuki 提交于
      Originally, walk_memory_resource() was introduced to traverse all memory
      of "System RAM" for detecting memory hotplug/unplug range.  For doing so,
      flags of IORESOUCE_MEM|IORESOURCE_BUSY was used and this was enough for
      memory hotplug.
      
      But for using other purpose, /proc/kcore, this may includes some firmware
      area marked as IORESOURCE_BUSY | IORESOUCE_MEM.  This patch makes the
      check strict to find out busy "System RAM".
      
      Note: PPC64 keeps their own walk_memory_resouce(), which walk through
      ppc64's lmb informaton.  Because old kclist_add() is called per lmb, this
      patch makes no difference in behavior, finally.
      
      And this patch removes CONFIG_MEMORY_HOTPLUG check from this function.
      Because pfn_valid() just show "there is memmap or not* and cannot be used
      for "there is physical memory or not", this function is useful in generic
      to scan physical memory range.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Cc: Américo Wang <xiyou.wangcong@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Roland Dreier <rolandd@cisco.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      908eedc6
    • S
      procfs: provide stack information for threads · d899bf7b
      Stefani Seibold 提交于
      A patch to give a better overview of the userland application stack usage,
      especially for embedded linux.
      
      Currently you are only able to dump the main process/thread stack usage
      which is showed in /proc/pid/status by the "VmStk" Value.  But you get no
      information about the consumed stack memory of the the threads.
      
      There is an enhancement in the /proc/<pid>/{task/*,}/*maps and which marks
      the vm mapping where the thread stack pointer reside with "[thread stack
      xxxxxxxx]".  xxxxxxxx is the maximum size of stack.  This is a value
      information, because libpthread doesn't set the start of the stack to the
      top of the mapped area, depending of the pthread usage.
      
      A sample output of /proc/<pid>/task/<tid>/maps looks like:
      
      08048000-08049000 r-xp 00000000 03:00 8312       /opt/z
      08049000-0804a000 rw-p 00001000 03:00 8312       /opt/z
      0804a000-0806b000 rw-p 00000000 00:00 0          [heap]
      a7d12000-a7d13000 ---p 00000000 00:00 0
      a7d13000-a7f13000 rw-p 00000000 00:00 0          [thread stack: 001ff4b4]
      a7f13000-a7f14000 ---p 00000000 00:00 0
      a7f14000-a7f36000 rw-p 00000000 00:00 0
      a7f36000-a8069000 r-xp 00000000 03:00 4222       /lib/libc.so.6
      a8069000-a806b000 r--p 00133000 03:00 4222       /lib/libc.so.6
      a806b000-a806c000 rw-p 00135000 03:00 4222       /lib/libc.so.6
      a806c000-a806f000 rw-p 00000000 00:00 0
      a806f000-a8083000 r-xp 00000000 03:00 14462      /lib/libpthread.so.0
      a8083000-a8084000 r--p 00013000 03:00 14462      /lib/libpthread.so.0
      a8084000-a8085000 rw-p 00014000 03:00 14462      /lib/libpthread.so.0
      a8085000-a8088000 rw-p 00000000 00:00 0
      a8088000-a80a4000 r-xp 00000000 03:00 8317       /lib/ld-linux.so.2
      a80a4000-a80a5000 r--p 0001b000 03:00 8317       /lib/ld-linux.so.2
      a80a5000-a80a6000 rw-p 0001c000 03:00 8317       /lib/ld-linux.so.2
      afaf5000-afb0a000 rw-p 00000000 00:00 0          [stack]
      ffffe000-fffff000 r-xp 00000000 00:00 0          [vdso]
      
      Also there is a new entry "stack usage" in /proc/<pid>/{task/*,}/status
      which will you give the current stack usage in kb.
      
      A sample output of /proc/self/status looks like:
      
      Name:	cat
      State:	R (running)
      Tgid:	507
      Pid:	507
      .
      .
      .
      CapBnd:	fffffffffffffeff
      voluntary_ctxt_switches:	0
      nonvoluntary_ctxt_switches:	0
      Stack usage:	12 kB
      
      I also fixed stack base address in /proc/<pid>/{task/*,}/stat to the base
      address of the associated thread stack and not the one of the main
      process.  This makes more sense.
      
      [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
      Signed-off-by: NStefani Seibold <stefani@seibold.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d899bf7b
  10. 22 9月, 2009 3 次提交