1. 04 7月, 2009 6 次提交
    • T
      percpu: simplify pcpu_setup_first_chunk() · 38a6be52
      Tejun Heo 提交于
      Now that all first chunk allocator helpers allocate and map the first
      chunk themselves, there's no need to have optional default alloc/map
      in pcpu_setup_first_chunk().  Drop @populate_pte_fn and only leave
      @dyn_size optional and make all other params mandatory.
      
      This makes it much easier to follow what pcpu_setup_first_chunk() is
      doing and what actual differences tweaking each parameter results in.
      
      [ Impact: drop unused code path ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      38a6be52
    • T
      x86,percpu: generalize lpage first chunk allocator · 8c4bfc6e
      Tejun Heo 提交于
      Generalize and move x86 setup_pcpu_lpage() into
      pcpu_lpage_first_chunk().  setup_pcpu_lpage() now is a simple wrapper
      around the generalized version.  Other than taking size parameters and
      using arch supplied callbacks to allocate/free/map memory,
      pcpu_lpage_first_chunk() is identical to the original implementation.
      
      This simplifies arch code and will help converting more archs to
      dynamic percpu allocator.
      
      While at it, factor out pcpu_calc_fc_sizes() which is common to
      pcpu_embed_first_chunk() and pcpu_lpage_first_chunk().
      
      [ Impact: code reorganization and generalization ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      8c4bfc6e
    • T
      percpu: make 4k first chunk allocator map memory · 8f05a6a6
      Tejun Heo 提交于
      At first, percpu first chunk was always setup page-by-page by the
      generic code.  To add other allocators, different parts of the generic
      initialization was made optional.  Now we have three allocators -
      embed, remap and 4k.  embed and remap fully handle allocation and
      mapping of the first chunk while 4k still depends on generic code for
      those.  This makes the generic alloc/map paths specifci to 4k and
      makes the code unnecessary complicated with optional generic
      behaviors.
      
      This patch makes the 4k allocator to allocate and map memory directly
      instead of depending on the generic code.  The only outside visible
      change is that now dynamic area in the first chunk is allocated
      up-front instead of on-demand.  This doesn't make any meaningful
      difference as the area is minimal (usually less than a page, just
      enough to fill the alignment) on 4k allocator.  Plus, dynamic area in
      the first chunk usually gets fully used anyway.
      
      This will allow simplification of pcpu_setpu_first_chunk() and removal
      of chunk->page array.
      
      [ Impact: no outside visible change other than up-front allocation of dyn area ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      8f05a6a6
    • T
      x86,percpu: generalize 4k first chunk allocator · d4b95f80
      Tejun Heo 提交于
      Generalize and move x86 setup_pcpu_4k() into pcpu_4k_first_chunk().
      setup_pcpu_4k() now is a simple wrapper around the generalized
      version.  Other than taking size parameters and using arch supplied
      callbacks to allocate/free memory, pcpu_4k_first_chunk() is identical
      to the original implementation.
      
      This simplifies arch code and will help converting more archs to
      dynamic percpu allocator.
      
      While at it, s/pcpu_populate_pte_fn_t/pcpu_fc_populate_pte_fn_t/ for
      consistency.
      
      [ Impact: code reorganization and generalization ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      d4b95f80
    • T
      percpu: drop @unit_size from embed first chunk allocator · 788e5abc
      Tejun Heo 提交于
      The only extra feature @unit_size provides is making dead space at the
      end of the first chunk which doesn't have any valid usecase.  Drop the
      parameter.  This will increase consistency with generalized 4k
      allocator.
      
      James Bottomley spotted missing conversion for the default
      setup_per_cpu_areas() which caused build breakage on all arcsh which
      use it.
      
      [ Impact: drop unused code path ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      788e5abc
    • T
      x86: make pcpu_chunk_addr_search() matching stricter · 79ba6ac8
      Tejun Heo 提交于
      The @addr passed into pcpu_chunk_addr_search() is unit0 based address
      and thus should be matched inside unit0 area.  Currently, when it uses
      chunk size when determining whether the address falls in the first
      chunk.  Addresses in unitN where N>0 shouldn't be passed in anyway, so
      this doesn't cause any malfunction but fix it for consistency.
      
      [ Impact: mostly cleanup ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      79ba6ac8
  2. 02 7月, 2009 1 次提交
    • I
      kmemleak: Fix scheduling-while-atomic bug · 57d81f6f
      Ingo Molnar 提交于
      One of the kmemleak changes caused the following
      scheduling-while-holding-the-tasklist-lock regression on x86:
      
      BUG: sleeping function called from invalid context at mm/kmemleak.c:795
      in_atomic(): 1, irqs_disabled(): 0, pid: 1737, name: kmemleak
      2 locks held by kmemleak/1737:
       #0:  (scan_mutex){......}, at: [<c10c4376>] kmemleak_scan_thread+0x45/0x86
       #1:  (tasklist_lock){......}, at: [<c10c3bb4>] kmemleak_scan+0x1a9/0x39c
      Pid: 1737, comm: kmemleak Not tainted 2.6.31-rc1-tip #59266
      Call Trace:
       [<c105ac0f>] ? __debug_show_held_locks+0x1e/0x20
       [<c102e490>] __might_sleep+0x10a/0x111
       [<c10c38d5>] scan_yield+0x17/0x3b
       [<c10c3970>] scan_block+0x39/0xd4
       [<c10c3bc6>] kmemleak_scan+0x1bb/0x39c
       [<c10c4331>] ? kmemleak_scan_thread+0x0/0x86
       [<c10c437b>] kmemleak_scan_thread+0x4a/0x86
       [<c104d73e>] kthread+0x6e/0x73
       [<c104d6d0>] ? kthread+0x0/0x73
       [<c100959f>] kernel_thread_helper+0x7/0x10
      kmemleak: 834 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
      
      The bit causing it is highly dubious:
      
      static void scan_yield(void)
      {
              might_sleep();
      
              if (time_is_before_eq_jiffies(next_scan_yield)) {
                      schedule();
                      next_scan_yield = jiffies + jiffies_scan_yield;
              }
      }
      
      It called deep inside the codepath and in a conditional way,
      and that is what crapped up when one of the new scan_block()
      uses grew a tasklist_lock dependency.
      
      This minimal patch removes that yielding stuff and adds the
      proper cond_resched().
      
      The background scanning thread could probably also be reniced
      to +10.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57d81f6f
  3. 01 7月, 2009 3 次提交
  4. 30 6月, 2009 2 次提交
  5. 27 6月, 2009 4 次提交
  6. 26 6月, 2009 2 次提交
  7. 25 6月, 2009 4 次提交
  8. 24 6月, 2009 9 次提交
    • A
      switch shmem to inode->i_acl · 06b16e9f
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      06b16e9f
    • T
      percpu: clean up percpu variable definitions · 245b2e70
      Tejun Heo 提交于
      Percpu variable definition is about to be updated such that all percpu
      symbols including the static ones must be unique.  Update percpu
      variable definitions accordingly.
      
      * as,cfq: rename ioc_count uniquely
      
      * cpufreq: rename cpu_dbs_info uniquely
      
      * xen: move nesting_count out of xen_evtchn_do_upcall() and rename it
      
      * mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
        rename it
      
      * ipv4,6: rename cookie_scratch uniquely
      
      * x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
        pmc_irq_entry and nmi_entry to pmc_nmi_entry
      
      * perf_counter: rename disable_count to perf_disable_count
      
      * ftrace: rename test_event_disable to ftrace_test_event_disable
      
      * kmemleak: rename test_pointer to kmemleak_test_pointer
      
      * mce: rename next_interval to mce_next_interval
      
      [ Impact: percpu usage cleanups, no duplicate static percpu var names ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: linux-mm <linux-mm@kvack.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      245b2e70
    • T
      percpu: cleanup percpu array definitions · 204fba4a
      Tejun Heo 提交于
      Currently, the following three different ways to define percpu arrays
      are in use.
      
      1. DEFINE_PER_CPU(elem_type[array_len], array_name);
      2. DEFINE_PER_CPU(elem_type, array_name[array_len]);
      3. DEFINE_PER_CPU(elem_type, array_name)[array_len];
      
      Unify to #1 which correctly separates the roles of the two parameters
      and thus allows more flexibility in the way percpu variables are
      defined.
      
      [ Impact: cleanup ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: linux-mm@kvack.org
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: David S. Miller <davem@davemloft.net>
      204fba4a
    • T
      percpu: use dynamic percpu allocator as the default percpu allocator · e74e3962
      Tejun Heo 提交于
      This patch makes most !CONFIG_HAVE_SETUP_PER_CPU_AREA archs use
      dynamic percpu allocator.  The first chunk is allocated using
      embedding helper and 8k is reserved for modules.  This ensures that
      the new allocator behaves almost identically to the original allocator
      as long as static percpu variables are concerned, so it shouldn't
      introduce much breakage.
      
      s390 and alpha use custom SHIFT_PERCPU_PTR() to work around addressing
      range limit the addressing model imposes.  Unfortunately, this breaks
      if the address is specified using a variable, so for now, the two
      archs aren't converted.
      
      The following architectures are affected by this change.
      
      * sh
      * arm
      * cris
      * mips
      * sparc(32)
      * blackfin
      * avr32
      * parisc (broken, under investigation)
      * m32r
      * powerpc(32)
      
      As this change makes the dynamic allocator the default one,
      CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is replaced with its invert -
      CONFIG_HAVE_LEGACY_PER_CPU_AREA, which is added to yet-to-be converted
      archs.  These archs implement their own setup_per_cpu_areas() and the
      conversion is not trivial.
      
      * powerpc(64)
      * sparc(64)
      * ia64
      * alpha
      * s390
      
      Boot and batch alloc/free tests on x86_32 with debug code (x86_32
      doesn't use default first chunk initialization).  Compile tested on
      sparc(32), powerpc(32), arm and alpha.
      
      Kyle McMartin reported that this change breaks parisc.  The problem is
      still under investigation and he is okay with pushing this patch
      forward and fixing parisc later.
      
      [ Impact: use dynamic allocator for most archs w/o custom percpu setup ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Bryan Wu <cooloney@kernel.org>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Grant Grundler <grundler@parisc-linux.org>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      e74e3962
    • D
      mm: fix handling of pagesets for downed cpus · 364df0eb
      Dimitri Sivanich 提交于
      After downing/upping a cpu, an attempt to set
      /proc/sys/vm/percpu_pagelist_fraction results in an oops in
      percpu_pagelist_fraction_sysctl_handler().
      
      If a processor is downed then we need to set the pageset pointer back to
      the boot pageset.
      
      Updates of the high water marks should not access pagesets of unpopulated
      zones (those pointer go to the boot pagesets which would be no longer
      functional if their size would be increased beyond zero).
      Signed-off-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      364df0eb
    • H
      mm: pass mm to grab_swap_token · a5c9b696
      Hugh Dickins 提交于
      If a kthread happens to use get_user_pages() on an mm (as KSM does),
      there's a chance that it will end up trying to read in a swap page, then
      oops in grab_swap_token() because the kthread has no mm: GUP passes down
      the right mm, so grab_swap_token() ought to be using it.
      
      We have not identified a stronger case than KSM's daemon (not yet in
      mainline), but the issue must have come up before, since RHEL has included
      a fix for this for years (though a different fix, they just back out of
      grab_swap_token if current->mm is unset: which is what we first proposed,
      but using the right mm here seems more correct).
      Reported-by: NIzik Eidus <ieidus@redhat.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5c9b696
    • H
      mm: don't rely on flags coincidence · d26ed650
      Hugh Dickins 提交于
      Indeed FOLL_WRITE matches FAULT_FLAG_WRITE, matches GUP_FLAGS_WRITE,
      and it's tempting to devise a set of Grand Unified Paging flags;
      but not today.  So until then, let's rely upon the compiler to spot
      the coincidence, "rather than have that subtle dependency and a
      comment for it" - as you remarked in another context yesterday.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d26ed650
    • H
      hugetlb: fault flags instead of write_access · 788c7df4
      Hugh Dickins 提交于
      handle_mm_fault() is now passing fault flags rather than write_access
      down to hugetlb_fault(), so better recognize that in hugetlb_fault(),
      and in hugetlb_no_page().
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      788c7df4
    • K
      mm: fix incorrect page removal from LRU · cb4cbcf6
      KAMEZAWA Hiroyuki 提交于
      The isolated page is "cursor_page" not "page".
      
      This could cause LRU list corruption under memory pressure, caught by
      CONFIG_DEBUG_LIST.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb4cbcf6
  9. 23 6月, 2009 1 次提交
  10. 22 6月, 2009 4 次提交
    • T
      x86: implement percpu_alloc kernel parameter · fa8a7094
      Tejun Heo 提交于
      According to Andi, it isn't clear whether lpage allocator is worth the
      trouble as there are many processors where PMD TLB is far scarcer than
      PTE TLB.  The advantage or disadvantage probably depends on the actual
      size of percpu area and specific processor.  As performance
      degradation due to TLB pressure tends to be highly workload specific
      and subtle, it is difficult to decide which way to go without more
      data.
      
      This patch implements percpu_alloc kernel parameter to allow selecting
      which first chunk allocator to use to ease debugging and testing.
      
      While at it, make sure all the failure paths report why something
      failed to help determining why certain allocator isn't working.  Also,
      kill the "Great future plan" comment which had already been realized
      quite some time ago.
      
      [ Impact: allow explicit percpu first chunk allocator selection ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJan Beulich <JBeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      fa8a7094
    • T
      percpu: fix too lazy vunmap cache flushing · 85ae87c1
      Tejun Heo 提交于
      In pcpu_unmap(), flushing virtual cache on vunmap can't be delayed as
      the page is going to be returned to the page allocator.  Only TLB
      flushing can be put off such that vmalloc code can handle it lazily.
      Fix it.
      
      [ Impact: fix subtle virtual cache flush bug ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      85ae87c1
    • L
      Move FAULT_FLAG_xyz into handle_mm_fault() callers · d06063cc
      Linus Torvalds 提交于
      This allows the callers to now pass down the full set of FAULT_FLAG_xyz
      flags to handle_mm_fault().  All callers have been (mechanically)
      converted to the new calling convention, there's almost certainly room
      for architectures to clean up their code and then add FAULT_FLAG_RETRY
      when that support is added.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d06063cc
    • L
      Remove internal use of 'write_access' in mm/memory.c · 30c9f3a9
      Linus Torvalds 提交于
      The fault handling routines really want more fine-grained flags than a
      single "was it a write fault" boolean - the callers will want to set
      flags like "you can return a retry error" etc.
      
      And that's actually how the VM works internally, but right now the
      top-level fault handling functions in mm/memory.c all pass just the
      'write_access' boolean around.
      
      This switches them over to pass around the FAULT_FLAG_xyzzy 'flags'
      variable instead.  The 'write_access' calling convention still exists
      for the exported 'handle_mm_fault()' function, but that is next.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30c9f3a9
  11. 21 6月, 2009 1 次提交
  12. 20 6月, 2009 1 次提交
  13. 19 6月, 2009 2 次提交