1. 11 12月, 2012 28 次提交
    • M
      mm: numa: Add THP migration for the NUMA working set scanning fault case. · b32967ff
      Mel Gorman 提交于
      Note: This is very heavily based on a patch from Peter Zijlstra with
      	fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner.  That patch
      	put a lot of migration logic into mm/huge_memory.c where it does
      	not belong. This version puts tries to share some of the migration
      	logic with migrate_misplaced_page.  However, it should be noted
      	that now migrate.c is doing more with the pagetable manipulation
      	than is preferred. The end result is barely recognisable so as
      	before, the signed-offs had to be removed but will be re-added if
      	the original authors are ok with it.
      
      Add THP migration for the NUMA working set scanning fault case.
      
      It uses the page lock to serialize. No migration pte dance is
      necessary because the pte is already unmapped when we decide
      to migrate.
      
      [dhillf@gmail.com: Fix memory leak on isolation failure]
      [dhillf@gmail.com: Fix transfer of last_nid information]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      b32967ff
    • M
      mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node · 5bca2303
      Mel Gorman 提交于
      Due to the fact that migrations are driven by the CPU a task is running
      on there is no point tracking NUMA faults until one task runs on a new
      node. This patch tracks the first node used by an address space. Until
      it changes, PTE scanning is disabled and no NUMA hinting faults are
      trapped. This should help workloads that are short-lived, do not care
      about NUMA placement or have bound themselves to a single node.
      
      This takes advantage of the logic in "mm: sched: numa: Implement slow
      start for working set sampling" to delay when the checks are made. This
      will take advantage of processes that set their CPU and node bindings
      early in their lifetime. It will also potentially allow any initial load
      balancing to take place.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      5bca2303
    • M
      mm: sched: numa: Control enabling and disabling of NUMA balancing · 1a687c2e
      Mel Gorman 提交于
      This patch adds Kconfig options and kernel parameters to allow the
      enabling and disabling of automatic NUMA balancing. The existance
      of such a switch was and is very important when debugging problems
      related to transparent hugepages and we should have the same for
      automatic NUMA placement.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      1a687c2e
    • M
      mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate · b8593bfd
      Mel Gorman 提交于
      The PTE scanning rate and fault rates are two of the biggest sources of
      system CPU overhead with automatic NUMA placement.  Ideally a proper policy
      would detect if a workload was properly placed, schedule and adjust the
      PTE scanning rate accordingly. We do not track the necessary information
      to do that but we at least know if we migrated or not.
      
      This patch scans slower if a page was not migrated as the result of a
      NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
      now higher than the previous default. Once every minute it will reset
      the scanner in case of phase changes.
      
      This is hilariously crude and the numbers are arbitrary. Workloads will
      converge quite slowly in comparison to what a proper policy should be able
      to do. On the plus side, we will chew up less CPU for workloads that have
      no need for automatic balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      b8593bfd
    • M
      mm: numa: Introduce last_nid to the page frame · 57e0a030
      Mel Gorman 提交于
      This patch introduces a last_nid field to the page struct. This is used
      to build a two-stage filter in the next patch that is aimed at
      mitigating a problem whereby pages migrate to the wrong node when
      referenced by a process that was running off its home node.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      57e0a030
    • M
      mm: numa: Rate limit setting of pte_numa if node is saturated · e14808b4
      Mel Gorman 提交于
      If there are a large number of NUMA hinting faults and all of them
      are resulting in migrations it may indicate that memory is just
      bouncing uselessly around. NUMA balancing cost is likely exceeding
      any benefit from locality. Rate limit the PTE updates if the node
      is migration rate-limited. As noted in the comments, this distorts
      the NUMA faulting statistics.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      e14808b4
    • A
      mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting · 8177a420
      Andrea Arcangeli 提交于
      This defines the per-node data used by Migrate On Fault in order to
      rate limit the migration. The rate limiting is applied independently
      to each destination node.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      8177a420
    • M
      mm: numa: Migrate on reference policy · 5606e387
      Mel Gorman 提交于
      This is the simplest possible policy that still does something of note.
      When a pte_numa is faulted, it is moved immediately. Any replacement
      policy must at least do better than this and in all likelihood this
      policy regresses normal workloads.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      5606e387
    • M
      mm: numa: Add pte updates, hinting and migration stats · 03c5a6e1
      Mel Gorman 提交于
      It is tricky to quantify the basic cost of automatic NUMA placement in a
      meaningful manner. This patch adds some vmstats that can be used as part
      of a basic costing model.
      
      u    = basic unit = sizeof(void *)
      Ca   = cost of struct page access = sizeof(struct page) / u
      Cpte = Cost PTE access = Ca
      Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
      	where Cpte is incurred twice for a read and a write and Wlock
      	is a constant representing the cost of taking or releasing a
      	lock
      Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
      Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
      Ci = Cost of page isolation = Ca + Wi
      	where Wi is a constant that should reflect the approximate cost
      	of the locking operation
      Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
      	where Wnuma is the approximate NUMA factor. 1 is local. 1.2
      	would imply that remote accesses are 20% more expensive
      
      Balancing cost = Cpte * numa_pte_updates +
      		Cnumahint * numa_hint_faults +
      		Ci * numa_pages_migrated +
      		Cpagecopy * numa_pages_migrated
      
      Note that numa_pages_migrated is used as a measure of how many pages
      were isolated even though it would miss pages that failed to migrate. A
      vmstat counter could have been added for it but the isolation cost is
      pretty marginal in comparison to the overall cost so it seemed overkill.
      
      The ideal way to measure automatic placement benefit would be to count
      the number of remote accesses versus local accesses and do something like
      
      	benefit = (remote_accesses_before - remove_access_after) * Wnuma
      
      but the information is not readily available. As a workload converges, the
      expection would be that the number of remote numa hints would reduce to 0.
      
      	convergence = numa_hint_faults_local / numa_hint_faults
      		where this is measured for the last N number of
      		numa hints recorded. When the workload is fully
      		converged the value is 1.
      
      This can measure if the placement policy is converging and how fast it is
      doing it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      03c5a6e1
    • P
      mm: sched: numa: Implement slow start for working set sampling · 4b96a29b
      Peter Zijlstra 提交于
      Add a 1 second delay before starting to scan the working set of
      a task and starting to balance it amongst nodes.
      
      [ note that before the constant per task WSS sampling rate patch
        the initial scan would happen much later still, in effect that
        patch caused this regression. ]
      
      The theory is that short-run tasks benefit very little from NUMA
      placement: they come and go, and they better stick to the node
      they were started on. As tasks mature and rebalance to other CPUs
      and nodes, so does their NUMA placement have to change and so
      does it start to matter more and more.
      
      In practice this change fixes an observable kbuild regression:
      
         # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]
      
         !NUMA:
         45.291088843 seconds time elapsed                                          ( +-  0.40% )
         45.154231752 seconds time elapsed                                          ( +-  0.36% )
      
         +NUMA, no slow start:
         46.172308123 seconds time elapsed                                          ( +-  0.30% )
         46.343168745 seconds time elapsed                                          ( +-  0.25% )
      
         +NUMA, 1 sec slow start:
         45.224189155 seconds time elapsed                                          ( +-  0.25% )
         45.160866532 seconds time elapsed                                          ( +-  0.17% )
      
      and it also fixes an observable perf bench (hackbench) regression:
      
         # perf stat --null --repeat 10 perf bench sched messaging
      
         -NUMA:
      
         -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
         +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )
      
         +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )
      
      The implementation is simple and straightforward, most of the patch
      deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable
      knob.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote the changelog, ran measurements, tuned the default. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      4b96a29b
    • P
      mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate · 6e5fb223
      Peter Zijlstra 提交于
      Previously, to probe the working set of a task, we'd use
      a very simple and crude method: mark all of its address
      space PROT_NONE.
      
      That method has various (obvious) disadvantages:
      
       - it samples the working set at dissimilar rates,
         giving some tasks a sampling quality advantage
         over others.
      
       - creates performance problems for tasks with very
         large working sets
      
       - over-samples processes with large address spaces but
         which only very rarely execute
      
      Improve that method by keeping a rotating offset into the
      address space that marks the current position of the scan,
      and advance it by a constant rate (in a CPU cycles execution
      proportional manner). If the offset reaches the last mapped
      address of the mm then it then it starts over at the first
      address.
      
      The per-task nature of the working set sampling functionality in this tree
      allows such constant rate, per task, execution-weight proportional sampling
      of the working set, with an adaptive sampling interval/frequency that
      goes from once per 100ms up to just once per 8 seconds.  The current
      sampling volume is 256 MB per interval.
      
      As tasks mature and converge their working set, so does the
      sampling rate slow down to just a trickle, 256 MB per 8
      seconds of CPU time executed.
      
      This, beyond being adaptive, also rate-limits rarely
      executing systems and does not over-sample on overloaded
      systems.
      
      [ In AutoNUMA speak, this patch deals with the effective sampling
        rate of the 'hinting page fault'. AutoNUMA's scanning is
        currently rate-limited, but it is also fundamentally
        single-threaded, executing in the knuma_scand kernel thread,
        so the limit in AutoNUMA is global and does not scale up with
        the number of CPUs, nor does it scan tasks in an execution
        proportional manner.
      
        So the idea of rate-limiting the scanning was first implemented
        in the AutoNUMA tree via a global rate limit. This patch goes
        beyond that by implementing an execution rate proportional
        working set sampling rate that is not implemented via a single
        global scanning daemon. ]
      
      [ Dan Carpenter pointed out a possible NULL pointer dereference in the
        first version of this patch. ]
      Based-on-idea-by: NAndrea Arcangeli <aarcange@redhat.com>
      Bug-Found-By: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote changelog and fixed bug. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      6e5fb223
    • P
      mm: numa: Add fault driven placement and migration · cbee9f88
      Peter Zijlstra 提交于
      NOTE: This patch is based on "sched, numa, mm: Add fault driven
      	placement and migration policy" but as it throws away all the policy
      	to just leave a basic foundation I had to drop the signed-offs-by.
      
      This patch creates a bare-bones method for setting PTEs pte_numa in the
      context of the scheduler that when faulted later will be faulted onto the
      node the CPU is running on.  In itself this does nothing useful but any
      placement policy will fundamentally depend on receiving hints on placement
      from fault context and doing something intelligent about it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      cbee9f88
    • M
      mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now · a720094d
      Mel Gorman 提交于
      The use of MPOL_NOOP and MPOL_MF_LAZY to allow an application to
      explicitly request lazy migration is a good idea but the actual
      API has not been well reviewed and once released we have to support it.
      For now this patch prevents an application using the services. This
      will need to be revisited.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      a720094d
    • M
      mm: mempolicy: Implement change_prot_numa() in terms of change_protection() · 4b10e7d5
      Mel Gorman 提交于
      This patch converts change_prot_numa() to use change_protection(). As
      pte_numa and friends check the PTE bits directly it is necessary for
      change_protection() to use pmd_mknuma(). Hence the required
      modifications to change_protection() are a little clumsy but the
      end result is that most of the numa page table helpers are just one or
      two instructions.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      4b10e7d5
    • L
      mm: mempolicy: Add MPOL_MF_LAZY · b24f53a0
      Lee Schermerhorn 提交于
      NOTE: Once again there is a lot of patch stealing and the end result
      	is sufficiently different that I had to drop the signed-offs.
      	Will re-add if the original authors are ok with that.
      
      This patch adds another mbind() flag to request "lazy migration".  The
      flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
      pages are marked PROT_NONE. The pages will be migrated in the fault
      path on "first touch", if the policy dictates at that time.
      
      "Lazy Migration" will allow testing of migrate-on-fault via mbind().
      Also allows applications to specify that only subsequently touched
      pages be migrated to obey new policy, instead of all pages in range.
      This can be useful for multi-threaded applications working on a
      large shared data area that is initialized by an initial thread
      resulting in all pages on one [or a few, if overflowed] nodes.
      After PROT_NONE, the pages in regions assigned to the worker threads
      will be automatically migrated local to the threads on 1st touch.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      b24f53a0
    • M
      mm: mempolicy: Use _PAGE_NUMA to migrate pages · 4daae3b4
      Mel Gorman 提交于
      Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but
      	sufficiently different that the signed-off-bys were dropped
      
      Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page()
      pieces into an effective migrate on fault scheme.
      
      Note that (on x86) we rely on PROT_NONE pages being !present and avoid
      the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the
      page-migration performance.
      Based-on-work-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      4daae3b4
    • P
      mm: migrate: Introduce migrate_misplaced_page() · 7039e1db
      Peter Zijlstra 提交于
      Note: This was originally based on Peter's patch "mm/migrate: Introduce
      	migrate_misplaced_page()" but borrows extremely heavily from Andrea's
      	"autonuma: memory follows CPU algorithm and task/mm_autonuma stats
      	collection". The end result is barely recognisable so signed-offs
      	had to be dropped. If original authors are ok with it, I'll
      	re-add the signed-off-bys.
      
      Add migrate_misplaced_page() which deals with migrating pages from
      faults.
      Based-on-work-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Based-on-work-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Based-on-work-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      7039e1db
    • L
      mm: mempolicy: Check for misplaced page · 771fb4d8
      Lee Schermerhorn 提交于
      This patch provides a new function to test whether a page resides
      on a node that is appropriate for the mempolicy for the vma and
      address where the page is supposed to be mapped.  This involves
      looking up the node where the page belongs.  So, the function
      returns that node so that it may be used to allocated the page
      without consulting the policy again.
      
      A subsequent patch will call this function from the fault path.
      Because of this, I don't want to go ahead and allocate the page, e.g.,
      via alloc_page_vma() only to have to free it if it has the correct
      policy.  So, I just mimic the alloc_page_vma() node computation
      logic--sort of.
      
      Note:  we could use this function to implement a MPOL_MF_STRICT
      behavior when migrating pages to match mbind() mempolicy--e.g.,
      to ensure that pages in an interleaved range are reinterleaved
      rather than left where they are when they reside on any page in
      the interleave nodemask.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      [ Added MPOL_F_LAZY to trigger migrate-on-fault;
        simplified code now that we don't have to bother
        with special crap for interleaved ]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      771fb4d8
    • L
      mm: mempolicy: Add MPOL_NOOP · d3a71033
      Lee Schermerhorn 提交于
      This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy
      to mbind().  When the NOOP policy is used with the 'MOVE and 'LAZY
      flags, mbind() will map the pages PROT_NONE so that they will be
      migrated on the next touch.
      
      This allows an application to prepare for a new phase of operation
      where different regions of shared storage will be assigned to
      worker threads, w/o changing policy.  Note that we could just use
      "default" policy in this case.  However, this also allows an
      application to request that pages be migrated, only if necessary,
      to follow any arbitrary policy that might currently apply to a
      range of pages, without knowing the policy, or without specifying
      multiple mbind()s for ranges with different policies.
      
      [ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ]
      Bug-Reported-by: NReported-by: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      d3a71033
    • P
      mm: mempolicy: Make MPOL_LOCAL a real policy · 479e2802
      Peter Zijlstra 提交于
      Make MPOL_LOCAL a real and exposed policy such that applications that
      relied on the previous default behaviour can explicitly request it.
      Requested-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      479e2802
    • M
      mm: numa: Create basic numa page hinting infrastructure · d10e63f2
      Mel Gorman 提交于
      Note: This patch started as "mm/mpol: Create special PROT_NONE
      	infrastructure" and preserves the basic idea but steals *very*
      	heavily from "autonuma: numa hinting page faults entry points" for
      	the actual fault handlers without the migration parts.	The end
      	result is barely recognisable as either patch so all Signed-off
      	and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
      	this version, I will re-add the signed-offs-by to reflect the history.
      
      In order to facilitate a lazy -- fault driven -- migration of pages, create
      a special transient PAGE_NUMA variant, we can then use the 'spurious'
      protection faults to drive our migrations from.
      
      The meaning of PAGE_NUMA depends on the architecture but on x86 it is
      effectively PROT_NONE. Actual PROT_NONE mappings will not generate these
      NUMA faults for the reason that the page fault code checks the permission on
      the VMA (and will throw a segmentation fault on actual PROT_NONE mappings),
      before it ever calls handle_mm_fault.
      
      [dhillf@gmail.com: Fix typo]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      d10e63f2
    • A
      mm: numa: Support NUMA hinting page faults from gup/gup_fast · 0b9d7052
      Andrea Arcangeli 提交于
      Introduce FOLL_NUMA to tell follow_page to check
      pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
      so because it always invokes handle_mm_fault and retries the
      follow_page later.
      
      KVM secondary MMU page faults will trigger the NUMA hinting page
      faults through gup_fast -> get_user_pages -> follow_page ->
      handle_mm_fault.
      
      Other follow_page callers like KSM should not use FOLL_NUMA, or they
      would fail to get the pages if they use follow_page instead of
      get_user_pages.
      
      [ This patch was picked up from the AutoNUMA tree. ]
      Originally-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ ported to this tree. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      0b9d7052
    • A
      mm: numa: pte_numa() and pmd_numa() · be3a7284
      Andrea Arcangeli 提交于
      Implement pte_numa and pmd_numa.
      
      We must atomically set the numa bit and clear the present bit to
      define a pte_numa or pmd_numa.
      
      Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
      a thread touches a virtual address in the corresponding virtual range,
      a NUMA hinting page fault will trigger. The NUMA hinting page fault
      will clear the NUMA bit and set the present bit again to resolve the
      page fault.
      
      The expectation is that a NUMA hinting page fault is used as part
      of a placement policy that decides if a page should remain on the
      current node or migrated to a different node.
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      be3a7284
    • M
      mm: compaction: Add scanned and isolated counters for compaction · 397487db
      Mel Gorman 提交于
      Compaction already has tracepoints to count scanned and isolated pages
      but it requires that ftrace be enabled and if that information has to be
      written to disk then it can be disruptive. This patch adds vmstat counters
      for compaction called compact_migrate_scanned, compact_free_scanned and
      compact_isolated.
      
      With these counters, it is possible to define a basic cost model for
      compaction. This approximates of how much work compaction is doing and can
      be compared that with an oprofile showing TLB misses and see if the cost of
      compaction is being offset by THP for example. Minimally a compaction patch
      can be evaluated in terms of whether it increases or decreases cost. The
      basic cost model looks like this
      
      Fundamental unit u:	a word	sizeof(void *)
      
      Ca  = cost of struct page access = sizeof(struct page) / u
      
      Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
      Cmf = Cost migrate failure   = Ca * 2
      Ci  = Cost page isolation    = (Ca + Wi)
      	where Wi is a constant that should reflect the approximate
      	cost of the locking operation.
      
      Csm = Cost migrate scanning = Ca
      Csf = Cost free    scanning = Ca
      
      Overall cost =	(Csm * compact_migrate_scanned) +
      	      	(Csf * compact_free_scanned)    +
      	      	(Ci  * compact_isolated)	+
      		(Cmc * pgmigrate_success)	+
      		(Cmf * pgmigrate_failed)
      
      Where the values are read from /proc/vmstat.
      
      This is very basic and ignores certain costs such as the allocation cost
      to do a migrate page copy but any improvement to the model would still
      use the same vmstat counters.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      397487db
    • M
      mm: migrate: Add a tracepoint for migrate_pages · 7b2a2d4a
      Mel Gorman 提交于
      The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
      about migration activity but not the type or the reason. This patch adds
      a tracepoint to identify the type of page migration and why the page is
      being migrated.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      7b2a2d4a
    • M
      mm: compaction: Move migration fail/success stats to migrate.c · 5647bc29
      Mel Gorman 提交于
      The compact_pages_moved and compact_pagemigrate_failed events are
      convenient for determining if compaction is active and to what
      degree migration is succeeding but it's at the wrong level. Other
      users of migration may also want to know if migration is working
      properly and this will be particularly true for any automated
      NUMA migration. This patch moves the counters down to migration
      with the new events called pgmigrate_success and pgmigrate_fail.
      The compact_blocks_moved counter is removed because while it was
      useful for debugging initially, it's worthless now as no meaningful
      conclusions can be drawn from its value.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      5647bc29
    • P
      mm: Count the number of pages affected in change_protection() · 7da4d641
      Peter Zijlstra 提交于
      This will be used for three kinds of purposes:
      
       - to optimize mprotect()
      
       - to speed up working set scanning for working set areas that
         have not been touched
      
       - to more accurately scan per real working set
      
      No change in functionality from this patch.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7da4d641
    • R
      x86/mm: Introduce pte_accessible() · 2c3cf556
      Rik van Riel 提交于
      We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that
      the pte is associated with a page.
      
      However, for TLB flushing purposes, we would like to know whether the pte
      points to an actually accessible page.  This allows us to skip remote TLB
      flushes for pages that are not actually accessible.
      
      Fill in this method for x86 and provide a safe (but slower) method
      on other architectures.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Fixed-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/n/tip-66p11te4uj23gevgh4j987ip@git.kernel.org
      [ Added Linus's review fixes. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2c3cf556
  2. 17 11月, 2012 4 次提交
    • A
      revert "mm: fix-up zone present pages" · 5576646f
      Andrew Morton 提交于
      Revert commit 7f1290f2 ("mm: fix-up zone present pages")
      
      That patch tried to fix a issue when calculating zone->present_pages,
      but it caused a regression on 32bit systems with HIGHMEM.  With that
      change, reset_zone_present_pages() resets all zone->present_pages to
      zero, and fixup_zone_present_pages() is called to recalculate
      zone->present_pages when the boot allocator frees core memory pages into
      buddy allocator.  Because highmem pages are not freed by bootmem
      allocator, all highmem zones' present_pages becomes zero.
      
      Various options for improving the situation are being discussed but for
      now, let's return to the 3.6 code.
      
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Tested-by: NChris Clayton <chris2553@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5576646f
    • R
      rapidio: fix kernel-doc warnings · 2ca3cb50
      Randy Dunlap 提交于
      Fix rapidio kernel-doc warnings:
      
        Warning(drivers/rapidio/rio.c:415): No description found for parameter 'local'
        Warning(drivers/rapidio/rio.c:415): Excess function parameter 'lstart' description in 'rio_map_inb_region'
        Warning(include/linux/rio.h:290): No description found for parameter 'switches'
        Warning(include/linux/rio.h:290): No description found for parameter 'destid_table'
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Acked-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ca3cb50
    • H
      memcg: fix hotplugged memory zone oops · bea8c150
      Hugh Dickins 提交于
      When MEMCG is configured on (even when it's disabled by boot option),
      when adding or removing a page to/from its lru list, the zone pointer
      used for stats updates is nowadays taken from the struct lruvec.  (On
      many configurations, calculating zone from page is slower.)
      
      But we have no code to update all the lruvecs (per zone, per memcg) when
      a memory node is hotadded.  Here's an extract from the oops which
      results when running numactl to bind a program to a newly onlined node:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
        IP:  __mod_zone_page_state+0x9/0x60
        Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
        Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
        Call Trace:
          __pagevec_lru_add_fn+0xdf/0x140
          pagevec_lru_move_fn+0xb1/0x100
          __pagevec_lru_add+0x1c/0x30
          lru_add_drain_cpu+0xa3/0x130
          lru_add_drain+0x2f/0x40
         ...
      
      The natural solution might be to use a memcg callback whenever memory is
      hotadded; but that solution has not been scoped out, and it happens that
      we do have an easy location at which to update lruvec->zone.  The lruvec
      pointer is discovered either by mem_cgroup_zone_lruvec() or by
      mem_cgroup_page_lruvec(), and both of those do know the right zone.
      
      So check and set lruvec->zone in those; and remove the inadequate
      attempt to set lruvec->zone from lruvec_init(), which is called before
      NODE_DATA(node) has been allocated in such cases.
      
      Ah, there was one exceptionr.  For no particularly good reason,
      mem_cgroup_force_empty_list() has its own code for deciding lruvec.
      Change it to use the standard mem_cgroup_zone_lruvec() and
      mem_cgroup_get_lru_size() too.  In fact it was already safe against such
      an oops (the lru lists in danger could only be empty), but we're better
      proofed against future changes this way.
      
      I've marked this for stable (3.6) since we introduced the problem in 3.5
      (now closed to stable); but I have no idea if this is the only fix
      needed to get memory hotadd working with memcg in 3.6, and received no
      answer when I enquired twice before.
      Reported-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bea8c150
    • D
      mm, oom: reintroduce /proc/pid/oom_adj · fa0cbbf1
      David Rientjes 提交于
      This is mostly a revert of 01dc52eb ("oom: remove deprecated oom_adj")
      from Davidlohr Bueso.
      
      It reintroduces /proc/pid/oom_adj for backwards compatibility with earlier
      kernels.  It simply scales the value linearly when /proc/pid/oom_score_adj
      is written.
      
      The major difference is that its scheduled removal is no longer included
      in Documentation/feature-removal-schedule.txt.  We do warn users with a
      single printk, though, to suggest the more powerful and supported
      /proc/pid/oom_score_adj interface.
      Reported-by: NArtem S. Tashkinov <t.artem@lycos.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa0cbbf1
  3. 16 11月, 2012 1 次提交
    • I
      clk: remove inline usage from clk-provider.h · 93532c8a
      Igor Mazanov 提交于
      Users of GCC 4.7 have reported compiler errors due to having inline
      applied to function declarations in clk-provider.h.  The definitions
      exist in drivers/clk/clk.c.  An example error:
      
      In file included from arch/arm/mach-omap2/clockdomain.c:25:0:
      arch/arm/mach-omap2/clockdomain.c: In function ‘clkdm_clk_disable’:
      include/linux/clk-provider.h:338:12: error: inlining failed in call to always_inline ‘__clk_get_enable_count’: function body not available
      arch/arm/mach-omap2/clockdomain.c:1001:28: error: called from here
      make[1]: *** [arch/arm/mach-omap2/clockdomain.o] Error 1
      make: *** [arch/arm/mach-omap2] Error 2
      
      This patch removes the use of inline from include/linux/clk-provider.h
      but keeps the function definitions in drivers/clk/clk.c as inlined since
      they are one-liners.
      Signed-off-by: NIgor Mazanov <i.mazanov@gmail.com>
      Acked-by: NPaul Walmsley <paul@pwsan.com>
      Signed-off-by: NMike Turquette <mturquette@linaro.org>
      [mturquette@linaro.org: improved subject, added changelog]
      93532c8a
  4. 10 11月, 2012 1 次提交
    • A
      of/address: sparc: Declare of_address_to_resource() as an extern function for sparc again · 0bce04be
      Andreas Larsson 提交于
      This bug-fix makes sure that of_address_to_resource is defined extern for sparc
      so that the sparc-specific implementation of of_address_to_resource() is once
      again used when including include/linux/of_address.h in a sparc context. A
      number of drivers in mainline relies on this function working for sparc.
      
      The bug was introduced in a850a755, "of/address:
      add empty static inlines for !CONFIG_OF". Contrary to that commit title, the
      static inlines are added for !CONFIG_OF_ADDRESS, and CONFIG_OF_ADDRESS is never
      defined for sparc. This is good behavior for the other functions in
      include/linux/of_address.h, as the extern functions defined in
      drivers/of/address.c only gets linked when OF_ADDRESS is configured. However,
      for of_address_to_resource there exists a sparc-specific implementation in
      arch/sparc/arch/sparc/kernel/of_device_common.c
      
      Solution suggested by: Sam Ravnborg <sam@ravnborg.org>
      Signed-off-by: NAndreas Larsson <andreas@gaisler.com>
      Acked-by: NRob Herring <rob.herring@calxeda.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0bce04be
  5. 09 11月, 2012 1 次提交
    • A
      revert "epoll: support for disabling items, and a self-test app" · a80a6b85
      Andrew Morton 提交于
      Revert commit 03a7beb5 ("epoll: support for disabling items, and a
      self-test app") pending resolution of the issues identified by Michael
      Kerrisk, copied below.
      
      We'll revisit this for 3.8.
      
      : I've taken a look at this patch as it currently stands in 3.7-rc1, and
      : done a bit of testing. (By the way, the test program
      : tools/testing/selftests/epoll/test_epoll.c does not compile...)
      :
      : There are one or two places where the behavior seems a little strange,
      : so I have a question or two at the end of this mail. But other than
      : that, I want to check my understanding so that the interface can be
      : correctly documented.
      :
      : Just to go though my understanding, the problem is the following
      : scenario in a multithreaded application:
      :
      : 1. Multiple threads are performing epoll_wait() operations,
      :    and maintaining a user-space cache that contains information
      :    corresponding to each file descriptor being monitored by
      :    epoll_wait().
      :
      : 2. At some point, a thread wants to delete (EPOLL_CTL_DEL)
      :    a file descriptor from the epoll interest list, and
      :    delete the corresponding record from the user-space cache.
      :
      : 3. The problem with (2) is that some other thread may have
      :    previously done an epoll_wait() that retrieved information
      :    about the fd in question, and may be in the middle of using
      :    information in the cache that relates to that fd. Thus,
      :    there is a potential race.
      :
      : 4. The race can't solved purely in user space, because doing
      :    so would require applying a mutex across the epoll_wait()
      :    call, which would of course blow thread concurrency.
      :
      : Right?
      :
      : Your solution is the EPOLL_CTL_DISABLE operation. I want to
      : confirm my understanding about how to use this flag, since
      : the description that has accompanied the patches so far
      : has been a bit sparse
      :
      : 0. In the scenario you're concerned about, deleting a file
      :    descriptor means (safely) doing the following:
      :    (a) Deleting the file descriptor from the epoll interest list
      :        using EPOLL_CTL_DEL
      :    (b) Deleting the corresponding record in the user-space cache
      :
      : 1. It's only meaningful to use this EPOLL_CTL_DISABLE in
      :    conjunction with EPOLLONESHOT.
      :
      : 2. Using EPOLL_CTL_DISABLE without using EPOLLONESHOT in
      :    conjunction is a logical error.
      :
      : 3. The correct way to code multithreaded applications using
      :    EPOLL_CTL_DISABLE and EPOLLONESHOT is as follows:
      :
      :    a. All EPOLL_CTL_ADD and EPOLL_CTL_MOD operations should
      :       should EPOLLONESHOT.
      :
      :    b. When a thread wants to delete a file descriptor, it
      :       should do the following:
      :
      :       [1] Call epoll_ctl(EPOLL_CTL_DISABLE)
      :       [2] If the return status from epoll_ctl(EPOLL_CTL_DISABLE)
      :           was zero, then the file descriptor can be safely
      :           deleted by the thread that made this call.
      :       [3] If the epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY,
      :           then the descriptor is in use. In this case, the calling
      :           thread should set a flag in the user-space cache to
      :           indicate that the thread that is using the descriptor
      :           should perform the deletion operation.
      :
      : Is all of the above correct?
      :
      : The implementation depends on checking on whether
      : (events & ~EP_PRIVATE_BITS) == 0
      : This replies on the fact that EPOLL_CTL_AD and EPOLL_CTL_MOD always
      : set EPOLLHUP and EPOLLERR in the 'events' mask, and EPOLLONESHOT
      : causes those flags (as well as all others in ~EP_PRIVATE_BITS) to be
      : cleared.
      :
      : A corollary to the previous paragraph is that using EPOLL_CTL_DISABLE
      : is only useful in conjunction with EPOLLONESHOT. However, as things
      : stand, one can use EPOLL_CTL_DISABLE on a file descriptor that does
      : not have EPOLLONESHOT set in 'events' This results in the following
      : (slightly surprising) behavior:
      :
      : (a) The first call to epoll_ctl(EPOLL_CTL_DISABLE) returns 0
      :     (the indicator that the file descriptor can be safely deleted).
      : (b) The next call to epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY.
      :
      : This doesn't seem particularly useful, and in fact is probably an
      : indication that the user made a logic error: they should only be using
      : epoll_ctl(EPOLL_CTL_DISABLE) on a file descriptor for which
      : EPOLLONESHOT was set in 'events'. If that is correct, then would it
      : not make sense to return an error to user space for this case?
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paton J. Lewis" <palewis@adobe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a80a6b85
  6. 08 11月, 2012 4 次提交
  7. 07 11月, 2012 1 次提交