1. 11 12月, 2012 8 次提交
    • M
      mm: sched: numa: Control enabling and disabling of NUMA balancing · 1a687c2e
      Mel Gorman 提交于
      This patch adds Kconfig options and kernel parameters to allow the
      enabling and disabling of automatic NUMA balancing. The existance
      of such a switch was and is very important when debugging problems
      related to transparent hugepages and we should have the same for
      automatic NUMA placement.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      1a687c2e
    • M
      mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate · b8593bfd
      Mel Gorman 提交于
      The PTE scanning rate and fault rates are two of the biggest sources of
      system CPU overhead with automatic NUMA placement.  Ideally a proper policy
      would detect if a workload was properly placed, schedule and adjust the
      PTE scanning rate accordingly. We do not track the necessary information
      to do that but we at least know if we migrated or not.
      
      This patch scans slower if a page was not migrated as the result of a
      NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
      now higher than the previous default. Once every minute it will reset
      the scanner in case of phase changes.
      
      This is hilariously crude and the numbers are arbitrary. Workloads will
      converge quite slowly in comparison to what a proper policy should be able
      to do. On the plus side, we will chew up less CPU for workloads that have
      no need for automatic balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      b8593bfd
    • M
      sched: numa: Slowly increase the scanning period as NUMA faults are handled · fb003b80
      Mel Gorman 提交于
      Currently the rate of scanning for an address space is controlled
      by the individual tasks. The next scan is simply determined by
      2*p->numa_scan_period.
      
      The 2*p->numa_scan_period is arbitrary and never changes. At this point
      there is still no proper policy that decides if a task or process is
      properly placed. It just scans and assumes the next NUMA fault will
      place it properly. As it is assumed that pages will get properly placed
      over time, increase the scan window each time a fault is incurred. This
      is a big assumption as noted in the comments.
      
      It should be noted that changing to p->numa_scan_period will increase
      system CPU usage because now the scanning rate has effectively doubled.
      If that is a problem then the min_rate should be made 200ms instead of
      restoring the 2* logic.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      fb003b80
    • M
      mm: numa: Rate limit setting of pte_numa if node is saturated · e14808b4
      Mel Gorman 提交于
      If there are a large number of NUMA hinting faults and all of them
      are resulting in migrations it may indicate that memory is just
      bouncing uselessly around. NUMA balancing cost is likely exceeding
      any benefit from locality. Rate limit the PTE updates if the node
      is migration rate-limited. As noted in the comments, this distorts
      the NUMA faulting statistics.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      e14808b4
    • P
      mm: sched: numa: Implement slow start for working set sampling · 4b96a29b
      Peter Zijlstra 提交于
      Add a 1 second delay before starting to scan the working set of
      a task and starting to balance it amongst nodes.
      
      [ note that before the constant per task WSS sampling rate patch
        the initial scan would happen much later still, in effect that
        patch caused this regression. ]
      
      The theory is that short-run tasks benefit very little from NUMA
      placement: they come and go, and they better stick to the node
      they were started on. As tasks mature and rebalance to other CPUs
      and nodes, so does their NUMA placement have to change and so
      does it start to matter more and more.
      
      In practice this change fixes an observable kbuild regression:
      
         # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]
      
         !NUMA:
         45.291088843 seconds time elapsed                                          ( +-  0.40% )
         45.154231752 seconds time elapsed                                          ( +-  0.36% )
      
         +NUMA, no slow start:
         46.172308123 seconds time elapsed                                          ( +-  0.30% )
         46.343168745 seconds time elapsed                                          ( +-  0.25% )
      
         +NUMA, 1 sec slow start:
         45.224189155 seconds time elapsed                                          ( +-  0.25% )
         45.160866532 seconds time elapsed                                          ( +-  0.17% )
      
      and it also fixes an observable perf bench (hackbench) regression:
      
         # perf stat --null --repeat 10 perf bench sched messaging
      
         -NUMA:
      
         -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
         +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )
      
         +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )
      
      The implementation is simple and straightforward, most of the patch
      deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable
      knob.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote the changelog, ran measurements, tuned the default. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      4b96a29b
    • M
      sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges · 9f40604c
      Mel Gorman 提交于
      By accounting against the present PTEs, scanning speed reflects the
      actual present (mapped) memory.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      9f40604c
    • P
      mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate · 6e5fb223
      Peter Zijlstra 提交于
      Previously, to probe the working set of a task, we'd use
      a very simple and crude method: mark all of its address
      space PROT_NONE.
      
      That method has various (obvious) disadvantages:
      
       - it samples the working set at dissimilar rates,
         giving some tasks a sampling quality advantage
         over others.
      
       - creates performance problems for tasks with very
         large working sets
      
       - over-samples processes with large address spaces but
         which only very rarely execute
      
      Improve that method by keeping a rotating offset into the
      address space that marks the current position of the scan,
      and advance it by a constant rate (in a CPU cycles execution
      proportional manner). If the offset reaches the last mapped
      address of the mm then it then it starts over at the first
      address.
      
      The per-task nature of the working set sampling functionality in this tree
      allows such constant rate, per task, execution-weight proportional sampling
      of the working set, with an adaptive sampling interval/frequency that
      goes from once per 100ms up to just once per 8 seconds.  The current
      sampling volume is 256 MB per interval.
      
      As tasks mature and converge their working set, so does the
      sampling rate slow down to just a trickle, 256 MB per 8
      seconds of CPU time executed.
      
      This, beyond being adaptive, also rate-limits rarely
      executing systems and does not over-sample on overloaded
      systems.
      
      [ In AutoNUMA speak, this patch deals with the effective sampling
        rate of the 'hinting page fault'. AutoNUMA's scanning is
        currently rate-limited, but it is also fundamentally
        single-threaded, executing in the knuma_scand kernel thread,
        so the limit in AutoNUMA is global and does not scale up with
        the number of CPUs, nor does it scan tasks in an execution
        proportional manner.
      
        So the idea of rate-limiting the scanning was first implemented
        in the AutoNUMA tree via a global rate limit. This patch goes
        beyond that by implementing an execution rate proportional
        working set sampling rate that is not implemented via a single
        global scanning daemon. ]
      
      [ Dan Carpenter pointed out a possible NULL pointer dereference in the
        first version of this patch. ]
      Based-on-idea-by: NAndrea Arcangeli <aarcange@redhat.com>
      Bug-Found-By: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote changelog and fixed bug. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      6e5fb223
    • P
      mm: numa: Add fault driven placement and migration · cbee9f88
      Peter Zijlstra 提交于
      NOTE: This patch is based on "sched, numa, mm: Add fault driven
      	placement and migration policy" but as it throws away all the policy
      	to just leave a basic foundation I had to drop the signed-offs-by.
      
      This patch creates a bare-bones method for setting PTEs pte_numa in the
      context of the scheduler that when faulted later will be faulted onto the
      node the CPU is running on.  In itself this does nothing useful but any
      placement policy will fundamentally depend on receiving hints on placement
      from fault context and doing something intelligent about it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      cbee9f88
  2. 01 11月, 2012 1 次提交
    • T
      futex: Handle futex_pi OWNER_DIED take over correctly · 59fa6245
      Thomas Gleixner 提交于
      Siddhesh analyzed a failure in the take over of pi futexes in case the
      owner died and provided a workaround.
      See: http://sourceware.org/bugzilla/show_bug.cgi?id=14076
      
      The detailed problem analysis shows:
      
      Futex F is initialized with PTHREAD_PRIO_INHERIT and
      PTHREAD_MUTEX_ROBUST_NP attributes.
      
      T1 lock_futex_pi(F);
      
      T2 lock_futex_pi(F);
         --> T2 blocks on the futex and creates pi_state which is associated
             to T1.
      
      T1 exits
         --> exit_robust_list() runs
             --> Futex F userspace value TID field is set to 0 and
                 FUTEX_OWNER_DIED bit is set.
      
      T3 lock_futex_pi(F);
         --> Succeeds due to the check for F's userspace TID field == 0
         --> Claims ownership of the futex and sets its own TID into the
             userspace TID field of futex F
         --> returns to user space
      
      T1 --> exit_pi_state_list()
             --> Transfers pi_state to waiter T2 and wakes T2 via
             	   rt_mutex_unlock(&pi_state->mutex)
      
      T2 --> acquires pi_state->mutex and gains real ownership of the
             pi_state
         --> Claims ownership of the futex and sets its own TID into the
             userspace TID field of futex F
         --> returns to user space
      
      T3 --> observes inconsistent state
      
      This problem is independent of UP/SMP, preemptible/non preemptible
      kernels, or process shared vs. private. The only difference is that
      certain configurations are more likely to expose it.
      
      So as Siddhesh correctly analyzed the following check in
      futex_lock_pi_atomic() is the culprit:
      
      	if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
      
      We check the userspace value for a TID value of 0 and take over the
      futex unconditionally if that's true.
      
      AFAICT this check is there as it is correct for a different corner
      case of futexes: the WAITERS bit became stale.
      
      Now the proposed change
      
      -	if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
      +       if (unlikely(ownerdied ||
      +                       !(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) {
      
      solves the problem, but it's not obvious why and it wreckages the
      "stale WAITERS bit" case.
      
      What happens is, that due to the WAITERS bit being set (T2 is blocked
      on that futex) it enforces T3 to go through lookup_pi_state(), which
      in the above case returns an existing pi_state and therefor forces T3
      to legitimately fight with T2 over the ownership of the pi_state (via
      pi_state->mutex). Probelm solved!
      
      Though that does not work for the "WAITERS bit is stale" problem
      because if lookup_pi_state() does not find existing pi_state it
      returns -ERSCH (due to TID == 0) which causes futex_lock_pi() to
      return -ESRCH to user space because the OWNER_DIED bit is not set.
      
      Now there is a different solution to that problem. Do not look at the
      user space value at all and enforce a lookup of possibly available
      pi_state. If pi_state can be found, then the new incoming locker T3
      blocks on that pi_state and legitimately races with T2 to acquire the
      rt_mutex and the pi_state and therefor the proper ownership of the
      user space futex.
      
      lookup_pi_state() has the correct order of checks. It first tries to
      find a pi_state associated with the user space futex and only if that
      fails it checks for futex TID value = 0. If no pi_state is available
      nothing can create new state at that point because this happens with
      the hash bucket lock held.
      
      So the above scenario changes to:
      
      T1 lock_futex_pi(F);
      
      T2 lock_futex_pi(F);
         --> T2 blocks on the futex and creates pi_state which is associated
             to T1.
      
      T1 exits
         --> exit_robust_list() runs
             --> Futex F userspace value TID field is set to 0 and
                 FUTEX_OWNER_DIED bit is set.
      
      T3 lock_futex_pi(F);
         --> Finds pi_state and blocks on pi_state->rt_mutex
      
      T1 --> exit_pi_state_list()
             --> Transfers pi_state to waiter T2 and wakes it via
             	   rt_mutex_unlock(&pi_state->mutex)
      
      T2 --> acquires pi_state->mutex and gains ownership of the pi_state
         --> Claims ownership of the futex and sets its own TID into the
             userspace TID field of futex F
         --> returns to user space
      
      This covers all gazillion points on which T3 might come in between
      T1's exit_robust_list() clearing the TID field and T2 fixing it up. It
      also solves the "WAITERS bit stale" problem by forcing the take over.
      
      Another benefit of changing the code this way is that it makes it less
      dependent on untrusted user space values and therefor minimizes the
      possible wreckage which might be inflicted.
      
      As usual after staring for too long at the futex code my brain hurts
      so much that I really want to ditch that whole optimization of
      avoiding the syscall for the non contended case for PI futexes and rip
      out the maze of corner case handling code. Unfortunately we can't as
      user space relies on that existing behaviour, but at least thinking
      about it helps me to preserve my mental sanity. Maybe we should
      nevertheless :)
      Reported-and-tested-by: NSiddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
      Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1210232138540.2756@ionosAcked-by: NDarren Hart <dvhart@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      59fa6245
  3. 31 10月, 2012 1 次提交
    • R
      module: fix out-by-one error in kallsyms · 59ef28b1
      Rusty Russell 提交于
      Masaki found and patched a kallsyms issue: the last symbol in a
      module's symtab wasn't transferred.  This is because we manually copy
      the zero'th entry (which is always empty) then copy the rest in a loop
      starting at 1, though from src[0].  His fix was minimal, I prefer to
      rewrite the loops in more standard form.
      
      There are two loops: one to get the size, and one to copy.  Make these
      identical: always count entry 0 and any defined symbol in an allocated
      non-init section.
      
      This bug exists since the following commit was introduced.
         module: reduce symbol table for loaded modules (v2)
         commit: 4a496226
      
      LKML: http://lkml.org/lkml/2012/10/24/27Reported-by: NMasaki Kimura <masaki.kimura.kz@hitachi.com>
      Cc: stable@kernel.org
      59ef28b1
  4. 26 10月, 2012 2 次提交
    • H
      Makefile: Documentation for external tool should be correct · 2008713c
      H. Peter Anvin 提交于
      If one includes documentation for an external tool, it should be
      correct.  This is not:
      
      1. Overriding the input to rngd should typically be neither
         necessary nor desired.  This is especially so since newer
         versions of rngd support a number of different *types* of sources.
      2. The default kernel-exported device is called /dev/hwrng not
         /dev/hwrandom nor /dev/hw_random (both of which were used in the
         past; however, kernel and udev seem to have converged on
         /dev/hwrng.)
      
      Overall it is better if the documentation for rngd is kept with rngd
      rather than in a kernel Makefile.
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jeff Garzik <jgarzik@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2008713c
    • A
      pidns: limit the nesting depth of pid namespaces · f2302505
      Andrew Vagin 提交于
      'struct pid' is a "variable sized struct" - a header with an array of
      upids at the end.
      
      The size of the array depends on a level (depth) of pid namespaces.  Now a
      level of pidns is not limited, so 'struct pid' can be more than one page.
      
      Looks reasonable, that it should be less than a page.  MAX_PIS_NS_LEVEL is
      not calculated from PAGE_SIZE, because in this case it depends on
      architectures, config options and it will be reduced, if someone adds a
      new fields in struct pid or struct upid.
      
      I suggest to set MAX_PIS_NS_LEVEL = 32, because it saves ability to expand
      "struct pid" and it's more than enough for all known for me use-cases.
      When someone finds a reasonable use case, we can add a config option or a
      sysctl parameter.
      
      In addition it will reduce the effect of another problem, when we have
      many nested namespaces and the oldest one starts dying.
      zap_pid_ns_processe will be called for each namespace and find_vpid will
      be called for each process in a namespace.  find_vpid will be called
      minimum max_level^2 / 2 times.  The reason of that is that when we found a
      bit in pidmap, we can't determine this pidns is top for this process or it
      isn't.
      
      vpid is a heavy operation, so a fork bomb, which create many nested
      namespace, can make a system inaccessible for a long time.  For example my
      system becomes inaccessible for a few minutes with 4000 processes.
      
      [akpm@linux-foundation.org: return -EINVAL in response to excessive nesting, not -ENOMEM]
      Signed-off-by: NAndrew Vagin <avagin@openvz.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2302505
  5. 25 10月, 2012 1 次提交
  6. 22 10月, 2012 1 次提交
  7. 20 10月, 2012 6 次提交
  8. 17 10月, 2012 2 次提交
  9. 13 10月, 2012 5 次提交
    • J
      audit: make audit_inode take struct filename · adb5c247
      Jeff Layton 提交于
      Keep a pointer to the audit_names "slot" in struct filename.
      
      Have all of the audit_inode callers pass a struct filename ponter to
      audit_inode instead of a string pointer. If the aname field is already
      populated, then we can skip walking the list altogether and just use it
      directly.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      adb5c247
    • J
      vfs: make path_openat take a struct filename pointer · 669abf4e
      Jeff Layton 提交于
      ...and fix up the callers. For do_file_open_root, just declare a
      struct filename on the stack and fill out the .name field. For
      do_filp_open, make it also take a struct filename pointer, and fix up its
      callers to call it appropriately.
      
      For filp_open, add a variant that takes a struct filename pointer and turn
      filp_open into a wrapper around it.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      669abf4e
    • J
      audit: allow audit code to satisfy getname requests from its names_list · 7ac86265
      Jeff Layton 提交于
      Currently, if we call getname() on a userland string more than once,
      we'll get multiple copies of the string and multiple audit_names
      records.
      
      Add a function that will allow the audit_names code to satisfy getname
      requests using info from the audit_names list, avoiding a new allocation
      and audit_names records.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7ac86265
    • J
      vfs: define struct filename and have getname() return it · 91a27b2a
      Jeff Layton 提交于
      getname() is intended to copy pathname strings from userspace into a
      kernel buffer. The result is just a string in kernel space. It would
      however be quite helpful to be able to attach some ancillary info to
      the string.
      
      For instance, we could attach some audit-related info to reduce the
      amount of audit-related processing needed. When auditing is enabled,
      we could also call getname() on the string more than once and not
      need to recopy it from userspace.
      
      This patchset converts the getname()/putname() interfaces to return
      a struct instead of a string. For now, the struct just tracks the
      string in kernel space and the original userland pointer for it.
      
      Later, we'll add other information to the struct as it becomes
      convenient.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      91a27b2a
    • A
      infrastructure for saner ret_from_kernel_thread semantics · a74fb73c
      Al Viro 提交于
      * allow kernel_execve() leave the actual return to userland to
      caller (selected by CONFIG_GENERIC_KERNEL_EXECVE).  Callers
      updated accordingly.
      * architecture that does select GENERIC_KERNEL_EXECVE in its
      Kconfig should have its ret_from_kernel_thread() do this:
      	call schedule_tail
      	call the callback left for it by copy_thread(); if it ever
      returns, that's because it has just done successful kernel_execve()
      	jump to return from syscall
      IOW, its only difference from ret_from_fork() is that it does call the
      callback.
      * such an architecture should also get rid of ret_from_kernel_execve()
      and __ARCH_WANT_KERNEL_EXECVE
      
      This is the last part of infrastructure patches in that area - from
      that point on work on different architectures can live independently.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a74fb73c
  10. 12 10月, 2012 13 次提交