1. 19 12月, 2012 3 次提交
  2. 18 12月, 2012 9 次提交
  3. 15 12月, 2012 3 次提交
  4. 14 12月, 2012 8 次提交
    • L
      Revert "sched: Update_cfs_shares at period edge" · 17bc14b7
      Linus Torvalds 提交于
      This reverts commit f269ae04.
      
      It turns out it causes a very noticeable interactivity regression with
      CONFIG_SCHED_AUTOGROUP (test-case: "make -j32" of the kernel in a
      terminal window, while scrolling in a browser - the autogrouping means
      that the two end up in separate cgroups, and the browser should be
      smooth as silk despite the high load).
      
      Says Paul Turner:
       "It seems that the update-throttling on the wake-side is reducing the
        interactive tasks' ability to preempt.  While I suspect the right
        longer term answer here is force these updates only in the
        cross-cgroup case; this is less trivial.  For this release I believe
        the right answer is either going to be a revert or restore the updates
        on the enqueue-side."
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Bisected-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NPaul Turner <pjt@google.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17bc14b7
    • M
    • T
      MODSIGN: Avoid using .incbin in C source · 919aa45e
      Takashi Iwai 提交于
      Using the asm .incbin statement in C sources breaks any gcc wrapper which
      assumes that preprocessed C source is self-contained. Use a separate .S
      file to include the siging key and certificate.
      
      [ This means we no longer need SYMBOL_PREFIX which is defined in kernel.h
        from cbdbf2ab, so I removed it -- RR ]
      Tested-by: NMichal Marek <mmarek@suse.cz>
      Signed-off-by: NTakashi Iwai <tiwai@suse.de>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NJames Hogan <james.hogan@imgtec.com>
      919aa45e
    • R
      modules: don't hand 0 to vmalloc. · 82fab442
      Rusty Russell 提交于
      In commit d0a21265 David Rientjes unified various archs'
      module_alloc implementation (including x86) and removed the graduitous
      shortcut for size == 0.
      
      Then, in commit de7d2b56, Joe Perches added a warning for
      zero-length vmallocs, which can happen without kallsyms on modules
      with no init sections (eg. zlib_deflate).
      
      Fix this once and for all; the module code has to handle zero length
      anyway, so get it right at the caller and remove the now-gratuitous
      checks within the arch-specific module_alloc implementations.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=42608Reported-by: NConrad Kostecki <ConiKost@gmx.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      82fab442
    • S
      module: Remove a extra null character at the top of module->strtab. · 54523ec7
      Satoru Takeuchi 提交于
      There is a extra null character('\0') at the top of module->strtab for
      each module. Commit 59ef28b1 introduced this bug and this patch fixes it.
      
      Live dump log of the current linus git kernel(HEAD is 2844a487):
      ============================================================================
      crash> mod | grep loop
      ffffffffa01db0a0  loop             16689  (not loaded)  [CONFIG_KALLSYMS]
      crash> module.core_symtab ffffffffa01db0a0
        core_symtab = 0xffffffffa01db320crash> rd 0xffffffffa01db320 12
      ffffffffa01db320:  0000005500000001 0000000000000000   ....U...........
      ffffffffa01db330:  0000000000000000 0002007400000002   ............t...
      ffffffffa01db340:  ffffffffa01d8000 0000000000000038   ........8.......
      ffffffffa01db350:  001a00640000000e ffffffffa01daeb0   ....d...........
      ffffffffa01db360:  00000000000000a0 0002007400000019   ............t...
      ffffffffa01db370:  ffffffffa01d8068 000000000000001b   h...............
      crash> module.core_strtab ffffffffa01db0a0
        core_strtab = 0xffffffffa01dbb30 ""
      crash> rd 0xffffffffa01dbb30 4
      ffffffffa01dbb30:  615f70616d6b0000 66780063696d6f74   ..kmap_atomic.xf
      ffffffffa01dbb40:  73636e75665f7265 72665f646e696600   er_funcs.find_fr
      ============================================================================
      
      We expect Just first one byte of '\0', but actually first two bytes
      are '\0'. Here is The relationship between symtab and strtab.
      
      	symtab_idx	strtab_idx	symbol
      	-----------------------------------------------
      	0		0x1		"\0" # startab_idx should be 0
      	1		0x2		"kmap_atomic"
      	2		0xe		"xfer_funcs"
      	3		0x19		"find_fr..."
      
      By applying this patch, it becomes as follows.
      
      	symtab_idx	strtab_idx	symbol
      	-----------------------------------------------
      	0		0x0		"\0"	# extra byte is removed
      	1		0x1		"kmap_atomic"
      	2		0xd		"xfer_funcs"
      	3		0x18		"find_fr..."
      Signed-off-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Cc: Masaki Kimura <masaki.kimura.kz@hitachi.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      54523ec7
    • K
      security: introduce kernel_module_from_file hook · 2e72d51b
      Kees Cook 提交于
      Now that kernel module origins can be reasoned about, provide a hook to
      the LSMs to make policy decisions about the module file. This will let
      Chrome OS enforce that loadable kernel modules can only come from its
      read-only hash-verified root filesystem. Other LSMs can, for example,
      read extended attributes for signatures, etc.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
      Acked-by: NEric Paris <eparis@redhat.com>
      Acked-by: NMimi Zohar <zohar@us.ibm.com>
      Acked-by: NJames Morris <james.l.morris@oracle.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      2e72d51b
    • R
      module: add flags arg to sys_finit_module() · 2f3238ae
      Rusty Russell 提交于
      Thanks to Michael Kerrisk for keeping us honest.  These flags are actually
      useful for eliminating the only case where kmod has to mangle a module's
      internals: for overriding module versioning.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>
      Acked-by: NKees Cook <keescook@chromium.org>
      2f3238ae
    • K
      module: add syscall to load module from fd · 34e1169d
      Kees Cook 提交于
      As part of the effort to create a stronger boundary between root and
      kernel, Chrome OS wants to be able to enforce that kernel modules are
      being loaded only from our read-only crypto-hash verified (dm_verity)
      root filesystem. Since the init_module syscall hands the kernel a module
      as a memory blob, no reasoning about the origin of the blob can be made.
      
      Earlier proposals for appending signatures to kernel modules would not be
      useful in Chrome OS, since it would involve adding an additional set of
      keys to our kernel and builds for no good reason: we already trust the
      contents of our root filesystem. We don't need to verify those kernel
      modules a second time. Having to do signature checking on module loading
      would slow us down and be redundant. All we need to know is where a
      module is coming from so we can say yes/no to loading it.
      
      If a file descriptor is used as the source of a kernel module, many more
      things can be reasoned about. In Chrome OS's case, we could enforce that
      the module lives on the filesystem we expect it to live on.  In the case
      of IMA (or other LSMs), it would be possible, for example, to examine
      extended attributes that may contain signatures over the contents of
      the module.
      
      This introduces a new syscall (on x86), similar to init_module, that has
      only two arguments. The first argument is used as a file descriptor to
      the module and the second argument is a pointer to the NULL terminated
      string of module arguments.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (merge fixes)
      34e1169d
  5. 13 12月, 2012 3 次提交
  6. 11 12月, 2012 11 次提交
    • M
      mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node · 5bca2303
      Mel Gorman 提交于
      Due to the fact that migrations are driven by the CPU a task is running
      on there is no point tracking NUMA faults until one task runs on a new
      node. This patch tracks the first node used by an address space. Until
      it changes, PTE scanning is disabled and no NUMA hinting faults are
      trapped. This should help workloads that are short-lived, do not care
      about NUMA placement or have bound themselves to a single node.
      
      This takes advantage of the logic in "mm: sched: numa: Implement slow
      start for working set sampling" to delay when the checks are made. This
      will take advantage of processes that set their CPU and node bindings
      early in their lifetime. It will also potentially allow any initial load
      balancing to take place.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      5bca2303
    • M
      mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG · 3105b86a
      Mel Gorman 提交于
      The "mm: sched: numa: Control enabling and disabling of NUMA balancing"
      depends on scheduling debug being enabled but it's perfectly legimate to
      disable automatic NUMA balancing even without this option. This should
      take care of it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      3105b86a
    • M
      mm: sched: numa: Control enabling and disabling of NUMA balancing · 1a687c2e
      Mel Gorman 提交于
      This patch adds Kconfig options and kernel parameters to allow the
      enabling and disabling of automatic NUMA balancing. The existance
      of such a switch was and is very important when debugging problems
      related to transparent hugepages and we should have the same for
      automatic NUMA placement.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      1a687c2e
    • M
      mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate · b8593bfd
      Mel Gorman 提交于
      The PTE scanning rate and fault rates are two of the biggest sources of
      system CPU overhead with automatic NUMA placement.  Ideally a proper policy
      would detect if a workload was properly placed, schedule and adjust the
      PTE scanning rate accordingly. We do not track the necessary information
      to do that but we at least know if we migrated or not.
      
      This patch scans slower if a page was not migrated as the result of a
      NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
      now higher than the previous default. Once every minute it will reset
      the scanner in case of phase changes.
      
      This is hilariously crude and the numbers are arbitrary. Workloads will
      converge quite slowly in comparison to what a proper policy should be able
      to do. On the plus side, we will chew up less CPU for workloads that have
      no need for automatic balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      b8593bfd
    • M
      sched: numa: Slowly increase the scanning period as NUMA faults are handled · fb003b80
      Mel Gorman 提交于
      Currently the rate of scanning for an address space is controlled
      by the individual tasks. The next scan is simply determined by
      2*p->numa_scan_period.
      
      The 2*p->numa_scan_period is arbitrary and never changes. At this point
      there is still no proper policy that decides if a task or process is
      properly placed. It just scans and assumes the next NUMA fault will
      place it properly. As it is assumed that pages will get properly placed
      over time, increase the scan window each time a fault is incurred. This
      is a big assumption as noted in the comments.
      
      It should be noted that changing to p->numa_scan_period will increase
      system CPU usage because now the scanning rate has effectively doubled.
      If that is a problem then the min_rate should be made 200ms instead of
      restoring the 2* logic.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      fb003b80
    • M
      mm: numa: Rate limit setting of pte_numa if node is saturated · e14808b4
      Mel Gorman 提交于
      If there are a large number of NUMA hinting faults and all of them
      are resulting in migrations it may indicate that memory is just
      bouncing uselessly around. NUMA balancing cost is likely exceeding
      any benefit from locality. Rate limit the PTE updates if the node
      is migration rate-limited. As noted in the comments, this distorts
      the NUMA faulting statistics.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      e14808b4
    • P
      mm: sched: numa: Implement slow start for working set sampling · 4b96a29b
      Peter Zijlstra 提交于
      Add a 1 second delay before starting to scan the working set of
      a task and starting to balance it amongst nodes.
      
      [ note that before the constant per task WSS sampling rate patch
        the initial scan would happen much later still, in effect that
        patch caused this regression. ]
      
      The theory is that short-run tasks benefit very little from NUMA
      placement: they come and go, and they better stick to the node
      they were started on. As tasks mature and rebalance to other CPUs
      and nodes, so does their NUMA placement have to change and so
      does it start to matter more and more.
      
      In practice this change fixes an observable kbuild regression:
      
         # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]
      
         !NUMA:
         45.291088843 seconds time elapsed                                          ( +-  0.40% )
         45.154231752 seconds time elapsed                                          ( +-  0.36% )
      
         +NUMA, no slow start:
         46.172308123 seconds time elapsed                                          ( +-  0.30% )
         46.343168745 seconds time elapsed                                          ( +-  0.25% )
      
         +NUMA, 1 sec slow start:
         45.224189155 seconds time elapsed                                          ( +-  0.25% )
         45.160866532 seconds time elapsed                                          ( +-  0.17% )
      
      and it also fixes an observable perf bench (hackbench) regression:
      
         # perf stat --null --repeat 10 perf bench sched messaging
      
         -NUMA:
      
         -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
         +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )
      
         +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )
      
      The implementation is simple and straightforward, most of the patch
      deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable
      knob.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote the changelog, ran measurements, tuned the default. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      4b96a29b
    • M
      sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges · 9f40604c
      Mel Gorman 提交于
      By accounting against the present PTEs, scanning speed reflects the
      actual present (mapped) memory.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      9f40604c
    • P
      mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate · 6e5fb223
      Peter Zijlstra 提交于
      Previously, to probe the working set of a task, we'd use
      a very simple and crude method: mark all of its address
      space PROT_NONE.
      
      That method has various (obvious) disadvantages:
      
       - it samples the working set at dissimilar rates,
         giving some tasks a sampling quality advantage
         over others.
      
       - creates performance problems for tasks with very
         large working sets
      
       - over-samples processes with large address spaces but
         which only very rarely execute
      
      Improve that method by keeping a rotating offset into the
      address space that marks the current position of the scan,
      and advance it by a constant rate (in a CPU cycles execution
      proportional manner). If the offset reaches the last mapped
      address of the mm then it then it starts over at the first
      address.
      
      The per-task nature of the working set sampling functionality in this tree
      allows such constant rate, per task, execution-weight proportional sampling
      of the working set, with an adaptive sampling interval/frequency that
      goes from once per 100ms up to just once per 8 seconds.  The current
      sampling volume is 256 MB per interval.
      
      As tasks mature and converge their working set, so does the
      sampling rate slow down to just a trickle, 256 MB per 8
      seconds of CPU time executed.
      
      This, beyond being adaptive, also rate-limits rarely
      executing systems and does not over-sample on overloaded
      systems.
      
      [ In AutoNUMA speak, this patch deals with the effective sampling
        rate of the 'hinting page fault'. AutoNUMA's scanning is
        currently rate-limited, but it is also fundamentally
        single-threaded, executing in the knuma_scand kernel thread,
        so the limit in AutoNUMA is global and does not scale up with
        the number of CPUs, nor does it scan tasks in an execution
        proportional manner.
      
        So the idea of rate-limiting the scanning was first implemented
        in the AutoNUMA tree via a global rate limit. This patch goes
        beyond that by implementing an execution rate proportional
        working set sampling rate that is not implemented via a single
        global scanning daemon. ]
      
      [ Dan Carpenter pointed out a possible NULL pointer dereference in the
        first version of this patch. ]
      Based-on-idea-by: NAndrea Arcangeli <aarcange@redhat.com>
      Bug-Found-By: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote changelog and fixed bug. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      6e5fb223
    • P
      mm: numa: Add fault driven placement and migration · cbee9f88
      Peter Zijlstra 提交于
      NOTE: This patch is based on "sched, numa, mm: Add fault driven
      	placement and migration policy" but as it throws away all the policy
      	to just leave a basic foundation I had to drop the signed-offs-by.
      
      This patch creates a bare-bones method for setting PTEs pte_numa in the
      context of the scheduler that when faulted later will be faulted onto the
      node the CPU is running on.  In itself this does nothing useful but any
      placement policy will fundamentally depend on receiving hints on placement
      from fault context and doing something intelligent about it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      cbee9f88
    • I
      Revert "sched/autogroup: Fix crash on reboot when autogroup is disabled" · c1ad41f1
      Ingo Molnar 提交于
      This reverts commit 5258f386,
      because the underlying autogroups bug got fixed upstream in
      a better way, via:
      
        fd8ef117 Revert "sched, autogroup: Stop going ahead if autogroup is disabled"
      
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Yong Zhang <yong.zhang0@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c1ad41f1
  7. 07 12月, 2012 1 次提交
  8. 06 12月, 2012 2 次提交