1. 19 6月, 2009 4 次提交
  2. 17 6月, 2009 12 次提交
    • C
      slow-work: use round_jiffies() for thread pool's cull and OOM timers · 009789f0
      Chris Peterson 提交于
      Round the slow work queue's cull and OOM timeouts to whole second boundary
      with round_jiffies().  The slow work queue uses a pair of timers to cull
      idle threads and, after OOM, to delay new thread creation.
      
      This patch also extracts the mod_timer() logic for the cull timer into a
      separate helper function.
      
      By rounding non-time-critical timers such as these to whole seconds, they
      will be batched up to fire at the same time rather than being spread out.
      This allows the CPU wake up less, which saves power.
      Signed-off-by: NChris Peterson <cpeterso@cpeterso.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      009789f0
    • A
      groups: move code to kernel/groups.c · 30639b6a
      Alexey Dobriyan 提交于
      Move supplementary groups implementation to kernel/groups.c .
      kernel/sys.c already accumulated quite a few random stuff.
      
      Do strictly copy/paste + add required headers to compile.  Compile-tested
      on many configs and archs.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30639b6a
    • R
    • K
      mm: remove CONFIG_UNEVICTABLE_LRU config option · 68377659
      KOSAKI Motohiro 提交于
      Currently, nobody wants to turn UNEVICTABLE_LRU off.  Thus this
      configurability is unnecessary.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Acked-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68377659
    • R
      mm, PM/Freezer: Disable OOM killer when tasks are frozen · 7f33d49a
      Rafael J. Wysocki 提交于
      Currently, the following scenario appears to be possible in theory:
      
      * Tasks are frozen for hibernation or suspend.
      * Free pages are almost exhausted.
      * Certain piece of code in the suspend code path attempts to allocate
        some memory using GFP_KERNEL and allocation order less than or
        equal to PAGE_ALLOC_COSTLY_ORDER.
      * __alloc_pages_internal() cannot find a free page so it invokes the
        OOM killer.
      * The OOM killer attempts to kill a task, but the task is frozen, so
        it doesn't die immediately.
      * __alloc_pages_internal() jumps to 'restart', unsuccessfully tries
        to find a free page and invokes the OOM killer.
      * No progress can be made.
      
      Although it is now hard to trigger during hibernation due to the memory
      shrinking carried out by the hibernation code, it is theoretically
      possible to trigger during suspend after the memory shrinking has been
      removed from that code path.  Moreover, since memory allocations are
      going to be used for the hibernation memory shrinking, it will be even
      more likely to happen during hibernation.
      
      To prevent it from happening, introduce the oom_killer_disabled switch
      that will cause __alloc_pages_internal() to fail in the situations in
      which the OOM killer would have been called and make the freezer set
      this switch after tasks have been successfully frozen.
      
      [akpm@linux-foundation.org: be nicer to the namespace]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Fengguang Wu <fengguang.wu@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f33d49a
    • M
      page allocator: do not check NUMA node ID when the caller knows the node is valid · 6484eb3e
      Mel Gorman 提交于
      Callers of alloc_pages_node() can optionally specify -1 as a node to mean
      "allocate from the current node".  However, a number of the callers in
      fast paths know for a fact their node is valid.  To avoid a comparison and
      branch, this patch adds alloc_pages_exact_node() that only checks the nid
      with VM_BUG_ON().  Callers that know their node is valid are then
      converted.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: Paul Mundt <lethal@linux-sh.org>	[for the SLOB NUMA bits]
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6484eb3e
    • M
      cpuset,mm: update tasks' mems_allowed in time · 58568d2a
      Miao Xie 提交于
      Fix allocating page cache/slab object on the unallowed node when memory
      spread is set by updating tasks' mems_allowed after its cpuset's mems is
      changed.
      
      In order to update tasks' mems_allowed in time, we must modify the code of
      memory policy.  Because the memory policy is applied in the process's
      context originally.  After applying this patch, one task directly
      manipulates anothers mems_allowed, and we use alloc_lock in the
      task_struct to protect mems_allowed and memory policy of the task.
      
      But in the fast path, we didn't use lock to protect them, because adding a
      lock may lead to performance regression.  But if we don't add a lock,the
      task might see no nodes when changing cpuset's mems_allowed to some
      non-overlapping set.  In order to avoid it, we set all new allowed nodes,
      then clear newly disallowed ones.
      
      [lee.schermerhorn@hp.com:
        The rework of mpol_new() to extract the adjusting of the node mask to
        apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
        with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
        allocation.  Fix this by adding the check for MPOL_PREFERRED and empty
        node mask to mpol_new_mpolicy().
      
        Remove the now unneeded 'nodes = NULL' from mpol_new().
      
        Note that mpol_new_mempolicy() is always called with a non-NULL
        'nodes' parameter now that it has been removed from mpol_new().
        Therefore, we don't need to test nodes for NULL before testing it for
        'empty'.  However, just to be extra paranoid, add a VM_BUG_ON() to
        verify this assumption.]
      [lee.schermerhorn@hp.com:
      
        I don't think the function name 'mpol_new_mempolicy' is descriptive
        enough to differentiate it from mpol_new().
      
        This function applies cpuset set context, usually constraining nodes
        to those allowed by the cpuset.  However, when the 'RELATIVE_NODES flag
        is set, it also translates the nodes.  So I settled on
        'mpol_set_nodemask()', because the comment block for mpol_new() mentions
        that we need to call this function to "set nodes".
      
        Some additional minor line length, whitespace and typo cleanup.]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58568d2a
    • M
      cpusets: update tasks' page/slab spread flags in time · 950592f7
      Miao Xie 提交于
      Fix the bug that the kernel didn't spread page cache/slab object evenly
      over all the allowed nodes when spread flags were set by updating tasks'
      page/slab spread flags in time.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      950592f7
    • M
      cpusets: restructure the function cpuset_update_task_memory_state() · f3b39d47
      Miao Xie 提交于
      The kernel still allocates the page caches on old node after modifying its
      cpuset's mems when 'memory_spread_page' was set, or it didn't spread the
      page cache evenly over all the nodes that faulting task is allowed to usr
      after memory_spread_page was set.  it is caused by the old mem_allowed and
      flags of the task, the current kernel doesn't updates them unless some
      function invokes cpuset_update_task_memory_state(), it is too late
      sometimes.We must update the mem_allowed and the flags of the tasks in
      time.
      
      Slab has the same problem.
      
      The following patches fix this bug by updating tasks' mem_allowed and
      spread flag after its cpuset's mems or spread flag is changed.
      
      This patch:
      
      Extract a function from cpuset_update_task_memory_state().  It will be
      used later for update tasks' page/slab spread flags after its cpuset's
      flag is set
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3b39d47
    • L
      printk: add KERN_DEFAULT loglevel to print_modules() · b231125a
      Linus Torvalds 提交于
      Several WARN_ON() messages omit the '\n' at the end of the string, which
      is a simple (and understandable) error.  The next line printed after
      that warning line is usually the current module list, and that printk
      does not have a log-level marker - resulting in one long mixed-up line.
      
      Adding this loglevel marker will now avoid this unreadable mess.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b231125a
    • L
      printk: Add KERN_DEFAULT printk log-level · e28d7137
      Linus Torvalds 提交于
      This adds a KERN_DEFAULT loglevel marker, for when you cannot decide
      which loglevel you want, and just want to keep an existing printk
      with the default loglevel.
      
      The difference between having KERN_DEFAULT and having no log-level
      marker at all is two-fold:
      
       - having the log-level marker will now force a new-line if the
         previous printout had not added one (perhaps because it forgot,
         but perhaps because it expected a continuation)
      
       - having a log-level marker is required if you are printing out a
         message that otherwise itself could perhaps otherwise be mistaken
         for a log-level.
      Signed-of-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e28d7137
    • L
      printk: clean up handling of log-levels and newlines · 5fd29d6c
      Linus Torvalds 提交于
      It used to be that we would only look at the log-level in a printk()
      after explicit newlines, which can cause annoying problems when the
      previous printk() did not end with a '\n'. In that case, the log-level
      marker would be just printed out in the middle of the line, and be
      seen as just noise rather than change the logging level.
      
      This changes things to always look at the log-level in the first
      bytes of the printout. If a log level marker is found, it is always
      used as the log-level. Additionally, if no newline existed, one is
      added (unless the log-level is the explicit KERN_CONT marker, to
      explicitly show that it's a continuation of a previous line).
      Acked-by: NArjan van de Ven <arjan@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fd29d6c
  3. 16 6月, 2009 2 次提交
    • G
      debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem. · 156f5a78
      GeunSik Lim 提交于
      Many developers use "/debug/" or "/debugfs/" or "/sys/kernel/debug/"
      directory name to mount debugfs filesystem for ftrace according to
      ./Documentation/tracers/ftrace.txt file.
      
      And, three directory names(ex:/debug/, /debugfs/, /sys/kernel/debug/) is
      existed in kernel source like ftrace, DRM, Wireless, Documentation,
      Network[sky2]files to mount debugfs filesystem.
      
      debugfs means debug filesystem for debugging easy to use by greg kroah
      hartman. "/sys/kernel/debug/" name is suitable as directory name
      of debugfs filesystem.
      - debugfs related reference: http://lwn.net/Articles/334546/
      
      Fix inconsistency of directory name to mount debugfs filesystem.
      
      * From Steven Rostedt
        - find_debugfs() and tracing_files() in this patch.
      Signed-off-by: NGeunSik Lim <geunsik.lim@samsung.com>
      Acked-by     : Inaky Perez-Gonzalez <inaky@linux.intel.com>
      Reviewed-by  : Steven Rostedt <rostedt@goodmis.org>
      Reviewed-by  : James Smart <james.smart@emulex.com>
      CC: Jiri Kosina <trivial@kernel.org>
      CC: David Airlie <airlied@linux.ie>
      CC: Peter Osterlund <petero2@telia.com>
      CC: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      CC: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      CC: Masami Hiramatsu <mhiramat@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      156f5a78
    • K
      sched: delayed cleanup of user_struct · 3959214f
      Kay Sievers 提交于
      During bootup performance tracing we see repeated occurrences of
      /sys/kernel/uid/* events for the same uid, leading to a,
      in this case, rather pointless userspace processing for the
      same uid over and over.
      
      This is usually caused by tools which change their uid to "nobody",
      to run without privileges to read data supplied by untrusted users.
      
      This change delays the execution of the (already existing) scheduled
      work, to cleanup the uid after one second, so the allocated and announced
      uid can possibly be re-used by another process.
      
      This is the current behavior, where almost every invocation of a
      binary, which changes the uid, creates two events:
        $ read START < /sys/kernel/uevent_seqnum; \
        for i in `seq 100`; do su --shell=/bin/true bin; done; \
        read END < /sys/kernel/uevent_seqnum; \
        echo $(($END - $START))
        178
      
      With the delayed cleanup, we get only two events, and userspace finishes
      a bit faster too:
        $ read START < /sys/kernel/uevent_seqnum; \
        for i in `seq 100`; do su --shell=/bin/true bin; done; \
        read END < /sys/kernel/uevent_seqnum; \
        echo $(($END - $START))
        1
      Acked-by: NDhaval Giani <dhaval@linux.vnet.ibm.com>
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      3959214f
  4. 15 6月, 2009 3 次提交
    • V
      signal: fix __send_signal() false positive kmemcheck warning · 7a0aeb14
      Vegard Nossum 提交于
      This false positive is due to field padding in struct sigqueue. When
      this dynamically allocated structure is copied to the stack (in arch-
      specific delivery code), kmemcheck sees a read from the padding, which
      is, naturally, uninitialized.
      
      Hide the false positive using the __GFP_NOTRACK_FALSE_POSITIVE flag.
      Also made the rlimit override code a bit clearer by introducing a new
      variable.
      
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      7a0aeb14
    • V
      trace: annotate bitfields in struct ring_buffer_event · 1744a21d
      Vegard Nossum 提交于
      This gets rid of a heap of false-positive warnings from the tracer
      code due to the use of bitfields.
      
      [rebased for mainline inclusion]
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      1744a21d
    • V
      kmemcheck: add mm functions · 2dff4405
      Vegard Nossum 提交于
      With kmemcheck enabled, the slab allocator needs to do this:
      
      1. Tell kmemcheck to allocate the shadow memory which stores the status of
         each byte in the allocation proper, e.g. whether it is initialized or
         uninitialized.
      2. Tell kmemcheck which parts of memory that should be marked uninitialized.
         There are actually a few more states, such as "not yet allocated" and
         "recently freed".
      
      If a slab cache is set up using the SLAB_NOTRACK flag, it will never return
      memory that can take page faults because of kmemcheck.
      
      If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still
      request memory with the __GFP_NOTRACK flag. This does not prevent the page
      faults from occuring, however, but marks the object in question as being
      initialized so that no warnings will ever be produced for this object.
      
      In addition to (and in contrast to) __GFP_NOTRACK, the
      __GFP_NOTRACK_FALSE_POSITIVE flag indicates that the allocation should
      not be tracked _because_ it would produce a false positive. Their values
      are identical, but need not be so in the future (for example, we could now
      enable/disable false positives with a config option).
      
      Parts of this patch were contributed by Pekka Enberg but merged for
      atomicity.
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      
      [rebased for mainline inclusion]
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      2dff4405
  5. 13 6月, 2009 12 次提交
  6. 12 6月, 2009 7 次提交
    • R
      sched: export kick_process · b43e3521
      Rusty Russell 提交于
      lguest needs kick_process: wake_up_process() does nothing if a process
      is running, which isn't sufficient (we need it in the kernel).
      
      And lguest support is usually modular.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      b43e3521
    • P
      perf_counter: Add forward/backward attribute ABI compatibility · 974802ea
      Peter Zijlstra 提交于
      Provide for means of extending the perf_counter_attr in a 'natural' way.
      
      We allow growing the structure by appending fields at the end by specifying
      the full structure size inside it.
      
      When a new kernel sees a smaller (old) structure, it will 0 pad the tail.
      When an old kernel sees a larger (new) structure, it will verify the tail
      consists of 0s, otherwise fail.
      
      If we fail due to a size-mismatch, we return -E2BIG and write the kernel's
      native attribe size back into the provided structure.
      
      Furthermore, add some attribute verification, so that we'll fail counter
      creation when unknown bits are present (PERF_SAMPLE, PERF_FORMAT, or in
      the __reserved fields).
      
      (This ABI detail is introduced while keeping the existing syscall ABI.)
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      974802ea
    • P
      perf_counter: Remove PERF_TYPE_RAW special casing · 081fad86
      Peter Zijlstra 提交于
      The PERF_TYPE_RAW special case seems superfluous these days. Remove
      it and add it to the switch() stmt like the others.
      
      [ Impact: cleanup ]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      081fad86
    • R
      module: trim exception table on init free. · ad6561df
      Rusty Russell 提交于
      It's theoretically possible that there are exception table entries
      which point into the (freed) init text of modules.  These could cause
      future problems if other modules get loaded into that memory and cause
      an exception as we'd see the wrong fixup.  The only case I know of is
      kvm-intel.ko (when CONFIG_CC_OPTIMIZE_FOR_SIZE=n).
      
      Amerigo fixed this long-standing FIXME in the x86 version, but this
      patch is more general.
      
      This implements trim_init_extable(); most archs are simple since they
      use the standard lib/extable.c sort code.  Alpha and IA64 use relative
      addresses in their fixups, so thier trimming is a slight variation.
      
      Sparc32 is unique; it doesn't seem to define ARCH_HAS_SORT_EXTABLE,
      yet it defines its own sort_extable() which overrides the one in lib.
      It doesn't sort, so we have to mark deleted entries instead of
      actually trimming them.
      Inspired-by: NAmerigo Wang <amwang@redhat.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: linux-alpha@vger.kernel.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-ia64@vger.kernel.org
      ad6561df
    • R
      module_param: allow 'bool' module_params to be bool, not just int. · fddd5201
      Rusty Russell 提交于
      Impact: API cleanup
      
      For historical reasons, 'bool' parameters must be an int, not a bool.
      But there are around 600 users, so a conversion seems like useless churn.
      
      So we use __same_type() to distinguish, and handle both cases.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      fddd5201
    • R
      module_param: split perm field into flags and perm · 45fcc70c
      Rusty Russell 提交于
      Impact: cleanup
      
      Rather than hack KPARAM_KMALLOCED into the perm field, separate it out.
      Since the perm field was 32 bits and only needs 16, we don't add bloat.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      45fcc70c
    • R
      module_param: invbool should take a 'bool', not an 'int' · 9a71af2c
      Rusty Russell 提交于
      It takes an 'int' for historical reasons, and there are only two
      users: simply switch it over to bool.
      
      The other user (uvesafb.c) will get a (harmless-on-x86) warning until
      the next patch is applied.
      
      Cc: Brad Douglas <brad@neruo.com>
      Cc: Michal Januszewski <spock@gentoo.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      9a71af2c