1. 28 5月, 2010 3 次提交
    • O
      signals: make task_struct->signal immutable/refcountable · ea6d290c
      Oleg Nesterov 提交于
      We have a lot of problems with accessing task_struct->signal, it can
      "disappear" at any moment.  Even current can't use its ->signal safely
      after exit_notify().  ->siglock helps, but it is not convenient, not
      always possible, and sometimes it makes sense to use task->signal even
      after this task has already dead.
      
      This patch adds the reference counter, sigcnt, into signal_struct.  This
      reference is owned by task_struct and it is dropped in
      __put_task_struct().  Perhaps it makes sense to export
      get/put_signal_struct() later, but currently I don't see the immediate
      reason.
      
      Rename __cleanup_signal() to free_signal_struct() and unexport it.  With
      the previous changes it does nothing except kmem_cache_free().
      
      Change __exit_signal() to not clear/free ->signal, it will be freed when
      the last reference to any thread in the thread group goes away.
      
      Note:
      	- when the last thead exits signal->tty can point to nowhere, see
      	  the next patch.
      
      	- with or without this patch signal_struct->count should go away,
      	  or at least it should be "int nr_threads" for fs/proc. This will
      	  be addressed later.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea6d290c
    • O
      exit: change zap_other_threads() to count sub-threads · 09faef11
      Oleg Nesterov 提交于
      Change zap_other_threads() to return the number of other sub-threads found
      on ->thread_group list.
      
      Other changes are cosmetic:
      
      	- change the code to use while_each_thread() helper
      
      	- remove the obsolete comment about SIGKILL/SIGSTOP
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Cc: Veaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09faef11
    • J
      cpusets: new round-robin rotor for SLAB allocations · 6adef3eb
      Jack Steiner 提交于
      We have observed several workloads running on multi-node systems where
      memory is assigned unevenly across the nodes in the system.  There are
      numerous reasons for this but one is the round-robin rotor in
      cpuset_mem_spread_node().
      
      For example, a simple test that writes a multi-page file will allocate
      pages on nodes 0 2 4 6 ...  Odd nodes are skipped.  (Sometimes it
      allocates on odd nodes & skips even nodes).
      
      An example is shown below.  The program "lfile" writes a file consisting
      of 10 pages.  The program then mmaps the file & uses get_mempolicy(...,
      MPOL_F_NODE) to determine the nodes where the file pages were allocated.
      The output is shown below:
      
      	# ./lfile
      	 allocated on nodes: 2 4 6 0 1 2 6 0 2
      
      There is a single rotor that is used for allocating both file pages & slab
      pages.  Writing the file allocates both a data page & a slab page
      (buffer_head).  This advances the RR rotor 2 nodes for each page
      allocated.
      
      A quick confirmation seems to confirm this is the cause of the uneven
      allocation:
      
      	# echo 0 >/dev/cpuset/memory_spread_slab
      	# ./lfile
      	 allocated on nodes: 6 7 8 9 0 1 2 3 4 5
      
      This patch introduces a second rotor that is used for slab allocations.
      Signed-off-by: NJack Steiner <steiner@sgi.com>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Paul Menage <menage@google.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6adef3eb
  2. 25 5月, 2010 2 次提交
    • A
      kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN · 4be929be
      Alexey Dobriyan 提交于
      - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
        USHORT_MAX/SHORT_MAX/SHORT_MIN.
      
      - Make SHRT_MIN of type s16, not int, for consistency.
      
      [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
      [akpm@linux-foundation.org: fix security/keys/keyring.c]
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Acked-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4be929be
    • M
      cpuset,mm: fix no node to alloc memory when changing cpuset's mems · c0ff7453
      Miao Xie 提交于
      Before applying this patch, cpuset updates task->mems_allowed and
      mempolicy by setting all new bits in the nodemask first, and clearing all
      old unallowed bits later.  But in the way, the allocator may find that
      there is no node to alloc memory.
      
      The reason is that cpuset rebinds the task's mempolicy, it cleans the
      nodes which the allocater can alloc pages on, for example:
      
      (mpol: mempolicy)
      	task1			task1's mpol	task2
      	alloc page		1
      	  alloc on node0? NO	1
      				1		change mems from 1 to 0
      				1		rebind task1's mpol
      				0-1		  set new bits
      				0	  	  clear disallowed bits
      	  alloc on node1? NO	0
      	  ...
      	can't alloc page
      	  goto oom
      
      This patch fixes this problem by expanding the nodes range first(set newly
      allowed bits) and shrink it lazily(clear newly disallowed bits).  So we
      use a variable to tell the write-side task that read-side task is reading
      nodemask, and the write-side task clears newly disallowed nodes after
      read-side task ends the current memory allocation.
      
      [akpm@linux-foundation.org: fix spello]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Menage <menage@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0ff7453
  3. 12 5月, 2010 1 次提交
    • R
      revert "procfs: provide stack information for threads" and its fixup commits · 34441427
      Robin Holt 提交于
      Originally, commit d899bf7b ("procfs: provide stack information for
      threads") attempted to introduce a new feature for showing where the
      threadstack was located and how many pages are being utilized by the
      stack.
      
      Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
      applied to fix the NO_MMU case.
      
      Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
      64-bit") was applied to fix a bug in ia32 executables being loaded.
      
      Commit 9ebd4eba ("procfs: fix /proc/<pid>/stat stack pointer for kernel
      threads") was applied to fix a bug which had kernel threads printing a
      userland stack address.
      
      Commit 1306d603 ('proc: partially revert "procfs: provide stack
      information for threads"') was then applied to revert the stack pages
      being used to solve a significant performance regression.
      
      This patch nearly undoes the effect of all these patches.
      
      The reason for reverting these is it provides an unusable value in
      field 28.  For x86_64, a fork will result in the task->stack_start
      value being updated to the current user top of stack and not the stack
      start address.  This unpredictability of the stack_start value makes
      it worthless.  That includes the intended use of showing how much stack
      space a thread has.
      
      Other architectures will get different values.  As an example, ia64
      gets 0.  The do_fork() and copy_process() functions appear to treat the
      stack_start and stack_size parameters as architecture specific.
      
      I only partially reverted c44972f1 ("procfs: disable per-task stack usage
      on NOMMU") .  If I had completely reverted it, I would have had to change
      mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
      configured.  Since I could not test the builds without significant effort,
      I decided to not change mm/Makefile.
      
      I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
      information for threads on 64-bit") .  I left the KSTK_ESP() change in
      place as that seemed worthwhile.
      Signed-off-by: NRobin Holt <holt@sgi.com>
      Cc: Stefani Seibold <stefani@seibold.net>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34441427
  4. 23 4月, 2010 1 次提交
  5. 03 4月, 2010 3 次提交
    • P
      sched: Add enqueue/dequeue flags · 371fd7e7
      Peter Zijlstra 提交于
      In order to reduce the dependency on TASK_WAKING rework the enqueue
      interface to support a proper flags field.
      
      Replace the int wakeup, bool head arguments with an int flags argument
      and create the following flags:
      
        ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
        ENQUEUE_WAKING - the enqueue has relative vruntime due to
                         having sched_class::task_waking() called,
        ENQUEUE_HEAD - the waking task should be places on the head
                       of the priority queue (where appropriate).
      
      For symmetry also convert sched_class::dequeue() to a flags scheme.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      371fd7e7
    • P
      sched: Fix TASK_WAKING vs fork deadlock · 0017d735
      Peter Zijlstra 提交于
      Oleg noticed a few races with the TASK_WAKING usage on fork.
      
       - since TASK_WAKING is basically a spinlock, it should be IRQ safe
       - since we set TASK_WAKING (*) without holding rq->lock it could
         be there still is a rq->lock holder, thereby not actually
         providing full serialization.
      
      (*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.
      
      Cure the second issue by not setting TASK_WAKING in sched_fork(), but
      only temporarily in wake_up_new_task() while calling select_task_rq().
      
      Cure the first by holding rq->lock around the select_task_rq() call,
      this will disable IRQs, this however requires that we push down the
      rq->lock release into select_task_rq_fair()'s cgroup stuff.
      
      Because select_task_rq_fair() still needs to drop the rq->lock we
      cannot fully get rid of TASK_WAKING.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0017d735
    • O
      sched: _cpu_down(): Don't play with current->cpus_allowed · 6a1bdc1b
      Oleg Nesterov 提交于
      _cpu_down() changes the current task's affinity and then recovers it at
      the end. The problems are well known: we can't restore old_allowed if it
      was bound to the now-dead-cpu, and we can race with the userspace which
      can change cpu-affinity during unplug.
      
      _cpu_down() should not play with current->cpus_allowed at all. Instead,
      take_cpu_down() can migrate the caller of _cpu_down() after __cpu_disable()
      removes the dying cpu from cpu_online_mask.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091023.GA9148@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6a1bdc1b
  6. 26 3月, 2010 1 次提交
    • P
      x86, perf, bts, mm: Delete the never used BTS-ptrace code · faa4602e
      Peter Zijlstra 提交于
      Support for the PMU's BTS features has been upstreamed in
      v2.6.32, but we still have the old and disabled ptrace-BTS,
      as Linus noticed it not so long ago.
      
      It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without
      regard for other uses (perf) and doesn't provide the flexibility
      needed for perf either.
      
      Its users are ptrace-block-step and ptrace-bts, since ptrace-bts
      was never used and ptrace-block-step can be implemented using a
      much simpler approach.
      
      So axe all 3000 lines of it. That includes the *locked_memory*()
      APIs in mm/mlock.c as well.
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Markus Metzger <markus.t.metzger@intel.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <20100325135413.938004390@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      faa4602e
  7. 13 3月, 2010 1 次提交
  8. 12 3月, 2010 3 次提交
    • M
      sched: Remove avg_overlap · e12f31d3
      Mike Galbraith 提交于
      Both avg_overlap and avg_wakeup had an inherent problem in that their accuracy
      was detrimentally affected by cross-cpu wakeups, this because we are missing
      the necessary call to update_curr().  This can't be fixed without increasing
      overhead in our already too fat fastpath.
      
      Additionally, with recent load balancing changes making us prefer to place tasks
      in an idle cache domain (which is good for compute bound loads), communicating
      tasks suffer when a sync wakeup, which would enable affine placement, is turned
      into a non-sync wakeup by SYNC_LESS.  With one task on the runqueue, wake_affine()
      rejects the affine wakeup request, leaving the unfortunate where placed, taking
      frequent cache misses.
      
      Remove it, and recover some fastpath cycles.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1268301121.6785.30.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e12f31d3
    • M
      sched: Remove avg_wakeup · b42e0c41
      Mike Galbraith 提交于
      Testing the load which led to this heuristic (nfs4 kbuild) shows that it has
      outlived it's usefullness.  With intervening load balancing changes, I cannot
      see any difference with/without, so recover there fastpath cycles.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1268301062.6785.29.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b42e0c41
    • M
      sched: Rate-limit nohz · 39c0cbe2
      Mike Galbraith 提交于
      Entering nohz code on every micro-idle is costing ~10% throughput for netperf
      TCP_RR when scheduling cross-cpu.  Rate limiting entry fixes this, but raises
      ticks a bit.  On my Q6600, an idle box goes from ~85 interrupts/sec to 128.
      
      The higher the context switch rate, the more nohz entry costs.  With this patch
      and some cycle recovery patches in my tree, max cross cpu context switch rate is
      improved by ~16%, a large portion of which of which is this ratelimiting.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1268301003.6785.28.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      39c0cbe2
  9. 11 3月, 2010 1 次提交
  10. 07 3月, 2010 2 次提交
  11. 04 3月, 2010 1 次提交
  12. 23 2月, 2010 1 次提交
  13. 16 2月, 2010 1 次提交
  14. 09 2月, 2010 1 次提交
  15. 01 2月, 2010 1 次提交
    • J
      softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume · d6ad3e28
      Jason Wessel 提交于
      When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, sched_clock() gets
      the time from hardware such as the TSC on x86. In this
      configuration kgdb will report a softlock warning message on
      resuming or detaching from a debug session.
      
      Sequence of events in the problem case:
      
       1) "cpu sched clock" and "hardware time" are at 100 sec prior
          to a call to kgdb_handle_exception()
      
       2) Debugger waits in kgdb_handle_exception() for 80 sec and on
          exit the following is called ...  touch_softlockup_watchdog() -->
          __raw_get_cpu_var(touch_timestamp) = 0;
      
       3) "cpu sched clock" = 100s (it was not updated, because the
          interrupt was disabled in kgdb) but the "hardware time" = 180 sec
      
       4) The first timer interrupt after resuming from
          kgdb_handle_exception updates the watchdog from the "cpu sched clock"
      
      update_process_times() { ...  run_local_timers() -->
      softlockup_tick() --> check (touch_timestamp == 0) (it is "YES"
      here, we have set "touch_timestamp = 0" at kgdb) -->
      __touch_softlockup_watchdog() ***(A)--> reset "touch_timestamp"
      to "get_timestamp()" (Here, the "touch_timestamp" will still be
      set to 100s.)  ...
      
          scheduler_tick() ***(B)--> sched_clock_tick() (update "cpu sched
          clock" to "hardware time" = 180s) ...  }
      
       5) The Second timer interrupt handler appears to have a large
          jump and trips the softlockup warning.
      
      update_process_times() { ...  run_local_timers() -->
      softlockup_tick() --> "cpu sched clock" - "touch_timestamp" =
      180s-100s > 60s --> printk "soft lockup error messages" ...  }
      
      note: ***(A) reset "touch_timestamp" to
      "get_timestamp(this_cpu)"
      
      Why is "touch_timestamp" 100 sec, instead of 180 sec?
      
      When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, the call trace of
      get_timestamp() is:
      
      get_timestamp(this_cpu)
       -->cpu_clock(this_cpu)
       -->sched_clock_cpu(this_cpu)
       -->__update_sched_clock(sched_clock_data, now)
      
      The __update_sched_clock() function uses the GTOD tick value to
      create a window to normalize the "now" values.  So if "now"
      value is too big for sched_clock_data, it will be ignored.
      
      The fix is to invoke sched_clock_tick() to update "cpu sched
      clock" in order to recover from this state.  This is done by
      introducing the function touch_softlockup_watchdog_sync(). This
      allows kgdb to request that the sched clock is updated when the
      watchdog thread runs the first time after a resume from kgdb.
      
      [yong.zhang0@gmail.com: Use per cpu instead of an array]
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      Signed-off-by: NDongdong Deng <Dongdong.Deng@windriver.com>
      Cc: kgdb-bugreport@lists.sourceforge.net
      Cc: peterz@infradead.org
      LKML-Reference: <1264631124-4837-2-git-send-email-jason.wessel@windriver.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d6ad3e28
  16. 30 1月, 2010 1 次提交
    • L
      Split 'flush_old_exec' into two functions · 221af7f8
      Linus Torvalds 提交于
      'flush_old_exec()' is the point of no return when doing an execve(), and
      it is pretty badly misnamed.  It doesn't just flush the old executable
      environment, it also starts up the new one.
      
      Which is very inconvenient for things like setting up the new
      personality, because we want the new personality to affect the starting
      of the new environment, but at the same time we do _not_ want the new
      personality to take effect if flushing the old one fails.
      
      As a result, the x86-64 '32-bit' personality is actually done using this
      insane "I'm going to change the ABI, but I haven't done it yet" bit
      (TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
      personality, but just the "pending" bit, so that "flush_thread()" can do
      the actual personality magic.
      
      This patch in no way changes any of that insanity, but it does split the
      'flush_old_exec()' function up into a preparatory part that can fail
      (still called flush_old_exec()), and a new part that will actually set
      up the new exec environment (setup_new_exec()).  All callers are changed
      to trivially comply with the new world order.
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      221af7f8
  17. 23 1月, 2010 1 次提交
  18. 21 1月, 2010 2 次提交
  19. 17 1月, 2010 1 次提交
  20. 04 1月, 2010 1 次提交
    • J
      resource: add helpers for fetching rlimits · 3e10e716
      Jiri Slaby 提交于
      We want to be sure that compiler fetches the limit variable only
      once, so add helpers for fetching current and maximal resource
      limits which do that.
      
      Add them to sched.h (instead of resource.h) due to circular dependency
       sched.h->resource.h->task_struct
      Alternative would be to create a separate res_access.h or similar.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: James Morris <jmorris@namei.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      3e10e716
  21. 18 12月, 2009 1 次提交
  22. 17 12月, 2009 6 次提交
  23. 16 12月, 2009 4 次提交
    • O
      signals: kill force_sig_specific() · ad09750b
      Oleg Nesterov 提交于
      Kill force_sig_specific(), this trivial wrapper has no callers.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad09750b
    • O
      signals: SEND_SIG_NOINFO should be considered as SI_FROMUSER() · 614c517d
      Oleg Nesterov 提交于
      No changes in compiled code. The patch adds the new helper, si_fromuser()
      and changes check_kill_permission() to use this helper.
      
      The real effect of this patch is that from now we "officially" consider
      SEND_SIG_NOINFO signal as "from user-space" signals. This is already true
      if we look at the code which uses SEND_SIG_NOINFO, except __send_signal()
      has another opinion - see the next patch.
      
      The naming of these special SEND_SIG_XXX siginfo's is really bad
      imho.  From __send_signal()'s pov they mean
      
      	SEND_SIG_NOINFO		from user
      	SEND_SIG_PRIV		from kernel
      	SEND_SIG_FORCED		no info
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Reviewed-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      614c517d
    • K
      memcg: coalesce uncharge during unmap/truncate · 569b846d
      KAMEZAWA Hiroyuki 提交于
      In massive parallel enviroment, res_counter can be a performance
      bottleneck.  One strong techinque to reduce lock contention is reducing
      calls by coalescing some amount of calls into one.
      
      Considering charge/uncharge chatacteristic,
      	- charge is done one by one via demand-paging.
      	- uncharge is done by
      		- in chunk at munmap, truncate, exit, execve...
      		- one by one via vmscan/paging.
      
      It seems we have a chance to coalesce uncharges for improving scalability
      at unmap/truncation.
      
      This patch is a for coalescing uncharge.  For avoiding scattering memcg's
      structure to functions under /mm, this patch adds memcg batch uncharge
      information to the task.  A reason for per-task batching is for making use
      of caller's context information.  We do batched uncharge (deleyed
      uncharge) when truncation/unmap occurs but do direct uncharge when
      uncharge is called by memory reclaim (vmscan.c).
      
      The degree of coalescing depends on callers
        - at invalidate/trucate... pagevec size
        - at unmap ....ZAP_BLOCK_SIZE
      (memory itself will be freed in this degree.)
      Then, we'll not coalescing too much.
      
      On x86-64 8cpu server, I tested overheads of memcg at page fault by
      running a program which does map/fault/unmap in a loop. Running
      a task per a cpu by taskset and see sum of the number of page faults
      in 60secs.
      
      [without memcg config]
        40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
        27.67 cache-miss/faults
      [root cgroup]
        36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
        31.58 miss/faults
      [in a child cgroup]
        18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
        69.96 miss/faults
      [child with this patch]
        27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
        47.16 miss/faults
      
      We can see some amounts of improvement.
      (root cgroup doesn't affected by this patch)
      Another patch for "charge" will follow this and above will be improved more.
      
      Changelog(since 2009/10/02):
       - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
       - some clean up and commentary/description updates.
       - added initialize code to copy_process(). (possible bug fix)
      
      Changelog(old):
       - fixed !CONFIG_MEM_CGROUP case.
       - rebased onto the latest mmotm + softlimit fix patches.
       - unified patch for callers
       - added commetns.
       - make ->do_batch as bool.
       - removed css_get() at el. We don't need it.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      569b846d
    • H
      task_struct: make journal_info conditional · e4c570c4
      Hiroshi Shimamoto 提交于
      journal_info in task_struct is used in journaling file system only.  So
      introduce CONFIG_FS_JOURNAL_INFO and make it conditional.
      Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4c570c4