1. 16 11月, 2014 1 次提交
    • K
      sched/numa: Init numa balancing fields of init_task · d8b163c4
      Kirill Tkhai 提交于
      We do not initialize init_task.numa_preferred_nid,
      but this value is inherited by userspace "init"
      process:
      
      rest_init()->kernel_thread(kernel_init)->do_fork(CLONE_VM);
      
      __sched_fork()
      {
      	if (clone_flags & CLONE_VM)
      		p->numa_preferred_nid = current->numa_preferred_nid;
      	else
      		p->numa_preferred_nid = -1;
      }
      
      kernel_init() becomes userspace "init" process.
      
      So, we propagate garbage nid to userspace, and it may be used
      during numa balancing.
      
      Currently, we do not have reports about this brings a problem,
      but it seem we should set it for sure.
      
      Even if init_task.numa_preferred_nid is zero, we may meet a weird
      configuration without nid#0. On sparc64, where processors are
      numbered physically, I saw a machine without cpu#1, while cpu#2
      existed. Possible, something similar may be with numa nodes.
      So, let's initialize it and be sure we're safe.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Link: http://lkml.kernel.org/r/1415699189.15631.6.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d8b163c4
  2. 30 10月, 2014 1 次提交
  3. 08 9月, 2014 3 次提交
    • P
      rcu: Remove local_irq_disable() in rcu_preempt_note_context_switch() · 1d082fd0
      Paul E. McKenney 提交于
      The rcu_preempt_note_context_switch() function is on a scheduling fast
      path, so it would be good to avoid disabling irqs.  The reason that irqs
      are disabled is to synchronize process-level and irq-handler access to
      the task_struct ->rcu_read_unlock_special bitmask.  This commit therefore
      makes ->rcu_read_unlock_special instead be a union of bools with a short
      allowing single-access checks in RCU's __rcu_read_unlock().  This results
      in the process-level and irq-handler accesses being simple loads and
      stores, so that irqs need no longer be disabled.  This commit therefore
      removes the irq disabling from rcu_preempt_note_context_switch().
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      1d082fd0
    • P
      rcu: Make TASKS_RCU handle nohz_full= CPUs · 176f8f7a
      Paul E. McKenney 提交于
      Currently TASKS_RCU would ignore a CPU running a task in nohz_full=
      usermode execution.  There would be neither a context switch nor a
      scheduling-clock interrupt to tell TASKS_RCU that the task in question
      had passed through a quiescent state.  The grace period would therefore
      extend indefinitely.  This commit therefore makes RCU's dyntick-idle
      subsystem record the task_struct structure of the task that is running
      in dyntick-idle mode on each CPU.  The TASKS_RCU grace period can
      then access this information and record a quiescent state on
      behalf of any CPU running in dyntick-idle usermode.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      176f8f7a
    • P
      rcu: Add call_rcu_tasks() · 8315f422
      Paul E. McKenney 提交于
      This commit adds a new RCU-tasks flavor of RCU, which provides
      call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
      context switch (not preemption!) and userspace execution (not the idle
      loop -- use some sort of schedule_on_each_cpu() if you need to handle the
      idle tasks.  Note that unlike other RCU flavors, these quiescent states
      occur in tasks, not necessarily CPUs.  Includes fixes from Steven Rostedt.
      
      This RCU flavor is assumed to have very infrequent latency-tolerant
      updaters.  This assumption permits significant simplifications, including
      a single global callback list protected by a single global lock, along
      with a single task-private linked list containing all tasks that have not
      yet passed through a quiescent state.  If experience shows this assumption
      to be incorrect, the required additional complexity will be added.
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      8315f422
  4. 10 7月, 2014 1 次提交
  5. 22 1月, 2014 1 次提交
    • O
      introduce for_each_thread() to replace the buggy while_each_thread() · 0c740d0a
      Oleg Nesterov 提交于
      while_each_thread() and next_thread() should die, almost every lockless
      usage is wrong.
      
      1. Unless g == current, the lockless while_each_thread() is not safe.
      
         while_each_thread(g, t) can loop forever if g exits, next_thread()
         can't reach the unhashed thread in this case. Note that this can
         happen even if g is the group leader, it can exec.
      
      2. Even if while_each_thread() itself was correct, people often use
         it wrongly.
      
         It was never safe to just take rcu_read_lock() and loop unless
         you verify that pid_alive(g) == T, even the first next_thread()
         can point to the already freed/reused memory.
      
      This patch adds signal_struct->thread_head and task->thread_node to
      create the normal rcu-safe list with the stable head.  The new
      for_each_thread(g, t) helper is always safe under rcu_read_lock() as
      long as this task_struct can't go away.
      
      Note: of course it is ugly to have both task_struct->thread_node and the
      old task_struct->thread_group, we will kill it later, after we change
      the users of while_each_thread() to use for_each_thread().
      
      Perhaps we can kill it even before we convert all users, we can
      reimplement next_thread(t) using the new thread_head/thread_node.  But
      we can't do this right now because this will lead to subtle behavioural
      changes.  For example, do/while_each_thread() always sees at least one
      task, while for_each_thread() can do nothing if the whole thread group
      has died.  Or thread_group_empty(), currently its semantics is not clear
      unless thread_group_leader(p) and we need to audit the callers before we
      can change it.
      
      So this patch adds the new interface which has to coexist with the old
      one for some time, hopefully the next changes will be more or less
      straightforward and the old one will go away soon.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NSergey Dyasly <dserrg@gmail.com>
      Tested-by: NSergey Dyasly <dserrg@gmail.com>
      Reviewed-by: NSameer Nanda <snanda@chromium.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mandeep Singh Baines <msb@chromium.org>
      Cc: "Ma, Xindong" <xindong.ma@intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: "Tu, Xiaobing" <xiaobing.tu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c740d0a
  6. 14 1月, 2014 1 次提交
  7. 13 1月, 2014 1 次提交
  8. 06 11月, 2013 1 次提交
  9. 19 2月, 2013 1 次提交
  10. 28 1月, 2013 1 次提交
    • F
      cputime: Safely read cputime of full dynticks CPUs · 6a61671b
      Frederic Weisbecker 提交于
      While remotely reading the cputime of a task running in a
      full dynticks CPU, the values stored in utime/stime fields
      of struct task_struct may be stale. Its values may be those
      of the last kernel <-> user transition time snapshot and
      we need to add the tickless time spent since this snapshot.
      
      To fix this, flush the cputime of the dynticks CPUs on
      kernel <-> user transition and record the time / context
      where we did this. Then on top of this snapshot and the current
      time, perform the fixup on the reader side from task_times()
      accessors.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [fixed kvm module related build errors]
      Signed-off-by: NSedat Dilek <sedat.dilek@gmail.com>
      6a61671b
  11. 18 9月, 2012 1 次提交
    • E
      userns: Convert the audit loginuid to be a kuid · e1760bd5
      Eric W. Biederman 提交于
      Always store audit loginuids in type kuid_t.
      
      Print loginuids by converting them into uids in the appropriate user
      namespace, and then printing the resulting uid.
      
      Modify audit_get_loginuid to return a kuid_t.
      
      Modify audit_set_loginuid to take a kuid_t.
      
      Modify /proc/<pid>/loginuid on read to convert the loginuid into the
      user namespace of the opener of the file.
      
      Modify /proc/<pid>/loginud on write to convert the loginuid
      rom the user namespace of the opener of the file.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Paul Moore <paul@paul-moore.com> ?
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      e1760bd5
  12. 24 7月, 2012 1 次提交
  13. 03 7月, 2012 1 次提交
  14. 30 5月, 2012 1 次提交
  15. 22 3月, 2012 1 次提交
    • M
      cpuset: mm: reduce large amounts of memory barrier related damage v3 · cc9a6c87
      Mel Gorman 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") wins a super prize for the largest number of
      memory barriers entered into fast paths for one commit.
      
      [get|put]_mems_allowed is incredibly heavy with pairs of full memory
      barriers inserted into a number of hot paths.  This was detected while
      investigating at large page allocator slowdown introduced some time
      after 2.6.32.  The largest portion of this overhead was shown by
      oprofile to be at an mfence introduced by this commit into the page
      allocator hot path.
      
      For extra style points, the commit introduced the use of yield() in an
      implementation of what looks like a spinning mutex.
      
      This patch replaces the full memory barriers on both read and write
      sides with a sequence counter with just read barriers on the fast path
      side.  This is much cheaper on some architectures, including x86.  The
      main bulk of the patch is the retry logic if the nodemask changes in a
      manner that can cause a false failure.
      
      While updating the nodemask, a check is made to see if a false failure
      is a risk.  If it is, the sequence number gets bumped and parallel
      allocators will briefly stall while the nodemask update takes place.
      
      In a page fault test microbenchmark, oprofile samples from
      __alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
      actual results were
      
                                   3.3.0-rc3          3.3.0-rc3
                                   rc3-vanilla        nobarrier-v2r1
          Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
          Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
          Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
          Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
          Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
          Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
          Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
          Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
          Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
          Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
          Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
          Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
          Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
          Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
          Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
          MMTests Statistics: duration
          Sys Time Running Test (seconds)             135.68    132.17
          User+Sys Time Running Test (seconds)         164.2    160.13
          Total Elapsed Time (seconds)                123.46    120.87
      
      The overall improvement is small but the System CPU time is much
      improved and roughly in correlation to what oprofile reported (these
      performance figures are without profiling so skew is expected).  The
      actual number of page faults is noticeably improved.
      
      For benchmarks like kernel builds, the overall benefit is marginal but
      the system CPU time is slightly reduced.
      
      To test the actual bug the commit fixed I opened two terminals.  The
      first ran within a cpuset and continually ran a small program that
      faulted 100M of anonymous data.  In a second window, the nodemask of the
      cpuset was continually randomised in a loop.
      
      Without the commit, the program would fail every so often (usually
      within 10 seconds) and obviously with the commit everything worked fine.
      With this patch applied, it also worked fine so the fix should be
      functionally equivalent.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc9a6c87
  16. 22 2月, 2012 1 次提交
  17. 13 12月, 2011 1 次提交
    • T
      threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem · 257058ae
      Tejun Heo 提交于
      Make the following renames to prepare for extension of threadgroup
      locking.
      
      * s/signal->threadgroup_fork_lock/signal->group_rwsem/
      * s/threadgroup_fork_read_lock()/threadgroup_change_begin()/
      * s/threadgroup_fork_read_unlock()/threadgroup_change_end()/
      * s/threadgroup_fork_write_lock()/threadgroup_lock()/
      * s/threadgroup_fork_write_unlock()/threadgroup_unlock()/
      
      This patch doesn't cause any behavior change.
      
      -v2: Rename threadgroup_change_done() to threadgroup_change_end() per
           KAMEZAWA's suggestion.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Paul Menage <paul@paulmenage.org>
      257058ae
  18. 17 11月, 2011 1 次提交
  19. 14 11月, 2011 1 次提交
  20. 13 9月, 2011 1 次提交
  21. 12 7月, 2011 1 次提交
    • J
      fixlet: Remove fs_excl from struct task. · 4aede84b
      Justin TerAvest 提交于
      fs_excl is a poor man's priority inheritance for filesystems to hint to
      the block layer that an operation is important. It was never clearly
      specified, not widely adopted, and will not prevent starvation in many
      cases (like across cgroups).
      
      fs_excl was introduced with the time sliced CFQ IO scheduler, to
      indicate when a process held FS exclusive resources and thus needed
      a boost.
      
      It doesn't cover all file systems, and it was never fully complete.
      Lets kill it.
      Signed-off-by: NJustin TerAvest <teravest@google.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4aede84b
  22. 27 5月, 2011 1 次提交
  23. 24 4月, 2011 1 次提交
  24. 04 4月, 2011 1 次提交
  25. 09 12月, 2010 1 次提交
  26. 30 11月, 2010 1 次提交
  27. 28 10月, 2010 1 次提交
  28. 20 8月, 2010 2 次提交
    • P
      rcu: Add a TINY_PREEMPT_RCU · a57eb940
      Paul E. McKenney 提交于
      Implement a small-memory-footprint uniprocessor-only implementation of
      preemptible RCU.  This implementation uses but a single blocked-tasks
      list rather than the combinatorial number used per leaf rcu_node by
      TREE_PREEMPT_RCU, which reduces memory consumption and greatly simplifies
      processing.  This version also takes advantage of uniprocessor execution
      to accelerate grace periods in the case where there are no readers.
      
      The general design is otherwise broadly similar to that of TREE_PREEMPT_RCU.
      
      This implementation is a step towards having RCU implementation driven
      off of the SMP and PREEMPT kernel configuration variables, which can
      happen once this implementation has accumulated sufficient experience.
      
      Removed ACCESS_ONCE() from __rcu_read_unlock() and added barrier() as
      suggested by Steve Rostedt in order to avoid the compiler-reordering
      issue noted by Mathieu Desnoyers (http://lkml.org/lkml/2010/8/16/183).
      
      As can be seen below, CONFIG_TINY_PREEMPT_RCU represents almost 5Kbyte
      savings compared to CONFIG_TREE_PREEMPT_RCU.  Of course, for non-real-time
      workloads, CONFIG_TINY_RCU is even better.
      
      	CONFIG_TREE_PREEMPT_RCU
      
      	   text	   data	    bss	    dec	   filename
      	     13	      0	      0	     13	   kernel/rcupdate.o
      	   6170	    825	     28	   7023	   kernel/rcutree.o
      				   ----
      				   7026    Total
      
      	CONFIG_TINY_PREEMPT_RCU
      
      	   text	   data	    bss	    dec	   filename
      	     13	      0	      0	     13	   kernel/rcupdate.o
      	   2081	     81	      8	   2170	   kernel/rcutiny.o
      				   ----
      				   2183    Total
      
      	CONFIG_TINY_RCU (non-preemptible)
      
      	   text	   data	    bss	    dec	   filename
      	     13	      0	      0	     13	   kernel/rcupdate.o
      	    719	     25	      0	    744	   kernel/rcutiny.o
      				    ---
      				    757    Total
      Requested-by: NLoïc Minier <loic.minier@canonical.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a57eb940
    • A
      kernel: __rcu annotations · 4d2deb40
      Arnd Bergmann 提交于
      This adds annotations for RCU operations in core kernel components
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      4d2deb40
  29. 28 5月, 2010 4 次提交
  30. 12 5月, 2010 1 次提交
  31. 13 3月, 2010 1 次提交
  32. 03 3月, 2010 1 次提交
  33. 18 12月, 2009 1 次提交
  34. 16 12月, 2009 1 次提交