1. 11 6月, 2013 1 次提交
    • S
      rcu: Don't call wakeup() with rcu_node structure ->lock held · 016a8d5b
      Steven Rostedt 提交于
      This commit fixes a lockdep-detected deadlock by moving a wake_up()
      call out from a rnp->lock critical section.  Please see below for
      the long version of this story.
      
      On Tue, 2013-05-28 at 16:13 -0400, Dave Jones wrote:
      
      > [12572.705832] ======================================================
      > [12572.750317] [ INFO: possible circular locking dependency detected ]
      > [12572.796978] 3.10.0-rc3+ #39 Not tainted
      > [12572.833381] -------------------------------------------------------
      > [12572.862233] trinity-child17/31341 is trying to acquire lock:
      > [12572.870390]  (rcu_node_0){..-.-.}, at: [<ffffffff811054ff>] rcu_read_unlock_special+0x9f/0x4c0
      > [12572.878859]
      > but task is already holding lock:
      > [12572.894894]  (&ctx->lock){-.-...}, at: [<ffffffff811390ed>] perf_lock_task_context+0x7d/0x2d0
      > [12572.903381]
      > which lock already depends on the new lock.
      >
      > [12572.927541]
      > the existing dependency chain (in reverse order) is:
      > [12572.943736]
      > -> #4 (&ctx->lock){-.-...}:
      > [12572.960032]        [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12572.968337]        [<ffffffff816ebc90>] _raw_spin_lock+0x40/0x80
      > [12572.976633]        [<ffffffff8113c987>] __perf_event_task_sched_out+0x2e7/0x5e0
      > [12572.984969]        [<ffffffff81088953>] perf_event_task_sched_out+0x93/0xa0
      > [12572.993326]        [<ffffffff816ea0bf>] __schedule+0x2cf/0x9c0
      > [12573.001652]        [<ffffffff816eacfe>] schedule_user+0x2e/0x70
      > [12573.009998]        [<ffffffff816ecd64>] retint_careful+0x12/0x2e
      > [12573.018321]
      > -> #3 (&rq->lock){-.-.-.}:
      > [12573.034628]        [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12573.042930]        [<ffffffff816ebc90>] _raw_spin_lock+0x40/0x80
      > [12573.051248]        [<ffffffff8108e6a7>] wake_up_new_task+0xb7/0x260
      > [12573.059579]        [<ffffffff810492f5>] do_fork+0x105/0x470
      > [12573.067880]        [<ffffffff81049686>] kernel_thread+0x26/0x30
      > [12573.076202]        [<ffffffff816cee63>] rest_init+0x23/0x140
      > [12573.084508]        [<ffffffff81ed8e1f>] start_kernel+0x3f1/0x3fe
      > [12573.092852]        [<ffffffff81ed856f>] x86_64_start_reservations+0x2a/0x2c
      > [12573.101233]        [<ffffffff81ed863d>] x86_64_start_kernel+0xcc/0xcf
      > [12573.109528]
      > -> #2 (&p->pi_lock){-.-.-.}:
      > [12573.125675]        [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12573.133829]        [<ffffffff816ebe9b>] _raw_spin_lock_irqsave+0x4b/0x90
      > [12573.141964]        [<ffffffff8108e881>] try_to_wake_up+0x31/0x320
      > [12573.150065]        [<ffffffff8108ebe2>] default_wake_function+0x12/0x20
      > [12573.158151]        [<ffffffff8107bbf8>] autoremove_wake_function+0x18/0x40
      > [12573.166195]        [<ffffffff81085398>] __wake_up_common+0x58/0x90
      > [12573.174215]        [<ffffffff81086909>] __wake_up+0x39/0x50
      > [12573.182146]        [<ffffffff810fc3da>] rcu_start_gp_advanced.isra.11+0x4a/0x50
      > [12573.190119]        [<ffffffff810fdb09>] rcu_start_future_gp+0x1c9/0x1f0
      > [12573.198023]        [<ffffffff810fe2c4>] rcu_nocb_kthread+0x114/0x930
      > [12573.205860]        [<ffffffff8107a91d>] kthread+0xed/0x100
      > [12573.213656]        [<ffffffff816f4b1c>] ret_from_fork+0x7c/0xb0
      > [12573.221379]
      > -> #1 (&rsp->gp_wq){..-.-.}:
      > [12573.236329]        [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12573.243783]        [<ffffffff816ebe9b>] _raw_spin_lock_irqsave+0x4b/0x90
      > [12573.251178]        [<ffffffff810868f3>] __wake_up+0x23/0x50
      > [12573.258505]        [<ffffffff810fc3da>] rcu_start_gp_advanced.isra.11+0x4a/0x50
      > [12573.265891]        [<ffffffff810fdb09>] rcu_start_future_gp+0x1c9/0x1f0
      > [12573.273248]        [<ffffffff810fe2c4>] rcu_nocb_kthread+0x114/0x930
      > [12573.280564]        [<ffffffff8107a91d>] kthread+0xed/0x100
      > [12573.287807]        [<ffffffff816f4b1c>] ret_from_fork+0x7c/0xb0
      
      Notice the above call chain.
      
      rcu_start_future_gp() is called with the rnp->lock held. Then it calls
      rcu_start_gp_advance, which does a wakeup.
      
      You can't do wakeups while holding the rnp->lock, as that would mean
      that you could not do a rcu_read_unlock() while holding the rq lock, or
      any lock that was taken while holding the rq lock. This is because...
      (See below).
      
      > [12573.295067]
      > -> #0 (rcu_node_0){..-.-.}:
      > [12573.309293]        [<ffffffff810b8d36>] __lock_acquire+0x1786/0x1af0
      > [12573.316568]        [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12573.323825]        [<ffffffff816ebc90>] _raw_spin_lock+0x40/0x80
      > [12573.331081]        [<ffffffff811054ff>] rcu_read_unlock_special+0x9f/0x4c0
      > [12573.338377]        [<ffffffff810760a6>] __rcu_read_unlock+0x96/0xa0
      > [12573.345648]        [<ffffffff811391b3>] perf_lock_task_context+0x143/0x2d0
      > [12573.352942]        [<ffffffff8113938e>] find_get_context+0x4e/0x1f0
      > [12573.360211]        [<ffffffff811403f4>] SYSC_perf_event_open+0x514/0xbd0
      > [12573.367514]        [<ffffffff81140e49>] SyS_perf_event_open+0x9/0x10
      > [12573.374816]        [<ffffffff816f4dd4>] tracesys+0xdd/0xe2
      
      Notice the above trace.
      
      perf took its own ctx->lock, which can be taken while holding the rq
      lock. While holding this lock, it did a rcu_read_unlock(). The
      perf_lock_task_context() basically looks like:
      
      rcu_read_lock();
      raw_spin_lock(ctx->lock);
      rcu_read_unlock();
      
      Now, what looks to have happened, is that we scheduled after taking that
      first rcu_read_lock() but before taking the spin lock. When we scheduled
      back in and took the ctx->lock, the following rcu_read_unlock()
      triggered the "special" code.
      
      The rcu_read_unlock_special() takes the rnp->lock, which gives us a
      possible deadlock scenario.
      
      	CPU0		CPU1		CPU2
      	----		----		----
      
      				     rcu_nocb_kthread()
          lock(rq->lock);
      		    lock(ctx->lock);
      				     lock(rnp->lock);
      
      				     wake_up();
      
      				     lock(rq->lock);
      
      		    rcu_read_unlock();
      
      		    rcu_read_unlock_special();
      
      		    lock(rnp->lock);
          lock(ctx->lock);
      
      **** DEADLOCK ****
      
      > [12573.382068]
      > other info that might help us debug this:
      >
      > [12573.403229] Chain exists of:
      >   rcu_node_0 --> &rq->lock --> &ctx->lock
      >
      > [12573.424471]  Possible unsafe locking scenario:
      >
      > [12573.438499]        CPU0                    CPU1
      > [12573.445599]        ----                    ----
      > [12573.452691]   lock(&ctx->lock);
      > [12573.459799]                                lock(&rq->lock);
      > [12573.467010]                                lock(&ctx->lock);
      > [12573.474192]   lock(rcu_node_0);
      > [12573.481262]
      >  *** DEADLOCK ***
      >
      > [12573.501931] 1 lock held by trinity-child17/31341:
      > [12573.508990]  #0:  (&ctx->lock){-.-...}, at: [<ffffffff811390ed>] perf_lock_task_context+0x7d/0x2d0
      > [12573.516475]
      > stack backtrace:
      > [12573.530395] CPU: 1 PID: 31341 Comm: trinity-child17 Not tainted 3.10.0-rc3+ #39
      > [12573.545357]  ffffffff825b4f90 ffff880219f1dbc0 ffffffff816e375b ffff880219f1dc00
      > [12573.552868]  ffffffff816dfa5d ffff880219f1dc50 ffff88023ce4d1f8 ffff88023ce4ca40
      > [12573.560353]  0000000000000001 0000000000000001 ffff88023ce4d1f8 ffff880219f1dcc0
      > [12573.567856] Call Trace:
      > [12573.575011]  [<ffffffff816e375b>] dump_stack+0x19/0x1b
      > [12573.582284]  [<ffffffff816dfa5d>] print_circular_bug+0x200/0x20f
      > [12573.589637]  [<ffffffff810b8d36>] __lock_acquire+0x1786/0x1af0
      > [12573.596982]  [<ffffffff810918f5>] ? sched_clock_cpu+0xb5/0x100
      > [12573.604344]  [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12573.611652]  [<ffffffff811054ff>] ? rcu_read_unlock_special+0x9f/0x4c0
      > [12573.619030]  [<ffffffff816ebc90>] _raw_spin_lock+0x40/0x80
      > [12573.626331]  [<ffffffff811054ff>] ? rcu_read_unlock_special+0x9f/0x4c0
      > [12573.633671]  [<ffffffff811054ff>] rcu_read_unlock_special+0x9f/0x4c0
      > [12573.640992]  [<ffffffff811390ed>] ? perf_lock_task_context+0x7d/0x2d0
      > [12573.648330]  [<ffffffff810b429e>] ? put_lock_stats.isra.29+0xe/0x40
      > [12573.655662]  [<ffffffff813095a0>] ? delay_tsc+0x90/0xe0
      > [12573.662964]  [<ffffffff810760a6>] __rcu_read_unlock+0x96/0xa0
      > [12573.670276]  [<ffffffff811391b3>] perf_lock_task_context+0x143/0x2d0
      > [12573.677622]  [<ffffffff81139070>] ? __perf_event_enable+0x370/0x370
      > [12573.684981]  [<ffffffff8113938e>] find_get_context+0x4e/0x1f0
      > [12573.692358]  [<ffffffff811403f4>] SYSC_perf_event_open+0x514/0xbd0
      > [12573.699753]  [<ffffffff8108cd9d>] ? get_parent_ip+0xd/0x50
      > [12573.707135]  [<ffffffff810b71fd>] ? trace_hardirqs_on_caller+0xfd/0x1c0
      > [12573.714599]  [<ffffffff81140e49>] SyS_perf_event_open+0x9/0x10
      > [12573.721996]  [<ffffffff816f4dd4>] tracesys+0xdd/0xe2
      
      This commit delays the wakeup via irq_work(), which is what
      perf and ftrace use to perform wakeups in critical sections.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      016a8d5b
  2. 04 6月, 2013 1 次提交
  3. 04 5月, 2013 1 次提交
    • F
      rcu: Fix full dynticks' dependency on wide RCU nocb mode · 73c30828
      Frederic Weisbecker 提交于
      Commit 0637e029
      ("nohz: Select wide RCU nocb for full dynticks") intended
      to force CONFIG_RCU_NOCB_CPU_ALL=y when full dynticks is
      enabled.
      
      However this option is part of a choice menu and Kconfig's
      "select" instruction has no effect on such targets.
      
      Fix this by using reverse dependencies on the targets we
      don't want instead.
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      73c30828
  4. 01 5月, 2013 1 次提交
    • M
      init/Kconfig: re-order CONFIG_EXPERT options to fix menuconfig display · 657a5209
      Mike Frysinger 提交于
      The kconfig language requires that dependent options all follow the
      menuconfig symbol in order to be collapsed below it.  Recently some hidden
      options were added below the EXPERT menuconfig, but did not depend on
      EXPERT (because hidden options can't).  This broke the display.  So
      re-order all these options, and while we're here stick the PCI quirks
      under the EXPERT menu (since it isn't sitting with any related options).
      
      Before this commit, we get:
      	[*] Configure standard kernel features (expert users)  --->
      	[ ] Sysctl syscall support
      	[*] Load all symbols for debugging/ksymoops
      	...
      	[ ] Embedded system
      
      Now we get the older (and correct) behavior:
      	[*] Configure standard kernel features (expert users)  --->
      	[ ] Embedded system
      And if you go into the expert menu you get the expert options:
      	[ ] Sysctl syscall support
      	[*] Load all symbols for debugging/ksymoops
      	...
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Acked-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
      Cc: Michal Marek <mmarek@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      657a5209
  5. 27 4月, 2013 1 次提交
    • F
      nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config · c58b0df1
      Frederic Weisbecker 提交于
      Turn the full dynticks passive dependency on VIRT_CPU_ACCOUNTING_GEN
      to an active one.
      
      The full dynticks Kconfig is currently hidden behind the full dynticks
      cputime accounting, which is an awkward and counter-intuitive layout:
      the user first has to select the dynticks cputime accounting in order
      to make the full dynticks feature to be visible.
      
      We definetly want it the other way around. The usual way to perform
      this kind of active dependency is use "select" on the depended target.
      Now we can't use the Kconfig "select" instruction when the target is
      a "choice".
      
      So this patch inspires on how the RCU subsystem Kconfig interact
      with its dependencies on SMP and PREEMPT: we make sure that cputime
      accounting can't propose another option than VIRT_CPU_ACCOUNTING_GEN
      when NO_HZ_FULL is selected by using the right "depends on" instruction
      for each cputime accounting choices.
      
      v2: Keep full dynticks cputime accounting available even without
      full dynticks, as per Paul McKenney's suggestion.
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      c58b0df1
  6. 03 4月, 2013 1 次提交
    • F
      nohz: Rename CONFIG_NO_HZ to CONFIG_NO_HZ_COMMON · 3451d024
      Frederic Weisbecker 提交于
      We are planning to convert the dynticks Kconfig options layout
      into a choice menu. The user must be able to easily pick
      any of the following implementations: constant periodic tick,
      idle dynticks, full dynticks.
      
      As this implies a mutual exclusion, the two dynticks implementions
      need to converge on the selection of a common Kconfig option in order
      to ease the sharing of a common infrastructure.
      
      It would thus seem pretty natural to reuse CONFIG_NO_HZ to
      that end. It already implements all the idle dynticks code
      and the full dynticks depends on all that code for now.
      So ideally the choice menu would propose CONFIG_NO_HZ_IDLE and
      CONFIG_NO_HZ_EXTENDED then both would select CONFIG_NO_HZ.
      
      On the other hand we want to stay backward compatible: if
      CONFIG_NO_HZ is set in an older config file, we want to
      enable CONFIG_NO_HZ_IDLE by default.
      
      But we can't afford both at the same time or we run into
      a circular dependency:
      
      1) CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED both select
         CONFIG_NO_HZ
      2) If CONFIG_NO_HZ is set, we default to CONFIG_NO_HZ_IDLE
      
      We might be able to support that from Kconfig/Kbuild but it
      may not be wise to introduce such a confusing behaviour.
      
      So to solve this, create a new CONFIG_NO_HZ_COMMON option
      which gathers the common code between idle and full dynticks
      (that common code for now is simply the idle dynticks code)
      and select it from their referring Kconfig.
      
      Then we'll later create CONFIG_NO_HZ_IDLE and map CONFIG_NO_HZ
      to it for backward compatibility.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      3451d024
  7. 26 3月, 2013 3 次提交
  8. 13 3月, 2013 2 次提交
    • K
      final removal of CONFIG_EXPERIMENTAL · 3d374d09
      Kees Cook 提交于
      Remove "config EXPERIMENTAL" itself, now that every "depends on" it has
      been removed from the tree.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3d374d09
    • P
      rcu: Remove restrictions on no-CBs CPUs · 34ed6246
      Paul E. McKenney 提交于
      Currently, CPU 0 is constrained to not be a no-CBs CPU, and furthermore
      at least one no-CBs CPU must remain online at any given time.  These
      restrictions are problematic in some situations, such as cases where
      all CPUs must run a real-time workload that needs to be insulated from
      OS jitter and latencies due to RCU callback invocation.  This commit
      therefore provides no-CBs CPUs a (very crude and energy-inefficient)
      way to start and to wait for grace periods independently of the normal
      RCU callback mechanisms.  This approach allows any or all of the CPUs to
      be designated as no-CBs CPUs, and allows any proper subset of the CPUs
      (whether no-CBs CPUs or not) to be offlined.
      
      This commit also provides a fix for a locking bug spotted by Xie
      ChanglongX <changlongx.xie@intel.com>.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      34ed6246
  9. 08 3月, 2013 1 次提交
    • F
      context_tracking: Enable probes by default for selftesting · 8b438766
      Frederic Weisbecker 提交于
      Until we provide the nohz_mask boot parameter, keeping
      the context tracking probes disabled by default is pointless
      since what we want is to runtime test this code anyway.
      
      It's furthermore confusing for the users which don't expect
      the probes to be off when they select RCU user mode or full
      dynticks cputime accounting.
      
      Let's enable these probes selftests by default for now.
      
      Suggested: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Mats Liljegren <mats.liljegren@enea.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      8b438766
  10. 16 2月, 2013 1 次提交
  11. 13 2月, 2013 8 次提交
  12. 12 2月, 2013 2 次提交
  13. 08 2月, 2013 1 次提交
  14. 29 1月, 2013 2 次提交
  15. 28 1月, 2013 1 次提交
    • F
      cputime: Generic on-demand virtual cputime accounting · abf917cd
      Frederic Weisbecker 提交于
      If we want to stop the tick further idle, we need to be
      able to account the cputime without using the tick.
      
      Virtual based cputime accounting solves that problem by
      hooking into kernel/user boundaries.
      
      However implementing CONFIG_VIRT_CPU_ACCOUNTING require
      low level hooks and involves more overhead. But we already
      have a generic context tracking subsystem that is required
      for RCU needs by archs which plan to shut down the tick
      outside idle.
      
      This patch implements a generic virtual based cputime
      accounting that relies on these generic kernel/user hooks.
      
      There are some upsides of doing this:
      
      - This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
      if context tracking is already built (already necessary for RCU in full
      tickless mode).
      
      - We can rely on the generic context tracking subsystem to dynamically
      (de)activate the hooks, so that we can switch anytime between virtual
      and tick based accounting. This way we don't have the overhead
      of the virtual accounting when the tick is running periodically.
      
      And one downside:
      
      - There is probably more overhead than a native virtual based cputime
      accounting. But this relies on hooks that are already set anyway.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      abf917cd
  16. 27 1月, 2013 1 次提交
  17. 25 1月, 2013 2 次提交
  18. 17 1月, 2013 1 次提交
    • K
      Tell the world we gave up on pushing CC_OPTIMIZE_FOR_SIZE · 3a55fb0d
      Kirill Smelkov 提交于
      In commit 281dc5c5 ("Give up on pushing CC_OPTIMIZE_FOR_SIZE") we
      already changed the actual default value, but the help-text still
      suggested 'y'. Fix the help text too, for all the same reasons.
      
      Sadly, -Os keeps on generating some very suboptimal code for certain
      cases, to the point where any I$ miss upside is swamped by the downside.
      The main ones are:
      
       - using "rep movsb" for memcpy, even on CPU's where that is
         horrendously bad for performance.
      
       - not honoring branch prediction information, so any I$ footprint you
         win from smaller code, you lose from less code density in the I$.
      
       - using divide instructions when that is very expensive.
      Signed-off-by: NKirill Smelkov <kirr@mns.spb.ru>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a55fb0d
  19. 12 1月, 2013 2 次提交
    • K
      init: remove depends on CONFIG_EXPERIMENTAL · 19c92399
      Kees Cook 提交于
      The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
      while now and is almost always enabled by default. As agreed during the
      Linux kernel summit, remove it from any "depends on" lines in Kconfigs.
      
      CC: "Eric W. Biederman" <ebiederm@xmission.com>
      CC: Serge Hallyn <serge.hallyn@canonical.com>
      CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NSerge Hallyn <serge.hallyn@ubuntu.com>
      19c92399
    • K
      make CONFIG_EXPERIMENTAL invisible and default · 5a958db3
      Kees Cook 提交于
      This config item has not carried much meaning for a while now and is
      almost always enabled by default (especially in distro builds). As agreed
      during the Linux kernel summit, it should be removed. As a first step,
      remove it from being listed, and default it to on. Once it has been
      removed from all subsystem Kconfigs, it will be dropped entirely.
      
      For items that really are experimental, maintainers should use "default
      n", optionally include "(EXPERIMENTAL)" in the title, and add language to
      the help text indicating why the item should be considered experimental.
      
      For items that are dangerously experimental, the maintainer is encouraged
      to follow the above title recommendation, add stronger language to the
      help text, and optionally use (depending on the extent of the danger,
      from least to most dangerous): printk(), add_taint(TAINT_WARN),
      add_taint(TAINT_CRAP), WARN_ON(1), and CONFIG_BROKEN.
      
      CC: Greg KH <gregkh@linuxfoundation.org>
      CC: "Eric W. Biederman" <ebiederm@xmission.com>
      CC: Serge Hallyn <serge.hallyn@canonical.com>
      CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      5a958db3
  20. 10 1月, 2013 1 次提交
  21. 19 12月, 2012 2 次提交
    • G
      memcg: infrastructure to match an allocation to the right cache · d7f25f8a
      Glauber Costa 提交于
      The page allocator is able to bind a page to a memcg when it is
      allocated.  But for the caches, we'd like to have as many objects as
      possible in a page belonging to the same cache.
      
      This is done in this patch by calling memcg_kmem_get_cache in the
      beginning of every allocation function.  This function is patched out by
      static branches when kernel memory controller is not being used.
      
      It assumes that the task allocating, which determines the memcg in the
      page allocator, belongs to the same cgroup throughout the whole process.
      Misaccounting can happen if the task calls memcg_kmem_get_cache() while
      belonging to a cgroup, and later on changes.  This is considered
      acceptable, and should only happen upon task migration.
      
      Before the cache is created by the memcg core, there is also a possible
      imbalance: the task belongs to a memcg, but the cache being allocated from
      is the global cache, since the child cache is not yet guaranteed to be
      ready.  This case is also fine, since in this case the GFP_KMEMCG will not
      be passed and the page allocator will not attempt any cgroup accounting.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7f25f8a
    • G
      memcg: kmem accounting basic infrastructure · 510fc4e1
      Glauber Costa 提交于
      Add the basic infrastructure for the accounting of kernel memory.  To
      control that, the following files are created:
      
       * memory.kmem.usage_in_bytes
       * memory.kmem.limit_in_bytes
       * memory.kmem.failcnt
       * memory.kmem.max_usage_in_bytes
      
      They have the same meaning of their user memory counterparts.  They
      reflect the state of the "kmem" res_counter.
      
      Per cgroup kmem memory accounting is not enabled until a limit is set for
      the group.  Once the limit is set the accounting cannot be disabled for
      that group.  This means that after the patch is applied, no behavioral
      changes exists for whoever is still using memcg to control their memory
      usage, until memory.kmem.limit_in_bytes is set for the first time.
      
      We always account to both user and kernel resource_counters.  This
      effectively means that an independent kernel limit is in place when the
      limit is set to a lower value than the user memory.  A equal or higher
      value means that the user limit will always hit first, meaning that kmem
      is effectively unlimited.
      
      People who want to track kernel memory but not limit it, can set this
      limit to a very high number (like RESOURCE_MAX - 1page - that no one will
      ever hit, or equal to the user memory)
      
      [akpm@linux-foundation.org: MEMCG_MMEM only works with slab and slub]
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      510fc4e1
  22. 11 12月, 2012 2 次提交
    • M
      mm: sched: numa: Control enabling and disabling of NUMA balancing · 1a687c2e
      Mel Gorman 提交于
      This patch adds Kconfig options and kernel parameters to allow the
      enabling and disabling of automatic NUMA balancing. The existance
      of such a switch was and is very important when debugging problems
      related to transparent hugepages and we should have the same for
      automatic NUMA placement.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      1a687c2e
    • A
      mm: numa: pte_numa() and pmd_numa() · be3a7284
      Andrea Arcangeli 提交于
      Implement pte_numa and pmd_numa.
      
      We must atomically set the numa bit and clear the present bit to
      define a pte_numa or pmd_numa.
      
      Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
      a thread touches a virtual address in the corresponding virtual range,
      a NUMA hinting page fault will trigger. The NUMA hinting page fault
      will clear the NUMA bit and set the present bit again to resolve the
      page fault.
      
      The expectation is that a NUMA hinting page fault is used as part
      of a placement policy that decides if a page should remain on the
      current node or migrated to a different node.
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      be3a7284
  23. 01 12月, 2012 1 次提交
    • F
      context_tracking: New context tracking susbsystem · 91d1aa43
      Frederic Weisbecker 提交于
      Create a new subsystem that probes on kernel boundaries
      to keep track of the transitions between level contexts
      with two basic initial contexts: user or kernel.
      
      This is an abstraction of some RCU code that use such tracking
      to implement its userspace extended quiescent state.
      
      We need to pull this up from RCU into this new level of indirection
      because this tracking is also going to be used to implement an "on
      demand" generic virtual cputime accounting. A necessary step to
      shutdown the tick while still accounting the cputime.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      [ paulmck: fix whitespace error and email address. ]
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      91d1aa43
  24. 18 11月, 2012 1 次提交
    • F
      printk: Wake up klogd using irq_work · 74876a98
      Frederic Weisbecker 提交于
      klogd is woken up asynchronously from the tick in order
      to do it safely.
      
      However if printk is called when the tick is stopped, the reader
      won't be woken up until the next interrupt, which might not fire
      for a while. As a result, the user may miss some message.
      
      To fix this, lets implement the printk tick using a lazy irq work.
      This subsystem takes care of the timer tick state and can
      fix up accordingly.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      74876a98