1. 18 3月, 2016 10 次提交
    • J
      lib/bug.c: use common WARN helper · 2553b67a
      Josh Poimboeuf 提交于
      The traceoff_on_warning option doesn't have any effect on s390, powerpc,
      arm64, parisc, and sh because there are two different types of WARN
      implementations:
      
      1) The above mentioned architectures treat WARN() as a special case of a
         BUG() exception.  They handle warnings in report_bug() in lib/bug.c.
      
      2) All other architectures just call warn_slowpath_*() directly.  Their
         warnings are handled in warn_slowpath_common() in kernel/panic.c.
      
      Support traceoff_on_warning on all architectures and prevent any future
      divergence by using a single common function to emit the warning.
      
      Also remove the '()' from '%pS()', because the parentheses look funky:
      
        [   45.607629] WARNING: at /root/warn_mod/warn_mod.c:17 .init_dummy+0x20/0x40 [warn_mod]()
      Reported-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Tested-by: NPrarit Bhargava <prarit@redhat.com>
      Acked-by: NPrarit Bhargava <prarit@redhat.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2553b67a
    • K
      param: convert some "on"/"off" users to strtobool · 4cc7ecb7
      Kees Cook 提交于
      This changes several users of manual "on"/"off" parsing to use
      strtobool.
      
      Some side-effects:
      - these uses will now parse y/n/1/0 meaningfully too
      - the early_param uses will now bubble up parse errors
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Amitkumar Karwar <akarwar@marvell.com>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kalle Valo <kvalo@codeaurora.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Nishant Sarmukadam <nishants@marvell.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Steve French <sfrench@samba.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cc7ecb7
    • I
      printk: add clear_idx symbol to vmcoreinfo · f468908b
      Ivan Delalande 提交于
      This allows us to extract from the vmcore only the messages emitted
      since the last time the ring buffer was cleared.  We just have to make
      sure its value is always up-to-date, when old messages are discarded to
      free space in log_make_free_space() for example.
      Signed-off-by: NZeyu Zhao <zzy8200@gmail.com>
      Signed-off-by: NIvan Delalande <colona@arista.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f468908b
    • S
      printk: check CON_ENABLED in have_callable_console() · adaf6590
      Sergey Senozhatsky 提交于
      have_callable_console() must also test CON_ENABLED bit, not just
      CON_ANYTIME.  We may have disabled CON_ANYTIME console so printk can
      wrongly assume that it's safe to call_console_drivers().
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NPetr Mladek <pmladek@suse.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kyle McMartin <kyle@kernel.org>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      Cc: Calvin Owens <calvinowens@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      adaf6590
    • S
      printk: set may_schedule for some of console_trylock() callers · 6b97a20d
      Sergey Senozhatsky 提交于
      console_unlock() allows to cond_resched() if its caller has set
      `console_may_schedule' to 1, since 8d91f8b1 ("printk: do
      cond_resched() between lines while outputting to consoles").
      
      The rules are:
      -- console_lock() always sets `console_may_schedule' to 1
      -- console_trylock() always sets `console_may_schedule' to 0
      
      However, console_trylock() callers (among them is printk()) do not
      always call printk() from atomic contexts, and some of them can
      cond_resched() in console_unlock(), so console_trylock() can set
      `console_may_schedule' to 1 for such processes.
      
      For !CONFIG_PREEMPT_COUNT kernels, however, console_trylock() always
      sets `console_may_schedule' to 0.
      
      It's possible to drop explicit preempt_disable()/preempt_enable() in
      vprintk_emit(), because console_unlock() and console_trylock() are now
      smart enough:
       a) console_unlock() does not cond_resched() when it's unsafe
          (console_trylock() takes care of that)
       b) console_unlock() does can_use_console() check.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NPetr Mladek <pmladek@suse.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kyle McMartin <kyle@kernel.org>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      Cc: Calvin Owens <calvinowens@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b97a20d
    • S
      printk: move can_use_console() out of console_trylock_for_printk() · a8199371
      Sergey Senozhatsky 提交于
      console_unlock() allows to cond_resched() if its caller has set
      `console_may_schedule' to 1 (this functionality is present since
      8d91f8b1 ("printk: do cond_resched() between lines while outputting
      to consoles").
      
      The rules are:
      -- console_lock() always sets `console_may_schedule' to 1
      -- console_trylock() always sets `console_may_schedule' to 0
      
      printk() calls console_unlock() with preemption desabled, which
      basically can lead to RCU stalls, watchdog soft lockups, etc.  if
      something is simultaneously calling printk() frequent enough (IOW,
      console_sem owner always has new data to send to console divers and
      can't leave console_unlock() for a long time).
      
      printk()->console_trylock() callers do not necessarily execute in atomic
      contexts, and some of them can cond_resched() in console_unlock().
      console_trylock() can set `console_may_schedule' to 1 (allow
      cond_resched() later in consoe_unlock()) when it's safe.
      
      This patch (of 3):
      
      vprintk_emit() disables preemption around console_trylock_for_printk()
      and console_unlock() calls for a strong reason -- can_use_console()
      check.  The thing is that vprintl_emit() can be called on a CPU that is
      not fully brought up yet (!cpu_online()), which potentially can cause
      problems if console driver wants to access per-cpu data.  A console
      driver can explicitly state that it's safe to call it from !online cpu
      by setting CON_ANYTIME bit in console ->flags.  That's why for
      !cpu_online() can_use_console() iterates all the console to find out if
      there is a CON_ANYTIME console, otherwise console_unlock() must be
      avoided.
      
      can_use_console() ensures that console_unlock() call is safe in
      vprintk_emit() only; console_lock() and console_trylock() are not
      covered by this check.  Even though call_console_drivers(), invoked from
      console_cont_flush() and console_unlock(), tests `!cpu_online() &&
      CON_ANYTIME' for_each_console(), it may be too late, which can result in
      messages loss.
      
      Assume that we have 2 cpus -- CPU0 is online, CPU1 is !online, and no
      CON_ANYTIME consoles available.
      
      CPU0 online                        CPU1 !online
                                       console_trylock()
                                       ...
                                       console_unlock()
                                         console_cont_flush
                                           spin_lock logbuf_lock
                                           if (!cont.len) {
                                              spin_unlock logbuf_lock
                                              return
                                           }
                                         for (;;) {
      vprintk_emit
        spin_lock logbuf_lock
        log_store
        spin_unlock logbuf_lock
                                           spin_lock logbuf_lock
        !console_trylock_for_printk        msg_print_text
       return                              console_idx = log_next()
                                           console_seq++
                                           console_prev = msg->flags
                                           spin_unlock logbuf_lock
      
                                           call_console_drivers()
                                             for_each_console(con) {
                                               if (!cpu_online() &&
                                                   !(con->flags & CON_ANYTIME))
                                                       continue;
                                               }
                                         /*
                                          * no message printed, we lost it
                                          */
      vprintk_emit
        spin_lock logbuf_lock
        log_store
        spin_unlock logbuf_lock
        !console_trylock_for_printk
       return
                                         /*
                                          * go to the beginning of the loop,
                                          * find out there are new messages,
                                          * lose it
                                          */
                                         }
      
      console_trylock()/console_lock() call on CPU1 may come from cpu
      notifiers registered on that CPU.  Since notifiers are not getting
      unregistered when CPU is going DOWN, all of the notifiers receive
      notifications during CPU UP.  For example, on my x86_64, I see around 50
      notification sent from offline CPU to itself
      
       [swapper/2] from cpu:2 to:2 action:CPU_STARTING hotplug_hrtick
       [swapper/2] from cpu:2 to:2 action:CPU_STARTING blk_mq_main_cpu_notify
       [swapper/2] from cpu:2 to:2 action:CPU_STARTING blk_mq_queue_reinit_notify
       [swapper/2] from cpu:2 to:2 action:CPU_STARTING console_cpu_notify
      
      while doing
        echo 0 > /sys/devices/system/cpu/cpu2/online
        echo 1 > /sys/devices/system/cpu/cpu2/online
      
      So grabbing the console_sem lock while CPU is !online is possible,
      in theory.
      
      This patch moves can_use_console() check out of
      console_trylock_for_printk().  Instead it calls it in console_unlock(),
      so now console_lock()/console_unlock() are also 'protected' by
      can_use_console().  This also means that console_trylock_for_printk() is
      not really needed anymore and can be removed.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NPetr Mladek <pmladek@suse.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kyle McMartin <kyle@kernel.org>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      Cc: Calvin Owens <calvinowens@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8199371
    • J
      timer: convert timer_slack_ns from unsigned long to u64 · da8b44d5
      John Stultz 提交于
      This patchset introduces a /proc/<pid>/timerslack_ns interface which
      would allow controlling processes to be able to set the timerslack value
      on other processes in order to save power by avoiding wakeups (Something
      Android currently does via out-of-tree patches).
      
      The first patch tries to fix the internal timer_slack_ns usage which was
      defined as a long, which limits the slack range to ~4 seconds on 32bit
      systems.  It converts it to a u64, which provides the same basically
      unlimited slack (500 years) on both 32bit and 64bit machines.
      
      The second patch introduces the /proc/<pid>/timerslack_ns interface
      which allows the full 64bit slack range for a task to be read or set on
      both 32bit and 64bit machines.
      
      With these two patches, on a 32bit machine, after setting the slack on
      bash to 10 seconds:
      
      $ time sleep 1
      
      real    0m10.747s
      user    0m0.001s
      sys     0m0.005s
      
      The first patch is a little ugly, since I had to chase the slack delta
      arguments through a number of functions converting them to u64s.  Let me
      know if it makes sense to break that up more or not.
      
      Other than that things are fairly straightforward.
      
      This patch (of 2):
      
      The timer_slack_ns value in the task struct is currently a unsigned
      long.  This means that on 32bit applications, the maximum slack is just
      over 4 seconds.  However, on 64bit machines, its much much larger (~500
      years).
      
      This disparity could make application development a little (as well as
      the default_slack) to a u64.  This means both 32bit and 64bit systems
      have the same effective internal slack range.
      
      Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
      the interface as a unsigned long, so we preserve that limitation on
      32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
      long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
      actually larger then what can be stored by an unsigned long.
      
      This patch also modifies hrtimer functions which specified the slack
      delta as a unsigned long.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Oren Laadan <orenl@cellrox.com>
      Cc: Ruchi Kandoi <kandoiruchi@google.com>
      Cc: Rom Lemarchand <romlem@android.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Android Kernel Team <kernel-team@android.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da8b44d5
    • J
      mm: scale kswapd watermarks in proportion to memory · 795ae7a0
      Johannes Weiner 提交于
      In machines with 140G of memory and enterprise flash storage, we have
      seen read and write bursts routinely exceed the kswapd watermarks and
      cause thundering herds in direct reclaim.  Unfortunately, the only way
      to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
      system's emergency reserves - which is entirely unrelated to the
      system's latency requirements.  In order to get kswapd to maintain a
      250M buffer of free memory, the emergency reserves need to be set to 1G.
      That is a lot of memory wasted for no good reason.
      
      On the other hand, it's reasonable to assume that allocation bursts and
      overall allocation concurrency scale with memory capacity, so it makes
      sense to make kswapd aggressiveness a function of that as well.
      
      Change the kswapd watermark scale factor from the currently fixed 25% of
      the tunable emergency reserve to a tunable 0.1% of memory.
      
      Beyond 1G of memory, this will produce bigger watermark steps than the
      current formula in default settings.  Ensure that the new formula never
      chooses steps smaller than that, i.e.  25% of the emergency reserve.
      
      On a 140G machine, this raises the default watermark steps - the
      distance between min and low, and low and high - from 16M to 143M.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      795ae7a0
    • V
      mm: memcontrol: report kernel stack usage in cgroup2 memory.stat · 12580e4b
      Vladimir Davydov 提交于
      Show how much memory is allocated to kernel stacks.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12580e4b
    • J
      watchdog: don't run proc_watchdog_update if new value is same as old · a1ee1932
      Joshua Hunt 提交于
      While working on a script to restore all sysctl params before a series of
      tests I found that writing any value into the
      /proc/sys/kernel/{nmi_watchdog,soft_watchdog,watchdog,watchdog_thresh}
      causes them to call proc_watchdog_update().
      
        NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
        NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
        NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
        NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
      
      There doesn't appear to be a reason for doing this work every time a write
      occurs, so only do it when the values change.
      Signed-off-by: NJosh Hunt <johunt@akamai.com>
      Acked-by: NDon Zickus <dzickus@redhat.com>
      Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.1.x+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1ee1932
  2. 17 3月, 2016 2 次提交
  3. 16 3月, 2016 4 次提交
    • A
      kallsyms: add support for relative offsets in kallsyms address table · 2213e9a6
      Ard Biesheuvel 提交于
      Similar to how relative extables are implemented, it is possible to emit
      the kallsyms table in such a way that it contains offsets relative to
      some anchor point in the kernel image rather than absolute addresses.
      
      On 64-bit architectures, it cuts the size of the kallsyms address table
      in half, since offsets between kernel symbols can typically be expressed
      in 32 bits.  This saves several hundreds of kilobytes of permanent
      .rodata on average.  In addition, the kallsyms address table is no
      longer subject to dynamic relocation when CONFIG_RELOCATABLE is in
      effect, so the relocation work done after decompression now doesn't have
      to do relocation updates for all these values.  This saves up to 24
      bytes (i.e., the size of a ELF64 RELA relocation table entry) per value,
      which easily adds up to a couple of megabytes of uncompressed __init
      data on ppc64 or arm64.  Even if these relocation entries typically
      compress well, the combined size reduction of 2.8 MB uncompressed for a
      ppc64_defconfig build (of which 2.4 MB is __init data) results in a ~500
      KB space saving in the compressed image.
      
      Since it is useful for some architectures (like x86) to retain the
      ability to emit absolute values as well, this patch also adds support
      for capturing both absolute and relative values when
      KALLSYMS_ABSOLUTE_PERCPU is in effect, by emitting absolute per-cpu
      addresses as positive 32-bit values, and addresses relative to the
      lowest encountered relative symbol as negative values, which are
      subtracted from the runtime address of this base symbol to produce the
      actual address.
      
      Support for the above is enabled by default for all architectures except
      IA-64 and Tile-GX, whose symbols are too far apart to capture in this
      manner.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Tested-by: NKees Cook <keescook@chromium.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2213e9a6
    • L
      mm/page_poisoning.c: allow for zero poisoning · 1414c7f4
      Laura Abbott 提交于
      By default, page poisoning uses a poison value (0xaa) on free.  If this
      is changed to 0, the page is not only sanitized but zeroing on alloc
      with __GFP_ZERO can be skipped as well.  The tradeoff is that detecting
      corruption from the poisoning is harder to detect.  This feature also
      cannot be used with hibernation since pages are not guaranteed to be
      zeroed after hibernation.
      
      Credit to Grsecurity/PaX team for inspiring this work
      Signed-off-by: NLaura Abbott <labbott@fedoraproject.org>
      Acked-by: NRafael J. Wysocki <rjw@rjwysocki.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1414c7f4
    • A
      mm: fix two typos in comments for to_vmem_altmap() · 07061aab
      Andreas Ziegler 提交于
      Commit 4b94ffdc ("x86, mm: introduce vmem_altmap to augment
      vmemmap_populate()"), introduced the to_vmem_altmap() function.
      
      The comments in this function contain two typos (one misspelling of the
      Kconfig option CONFIG_SPARSEMEM_VMEMMAP, and one missing letter 'n'),
      let's fix them up.
      Signed-off-by: NAndreas Ziegler <andreas.ziegler@fau.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07061aab
    • P
      tags: Fix DEFINE_PER_CPU expansions · 25528213
      Peter Zijlstra 提交于
      $ make tags
        GEN     tags
      ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
      ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
      ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
      ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
      ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
      ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
      ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"
      
      Which are all the result of the DEFINE_PER_CPU pattern:
      
        scripts/tags.sh:200:	'/\<DEFINE_PER_CPU([^,]*, *\([[:alnum:]_]*\)/\1/v/'
        scripts/tags.sh:201:	'/\<DEFINE_PER_CPU_SHARED_ALIGNED([^,]*, *\([[:alnum:]_]*\)/\1/v/'
      
      The below cures them. All except the workqueue one are within reasonable
      distance of the 80 char limit. TJ do you have any preference on how to
      fix the wq one, or shall we just not care its too long?
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25528213
  4. 13 3月, 2016 1 次提交
  5. 12 3月, 2016 1 次提交
  6. 11 3月, 2016 2 次提交
    • T
      cpu/hotplug: Fix smpboot thread ordering · 2a58c527
      Thomas Gleixner 提交于
      Commit 931ef163 moved the smpboot thread park/unpark invocation to the
      state machine. The move of the unpark invocation was premature as it depends
      on work in progress patches.
      
      As a result cpu down can fail, because rcu synchronization in takedown_cpu()
      eventually requires a functional softirq thread. I never encountered the
      problem in testing, but 0day testing managed to provide a reliable reproducer.
      
      Remove the smpboot_threads_park() call from the state machine for now and put
      it back into the original place after the rcu synchronization.
      
      I'm embarrassed as I knew about the dependency and still managed to get it
      wrong. Hotplug induced brain melt seems to be the only sensible explanation
      for that.
      
      Fixes: 931ef163 "cpu/hotplug: Unpark smpboot threads from the state machine"
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2a58c527
    • R
      cpufreq: Move scheduler-related code to the sched directory · adaf9fcd
      Rafael J. Wysocki 提交于
      Create cpufreq.c under kernel/sched/ and move the cpufreq code
      related to the scheduler to that file and to sched.h.
      
      Redefine cpufreq_update_util() as a static inline function to avoid
      function calls at its call sites in the scheduler code (as suggested
      by Peter Zijlstra).
      
      Also move the definition of struct update_util_data and declaration
      of cpufreq_set_update_util_data() from include/linux/cpufreq.h to
      include/linux/sched.h.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      adaf9fcd
  7. 10 3月, 2016 11 次提交
  8. 09 3月, 2016 3 次提交
    • R
      cpufreq: Add mechanism for registering utilization update callbacks · 34e2c555
      Rafael J. Wysocki 提交于
      Introduce a mechanism by which parts of the cpufreq subsystem
      ("setpolicy" drivers or the core) can register callbacks to be
      executed from cpufreq_update_util() which is invoked by the
      scheduler's update_load_avg() on CPU utilization changes.
      
      This allows the "setpolicy" drivers to dispense with their timers
      and do all of the computations they need and frequency/voltage
      adjustments in the update_load_avg() code path, among other things.
      
      The update_load_avg() changes were suggested by Peter Zijlstra.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      34e2c555
    • C
      x86/ACPI/PCI: Recognize that Interrupt Line 255 means "not connected" · e237a551
      Chen Fan 提交于
      Per the x86-specific footnote to PCI spec r3.0, sec 6.2.4, the value 255 in
      the Interrupt Line register means "unknown" or "no connection."
      Previously, when we couldn't derive an IRQ from the _PRT, we fell back to
      using the value from Interrupt Line as an IRQ.  It's questionable whether
      we should do that at all, but the spec clearly suggests we shouldn't do it
      for the value 255 on x86.
      
      Calling request_irq() with IRQ 255 may succeed, but the driver won't
      receive any interrupts.  Or, if IRQ 255 is shared with another device, it
      may succeed, and the driver's ISR will be called at random times when the
      *other* device interrupts.  Or it may fail if another device is using IRQ
      255 with incompatible flags.  What we *want* is for request_irq() to fail
      predictably so the driver can fall back to polling.
      
      On x86, assume 255 in the Interrupt Line means the INTx line is not
      connected.  In that case, set dev->irq to IRQ_NOTCONNECTED so request_irq()
      will fail gracefully with -ENOTCONN.
      
      We found this problem on a system where Secure Boot firmware assigned
      Interrupt Line 255 to an i801_smbus device and another device was already
      using MSI-X IRQ 255.  This was in v3.10, where i801_probe() fails if
      request_irq() fails:
      
        i801_smbus 0000:00:1f.3: enabling device (0140 -> 0143)
        i801_smbus 0000:00:1f.3: can't derive routing for PCI INT C
        i801_smbus 0000:00:1f.3: PCI INT C: no GSI
        genirq: Flags mismatch irq 255. 00000080 (i801_smbus) vs. 00000000 (megasa)
        CPU: 0 PID: 2487 Comm: kworker/0:1 Not tainted 3.10.0-229.el7.x86_64 #1
        Hardware name: FUJITSU PRIMEQUEST 2800E2/D3736, BIOS PRIMEQUEST 2000 Serie5
        Call Trace:
          dump_stack+0x19/0x1b
          __setup_irq+0x54a/0x570
          request_threaded_irq+0xcc/0x170
          i801_probe+0x32f/0x508 [i2c_i801]
          local_pci_probe+0x45/0xa0
        i801_smbus 0000:00:1f.3: Failed to allocate irq 255: -16
        i801_smbus: probe of 0000:00:1f.3 failed with error -16
      
      After aeb8a3d1 ("i2c: i801: Check if interrupts are disabled"),
      i801_probe() will fall back to polling if request_irq() fails.  But we
      still need this patch because request_irq() may succeed or fail depending
      on other devices in the system.  If request_irq() fails, i801_smbus will
      work by falling back to polling, but if it succeeds, i801_smbus won't work
      because it expects interrupts that it may not receive.
      Signed-off-by: NChen Fan <chen.fan.fnst@cn.fujitsu.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      e237a551
    • J
      futex: Replace barrier() in unqueue_me() with READ_ONCE() · 29b75eb2
      Jianyu Zhan 提交于
      Commit e91467ec ("bug in futex unqueue_me") introduced a barrier() in
      unqueue_me() to prevent the compiler from rereading the lock pointer which
      might change after a check for NULL.
      
      Replace the barrier() with a READ_ONCE() for the following reasons:
      
      1) READ_ONCE() is a weaker form of barrier() that affects only the specific
         load operation, while barrier() is a general compiler level memory barrier.
         READ_ONCE() was not available at the time when the barrier was added.
      
      2) Aside of that READ_ONCE() is descriptive and self explainatory while a
         barrier without comment is not clear to the casual reader.
      
      No functional change.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NDarren Hart <dvhart@linux.intel.com>
      Cc: dave@stgolabs.net
      Cc: peterz@infradead.org
      Cc: linux@rasmusvillemoes.dk
      Cc: akpm@linux-foundation.org
      Cc: fengguang.wu@intel.com
      Cc: bigeasy@linutronix.de
      Link: http://lkml.kernel.org/r/1457314344-5685-1-git-send-email-nasa4836@gmail.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      29b75eb2
  9. 08 3月, 2016 4 次提交
    • C
      sched/cputime: Fix steal_account_process_tick() to always return jiffies · f9c904b7
      Chris Friesen 提交于
      The callers of steal_account_process_tick() expect it to return
      whether a jiffy should be considered stolen or not.
      
      Currently the return value of steal_account_process_tick() is in
      units of cputime, which vary between either jiffies or nsecs
      depending on CONFIG_VIRT_CPU_ACCOUNTING_GEN.
      
      If cputime has nsecs granularity and there is a tiny amount of
      stolen time (a few nsecs, say) then we will consider the entire
      tick stolen and will not account the tick on user/system/idle,
      causing /proc/stats to show invalid data.
      
      The fix is to change steal_account_process_tick() to accumulate
      the stolen time and only account it once it's worth a jiffy.
      
      (Thanks to Frederic Weisbecker for suggestions to fix a bug in my
      first version of the patch.)
      Signed-off-by: NChris Friesen <chris.friesen@windriver.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/56DBBDB8.40305@mail.usask.caSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f9c904b7
    • L
      sched/deadline: Remove dl_new from struct sched_dl_entity · 72f9f3fd
      Luca Abeni 提交于
      The dl_new field of struct sched_dl_entity is currently used to
      identify new deadline tasks, so that their deadline and runtime
      can be properly initialised.
      
      However, these tasks can be easily identified by checking if
      their deadline is smaller than the current time when they switch
      to SCHED_DEADLINE. So, dl_new can be removed by introducing this
      check in switched_to_dl(); this allows to simplify the
      SCHED_DEADLINE code.
      Signed-off-by: NLuca Abeni <luca.abeni@unitn.it>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1457350024-7825-2-git-send-email-luca.abeni@unitn.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
      72f9f3fd
    • A
      perf/core: Fix perf_sched_count derailment · 927a5570
      Alexander Shishkin 提交于
      The error path in perf_event_open() is such that asking for a sampling
      event on a PMU that doesn't generate interrupts will end up in dropping
      the perf_sched_count even though it hasn't been incremented for this
      event yet.
      
      Given a sufficient amount of these calls, we'll end up disabling
      scheduler's jump label even though we'd still have active events in the
      system, thereby facilitating the arrival of the infernal regions upon us.
      
      I'm fixing this by moving account_event() inside perf_event_alloc().
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/1456917854-29427-1-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      927a5570
    • I
      time/timekeeping: Work around false positive GCC warning · 6436257b
      Ingo Molnar 提交于
      Newer GCC versions trigger the following warning:
      
        kernel/time/timekeeping.c: In function ‘get_device_system_crosststamp’:
        kernel/time/timekeeping.c:987:5: warning: ‘clock_was_set_seq’ may be used uninitialized in this function [-Wmaybe-uninitialized]
          if (discontinuity) {
           ^
        kernel/time/timekeeping.c:1045:15: note: ‘clock_was_set_seq’ was declared here
          unsigned int clock_was_set_seq;
                       ^
      
      GCC clearly is unable to recognize that the 'do_interp' boolean tracks
      the initialization status of 'clock_was_set_seq'.
      
      The GCC version used was:
      
        gcc version 5.3.1 20151207 (Red Hat 5.3.1-2) (GCC)
      
      Work it around by initializing clock_was_set_seq to 0. Compilers that
      are able to recognize the code flow will eliminate the unnecessary
      initialization.
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6436257b
  10. 06 3月, 2016 1 次提交
  11. 04 3月, 2016 1 次提交
    • S
      tracing: Do not have 'comm' filter override event 'comm' field · e57cbaf0
      Steven Rostedt (Red Hat) 提交于
      Commit 9f616680 "tracing: Allow triggers to filter for CPU ids and
      process names" added a 'comm' filter that will filter events based on the
      current tasks struct 'comm'. But this now hides the ability to filter events
      that have a 'comm' field too. For example, sched_migrate_task trace event.
      That has a 'comm' field of the task to be migrated.
      
       echo 'comm == "bash"' > events/sched_migrate_task/filter
      
      will now filter all sched_migrate_task events for tasks named "bash" that
      migrates other tasks (in interrupt context), instead of seeing when "bash"
      itself gets migrated.
      
      This fix requires a couple of changes.
      
      1) Change the look up order for filter predicates to look at the events
         fields before looking at the generic filters.
      
      2) Instead of basing the filter function off of the "comm" name, have the
         generic "comm" filter have its own filter_type (FILTER_COMM). Test
         against the type instead of the name to assign the filter function.
      
      3) Add a new "COMM" filter that works just like "comm" but will filter based
         on the current task, even if the trace event contains a "comm" field.
      
      Do the same for "cpu" field, adding a FILTER_CPU and a filter "CPU".
      
      Cc: stable@vger.kernel.org # v4.3+
      Fixes: 9f616680 "tracing: Allow triggers to filter for CPU ids and process names"
      Reported-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e57cbaf0