1. 24 3月, 2018 1 次提交
  2. 22 3月, 2018 1 次提交
  3. 21 3月, 2018 2 次提交
  4. 20 3月, 2018 5 次提交
    • J
      sched/debug: Adjust newlines for better alignment · e9ca2670
      Joe Lawrence 提交于
      Scheduler debug stats include newlines that display out of alignment
      when prefixed by timestamps.  For example, the dmesg utility:
      
        % echo t > /proc/sysrq-trigger
        % dmesg
        ...
        [   83.124251]
        runnable tasks:
         S           task   PID         tree-key  switches  prio     wait-time
        sum-exec        sum-sleep
        -----------------------------------------------------------------------------------------------------------
      
      At the same time, some syslog utilities (like rsyslog by default) don't
      like the additional newlines control characters, saving lines like this
      to /var/log/messages:
      
        Mar 16 16:02:29 localhost kernel: #012runnable tasks:#012 S           task   PID         tree-key ...
                                          ^^^^               ^^^^
      Clean these up by moving newline characters to their own SEQ_printf
      invocation.  This leaves the /proc/sched_debug unchanged, but brings the
      entire output into alignment when prefixed:
      
        % echo t > /proc/sysrq-trigger
        % dmesg
        ...
        [   62.410368] runnable tasks:
        [   62.410368]  S           task   PID         tree-key  switches  prio     wait-time             sum-exec        sum-sleep
        [   62.410369] -----------------------------------------------------------------------------------------------------------
        [   62.410369]  I  kworker/u12:0     5      1932.215593       332   120         0.000000         3.621252         0.000000 0 0 /
      
      and no escaped control characters from rsyslog in /var/log/messages:
      
        Mar 16 16:15:06 localhost kernel: runnable tasks:
        Mar 16 16:15:06 localhost kernel: S           task   PID         tree-key  ...
      Signed-off-by: NJoe Lawrence <joe.lawrence@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1521484555-8620-3-git-send-email-joe.lawrence@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e9ca2670
    • J
      sched/debug: Fix per-task line continuation for console output · a8c024cd
      Joe Lawrence 提交于
      When the SEQ_printf() macro prints to the console, it runs a simple
      printk() without KERN_CONT "continued" line printing.  The result of
      this is oddly wrapped task info, for example:
      
        % echo t > /proc/sysrq-trigger
        % dmesg
        ...
        runnable tasks:
        ...
        [   29.608611]  I
        [   29.608613]       rcu_sched     8      3252.013846      4087   120
        [   29.608614]         0.000000        29.090111         0.000000
        [   29.608615]  0 0
        [   29.608616]  /
      
      Modify SEQ_printf to use pr_cont() for expected one-line results:
      
        % echo t > /proc/sysrq-trigger
        % dmesg
        ...
        runnable tasks:
        ...
        [  106.716329]  S        cpuhp/5    37      2006.315026        14   120         0.000000         0.496893         0.000000 0 0 /
      Signed-off-by: NJoe Lawrence <joe.lawrence@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1521484555-8620-2-git-send-email-joe.lawrence@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a8c024cd
    • S
      perf/cgroup: Fix child event counting bug · c917e0f2
      Song Liu 提交于
      When a perf_event is attached to parent cgroup, it should count events
      for all children cgroups:
      
         parent_group   <---- perf_event
           \
            - child_group  <---- process(es)
      
      However, in our tests, we found this perf_event cannot report reliable
      results. Here is an example case:
      
        # create cgroups
        mkdir -p /sys/fs/cgroup/p/c
        # start perf for parent group
        perf stat -e instructions -G "p"
      
        # on another console, run test process in child cgroup:
        stressapptest -s 2 -M 1000 & echo $! > /sys/fs/cgroup/p/c/cgroup.procs
      
        # after the test process is done, stop perf in the first console shows
      
             <not counted>      instructions              p
      
      The instruction should not be "not counted" as the process runs in the
      child cgroup.
      
      We found this is because perf_event->cgrp and cpuctx->cgrp are not
      identical, thus perf_event->cgrp are not updated properly.
      
      This patch fixes this by updating perf_cgroup properly for ancestor
      cgroup(s).
      Reported-by: NEphraim Park <ephiepark@fb.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <jolsa@redhat.com>
      Cc: <kernel-team@fb.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/20180312165943.1057894-1-songliubraving@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c917e0f2
    • J
      jump_label: Disable jump labels in __exit code · 578ae447
      Josh Poimboeuf 提交于
      With the following commit:
      
        33352244 ("jump_label: Explicitly disable jump labels in __init code")
      
      ... we explicitly disabled jump labels in __init code, so they could be
      detected and not warned about in the following commit:
      
        dc1dd184 ("jump_label: Warn on failed jump_label patching attempt")
      
      In-kernel __exit code has the same issue.  It's never used, so it's
      freed along with the rest of initmem.  But jump label entries in __exit
      code aren't explicitly disabled, so we get the following warning when
      enabling pr_debug() in __exit code:
      
        can't patch jump_label at dmi_sysfs_exit+0x0/0x2d
        WARNING: CPU: 0 PID: 22572 at kernel/jump_label.c:376 __jump_label_update+0x9d/0xb0
      
      Fix the warning by disabling all jump labels in initmem (which includes
      both __init and __exit code).
      Reported-and-tested-by: NLi Wang <liwang@redhat.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: dc1dd184 ("jump_label: Warn on failed jump_label patching attempt")
      Link: http://lkml.kernel.org/r/7121e6e595374f06616c505b6e690e275c0054d1.1521483452.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      578ae447
    • M
      locking/mutex: Improve documentation · 45dbac0e
      Matthew Wilcox 提交于
      On Wed, Mar 14, 2018 at 01:56:31PM -0700, Andrew Morton wrote:
      
      > My memory is weak and our documentation is awful.  What does
      > mutex_lock_killable() actually do and how does it differ from
      > mutex_lock_interruptible()?
      
      Add kernel-doc for mutex_lock_killable() and mutex_lock_io().  Reword the
      kernel-doc for mutex_lock_interruptible().
      Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: cl@linux.com
      Cc: tj@kernel.org
      Link: http://lkml.kernel.org/r/20180315115812.GA9949@bombadil.infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      45dbac0e
  5. 14 3月, 2018 3 次提交
  6. 12 3月, 2018 1 次提交
  7. 10 3月, 2018 1 次提交
  8. 09 3月, 2018 3 次提交
    • B
      rtmutex: Make rt_mutex_futex_unlock() safe for irq-off callsites · 6b0ef92f
      Boqun Feng 提交于
      When running rcutorture with TREE03 config, CONFIG_PROVE_LOCKING=y, and
      kernel cmdline argument "rcutorture.gp_exp=1", lockdep reports a
      HARDIRQ-safe->HARDIRQ-unsafe deadlock:
      
       ================================
       WARNING: inconsistent lock state
       4.16.0-rc4+ #1 Not tainted
       --------------------------------
       inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
       takes:
       __schedule+0xbe/0xaf0
       {IN-HARDIRQ-W} state was registered at:
         _raw_spin_lock+0x2a/0x40
         scheduler_tick+0x47/0xf0
      ...
       other info that might help us debug this:
        Possible unsafe locking scenario:
              CPU0
              ----
         lock(&rq->lock);
         <Interrupt>
           lock(&rq->lock);
        *** DEADLOCK ***
       1 lock held by rcu_torture_rea/724:
       rcu_torture_read_lock+0x0/0x70
       stack backtrace:
       CPU: 2 PID: 724 Comm: rcu_torture_rea Not tainted 4.16.0-rc4+ #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
       Call Trace:
        lock_acquire+0x90/0x200
        ? __schedule+0xbe/0xaf0
        _raw_spin_lock+0x2a/0x40
        ? __schedule+0xbe/0xaf0
        __schedule+0xbe/0xaf0
        preempt_schedule_irq+0x2f/0x60
        retint_kernel+0x1b/0x2d
       RIP: 0010:rcu_read_unlock_special+0x0/0x680
        ? rcu_torture_read_unlock+0x60/0x60
        __rcu_read_unlock+0x64/0x70
        rcu_torture_read_unlock+0x17/0x60
        rcu_torture_reader+0x275/0x450
        ? rcutorture_booster_init+0x110/0x110
        ? rcu_torture_stall+0x230/0x230
        ? kthread+0x10e/0x130
        kthread+0x10e/0x130
        ? kthread_create_worker_on_cpu+0x70/0x70
        ? call_usermodehelper_exec_async+0x11a/0x150
        ret_from_fork+0x3a/0x50
      
      This happens with the following even sequence:
      
      	preempt_schedule_irq();
      	  local_irq_enable();
      	  __schedule():
      	    local_irq_disable(); // irq off
      	    ...
      	    rcu_note_context_switch():
      	      rcu_note_preempt_context_switch():
      	        rcu_read_unlock_special():
      	          local_irq_save(flags);
      	          ...
      		  raw_spin_unlock_irqrestore(...,flags); // irq remains off
      	          rt_mutex_futex_unlock():
      	            raw_spin_lock_irq();
      	            ...
      	            raw_spin_unlock_irq(); // accidentally set irq on
      
      	    <return to __schedule()>
      	    rq_lock():
      	      raw_spin_lock(); // acquiring rq->lock with irq on
      
      which means rq->lock becomes a HARDIRQ-unsafe lock, which can cause
      deadlocks in scheduler code.
      
      This problem was introduced by commit 02a7c234 ("rcu: Suppress
      lockdep false-positive ->boost_mtx complaints"). That brought the user
      of rt_mutex_futex_unlock() with irq off.
      
      To fix this, replace the *lock_irq() in rt_mutex_futex_unlock() with
      *lock_irq{save,restore}() to make it safe to call rt_mutex_futex_unlock()
      with irq off.
      
      Fixes: 02a7c234 ("rcu: Suppress lockdep false-positive ->boost_mtx complaints")
      Signed-off-by: NBoqun Feng <boqun.feng@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Link: https://lkml.kernel.org/r/20180309065630.8283-1-boqun.feng@gmail.com
      6b0ef92f
    • S
      perf/core: Fix ctx_event_type in ctx_resched() · bd903afe
      Song Liu 提交于
      In ctx_resched(), EVENT_FLEXIBLE should be sched_out when EVENT_PINNED is
      added. However, ctx_resched() calculates ctx_event_type before checking
      this condition. As a result, pinned events will NOT get higher priority
      than flexible events.
      
      The following shows this issue on an Intel CPU (where ref-cycles can
      only use one hardware counter).
      
        1. First start:
             perf stat -C 0 -e ref-cycles  -I 1000
        2. Then, in the second console, run:
             perf stat -C 0 -e ref-cycles:D -I 1000
      
      The second perf uses pinned events, which is expected to have higher
      priority. However, because it failed in ctx_resched(). It is never
      run.
      
      This patch fixes this by calculating ctx_event_type after re-evaluating
      event_type.
      Reported-by: NEphraim Park <ephiepark@fb.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <jolsa@redhat.com>
      Cc: <kernel-team@fb.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 487f05e1 ("perf/core: Optimize event rescheduling on active contexts")
      Link: http://lkml.kernel.org/r/20180306055504.3283731-1-songliubraving@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bd903afe
    • L
      module: propagate error in modules_open() · 3f553b30
      Leon Yu 提交于
      otherwise kernel can oops later in seq_release() due to dereferencing null
      file->private_data which is only set if seq_open() succeeds.
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      IP: seq_release+0xc/0x30
      Call Trace:
       close_pdeo+0x37/0xd0
       proc_reg_release+0x5d/0x60
       __fput+0x9d/0x1d0
       ____fput+0x9/0x10
       task_work_run+0x75/0x90
       do_exit+0x252/0xa00
       do_group_exit+0x36/0xb0
       SyS_exit_group+0xf/0x10
      
      Fixes: 516fb7f2 ("/proc/module: use the same logic as /proc/kallsyms for address exposure")
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@vger.kernel.org # 4.15+
      Signed-off-by: NLeon Yu <chianglungyu@gmail.com>
      Signed-off-by: NJessica Yu <jeyu@kernel.org>
      3f553b30
  9. 07 3月, 2018 1 次提交
  10. 03 3月, 2018 2 次提交
    • D
      memremap: fix softlockup reports at teardown · 949b9325
      Dan Williams 提交于
      The cond_resched() currently in the setup path needs to be duplicated in
      the teardown path. Rather than require each instance of
      for_each_device_pfn() to open code the same sequence, embed it in the
      helper.
      
      Link: https://github.com/intel/ixpdimm_sw/issues/11
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <stable@vger.kernel.org>
      Fixes: 71389703 ("mm, zone_device: Replace {get, put}_zone_device_page()...")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      949b9325
    • M
      signals: Move put_compat_sigset to compat.h to silence hardened usercopy · fde9fc76
      Matt Redfearn 提交于
      Since commit afcc90f8 ("usercopy: WARN() on slab cache usercopy
      region violations"), MIPS systems booting with a compat root filesystem
      emit a warning when copying compat siginfo to userspace:
      
      WARNING: CPU: 0 PID: 953 at mm/usercopy.c:81 usercopy_warn+0x98/0xe8
      Bad or missing usercopy whitelist? Kernel memory exposure attempt
      detected from SLAB object 'task_struct' (offset 1432, size 16)!
      Modules linked in:
      CPU: 0 PID: 953 Comm: S01logging Not tainted 4.16.0-rc2 #10
      Stack : ffffffff808c0000 0000000000000000 0000000000000001 65ac85163f3bdc4a
      	65ac85163f3bdc4a 0000000000000000 90000000ff667ab8 ffffffff808c0000
      	00000000000003f8 ffffffff808d0000 00000000000000d1 0000000000000000
      	000000000000003c 0000000000000000 ffffffff808c8ca8 ffffffff808d0000
      	ffffffff808d0000 ffffffff80810000 fffffc0000000000 ffffffff80785c30
      	0000000000000009 0000000000000051 90000000ff667eb0 90000000ff667db0
      	000000007fe0d938 0000000000000018 ffffffff80449958 0000000020052798
      	ffffffff808c0000 90000000ff664000 90000000ff667ab0 00000000100c0000
      	ffffffff80698810 0000000000000000 0000000000000000 0000000000000000
      	0000000000000000 0000000000000000 ffffffff8010d02c 65ac85163f3bdc4a
      	...
      Call Trace:
      [<ffffffff8010d02c>] show_stack+0x9c/0x130
      [<ffffffff80698810>] dump_stack+0x90/0xd0
      [<ffffffff80137b78>] __warn+0x100/0x118
      [<ffffffff80137bdc>] warn_slowpath_fmt+0x4c/0x70
      [<ffffffff8021e4a8>] usercopy_warn+0x98/0xe8
      [<ffffffff8021e68c>] __check_object_size+0xfc/0x250
      [<ffffffff801bbfb8>] put_compat_sigset+0x30/0x88
      [<ffffffff8011af24>] setup_rt_frame_n32+0xc4/0x160
      [<ffffffff8010b8b4>] do_signal+0x19c/0x230
      [<ffffffff8010c408>] do_notify_resume+0x60/0x78
      [<ffffffff80106f50>] work_notifysig+0x10/0x18
      ---[ end trace 88fffbf69147f48a ]---
      
      Commit 5905429a ("fork: Provide usercopy whitelisting for
      task_struct") noted that:
      
      "While the blocked and saved_sigmask fields of task_struct are copied to
      userspace (via sigmask_to_save() and setup_rt_frame()), it is always
      copied with a static length (i.e. sizeof(sigset_t))."
      
      However, this is not true in the case of compat signals, whose sigset
      is copied by put_compat_sigset and receives size as an argument.
      
      At most call sites, put_compat_sigset is copying a sigset from the
      current task_struct. This triggers a warning when
      CONFIG_HARDENED_USERCOPY is active. However, by marking this function as
      static inline, the warning can be avoided because in all of these cases
      the size is constant at compile time, which is allowed. The only site
      where this is not the case is handling the rt_sigpending syscall, but
      there the copy is being made from a stack local variable so does not
      trigger the warning.
      
      Move put_compat_sigset to compat.h, and mark it static inline. This
      fixes the WARN on MIPS.
      
      Fixes: afcc90f8 ("usercopy: WARN() on slab cache usercopy region violations")
      Signed-off-by: NMatt Redfearn <matt.redfearn@mips.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: "Dmitry V . Levin" <ldv@altlinux.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/18639/Signed-off-by: NJames Hogan <jhogan@kernel.org>
      fde9fc76
  11. 01 3月, 2018 1 次提交
    • L
      timers: Forward timer base before migrating timers · c52232a4
      Lingutla Chandrasekhar 提交于
      On CPU hotunplug the enqueued timers of the unplugged CPU are migrated to a
      live CPU. This happens from the control thread which initiated the unplug.
      
      If the CPU on which the control thread runs came out from a longer idle
      period then the base clock of that CPU might be stale because the control
      thread runs prior to any event which forwards the clock.
      
      In such a case the timers from the unplugged CPU are queued on the live CPU
      based on the stale clock which can cause large delays due to increased
      granularity of the outer timer wheels which are far away from base:;clock.
      
      But there is a worse problem than that. The following sequence of events
      illustrates it:
      
       - CPU0 timer1 is queued expires = 59969 and base->clk = 59131.
      
         The timer is queued at wheel level 2, with resulting expiry time = 60032
         (due to level granularity).
      
       - CPU1 enters idle @60007, with next timer expiry @60020.
      
       - CPU0 is hotplugged at @60009
      
       - CPU1 exits idle and runs the control thread which migrates the
         timers from CPU0
      
         timer1 is now queued in level 0 for immediate handling in the next
         softirq because the requested expiry time 59969 is before CPU1 base->clk
         60007
      
       - CPU1 runs code which forwards the base clock which succeeds because the
         next expiring timer. which was collected at idle entry time is still set
         to 60020.
      
         So it forwards beyond 60007 and therefore misses to expire the migrated
         timer1. That timer gets expired when the wheel wraps around again, which
         takes between 63 and 630ms depending on the HZ setting.
      
      Address both problems by invoking forward_timer_base() for the control CPUs
      timer base. All other places, which might run into a similar problem
      (mod_timer()/add_timer_on()) already invoke forward_timer_base() to avoid
      that.
      
      [ tglx: Massaged comment and changelog ]
      
      Fixes: a683f390 ("timers: Forward the wheel clock whenever possible")
      Co-developed-by: NNeeraj Upadhyay <neeraju@codeaurora.org>
      Signed-off-by: NNeeraj Upadhyay <neeraju@codeaurora.org>
      Signed-off-by: NLingutla Chandrasekhar <clingutla@codeaurora.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: linux-arm-msm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180118115022.6368-1-clingutla@codeaurora.org
      c52232a4
  12. 27 2月, 2018 1 次提交
    • P
      printk: Wake klogd when passing console_lock owner · c14376de
      Petr Mladek 提交于
      wake_klogd is a local variable in console_unlock(). The information
      is lost when the console_lock owner using the busy wait added by
      the commit dbdda842 ("printk: Add console owner and waiter
      logic to load balance console writes"). The following race is
      possible:
      
      CPU0				CPU1
      console_unlock()
      
        for (;;)
           /* calling console for last message */
      
      				printk()
      				  log_store()
      				    log_next_seq++;
      
           /* see new message */
           if (seen_seq != log_next_seq) {
      	wake_klogd = true;
      	seen_seq = log_next_seq;
           }
      
           console_lock_spinning_enable();
      
      				  if (console_trylock_spinning())
      				     /* spinning */
      
           if (console_lock_spinning_disable_and_check()) {
      	printk_safe_exit_irqrestore(flags);
      	return;
      
      				  console_unlock()
      				    if (seen_seq != log_next_seq) {
      				    /* already seen */
      				    /* nothing to do */
      
      Result: Nobody would wakeup klogd.
      
      One solution would be to make a global variable from wake_klogd.
      But then we would need to manipulate it under a lock or so.
      
      This patch wakes klogd also when console_lock is passed to the
      spinning waiter. It looks like the right way to go. Also userspace
      should have a chance to see and store any "flood" of messages.
      
      Note that the very late klogd wake up was a historic solution.
      It made sense on single CPU systems or when sys_syslog() operations
      were synchronized using the big kernel lock like in v2.1.113.
      But it is questionable these days.
      
      Fixes: dbdda842 ("printk: Add console owner and waiter logic to load balance console writes")
      Link: http://lkml.kernel.org/r/20180226155734.dzwg3aovqnwtvkoy@pathway.suse.cz
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: linux-kernel@vger.kernel.org
      Cc: Tejun Heo <tj@kernel.org>
      Suggested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      c14376de
  13. 24 2月, 2018 1 次提交
    • D
      bpf: allow xadd only on aligned memory · ca369602
      Daniel Borkmann 提交于
      The requirements around atomic_add() / atomic64_add() resp. their
      JIT implementations differ across architectures. E.g. while x86_64
      seems just fine with BPF's xadd on unaligned memory, on arm64 it
      triggers via interpreter but also JIT the following crash:
      
        [  830.864985] Unable to handle kernel paging request at virtual address ffff8097d7ed6703
        [...]
        [  830.916161] Internal error: Oops: 96000021 [#1] SMP
        [  830.984755] CPU: 37 PID: 2788 Comm: test_verifier Not tainted 4.16.0-rc2+ #8
        [  830.991790] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.29 07/17/2017
        [  830.998998] pstate: 80400005 (Nzcv daif +PAN -UAO)
        [  831.003793] pc : __ll_sc_atomic_add+0x4/0x18
        [  831.008055] lr : ___bpf_prog_run+0x1198/0x1588
        [  831.012485] sp : ffff00001ccabc20
        [  831.015786] x29: ffff00001ccabc20 x28: ffff8017d56a0f00
        [  831.021087] x27: 0000000000000001 x26: 0000000000000000
        [  831.026387] x25: 000000c168d9db98 x24: 0000000000000000
        [  831.031686] x23: ffff000008203878 x22: ffff000009488000
        [  831.036986] x21: ffff000008b14e28 x20: ffff00001ccabcb0
        [  831.042286] x19: ffff0000097b5080 x18: 0000000000000a03
        [  831.047585] x17: 0000000000000000 x16: 0000000000000000
        [  831.052885] x15: 0000ffffaeca8000 x14: 0000000000000000
        [  831.058184] x13: 0000000000000000 x12: 0000000000000000
        [  831.063484] x11: 0000000000000001 x10: 0000000000000000
        [  831.068783] x9 : 0000000000000000 x8 : 0000000000000000
        [  831.074083] x7 : 0000000000000000 x6 : 000580d428000000
        [  831.079383] x5 : 0000000000000018 x4 : 0000000000000000
        [  831.084682] x3 : ffff00001ccabcb0 x2 : 0000000000000001
        [  831.089982] x1 : ffff8097d7ed6703 x0 : 0000000000000001
        [  831.095282] Process test_verifier (pid: 2788, stack limit = 0x0000000018370044)
        [  831.102577] Call trace:
        [  831.105012]  __ll_sc_atomic_add+0x4/0x18
        [  831.108923]  __bpf_prog_run32+0x4c/0x70
        [  831.112748]  bpf_test_run+0x78/0xf8
        [  831.116224]  bpf_prog_test_run_xdp+0xb4/0x120
        [  831.120567]  SyS_bpf+0x77c/0x1110
        [  831.123873]  el0_svc_naked+0x30/0x34
        [  831.127437] Code: 97fffe97 17ffffec 00000000 f9800031 (885f7c31)
      
      Reason for this is because memory is required to be aligned. In
      case of BPF, we always enforce alignment in terms of stack access,
      but not when accessing map values or packet data when the underlying
      arch (e.g. arm64) has CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set.
      
      xadd on packet data that is local to us anyway is just wrong, so
      forbid this case entirely. The only place where xadd makes sense in
      fact are map values; xadd on stack is wrong as well, but it's been
      around for much longer. Specifically enforce strict alignment in case
      of xadd, so that we handle this case generically and avoid such crashes
      in the first place.
      
      Fixes: 17a52670 ("bpf: verifier (add verifier core)")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ca369602
  14. 23 2月, 2018 4 次提交
    • T
      genirq/matrix: Handle CPU offlining proper · 651ca2c0
      Thomas Gleixner 提交于
      At CPU hotunplug the corresponding per cpu matrix allocator is shut down and
      the allocated interrupt bits are discarded under the assumption that all
      allocated bits have been either migrated away or shut down through the
      managed interrupts mechanism.
      
      This is not true because interrupts which are not started up might have a
      vector allocated on the outgoing CPU. When the interrupt is started up
      later or completely shutdown and freed then the allocated vector is handed
      back, triggering warnings or causing accounting issues which result in
      suspend failures and other issues.
      
      Change the CPU hotplug mechanism of the matrix allocator so that the
      remaining allocations at unplug time are preserved and global accounting at
      hotplug is correctly readjusted to take the dormant vectors into account.
      
      Fixes: 2f75d9e1 ("genirq: Implement bitmap matrix allocator")
      Reported-by: NYuriy Vostrikov <delamonpansie@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NYuriy Vostrikov <delamonpansie@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180222112316.849980972@linutronix.de
      651ca2c0
    • Y
      bpf: fix rcu lockdep warning for lpm_trie map_free callback · 6c5f6102
      Yonghong Song 提交于
      Commit 9a3efb6b ("bpf: fix memory leak in lpm_trie map_free callback function")
      fixed a memory leak and removed unnecessary locks in map_free callback function.
      Unfortrunately, it introduced a lockdep warning. When lockdep checking is turned on,
      running tools/testing/selftests/bpf/test_lpm_map will have:
      
        [   98.294321] =============================
        [   98.294807] WARNING: suspicious RCU usage
        [   98.295359] 4.16.0-rc2+ #193 Not tainted
        [   98.295907] -----------------------------
        [   98.296486] /home/yhs/work/bpf/kernel/bpf/lpm_trie.c:572 suspicious rcu_dereference_check() usage!
        [   98.297657]
        [   98.297657] other info that might help us debug this:
        [   98.297657]
        [   98.298663]
        [   98.298663] rcu_scheduler_active = 2, debug_locks = 1
        [   98.299536] 2 locks held by kworker/2:1/54:
        [   98.300152]  #0:  ((wq_completion)"events"){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0
        [   98.301381]  #1:  ((work_completion)(&map->work)){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0
      
      Since actual trie tree removal happens only after no other
      accesses to the tree are possible, replacing
        rcu_dereference_protected(*slot, lockdep_is_held(&trie->lock))
      with
        rcu_dereference_protected(*slot, 1)
      fixed the issue.
      
      Fixes: 9a3efb6b ("bpf: fix memory leak in lpm_trie map_free callback function")
      Reported-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6c5f6102
    • E
      bpf: add schedule points in percpu arrays management · 32fff239
      Eric Dumazet 提交于
      syszbot managed to trigger RCU detected stalls in
      bpf_array_free_percpu()
      
      It takes time to allocate a huge percpu map, but even more time to free
      it.
      
      Since we run in process context, use cond_resched() to yield cpu if
      needed.
      
      Fixes: a10423b8 ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      32fff239
    • L
      efivarfs: Limit the rate for non-root to read files · bef3efbe
      Luck, Tony 提交于
      Each read from a file in efivarfs results in two calls to EFI
      (one to get the file size, another to get the actual data).
      
      On X86 these EFI calls result in broadcast system management
      interrupts (SMI) which affect performance of the whole system.
      A malicious user can loop performing reads from efivarfs bringing
      the system to its knees.
      
      Linus suggested per-user rate limit to solve this.
      
      So we add a ratelimit structure to "user_struct" and initialize
      it for the root user for no limit. When allocating user_struct for
      other users we set the limit to 100 per second. This could be used
      for other places that want to limit the rate of some detrimental
      user action.
      
      In efivarfs if the limit is exceeded when reading, we take an
      interruptible nap for 50ms and check the rate limit again.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bef3efbe
  15. 22 2月, 2018 4 次提交
  16. 21 2月, 2018 3 次提交
  17. 17 2月, 2018 1 次提交
  18. 16 2月, 2018 4 次提交
    • A
      irqdomain: Re-use DEFINE_SHOW_ATTRIBUTE() macro · 0b24a0bb
      Andy Shevchenko 提交于
      ...instead of open coding file operations followed by custom ->open()
      callbacks per each attribute.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      0b24a0bb
    • J
      kprobes: Propagate error from disarm_kprobe_ftrace() · 297f9233
      Jessica Yu 提交于
      Improve error handling when disarming ftrace-based kprobes. Like with
      arm_kprobe_ftrace(), propagate any errors from disarm_kprobe_ftrace() so
      that we do not disable/unregister kprobes that are still armed. In other
      words, unregister_kprobe() and disable_kprobe() should not report success
      if the kprobe could not be disarmed.
      
      disarm_all_kprobes() keeps its current behavior and attempts to
      disarm all kprobes. It returns the last encountered error and gives a
      warning if not all probes could be disarmed.
      
      This patch is based on Petr Mladek's original patchset (patches 2 and 3)
      back in 2015, which improved kprobes error handling, found here:
      
         https://lkml.org/lkml/2015/2/26/452
      
      However, further work on this had been paused since then and the patches
      were not upstreamed.
      Based-on-patches-by: NPetr Mladek <pmladek@suse.com>
      Signed-off-by: NJessica Yu <jeyu@kernel.org>
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S . Miller <davem@davemloft.net>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Joe Lawrence <joe.lawrence@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180109235124.30886-3-jeyu@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      297f9233
    • J
      kprobes: Propagate error from arm_kprobe_ftrace() · 12310e34
      Jessica Yu 提交于
      Improve error handling when arming ftrace-based kprobes. Specifically, if
      we fail to arm a ftrace-based kprobe, register_kprobe()/enable_kprobe()
      should report an error instead of success. Previously, this has lead to
      confusing situations where register_kprobe() would return 0 indicating
      success, but the kprobe would not be functional if ftrace registration
      during the kprobe arming process had failed. We should therefore take any
      errors returned by ftrace into account and propagate this error so that we
      do not register/enable kprobes that cannot be armed. This can happen if,
      for example, register_ftrace_function() finds an IPMODIFY conflict (since
      kprobe_ftrace_ops has this flag set) and returns an error. Such a conflict
      is possible since livepatches also set the IPMODIFY flag for their ftrace_ops.
      
      arm_all_kprobes() keeps its current behavior and attempts to arm all
      kprobes. It returns the last encountered error and gives a warning if
      not all probes could be armed.
      
      This patch is based on Petr Mladek's original patchset (patches 2 and 3)
      back in 2015, which improved kprobes error handling, found here:
      
         https://lkml.org/lkml/2015/2/26/452
      
      However, further work on this had been paused since then and the patches
      were not upstreamed.
      Based-on-patches-by: NPetr Mladek <pmladek@suse.com>
      Signed-off-by: NJessica Yu <jeyu@kernel.org>
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S . Miller <davem@davemloft.net>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Joe Lawrence <joe.lawrence@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180109235124.30886-2-jeyu@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      12310e34
    • D
      bpf: fix mlock precharge on arraymaps · 9c2d63b8
      Daniel Borkmann 提交于
      syzkaller recently triggered OOM during percpu map allocation;
      while there is work in progress by Dennis Zhou to add __GFP_NORETRY
      semantics for percpu allocator under pressure, there seems also a
      missing bpf_map_precharge_memlock() check in array map allocation.
      
      Given today the actual bpf_map_charge_memlock() happens after the
      find_and_alloc_map() in syscall path, the bpf_map_precharge_memlock()
      is there to bail out early before we go and do the map setup work
      when we find that we hit the limits anyway. Therefore add this for
      array map as well.
      
      Fixes: 6c905981 ("bpf: pre-allocate hash map elements")
      Fixes: a10423b8 ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
      Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Dennis Zhou <dennisszhou@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9c2d63b8
  19. 15 2月, 2018 1 次提交
    • D
      bpf: fix bpf_prog_array_copy_to_user warning from perf event prog query · 9c481b90
      Daniel Borkmann 提交于
      syzkaller tried to perform a prog query in perf_event_query_prog_array()
      where struct perf_event_query_bpf had an ids_len of 1,073,741,353 and
      thus causing a warning due to failed kcalloc() allocation out of the
      bpf_prog_array_copy_to_user() helper. Given we cannot attach more than
      64 programs to a perf event, there's no point in allowing huge ids_len.
      Therefore, allow a buffer that would fix the maximum number of ids and
      also add a __GFP_NOWARN to the temporary ids buffer.
      
      Fixes: f371b304 ("bpf/tracing: allow user space to query prog array on the same tp")
      Fixes: 0911287c ("bpf: fix bpf_prog_array_copy_to_user() issues")
      Reported-by: syzbot+cab5816b0edbabf598b3@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9c481b90