1. 14 1月, 2017 3 次提交
    • J
      perf/x86/intel: Account interrupts for PEBS errors · 475113d9
      Jiri Olsa 提交于
      It's possible to set up PEBS events to get only errors and not
      any data, like on SNB-X (model 45) and IVB-EP (model 62)
      via 2 perf commands running simultaneously:
      
          taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10
      
      This leads to a soft lock up, because the error path of the
      intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
      for error PEBS interrupts, so in case you're getting ONLY
      errors you don't have a way to stop the event when it's over
      the max_samples_per_tick limit:
      
        NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
        ...
        RIP: 0010:[<ffffffff81159232>]  [<ffffffff81159232>] smp_call_function_single+0xe2/0x140
        ...
        Call Trace:
         ? trace_hardirqs_on_caller+0xf5/0x1b0
         ? perf_cgroup_attach+0x70/0x70
         perf_install_in_context+0x199/0x1b0
         ? ctx_resched+0x90/0x90
         SYSC_perf_event_open+0x641/0xf90
         SyS_perf_event_open+0x9/0x10
         do_syscall_64+0x6c/0x1f0
         entry_SYSCALL64_slow_path+0x25/0x25
      
      Add perf_event_account_interrupt() which does the interrupt
      and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
      error path.
      
      We keep the pending_kill and pending_wakeup logic only in the
      __perf_event_overflow() path, because they make sense only if
      there's any data to deliver.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vince@deater.net>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      475113d9
    • P
      perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race · 321027c1
      Peter Zijlstra 提交于
      Di Shen reported a race between two concurrent sys_perf_event_open()
      calls where both try and move the same pre-existing software group
      into a hardware context.
      
      The problem is exactly that described in commit:
      
        f63a8daa ("perf: Fix event->ctx locking")
      
      ... where, while we wait for a ctx->mutex acquisition, the event->ctx
      relation can have changed under us.
      
      That very same commit failed to recognise sys_perf_event_context() as an
      external access vector to the events and thereby didn't apply the
      established locking rules correctly.
      
      So while one sys_perf_event_open() call is stuck waiting on
      mutex_lock_double(), the other (which owns said locks) moves the group
      about. So by the time the former sys_perf_event_open() acquires the
      locks, the context we've acquired is stale (and possibly dead).
      
      Apply the established locking rules as per perf_event_ctx_lock_nested()
      to the mutex_lock_double() for the 'move_group' case. This obviously means
      we need to validate state after we acquire the locks.
      
      Reported-by: Di Shen (Keen Lab)
      Tested-by: NJohn Dias <joaodias@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Min Chong <mchong@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: f63a8daa ("perf: Fix event->ctx locking")
      Link: http://lkml.kernel.org/r/20170106131444.GZ3174@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      321027c1
    • P
      perf/core: Fix sys_perf_event_open() vs. hotplug · 63cae12b
      Peter Zijlstra 提交于
      There is problem with installing an event in a task that is 'stuck' on
      an offline CPU.
      
      Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
      blocked task doesn't run and doesn't require a CPU etc.. Only on
      wakeup do we ammend the situation and place the task on a available
      CPU.
      
      If we hit such a task with perf_install_in_context() we'll loop until
      either that task wakes up or the CPU comes back online, if the task
      waking depends on the event being installed, we're stuck.
      
      While looking into this issue, I also spotted another problem, if we
      hit a task with perf_install_in_context() that is in the middle of
      being migrated, that is we observe the old CPU before sending the IPI,
      but run the IPI (on the old CPU) while the task is already running on
      the new CPU, things also go sideways.
      
      Rework things to rely on task_curr() -- outside of rq->lock -- which
      is rather tricky. Imagine the following scenario where we're trying to
      install the first event into our task 't':
      
      CPU0            CPU1            CPU2
      
                      (current == t)
      
      t->perf_event_ctxp[] = ctx;
      smp_mb();
      cpu = task_cpu(t);
      
                      switch(t, n);
                                      migrate(t, 2);
                                      switch(p, t);
      
                                      ctx = t->perf_event_ctxp[]; // must not be NULL
      
      smp_function_call(cpu, ..);
      
                      generic_exec_single()
                        func();
                          spin_lock(ctx->lock);
                          if (task_curr(t)) // false
      
                          add_event_to_ctx();
                          spin_unlock(ctx->lock);
      
                                      perf_event_context_sched_in();
                                        spin_lock(ctx->lock);
                                        // sees event
      
      So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
      Because if CPU2's load of that variable were to observe NULL, it would
      not try to schedule the ctx and we'd have a task running without its
      counter, which would be 'bad'.
      
      As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
      first and not see the event yet, then CPU0 must observe task_curr()
      and retry. If the install happens first, then we must see the event on
      sched-in and all is well.
      
      I think we can translate the first part (until the 'must not be NULL')
      of the scenario to a litmus test like:
      
        C C-peterz
      
        {
        }
      
        P0(int *x, int *y)
        {
                int r1;
      
                WRITE_ONCE(*x, 1);
                smp_mb();
                r1 = READ_ONCE(*y);
        }
      
        P1(int *y, int *z)
        {
                WRITE_ONCE(*y, 1);
                smp_store_release(z, 1);
        }
      
        P2(int *x, int *z)
        {
                int r1;
                int r2;
      
                r1 = smp_load_acquire(z);
      	  smp_mb();
                r2 = READ_ONCE(*x);
        }
      
        exists
        (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
      
      Where:
        x is perf_event_ctxp[],
        y is our tasks's CPU, and
        z is our task being placed on the rq of CPU2.
      
      The P0 smp_mb() is the one added by this patch, ordering the store to
      perf_event_ctxp[] from find_get_context() and the load of task_cpu()
      in task_function_call().
      
      The smp_store_release/smp_load_acquire model the RCpc locking of the
      rq->lock and the smp_mb() of P2 is the context switch switching from
      whatever CPU2 was running to our task 't'.
      
      This litmus test evaluates into:
      
        Test C-peterz Allowed
        States 7
        0:r1=0; 2:r1=0; 2:r2=0;
        0:r1=0; 2:r1=0; 2:r2=1;
        0:r1=0; 2:r1=1; 2:r2=1;
        0:r1=1; 2:r1=0; 2:r2=0;
        0:r1=1; 2:r1=0; 2:r2=1;
        0:r1=1; 2:r1=1; 2:r2=0;
        0:r1=1; 2:r1=1; 2:r2=1;
        No
        Witnesses
        Positive: 0 Negative: 7
        Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
        Observation C-peterz Never 0 7
        Hash=e427f41d9146b2a5445101d3e2fcaa34
      
      And the strong and weak model agree.
      Reported-by: NMark Rutland <mark.rutland@arm.com>
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: jeremy.linton@arm.com
      Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      63cae12b
  2. 12 1月, 2017 2 次提交
  3. 11 1月, 2017 4 次提交
    • F
      nohz: Fix collision between tick and other hrtimers · 24b91e36
      Frederic Weisbecker 提交于
      When the tick is stopped and an interrupt occurs afterward, we check on
      that interrupt exit if the next tick needs to be rescheduled. If it
      doesn't need any update, we don't want to do anything.
      
      In order to check if the tick needs an update, we compare it against the
      clockevent device deadline. Now that's a problem because the clockevent
      device is at a lower level than the tick itself if it is implemented
      on top of hrtimer.
      
      Every hrtimer share this clockevent device. So comparing the next tick
      deadline against the clockevent device deadline is wrong because the
      device may be programmed for another hrtimer whose deadline collides
      with the tick. As a result we may end up not reprogramming the tick
      accidentally.
      
      In a worst case scenario under full dynticks mode, the tick stops firing
      as it is supposed to every 1hz, leaving /proc/stat stalled:
      
            Task in a full dynticks CPU
            ----------------------------
      
            * hrtimer A is queued 2 seconds ahead
            * the tick is stopped, scheduled 1 second ahead
            * tick fires 1 second later
            * on tick exit, nohz schedules the tick 1 second ahead but sees
              the clockevent device is already programmed to that deadline,
              fooled by hrtimer A, the tick isn't rescheduled.
            * hrtimer A is cancelled before its deadline
            * tick never fires again until an interrupt happens...
      
      In order to fix this, store the next tick deadline to the tick_sched
      local structure and reuse that value later to check whether we need to
      reprogram the clock after an interrupt.
      
      On the other hand, ts->sleep_length still wants to know about the next
      clock event and not just the tick, so we want to improve the related
      comment to avoid confusion.
      Reported-by: NJames Hartsock <hartsjc@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/1483539124-5693-1-git-send-email-fweisbec@gmail.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      24b91e36
    • J
      signal: protect SIGNAL_UNKILLABLE from unintentional clearing. · 2d39b3cd
      Jamie Iles 提交于
      Since commit 00cd5c37 ("ptrace: permit ptracing of /sbin/init") we
      can now trace init processes.  init is initially protected with
      SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
      there are a number of paths during tracing where SIGNAL_UNKILLABLE can
      be implicitly cleared.
      
      This can result in init becoming stoppable/killable after tracing.  For
      example, running:
      
        while true; do kill -STOP 1; done &
        strace -p 1
      
      and then stopping strace and the kill loop will result in init being
      left in state TASK_STOPPED.  Sending SIGCONT to init will resume it, but
      init will now respond to future SIGSTOP signals rather than ignoring
      them.
      
      Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
      that we don't clear SIGNAL_UNKILLABLE.
      
      Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.comSigned-off-by: NJamie Iles <jamie.iles@oracle.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d39b3cd
    • D
      mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done} · f931ab47
      Dan Williams 提交于
      Both arch_add_memory() and arch_remove_memory() expect a single threaded
      context.
      
      For example, arch/x86/mm/init_64.c::kernel_physical_mapping_init() does
      not hold any locks over this check and branch:
      
          if (pgd_val(*pgd)) {
          	pud = (pud_t *)pgd_page_vaddr(*pgd);
          	paddr_last = phys_pud_init(pud, __pa(vaddr),
          				   __pa(vaddr_end),
          				   page_size_mask);
          	continue;
          }
      
          pud = alloc_low_page();
          paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
          			   page_size_mask);
      
      The result is that two threads calling devm_memremap_pages()
      simultaneously can end up colliding on pgd initialization.  This leads
      to crash signatures like the following where the loser of the race
      initializes the wrong pgd entry:
      
          BUG: unable to handle kernel paging request at ffff888ebfff0000
          IP: memcpy_erms+0x6/0x10
          PGD 2f8e8fc067 PUD 0 /* <---- Invalid PUD */
          Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
          CPU: 54 PID: 3818 Comm: systemd-udevd Not tainted 4.6.7+ #13
          task: ffff882fac290040 ti: ffff882f887a4000 task.ti: ffff882f887a4000
          RIP: memcpy_erms+0x6/0x10
          [..]
          Call Trace:
            ? pmem_do_bvec+0x205/0x370 [nd_pmem]
            ? blk_queue_enter+0x3a/0x280
            pmem_rw_page+0x38/0x80 [nd_pmem]
            bdev_read_page+0x84/0xb0
      
      Hold the standard memory hotplug mutex over calls to
      arch_{add,remove}_memory().
      
      Fixes: 41e94a85 ("add devm_memremap_pages")
      Link: http://lkml.kernel.org/r/148357647831.9498.12606007370121652979.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f931ab47
    • M
      bpf: do not use KMALLOC_SHIFT_MAX · 7984c27c
      Michal Hocko 提交于
      Commit 01b3f521 ("bpf: fix allocation warnings in bpf maps and
      integer overflow") has added checks for the maximum allocateable size.
      It (ab)used KMALLOC_SHIFT_MAX for that purpose.
      
      While this is not incorrect it is not very clean because we already have
      KMALLOC_MAX_SIZE for this very reason so let's change both checks to use
      KMALLOC_MAX_SIZE instead.
      
      The original motivation for using KMALLOC_SHIFT_MAX was to work around
      an incorrect KMALLOC_MAX_SIZE which could lead to allocation warnings
      but it is no longer needed since "slab: make sure that KMALLOC_MAX_SIZE
      will fit into MAX_ORDER".
      
      Link: http://lkml.kernel.org/r/20161220130659.16461-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7984c27c
  4. 04 1月, 2017 1 次提交
    • J
      audit: Fix sleep in atomic · be29d20f
      Jan Kara 提交于
      Audit tree code was happily adding new notification marks while holding
      spinlocks. Since fsnotify_add_mark() acquires group->mark_mutex this can
      lead to sleeping while holding a spinlock, deadlocks due to lock
      inversion, and probably other fun. Fix the problem by acquiring
      group->mark_mutex earlier.
      
      CC: Paul Moore <paul@paul-moore.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      be29d20f
  5. 27 12月, 2016 1 次提交
  6. 26 12月, 2016 2 次提交
    • T
      ktime: Cleanup ktime_set() usage · 8b0e1953
      Thomas Gleixner 提交于
      ktime_set(S,N) was required for the timespec storage type and is still
      useful for situations where a Seconds and Nanoseconds part of a time value
      needs to be converted. For anything where the Seconds argument is 0, this
      is pointless and can be replaced with a simple assignment.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8b0e1953
    • T
      ktime: Get rid of the union · 2456e855
      Thomas Gleixner 提交于
      ktime is a union because the initial implementation stored the time in
      scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
      variant for 32bit machines. The Y2038 cleanup removed the timespec variant
      and switched everything to scalar nanoseconds. The union remained, but
      become completely pointless.
      
      Get rid of the union and just keep ktime_t as simple typedef of type s64.
      
      The conversion was done with coccinelle and some manual mopping up.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2456e855
  7. 25 12月, 2016 4 次提交
  8. 24 12月, 2016 1 次提交
    • J
      fsnotify: Remove fsnotify_duplicate_mark() · e3ba7307
      Jan Kara 提交于
      There are only two calls sites of fsnotify_duplicate_mark(). Those are
      in kernel/audit_tree.c and both are bogus. Vfsmount pointer is unused
      for audit tree, inode pointer and group gets set in
      fsnotify_add_mark_locked() later anyway, mask and free_mark are already
      set in alloc_chunk(). In fact, calling fsnotify_duplicate_mark() is
      actively harmful because following fsnotify_add_mark_locked() will leak
      group reference by overwriting the group pointer. So just remove the two
      calls to fsnotify_duplicate_mark() and the function.
      Signed-off-by: NJan Kara <jack@suse.cz>
      [PM: line wrapping to fit in 80 chars]
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      e3ba7307
  9. 23 12月, 2016 1 次提交
    • A
      move aio compat to fs/aio.c · c00d2c7e
      Al Viro 提交于
      ... and fix the minor buglet in compat io_submit() - native one
      kills ioctx as cleanup when put_user() fails.  Get rid of
      bogus compat_... in !CONFIG_AIO case, while we are at it - they
      should simply fail with ENOSYS, same as for native counterparts.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c00d2c7e
  10. 21 12月, 2016 3 次提交
  11. 18 12月, 2016 4 次提交
    • M
      uprobes: Fix uprobes on MIPS, allow for a cache flush after ixol breakpoint creation · 297e765e
      Marcin Nowakowski 提交于
      Commit:
      
        72e6ae28 ('ARM: 8043/1: uprobes need icache flush after xol write'
      
      ... has introduced an arch-specific method to ensure all caches are
      flushed appropriately after an instruction is written to an XOL page.
      
      However, when the XOL area is created and the out-of-line breakpoint
      instruction is copied, caches are not flushed at all and stale data may
      be found in icache.
      
      Replace a simple copy_to_page() with arch_uprobe_copy_ixol() to allow
      the arch to ensure all caches are updated accordingly.
      
      This change fixes uprobes on MIPS InterAptiv (tested on Creator Ci40).
      Signed-off-by: NMarcin Nowakowski <marcin.nowakowski@imgtec.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Victor Kamensky <victor.kamensky@linaro.org>
      Cc: linux-mips@linux-mips.org
      Link: http://lkml.kernel.org/r/1481625657-22850-1-git-send-email-marcin.nowakowski@imgtec.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      297e765e
    • D
      bpf: fix mark_reg_unknown_value for spilled regs on map value marking · 6760bf2d
      Daniel Borkmann 提交于
      Martin reported a verifier issue that hit the BUG_ON() for his
      test case in the mark_reg_unknown_value() function:
      
        [  202.861380] kernel BUG at kernel/bpf/verifier.c:467!
        [...]
        [  203.291109] Call Trace:
        [  203.296501]  [<ffffffff811364d5>] mark_map_reg+0x45/0x50
        [  203.308225]  [<ffffffff81136558>] mark_map_regs+0x78/0x90
        [  203.320140]  [<ffffffff8113938d>] do_check+0x226d/0x2c90
        [  203.331865]  [<ffffffff8113a6ab>] bpf_check+0x48b/0x780
        [  203.343403]  [<ffffffff81134c8e>] bpf_prog_load+0x27e/0x440
        [  203.355705]  [<ffffffff8118a38f>] ? handle_mm_fault+0x11af/0x1230
        [  203.369158]  [<ffffffff812d8188>] ? security_capable+0x48/0x60
        [  203.382035]  [<ffffffff811351a4>] SyS_bpf+0x124/0x960
        [  203.393185]  [<ffffffff810515f6>] ? __do_page_fault+0x276/0x490
        [  203.406258]  [<ffffffff816db320>] entry_SYSCALL_64_fastpath+0x13/0x94
      
      This issue got uncovered after the fix in a08dd0da ("bpf: fix
      regression on verifier pruning wrt map lookups"). The reason why it
      wasn't noticed before was, because as mentioned in a08dd0da,
      mark_map_regs() was doing the id matching incorrectly based on the
      uncached regs[regno].id. So, in the first loop, we walked all regs
      and as soon as we found regno == i, then this reg's id was cleared
      when calling mark_reg_unknown_value() thus that every subsequent
      register was probed against id of 0 (which, in combination with the
      PTR_TO_MAP_VALUE_OR_NULL type is an invalid condition that no other
      register state can hold), and therefore wasn't type transitioned such
      as in the spilled register case for the second loop.
      
      Now since that got fixed, it turned out that 57a09bf0 ("bpf:
      Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") used
      mark_reg_unknown_value() incorrectly for the spilled regs, and thus
      hitting the BUG_ON() in some cases due to regno >= MAX_BPF_REG.
      
      Although spilled regs have the same type as the non-spilled regs
      for the verifier state, that is, struct bpf_reg_state, they are
      semantically different from the non-spilled regs. In other words,
      there can be up to 64 (MAX_BPF_STACK / BPF_REG_SIZE) spilled regs
      in the stack, for example, register R<x> could have been spilled by
      the program to stack location X, Y, Z, and in mark_map_regs() we
      need to scan these stack slots of type STACK_SPILL for potential
      registers that we have to transition from PTR_TO_MAP_VALUE_OR_NULL.
      Therefore, depending on the location, the spilled_regs regno can
      be a lot higher than just MAX_BPF_REG's value since we operate on
      stack instead. The reset in mark_reg_unknown_value() itself is
      just fine, only that the BUG_ON() was inappropriate for this. Fix
      it by making a __mark_reg_unknown_value() version that can be
      called from mark_map_reg() generically; we know for the non-spilled
      case that the regno is always < MAX_BPF_REG anyway.
      
      Fixes: 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6760bf2d
    • D
      bpf: fix overflow in prog accounting · 5ccb071e
      Daniel Borkmann 提交于
      Commit aaac3ba9 ("bpf: charge user for creation of BPF maps and
      programs") made a wrong assumption of charging against prog->pages.
      Unlike map->pages, prog->pages are still subject to change when we
      need to expand the program through bpf_prog_realloc().
      
      This can for example happen during verification stage when we need to
      expand and rewrite parts of the program. Should the required space
      cross a page boundary, then prog->pages is not the same anymore as
      its original value that we used to bpf_prog_charge_memlock() on. Thus,
      we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
      is freed eventually. I noticed this that despite having unlimited
      memlock, programs suddenly refused to load with EPERM error due to
      insufficient memlock.
      
      There are two ways to fix this issue. One would be to add a cached
      variable to struct bpf_prog that takes a snapshot of prog->pages at the
      time of charging. The other approach is to also account for resizes. I
      chose to go with the latter for a couple of reasons: i) We want accounting
      rather to be more accurate instead of further fooling limits, ii) adding
      yet another page counter on struct bpf_prog would also be a waste just
      for this purpose. We also do want to charge as early as possible to
      avoid going into the verifier just to find out later on that we crossed
      limits. The only place that needs to be fixed is bpf_prog_realloc(),
      since only here we expand the program, so we try to account for the
      needed delta and should we fail, call-sites check for outcome anyway.
      On cBPF to eBPF migrations, we don't grab a reference to the user as
      they are charged differently. With that in place, my test case worked
      fine.
      
      Fixes: aaac3ba9 ("bpf: charge user for creation of BPF maps and programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ccb071e
    • D
      bpf: dynamically allocate digest scratch buffer · aafe6ae9
      Daniel Borkmann 提交于
      Geert rightfully complained that 7bd509e3 ("bpf: add prog_digest
      and expose it via fdinfo/netlink") added a too large allocation of
      variable 'raw' from bss section, and should instead be done dynamically:
      
        # ./scripts/bloat-o-meter kernel/bpf/core.o.1 kernel/bpf/core.o.2
        add/remove: 3/0 grow/shrink: 0/0 up/down: 33291/0 (33291)
        function                                     old     new   delta
        raw                                            -   32832  +32832
        [...]
      
      Since this is only relevant during program creation path, which can be
      considered slow-path anyway, lets allocate that dynamically and be not
      implicitly dependent on verifier mutex. Move bpf_prog_calc_digest() at
      the beginning of replace_map_fd_with_map_ptr() and also error handling
      stays straight forward.
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aafe6ae9
  12. 17 12月, 2016 1 次提交
    • D
      bpf: fix regression on verifier pruning wrt map lookups · a08dd0da
      Daniel Borkmann 提交于
      Commit 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL
      registers") introduced a regression where existing programs stopped
      loading due to reaching the verifier's maximum complexity limit,
      whereas prior to this commit they were loading just fine; the affected
      program has roughly 2k instructions.
      
      What was found is that state pruning couldn't be performed effectively
      anymore due to mismatches of the verifier's register state, in particular
      in the id tracking. It doesn't mean that 57a09bf0 is incorrect per
      se, but rather that verifier needs to perform a lot more work for the
      same program with regards to involved map lookups.
      
      Since commit 57a09bf0 is only about tracking registers with type
      PTR_TO_MAP_VALUE_OR_NULL, the id is only needed to follow registers
      until they are promoted through pattern matching with a NULL check to
      either PTR_TO_MAP_VALUE or UNKNOWN_VALUE type. After that point, the
      id becomes irrelevant for the transitioned types.
      
      For UNKNOWN_VALUE, id is already reset to 0 via mark_reg_unknown_value(),
      but not so for PTR_TO_MAP_VALUE where id is becoming stale. It's even
      transferred further into other types that don't make use of it. Among
      others, one example is where UNKNOWN_VALUE is set on function call
      return with RET_INTEGER return type.
      
      states_equal() will then fall through the memcmp() on register state;
      note that the second memcmp() uses offsetofend(), so the id is part of
      that since d2a4dd37 ("bpf: fix state equivalence"). But the bisect
      pointed already to 57a09bf0, where we really reach beyond complexity
      limit. What I found was that states_equal() often failed in this
      case due to id mismatches in spilled regs with registers in type
      PTR_TO_MAP_VALUE. Unlike non-spilled regs, spilled regs just perform
      a memcmp() on their reg state and don't have any other optimizations
      in place, therefore also id was relevant in this case for making a
      pruning decision.
      
      We can safely reset id to 0 as well when converting to PTR_TO_MAP_VALUE.
      For the affected program, it resulted in a ~17 fold reduction of
      complexity and let the program load fine again. Selftest suite also
      runs fine. The only other place where env->id_gen is used currently is
      through direct packet access, but for these cases id is long living, thus
      a different scenario.
      
      Also, the current logic in mark_map_regs() is not fully correct when
      marking NULL branch with UNKNOWN_VALUE. We need to cache the destination
      reg's id in any case. Otherwise, once we marked that reg as UNKNOWN_VALUE,
      it's id is reset and any subsequent registers that hold the original id
      and are of type PTR_TO_MAP_VALUE_OR_NULL won't be marked UNKNOWN_VALUE
      anymore, since mark_map_reg() reuses the uncached regs[regno].id that
      was just overridden. Note, we don't need to cache it outside of
      mark_map_regs(), since it's called once on this_branch and the other
      time on other_branch, which are both two independent verifier states.
      A test case for this is added here, too.
      
      Fixes: 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a08dd0da
  13. 16 12月, 2016 2 次提交
  14. 15 12月, 2016 11 次提交
    • G
      genirq/affinity: Fix node generation from cpumask · c0af5243
      Guilherme G. Piccoli 提交于
      Commit 34c3d981 ("genirq/affinity: Provide smarter irq spreading
      infrastructure") introduced a better IRQ spreading mechanism, taking
      account of the available NUMA nodes in the machine.
      
      Problem is that the algorithm of retrieving the nodemask iterates
      "linearly" based on the number of online nodes - some architectures
      present non-linear node distribution among the nodemask, like PowerPC.
      If this is the case, the algorithm lead to a wrong node count number
      and therefore to a bad/incomplete IRQ affinity distribution.
      
      For example, this problem were found in a machine with 128 CPUs and two
      nodes, namely nodes 0 and 8 (instead of 0 and 1, if it was linearly
      distributed). This led to a wrong affinity distribution which then led to
      a bad mq allocation for nvme driver.
      
      Finally, we take the opportunity to fix a comment regarding the affinity
      distribution when we have _more_ nodes than vectors.
      
      Fixes: 34c3d981 ("genirq/affinity: Provide smarter irq spreading infrastructure")
      Reported-by: NGabriel Krisman Bertazi <gabriel@krisman.be>
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NGabriel Krisman Bertazi <gabriel@krisman.be>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Cc: linux-pci@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: hch@lst.de
      Link: http://lkml.kernel.org/r/1481738472-2671-1-git-send-email-gpiccoli@linux.vnet.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      c0af5243
    • T
      tick/broadcast: Prevent NULL pointer dereference · c1a9eeb9
      Thomas Gleixner 提交于
      When a disfunctional timer, e.g. dummy timer, is installed, the tick core
      tries to setup the broadcast timer.
      
      If no broadcast device is installed, the kernel crashes with a NULL pointer
      dereference in tick_broadcast_setup_oneshot() because the function has no
      sanity check.
      Reported-by: NMason <slash.tmp@free.fr>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Richard Cochran <rcochran@linutronix.de>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>,
      Cc: Sebastian Frias <sf84@laposte.net>
      Cc: Thibaud Cornic <thibaud_cornic@sigmadesigns.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr
      c1a9eeb9
    • L
      printk: remove console flushing special cases for partial buffered lines · 5c2992ee
      Linus Torvalds 提交于
      It actively hurts proper merging, and makes for a lot of special cases.
      There was a good(ish) reason for doing it originally, but it's getting
      too painful to maintain.  And most of the original reasons for it are
      long gone.
      
      So instead of having special code to flush partial lines to the console
      (as opposed to the record buffers), do _all_ the console writing from
      the record buffer, and be done with it.
      
      If an oops happens (or some other synchronous event), we will flush the
      partial lines due to the oops printing activity, so this does not affect
      that.  It does mean that if you have a completely hung machine, a
      partial preceding line may not have been printed out.
      
      That was some of the original reason for this complexity, in fact, back
      when we used to test for the historical i386 "halt" instruction problem
      by doing
      
      	pr_info("Checking 'hlt' instruction... ");
      
      	if (!boot_cpu_data.hlt_works_ok) {
      		pr_cont("disabled\n");
      		return;
      	}
      	halt();
      	halt();
      	halt();
      	halt();
      	pr_cont("OK\n");
      
      and that model no longer works (it the 'hlt' instruction kills the
      machine, the partial line won't have been flushed, so you won't even see
      it).
      
      Of course, that was also back in the days when people actually had
      textual console output rather than a graphical splash-screen at bootup.
      How times change..
      
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Tested-by: NPetr Mladek <pmladek@suse.com>
      Tested-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c2992ee
    • L
      printk: remove games with previous record flags · 5aa068ea
      Linus Torvalds 提交于
      The record logging code looks at the previous record flags in various
      ways, and they are all wrong.
      
      You can't use the previous record flags to determine anything about the
      next record, because they may simply not be related.  In particular, the
      reason the previous record was a continuation record may well be exactly
      _because_ the new record was printed by a different process, which is
      why the previous record was flushed.
      
      So all those games are simply wrong, and make the code hard to
      understand (because the code fundamentally cdoes not make sense).
      
      So remove it.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5aa068ea
    • L
      mm: add locked parameter to get_user_pages_remote() · 5b56d49f
      Lorenzo Stoakes 提交于
      Patch series "mm: unexport __get_user_pages_unlocked()".
      
      This patch series continues the cleanup of get_user_pages*() functions
      taking advantage of the fact we can now pass gup_flags as we please.
      
      It firstly adds an additional 'locked' parameter to
      get_user_pages_remote() to allow for its callers to utilise
      VM_FAULT_RETRY functionality.  This is necessary as the invocation of
      __get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
      this and no other existing higher level function would allow it to do
      so.
      
      Secondly existing callers of __get_user_pages_unlocked() are replaced
      with the appropriate higher-level replacement -
      get_user_pages_unlocked() if the current task and memory descriptor are
      referenced, or get_user_pages_remote() if other task/memory descriptors
      are referenced (having acquiring mmap_sem.)
      
      This patch (of 2):
      
      Add a int *locked parameter to get_user_pages_remote() to allow
      VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().
      
      Taking into account the previous adjustments to get_user_pages*()
      functions allowing for the passing of gup_flags, we are now in a
      position where __get_user_pages_unlocked() need only be exported for his
      ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
      subsequently unexport __get_user_pages_unlocked() as well as allowing
      for future flexibility in the use of get_user_pages_remote().
      
      [sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
        Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.comSigned-off-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b56d49f
    • B
      kernel/watchdog.c: move hardlockup detector to separate file · 73ce0511
      Babu Moger 提交于
      Separate hardlockup code from watchdog.c and move it to watchdog_hld.c.
      It is mostly straight forward.  Remove everything inside
      CONFIG_HARDLOCKUP_DETECTORS.  This code will go to file watchdog_hld.c.
      Also update the makefile accordigly.
      
      Link: http://lkml.kernel.org/r/1478034826-43888-3-git-send-email-babu.moger@oracle.comSigned-off-by: NBabu Moger <babu.moger@oracle.com>
      Acked-by: NDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Josh Hunt <johunt@akamai.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73ce0511
    • B
      kernel/watchdog.c: move shared definitions to nmi.h · 249e52e3
      Babu Moger 提交于
      Patch series "Clean up watchdog handlers", v2.
      
      This is an attempt to cleanup watchdog handlers.  Right now,
      kernel/watchdog.c implements both softlockup and hardlockup detectors.
      Softlockup code is generic.  Hardlockup code is arch specific.  Some
      architectures don't use hardlockup detectors.  They use their own
      watchdog detectors.  To make both these combination work, we have
      numerous #ifdefs in kernel/watchdog.c.
      
      We are trying here to make these handlers independent of each other.
      Also provide an interface for architectures to implement their own
      handlers.  watchdog_nmi_enable and watchdog_nmi_disable will be defined
      as weak such that architectures can override its definitions.
      
      Thanks to Don Zickus for his suggestions.
      Here are our previous discussions
      http://www.spinics.net/lists/sparclinux/msg16543.html
      http://www.spinics.net/lists/sparclinux/msg16441.html
      
      This patch (of 3):
      
      Move shared macros and definitions to nmi.h so that watchdog.c, new file
      watchdog_hld.c or any other architecture specific handler can use those
      definitions.
      
      Link: http://lkml.kernel.org/r/1478034826-43888-2-git-send-email-babu.moger@oracle.comSigned-off-by: NBabu Moger <babu.moger@oracle.com>
      Acked-by: NDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Josh Hunt <johunt@akamai.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      249e52e3
    • N
      posix-timers: give lazy compilers some help optimizing code away · b6f8a92c
      Nicolas Pitre 提交于
      The OpenRISC compiler (so far) fails to optimize away a large portion of
      code containing a reference to posix_timer_event in alarmtimer.c when
      CONFIG_POSIX_TIMERS is unset.  Let's give it a direct clue to let the
      build succeed.
      
      This fixes
      [linux-next:master 6682/7183] alarmtimer.c:undefined reference to `posix_timer_event'
      reported by kbuild test robot.
      Signed-off-by: NNicolas Pitre <nico@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6f8a92c
    • P
      kdb: call vkdb_printf() from vprintk_default() only when wanted · 34aaff40
      Petr Mladek 提交于
      kdb_trap_printk allows to pass normal printk() messages to kdb via
      vkdb_printk().  For example, it is used to get backtrace using the
      classic show_stack(), see kdb_show_stack().
      
      vkdb_printf() tries to avoid a potential infinite loop by disabling the
      trap.  But this approach is racy, for example:
      
      CPU1					CPU2
      
      vkdb_printf()
        // assume that kdb_trap_printk == 0
        saved_trap_printk = kdb_trap_printk;
        kdb_trap_printk = 0;
      
      					kdb_show_stack()
      					  kdb_trap_printk++;
      
      Problem1: Now, a nested printk() on CPU0 calls vkdb_printf()
      	  even when it should have been disabled. It will not
      	  cause a deadlock but...
      
         // using the outdated saved value: 0
         kdb_trap_printk = saved_trap_printk;
      
      					  kdb_trap_printk--;
      
      Problem2: Now, kdb_trap_printk == -1 and will stay like this.
         It means that all messages will get passed to kdb from
         now on.
      
      This patch removes the racy saved_trap_printk handling.  Instead, the
      recursion is prevented by a check for the locked CPU.
      
      The solution is still kind of racy.  A non-related printk(), from
      another process, might get trapped by vkdb_printf().  And the wanted
      printk() might not get trapped because kdb_printf_cpu is assigned.  But
      this problem existed even with the original code.
      
      A proper solution would be to get_cpu() before setting kdb_trap_printk
      and trap messages only from this CPU.  I am not sure if it is worth the
      effort, though.
      
      In fact, the race is very theoretical.  When kdb is running any of the
      commands that use kdb_trap_printk there is a single active CPU and the
      other CPUs should be in a holding pen inside kgdb_cpu_enter().
      
      The only time this is violated is when there is a timeout waiting for
      the other CPUs to report to the holding pen.
      
      Finally, note that the situation is a bit schizophrenic.  vkdb_printf()
      explicitly allows recursion but only from KDB code that calls
      kdb_printf() directly.  On the other hand, the generic printk()
      recursion is not allowed because it might cause an infinite loop.  This
      is why we could not hide the decision inside vkdb_printf() easily.
      
      Link: http://lkml.kernel.org/r/1480412276-16690-4-git-send-email-pmladek@suse.comSigned-off-by: NPetr Mladek <pmladek@suse.com>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34aaff40
    • P
      kdb: properly synchronize vkdb_printf() calls with other CPUs · d5d8d3d0
      Petr Mladek 提交于
      kdb_printf_lock does not prevent other CPUs from entering the critical
      section because it is ignored when KDB_STATE_PRINTF_LOCK is set.
      
      The problematic situation might look like:
      
      CPU0					CPU1
      
      vkdb_printf()
        if (!KDB_STATE(PRINTF_LOCK))
          KDB_STATE_SET(PRINTF_LOCK);
          spin_lock_irqsave(&kdb_printf_lock, flags);
      
      					vkdb_printf()
      					  if (!KDB_STATE(PRINTF_LOCK))
      
      BANG: The PRINTF_LOCK state is set and CPU1 is entering the critical
      section without spinning on the lock.
      
      The problem is that the code tries to implement locking using two state
      variables that are not handled atomically.  Well, we need a custom
      locking because we want to allow reentering the critical section on the
      very same CPU.
      
      Let's use solution from Petr Zijlstra that was proposed for a similar
      scenario, see
      https://lkml.kernel.org/r/20161018171513.734367391@infradead.org
      
      This patch uses the same trick with cmpxchg().  The only difference is
      that we want to handle only recursion from the same context and
      therefore we disable interrupts.
      
      In addition, KDB_STATE_PRINTF_LOCK is removed.  In fact, we are not able
      to set it a non-racy way.
      
      Link: http://lkml.kernel.org/r/1480412276-16690-3-git-send-email-pmladek@suse.comSigned-off-by: NPetr Mladek <pmladek@suse.com>
      Reviewed-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5d8d3d0
    • P
      kdb: remove unused kdb_event handling · d1bd8ead
      Petr Mladek 提交于
      kdb_event state variable is only set but never checked in the kernel
      code.
      
      http://www.spinics.net/lists/kdb/msg01733.html suggests that this
      variable affected WARN_CONSOLE_UNLOCKED() in the original
      implementation.  But this check never went upstream.
      
      The semantic is unclear and racy.  The value is updated after the
      kdb_printf_lock is acquired and after it is released.  It should be
      symmetric at minimum.  The value should be manipulated either inside or
      outside the locked area.
      
      Fortunately, it seems that the original function is gone and we could
      simply remove the state variable.
      
      Link: http://lkml.kernel.org/r/1480412276-16690-2-git-send-email-pmladek@suse.comSigned-off-by: NPetr Mladek <pmladek@suse.com>
      Suggested-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1bd8ead