1. 20 7月, 2018 6 次提交
    • P
      sched/clock: Close a hole in sched_clock_init() · 9407f5a7
      Peter Zijlstra 提交于
      All data required for the 'unstable' sched_clock must be set-up _before_
      enabling it -- setting sched_clock_running. This includes the
      __gtod_offset but also a recent scd stamp.
      
      Make the gtod-offset update also set the csd stamp -- it requires the
      same two clock reads _anyway_. This doesn't hurt in the
      sched_clock_tick_stable() case and ensures sched_clock_init() gets
      everything set-up before use.
      
      Also switch to unconditional IRQ-disable/enable because the static key
      stuff already requires this is not ran with IRQs disabled.
      
      Fixes: 857baa87 ("sched/clock: Enable sched clock early")
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: douly.fnst@cn.fujitsu.com
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Cc: pbonzini@redhat.com
      Link: https://lkml.kernel.org/r/20180720080911.GM2494@hirez.programming.kicks-ass.net
      9407f5a7
    • P
      sched/clock: Use static key for sched_clock_running · 46457ea4
      Pavel Tatashin 提交于
      sched_clock_running may be read every time sched_clock_cpu() is called.
      Yet, this variable is updated only twice during boot, and never changes
      again, therefore it is better to make it a static key.
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: douly.fnst@cn.fujitsu.com
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Cc: pbonzini@redhat.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-25-pasha.tatashin@oracle.com
      46457ea4
    • P
      sched/clock: Enable sched clock early · 857baa87
      Pavel Tatashin 提交于
      Allow sched_clock() to be used before schec_clock_init() is called.  This
      provides a way to get early boot timestamps on machines with unstable
      clocks.
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: douly.fnst@cn.fujitsu.com
      Cc: peterz@infradead.org
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Cc: pbonzini@redhat.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-24-pasha.tatashin@oracle.com
      857baa87
    • P
      sched/clock: Move sched clock initialization and merge with generic clock · 5d2a4e91
      Pavel Tatashin 提交于
      sched_clock_postinit() initializes a generic clock on systems where no
      other clock is provided. This function may be called only after
      timekeeping_init().
      
      Rename sched_clock_postinit to generic_clock_inti() and call it from
      sched_clock_init(). Move the call for sched_clock_init() until after
      time_init().
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: douly.fnst@cn.fujitsu.com
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Cc: pbonzini@redhat.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-23-pasha.tatashin@oracle.com
      5d2a4e91
    • P
      timekeeping: Default boot time offset to local_clock() · 4b1b7f80
      Pavel Tatashin 提交于
      read_persistent_wall_and_boot_offset() is called during boot to read
      both the persistent clock and also return the offset between the boot time
      and the value of persistent clock.
      
      Change the default boot_offset from zero to local_clock() so architectures,
      that do not have a dedicated boot_clock but have early sched_clock(), such
      as SPARCv9, x86, and possibly more will benefit from this change by getting
      a better and more consistent estimate of the boot time without need for an
      arch specific implementation.
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: douly.fnst@cn.fujitsu.com
      Cc: peterz@infradead.org
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Cc: pbonzini@redhat.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-17-pasha.tatashin@oracle.com
      4b1b7f80
    • P
      timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset() · 3eca9937
      Pavel Tatashin 提交于
      If architecture does not support exact boot time, it is challenging to
      estimate boot time without having a reference to the current persistent
      clock value. Yet, it cannot read the persistent clock time again, because
      this may lead to math discrepancies with the caller of read_boot_clock64()
      who have read the persistent clock at a different time.
      
      This is why it is better to provide two values simultaneously: the
      persistent clock value, and the boot time.
      
      Replace read_boot_clock64() with:
      read_persistent_wall_and_boot_offset(wall_time, boot_offset)
      
      Where wall_time is returned by read_persistent_clock() And boot_offset is
      wall_time - boot time, which defaults to 0.
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Cc: linux@armlinux.org.uk
      Cc: schwidefsky@de.ibm.com
      Cc: heiko.carstens@de.ibm.com
      Cc: john.stultz@linaro.org
      Cc: sboyd@codeaurora.org
      Cc: hpa@zytor.com
      Cc: douly.fnst@cn.fujitsu.com
      Cc: peterz@infradead.org
      Cc: prarit@redhat.com
      Cc: feng.tang@intel.com
      Cc: pmladek@suse.com
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: linux-s390@vger.kernel.org
      Cc: boris.ostrovsky@oracle.com
      Cc: jgross@suse.com
      Cc: pbonzini@redhat.com
      Link: https://lkml.kernel.org/r/20180719205545.16512-16-pasha.tatashin@oracle.com
      3eca9937
  2. 18 7月, 2018 1 次提交
    • L
      Mark HI and TASKLET softirq synchronous · 3c53776e
      Linus Torvalds 提交于
      Way back in 4.9, we committed 4cd13c21 ("softirq: Let ksoftirqd do
      its job"), and ever since we've had small nagging issues with it.  For
      example, we've had:
      
        1ff68820 ("watchdog: core: make sure the watchdog_worker is not deferred")
        8d5755b3 ("watchdog: softdog: fire watchdog even if softirqs do not get to run")
        217f6974 ("net: busy-poll: allow preemption in sk_busy_loop()")
      
      all of which worked around some of the effects of that commit.
      
      The DVB people have also complained that the commit causes excessive USB
      URB latencies, which seems to be due to the USB code using tasklets to
      schedule USB traffic.  This seems to be an issue mainly when already
      living on the edge, but waiting for ksoftirqd to handle it really does
      seem to cause excessive latencies.
      
      Now Hanna Hawa reports that this issue isn't just limited to USB URB and
      DVB, but also causes timeout problems for the Marvell SoC team:
      
       "I'm facing kernel panic issue while running raid 5 on sata disks
        connected to Macchiatobin (Marvell community board with Armada-8040
        SoC with 4 ARMv8 cores of CA72) Raid 5 built with Marvell DMA engine
        and async_tx mechanism (ASYNC_TX_DMA [=y]); the DMA driver (mv_xor_v2)
        uses a tasklet to clean the done descriptors from the queue"
      
      The latency problem causes a panic:
      
        mv_xor_v2 f0400000.xor: dma_sync_wait: timeout!
        Kernel panic - not syncing: async_tx_quiesce: DMA error waiting for transaction
      
      We've discussed simply just reverting the original commit entirely, and
      also much more involved solutions (with per-softirq threads etc).  This
      patch is intentionally stupid and fairly limited, because the issue
      still remains, and the other solutions either got sidetracked or had
      other issues.
      
      We should probably also consider the timer softirqs to be synchronous
      and not be delayed to ksoftirqd (since they were the issue with the
      earlier watchdog problems), but that should be done as a separate patch.
      This does only the tasklet cases.
      Reported-and-tested-by: NHanna Hawa <hannah@marvell.com>
      Reported-and-tested-by: NJosef Griebichler <griebichler.josef@gmx.at>
      Reported-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c53776e
  3. 13 7月, 2018 2 次提交
    • J
      tracing: Reorder display of TGID to be after PID · f8494fa3
      Joel Fernandes (Google) 提交于
      Currently ftrace displays data in trace output like so:
      
                                             _-----=> irqs-off
                                            / _----=> need-resched
                                           | / _---=> hardirq/softirq
                                           || / _--=> preempt-depth
                                           ||| /     delay
                  TASK-PID   CPU    TGID   ||||    TIMESTAMP  FUNCTION
                     | |       |      |    ||||       |         |
                  bash-1091  [000] ( 1091) d..2    28.313544: sched_switch:
      
      However Android's trace visualization tools expect a slightly different
      format due to an out-of-tree patch patch that was been carried for a
      decade, notice that the TGID and CPU fields are reversed:
      
                                             _-----=> irqs-off
                                            / _----=> need-resched
                                           | / _---=> hardirq/softirq
                                           || / _--=> preempt-depth
                                           ||| /     delay
                  TASK-PID    TGID   CPU   ||||    TIMESTAMP  FUNCTION
                     | |        |      |   ||||       |         |
                  bash-1091  ( 1091) [002] d..2    64.965177: sched_switch:
      
      From kernel v4.13 onwards, during which TGID was introduced, tracing
      with systrace on all Android kernels will break (most Android kernels
      have been on 4.9 with Android patches, so this issues hasn't been seen
      yet). From v4.13 onwards things will break.
      
      The chrome browser's tracing tools also embed the systrace viewer which
      uses the legacy TGID format and updates to that are known to be
      difficult to make.
      
      Considering this, I suggest we make this change to the upstream kernel
      and backport it to all Android kernels. I believe this feature is merged
      recently enough into the upstream kernel that it shouldn't be a problem.
      Also logically, IMO it makes more sense to group the TGID with the
      TASK-PID and the CPU after these.
      
      Link: http://lkml.kernel.org/r/20180626000822.113931-1-joel@joelfernandes.org
      
      Cc: jreck@google.com
      Cc: tkjos@google.com
      Cc: stable@vger.kernel.org
      Fixes: 441dae8f ("tracing: Add support for display of tgid in trace output")
      Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      f8494fa3
    • D
      bpf: don't leave partial mangled prog in jit_subprogs error path · c7a89784
      Daniel Borkmann 提交于
      syzkaller managed to trigger the following bug through fault injection:
      
        [...]
        [  141.043668] verifier bug. No program starts at insn 3
        [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                       get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
        [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                       fixup_call_args kernel/bpf/verifier.c:5587 [inline]
        [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                       bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
        [  141.047355] CPU: 3 PID: 4072 Comm: a.out Not tainted 4.18.0-rc4+ #51
        [  141.048446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS 1.10.2-1 04/01/2014
        [  141.049877] Call Trace:
        [  141.050324]  __dump_stack lib/dump_stack.c:77 [inline]
        [  141.050324]  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
        [  141.050950]  ? dump_stack_print_info.cold.2+0x52/0x52 lib/dump_stack.c:60
        [  141.051837]  panic+0x238/0x4e7 kernel/panic.c:184
        [  141.052386]  ? add_taint.cold.5+0x16/0x16 kernel/panic.c:385
        [  141.053101]  ? __warn.cold.8+0x148/0x1ba kernel/panic.c:537
        [  141.053814]  ? __warn.cold.8+0x117/0x1ba kernel/panic.c:530
        [  141.054506]  ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
        [  141.054506]  ? fixup_call_args kernel/bpf/verifier.c:5587 [inline]
        [  141.054506]  ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
        [  141.055163]  __warn.cold.8+0x163/0x1ba kernel/panic.c:538
        [  141.055820]  ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
        [  141.055820]  ? fixup_call_args kernel/bpf/verifier.c:5587 [inline]
        [  141.055820]  ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
        [...]
      
      What happens in jit_subprogs() is that kcalloc() for the subprog func
      buffer is failing with NULL where we then bail out. Latter is a plain
      return -ENOMEM, and this is definitely not okay since earlier in the
      loop we are walking all subprogs and temporarily rewrite insn->off to
      remember the subprog id as well as insn->imm to temporarily point the
      call to __bpf_call_base + 1 for the initial JIT pass. Thus, bailing
      out in such state and handing this over to the interpreter is troublesome
      since later/subsequent e.g. find_subprog() lookups are based on wrong
      insn->imm.
      
      Therefore, once we hit this point, we need to jump to out_free path
      where we undo all changes from earlier loop, so that interpreter can
      work on unmodified insn->{off,imm}.
      
      Another point is that should find_subprog() fail in jit_subprogs() due
      to a verifier bug, then we also should not simply defer the program to
      the interpreter since also here we did partial modifications. Instead
      we should just bail out entirely and return an error to the user who is
      trying to load the program.
      
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Reported-by: syzbot+7d427828b2ea6e592804@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c7a89784
  4. 12 7月, 2018 2 次提交
  5. 11 7月, 2018 5 次提交
    • M
      rseq: uapi: Declare rseq_cs field as union, update includes · ec9c82e0
      Mathieu Desnoyers 提交于
      Declaring the rseq_cs field as a union between __u64 and two __u32
      allows both 32-bit and 64-bit kernels to read the full __u64, and
      therefore validate that a 32-bit user-space cleared the upper 32
      bits, thus ensuring a consistent behavior between native 32-bit
      kernels and 32-bit compat tasks on 64-bit kernels.
      
      Check that the rseq_cs value read is < TASK_SIZE.
      
      The asm/byteorder.h header needs to be included by rseq.h, now
      that it is not using linux/types_32_64.h anymore.
      
      Considering that only __32 and __u64 types are declared in linux/rseq.h,
      the linux/types.h header should always be included for both kernel and
      user-space code: including stdint.h is just for u64 and u32, which are
      not used in this header at all.
      
      Use copy_from_user()/clear_user() to interact with a 64-bit field,
      because arm32 does not implement 64-bit __get_user, and ppc32 does not
      64-bit get_user. Considering that the rseq_cs pointer does not need to
      be loaded/stored with single-copy atomicity from the kernel anymore, we
      can simply use copy_from_user()/clear_user().
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-5-mathieu.desnoyers@efficios.com
      ec9c82e0
    • M
      rseq: uapi: Update uapi comments · 0fb9a1ab
      Mathieu Desnoyers 提交于
      Update rseq uapi header comments to reflect that user-space need to do
      thread-local loads/stores from/to the struct rseq fields.
      
      As a consequence of this added requirement, the kernel does not need
      to perform loads/stores with single-copy atomicity.
      
      Update the comment associated to the "flags" fields to describe
      more accurately that it's only useful to facilitate single-stepping
      through rseq critical sections with debuggers.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-4-mathieu.desnoyers@efficios.com
      0fb9a1ab
    • M
      rseq: Use get_user/put_user rather than __get_user/__put_user · 8f281770
      Mathieu Desnoyers 提交于
      __get_user()/__put_user() is used to read values for address ranges that
      were already checked with access_ok() on rseq registration.
      
      It has been recognized that __get_user/__put_user are optimizing the
      wrong thing. Replace them by get_user/put_user across rseq instead.
      
      If those end up showing up in benchmarks, the proper approach would be to
      use user_access_begin() / unsafe_{get,put}_user() / user_access_end()
      anyway.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: https://lkml.kernel.org/r/20180709195155.7654-3-mathieu.desnoyers@efficios.com
      8f281770
    • M
      rseq: Use __u64 for rseq_cs fields, validate user inputs · e96d7135
      Mathieu Desnoyers 提交于
      Change the rseq ABI so rseq_cs start_ip, post_commit_offset and abort_ip
      fields are seen as 64-bit fields by both 32-bit and 64-bit kernels rather
      that ignoring the 32 upper bits on 32-bit kernels. This ensures we have a
      consistent behavior for a 32-bit binary executed on 32-bit kernels and in
      compat mode on 64-bit kernels.
      
      Validating the value of abort_ip field to be below TASK_SIZE ensures the
      kernel don't return to an invalid address when returning to userspace
      after an abort. I don't fully trust each architecture code to consistently
      deal with invalid return addresses.
      
      Validating the value of the start_ip and post_commit_offset fields
      prevents overflow on arithmetic performed on those values, used to
      check whether abort_ip is within the rseq critical section.
      
      If validation fails, the process is killed with a segmentation fault.
      
      When the signature encountered before abort_ip does not match the expected
      signature, return -EINVAL rather than -EPERM to be consistent with other
      input validation return codes from rseq_get_rseq_cs().
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-2-mathieu.desnoyers@efficios.com
      e96d7135
    • S
      Revert "tick: Prefer a lower rating device only if it's CPU local device" · 5b5ccbc2
      Sudeep Holla 提交于
      This reverts commit 1332a905.
      
      The original issue was not because of incorrect checking of cpumask for
      both new and old tick device. It was incorrectly analysed was due to the
      misunderstanding of the comment and misinterpretation of the return value
      from tick_check_preferred. The main issue is with the clockevent driver
      that sets the cpumask to cpu_all_mask instead of cpu_possible_mask.
      Signed-off-by: NSudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NKevin Hilman <khilman@baylibre.com>
      Tested-by: NMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Link: https://lkml.kernel.org/r/1531151136-18297-1-git-send-email-sudeep.holla@arm.com
      5b5ccbc2
  6. 08 7月, 2018 6 次提交
    • T
      xdp: XDP_REDIRECT should check IFF_UP and MTU · d8d7218a
      Toshiaki Makita 提交于
      Otherwise we end up with attempting to send packets from down devices
      or to send oversized packets, which may cause unexpected driver/device
      behaviour. Generic XDP has already done this check, so reuse the logic
      in native XDP.
      
      Fixes: 814abfab ("xdp: add bpf_redirect helper function")
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d8d7218a
    • J
      bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skb · 0ea488ff
      John Fastabend 提交于
      In commit
      
        'bpf: bpf_compute_data uses incorrect cb structure' (8108a775)
      
      we added the routine bpf_compute_data_end_sk_skb() to compute the
      correct data_end values, but this has since been lost. In kernel
      v4.14 this was correct and the above patch was applied in it
      entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
      we lost the piece that renamed bpf_compute_data_pointers to the
      new function bpf_compute_data_end_sk_skb. This was done here,
      
      e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      
      When it conflicted with the following rename patch,
      
      6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      
      Finally, after a refactor I thought even the function
      bpf_compute_data_end_sk_skb() was no longer needed and it was
      erroneously removed.
      
      However, we never reverted the sk_skb_convert_ctx_access() usage of
      tcp_skb_cb which had been committed and survived the merge conflict.
      Here we fix this by adding back the helper and *_data_end_sk_skb()
      usage. Using the bpf_skc_data_end mapping is not correct because it
      expects a qdisc_skb_cb object but at the sock layer this is not the
      case. Even though it happens to work here because we don't overwrite
      any data in-use at the socket layer and the cb structure is cleared
      later this has potential to create some subtle issues. But, even
      more concretely the filter.c access check uses tcp_skb_cb.
      
      And by some act of chance though,
      
      struct bpf_skb_data_end {
              struct qdisc_skb_cb        qdisc_cb;             /*     0    28 */
      
              /* XXX 4 bytes hole, try to pack */
      
              void *                     data_meta;            /*    32     8 */
              void *                     data_end;             /*    40     8 */
      
              /* size: 48, cachelines: 1, members: 3 */
              /* sum members: 44, holes: 1, sum holes: 4 */
              /* last cacheline: 48 bytes */
      };
      
      and then tcp_skb_cb,
      
      struct tcp_skb_cb {
      	[...]
                      struct {
                              __u32      flags;                /*    24     4 */
                              struct sock * sk_redir;          /*    32     8 */
                              void *     data_end;             /*    40     8 */
                      } bpf;                                   /*          24 */
              };
      
      So when we use offset_of() to track down the byte offset we get 40 in
      either case and everything continues to work. Fix this mess and use
      correct structures its unclear how long this might actually work for
      until someone moves the structs around.
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Fixes: e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      Fixes: 6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0ea488ff
    • J
      bpf: sockmap, consume_skb in close path · 7ebc14d5
      John Fastabend 提交于
      Currently, when a sock is closed and the bpf_tcp_close() callback is
      used we remove memory but do not free the skb. Call consume_skb() if
      the skb is attached to the buffer.
      
      Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com
      Fixes: 1aa12bdf ("bpf: sockmap, add sock close() hook to remove socks")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7ebc14d5
    • J
      bpf: sockhash, disallow bpf_tcp_close and update in parallel · 99ba2b5a
      John Fastabend 提交于
      After latest lock updates there is no longer anything preventing a
      close and recvmsg call running in parallel. Additionally, we can
      race update with close if we close a socket and simultaneously update
      if via the BPF userspace API (note the cgroup ops are already run
      with sock_lock held).
      
      To resolve this take sock_lock in close and update paths.
      
      Reported-by: syzbot+b680e42077a0d7c9a0c4@syzkaller.appspotmail.com
      Fixes: e9db4ef6 ("bpf: sockhash fix omitted bucket lock in sock_close")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      99ba2b5a
    • J
      bpf: sockmap, hash table is RCU so readers do not need locks · 1d1ef005
      John Fastabend 提交于
      This removes locking from readers of RCU hash table. Its not
      necessary.
      
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1d1ef005
    • J
      bpf: sockmap, error path can not release psock in multi-map case · 547b3aa4
      John Fastabend 提交于
      The current code, in the error path of sock_hash_ctx_update_elem,
      checks if the sock has a psock in the user data and if so decrements
      the reference count of the psock. However, if the error happens early
      in the error path we may have never incremented the psock reference
      count and if the psock exists because the sock is in another map then
      we may inadvertently decrement the reference count.
      
      Fix this by making the error path only call smap_release_sock if the
      error happens after the increment.
      
      Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      547b3aa4
  7. 04 7月, 2018 7 次提交
  8. 03 7月, 2018 6 次提交
    • P
      kthread, sched/core: Fix kthread_parkme() (again...) · 1cef1150
      Peter Zijlstra 提交于
      Gaurav reports that commit:
      
        85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      
      isn't working for him. Because of the following race:
      
      > controller Thread                               CPUHP Thread
      > takedown_cpu
      > kthread_park
      > kthread_parkme
      > Set KTHREAD_SHOULD_PARK
      >                                                 smpboot_thread_fn
      >                                                 set Task interruptible
      >
      >
      > wake_up_process
      >  if (!(p->state & state))
      >                 goto out;
      >
      >                                                 Kthread_parkme
      >                                                 SET TASK_PARKED
      >                                                 schedule
      >                                                 raw_spin_lock(&rq->lock)
      > ttwu_remote
      > waiting for __task_rq_lock
      >                                                 context_switch
      >
      >                                                 finish_lock_switch
      >
      >
      >
      >                                                 Case TASK_PARKED
      >                                                 kthread_park_complete
      >
      >
      > SET Running
      
      Furthermore, Oleg noticed that the whole scheduler TASK_PARKED
      handling is buggered because the TASK_DEAD thing is done with
      preemption disabled, the current code can still complete early on
      preemption :/
      
      So basically revert that earlier fix and go with a variant of the
      alternative mentioned in the commit. Promote TASK_PARKED to special
      state to avoid the store-store issue on task->state leading to the
      WARN in kthread_unpark() -> __kthread_bind().
      
      But in addition, add wait_task_inactive() to kthread_park() to ensure
      the task really is PARKED when we return from kthread_park(). This
      avoids the whole kthread still gets migrated nonsense -- although it
      would be really good to get this done differently.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1cef1150
    • V
      sched/util_est: Fix util_est_dequeue() for throttled cfs_rq · 3482d98b
      Vincent Guittot 提交于
      When a cfs_rq is throttled, parent cfs_rq->nr_running is decreased and
      everything happens at cfs_rq level. Currently util_est stays unchanged
      in such case and it keeps accounting the utilization of throttled tasks.
      This can somewhat make sense as we don't dequeue tasks but only throttled
      cfs_rq.
      
      If a task of another group is enqueued/dequeued and root cfs_rq becomes
      idle during the dequeue, util_est will be cleared whereas it was
      accounting util_est of throttled tasks before. So the behavior of util_est
      is not always the same regarding throttled tasks and depends of side
      activity. Furthermore, util_est will not be updated when the cfs_rq is
      unthrottled as everything happens at cfs_rq level. Main results is that
      util_est will stay null whereas we now have running tasks. We have to wait
      for the next dequeue/enqueue of the previously throttled tasks to get an
      up to date util_est.
      
      Remove the assumption that cfs_rq's estimated utilization of a CPU is 0
      if there is no running task so the util_est of a task remains until the
      latter is dequeued even if its cfs_rq has been throttled.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 7f65ea42 ("sched/fair: Add util_est on top of PELT")
      Link: http://lkml.kernel.org/r/1528972380-16268-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3482d98b
    • X
      sched/fair: Advance global expiration when period timer is restarted · f1d1be8a
      Xunlei Pang 提交于
      When period gets restarted after some idle time, start_cfs_bandwidth()
      doesn't update the expiration information, expire_cfs_rq_runtime() will
      see cfs_rq->runtime_expires smaller than rq clock and go to the clock
      drift logic, wasting needless CPU cycles on the scheduler hot path.
      
      Update the global expiration in start_cfs_bandwidth() to avoid frequent
      expire_cfs_rq_runtime() calls once a new period begins.
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180620101834.24455-2-xlpang@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f1d1be8a
    • X
      sched/fair: Fix bandwidth timer clock drift condition · 512ac999
      Xunlei Pang 提交于
      I noticed that cgroup task groups constantly get throttled even
      if they have low CPU usage, this causes some jitters on the response
      time to some of our business containers when enabling CPU quotas.
      
      It's very simple to reproduce:
      
        mkdir /sys/fs/cgroup/cpu/test
        cd /sys/fs/cgroup/cpu/test
        echo 100000 > cpu.cfs_quota_us
        echo $$ > tasks
      
      then repeat:
      
        cat cpu.stat | grep nr_throttled  # nr_throttled will increase steadily
      
      After some analysis, we found that cfs_rq::runtime_remaining will
      be cleared by expire_cfs_rq_runtime() due to two equal but stale
      "cfs_{b|q}->runtime_expires" after period timer is re-armed.
      
      The current condition to judge clock drift in expire_cfs_rq_runtime()
      is wrong, the two runtime_expires are actually the same when clock
      drift happens, so this condtion can never hit. The orginal design was
      correctly done by this commit:
      
        a9cf55b2 ("sched: Expire invalid runtime")
      
      ... but was changed to be the current implementation due to its locking bug.
      
      This patch introduces another way, it adds a new field in both structures
      cfs_rq and cfs_bandwidth to record the expiration update sequence, and
      uses them to figure out if clock drift happens (true if they are equal).
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 51f2176d ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
      Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      512ac999
    • V
      sched/rt: Fix call to cpufreq_update_util() · 296b2ffe
      Vincent Guittot 提交于
      With commit:
      
        8f111bc3 ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
      
      the schedutil governor uses rq->rt.rt_nr_running to detect whether an
      RT task is currently running on the CPU and to set frequency to max
      if necessary.
      
      cpufreq_update_util() is called in enqueue/dequeue_top_rt_rq() but
      rq->rt.rt_nr_running has not been updated yet when dequeue_top_rt_rq() is
      called so schedutil still considers that an RT task is running when the
      last task is dequeued. The update of rq->rt.rt_nr_running happens later
      in dequeue_rt_stack().
      
      In fact, we can take advantage of the sequence that the dequeue then
      re-enqueue rt entities when a rt task is enqueued or dequeued;
      As a result enqueue_top_rt_rq() is always called when a task is
      enqueued or dequeued and also when groups are throttled or unthrottled.
      The only place that not use enqueue_top_rt_rq() is when root rt_rq is
      throttled.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: efault@gmx.de
      Cc: juri.lelli@redhat.com
      Cc: patrick.bellasi@arm.com
      Cc: viresh.kumar@linaro.org
      Fixes: 8f111bc3 ('cpufreq/schedutil: Rewrite CPUFREQ_RT support')
      Link: http://lkml.kernel.org/r/1530021202-21695-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      296b2ffe
    • F
      sched/nohz: Skip remote tick on idle task entirely · d9c0ffca
      Frederic Weisbecker 提交于
      Some people have reported that the warning in sched_tick_remote()
      occasionally triggers, especially in favour of some RCU-Torture
      pressure:
      
      	WARNING: CPU: 11 PID: 906 at kernel/sched/core.c:3138 sched_tick_remote+0xb6/0xc0
      	Modules linked in:
      	CPU: 11 PID: 906 Comm: kworker/u32:3 Not tainted 4.18.0-rc2+ #1
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
      	Workqueue: events_unbound sched_tick_remote
      	RIP: 0010:sched_tick_remote+0xb6/0xc0
      	Code: e8 0f 06 b8 00 c6 03 00 fb eb 9d 8b 43 04 85 c0 75 8d 48 8b 83 e0 0a 00 00 48 85 c0 75 81 eb 88 48 89 df e8 bc fe ff ff eb aa <0f> 0b eb
      	+c5 66 0f 1f 44 00 00 bf 17 00 00 00 e8 b6 2e fe ff 0f b6
      	Call Trace:
      	 process_one_work+0x1df/0x3b0
      	 worker_thread+0x44/0x3d0
      	 kthread+0xf3/0x130
      	 ? set_worker_desc+0xb0/0xb0
      	 ? kthread_create_worker_on_cpu+0x70/0x70
      	 ret_from_fork+0x35/0x40
      
      This happens when the remote tick applies on an idle task. Usually the
      idle_cpu() check avoids that, but it is performed before we lock the
      runqueue and it is therefore racy. It was intended to be that way in
      order to prevent from useless runqueue locks since idle task tick
      callback is a no-op.
      
      Now if the racy check slips out of our hands and we end up remotely
      ticking an idle task, the empty task_tick_idle() is harmless. Still
      it won't pass the WARN_ON_ONCE() test that ensures rq_clock_task() is
      not too far from curr->se.exec_start because update_curr_idle() doesn't
      update the exec_start value like other scheduler policies. Hence the
      reported false positive.
      
      So let's have another check, while the rq is locked, to make sure we
      don't remote tick on an idle task. The lockless idle_cpu() still applies
      to avoid unecessary rq lock contention.
      Reported-by: NJacek Tomaka <jacekt@dug.com>
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reported-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1530203381-31234-1-git-send-email-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d9c0ffca
  9. 01 7月, 2018 4 次提交
    • J
      bpf: sockhash, add release routine · caac76a5
      John Fastabend 提交于
      Add map_release_uref pointer to hashmap ops. This was dropped when
      original sockhash code was ported into bpf-next before initial
      commit.
      
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      caac76a5
    • J
      bpf: sockhash fix omitted bucket lock in sock_close · e9db4ef6
      John Fastabend 提交于
      First the sk_callback_lock() was being used to protect both the
      sock callback hooks and the psock->maps list. This got overly
      convoluted after the addition of sockhash (in sockmap it made
      some sense because masp and callbacks were tightly coupled) so
      lets split out a specific lock for maps and only use the callback
      lock for its intended purpose. This fixes a couple cases where
      we missed using maps lock when it was in fact needed. Also this
      makes it easier to follow the code because now we can put the
      locking closer to the actual code its serializing.
      
      Next, in sock_hash_delete_elem() the pattern was as follows,
      
        sock_hash_delete_elem()
           [...]
           spin_lock(bucket_lock)
           l = lookup_elem_raw()
           if (l)
              hlist_del_rcu()
              write_lock(sk_callback_lock)
               .... destroy psock ...
              write_unlock(sk_callback_lock)
           spin_unlock(bucket_lock)
      
      The ordering is necessary because we only know the {p}sock after
      dereferencing the hash table which we can't do unless we have the
      bucket lock held. Once we have the bucket lock and the psock element
      it is deleted from the hashmap to ensure any other path doing a lookup
      will fail. Finally, the refcnt is decremented and if zero the psock
      is destroyed.
      
      In parallel with the above (or free'ing the map) a tcp close event
      may trigger tcp_close(). Which at the moment omits the bucket lock
      altogether (oops!) where the flow looks like this,
      
        bpf_tcp_close()
           [...]
           write_lock(sk_callback_lock)
           for each psock->maps // list of maps this sock is part of
               hlist_del_rcu(ref_hash_node);
               .... destroy psock ...
           write_unlock(sk_callback_lock)
      
      Obviously, and demonstrated by syzbot, this is broken because
      we can have multiple threads deleting entries via hlist_del_rcu().
      
      To fix this we might be tempted to wrap the hlist operation in a
      bucket lock but that would create a lock inversion problem. In
      summary to follow locking rules the psocks maps list needs the
      sk_callback_lock (after this patch maps_lock) but we need the bucket
      lock to do the hlist_del_rcu.
      
      To resolve the lock inversion problem pop the head of the maps list
      repeatedly and remove the reference until no more are left. If a
      delete happens in parallel from the BPF API that is OK as well because
      it will do a similar action, lookup the lock in the map/hash, delete
      it from the map/hash, and dec the refcnt. We check for this case
      before doing a destroy on the psock to ensure we don't have two
      threads tearing down a psock. The new logic is as follows,
      
        bpf_tcp_close()
        e = psock_map_pop(psock->maps) // done with map lock
        bucket_lock() // lock hash list bucket
        l = lookup_elem_raw(head, hash, key, key_size);
        if (l) {
           //only get here if elmnt was not already removed
           hlist_del_rcu()
           ... destroy psock...
        }
        bucket_unlock()
      
      And finally for all the above to work add missing locking around  map
      operations per above. Then add RCU annotations and use
      rcu_dereference/rcu_assign_pointer to manage values relying on RCU so
      that the object is not free'd from sock_hash_free() while it is being
      referenced in bpf_tcp_close().
      
      Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e9db4ef6
    • J
      bpf: sockmap, fix smap_list_map_remove when psock is in many maps · 54fedb42
      John Fastabend 提交于
      If a hashmap is free'd with open socks it removes the reference to
      the hash entry from the psock. If that is the last reference to the
      psock then it will also be free'd by the reference counting logic.
      However the current logic that removes the hash reference from the
      list of references is broken. In smap_list_remove() we first check
      if the sockmap entry matches and then check if the hashmap entry
      matches. But, the sockmap entry sill always match because its NULL in
      this case which causes the first entry to be removed from the list.
      If this is always the "right" entry (because the user adds/removes
      entries in order) then everything is OK but otherwise a subsequent
      bpf_tcp_close() may reference a free'd object.
      
      To fix this create two list handlers one for sockmap and one for
      sockhash.
      
      Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      54fedb42
    • J
      bpf: sockmap, fix crash when ipv6 sock is added · 9901c5d7
      John Fastabend 提交于
      This fixes a crash where we assign tcp_prot to IPv6 sockets instead
      of tcpv6_prot.
      
      Previously we overwrote the sk->prot field with tcp_prot even in the
      AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
      are used.
      
      Tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
      'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
      crashing case here.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9901c5d7
  10. 30 6月, 2018 1 次提交
    • D
      bpf: undo prog rejection on read-only lock failure · 85782e03
      Daniel Borkmann 提交于
      Partially undo commit 9facc336 ("bpf: reject any prog that failed
      read-only lock") since it caused a regression, that is, syzkaller was
      able to manage to cause a panic via fault injection deep in set_memory_ro()
      path by letting an allocation fail: In x86's __change_page_attr_set_clr()
      it was able to change the attributes of the primary mapping but not in
      the alias mapping via cpa_process_alias(), so the second, inner call
      to the __change_page_attr() via __change_page_attr_set_clr() had to split
      a larger page and failed in the alloc_pages() with the artifically triggered
      allocation error which is then propagated down to the call site.
      
      Thus, for set_memory_ro() this means that it returned with an error, but
      from debugging a probe_kernel_write() revealed EFAULT on that memory since
      the primary mapping succeeded to get changed. Therefore the subsequent
      hdr->locked = 0 reset triggered the panic as it was performed on read-only
      memory, so call-site assumptions were infact wrong to assume that it would
      either succeed /or/ not succeed at all since there's no such rollback in
      set_memory_*() calls from partial change of mappings, in other words, we're
      left in a state that is "half done". A later undo via set_memory_rw() is
      succeeding though due to matching permissions on that part (aka due to the
      try_preserve_large_page() succeeding). While reproducing locally with
      explicitly triggering this error, the initial splitting only happens on
      rare occasions and in real world it would additionally need oom conditions,
      but that said, it could partially fail. Therefore, it is definitely wrong
      to bail out on set_memory_ro() error and reject the program with the
      set_memory_*() semantics we have today. Shouldn't have gone the extra mile
      since no other user in tree today infact checks for any set_memory_*()
      errors, e.g. neither module_enable_ro() / module_disable_ro() for module
      RO/NX handling which is mostly default these days nor kprobes core with
      alloc_insn_page() / free_insn_page() as examples that could be invoked long
      after bootup and original 314beb9b ("x86: bpf_jit_comp: secure bpf jit
      against spraying attacks") did neither when it got first introduced to BPF
      so "improving" with bailing out was clearly not right when set_memory_*()
      cannot handle it today.
      
      Kees suggested that if set_memory_*() can fail, we should annotate it with
      __must_check, and all callers need to deal with it gracefully given those
      set_memory_*() markings aren't "advisory", but they're expected to actually
      do what they say. This might be an option worth to move forward in future
      but would at the same time require that set_memory_*() calls from supporting
      archs are guaranteed to be "atomic" in that they provide rollback if part
      of the range fails, once that happened, the transition from RW -> RO could
      be made more robust that way, while subsequent RO -> RW transition /must/
      continue guaranteeing to always succeed the undo part.
      
      Reported-by: syzbot+a4eb8c7766952a1ca872@syzkaller.appspotmail.com
      Reported-by: syzbot+d866d1925855328eac3b@syzkaller.appspotmail.com
      Fixes: 9facc336 ("bpf: reject any prog that failed read-only lock")
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      85782e03