1. 25 7月, 2018 4 次提交
    • H
      sched/rt: Restore rt_runtime after disabling RT_RUNTIME_SHARE · f3d133ee
      Hailong Liu 提交于
      NO_RT_RUNTIME_SHARE feature is used to prevent a CPU borrow enough
      runtime with a spin-rt-task.
      
      However, if RT_RUNTIME_SHARE feature is enabled and rt_rq has borrowd
      enough rt_runtime at the beginning, rt_runtime can't be restored to
      its initial bandwidth rt_runtime after we disable RT_RUNTIME_SHARE.
      
      E.g. on my PC with 4 cores, procedure to reproduce:
      1) Make sure  RT_RUNTIME_SHARE is enabled
       cat /sys/kernel/debug/sched_features
        GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
        CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK
        LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP
        NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN
        ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS
      2) Start a spin-rt-task
       ./loop_rr &
      3) set affinity to the last cpu
       taskset -p 8 $pid_of_loop_rr
      4) Observe that last cpu have borrowed enough runtime.
       cat /proc/sched_debug | grep rt_runtime
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 900.000000
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 1000.000000
      5) Disable RT_RUNTIME_SHARE
       echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
      6) Observe that rt_runtime can not been restored
       cat /proc/sched_debug | grep rt_runtime
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 900.000000
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 1000.000000
      
      This patch help to restore rt_runtime after we disable
      RT_RUNTIME_SHARE.
      Signed-off-by: NHailong Liu <liu.hailong6@zte.com.cn>
      Signed-off-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: zhong.weidong@zte.com.cn
      Link: http://lkml.kernel.org/r/1531874815-39357-1-git-send-email-liu.hailong6@zte.com.cnSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f3d133ee
    • D
      sched/deadline: Update rq_clock of later_rq when pushing a task · 840d7196
      Daniel Bristot de Oliveira 提交于
      Daniel Casini got this warn while running a DL task here at RetisLab:
      
        [  461.137582] ------------[ cut here ]------------
        [  461.137583] rq->clock_update_flags < RQCF_ACT_SKIP
        [  461.137599] WARNING: CPU: 4 PID: 2354 at kernel/sched/sched.h:967 assert_clock_updated.isra.32.part.33+0x17/0x20
            [a ton of modules]
        [  461.137646] CPU: 4 PID: 2354 Comm: label_image Not tainted 4.18.0-rc4+ #3
        [  461.137647] Hardware name: ASUS All Series/Z87-K, BIOS 0801 09/02/2013
        [  461.137649] RIP: 0010:assert_clock_updated.isra.32.part.33+0x17/0x20
        [  461.137649] Code: ff 48 89 83 08 09 00 00 eb c6 66 0f 1f 84 00 00 00 00 00 55 48 c7 c7 98 7a 6c a5 c6 05 bc 0d 54 01 01 48 89 e5 e8 a9 84 fb ff <0f> 0b 5d c3 0f 1f 44 00 00 0f 1f 44 00 00 83 7e 60 01 74 0a 48 3b
        [  461.137673] RSP: 0018:ffffa77e08cafc68 EFLAGS: 00010082
        [  461.137674] RAX: 0000000000000000 RBX: ffff8b3fc1702d80 RCX: 0000000000000006
        [  461.137674] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8b3fded164b0
        [  461.137675] RBP: ffffa77e08cafc68 R08: 0000000000000026 R09: 0000000000000339
        [  461.137676] R10: ffff8b3fd060d410 R11: 0000000000000026 R12: ffffffffa4e14e20
        [  461.137677] R13: ffff8b3fdec22940 R14: ffff8b3fc1702da0 R15: ffff8b3fdec22940
        [  461.137678] FS:  00007efe43ee5700(0000) GS:ffff8b3fded00000(0000) knlGS:0000000000000000
        [  461.137679] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  461.137680] CR2: 00007efe30000010 CR3: 0000000301744003 CR4: 00000000001606e0
        [  461.137680] Call Trace:
        [  461.137684]  push_dl_task.part.46+0x3bc/0x460
        [  461.137686]  task_woken_dl+0x60/0x80
        [  461.137689]  ttwu_do_wakeup+0x4f/0x150
        [  461.137690]  ttwu_do_activate+0x77/0x80
        [  461.137692]  try_to_wake_up+0x1d6/0x4c0
        [  461.137693]  wake_up_q+0x32/0x70
        [  461.137696]  do_futex+0x7e7/0xb50
        [  461.137698]  __x64_sys_futex+0x8b/0x180
        [  461.137701]  do_syscall_64+0x5a/0x110
        [  461.137703]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [  461.137705] RIP: 0033:0x7efe4918ca26
        [  461.137705] Code: 00 00 00 74 17 49 8b 48 20 44 8b 59 10 41 83 e3 30 41 83 fb 20 74 1e be 85 00 00 00 41 ba 01 00 00 00 41 b9 01 00 00 04 0f 05 <48> 3d 01 f0 ff ff 73 1f 31 c0 c3 be 8c 00 00 00 49 89 c8 4d 31 d2
        [  461.137738] RSP: 002b:00007efe43ee4928 EFLAGS: 00000283 ORIG_RAX: 00000000000000ca
        [  461.137739] RAX: ffffffffffffffda RBX: 0000000005094df0 RCX: 00007efe4918ca26
        [  461.137740] RDX: 0000000000000001 RSI: 0000000000000085 RDI: 0000000005094e24
        [  461.137741] RBP: 00007efe43ee49c0 R08: 0000000005094e20 R09: 0000000004000001
        [  461.137741] R10: 0000000000000001 R11: 0000000000000283 R12: 0000000000000000
        [  461.137742] R13: 0000000005094df8 R14: 0000000000000001 R15: 0000000000448a10
        [  461.137743] ---[ end trace 187df4cad2bf7649 ]---
      
      This warning happened in the push_dl_task(), because
      __add_running_bw()->cpufreq_update_util() is getting the rq_clock of
      the later_rq before its update, which takes place at activate_task().
      The fix then is to update the rq_clock before calling add_running_bw().
      
      To avoid double rq_clock_update() call, we set ENQUEUE_NOCLOCK flag to
      activate_task().
      Reported-by: NDaniel Casini <daniel.casini@santannapisa.it>
      Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJuri Lelli <juri.lelli@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>
      Fixes: e0367b12 sched/deadline: Move CPU frequency selection triggering points
      Link: http://lkml.kernel.org/r/ca31d073a4788acf0684a8b255f14fea775ccf20.1532077269.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      840d7196
    • I
      stop_machine: Disable preemption after queueing stopper threads · 2610e889
      Isaac J. Manjarres 提交于
      This commit:
      
        9fb8d5dc ("stop_machine, Disable preemption when waking two stopper threads")
      
      does not fully address the race condition that can occur
      as follows:
      
      On one CPU, call it CPU 3, thread 1 invokes
      cpu_stop_queue_two_works(2, 3,...), and the execution is such
      that thread 1 queues the works for migration/2 and migration/3,
      and is preempted after releasing the locks for migration/2 and
      migration/3, but before waking the threads.
      
      Then, On CPU 2, a kworker, call it thread 2, is running,
      and it invokes cpu_stop_queue_two_works(1, 2,...), such that
      thread 2 queues the works for migration/1 and migration/2.
      Meanwhile, on CPU 3, thread 1 resumes execution, and wakes
      migration/2 and migration/3. This means that when CPU 2
      releases the locks for migration/1 and migration/2, but before
      it wakes those threads, it can be preempted by migration/2.
      
      If thread 2 is preempted by migration/2, then migration/2 will
      execute the first work item successfully, since migration/3
      was woken up by CPU 3, but when it goes to execute the second
      work item, it disables preemption, calls multi_cpu_stop(),
      and thus, CPU 2 will wait forever for migration/1, which should
      have been woken up by thread 2. However migration/1 cannot be
      woken up by thread 2, since it is a kworker, so it is affine to
      CPU 2, but CPU 2 is running migration/2 with preemption
      disabled, so thread 2 will never run.
      
      Disable preemption after queueing works for stopper threads
      to ensure that the operation of queueing the works and waking
      the stopper threads is atomic.
      Co-Developed-by: NPrasad Sodagudi <psodagud@codeaurora.org>
      Co-Developed-by: NPavankumar Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NIsaac J. Manjarres <isaacm@codeaurora.org>
      Signed-off-by: NPrasad Sodagudi <psodagud@codeaurora.org>
      Signed-off-by: NPavankumar Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bigeasy@linutronix.de
      Cc: gregkh@linuxfoundation.org
      Cc: matt@codeblueprint.co.uk
      Fixes: 9fb8d5dc ("stop_machine, Disable preemption when waking two stopper threads")
      Link: http://lkml.kernel.org/r/1531856129-9871-1-git-send-email-isaacm@codeaurora.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2610e889
    • Y
      sched/topology: Check variable group before dereferencing it · 6cd0c583
      Yi Wang 提交于
      The 'group' variable in sched_domain_debug_one() is not checked
      when firstly used in cpumask_test_cpu(cpu, sched_group_span(group)),
      but it might be NULL (it is checked later in the following while loop)
      and may cause NULL pointer dereference.
      
      We need to check it before using to avoid NULL dereference.
      Signed-off-by: NYi Wang <wang.yi59@zte.com.cn>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: zhong.weidong@zte.com.cn
      Link: http://lkml.kernel.org/r/1532319547-33335-1-git-send-email-wang.yi59@zte.com.cnSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6cd0c583
  2. 22 7月, 2018 3 次提交
    • L
      mm: make vm_area_alloc() initialize core fields · 490fc053
      Linus Torvalds 提交于
      Like vm_area_dup(), it initializes the anon_vma_chain head, and the
      basic mm pointer.
      
      The rest of the fields end up being different for different users,
      although the plan is to also initialize the 'vm_ops' field to a dummy
      entry.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      490fc053
    • L
      mm: make vm_area_dup() actually copy the old vma data · 95faf699
      Linus Torvalds 提交于
      .. and re-initialize th eanon_vma_chain head.
      
      This removes some boiler-plate from the users, and also makes it clear
      why it didn't need use the 'zalloc()' version.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95faf699
    • L
      mm: use helper functions for allocating and freeing vm_area structs · 3928d4f5
      Linus Torvalds 提交于
      The vm_area_struct is one of the most fundamental memory management
      objects, but the management of it is entirely open-coded evertwhere,
      ranging from allocation and freeing (using kmem_cache_[z]alloc and
      kmem_cache_free) to initializing all the fields.
      
      We want to unify this in order to end up having some unified
      initialization of the vmas, and the first step to this is to at least
      have basic allocation functions.
      
      Right now those functions are literally just wrappers around the
      kmem_cache_*() calls.  This is a purely mechanical conversion:
      
          # new vma:
          kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()
      
          # copy old vma
          kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)
      
          # free vma
          kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)
      
      to the point where the old vma passed in to the vm_area_dup() function
      isn't even used yet (because I've left all the old manual initialization
      alone).
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3928d4f5
  3. 18 7月, 2018 1 次提交
    • L
      Mark HI and TASKLET softirq synchronous · 3c53776e
      Linus Torvalds 提交于
      Way back in 4.9, we committed 4cd13c21 ("softirq: Let ksoftirqd do
      its job"), and ever since we've had small nagging issues with it.  For
      example, we've had:
      
        1ff68820 ("watchdog: core: make sure the watchdog_worker is not deferred")
        8d5755b3 ("watchdog: softdog: fire watchdog even if softirqs do not get to run")
        217f6974 ("net: busy-poll: allow preemption in sk_busy_loop()")
      
      all of which worked around some of the effects of that commit.
      
      The DVB people have also complained that the commit causes excessive USB
      URB latencies, which seems to be due to the USB code using tasklets to
      schedule USB traffic.  This seems to be an issue mainly when already
      living on the edge, but waiting for ksoftirqd to handle it really does
      seem to cause excessive latencies.
      
      Now Hanna Hawa reports that this issue isn't just limited to USB URB and
      DVB, but also causes timeout problems for the Marvell SoC team:
      
       "I'm facing kernel panic issue while running raid 5 on sata disks
        connected to Macchiatobin (Marvell community board with Armada-8040
        SoC with 4 ARMv8 cores of CA72) Raid 5 built with Marvell DMA engine
        and async_tx mechanism (ASYNC_TX_DMA [=y]); the DMA driver (mv_xor_v2)
        uses a tasklet to clean the done descriptors from the queue"
      
      The latency problem causes a panic:
      
        mv_xor_v2 f0400000.xor: dma_sync_wait: timeout!
        Kernel panic - not syncing: async_tx_quiesce: DMA error waiting for transaction
      
      We've discussed simply just reverting the original commit entirely, and
      also much more involved solutions (with per-softirq threads etc).  This
      patch is intentionally stupid and fairly limited, because the issue
      still remains, and the other solutions either got sidetracked or had
      other issues.
      
      We should probably also consider the timer softirqs to be synchronous
      and not be delayed to ksoftirqd (since they were the issue with the
      earlier watchdog problems), but that should be done as a separate patch.
      This does only the tasklet cases.
      Reported-and-tested-by: NHanna Hawa <hannah@marvell.com>
      Reported-and-tested-by: NJosef Griebichler <griebichler.josef@gmx.at>
      Reported-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c53776e
  4. 16 7月, 2018 16 次提交
  5. 15 7月, 2018 1 次提交
    • I
      stop_machine: Disable preemption when waking two stopper threads · 9fb8d5dc
      Isaac J. Manjarres 提交于
      When cpu_stop_queue_two_works() begins to wake the stopper threads, it does
      so without preemption disabled, which leads to the following race
      condition:
      
      The source CPU calls cpu_stop_queue_two_works(), with cpu1 as the source
      CPU, and cpu2 as the destination CPU. When adding the stopper threads to
      the wake queue used in this function, the source CPU stopper thread is
      added first, and the destination CPU stopper thread is added last.
      
      When wake_up_q() is invoked to wake the stopper threads, the threads are
      woken up in the order that they are queued in, so the source CPU's stopper
      thread is woken up first, and it preempts the thread running on the source
      CPU.
      
      The stopper thread will then execute on the source CPU, disable preemption,
      and begin executing multi_cpu_stop(), and wait for an ack from the
      destination CPU's stopper thread, with preemption still disabled. Since the
      worker thread that woke up the stopper thread on the source CPU is affine
      to the source CPU, and preemption is disabled on the source CPU, that
      thread will never run to dequeue the destination CPU's stopper thread from
      the wake queue, and thus, the destination CPU's stopper thread will never
      run, causing the source CPU's stopper thread to wait forever, and stall.
      
      Disable preemption when waking the stopper threads in
      cpu_stop_queue_two_works().
      
      Fixes: 0b26351b ("stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock")
      Co-Developed-by: NPrasad Sodagudi <psodagud@codeaurora.org>
      Signed-off-by: NPrasad Sodagudi <psodagud@codeaurora.org>
      Co-Developed-by: NPavankumar Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NPavankumar Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NIsaac J. Manjarres <isaacm@codeaurora.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: peterz@infradead.org
      Cc: matt@codeblueprint.co.uk
      Cc: bigeasy@linutronix.de
      Cc: gregkh@linuxfoundation.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1530655334-4601-1-git-send-email-isaacm@codeaurora.org
      9fb8d5dc
  6. 13 7月, 2018 2 次提交
    • J
      tracing: Reorder display of TGID to be after PID · f8494fa3
      Joel Fernandes (Google) 提交于
      Currently ftrace displays data in trace output like so:
      
                                             _-----=> irqs-off
                                            / _----=> need-resched
                                           | / _---=> hardirq/softirq
                                           || / _--=> preempt-depth
                                           ||| /     delay
                  TASK-PID   CPU    TGID   ||||    TIMESTAMP  FUNCTION
                     | |       |      |    ||||       |         |
                  bash-1091  [000] ( 1091) d..2    28.313544: sched_switch:
      
      However Android's trace visualization tools expect a slightly different
      format due to an out-of-tree patch patch that was been carried for a
      decade, notice that the TGID and CPU fields are reversed:
      
                                             _-----=> irqs-off
                                            / _----=> need-resched
                                           | / _---=> hardirq/softirq
                                           || / _--=> preempt-depth
                                           ||| /     delay
                  TASK-PID    TGID   CPU   ||||    TIMESTAMP  FUNCTION
                     | |        |      |   ||||       |         |
                  bash-1091  ( 1091) [002] d..2    64.965177: sched_switch:
      
      From kernel v4.13 onwards, during which TGID was introduced, tracing
      with systrace on all Android kernels will break (most Android kernels
      have been on 4.9 with Android patches, so this issues hasn't been seen
      yet). From v4.13 onwards things will break.
      
      The chrome browser's tracing tools also embed the systrace viewer which
      uses the legacy TGID format and updates to that are known to be
      difficult to make.
      
      Considering this, I suggest we make this change to the upstream kernel
      and backport it to all Android kernels. I believe this feature is merged
      recently enough into the upstream kernel that it shouldn't be a problem.
      Also logically, IMO it makes more sense to group the TGID with the
      TASK-PID and the CPU after these.
      
      Link: http://lkml.kernel.org/r/20180626000822.113931-1-joel@joelfernandes.org
      
      Cc: jreck@google.com
      Cc: tkjos@google.com
      Cc: stable@vger.kernel.org
      Fixes: 441dae8f ("tracing: Add support for display of tgid in trace output")
      Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      f8494fa3
    • D
      bpf: don't leave partial mangled prog in jit_subprogs error path · c7a89784
      Daniel Borkmann 提交于
      syzkaller managed to trigger the following bug through fault injection:
      
        [...]
        [  141.043668] verifier bug. No program starts at insn 3
        [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                       get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
        [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                       fixup_call_args kernel/bpf/verifier.c:5587 [inline]
        [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                       bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
        [  141.047355] CPU: 3 PID: 4072 Comm: a.out Not tainted 4.18.0-rc4+ #51
        [  141.048446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS 1.10.2-1 04/01/2014
        [  141.049877] Call Trace:
        [  141.050324]  __dump_stack lib/dump_stack.c:77 [inline]
        [  141.050324]  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
        [  141.050950]  ? dump_stack_print_info.cold.2+0x52/0x52 lib/dump_stack.c:60
        [  141.051837]  panic+0x238/0x4e7 kernel/panic.c:184
        [  141.052386]  ? add_taint.cold.5+0x16/0x16 kernel/panic.c:385
        [  141.053101]  ? __warn.cold.8+0x148/0x1ba kernel/panic.c:537
        [  141.053814]  ? __warn.cold.8+0x117/0x1ba kernel/panic.c:530
        [  141.054506]  ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
        [  141.054506]  ? fixup_call_args kernel/bpf/verifier.c:5587 [inline]
        [  141.054506]  ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
        [  141.055163]  __warn.cold.8+0x163/0x1ba kernel/panic.c:538
        [  141.055820]  ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
        [  141.055820]  ? fixup_call_args kernel/bpf/verifier.c:5587 [inline]
        [  141.055820]  ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
        [...]
      
      What happens in jit_subprogs() is that kcalloc() for the subprog func
      buffer is failing with NULL where we then bail out. Latter is a plain
      return -ENOMEM, and this is definitely not okay since earlier in the
      loop we are walking all subprogs and temporarily rewrite insn->off to
      remember the subprog id as well as insn->imm to temporarily point the
      call to __bpf_call_base + 1 for the initial JIT pass. Thus, bailing
      out in such state and handing this over to the interpreter is troublesome
      since later/subsequent e.g. find_subprog() lookups are based on wrong
      insn->imm.
      
      Therefore, once we hit this point, we need to jump to out_free path
      where we undo all changes from earlier loop, so that interpreter can
      work on unmodified insn->{off,imm}.
      
      Another point is that should find_subprog() fail in jit_subprogs() due
      to a verifier bug, then we also should not simply defer the program to
      the interpreter since also here we did partial modifications. Instead
      we should just bail out entirely and return an error to the user who is
      trying to load the program.
      
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Reported-by: syzbot+7d427828b2ea6e592804@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c7a89784
  7. 12 7月, 2018 2 次提交
  8. 11 7月, 2018 5 次提交
    • M
      rseq: uapi: Declare rseq_cs field as union, update includes · ec9c82e0
      Mathieu Desnoyers 提交于
      Declaring the rseq_cs field as a union between __u64 and two __u32
      allows both 32-bit and 64-bit kernels to read the full __u64, and
      therefore validate that a 32-bit user-space cleared the upper 32
      bits, thus ensuring a consistent behavior between native 32-bit
      kernels and 32-bit compat tasks on 64-bit kernels.
      
      Check that the rseq_cs value read is < TASK_SIZE.
      
      The asm/byteorder.h header needs to be included by rseq.h, now
      that it is not using linux/types_32_64.h anymore.
      
      Considering that only __32 and __u64 types are declared in linux/rseq.h,
      the linux/types.h header should always be included for both kernel and
      user-space code: including stdint.h is just for u64 and u32, which are
      not used in this header at all.
      
      Use copy_from_user()/clear_user() to interact with a 64-bit field,
      because arm32 does not implement 64-bit __get_user, and ppc32 does not
      64-bit get_user. Considering that the rseq_cs pointer does not need to
      be loaded/stored with single-copy atomicity from the kernel anymore, we
      can simply use copy_from_user()/clear_user().
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-5-mathieu.desnoyers@efficios.com
      ec9c82e0
    • M
      rseq: uapi: Update uapi comments · 0fb9a1ab
      Mathieu Desnoyers 提交于
      Update rseq uapi header comments to reflect that user-space need to do
      thread-local loads/stores from/to the struct rseq fields.
      
      As a consequence of this added requirement, the kernel does not need
      to perform loads/stores with single-copy atomicity.
      
      Update the comment associated to the "flags" fields to describe
      more accurately that it's only useful to facilitate single-stepping
      through rseq critical sections with debuggers.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-4-mathieu.desnoyers@efficios.com
      0fb9a1ab
    • M
      rseq: Use get_user/put_user rather than __get_user/__put_user · 8f281770
      Mathieu Desnoyers 提交于
      __get_user()/__put_user() is used to read values for address ranges that
      were already checked with access_ok() on rseq registration.
      
      It has been recognized that __get_user/__put_user are optimizing the
      wrong thing. Replace them by get_user/put_user across rseq instead.
      
      If those end up showing up in benchmarks, the proper approach would be to
      use user_access_begin() / unsafe_{get,put}_user() / user_access_end()
      anyway.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: https://lkml.kernel.org/r/20180709195155.7654-3-mathieu.desnoyers@efficios.com
      8f281770
    • M
      rseq: Use __u64 for rseq_cs fields, validate user inputs · e96d7135
      Mathieu Desnoyers 提交于
      Change the rseq ABI so rseq_cs start_ip, post_commit_offset and abort_ip
      fields are seen as 64-bit fields by both 32-bit and 64-bit kernels rather
      that ignoring the 32 upper bits on 32-bit kernels. This ensures we have a
      consistent behavior for a 32-bit binary executed on 32-bit kernels and in
      compat mode on 64-bit kernels.
      
      Validating the value of abort_ip field to be below TASK_SIZE ensures the
      kernel don't return to an invalid address when returning to userspace
      after an abort. I don't fully trust each architecture code to consistently
      deal with invalid return addresses.
      
      Validating the value of the start_ip and post_commit_offset fields
      prevents overflow on arithmetic performed on those values, used to
      check whether abort_ip is within the rseq critical section.
      
      If validation fails, the process is killed with a segmentation fault.
      
      When the signature encountered before abort_ip does not match the expected
      signature, return -EINVAL rather than -EPERM to be consistent with other
      input validation return codes from rseq_get_rseq_cs().
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-2-mathieu.desnoyers@efficios.com
      e96d7135
    • S
      Revert "tick: Prefer a lower rating device only if it's CPU local device" · 5b5ccbc2
      Sudeep Holla 提交于
      This reverts commit 1332a905.
      
      The original issue was not because of incorrect checking of cpumask for
      both new and old tick device. It was incorrectly analysed was due to the
      misunderstanding of the comment and misinterpretation of the return value
      from tick_check_preferred. The main issue is with the clockevent driver
      that sets the cpumask to cpu_all_mask instead of cpu_possible_mask.
      Signed-off-by: NSudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NKevin Hilman <khilman@baylibre.com>
      Tested-by: NMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Link: https://lkml.kernel.org/r/1531151136-18297-1-git-send-email-sudeep.holla@arm.com
      5b5ccbc2
  9. 08 7月, 2018 6 次提交
    • T
      xdp: XDP_REDIRECT should check IFF_UP and MTU · d8d7218a
      Toshiaki Makita 提交于
      Otherwise we end up with attempting to send packets from down devices
      or to send oversized packets, which may cause unexpected driver/device
      behaviour. Generic XDP has already done this check, so reuse the logic
      in native XDP.
      
      Fixes: 814abfab ("xdp: add bpf_redirect helper function")
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d8d7218a
    • J
      bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skb · 0ea488ff
      John Fastabend 提交于
      In commit
      
        'bpf: bpf_compute_data uses incorrect cb structure' (8108a775)
      
      we added the routine bpf_compute_data_end_sk_skb() to compute the
      correct data_end values, but this has since been lost. In kernel
      v4.14 this was correct and the above patch was applied in it
      entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
      we lost the piece that renamed bpf_compute_data_pointers to the
      new function bpf_compute_data_end_sk_skb. This was done here,
      
      e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      
      When it conflicted with the following rename patch,
      
      6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      
      Finally, after a refactor I thought even the function
      bpf_compute_data_end_sk_skb() was no longer needed and it was
      erroneously removed.
      
      However, we never reverted the sk_skb_convert_ctx_access() usage of
      tcp_skb_cb which had been committed and survived the merge conflict.
      Here we fix this by adding back the helper and *_data_end_sk_skb()
      usage. Using the bpf_skc_data_end mapping is not correct because it
      expects a qdisc_skb_cb object but at the sock layer this is not the
      case. Even though it happens to work here because we don't overwrite
      any data in-use at the socket layer and the cb structure is cleared
      later this has potential to create some subtle issues. But, even
      more concretely the filter.c access check uses tcp_skb_cb.
      
      And by some act of chance though,
      
      struct bpf_skb_data_end {
              struct qdisc_skb_cb        qdisc_cb;             /*     0    28 */
      
              /* XXX 4 bytes hole, try to pack */
      
              void *                     data_meta;            /*    32     8 */
              void *                     data_end;             /*    40     8 */
      
              /* size: 48, cachelines: 1, members: 3 */
              /* sum members: 44, holes: 1, sum holes: 4 */
              /* last cacheline: 48 bytes */
      };
      
      and then tcp_skb_cb,
      
      struct tcp_skb_cb {
      	[...]
                      struct {
                              __u32      flags;                /*    24     4 */
                              struct sock * sk_redir;          /*    32     8 */
                              void *     data_end;             /*    40     8 */
                      } bpf;                                   /*          24 */
              };
      
      So when we use offset_of() to track down the byte offset we get 40 in
      either case and everything continues to work. Fix this mess and use
      correct structures its unclear how long this might actually work for
      until someone moves the structs around.
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Fixes: e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      Fixes: 6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0ea488ff
    • J
      bpf: sockmap, consume_skb in close path · 7ebc14d5
      John Fastabend 提交于
      Currently, when a sock is closed and the bpf_tcp_close() callback is
      used we remove memory but do not free the skb. Call consume_skb() if
      the skb is attached to the buffer.
      
      Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com
      Fixes: 1aa12bdf ("bpf: sockmap, add sock close() hook to remove socks")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7ebc14d5
    • J
      bpf: sockhash, disallow bpf_tcp_close and update in parallel · 99ba2b5a
      John Fastabend 提交于
      After latest lock updates there is no longer anything preventing a
      close and recvmsg call running in parallel. Additionally, we can
      race update with close if we close a socket and simultaneously update
      if via the BPF userspace API (note the cgroup ops are already run
      with sock_lock held).
      
      To resolve this take sock_lock in close and update paths.
      
      Reported-by: syzbot+b680e42077a0d7c9a0c4@syzkaller.appspotmail.com
      Fixes: e9db4ef6 ("bpf: sockhash fix omitted bucket lock in sock_close")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      99ba2b5a
    • J
      bpf: sockmap, hash table is RCU so readers do not need locks · 1d1ef005
      John Fastabend 提交于
      This removes locking from readers of RCU hash table. Its not
      necessary.
      
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1d1ef005
    • J
      bpf: sockmap, error path can not release psock in multi-map case · 547b3aa4
      John Fastabend 提交于
      The current code, in the error path of sock_hash_ctx_update_elem,
      checks if the sock has a psock in the user data and if so decrements
      the reference count of the psock. However, if the error happens early
      in the error path we may have never incremented the psock reference
      count and if the psock exists because the sock is in another map then
      we may inadvertently decrement the reference count.
      
      Fix this by making the error path only call smap_release_sock if the
      error happens after the increment.
      
      Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      547b3aa4