1. 27 12月, 2019 29 次提交
    • P
      bpf: decrease usercnt if bpf_map_new_fd() fails in bpf_map_get_fd_by_id() · eda04c10
      Peng Sun 提交于
      mainline inclusion
      from mainline-5.0
      commit 781e62823cb81b972dc8652c1827205cda2ac9ac
      category: bugfix
      bugzilla: 11101
      CVE: NA
      
      -------------------------------------------------
      In bpf/syscall.c, bpf_map_get_fd_by_id() use bpf_map_inc_not_zero()
      to increase the refcount, both map->refcnt and map->usercnt. Then, if
      bpf_map_new_fd() fails, should handle map->usercnt too.
      
      Fixes: bd5f5f4e ("bpf: Add BPF_MAP_GET_FD_BY_ID")
      Signed-off-by: NPeng Sun <sironhide0null@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      (cherry picked from commit 781e62823cb81b972dc8652c1827205cda2ac9ac)
      Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      eda04c10
    • D
      bpf: fix sanitation rewrite in case of non-pointers · 223cd1fc
      Daniel Borkmann 提交于
      commit 3612af783cf52c74a031a2f11b82247b2599d3cd upstream.
      
      Marek reported that he saw an issue with the below snippet in that
      timing measurements where off when loaded as unpriv while results
      were reasonable when loaded as privileged:
      
          [...]
          uint64_t a = bpf_ktime_get_ns();
          uint64_t b = bpf_ktime_get_ns();
          uint64_t delta = b - a;
          if ((int64_t)delta > 0) {
          [...]
      
      Turns out there is a bug where a corner case is missing in the fix
      d3bd7413e0ca ("bpf: fix sanitation of alu op with pointer / scalar
      type from different paths"), namely fixup_bpf_calls() only checks
      whether aux has a non-zero alu_state, but it also needs to test for
      the case of BPF_ALU_NON_POINTER since in both occasions we need to
      skip the masking rewrite (as there is nothing to mask).
      
      Fixes: d3bd7413e0ca ("bpf: fix sanitation of alu op with pointer / scalar type from different paths")
      Reported-by: NMarek Majkowski <marek@cloudflare.com>
      Reported-by: NArthur Fabre <afabre@cloudflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/netdev/CAJPywTJqP34cK20iLM5YmUMz9KXQOdu1-+BZrGMAGgLuBWz7fg@mail.gmail.com/T/Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      223cd1fc
    • A
      bpf: fix lockdep false positive in stackmap · ec9d64f0
      Alexei Starovoitov 提交于
      mainline inclusion
      from mainline-5.0-rc8
      commit 3defaf2f
      category: bugfix
      bugzilla: 10760
      CVE: NA
      
      -------------------------------------------------
      
      Lockdep warns about false positive:
      [   11.211460] ------------[ cut here ]------------
      [   11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0)
      [   11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280
      [   11.213134] Modules linked in:
      [   11.214954] RIP: 0010:lock_release+0x1ad/0x280
      [   11.223508] Call Trace:
      [   11.223705]  <IRQ>
      [   11.223874]  ? __local_bh_enable+0x7a/0x80
      [   11.224199]  up_read+0x1c/0xa0
      [   11.224446]  do_up_read+0x12/0x20
      [   11.224713]  irq_work_run_list+0x43/0x70
      [   11.225030]  irq_work_run+0x26/0x50
      [   11.225310]  smp_irq_work_interrupt+0x57/0x1f0
      [   11.225662]  irq_work_interrupt+0xf/0x20
      
      since rw_semaphore is released in a different task vs task that locked the sema.
      It is expected behavior.
      Fix the warning with up_read_non_owner() and rwsem_release() annotation.
      
      Fixes: bae77c5e ("bpf: enable stackmap with build_id in nmi context")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NLi Bin <huawei.libin@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ec9d64f0
    • A
      bpf, lpm: fix lookup bug in map_delete_elem · 5f549338
      Alban Crequy 提交于
      mainline inclusion
      from mainline-5.0
      commit 7c0cdf0b3940
      category: bugfix
      bugzilla: 10936
      CVE: NA
      
      -------------------------------------------------
      
      trie_delete_elem() was deleting an entry even though it was not matching
      if the prefixlen was correct. This patch adds a check on matchlen.
      
      Reproducer:
      
      $ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 entries 128 name mylpm flags 1
      $ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb cc dd value hex 01
      $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
      key: 10 00 00 00 aa bb cc dd  value: 01
      Found 1 element
      $ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff ff ff
      $ echo $?
      0
      $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
      Found 0 elements
      
      A similar reproducer is added in the selftests.
      
      Without the patch:
      
      $ sudo ./tools/testing/selftests/bpf/test_lpm_map
      test_lpm_map: test_lpm_map.c:485: test_lpm_delete: Assertion `bpf_map_delete_elem(map_fd, key) == -1 && errno == ENOENT' failed.
      Aborted
      
      With the patch: test_lpm_map runs without errors.
      
      Fixes: e454cf59 ("bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE")
      Cc: Craig Gallek <kraig@google.com>
      Signed-off-by: NAlban Crequy <alban@kinvolk.io>
      Acked-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5f549338
    • E
      bpf, lpm: make longest_prefix_match() faster · 7e35327e
      Eric Dumazet 提交于
      mainline inclusion
      from mainline-5.0
      commit 7c0cdf0b3940
      category: bugfix
      bugzilla: 10936
      CVE: NA
      
      -------------------------------------------------
      
      At LPC 2018 in Vancouver, Vlad Dumitrescu mentioned that longest_prefix_match()
      has a high cost [1].
      
      One reason for that cost is a loop handling one byte at a time.
      
      We can handle more bytes at a time, if enough attention is paid
      to endianness.
      
      I was able to remove ~55 % of longest_prefix_match() cpu costs.
      
      [1] https://linuxplumbersconf.org/event/2/contributions/88/attachments/76/87/lpc-bpf-2018-shaping.pdfSigned-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Vlad Dumitrescu <vladum@google.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7e35327e
    • S
      bpf: zero out build_id for BPF_STACK_BUILD_ID_IP · 1f98ce54
      Stanislav Fomichev 提交于
      [ Upstream commit 4af396ae4836c4ecab61e975b8e61270c551894d ]
      
      When returning BPF_STACK_BUILD_ID_IP from stack_map_get_build_id_offset,
      make sure that build_id field is empty. Since we are using percpu
      free list, there is a possibility that we might reuse some previous
      bpf_stack_build_id with non-zero build_id.
      
      Fixes: 615755a7 ("bpf: extend stackmap to save binary_build_id+offset instead of address")
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1f98ce54
    • S
      bpf: don't assume build-id length is always 20 bytes · 49facf53
      Stanislav Fomichev 提交于
      [ Upstream commit 0b698005a9d11c0e91141ec11a2c4918a129f703 ]
      
      Build-id length is not fixed to 20, it can be (`man ld` /--build-id):
        * 128-bit (uuid)
        * 160-bit (sha1)
        * any length specified in ld --build-id=0xhexstring
      
      To fix the issue of missing BPF_STACK_BUILD_ID_VALID for shorter build-ids,
      assume that build-id is somewhere in the range of 1 .. 20.
      Set the remaining bytes to zero.
      
      v2:
      * don't introduce new "len = min(BPF_BUILD_ID_SIZE, nhdr->n_descsz)",
        we already know that nhdr->n_descsz <= BPF_BUILD_ID_SIZE if we enter
        this 'if' condition
      
      Fixes: 615755a7 ("bpf: extend stackmap to save binary_build_id+offset instead of address")
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      49facf53
    • S
      bpf: fix panic in stack_map_get_build_id() on i386 and arm32 · 6a263990
      Song Liu 提交于
      [ Upstream commit beaf3d1901f4ea46fbd5c9d857227d99751de469 ]
      
      As Naresh reported, test_stacktrace_build_id() causes panic on i386 and
      arm32 systems. This is caused by page_address() returns NULL in certain
      cases.
      
      This patch fixes this error by using kmap_atomic/kunmap_atomic instead
      of page_address.
      
      Fixes: 615755a7 (" bpf: extend stackmap to save binary_build_id+offset instead of address")
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6a263990
    • A
      bpf: fix lockdep false positive in percpu_freelist · f72ddcb4
      Alexei Starovoitov 提交于
      mainline inclusion
      from mainline-5.0
      commit a89fac57b5d0
      category: bugfix
      bugzilla: 9352
      CVE: NA
      
      -------------------------------------------------
      
      Lockdep warns about false positive:
      [   12.492084] 00000000e6b28347 (&head->lock){+...}, at: pcpu_freelist_push+0x2a/0x40
      [   12.492696] but this lock was taken by another, HARDIRQ-safe lock in the past:
      [   12.493275]  (&rq->lock){-.-.}
      [   12.493276]
      [   12.493276]
      [   12.493276] and interrupts could create inverse lock ordering between them.
      [   12.493276]
      [   12.494435]
      [   12.494435] other info that might help us debug this:
      [   12.494979]  Possible interrupt unsafe locking scenario:
      [   12.494979]
      [   12.495518]        CPU0                    CPU1
      [   12.495879]        ----                    ----
      [   12.496243]   lock(&head->lock);
      [   12.496502]                                local_irq_disable();
      [   12.496969]                                lock(&rq->lock);
      [   12.497431]                                lock(&head->lock);
      [   12.497890]   <Interrupt>
      [   12.498104]     lock(&rq->lock);
      [   12.498368]
      [   12.498368]  *** DEADLOCK ***
      [   12.498368]
      [   12.498837] 1 lock held by dd/276:
      [   12.499110]  #0: 00000000c58cb2ee (rcu_read_lock){....}, at: trace_call_bpf+0x5e/0x240
      [   12.499747]
      [   12.499747] the shortest dependencies between 2nd lock and 1st lock:
      [   12.500389]  -> (&rq->lock){-.-.} {
      [   12.500669]     IN-HARDIRQ-W at:
      [   12.500934]                       _raw_spin_lock+0x2f/0x40
      [   12.501373]                       scheduler_tick+0x4c/0xf0
      [   12.501812]                       update_process_times+0x40/0x50
      [   12.502294]                       tick_periodic+0x27/0xb0
      [   12.502723]                       tick_handle_periodic+0x1f/0x60
      [   12.503203]                       timer_interrupt+0x11/0x20
      [   12.503651]                       __handle_irq_event_percpu+0x43/0x2c0
      [   12.504167]                       handle_irq_event_percpu+0x20/0x50
      [   12.504674]                       handle_irq_event+0x37/0x60
      [   12.505139]                       handle_level_irq+0xa7/0x120
      [   12.505601]                       handle_irq+0xa1/0x150
      [   12.506018]                       do_IRQ+0x77/0x140
      [   12.506411]                       ret_from_intr+0x0/0x1d
      [   12.506834]                       _raw_spin_unlock_irqrestore+0x53/0x60
      [   12.507362]                       __setup_irq+0x481/0x730
      [   12.507789]                       setup_irq+0x49/0x80
      [   12.508195]                       hpet_time_init+0x21/0x32
      [   12.508644]                       x86_late_time_init+0xb/0x16
      [   12.509106]                       start_kernel+0x390/0x42a
      [   12.509554]                       secondary_startup_64+0xa4/0xb0
      [   12.510034]     IN-SOFTIRQ-W at:
      [   12.510305]                       _raw_spin_lock+0x2f/0x40
      [   12.510772]                       try_to_wake_up+0x1c7/0x4e0
      [   12.511220]                       swake_up_locked+0x20/0x40
      [   12.511657]                       swake_up_one+0x1a/0x30
      [   12.512070]                       rcu_process_callbacks+0xc5/0x650
      [   12.512553]                       __do_softirq+0xe6/0x47b
      [   12.512978]                       irq_exit+0xc3/0xd0
      [   12.513372]                       smp_apic_timer_interrupt+0xa9/0x250
      [   12.513876]                       apic_timer_interrupt+0xf/0x20
      [   12.514343]                       default_idle+0x1c/0x170
      [   12.514765]                       do_idle+0x199/0x240
      [   12.515159]                       cpu_startup_entry+0x19/0x20
      [   12.515614]                       start_kernel+0x422/0x42a
      [   12.516045]                       secondary_startup_64+0xa4/0xb0
      [   12.516521]     INITIAL USE at:
      [   12.516774]                      _raw_spin_lock_irqsave+0x38/0x50
      [   12.517258]                      rq_attach_root+0x16/0xd0
      [   12.517685]                      sched_init+0x2f2/0x3eb
      [   12.518096]                      start_kernel+0x1fb/0x42a
      [   12.518525]                      secondary_startup_64+0xa4/0xb0
      [   12.518986]   }
      [   12.519132]   ... key      at: [<ffffffff82b7bc28>] __key.71384+0x0/0x8
      [   12.519649]   ... acquired at:
      [   12.519892]    pcpu_freelist_pop+0x7b/0xd0
      [   12.520221]    bpf_get_stackid+0x1d2/0x4d0
      [   12.520563]    ___bpf_prog_run+0x8b4/0x11a0
      [   12.520887]
      [   12.521008] -> (&head->lock){+...} {
      [   12.521292]    HARDIRQ-ON-W at:
      [   12.521539]                     _raw_spin_lock+0x2f/0x40
      [   12.521950]                     pcpu_freelist_push+0x2a/0x40
      [   12.522396]                     bpf_get_stackid+0x494/0x4d0
      [   12.522828]                     ___bpf_prog_run+0x8b4/0x11a0
      [   12.523296]    INITIAL USE at:
      [   12.523537]                    _raw_spin_lock+0x2f/0x40
      [   12.523944]                    pcpu_freelist_populate+0xc0/0x120
      [   12.524417]                    htab_map_alloc+0x405/0x500
      [   12.524835]                    __do_sys_bpf+0x1a3/0x1a90
      [   12.525253]                    do_syscall_64+0x4a/0x180
      [   12.525659]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   12.526167]  }
      [   12.526311]  ... key      at: [<ffffffff838f7668>] __key.13130+0x0/0x8
      [   12.526812]  ... acquired at:
      [   12.527047]    __lock_acquire+0x521/0x1350
      [   12.527371]    lock_acquire+0x98/0x190
      [   12.527680]    _raw_spin_lock+0x2f/0x40
      [   12.527994]    pcpu_freelist_push+0x2a/0x40
      [   12.528325]    bpf_get_stackid+0x494/0x4d0
      [   12.528645]    ___bpf_prog_run+0x8b4/0x11a0
      [   12.528970]
      [   12.529092]
      [   12.529092] stack backtrace:
      [   12.529444] CPU: 0 PID: 276 Comm: dd Not tainted 5.0.0-rc3-00018-g2fa53f892422 #475
      [   12.530043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      [   12.530750] Call Trace:
      [   12.530948]  dump_stack+0x5f/0x8b
      [   12.531248]  check_usage_backwards+0x10c/0x120
      [   12.531598]  ? ___bpf_prog_run+0x8b4/0x11a0
      [   12.531935]  ? mark_lock+0x382/0x560
      [   12.532229]  mark_lock+0x382/0x560
      [   12.532496]  ? print_shortest_lock_dependencies+0x180/0x180
      [   12.532928]  __lock_acquire+0x521/0x1350
      [   12.533271]  ? find_get_entry+0x17f/0x2e0
      [   12.533586]  ? find_get_entry+0x19c/0x2e0
      [   12.533902]  ? lock_acquire+0x98/0x190
      [   12.534196]  lock_acquire+0x98/0x190
      [   12.534482]  ? pcpu_freelist_push+0x2a/0x40
      [   12.534810]  _raw_spin_lock+0x2f/0x40
      [   12.535099]  ? pcpu_freelist_push+0x2a/0x40
      [   12.535432]  pcpu_freelist_push+0x2a/0x40
      [   12.535750]  bpf_get_stackid+0x494/0x4d0
      [   12.536062]  ___bpf_prog_run+0x8b4/0x11a0
      
      It has been explained that is a false positive here:
      https://lkml.org/lkml/2018/7/25/756
      Recap:
      - stackmap uses pcpu_freelist
      - The lock in pcpu_freelist is a percpu lock
      - stackmap is only used by tracing bpf_prog
      - A tracing bpf_prog cannot be run if another bpf_prog
        has already been running (ensured by the percpu bpf_prog_active counter).
      
      Eric pointed out that this lockdep splats stops other
      legit lockdep splats in selftests/bpf/test_progs.c.
      
      Fix this by calling local_irq_save/restore for stackmap.
      
      Another false positive had also been worked around by calling
      local_irq_save in commit 89ad2fa3 ("bpf: fix lockdep splat").
      That commit added unnecessary irq_save/restore to fast path of
      bpf hash map. irqs are already disabled at that point, since htab
      is holding per bucket spin_lock with irqsave.
      
      Let's reduce overhead for htab by introducing __pcpu_freelist_push/pop
      function w/o irqsave and convert pcpu_freelist_push/pop to irqsave
      to be used elsewhere (right now only in stackmap).
      It stops lockdep false positive in stackmap with a bit of acceptable overhead.
      
      Fixes: 557c0c6e ("bpf: convert stackmap to pre-allocation")
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f72ddcb4
    • M
      bpf: Fix syscall's stackmap lookup potential deadlock · 2b8e56c9
      Martin KaFai Lau 提交于
      mainline inclusion
      from mainline-5.0
      commit 7c4cd051add3
      category: bugfix
      bugzilla: 9355
      CVE: NA
      
      -------------------------------------------------
      
      The map_lookup_elem used to not acquiring spinlock
      in order to optimize the reader.
      
      It was true until commit 557c0c6e ("bpf: convert stackmap to pre-allocation")
      The syscall's map_lookup_elem(stackmap) calls bpf_stackmap_copy().
      bpf_stackmap_copy() may find the elem no longer needed after the copy is done.
      If that is the case, pcpu_freelist_push() saves this elem for reuse later.
      This push requires a spinlock.
      
      If a tracing bpf_prog got run in the middle of the syscall's
      map_lookup_elem(stackmap) and this tracing bpf_prog is calling
      bpf_get_stackid(stackmap) which also requires the same pcpu_freelist's
      spinlock, it may end up with a dead lock situation as reported by
      Eric Dumazet in https://patchwork.ozlabs.org/patch/1030266/
      
      The situation is the same as the syscall's map_update_elem() which
      needs to acquire the pcpu_freelist's spinlock and could race
      with tracing bpf_prog.  Hence, this patch fixes it by protecting
      bpf_stackmap_copy() with this_cpu_inc(bpf_prog_active)
      to prevent tracing bpf_prog from running.
      
      A later syscall's map_lookup_elem commit f1a2e44a3aec ("bpf: add queue and stack maps")
      also acquires a spinlock and races with tracing bpf_prog similarly.
      Hence, this patch is forward looking and protects the majority
      of the map lookups.  bpf_map_offload_lookup_elem() is the exception
      since it is for network bpf_prog only (i.e. never called by tracing
      bpf_prog).
      
      Fixes: 557c0c6e ("bpf: convert stackmap to pre-allocation")
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2b8e56c9
    • P
      bpf: error handling when map_lookup_elem isn't supported · d7cb6182
      Prashant Bhole 提交于
      mainline inclusion
      from mainline-4.20
      commit 509db2833e0d
      category: bugfix
      bugzilla: 9355
      CVE: NA
      
      -------------------------------------------------
      
      The error value returned by map_lookup_elem doesn't differentiate
      whether lookup was failed because of invalid key or lookup is not
      supported.
      
      Lets add handling for -EOPNOTSUPP return value of map_lookup_elem()
      method of map, with expectation from map's implementation that it
      should return -EOPNOTSUPP if lookup is not supported.
      
      The errno for bpf syscall for BPF_MAP_LOOKUP_ELEM command will be set
      to EOPNOTSUPP if map lookup is not supported.
      Signed-off-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d7cb6182
    • M
      bpf: support raw tracepoints in modules · 00d0f1b1
      Matt Mullins 提交于
      mainline inclusion
      from mainline-5.0
      commit a38d1107f937
      category: bugfix
      bugzilla: 9347
      CVE: NA
      
      -------------------------------------------------
      
      Distributions build drivers as modules, including network and filesystem
      drivers which export numerous tracepoints.  This enables
      bpf(BPF_RAW_TRACEPOINT_OPEN) to attach to those tracepoints.
      Signed-off-by: NMatt Mullins <mmullins@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      00d0f1b1
    • L
      Remove 'type' argument from access_ok() function · 4983cb67
      Linus Torvalds 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit 96d4f267
      category: cleanup
      bugzilla: 9284
      CVE: NA
      
      It's a cleanup patch that prepare for applying CVE-2018-20669
      patch 594cc251 ("make 'user_access_begin()' do 'access_ok()'")
      
      -------------------------------------------------
      
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      
      Conflicts:
      	drivers/media/v4l2-core/v4l2-compat-ioctl32.c
      	drivers/infiniband/core/uverbs_main.c
      	drivers/platform/goldfish/goldfish_pipe.c
      	fs/namespace.c
      	fs/select.c
      	kernel/compat.c
      	arch/powerpc/include/asm/uaccess.h
      	arch/arm64/kernel/perf_callchain.c
      	arch/arm64/include/asm/uaccess.h
      	arch/ia64/kernel/signal.c
      	arch/x86/entry/vsyscall/vsyscall_64.c
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4983cb67
    • D
      bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K · fbbbd7cd
      Daniel Borkmann 提交于
      mainline inclusion
      from mainline-4.20
      commit fdadd04931c2d7cd294dc5b2b342863f94be53a3
      category: bugfix
      bugzilla: 5995
      CVE: NA
      
      -------------------------------------------------
      
      Michael and Sandipan report:
      
        Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
        JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
        and is adjusted again at init time if MODULES_VADDR is defined.
      
        For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
        the compile-time default at boot-time, which is 0x9c400000 when
        using 64K page size. This overflows the signed 32-bit bpf_jit_limit
        value:
      
        root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
        -1673527296
      
        and can cause various unexpected failures throughout the network
        stack. In one case `strace dhclient eth0` reported:
      
        setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8},
                   16) = -1 ENOTSUPP (Unknown error 524)
      
        and similar failures can be seen with tools like tcpdump. This doesn't
        always reproduce however, and I'm not sure why. The more consistent
        failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9
        host would time out on systemd/netplan configuring a virtio-net NIC
        with no noticeable errors in the logs.
      
      Given this and also given that in near future some architectures like
      arm64 will have a custom area for BPF JIT image allocations we should
      get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For
      4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec()
      so therefore add another overridable bpf_jit_alloc_exec_limit() helper
      function which returns the possible size of the memory area for deriving
      the default heuristic in bpf_jit_charge_init().
      
      Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new
      bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default
      JIT memory provider, and therefore in case archs implement their custom
      module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for
      vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}.
      
      Additionally, for archs supporting large page sizes, we should change
      the sysctl to be handled as long to not run into sysctl restrictions
      in future.
      
      Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
      Reported-by: NSandipan Das <sandipan@linux.ibm.com>
      Reported-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMao Wenan <maowenan@huawei.com>
      Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fbbbd7cd
    • D
      bpf: add bpf_jit_limit knob to restrict unpriv allocations · 1bb640c7
      Daniel Borkmann 提交于
      mainline inclusion
      from mainline-4.20
      commit ede95a63b5e8
      category: bugfix
      bugzilla: 5995
      CVE: NA
      
      -------------------------------------------------
      
      Rick reported that the BPF JIT could potentially fill the entire module
      space with BPF programs from unprivileged users which would prevent later
      attempts to load normal kernel modules or privileged BPF programs, for
      example. If JIT was enabled but unsuccessful to generate the image, then
      before commit 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      we would always fall back to the BPF interpreter. Nowadays in the case
      where the CONFIG_BPF_JIT_ALWAYS_ON could be set, then the load will abort
      with a failure since the BPF interpreter was compiled out.
      
      Add a global limit and enforce it for unprivileged users such that in case
      of BPF interpreter compiled out we fail once the limit has been reached
      or we fall back to BPF interpreter earlier w/o using module mem if latter
      was compiled in. In a next step, fair share among unprivileged users can
      be resolved in particular for the case where we would fail hard once limit
      is reached.
      
      Fixes: 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      Fixes: 0a14842f ("net: filter: Just In Time compiler for x86-64")
      Co-Developed-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: LKML <linux-kernel@vger.kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMao Wenan <maowenan@huawei.com>
      Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1bb640c7
    • D
      bpf: fix inner map masking to prevent oob under speculation · fb1032f6
      Daniel Borkmann 提交于
      [ commit 9d5564ddcf2a0f5ba3fa1c3a1f8a1b59ad309553 upstream ]
      
      During review I noticed that inner meta map setup for map in
      map is buggy in that it does not propagate all needed data
      from the reference map which the verifier is later accessing.
      
      In particular one such case is index masking to prevent out of
      bounds access under speculative execution due to missing the
      map's unpriv_array/index_mask field propagation. Fix this such
      that the verifier is generating the correct code for inlined
      lookups in case of unpriviledged use.
      
      Before patch (test_verifier's 'map in map access' dump):
      
        # bpftool prog dump xla id 3
           0: (62) *(u32 *)(r10 -4) = 0
           1: (bf) r2 = r10
           2: (07) r2 += -4
           3: (18) r1 = map[id:4]
           5: (07) r1 += 272                |
           6: (61) r0 = *(u32 *)(r2 +0)     |
           7: (35) if r0 >= 0x1 goto pc+6   | Inlined map in map lookup
           8: (54) (u32) r0 &= (u32) 0      | with index masking for
           9: (67) r0 <<= 3                 | map->unpriv_array.
          10: (0f) r0 += r1                 |
          11: (79) r0 = *(u64 *)(r0 +0)     |
          12: (15) if r0 == 0x0 goto pc+1   |
          13: (05) goto pc+1                |
          14: (b7) r0 = 0                   |
          15: (15) if r0 == 0x0 goto pc+11
          16: (62) *(u32 *)(r10 -4) = 0
          17: (bf) r2 = r10
          18: (07) r2 += -4
          19: (bf) r1 = r0
          20: (07) r1 += 272                |
          21: (61) r0 = *(u32 *)(r2 +0)     | Index masking missing (!)
          22: (35) if r0 >= 0x1 goto pc+3   | for inner map despite
          23: (67) r0 <<= 3                 | map->unpriv_array set.
          24: (0f) r0 += r1                 |
          25: (05) goto pc+1                |
          26: (b7) r0 = 0                   |
          27: (b7) r0 = 0
          28: (95) exit
      
      After patch:
      
        # bpftool prog dump xla id 1
           0: (62) *(u32 *)(r10 -4) = 0
           1: (bf) r2 = r10
           2: (07) r2 += -4
           3: (18) r1 = map[id:2]
           5: (07) r1 += 272                |
           6: (61) r0 = *(u32 *)(r2 +0)     |
           7: (35) if r0 >= 0x1 goto pc+6   | Same inlined map in map lookup
           8: (54) (u32) r0 &= (u32) 0      | with index masking due to
           9: (67) r0 <<= 3                 | map->unpriv_array.
          10: (0f) r0 += r1                 |
          11: (79) r0 = *(u64 *)(r0 +0)     |
          12: (15) if r0 == 0x0 goto pc+1   |
          13: (05) goto pc+1                |
          14: (b7) r0 = 0                   |
          15: (15) if r0 == 0x0 goto pc+12
          16: (62) *(u32 *)(r10 -4) = 0
          17: (bf) r2 = r10
          18: (07) r2 += -4
          19: (bf) r1 = r0
          20: (07) r1 += 272                |
          21: (61) r0 = *(u32 *)(r2 +0)     |
          22: (35) if r0 >= 0x1 goto pc+4   | Now fixed inlined inner map
          23: (54) (u32) r0 &= (u32) 0      | lookup with proper index masking
          24: (67) r0 <<= 3                 | for map->unpriv_array.
          25: (0f) r0 += r1                 |
          26: (05) goto pc+1                |
          27: (b7) r0 = 0                   |
          28: (b7) r0 = 0
          29: (95) exit
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fb1032f6
    • D
      bpf: fix sanitation of alu op with pointer / scalar type from different paths · 52cf1114
      Daniel Borkmann 提交于
      [ commit d3bd7413e0ca40b60cf60d4003246d067cafdeda upstream ]
      
      While 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer
      arithmetic") took care of rejecting alu op on pointer when e.g. pointer
      came from two different map values with different map properties such as
      value size, Jann reported that a case was not covered yet when a given
      alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from
      different branches where we would incorrectly try to sanitize based
      on the pointer's limit. Catch this corner case and reject the program
      instead.
      
      Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      52cf1114
    • D
      bpf: prevent out of bounds speculation on pointer arithmetic · cedaa820
      Daniel Borkmann 提交于
      [ commit 979d63d50c0c0f7bc537bf821e056cc9fe5abd38 upstream ]
      
      Jann reported that the original commit back in b2157399
      ("bpf: prevent out-of-bounds speculation") was not sufficient
      to stop CPU from speculating out of bounds memory access:
      While b2157399 only focussed on masking array map access
      for unprivileged users for tail calls and data access such
      that the user provided index gets sanitized from BPF program
      and syscall side, there is still a more generic form affected
      from BPF programs that applies to most maps that hold user
      data in relation to dynamic map access when dealing with
      unknown scalars or "slow" known scalars as access offset, for
      example:
      
        - Load a map value pointer into R6
        - Load an index into R7
        - Do a slow computation (e.g. with a memory dependency) that
          loads a limit into R8 (e.g. load the limit from a map for
          high latency, then mask it to make the verifier happy)
        - Exit if R7 >= R8 (mispredicted branch)
        - Load R0 = R6[R7]
        - Load R0 = R6[R0]
      
      For unknown scalars there are two options in the BPF verifier
      where we could derive knowledge from in order to guarantee
      safe access to the memory: i) While </>/<=/>= variants won't
      allow to derive any lower or upper bounds from the unknown
      scalar where it would be safe to add it to the map value
      pointer, it is possible through ==/!= test however. ii) another
      option is to transform the unknown scalar into a known scalar,
      for example, through ALU ops combination such as R &= <imm>
      followed by R |= <imm> or any similar combination where the
      original information from the unknown scalar would be destroyed
      entirely leaving R with a constant. The initial slow load still
      precedes the latter ALU ops on that register, so the CPU
      executes speculatively from that point. Once we have the known
      scalar, any compare operation would work then. A third option
      only involving registers with known scalars could be crafted
      as described in [0] where a CPU port (e.g. Slow Int unit)
      would be filled with many dependent computations such that
      the subsequent condition depending on its outcome has to wait
      for evaluation on its execution port and thereby executing
      speculatively if the speculated code can be scheduled on a
      different execution port, or any other form of mistraining
      as described in [1], for example. Given this is not limited
      to only unknown scalars, not only map but also stack access
      is affected since both is accessible for unprivileged users
      and could potentially be used for out of bounds access under
      speculation.
      
      In order to prevent any of these cases, the verifier is now
      sanitizing pointer arithmetic on the offset such that any
      out of bounds speculation would be masked in a way where the
      pointer arithmetic result in the destination register will
      stay unchanged, meaning offset masked into zero similar as
      in array_index_nospec() case. With regards to implementation,
      there are three options that were considered: i) new insn
      for sanitation, ii) push/pop insn and sanitation as inlined
      BPF, iii) reuse of ax register and sanitation as inlined BPF.
      
      Option i) has the downside that we end up using from reserved
      bits in the opcode space, but also that we would require
      each JIT to emit masking as native arch opcodes meaning
      mitigation would have slow adoption till everyone implements
      it eventually which is counter-productive. Option ii) and iii)
      have both in common that a temporary register is needed in
      order to implement the sanitation as inlined BPF since we
      are not allowed to modify the source register. While a push /
      pop insn in ii) would be useful to have in any case, it
      requires once again that every JIT needs to implement it
      first. While possible, amount of changes needed would also
      be unsuitable for a -stable patch. Therefore, the path which
      has fewer changes, less BPF instructions for the mitigation
      and does not require anything to be changed in the JITs is
      option iii) which this work is pursuing. The ax register is
      already mapped to a register in all JITs (modulo arm32 where
      it's mapped to stack as various other BPF registers there)
      and used in constant blinding for JITs-only so far. It can
      be reused for verifier rewrites under certain constraints.
      The interpreter's tmp "register" has therefore been remapped
      into extending the register set with hidden ax register and
      reusing that for a number of instructions that needed the
      prior temporary variable internally (e.g. div, mod). This
      allows for zero increase in stack space usage in the interpreter,
      and enables (restricted) generic use in rewrites otherwise as
      long as such a patchlet does not make use of these instructions.
      The sanitation mask is dynamic and relative to the offset the
      map value or stack pointer currently holds.
      
      There are various cases that need to be taken under consideration
      for the masking, e.g. such operation could look as follows:
      ptr += val or val += ptr or ptr -= val. Thus, the value to be
      sanitized could reside either in source or in destination
      register, and the limit is different depending on whether
      the ALU op is addition or subtraction and depending on the
      current known and bounded offset. The limit is derived as
      follows: limit := max_value_size - (smin_value + off). For
      subtraction: limit := umax_value + off. This holds because
      we do not allow any pointer arithmetic that would
      temporarily go out of bounds or would have an unknown
      value with mixed signed bounds where it is unclear at
      verification time whether the actual runtime value would
      be either negative or positive. For example, we have a
      derived map pointer value with constant offset and bounded
      one, so limit based on smin_value works because the verifier
      requires that statically analyzed arithmetic on the pointer
      must be in bounds, and thus it checks if resulting
      smin_value + off and umax_value + off is still within map
      value bounds at time of arithmetic in addition to time of
      access. Similarly, for the case of stack access we derive
      the limit as follows: MAX_BPF_STACK + off for subtraction
      and -off for the case of addition where off := ptr_reg->off +
      ptr_reg->var_off.value. Subtraction is a special case for
      the masking which can be in form of ptr += -val, ptr -= -val,
      or ptr -= val. In the first two cases where we know that
      the value is negative, we need to temporarily negate the
      value in order to do the sanitation on a positive value
      where we later swap the ALU op, and restore original source
      register if the value was in source.
      
      The sanitation of pointer arithmetic alone is still not fully
      sufficient as is, since a scenario like the following could
      happen ...
      
        PTR += 0x1000 (e.g. K-based imm)
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        PTR += 0x1000
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        [...]
      
      ... which under speculation could end up as ...
      
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        [...]
      
      ... and therefore still access out of bounds. To prevent such
      case, the verifier is also analyzing safety for potential out
      of bounds access under speculative execution. Meaning, it is
      also simulating pointer access under truncation. We therefore
      "branch off" and push the current verification state after the
      ALU operation with known 0 to the verification stack for later
      analysis. Given the current path analysis succeeded it is
      likely that the one under speculation can be pruned. In any
      case, it is also subject to existing complexity limits and
      therefore anything beyond this point will be rejected. In
      terms of pruning, it needs to be ensured that the verification
      state from speculative execution simulation must never prune
      a non-speculative execution path, therefore, we mark verifier
      state accordingly at the time of push_stack(). If verifier
      detects out of bounds access under speculative execution from
      one of the possible paths that includes a truncation, it will
      reject such program.
      
      Given we mask every reg-based pointer arithmetic for
      unprivileged programs, we've been looking into how it could
      affect real-world programs in terms of size increase. As the
      majority of programs are targeted for privileged-only use
      case, we've unconditionally enabled masking (with its alu
      restrictions on top of it) for privileged programs for the
      sake of testing in order to check i) whether they get rejected
      in its current form, and ii) by how much the number of
      instructions and size will increase. We've tested this by
      using Katran, Cilium and test_l4lb from the kernel selftests.
      For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
      and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
      we've used test_l4lb.o as well as test_l4lb_noinline.o. We
      found that none of the programs got rejected by the verifier
      with this change, and that impact is rather minimal to none.
      balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
      7,797 bytes JITed before and after the change. Most complex
      program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
      and 18,538 bytes JITed before and after and none of the other
      tail call programs in bpf_lxc.o had any changes either. For
      the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
      increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
      before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
      after the change. Other programs from that object file had
      similar small increase. Both test_l4lb.o had no change and
      remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
      JITed and for test_l4lb_noinline.o constant at 5,080 bytes
      (634 insns) xlated and 3,313 bytes JITed. This can be explained
      in that LLVM typically optimizes stack based pointer arithmetic
      by using K-based operations and that use of dynamic map access
      is not overly frequent. However, in future we may decide to
      optimize the algorithm further under known guarantees from
      branch and value speculation. Latter seems also unclear in
      terms of prediction heuristics that today's CPUs apply as well
      as whether there could be collisions in e.g. the predictor's
      Value History/Pattern Table for triggering out of bounds access,
      thus masking is performed unconditionally at this point but could
      be subject to relaxation later on. We were generally also
      brainstorming various other approaches for mitigation, but the
      blocker was always lack of available registers at runtime and/or
      overhead for runtime tracking of limits belonging to a specific
      pointer. Thus, we found this to be minimally intrusive under
      given constraints.
      
      With that in place, a simple example with sanitized access on
      unprivileged load at post-verification time looks as follows:
      
        # bpftool prog dump xlated id 282
        [...]
        28: (79) r1 = *(u64 *)(r7 +0)
        29: (79) r2 = *(u64 *)(r7 +8)
        30: (57) r1 &= 15
        31: (79) r3 = *(u64 *)(r0 +4608)
        32: (57) r3 &= 1
        33: (47) r3 |= 1
        34: (2d) if r2 > r3 goto pc+19
        35: (b4) (u32) r11 = (u32) 20479  |
        36: (1f) r11 -= r2                | Dynamic sanitation for pointer
        37: (4f) r11 |= r2                | arithmetic with registers
        38: (87) r11 = -r11               | containing bounded or known
        39: (c7) r11 s>>= 63              | scalars in order to prevent
        40: (5f) r11 &= r2                | out of bounds speculation.
        41: (0f) r4 += r11                |
        42: (71) r4 = *(u8 *)(r4 +0)
        43: (6f) r4 <<= r1
        [...]
      
      For the case where the scalar sits in the destination register
      as opposed to the source register, the following code is emitted
      for the above example:
      
        [...]
        16: (b4) (u32) r11 = (u32) 20479
        17: (1f) r11 -= r2
        18: (4f) r11 |= r2
        19: (87) r11 = -r11
        20: (c7) r11 s>>= 63
        21: (5f) r2 &= r11
        22: (0f) r2 += r0
        23: (61) r0 = *(u32 *)(r2 +0)
        [...]
      
      JIT blinding example with non-conflicting use of r10:
      
        [...]
         d5:	je     0x0000000000000106    _
         d7:	mov    0x0(%rax),%edi       |
         da:	mov    $0xf153246,%r10d     | Index load from map value and
         e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
         e7:	and    %r10,%rdi            |_
         ea:	mov    $0x2f,%r10d          |
         f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
         f3:	or     %rdi,%r10            | but do not interfere with each
         f6:	neg    %r10                 | other. (Neither do these instructions
         f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
         fd:	and    %r10,%rdi            | in interpreter.)
        100:	add    %rax,%rdi            |_
        103:	mov    0x0(%rdi),%eax
       [...]
      
      Tested that it fixes Jann's reproducer, and also checked that test_verifier
      and test_progs suite with interpreter, JIT and JIT with hardening enabled
      on x86-64 and arm64 runs successfully.
      
        [0] Speculose: Analyzing the Security Implications of Speculative
            Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
            https://arxiv.org/pdf/1801.04084.pdf
      
        [1] A Systematic Evaluation of Transient Execution Attacks and
            Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
            Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
            Dmitry Evtyushkin, Daniel Gruss,
            https://arxiv.org/pdf/1811.05441.pdf
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      cedaa820
    • D
      bpf: fix check_map_access smin_value test when pointer contains offset · fdc6205e
      Daniel Borkmann 提交于
      [ commit b7137c4eab85c1cf3d46acdde90ce1163b28c873 upstream ]
      
      In check_map_access() we probe actual bounds through __check_map_access()
      with offset of reg->smin_value + off for lower bound and offset of
      reg->umax_value + off for the upper bound. However, even though the
      reg->smin_value could have a negative value, the final result of the
      sum with off could be positive when pointer arithmetic with known and
      unknown scalars is combined. In this case we reject the program with
      an error such as "R<x> min value is negative, either use unsigned index
      or do a if (index >=0) check." even though the access itself would be
      fine. Therefore extend the check to probe whether the actual resulting
      reg->smin_value + off is less than zero.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fdc6205e
    • D
      bpf: restrict unknown scalars of mixed signed bounds for unprivileged · 85cb4258
      Daniel Borkmann 提交于
      [ commit 9d7eceede769f90b66cfa06ad5b357140d5141ed upstream ]
      
      For unknown scalars of mixed signed bounds, meaning their smin_value is
      negative and their smax_value is positive, we need to reject arithmetic
      with pointer to map value. For unprivileged the goal is to mask every
      map pointer arithmetic and this cannot reliably be done when it is
      unknown at verification time whether the scalar value is negative or
      positive. Given this is a corner case, the likelihood of breaking should
      be very small.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      85cb4258
    • D
      bpf: restrict stack pointer arithmetic for unprivileged · a05f6d49
      Daniel Borkmann 提交于
      [ commit e4298d25830a866cc0f427d4bccb858e76715859 upstream ]
      
      Restrict stack pointer arithmetic for unprivileged users in that
      arithmetic itself must not go out of bounds as opposed to the actual
      access later on. Therefore after each adjust_ptr_min_max_vals() with
      a stack pointer as a destination we simulate a check_stack_access()
      of 1 byte on the destination and once that fails the program is
      rejected for unprivileged program loads. This is analog to map
      value pointer arithmetic and needed for masking later on.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a05f6d49
    • D
      bpf: restrict map value pointer arithmetic for unprivileged · 5e71d26c
      Daniel Borkmann 提交于
      [ commit 0d6303db7970e6f56ae700fa07e11eb510cda125 upstream ]
      
      Restrict map value pointer arithmetic for unprivileged users in that
      arithmetic itself must not go out of bounds as opposed to the actual
      access later on. Therefore after each adjust_ptr_min_max_vals() with a
      map value pointer as a destination it will simulate a check_map_access()
      of 1 byte on the destination and once that fails the program is rejected
      for unprivileged program loads. We use this later on for masking any
      pointer arithmetic with the remainder of the map value space. The
      likelihood of breaking any existing real-world unprivileged eBPF
      program is very small for this corner case.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5e71d26c
    • D
      bpf: enable access to ax register also from verifier rewrite · 2facace8
      Daniel Borkmann 提交于
      [ commit 9b73bfdd08e73231d6a90ae6db4b46b3fbf56c30 upstream ]
      
      Right now we are using BPF ax register in JIT for constant blinding as
      well as in interpreter as temporary variable. Verifier will not be able
      to use it simply because its use will get overridden from the former in
      bpf_jit_blind_insn(). However, it can be made to work in that blinding
      will be skipped if there is prior use in either source or destination
      register on the instruction. Taking constraints of ax into account, the
      verifier is then open to use it in rewrites under some constraints. Note,
      ax register already has mappings in every eBPF JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2facace8
    • D
      bpf: move tmp variable into ax register in interpreter · 648f6859
      Daniel Borkmann 提交于
      [ commit 144cd91c4c2bced6eb8a7e25e590f6618a11e854 upstream ]
      
      This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run()
      into the hidden ax register. The latter is currently only used in JITs
      for constant blinding as a temporary scratch register, meaning the BPF
      interpreter will never see the use of ax. Therefore it is safe to use
      it for the cases where tmp has been used earlier. This is needed to later
      on allow restricted hidden use of ax in both interpreter and JITs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      648f6859
    • D
      bpf: move {prev_,}insn_idx into verifier env · f38a7d20
      Daniel Borkmann 提交于
      [ commit c08435ec7f2bc8f4109401f696fd55159b4b40cb upstream ]
      
      Move prev_insn_idx and insn_idx from the do_check() function into
      the verifier environment, so they can be read inside the various
      helper functions for handling the instructions. It's easier to put
      this into the environment rather than changing all call-sites only
      to pass it along. insn_idx is useful in particular since this later
      on allows to hold state in env->insn_aux_data[env->insn_idx].
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f38a7d20
    • A
      bpf: add per-insn complexity limit · 6f89dd32
      Alexei Starovoitov 提交于
      [ commit ceefbc96fa5c5b975d87bf8e89ba8416f6b764d9 upstream ]
      
      malicious bpf program may try to force the verifier to remember
      a lot of distinct verifier states.
      Put a limit to number of per-insn 'struct bpf_verifier_state'.
      Note that hitting the limit doesn't reject the program.
      It potentially makes the verifier do more steps to analyze the program.
      It means that malicious programs will hit BPF_COMPLEXITY_LIMIT_INSNS sooner
      instead of spending cpu time walking long link list.
      
      The limit of BPF_COMPLEXITY_LIMIT_STATES==64 affects cilium progs
      with slight increase in number of "steps" it takes to successfully verify
      the programs:
                             before    after
      bpf_lb-DLB_L3.o         1940      1940
      bpf_lb-DLB_L4.o         3089      3089
      bpf_lb-DUNKNOWN.o       1065      1065
      bpf_lxc-DDROP_ALL.o     28052  |  28162
      bpf_lxc-DUNKNOWN.o      35487  |  35541
      bpf_netdev.o            10864     10864
      bpf_overlay.o           6643      6643
      bpf_lcx_jit.o           38437     38437
      
      But it also makes malicious program to be rejected in 0.4 seconds vs 6.5
      Hence apply this limit to unprivileged programs only.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6f89dd32
    • A
      bpf: improve verifier branch analysis · 027fc602
      Alexei Starovoitov 提交于
      [ commit 4f7b3e82589e0de723780198ec7983e427144c0a upstream ]
      
      pathological bpf programs may try to force verifier to explode in
      the number of branch states:
        20: (d5) if r1 s<= 0x24000028 goto pc+0
        21: (b5) if r0 <= 0xe1fa20 goto pc+2
        22: (d5) if r1 s<= 0x7e goto pc+0
        23: (b5) if r0 <= 0xe880e000 goto pc+0
        24: (c5) if r0 s< 0x2100ecf4 goto pc+0
        25: (d5) if r1 s<= 0xe880e000 goto pc+1
        26: (c5) if r0 s< 0xf4041810 goto pc+0
        27: (d5) if r1 s<= 0x1e007e goto pc+0
        28: (b5) if r0 <= 0xe86be000 goto pc+0
        29: (07) r0 += 16614
        30: (c5) if r0 s< 0x6d0020da goto pc+0
        31: (35) if r0 >= 0x2100ecf4 goto pc+0
      
      Teach verifier to recognize always taken and always not taken branches.
      This analysis is already done for == and != comparison.
      Expand it to all other branches.
      
      It also helps real bpf programs to be verified faster:
                             before  after
      bpf_lb-DLB_L3.o         2003    1940
      bpf_lb-DLB_L4.o         3173    3089
      bpf_lb-DUNKNOWN.o       1080    1065
      bpf_lxc-DDROP_ALL.o     29584   28052
      bpf_lxc-DUNKNOWN.o      36916   35487
      bpf_netdev.o            11188   10864
      bpf_overlay.o           6679    6643
      bpf_lcx_jit.o           39555   38437
      Reported-by: NAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      027fc602
    • J
      bpf: relax verifier restriction on BPF_MOV | BPF_ALU · 51c713ba
      Jiong Wang 提交于
      [ Upstream commit e434b8cdf788568ba65a0a0fd9f3cb41f3ca1803 ]
      
      Currently, the destination register is marked as unknown for 32-bit
      sub-register move (BPF_MOV | BPF_ALU) whenever the source register type is
      SCALAR_VALUE.
      
      This is too conservative that some valid cases will be rejected.
      Especially, this may turn a constant scalar value into unknown value that
      could break some assumptions of verifier.
      
      For example, test_l4lb_noinline.c has the following C code:
      
          struct real_definition *dst
      
      1:  if (!get_packet_dst(&dst, &pckt, vip_info, is_ipv6))
      2:    return TC_ACT_SHOT;
      3:
      4:  if (dst->flags & F_IPV6) {
      
      get_packet_dst is responsible for initializing "dst" into valid pointer and
      return true (1), otherwise return false (0). The compiled instruction
      sequence using alu32 will be:
      
        412: (54) (u32) r7 &= (u32) 1
        413: (bc) (u32) r0 = (u32) r7
        414: (95) exit
      
      insn 413, a BPF_MOV | BPF_ALU, however will turn r0 into unknown value even
      r7 contains SCALAR_VALUE 1.
      
      This causes trouble when verifier is walking the code path that hasn't
      initialized "dst" inside get_packet_dst, for which case 0 is returned and
      we would then expect verifier concluding line 1 in the above C code pass
      the "if" check, therefore would skip fall through path starting at line 4.
      Now, because r0 returned from callee has became unknown value, so verifier
      won't skip analyzing path starting at line 4 and "dst->flags" requires
      dereferencing the pointer "dst" which actually hasn't be initialized for
      this path.
      
      This patch relaxed the code marking sub-register move destination. For a
      SCALAR_VALUE, it is safe to just copy the value from source then truncate
      it into 32-bit.
      
      A unit test also included to demonstrate this issue. This test will fail
      before this patch.
      
      This relaxation could let verifier skipping more paths for conditional
      comparison against immediate. It also let verifier recording a more
      accurate/strict value for one register at one state, if this state end up
      with going through exit without rejection and it is used for state
      comparison later, then it is possible an inaccurate/permissive value is
      better. So the real impact on verifier processed insn number is complex.
      But in all, without this fix, valid program could be rejected.
      
      >From real benchmarking on kernel selftests and Cilium bpf tests, there is
      no impact on processed instruction number when tests ares compiled with
      default compilation options. There is slightly improvements when they are
      compiled with -mattr=+alu32 after this patch.
      
      Also, test_xdp_noinline/-mattr=+alu32 now passed verification. It is
      rejected before this fix.
      
      Insn processed before/after this patch:
      
                              default     -mattr=+alu32
      
      Kernel selftest
      
      ===
      test_xdp.o              371/371      369/369
      test_l4lb.o             6345/6345    5623/5623
      test_xdp_noinline.o     2971/2971    rejected/2727
      test_tcp_estates.o      429/429      430/430
      
      Cilium bpf
      ===
      bpf_lb-DLB_L3.o:        2085/2085     1685/1687
      bpf_lb-DLB_L4.o:        2287/2287     1986/1982
      bpf_lb-DUNKNOWN.o:      690/690       622/622
      bpf_lxc.o:              95033/95033   N/A
      bpf_netdev.o:           7245/7245     N/A
      bpf_overlay.o:          2898/2898     3085/2947
      
      NOTE:
        - bpf_lxc.o and bpf_netdev.o compiled by -mattr=+alu32 are rejected by
          verifier due to another issue inside verifier on supporting alu32
          binary.
        - Each cilium bpf program could generate several processed insn number,
          above number is sum of them.
      
      v1->v2:
       - Restrict the change on SCALAR_VALUE.
       - Update benchmark numbers on Cilium bpf tests.
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      51c713ba
    • A
      bpf: Allow narrow loads with offset > 0 · f7180374
      Andrey Ignatov 提交于
      [ Upstream commit 46f53a65d2de3e1591636c22b626b09d8684fd71 ]
      
      Currently BPF verifier allows narrow loads for a context field only with
      offset zero. E.g. if there is a __u32 field then only the following
      loads are permitted:
        * off=0, size=1 (narrow);
        * off=0, size=2 (narrow);
        * off=0, size=4 (full).
      
      On the other hand LLVM can generate a load with offset different than
      zero that make sense from program logic point of view, but verifier
      doesn't accept it.
      
      E.g. tools/testing/selftests/bpf/sendmsg4_prog.c has code:
      
        #define DST_IP4			0xC0A801FEU /* 192.168.1.254 */
        ...
        	if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) &&
      
      where ctx is struct bpf_sock_addr.
      
      Some versions of LLVM can produce the following byte code for it:
      
             8:       71 12 07 00 00 00 00 00         r2 = *(u8 *)(r1 + 7)
             9:       67 02 00 00 18 00 00 00         r2 <<= 24
            10:       18 03 00 00 00 00 00 fe 00 00 00 00 00 00 00 00         r3 = 4261412864 ll
            12:       5d 32 07 00 00 00 00 00         if r2 != r3 goto +7 <LBB0_6>
      
      where `*(u8 *)(r1 + 7)` means narrow load for ctx->user_ip4 with size=1
      and offset=3 (7 - sizeof(ctx->user_family) = 3). This load is currently
      rejected by verifier.
      
      Verifier code that rejects such loads is in bpf_ctx_narrow_access_ok()
      what means any is_valid_access implementation, that uses the function,
      works this way, e.g. bpf_skb_is_valid_access() for __sk_buff or
      sock_addr_is_valid_access() for bpf_sock_addr.
      
      The patch makes such loads supported. Offset can be in [0; size_default)
      but has to be multiple of load size. E.g. for __u32 field the following
      loads are supported now:
        * off=0, size=1 (narrow);
        * off=1, size=1 (narrow);
        * off=2, size=1 (narrow);
        * off=3, size=1 (narrow);
        * off=0, size=2 (narrow);
        * off=2, size=2 (narrow);
        * off=0, size=4 (full).
      Reported-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f7180374
  2. 21 12月, 2018 1 次提交
  3. 17 12月, 2018 2 次提交
    • E
      bpf: fix off-by-one error in adjust_subprog_starts · bddeb449
      Edward Cree 提交于
      commit afd594240806acc138cf696c09f2f4829d55d02f upstream.
      
      When patching in a new sequence for the first insn of a subprog, the start
       of that subprog does not change (it's the first insn of the sequence), so
       adjust_subprog_starts should check start <= off (rather than < off).
      Also added a test to test_verifier.c (it's essentially the syz reproducer).
      
      Fixes: cc8b0b92 ("bpf: introduce function calls (function boundaries)")
      Reported-by: syzbot+4fc427c7af994b0948be@syzkaller.appspotmail.com
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bddeb449
    • R
      bpf: allocate local storage buffers using GFP_ATOMIC · 5689666a
      Roman Gushchin 提交于
      [ Upstream commit 569a933b03f3c48b392fe67c0086b3a6b9306b5a ]
      
      Naresh reported an issue with the non-atomic memory allocation of
      cgroup local storage buffers:
      
      [   73.047526] BUG: sleeping function called from invalid context at
      /srv/oe/build/tmp-rpb-glibc/work-shared/intel-corei7-64/kernel-source/mm/slab.h:421
      [   73.060915] in_atomic(): 1, irqs_disabled(): 0, pid: 3157, name: test_cgroup_sto
      [   73.068342] INFO: lockdep is turned off.
      [   73.072293] CPU: 2 PID: 3157 Comm: test_cgroup_sto Not tainted
      4.20.0-rc2-next-20181113 #1
      [   73.080548] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
      2.0b 07/27/2017
      [   73.088018] Call Trace:
      [   73.090463]  dump_stack+0x70/0xa5
      [   73.093783]  ___might_sleep+0x152/0x240
      [   73.097619]  __might_sleep+0x4a/0x80
      [   73.101191]  __kmalloc_node+0x1cf/0x2f0
      [   73.105031]  ? cgroup_storage_update_elem+0x46/0x90
      [   73.109909]  cgroup_storage_update_elem+0x46/0x90
      
      cgroup_storage_update_elem() (as well as other update map update
      callbacks) is called with disabled preemption, so GFP_ATOMIC
      allocation should be used: e.g. alloc_htab_elem() in hashtab.c.
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5689666a
  4. 27 11月, 2018 1 次提交
  5. 14 11月, 2018 3 次提交
  6. 11 10月, 2018 1 次提交
  7. 06 10月, 2018 1 次提交
    • J
      bpf: 32-bit RSH verification must truncate input before the ALU op · b799207e
      Jann Horn 提交于
      When I wrote commit 468f6eaf ("bpf: fix 32-bit ALU op verification"), I
      assumed that, in order to emulate 64-bit arithmetic with 32-bit logic, it
      is sufficient to just truncate the output to 32 bits; and so I just moved
      the register size coercion that used to be at the start of the function to
      the end of the function.
      
      That assumption is true for almost every op, but not for 32-bit right
      shifts, because those can propagate information towards the least
      significant bit. Fix it by always truncating inputs for 32-bit ops to 32
      bits.
      
      Also get rid of the coerce_reg_to_size() after the ALU op, since that has
      no effect.
      
      Fixes: 468f6eaf ("bpf: fix 32-bit ALU op verification")
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b799207e
  8. 02 10月, 2018 1 次提交
  9. 28 9月, 2018 1 次提交