1. 07 2月, 2018 1 次提交
  2. 03 2月, 2018 3 次提交
    • R
      Revert "defer call to mem_cgroup_sk_alloc()" · edbe69ef
      Roman Gushchin 提交于
      This patch effectively reverts commit 9f1c2674 ("net: memcontrol:
      defer call to mem_cgroup_sk_alloc()").
      
      Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks
      memcg socket memory accounting, as packets received before memcg
      pointer initialization are not accounted and are causing refcounting
      underflow on socket release.
      
      Actually the free-after-use problem was fixed by
      commit c0576e39 ("net: call cgroup_sk_alloc() earlier in
      sk_clone_lock()") for the cgroup pointer.
      
      So, let's revert it and call mem_cgroup_sk_alloc() just before
      cgroup_sk_alloc(). This is safe, as we hold a reference to the socket
      we're cloning, and it holds a reference to the memcg.
      
      Also, let's drop BUG_ON(mem_cgroup_is_root()) check from
      mem_cgroup_sk_alloc(). I see no reasons why bumping the root
      memcg counter is a good reason to panic, and there are no realistic
      ways to hit it.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edbe69ef
    • E
      soreuseport: fix mem leak in reuseport_add_sock() · 4db428a7
      Eric Dumazet 提交于
      reuseport_add_sock() needs to deal with attaching a socket having
      its own sk_reuseport_cb, after a prior
      setsockopt(SO_ATTACH_REUSEPORT_?BPF)
      
      Without this fix, not only a WARN_ONCE() was issued, but we were also
      leaking memory.
      
      Thanks to sysbot and Eric Biggers for providing us nice C repros.
      
      ------------[ cut here ]------------
      socket already in reuseport group
      WARNING: CPU: 0 PID: 3496 at net/core/sock_reuseport.c:119  
      reuseport_add_sock+0x742/0x9b0 net/core/sock_reuseport.c:117
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 0 PID: 3496 Comm: syzkaller869503 Not tainted 4.15.0-rc6+ #245
      Hardware name: Google Google Compute Engine/Google Compute Engine,
      BIOS  
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:17 [inline]
        dump_stack+0x194/0x257 lib/dump_stack.c:53
        panic+0x1e4/0x41c kernel/panic.c:183
        __warn+0x1dc/0x200 kernel/panic.c:547
        report_bug+0x211/0x2d0 lib/bug.c:184
        fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
        fixup_bug arch/x86/kernel/traps.c:247 [inline]
        do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
        do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
        invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1079
      
      Fixes: ef456144 ("soreuseport: define reuseport groups")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: syzbot+c0ea2226f77a42936bf7@syzkaller.appspotmail.com
      Acked-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4db428a7
    • P
      cls_u32: add missing RCU annotation. · 058a6c03
      Paolo Abeni 提交于
      In a couple of points of the control path, n->ht_down is currently
      accessed without the required RCU annotation. The accesses are
      safe, but sparse complaints. Since we already held the
      rtnl lock, let use rtnl_dereference().
      
      Fixes: a1b7c5fd ("net: sched: add cls_u32 offload hooks for netdevs")
      Fixes: de5df632 ("net: sched: cls_u32 changes to knode must appear atomic to readers")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      058a6c03
  3. 02 2月, 2018 1 次提交
    • E
      net: igmp: add a missing rcu locking section · e7aadb27
      Eric Dumazet 提交于
      Newly added igmpv3_get_srcaddr() needs to be called under rcu lock.
      
      Timer callbacks do not ensure this locking.
      
      =============================
      WARNING: suspicious RCU usage
      4.15.0+ #200 Not tainted
      -----------------------------
      ./include/linux/inetdevice.h:216 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      3 locks held by syzkaller616973/4074:
       #0:  (&mm->mmap_sem){++++}, at: [<00000000bfce669e>] __do_page_fault+0x32d/0xc90 arch/x86/mm/fault.c:1355
       #1:  ((&im->timer)){+.-.}, at: [<00000000619d2f71>] lockdep_copy_map include/linux/lockdep.h:178 [inline]
       #1:  ((&im->timer)){+.-.}, at: [<00000000619d2f71>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1316
       #2:  (&(&im->lock)->rlock){+.-.}, at: [<000000005f833c5c>] spin_lock_bh include/linux/spinlock.h:315 [inline]
       #2:  (&(&im->lock)->rlock){+.-.}, at: [<000000005f833c5c>] igmpv3_send_report+0x98/0x5b0 net/ipv4/igmp.c:600
      
      stack backtrace:
      CPU: 0 PID: 4074 Comm: syzkaller616973 Not tainted 4.15.0+ #200
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:53
       lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4592
       __in_dev_get_rcu include/linux/inetdevice.h:216 [inline]
       igmpv3_get_srcaddr net/ipv4/igmp.c:329 [inline]
       igmpv3_newpack+0xeef/0x12e0 net/ipv4/igmp.c:389
       add_grhead.isra.27+0x235/0x300 net/ipv4/igmp.c:432
       add_grec+0xbd3/0x1170 net/ipv4/igmp.c:565
       igmpv3_send_report+0xd5/0x5b0 net/ipv4/igmp.c:605
       igmp_send_report+0xc43/0x1050 net/ipv4/igmp.c:722
       igmp_timer_expire+0x322/0x5c0 net/ipv4/igmp.c:831
       call_timer_fn+0x228/0x820 kernel/time/timer.c:1326
       expire_timers kernel/time/timer.c:1363 [inline]
       __run_timers+0x7ee/0xb70 kernel/time/timer.c:1666
       run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
       __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
       invoke_softirq kernel/softirq.c:365 [inline]
       irq_exit+0x1cc/0x200 kernel/softirq.c:405
       exiting_irq arch/x86/include/asm/apic.h:541 [inline]
       smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
       apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:938
      
      Fixes: a46182b0 ("net: igmp: Use correct source address on IGMPv3 reports")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7aadb27
  4. 01 2月, 2018 6 次提交
  5. 31 1月, 2018 11 次提交
    • P
      netfilter: on sockopt() acquire sock lock only in the required scope · 3f34cfae
      Paolo Abeni 提交于
      Syzbot reported several deadlocks in the netfilter area caused by
      rtnl lock and socket lock being acquired with a different order on
      different code paths, leading to backtraces like the following one:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      4.15.0-rc9+ #212 Not tainted
      ------------------------------------------------------
      syzkaller041579/3682 is trying to acquire lock:
        (sk_lock-AF_INET6){+.+.}, at: [<000000008775e4dd>] lock_sock
      include/net/sock.h:1463 [inline]
        (sk_lock-AF_INET6){+.+.}, at: [<000000008775e4dd>]
      do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167
      
      but task is already holding lock:
        (rtnl_mutex){+.+.}, at: [<000000004342eaa9>] rtnl_lock+0x17/0x20
      net/core/rtnetlink.c:74
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (rtnl_mutex){+.+.}:
              __mutex_lock_common kernel/locking/mutex.c:756 [inline]
              __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
              mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
              rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
              register_netdevice_notifier+0xad/0x860 net/core/dev.c:1607
              tee_tg_check+0x1a0/0x280 net/netfilter/xt_TEE.c:106
              xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:845
              check_target net/ipv6/netfilter/ip6_tables.c:538 [inline]
              find_check_entry.isra.7+0x935/0xcf0
      net/ipv6/netfilter/ip6_tables.c:580
              translate_table+0xf52/0x1690 net/ipv6/netfilter/ip6_tables.c:749
              do_replace net/ipv6/netfilter/ip6_tables.c:1165 [inline]
              do_ip6t_set_ctl+0x370/0x5f0 net/ipv6/netfilter/ip6_tables.c:1691
              nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
              nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
              ipv6_setsockopt+0x115/0x150 net/ipv6/ipv6_sockglue.c:928
              udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
              sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
              SYSC_setsockopt net/socket.c:1849 [inline]
              SyS_setsockopt+0x189/0x360 net/socket.c:1828
              entry_SYSCALL_64_fastpath+0x29/0xa0
      
      -> #0 (sk_lock-AF_INET6){+.+.}:
              lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3914
              lock_sock_nested+0xc2/0x110 net/core/sock.c:2780
              lock_sock include/net/sock.h:1463 [inline]
              do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167
              ipv6_setsockopt+0xd7/0x150 net/ipv6/ipv6_sockglue.c:922
              udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
              sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
              SYSC_setsockopt net/socket.c:1849 [inline]
              SyS_setsockopt+0x189/0x360 net/socket.c:1828
              entry_SYSCALL_64_fastpath+0x29/0xa0
      
      other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(rtnl_mutex);
                                      lock(sk_lock-AF_INET6);
                                      lock(rtnl_mutex);
         lock(sk_lock-AF_INET6);
      
        *** DEADLOCK ***
      
      1 lock held by syzkaller041579/3682:
        #0:  (rtnl_mutex){+.+.}, at: [<000000004342eaa9>] rtnl_lock+0x17/0x20
      net/core/rtnetlink.c:74
      
      The problem, as Florian noted, is that nf_setsockopt() is always
      called with the socket held, even if the lock itself is required only
      for very tight scopes and only for some operation.
      
      This patch addresses the issues moving the lock_sock() call only
      where really needed, namely in ipv*_getorigdst(), so that nf_setsockopt()
      does not need anymore to acquire both locks.
      
      Fixes: 22265a5c ("netfilter: xt_TEE: resolve oif using netdevice notifiers")
      Reported-by: syzbot+a4c2dc980ac1af699b36@syzkaller.appspotmail.com
      Suggested-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3f34cfae
    • V
      tls: Add support for encryption using async offload accelerator · a54667f6
      Vakul Garg 提交于
      Async crypto accelerators (e.g. drivers/crypto/caam) support offloading
      GCM operation. If they are enabled, crypto_aead_encrypt() return error
      code -EINPROGRESS. In this case tls_do_encryption() needs to wait on a
      completion till the time the response for crypto offload request is
      received.
      Signed-off-by: NVakul Garg <vakul.garg@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a54667f6
    • N
      ip6mr: fix stale iterator · 4adfa79f
      Nikolay Aleksandrov 提交于
      When we dump the ip6mr mfc entries via proc, we initialize an iterator
      with the table to dump but we don't clear the cache pointer which might
      be initialized from a prior read on the same descriptor that ended. This
      can result in lock imbalance (an unnecessary unlock) leading to other
      crashes and hangs. Clear the cache pointer like ipmr does to fix the issue.
      Thanks for the reliable reproducer.
      
      Here's syzbot's trace:
       WARNING: bad unlock balance detected!
       4.15.0-rc3+ #128 Not tainted
       syzkaller971460/3195 is trying to release lock (mrt_lock) at:
       [<000000006898068d>] ipmr_mfc_seq_stop+0xe1/0x130 net/ipv6/ip6mr.c:553
       but there are no more locks to release!
      
       other info that might help us debug this:
       1 lock held by syzkaller971460/3195:
        #0:  (&p->lock){+.+.}, at: [<00000000744a6565>] seq_read+0xd5/0x13d0
       fs/seq_file.c:165
      
       stack backtrace:
       CPU: 1 PID: 3195 Comm: syzkaller971460 Not tainted 4.15.0-rc3+ #128
       Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
       Google 01/01/2011
       Call Trace:
        __dump_stack lib/dump_stack.c:17 [inline]
        dump_stack+0x194/0x257 lib/dump_stack.c:53
        print_unlock_imbalance_bug+0x12f/0x140 kernel/locking/lockdep.c:3561
        __lock_release kernel/locking/lockdep.c:3775 [inline]
        lock_release+0x5f9/0xda0 kernel/locking/lockdep.c:4023
        __raw_read_unlock include/linux/rwlock_api_smp.h:225 [inline]
        _raw_read_unlock+0x1a/0x30 kernel/locking/spinlock.c:255
        ipmr_mfc_seq_stop+0xe1/0x130 net/ipv6/ip6mr.c:553
        traverse+0x3bc/0xa00 fs/seq_file.c:135
        seq_read+0x96a/0x13d0 fs/seq_file.c:189
        proc_reg_read+0xef/0x170 fs/proc/inode.c:217
        do_loop_readv_writev fs/read_write.c:673 [inline]
        do_iter_read+0x3db/0x5b0 fs/read_write.c:897
        compat_readv+0x1bf/0x270 fs/read_write.c:1140
        do_compat_preadv64+0xdc/0x100 fs/read_write.c:1189
        C_SYSC_preadv fs/read_write.c:1209 [inline]
        compat_SyS_preadv+0x3b/0x50 fs/read_write.c:1203
        do_syscall_32_irqs_on arch/x86/entry/common.c:327 [inline]
        do_fast_syscall_32+0x3ee/0xf9d arch/x86/entry/common.c:389
        entry_SYSENTER_compat+0x51/0x60 arch/x86/entry/entry_64_compat.S:125
       RIP: 0023:0xf7f73c79
       RSP: 002b:00000000e574a15c EFLAGS: 00000292 ORIG_RAX: 000000000000014d
       RAX: ffffffffffffffda RBX: 000000000000000f RCX: 0000000020a3afb0
       RDX: 0000000000000001 RSI: 0000000000000067 RDI: 0000000000000000
       RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
       R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
       BUG: sleeping function called from invalid context at lib/usercopy.c:25
       in_atomic(): 1, irqs_disabled(): 0, pid: 3195, name: syzkaller971460
       INFO: lockdep is turned off.
       CPU: 1 PID: 3195 Comm: syzkaller971460 Not tainted 4.15.0-rc3+ #128
       Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
       Google 01/01/2011
       Call Trace:
        __dump_stack lib/dump_stack.c:17 [inline]
        dump_stack+0x194/0x257 lib/dump_stack.c:53
        ___might_sleep+0x2b2/0x470 kernel/sched/core.c:6060
        __might_sleep+0x95/0x190 kernel/sched/core.c:6013
        __might_fault+0xab/0x1d0 mm/memory.c:4525
        _copy_to_user+0x2c/0xc0 lib/usercopy.c:25
        copy_to_user include/linux/uaccess.h:155 [inline]
        seq_read+0xcb4/0x13d0 fs/seq_file.c:279
        proc_reg_read+0xef/0x170 fs/proc/inode.c:217
        do_loop_readv_writev fs/read_write.c:673 [inline]
        do_iter_read+0x3db/0x5b0 fs/read_write.c:897
        compat_readv+0x1bf/0x270 fs/read_write.c:1140
        do_compat_preadv64+0xdc/0x100 fs/read_write.c:1189
        C_SYSC_preadv fs/read_write.c:1209 [inline]
        compat_SyS_preadv+0x3b/0x50 fs/read_write.c:1203
        do_syscall_32_irqs_on arch/x86/entry/common.c:327 [inline]
        do_fast_syscall_32+0x3ee/0xf9d arch/x86/entry/common.c:389
        entry_SYSENTER_compat+0x51/0x60 arch/x86/entry/entry_64_compat.S:125
       RIP: 0023:0xf7f73c79
       RSP: 002b:00000000e574a15c EFLAGS: 00000292 ORIG_RAX: 000000000000014d
       RAX: ffffffffffffffda RBX: 000000000000000f RCX: 0000000020a3afb0
       RDX: 0000000000000001 RSI: 0000000000000067 RDI: 0000000000000000
       RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
       R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
       WARNING: CPU: 1 PID: 3195 at lib/usercopy.c:26 _copy_to_user+0xb5/0xc0
       lib/usercopy.c:26
      Reported-by: Nsyzbot <bot+eceb3204562c41a438fa1f2335e0fe4f6886d669@syzkaller.appspotmail.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4adfa79f
    • U
      net/sched: kconfig: Remove blank help texts · 11eab148
      Ulf Magnusson 提交于
      Blank help texts are probably either a typo, a Kconfig misunderstanding,
      or some kind of half-committing to adding a help text (in which case a
      TODO comment would be clearer, if the help text really can't be added
      right away).
      
      Best to remove them, IMO.
      Signed-off-by: NUlf Magnusson <ulfalizer@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11eab148
    • G
      openvswitch: meter: Use 64-bit arithmetic instead of 32-bit · 5b7789e8
      Gustavo A. R. Silva 提交于
      Add suffix LL to constant 1000 in order to give the compiler
      complete information about the proper arithmetic to use. Notice
      that this constant is used in a context that expects an expression
      of type long long int (64 bits, signed).
      
      The expression (band->burst_size + band->rate) * 1000 is currently
      being evaluated using 32-bit arithmetic.
      
      Addresses-Coverity-ID: 1461563 ("Unintentional integer overflow")
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b7789e8
    • G
      tcp_nv: fix potential integer overflow in tcpnv_acked · e4823fbd
      Gustavo A. R. Silva 提交于
      Add suffix ULL to constant 80000 in order to avoid a potential integer
      overflow and give the compiler complete information about the proper
      arithmetic to use. Notice that this constant is used in a context that
      expects an expression of type u64.
      
      The current cast to u64 effectively applies to the whole expression
      as an argument of type u64 to be passed to div64_u64, but it does
      not prevent it from being evaluated using 32-bit arithmetic instead
      of 64-bit arithmetic.
      
      Also, once the expression is properly evaluated using 64-bit arithmentic,
      there is no need for the parentheses and the external cast to u64.
      
      Addresses-Coverity-ID: 1357588 ("Unintentional integer overflow")
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4823fbd
    • C
      rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK · 5bb8ed07
      Christian Brauner 提交于
      - Backwards Compatibility:
        If userspace wants to determine whether RTM_NEWLINK supports the
        IFLA_IF_NETNSID property they should first send an RTM_GETLINK request
        with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply
        does not include IFLA_IF_NETNSID userspace should assume that
        IFLA_IF_NETNSID is not supported on this kernel.
        If the reply does contain an IFLA_IF_NETNSID property userspace
        can send an RTM_NEWLINK with a IFLA_IF_NETNSID property. If they receive
        EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property
        with RTM_NEWLINK. Userpace should then fallback to other means.
      
      - Security:
        Callers must have CAP_NET_ADMIN in the owning user namespace of the
        target network namespace.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5bb8ed07
    • D
      netfilter: ipt_CLUSTERIP: fix out-of-bounds accesses in clusterip_tg_check() · 1a38956c
      Dmitry Vyukov 提交于
      Commit 136e92bb switched local_nodes from an array to a bitmask
      but did not add proper bounds checks. As the result
      clusterip_config_init_nodelist() can both over-read
      ipt_clusterip_tgt_info.local_nodes and over-write
      clusterip_config.local_nodes.
      
      Add bounds checks for both.
      
      Fixes: 136e92bb ("[NETFILTER] CLUSTERIP: use a bitmap to store node responsibility data")
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      1a38956c
    • D
      netfilter: x_tables: fix pointer leaks to userspace · 1e98ffea
      Dmitry Vyukov 提交于
      Several netfilter matches and targets put kernel pointers into
      info objects, but don't set usersize in descriptors.
      This leads to kernel pointer leaks if a match/target is set
      and then read back to userspace.
      
      Properly set usersize for these matches/targets.
      
      Found with manual code inspection.
      
      Fixes: ec231890 ("xtables: extend matches and targets with .usersize")
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      1e98ffea
    • J
      netfilter: ipset: Fix wraparound in hash:*net* types · 0b8d9073
      Jozsef Kadlecsik 提交于
      Fix wraparound bug which could lead to memory exhaustion when adding an
      x.x.x.x-255.255.255.255 range to any hash:*net* types.
      
      Fixes Netfilter's bugzilla id #1212, reported by Thomas Schwark.
      
      Fixes: 48596a8d ("netfilter: ipset: Fix adding an IPv4 range containing more than 2^31 addresses")
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      0b8d9073
    • D
      nl80211: Sanitize array index in parse_txq_params · 259d8c1e
      Dan Williams 提交于
      Wireless drivers rely on parse_txq_params to validate that txq_params->ac
      is less than NL80211_NUM_ACS by the time the low-level driver's ->conf_tx()
      handler is called. Use a new helper, array_index_nospec(), to sanitize
      txq_params->ac with respect to speculation. I.e. ensure that any
      speculation into ->conf_tx() handlers is done with a value of
      txq_params->ac that is within the bounds of [0, NL80211_NUM_ACS).
      Reported-by: NChristian Lamparter <chunkeey@gmail.com>
      Reported-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NJohannes Berg <johannes@sipsolutions.net>
      Cc: linux-arch@vger.kernel.org
      Cc: kernel-hardening@lists.openwall.com
      Cc: gregkh@linuxfoundation.org
      Cc: linux-wireless@vger.kernel.org
      Cc: torvalds@linux-foundation.org
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: alan@linux.intel.com
      Link: https://lkml.kernel.org/r/151727419584.33451.7700736761686184303.stgit@dwillia2-desk3.amr.corp.intel.com
      259d8c1e
  6. 30 1月, 2018 16 次提交
    • J
      ipmr: Fix ptrdiff_t print formatting · 91e6dd82
      James Hogan 提交于
      ipmr_vif_seq_show() prints the difference between two pointers with the
      format string %2zd (z for size_t), however the correct format string is
      %2td instead (t for ptrdiff_t).
      
      The same bug in ip6mr_vif_seq_show() was already fixed long ago by
      commit d430a227 ("bogus format in ip6mr").
      Signed-off-by: NJames Hogan <jhogan@kernel.org>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91e6dd82
    • L
      tcp: release sk_frag.page in tcp_disconnect · 9b42d55a
      Li RongQing 提交于
      socket can be disconnected and gets transformed back to a listening
      socket, if sk_frag.page is not released, which will be cloned into
      a new socket by sk_clone_lock, but the reference count of this page
      is increased, lead to a use after free or double free issue
      Signed-off-by: NLi RongQing <lirongqing@baidu.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b42d55a
    • T
      ipv4: Get the address of interface correctly. · 30e948a3
      Tonghao Zhang 提交于
      When using ioctl to get address of interface, we can't
      get it anymore. For example, the command is show as below.
      
      	# ifconfig eth0
      
      In the patch ("03aef17b"), the devinet_ioctl does not
      return a suitable value, even though we can find it in
      the kernel. Then fix it now.
      
      Fixes: 03aef17b ("devinet_ioctl(): take copyin/copyout to caller")
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30e948a3
    • E
      net_sched: gen_estimator: fix lockdep splat · 40ca54e3
      Eric Dumazet 提交于
      syzbot reported a lockdep splat in gen_new_estimator() /
      est_fetch_counters() when attempting to lock est->stats_lock.
      
      Since est_fetch_counters() is called from BH context from timer
      interrupt, we need to block BH as well when calling it from process
      context.
      
      Most qdiscs use per cpu counters and are immune to the problem,
      but net/sched/act_api.c and net/netfilter/xt_RATEEST.c are using
      a spinlock to protect their data. They both call gen_new_estimator()
      while object is created and not yet alive, so this bug could
      not trigger a deadlock, only a lockdep splat.
      
      Fixes: 1c0d32fd ("net_sched: gen_estimator: complete rewrite of rate estimators")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40ca54e3
    • E
      ipv6: addrconf: break critical section in addrconf_verify_rtnl() · e64e469b
      Eric Dumazet 提交于
      Heiner reported a lockdep splat [1]
      
      This is caused by attempting GFP_KERNEL allocation while RCU lock is
      held and BH blocked.
      
      We believe that addrconf_verify_rtnl() could run for a long period,
      so instead of using GFP_ATOMIC here as Ido suggested, we should break
      the critical section and restart it after the allocation.
      
      [1]
      [86220.125562] =============================
      [86220.125586] WARNING: suspicious RCU usage
      [86220.125612] 4.15.0-rc7-next-20180110+ #7 Not tainted
      [86220.125641] -----------------------------
      [86220.125666] kernel/sched/core.c:6026 Illegal context switch in RCU-bh read-side critical section!
      [86220.125711]
                     other info that might help us debug this:
      
      [86220.125755]
                     rcu_scheduler_active = 2, debug_locks = 1
      [86220.125792] 4 locks held by kworker/0:2/1003:
      [86220.125817]  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
      [86220.125895]  #1:  ((addr_chk_work).work){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
      [86220.125959]  #2:  (rtnl_mutex){+.+.}, at: [<00000000b06d9510>] rtnl_lock+0x12/0x20
      [86220.126017]  #3:  (rcu_read_lock_bh){....}, at: [<00000000aef52299>] addrconf_verify_rtnl+0x1e/0x510 [ipv6]
      [86220.126111]
                     stack backtrace:
      [86220.126142] CPU: 0 PID: 1003 Comm: kworker/0:2 Not tainted 4.15.0-rc7-next-20180110+ #7
      [86220.126185] Hardware name: ZOTAC ZBOX-CI321NANO/ZBOX-CI321NANO, BIOS B246P105 06/01/2015
      [86220.126250] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6]
      [86220.126288] Call Trace:
      [86220.126312]  dump_stack+0x70/0x9e
      [86220.126337]  lockdep_rcu_suspicious+0xce/0xf0
      [86220.126365]  ___might_sleep+0x1d3/0x240
      [86220.126390]  __might_sleep+0x45/0x80
      [86220.126416]  kmem_cache_alloc_trace+0x53/0x250
      [86220.126458]  ? ipv6_add_addr+0xfe/0x6e0 [ipv6]
      [86220.126498]  ipv6_add_addr+0xfe/0x6e0 [ipv6]
      [86220.126538]  ipv6_create_tempaddr+0x24d/0x430 [ipv6]
      [86220.126580]  ? ipv6_create_tempaddr+0x24d/0x430 [ipv6]
      [86220.126623]  addrconf_verify_rtnl+0x339/0x510 [ipv6]
      [86220.126664]  ? addrconf_verify_rtnl+0x339/0x510 [ipv6]
      [86220.126708]  addrconf_verify_work+0xe/0x20 [ipv6]
      [86220.126738]  process_one_work+0x258/0x680
      [86220.126765]  worker_thread+0x35/0x3f0
      [86220.126790]  kthread+0x124/0x140
      [86220.126813]  ? process_one_work+0x680/0x680
      [86220.126839]  ? kthread_create_worker_on_cpu+0x40/0x40
      [86220.126869]  ? umh_complete+0x40/0x40
      [86220.126893]  ? call_usermodehelper_exec_async+0x12a/0x160
      [86220.126926]  ret_from_fork+0x4b/0x60
      [86220.126999] BUG: sleeping function called from invalid context at mm/slab.h:420
      [86220.127041] in_atomic(): 1, irqs_disabled(): 0, pid: 1003, name: kworker/0:2
      [86220.127082] 4 locks held by kworker/0:2/1003:
      [86220.127107]  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
      [86220.127179]  #1:  ((addr_chk_work).work){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
      [86220.127242]  #2:  (rtnl_mutex){+.+.}, at: [<00000000b06d9510>] rtnl_lock+0x12/0x20
      [86220.127300]  #3:  (rcu_read_lock_bh){....}, at: [<00000000aef52299>] addrconf_verify_rtnl+0x1e/0x510 [ipv6]
      [86220.127414] CPU: 0 PID: 1003 Comm: kworker/0:2 Not tainted 4.15.0-rc7-next-20180110+ #7
      [86220.127463] Hardware name: ZOTAC ZBOX-CI321NANO/ZBOX-CI321NANO, BIOS B246P105 06/01/2015
      [86220.127528] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6]
      [86220.127568] Call Trace:
      [86220.127591]  dump_stack+0x70/0x9e
      [86220.127616]  ___might_sleep+0x14d/0x240
      [86220.127644]  __might_sleep+0x45/0x80
      [86220.127672]  kmem_cache_alloc_trace+0x53/0x250
      [86220.127717]  ? ipv6_add_addr+0xfe/0x6e0 [ipv6]
      [86220.127762]  ipv6_add_addr+0xfe/0x6e0 [ipv6]
      [86220.127807]  ipv6_create_tempaddr+0x24d/0x430 [ipv6]
      [86220.127854]  ? ipv6_create_tempaddr+0x24d/0x430 [ipv6]
      [86220.127903]  addrconf_verify_rtnl+0x339/0x510 [ipv6]
      [86220.127950]  ? addrconf_verify_rtnl+0x339/0x510 [ipv6]
      [86220.127998]  addrconf_verify_work+0xe/0x20 [ipv6]
      [86220.128032]  process_one_work+0x258/0x680
      [86220.128063]  worker_thread+0x35/0x3f0
      [86220.128091]  kthread+0x124/0x140
      [86220.128117]  ? process_one_work+0x680/0x680
      [86220.128146]  ? kthread_create_worker_on_cpu+0x40/0x40
      [86220.128180]  ? umh_complete+0x40/0x40
      [86220.128207]  ? call_usermodehelper_exec_async+0x12a/0x160
      [86220.128243]  ret_from_fork+0x4b/0x60
      
      Fixes: f3d9832e ("ipv6: addrconf: cleanup locking in ipv6_add_addr")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e64e469b
    • W
      ipv6: change route cache aging logic · 31afeb42
      Wei Wang 提交于
      In current route cache aging logic, if a route has both RTF_EXPIRE and
      RTF_GATEWAY set, the route will only be removed if the neighbor cache
      has no NTF_ROUTER flag. Otherwise, even if the route has expired, it
      won't get deleted.
      Fix this logic to always check if the route has expired first and then
      do the gateway neighbor cache check if previous check decide to not
      remove the exception entry.
      
      Fixes: 1859bac0 ("ipv6: remove from fib tree aged out RTF_CACHE dst")
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31afeb42
    • D
      net: ipv6: send unsolicited NA after DAD · c76fe2d9
      David Ahern 提交于
      Unsolicited IPv6 neighbor advertisements should be sent after DAD
      completes. Update ndisc_send_unsol_na to skip tentative, non-optimistic
      addresses and have those sent by addrconf_dad_completed after DAD.
      
      Fixes: 4a6e3c5d ("net: ipv6: send unsolicited NA on admin up")
      Reported-by: NVivek Venkatraman <vivek@cumulusnetworks.com>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c76fe2d9
    • C
      net_sched: implement ->change_tx_queue_len() for pfifo_fast · 7007ba63
      Cong Wang 提交于
      pfifo_fast used to drop based on qdisc_dev(qdisc)->tx_queue_len,
      so we have to resize skb array when we change tx_queue_len.
      
      Other qdiscs which read tx_queue_len are fine because they
      all save it to sch->limit or somewhere else in qdisc during init.
      They don't have to implement this, it is nicer if they do so
      that users don't have to re-configure qdisc after changing
      tx_queue_len.
      
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7007ba63
    • C
      net_sched: plug in qdisc ops change_tx_queue_len · 48bfd55e
      Cong Wang 提交于
      Introduce a new qdisc ops ->change_tx_queue_len() so that
      each qdisc could decide how to implement this if it wants.
      Previously we simply read dev->tx_queue_len, after pfifo_fast
      switches to skb array, we need this API to resize the skb array
      when we change dev->tx_queue_len.
      
      To avoid handling race conditions with TX BH, we need to
      deactivate all TX queues before change the value and bring them
      back after we are done, this also makes implementation easier.
      
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48bfd55e
    • C
      net: introduce helper dev_change_tx_queue_len() · 6a643ddb
      Cong Wang 提交于
      This patch promotes the local change_tx_queue_len() to a core
      helper function, dev_change_tx_queue_len(), so that rtnetlink
      and net-sysfs could share the code. This also prepares for the
      following patch.
      
      Note, the -EFAULT in the original code doesn't make sense,
      we should propagate the errno from notifiers.
      
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a643ddb
    • N
      dev: advertise the new ifindex when the netns iface changes · 38e01b30
      Nicolas Dichtel 提交于
      The goal is to let the user follow an interface that moves to another
      netns.
      
      CC: Jiri Benc <jbenc@redhat.com>
      CC: Christian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38e01b30
    • N
      dev: always advertise the new nsid when the netns iface changes · c36ac8e2
      Nicolas Dichtel 提交于
      The user should be able to follow any interface that moves to another
      netns.  There is no reason to hide physical interfaces.
      
      CC: Jiri Benc <jbenc@redhat.com>
      CC: Christian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c36ac8e2
    • M
      ipv6: Fix SO_REUSEPORT UDP socket with implicit sk_ipv6only · 7ece54a6
      Martin KaFai Lau 提交于
      If a sk_v6_rcv_saddr is !IPV6_ADDR_ANY and !IPV6_ADDR_MAPPED, it
      implicitly implies it is an ipv6only socket.  However, in inet6_bind(),
      this addr_type checking and setting sk->sk_ipv6only to 1 are only done
      after sk->sk_prot->get_port(sk, snum) has been completed successfully.
      
      This inconsistency between sk_v6_rcv_saddr and sk_ipv6only confuses
      the 'get_port()'.
      
      In particular, when binding SO_REUSEPORT UDP sockets,
      udp_reuseport_add_sock(sk,...) is called.  udp_reuseport_add_sock()
      checks "ipv6_only_sock(sk2) == ipv6_only_sock(sk)" before adding sk to
      sk2->sk_reuseport_cb.  In this case, ipv6_only_sock(sk2) could be
      1 while ipv6_only_sock(sk) is still 0 here.  The end result is,
      reuseport_alloc(sk) is called instead of adding sk to the existing
      sk2->sk_reuseport_cb.
      
      It can be reproduced by binding two SO_REUSEPORT UDP sockets on an
      IPv6 address (!ANY and !MAPPED).  Only one of the socket will
      receive packet.
      
      The fix is to set the implicit sk_ipv6only before calling get_port().
      The original sk_ipv6only has to be saved such that it can be restored
      in case get_port() failed.  The situation is similar to the
      inet_reset_saddr(sk) after get_port() has failed.
      
      Thanks to Calvin Owens <calvinowens@fb.com> who created an easy
      reproduction which leads to a fix.
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ece54a6
    • C
      rtnetlink: enable IFLA_IF_NETNSID for RTM_DELLINK · b61ad68a
      Christian Brauner 提交于
      - Backwards Compatibility:
        If userspace wants to determine whether RTM_DELLINK supports the
        IFLA_IF_NETNSID property they should first send an RTM_GETLINK request
        with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply
        does not include IFLA_IF_NETNSID userspace should assume that
        IFLA_IF_NETNSID is not supported on this kernel.
        If the reply does contain an IFLA_IF_NETNSID property userspace
        can send an RTM_DELLINK with a IFLA_IF_NETNSID property. If they receive
        EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property
        with RTM_DELLINK. Userpace should then fallback to other means.
      
      - Security:
        Callers must have CAP_NET_ADMIN in the owning user namespace of the
        target network namespace.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b61ad68a
    • C
      rtnetlink: enable IFLA_IF_NETNSID for RTM_SETLINK · c310bfcb
      Christian Brauner 提交于
      - Backwards Compatibility:
        If userspace wants to determine whether RTM_SETLINK supports the
        IFLA_IF_NETNSID property they should first send an RTM_GETLINK request
        with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply
        does not include IFLA_IF_NETNSID userspace should assume that
        IFLA_IF_NETNSID is not supported on this kernel.
        If the reply does contain an IFLA_IF_NETNSID property userspace
        can send an RTM_SETLINK with a IFLA_IF_NETNSID property. If they receive
        EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property
        with RTM_SETLINK. Userpace should then fallback to other means.
      
        To retain backwards compatibility the kernel will first check whether a
        IFLA_NET_NS_PID or IFLA_NET_NS_FD property has been passed. If either
        one is found it will be used to identify the target network namespace.
        This implies that users who do not care whether their running kernel
        supports IFLA_IF_NETNSID with RTM_SETLINK can pass both
        IFLA_NET_NS_{FD,PID} and IFLA_IF_NETNSID referring to the same network
        namespace.
      
      - Security:
        Callers must have CAP_NET_ADMIN in the owning user namespace of the
        target network namespace.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c310bfcb
    • C
      rtnetlink: enable IFLA_IF_NETNSID in do_setlink() · 7c4f63ba
      Christian Brauner 提交于
      RTM_{NEW,SET}LINK already allow operations on other network namespaces
      by identifying the target network namespace through IFLA_NET_NS_{FD,PID}
      properties. This is done by looking for the corresponding properties in
      do_setlink(). Extend do_setlink() to also look for the IFLA_IF_NETNSID
      property. This introduces no functional changes since all callers of
      do_setlink() currently block IFLA_IF_NETNSID by reporting an error before
      they reach do_setlink().
      
      This introduces the helpers:
      
      static struct net *rtnl_link_get_net_by_nlattr(struct net *src_net, struct
                                                     nlattr *tb[])
      
      static struct net *rtnl_link_get_net_capable(const struct sk_buff *skb,
                                                   struct net *src_net,
      					     struct nlattr *tb[], int cap)
      
      to simplify permission checks and target network namespace retrieval for
      RTM_* requests that already support IFLA_NET_NS_{FD,PID} but get extended
      to IFLA_IF_NETNSID. To perserve backwards compatibility the helpers look
      for IFLA_NET_NS_{FD,PID} properties first before checking for
      IFLA_IF_NETNSID.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c4f63ba
  7. 27 1月, 2018 2 次提交
    • D
      bpf: fix subprog verifier bypass by div/mod by 0 exception · f6b1b3bf
      Daniel Borkmann 提交于
      One of the ugly leftovers from the early eBPF days is that div/mod
      operations based on registers have a hard-coded src_reg == 0 test
      in the interpreter as well as in JIT code generators that would
      return from the BPF program with exit code 0. This was basically
      adopted from cBPF interpreter for historical reasons.
      
      There are multiple reasons why this is very suboptimal and prone
      to bugs. To name one: the return code mapping for such abnormal
      program exit of 0 does not always match with a suitable program
      type's exit code mapping. For example, '0' in tc means action 'ok'
      where the packet gets passed further up the stack, which is just
      undesirable for such cases (e.g. when implementing policy) and
      also does not match with other program types.
      
      While trying to work out an exception handling scheme, I also
      noticed that programs crafted like the following will currently
      pass the verifier:
      
        0: (bf) r6 = r1
        1: (85) call pc+8
        caller:
         R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
        callee:
         frame1: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_1
        10: (b4) (u32) r2 = (u32) 0
        11: (b4) (u32) r3 = (u32) 1
        12: (3c) (u32) r3 /= (u32) r2
        13: (61) r0 = *(u32 *)(r1 +76)
        14: (95) exit
        returning from callee:
         frame1: R0_w=pkt(id=0,off=0,r=0,imm=0)
                 R1=ctx(id=0,off=0,imm=0) R2_w=inv0
                 R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
                 R10=fp0,call_1
        to caller at 2:
         R0_w=pkt(id=0,off=0,r=0,imm=0) R6=ctx(id=0,off=0,imm=0)
         R10=fp0,call_-1
      
        from 14 to 2: R0=pkt(id=0,off=0,r=0,imm=0)
                      R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
        2: (bf) r1 = r6
        3: (61) r1 = *(u32 *)(r1 +80)
        4: (bf) r2 = r0
        5: (07) r2 += 8
        6: (2d) if r2 > r1 goto pc+1
         R0=pkt(id=0,off=0,r=8,imm=0) R1=pkt_end(id=0,off=0,imm=0)
         R2=pkt(id=0,off=8,r=8,imm=0) R6=ctx(id=0,off=0,imm=0)
         R10=fp0,call_-1
        7: (71) r0 = *(u8 *)(r0 +0)
        8: (b7) r0 = 1
        9: (95) exit
      
        from 6 to 8: safe
        processed 16 insns (limit 131072), stack depth 0+0
      
      Basically what happens is that in the subprog we make use of a
      div/mod by 0 exception and in the 'normal' subprog's exit path
      we just return skb->data back to the main prog. This has the
      implication that the verifier thinks we always get a pkt pointer
      in R0 while we still have the implicit 'return 0' from the div
      as an alternative unconditional return path earlier. Thus, R0
      then contains 0, meaning back in the parent prog we get the
      address range of [0x0, skb->data_end] as read and writeable.
      Similar can be crafted with other pointer register types.
      
      Since i) BPF_ABS/IND is not allowed in programs that contain
      BPF to BPF calls (and generally it's also disadvised to use in
      native eBPF context), ii) unknown opcodes don't return zero
      anymore, iii) we don't return an exception code in dead branches,
      the only last missing case affected and to fix is the div/mod
      handling.
      
      What we would really need is some infrastructure to propagate
      exceptions all the way to the original prog unwinding the
      current stack and returning that code to the caller of the
      BPF program. In user space such exception handling for similar
      runtimes is typically implemented with setjmp(3) and longjmp(3)
      as one possibility which is not available in the kernel,
      though (kgdb used to implement it in kernel long time ago). I
      implemented a PoC exception handling mechanism into the BPF
      interpreter with porting setjmp()/longjmp() into x86_64 and
      adding a new internal BPF_ABRT opcode that can use a program
      specific exception code for all exception cases we have (e.g.
      div/mod by 0, unknown opcodes, etc). While this seems to work
      in the constrained BPF environment (meaning, here, we don't
      need to deal with state e.g. from memory allocations that we
      would need to undo before going into exception state), it still
      has various drawbacks: i) we would need to implement the
      setjmp()/longjmp() for every arch supported in the kernel and
      for x86_64, arm64, sparc64 JITs currently supporting calls,
      ii) it has unconditional additional cost on main program
      entry to store CPU register state in initial setjmp() call,
      and we would need some way to pass the jmp_buf down into
      ___bpf_prog_run() for main prog and all subprogs, but also
      storing on stack is not really nice (other option would be
      per-cpu storage for this, but it also has the drawback that
      we need to disable preemption for every BPF program types).
      All in all this approach would add a lot of complexity.
      
      Another poor-man's solution would be to have some sort of
      additional shared register or scratch buffer to hold state
      for exceptions, and test that after every call return to
      chain returns and pass R0 all the way down to BPF prog caller.
      This is also problematic in various ways: i) an additional
      register doesn't map well into JITs, and some other scratch
      space could only be on per-cpu storage, which, again has the
      side-effect that this only works when we disable preemption,
      or somewhere in the input context which is not available
      everywhere either, and ii) this adds significant runtime
      overhead by putting conditionals after each and every call,
      as well as implementation complexity.
      
      Yet another option is to teach verifier that div/mod can
      return an integer, which however is also complex to implement
      as verifier would need to walk such fake 'mov r0,<code>; exit;'
      sequeuence and there would still be no guarantee for having
      propagation of this further down to the BPF caller as proper
      exception code. For parent prog, it is also is not distinguishable
      from a normal return of a constant scalar value.
      
      The approach taken here is a completely different one with
      little complexity and no additional overhead involved in
      that we make use of the fact that a div/mod by 0 is undefined
      behavior. Instead of bailing out, we adapt the same behavior
      as on some major archs like ARMv8 [0] into eBPF as well:
      X div 0 results in 0, and X mod 0 results in X. aarch64 and
      aarch32 ISA do not generate any traps or otherwise aborts
      of program execution for unsigned divides. I verified this
      also with a test program compiled by gcc and clang, and the
      behavior matches with the spec. Going forward we adapt the
      eBPF verifier to emit such rewrites once div/mod by register
      was seen. cBPF is not touched and will keep existing 'return 0'
      semantics. Given the options, it seems the most suitable from
      all of them, also since major archs have similar schemes in
      place. Given this is all in the realm of undefined behavior,
      we still have the option to adapt if deemed necessary and
      this way we would also have the option of more flexibility
      from LLVM code generation side (which is then fully visible
      to verifier). Thus, this patch i) fixes the panic seen in
      above program and ii) doesn't bypass the verifier observations.
      
        [0] ARM Architecture Reference Manual, ARMv8 [ARM DDI 0487B.b]
            http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487b.b/DDI0487B_b_armv8_arm.pdf
            1) aarch64 instruction set: section C3.4.7 and C6.2.279 (UDIV)
               "A division by zero results in a zero being written to
                the destination register, without any indication that
                the division by zero occurred."
            2) aarch32 instruction set: section F1.4.8 and F5.1.263 (UDIV)
               "For the SDIV and UDIV instructions, division by zero
                always returns a zero result."
      
      Fixes: f4d7e40a ("bpf: introduce function calls (verification)")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f6b1b3bf
    • D
      bpf: xor of a/x in cbpf can be done in 32 bit alu · 1d621674
      Daniel Borkmann 提交于
      Very minor optimization; saves 1 byte per program in x86_64
      JIT in cBPF prologue.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1d621674