1. 07 5月, 2018 1 次提交
  2. 20 4月, 2018 1 次提交
  3. 18 4月, 2018 1 次提交
    • T
      vlan: Fix reading memory beyond skb->tail in skb_vlan_tagged_multi · 7ce23672
      Toshiaki Makita 提交于
      Syzkaller spotted an old bug which leads to reading skb beyond tail by 4
      bytes on vlan tagged packets.
      This is caused because skb_vlan_tagged_multi() did not check
      skb_headlen.
      
      BUG: KMSAN: uninit-value in eth_type_vlan include/linux/if_vlan.h:283 [inline]
      BUG: KMSAN: uninit-value in skb_vlan_tagged_multi include/linux/if_vlan.h:656 [inline]
      BUG: KMSAN: uninit-value in vlan_features_check include/linux/if_vlan.h:672 [inline]
      BUG: KMSAN: uninit-value in dflt_features_check net/core/dev.c:2949 [inline]
      BUG: KMSAN: uninit-value in netif_skb_features+0xd1b/0xdc0 net/core/dev.c:3009
      CPU: 1 PID: 3582 Comm: syzkaller435149 Not tainted 4.16.0+ #82
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:17 [inline]
        dump_stack+0x185/0x1d0 lib/dump_stack.c:53
        kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
        __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
        eth_type_vlan include/linux/if_vlan.h:283 [inline]
        skb_vlan_tagged_multi include/linux/if_vlan.h:656 [inline]
        vlan_features_check include/linux/if_vlan.h:672 [inline]
        dflt_features_check net/core/dev.c:2949 [inline]
        netif_skb_features+0xd1b/0xdc0 net/core/dev.c:3009
        validate_xmit_skb+0x89/0x1320 net/core/dev.c:3084
        __dev_queue_xmit+0x1cb2/0x2b60 net/core/dev.c:3549
        dev_queue_xmit+0x4b/0x60 net/core/dev.c:3590
        packet_snd net/packet/af_packet.c:2944 [inline]
        packet_sendmsg+0x7c57/0x8a10 net/packet/af_packet.c:2969
        sock_sendmsg_nosec net/socket.c:630 [inline]
        sock_sendmsg net/socket.c:640 [inline]
        sock_write_iter+0x3b9/0x470 net/socket.c:909
        do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
        do_iter_write+0x30d/0xd40 fs/read_write.c:932
        vfs_writev fs/read_write.c:977 [inline]
        do_writev+0x3c9/0x830 fs/read_write.c:1012
        SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
        SyS_writev+0x56/0x80 fs/read_write.c:1082
        do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
        entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      RIP: 0033:0x43ffa9
      RSP: 002b:00007fff2cff3948 EFLAGS: 00000217 ORIG_RAX: 0000000000000014
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043ffa9
      RDX: 0000000000000001 RSI: 0000000020000080 RDI: 0000000000000003
      RBP: 00000000006cb018 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000217 R12: 00000000004018d0
      R13: 0000000000401960 R14: 0000000000000000 R15: 0000000000000000
      
      Uninit was created at:
        kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
        kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
        kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
        kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
        slab_post_alloc_hook mm/slab.h:445 [inline]
        slab_alloc_node mm/slub.c:2737 [inline]
        __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
        __kmalloc_reserve net/core/skbuff.c:138 [inline]
        __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
        alloc_skb include/linux/skbuff.h:984 [inline]
        alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234
        sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085
        packet_alloc_skb net/packet/af_packet.c:2803 [inline]
        packet_snd net/packet/af_packet.c:2894 [inline]
        packet_sendmsg+0x6444/0x8a10 net/packet/af_packet.c:2969
        sock_sendmsg_nosec net/socket.c:630 [inline]
        sock_sendmsg net/socket.c:640 [inline]
        sock_write_iter+0x3b9/0x470 net/socket.c:909
        do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
        do_iter_write+0x30d/0xd40 fs/read_write.c:932
        vfs_writev fs/read_write.c:977 [inline]
        do_writev+0x3c9/0x830 fs/read_write.c:1012
        SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
        SyS_writev+0x56/0x80 fs/read_write.c:1082
        do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
        entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: 58e998c6 ("offloading: Force software GSO for multiple vlan tags.")
      Reported-and-tested-by: syzbot+0bbe42c764feafa82c5a@syzkaller.appspotmail.com
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ce23672
  4. 13 4月, 2018 2 次提交
    • W
      net: fix deadlock while clearing neighbor proxy table · 53b76cdf
      Wolfgang Bumiller 提交于
      When coming from ndisc_netdev_event() in net/ipv6/ndisc.c,
      neigh_ifdown() is called with &nd_tbl, locking this while
      clearing the proxy neighbor entries when eg. deleting an
      interface. Calling the table's pndisc_destructor() with the
      lock still held, however, can cause a deadlock: When a
      multicast listener is available an IGMP packet of type
      ICMPV6_MGM_REDUCTION may be sent out. When reaching
      ip6_finish_output2(), if no neighbor entry for the target
      address is found, __neigh_create() is called with &nd_tbl,
      which it'll want to lock.
      
      Move the elements into their own list, then unlock the table
      and perform the destruction.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199289
      Fixes: 6fd6ce20 ("ipv6: Do not depend on rt->n in ip6_finish_output2().")
      Signed-off-by: NWolfgang Bumiller <w.bumiller@proxmox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53b76cdf
    • E
      net: validate attribute sizes in neigh_dump_table() · 7dd07c14
      Eric Dumazet 提交于
      Since neigh_dump_table() calls nlmsg_parse() without giving policy
      constraints, attributes can have arbirary size that we must validate
      
      Reported by syzbot/KMSAN :
      
      BUG: KMSAN: uninit-value in neigh_master_filtered net/core/neighbour.c:2292 [inline]
      BUG: KMSAN: uninit-value in neigh_dump_table net/core/neighbour.c:2348 [inline]
      BUG: KMSAN: uninit-value in neigh_dump_info+0x1af0/0x2250 net/core/neighbour.c:2438
      CPU: 1 PID: 3575 Comm: syzkaller268891 Not tainted 4.16.0+ #83
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:53
       kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
       __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
       neigh_master_filtered net/core/neighbour.c:2292 [inline]
       neigh_dump_table net/core/neighbour.c:2348 [inline]
       neigh_dump_info+0x1af0/0x2250 net/core/neighbour.c:2438
       netlink_dump+0x9ad/0x1540 net/netlink/af_netlink.c:2225
       __netlink_dump_start+0x1167/0x12a0 net/netlink/af_netlink.c:2322
       netlink_dump_start include/linux/netlink.h:214 [inline]
       rtnetlink_rcv_msg+0x1435/0x1560 net/core/rtnetlink.c:4598
       netlink_rcv_skb+0x355/0x5f0 net/netlink/af_netlink.c:2447
       rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:4653
       netlink_unicast_kernel net/netlink/af_netlink.c:1311 [inline]
       netlink_unicast+0x1672/0x1750 net/netlink/af_netlink.c:1337
       netlink_sendmsg+0x1048/0x1310 net/netlink/af_netlink.c:1900
       sock_sendmsg_nosec net/socket.c:630 [inline]
       sock_sendmsg net/socket.c:640 [inline]
       ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
       __sys_sendmsg net/socket.c:2080 [inline]
       SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
       SyS_sendmsg+0x54/0x80 net/socket.c:2087
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      RIP: 0033:0x43fed9
      RSP: 002b:00007ffddbee2798 EFLAGS: 00000213 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fed9
      RDX: 0000000000000000 RSI: 0000000020005000 RDI: 0000000000000003
      RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
      R10: 00000000004002c8 R11: 0000000000000213 R12: 0000000000401800
      R13: 0000000000401890 R14: 0000000000000000 R15: 0000000000000000
      
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
       kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
       kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
       kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
       slab_post_alloc_hook mm/slab.h:445 [inline]
       slab_alloc_node mm/slub.c:2737 [inline]
       __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
       __kmalloc_reserve net/core/skbuff.c:138 [inline]
       __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
       alloc_skb include/linux/skbuff.h:984 [inline]
       netlink_alloc_large_skb net/netlink/af_netlink.c:1183 [inline]
       netlink_sendmsg+0x9a6/0x1310 net/netlink/af_netlink.c:1875
       sock_sendmsg_nosec net/socket.c:630 [inline]
       sock_sendmsg net/socket.c:640 [inline]
       ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
       __sys_sendmsg net/socket.c:2080 [inline]
       SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
       SyS_sendmsg+0x54/0x80 net/socket.c:2087
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: 21fdd092 ("net: Add support for filtering neigh dump by master device")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: David Ahern <dsa@cumulusnetworks.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7dd07c14
  5. 09 4月, 2018 1 次提交
    • J
      devlink: convert occ_get op to separate registration · fc56be47
      Jiri Pirko 提交于
      This resolves race during initialization where the resources with
      ops are registered before driver and the structures used by occ_get
      op is initialized. So keep occ_get callbacks registered only when
      all structs are initialized.
      
      The example flows, as it is in mlxsw:
      1) driver load/asic probe:
         mlxsw_core
            -> mlxsw_sp_resources_register
              -> mlxsw_sp_kvdl_resources_register
                -> devlink_resource_register IDX
         mlxsw_spectrum
            -> mlxsw_sp_kvdl_init
              -> mlxsw_sp_kvdl_parts_init
                -> mlxsw_sp_kvdl_part_init
                  -> devlink_resource_size_get IDX (to get the current setup
                                                    size from devlink)
              -> devlink_resource_occ_get_register IDX (register current
                                                        occupancy getter)
      2) reload triggered by devlink command:
        -> mlxsw_devlink_core_bus_device_reload
          -> mlxsw_sp_fini
            -> mlxsw_sp_kvdl_fini
      	-> devlink_resource_occ_get_unregister IDX
          (struct mlxsw_sp *mlxsw_sp is freed at this point, call to occ get
           which is using mlxsw_sp would cause use-after free)
          -> mlxsw_sp_init
            -> mlxsw_sp_kvdl_init
              -> mlxsw_sp_kvdl_parts_init
                -> mlxsw_sp_kvdl_part_init
                  -> devlink_resource_size_get IDX (to get the current setup
                                                    size from devlink)
              -> devlink_resource_occ_get_register IDX (register current
                                                        occupancy getter)
      
      Fixes: d9f9b9a4 ("devlink: Add support for resource abstraction")
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc56be47
  6. 08 4月, 2018 2 次提交
    • E
      net: fix uninit-value in __hw_addr_add_ex() · 77d36398
      Eric Dumazet 提交于
      syzbot complained :
      
      BUG: KMSAN: uninit-value in memcmp+0x119/0x180 lib/string.c:861
      CPU: 0 PID: 3 Comm: kworker/0:0 Not tainted 4.16.0+ #82
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: ipv6_addrconf addrconf_dad_work
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:53
       kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
       __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
       memcmp+0x119/0x180 lib/string.c:861
       __hw_addr_add_ex net/core/dev_addr_lists.c:60 [inline]
       __dev_mc_add+0x1c2/0x8e0 net/core/dev_addr_lists.c:670
       dev_mc_add+0x6d/0x80 net/core/dev_addr_lists.c:687
       igmp6_group_added+0x2db/0xa00 net/ipv6/mcast.c:662
       ipv6_dev_mc_inc+0xe9e/0x1130 net/ipv6/mcast.c:914
       addrconf_join_solict net/ipv6/addrconf.c:2078 [inline]
       addrconf_dad_begin net/ipv6/addrconf.c:3828 [inline]
       addrconf_dad_work+0x427/0x2150 net/ipv6/addrconf.c:3954
       process_one_work+0x12c6/0x1f60 kernel/workqueue.c:2113
       worker_thread+0x113c/0x24f0 kernel/workqueue.c:2247
       kthread+0x539/0x720 kernel/kthread.c:239
      
      Fixes: f001fde5 ("net: introduce a list of device addresses dev_addr_list (v6)")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77d36398
    • E
      net: initialize skb->peeked when cloning · b13dda9f
      Eric Dumazet 提交于
      syzbot reported __skb_try_recv_from_queue() was using skb->peeked
      while it was potentially unitialized.
      
      We need to clear it in __skb_clone()
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b13dda9f
  7. 06 4月, 2018 2 次提交
  8. 01 4月, 2018 3 次提交
  9. 31 3月, 2018 5 次提交
    • A
      bpf: Post-hooks for sys_bind · aac3fc32
      Andrey Ignatov 提交于
      "Post-hooks" are hooks that are called right before returning from
      sys_bind. At this time IP and port are already allocated and no further
      changes to `struct sock` can happen before returning from sys_bind but
      BPF program has a chance to inspect the socket and change sys_bind
      result.
      
      Specifically it can e.g. inspect what port was allocated and if it
      doesn't satisfy some policy, BPF program can force sys_bind to fail and
      return EPERM to user.
      
      Another example of usage is recording the IP:port pair to some map to
      use it in later calls to sys_connect. E.g. if some TCP server inside
      cgroup was bound to some IP:port_n, it can be recorded to a map. And
      later when some TCP client inside same cgroup is trying to connect to
      127.0.0.1:port_n, BPF hook for sys_connect can override the destination
      and connect application to IP:port_n instead of 127.0.0.1:port_n. That
      helps forcing all applications inside a cgroup to use desired IP and not
      break those applications if they e.g. use localhost to communicate
      between each other.
      
      == Implementation details ==
      
      Post-hooks are implemented as two new attach types
      `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for
      existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`.
      
      Separate attach types for IPv4 and IPv6 are introduced to avoid access
      to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
      `inet6_bind()` since those fields might not make sense in such cases.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      aac3fc32
    • A
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov 提交于
      == The problem ==
      
      See description of the problem in the initial patch of this patch set.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
        significantly;
      * there is no use-case for port.
      
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      cgroup.
      
      Support is added for IPv4 and IPv6, for TCP and UDP.
      
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      
      == Implementation notes ==
      
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d74bad4e
    • A
      bpf: Hooks for sys_bind · 4fbac77d
      Andrey Ignatov 提交于
      == The problem ==
      
      There is a use-case when all processes inside a cgroup should use one
      single IP address on a host that has multiple IP configured.  Those
      processes should use the IP for both ingress and egress, for TCP and UDP
      traffic. So TCP/UDP servers should be bound to that IP to accept
      incoming connections on it, and TCP/UDP clients should make outgoing
      connections from that IP. It should not require changing application
      code since it's often not possible.
      
      Currently it's solved by intercepting glibc wrappers around syscalls
      such as `bind(2)` and `connect(2)`. It's done by a shared library that
      is preloaded for every process in a cgroup so that whenever TCP/UDP
      server calls `bind(2)`, the library replaces IP in sockaddr before
      passing arguments to syscall. When application calls `connect(2)` the
      library transparently binds the local end of connection to that IP
      (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
      
      Shared library approach is fragile though, e.g.:
      * some applications clear env vars (incl. `LD_PRELOAD`);
      * `/etc/ld.so.preload` doesn't help since some applications are linked
        with option `-z nodefaultlib`;
      * other applications don't use glibc and there is nothing to intercept.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 1st
      part of the problem: binding TCP/UDP servers on desired IP. It does not
      depend on application environment and implementation details (whether
      glibc is used or not).
      
      It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
      attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
      (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).
      
      The new program type is intended to be used with sockets (`struct sock`)
      in a cgroup and provided by user `struct sockaddr`. Pointers to both of
      them are parts of the context passed to programs of newly added types.
      
      The new attach types provides hooks in `bind(2)` system call for both
      IPv4 and IPv6 so that one can write a program to override IP addresses
      and ports user program tries to bind to and apply such a program for
      whole cgroup.
      
      == Implementation notes ==
      
      [1]
      Separate attach types for `AF_INET` and `AF_INET6` are added
      intentionally to prevent reading/writing to offsets that don't make
      sense for corresponding socket family. E.g. if user passes `sockaddr_in`
      it doesn't make sense to read from / write to `user_ip6[]` context
      fields.
      
      [2]
      The write access to `struct bpf_sock_addr_kern` is implemented using
      special field as an additional "register".
      
      There are just two registers in `sock_addr_convert_ctx_access`: `src`
      with value to write and `dst` with pointer to context that can't be
      changed not to break later instructions. But the fields, allowed to
      write to, are not available directly and to access them address of
      corresponding pointer has to be loaded first. To get additional register
      the 1st not used by `src` and `dst` one is taken, its content is saved
      to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
      address of pointer field, and finally the register's content is restored
      from the temporary field after writing `src` value.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4fbac77d
    • A
      bpf: Check attach type at prog load time · 5e43f899
      Andrey Ignatov 提交于
      == The problem ==
      
      There are use-cases when a program of some type can be attached to
      multiple attach points and those attach points must have different
      permissions to access context or to call helpers.
      
      E.g. context structure may have fields for both IPv4 and IPv6 but it
      doesn't make sense to read from / write to IPv6 field when attach point
      is somewhere in IPv4 stack.
      
      Same applies to BPF-helpers: it may make sense to call some helper from
      some attach point, but not from other for same prog type.
      
      == The solution ==
      
      Introduce `expected_attach_type` field in in `struct bpf_attr` for
      `BPF_PROG_LOAD` command. If scenario described in "The problem" section
      is the case for some prog type, the field will be checked twice:
      
      1) At load time prog type is checked to see if attach type for it must
         be known to validate program permissions correctly. Prog will be
         rejected with EINVAL if it's the case and `expected_attach_type` is
         not specified or has invalid value.
      
      2) At attach time `attach_type` is compared with `expected_attach_type`,
         if prog type requires to have one, and, if they differ, attach will
         be rejected with EINVAL.
      
      The `expected_attach_type` is now available as part of `struct bpf_prog`
      in both `bpf_verifier_ops->is_valid_access()` and
      `bpf_verifier_ops->get_func_proto()` () and can be used to check context
      accesses and calls to helpers correspondingly.
      
      Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and
      Daniel Borkmann <daniel@iogearbox.net> here:
      https://marc.info/?l=linux-netdev&m=152107378717201&w=2Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5e43f899
    • T
      net: Fix untag for vlan packets without ethernet header · ae474573
      Toshiaki Makita 提交于
      In some situation vlan packets do not have ethernet headers. One example
      is packets from tun devices. Users can specify vlan protocol in tun_pi
      field instead of IP protocol, and skb_vlan_untag() attempts to untag such
      packets.
      
      skb_vlan_untag() (more precisely, skb_reorder_vlan_header() called by it)
      however did not expect packets without ethernet headers, so in such a case
      size argument for memmove() underflowed and triggered crash.
      
      ====
      BUG: unable to handle kernel paging request at ffff8801cccb8000
      IP: __memmove+0x24/0x1a0 arch/x86/lib/memmove_64.S:43
      PGD 9cee067 P4D 9cee067 PUD 1d9401063 PMD 1cccb7063 PTE 2810100028101
      Oops: 000b [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 1 PID: 17663 Comm: syz-executor2 Not tainted 4.16.0-rc7+ #368
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:__memmove+0x24/0x1a0 arch/x86/lib/memmove_64.S:43
      RSP: 0018:ffff8801cc046e28 EFLAGS: 00010287
      RAX: ffff8801ccc244c4 RBX: fffffffffffffffe RCX: fffffffffff6c4c2
      RDX: fffffffffffffffe RSI: ffff8801cccb7ffc RDI: ffff8801cccb8000
      RBP: ffff8801cc046e48 R08: ffff8801ccc244be R09: ffffed0039984899
      R10: 0000000000000001 R11: ffffed0039984898 R12: ffff8801ccc244c4
      R13: ffff8801ccc244c0 R14: ffff8801d96b7c06 R15: ffff8801d96b7b40
      FS:  00007febd562d700(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffff8801cccb8000 CR3: 00000001ccb2f006 CR4: 00000000001606e0
      DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
      Call Trace:
       memmove include/linux/string.h:360 [inline]
       skb_reorder_vlan_header net/core/skbuff.c:5031 [inline]
       skb_vlan_untag+0x470/0xc40 net/core/skbuff.c:5061
       __netif_receive_skb_core+0x119c/0x3460 net/core/dev.c:4460
       __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4627
       netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4701
       netif_receive_skb+0xae/0x390 net/core/dev.c:4725
       tun_rx_batched.isra.50+0x5ee/0x870 drivers/net/tun.c:1555
       tun_get_user+0x299e/0x3c20 drivers/net/tun.c:1962
       tun_chr_write_iter+0xb9/0x160 drivers/net/tun.c:1990
       call_write_iter include/linux/fs.h:1782 [inline]
       new_sync_write fs/read_write.c:469 [inline]
       __vfs_write+0x684/0x970 fs/read_write.c:482
       vfs_write+0x189/0x510 fs/read_write.c:544
       SYSC_write fs/read_write.c:589 [inline]
       SyS_write+0xef/0x220 fs/read_write.c:581
       do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7
      RIP: 0033:0x454879
      RSP: 002b:00007febd562cc68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 00007febd562d6d4 RCX: 0000000000454879
      RDX: 0000000000000157 RSI: 0000000020000180 RDI: 0000000000000014
      RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 00000000000006b0 R14: 00000000006fc120 R15: 0000000000000000
      Code: 90 90 90 90 90 90 90 48 89 f8 48 83 fa 20 0f 82 03 01 00 00 48 39 fe 7d 0f 49 89 f0 49 01 d0 49 39 f8 0f 8f 9f 00 00 00 48 89 d1 <f3> a4 c3 48 81 fa a8 02 00 00 72 05 40 38 fe 74 3b 48 83 ea 20
      RIP: __memmove+0x24/0x1a0 arch/x86/lib/memmove_64.S:43 RSP: ffff8801cc046e28
      CR2: ffff8801cccb8000
      ====
      
      We don't need to copy headers for packets which do not have preceding
      headers of vlan headers, so skip memmove() in that case.
      
      Fixes: 4bbb3e0e ("net: Fix vlan untag for bridge and vlan_dev with reorder_hdr off")
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae474573
  10. 30 3月, 2018 8 次提交
    • K
      net: Close race between {un, }register_netdevice_notifier() and setup_net()/cleanup_net() · 328fbe74
      Kirill Tkhai 提交于
      {un,}register_netdevice_notifier() iterate over all net namespaces
      hashed to net_namespace_list. But pernet_operations register and
      unregister netdevices in unhashed net namespace, and they are not
      seen for netdevice notifiers. This results in asymmetry:
      
      1)Race with register_netdevice_notifier()
        pernet_operations::init(net)	...
         register_netdevice()		...
          call_netdevice_notifiers()  ...
            ... nb is not called ...
        ...				register_netdevice_notifier(nb) -> net skipped
        ...				...
        list_add_tail(&net->list, ..) ...
      
        Then, userspace stops using net, and it's destructed:
      
        pernet_operations::exit(net)
         unregister_netdevice()
          call_netdevice_notifiers()
            ... nb is called ...
      
      This always happens with net::loopback_dev, but it may be not the only device.
      
      2)Race with unregister_netdevice_notifier()
        pernet_operations::init(net)
         register_netdevice()
          call_netdevice_notifiers()
            ... nb is called ...
      
        Then, userspace stops using net, and it's destructed:
      
        list_del_rcu(&net->list)	...
        pernet_operations::exit(net)  unregister_netdevice_notifier(nb) -> net skipped
         dev_change_net_namespace()	...
          call_netdevice_notifiers()
            ... nb is not called ...
         unregister_netdevice()
          call_netdevice_notifiers()
            ... nb is not called ...
      
      This race is more danger, since dev_change_net_namespace() moves real
      network devices, which use not trivial netdevice notifiers, and if this
      will happen, the system will be left in unpredictable state.
      
      The patch closes the race. During the testing I found two places,
      where register_netdevice_notifier() is called from pernet init/exit
      methods (which led to deadlock) and fixed them (see previous patches).
      
      The review moved me to one more unusual registration place:
      raw_init() (can driver). It may be a reason of problems,
      if someone creates in-kernel CAN_RAW sockets, since they
      will be destroyed in exit method and raw_release()
      will call unregister_netdevice_notifier(). But grep over
      kernel tree does not show, someone creates such sockets
      from kernel space.
      
      Theoretically, there can be more places like this, and which are
      hidden from review, but we found them on the first bumping there
      (since there is no a race, it will be 100% reproducible).
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      328fbe74
    • R
      sfp/phylink: move module EEPROM ethtool access into netdev core ethtool · e679c9c1
      Russell King 提交于
      Provide a pointer to the SFP bus in struct net_device, so that the
      ethtool module EEPROM methods can access the SFP directly, rather
      than needing every user to provide a hook for it.
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e679c9c1
    • G
      net: Call add/kill vid ndo on vlan filter feature toggling · 9daae9bd
      Gal Pressman 提交于
      NETIF_F_HW_VLAN_[CS]TAG_FILTER features require more than just a bit
      flip in dev->features in order to keep the driver in a consistent state.
      These features notify the driver of each added/removed vlan, but toggling
      of vlan-filter does not notify the driver accordingly for each of the
      existing vlans.
      
      This patch implements a similar solution to NETIF_F_RX_UDP_TUNNEL_PORT
      behavior (which notifies the driver about UDP ports in the same manner
      that vids are reported).
      
      Each toggling of the features propagates to the 8021q module, which
      iterates over the vlans and call add/kill ndo accordingly.
      Signed-off-by: NGal Pressman <galp@mellanox.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9daae9bd
    • J
      bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT: · fa246693
      John Fastabend 提交于
      Add support for the BPF_F_INGRESS flag in skb redirect helper. To
      do this convert skb into a scatterlist and push into ingress queue.
      This is the same logic that is used in the sk_msg redirect helper
      so it should feel familiar.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      fa246693
    • J
      bpf: sockmap redirect ingress support · 8934ce2f
      John Fastabend 提交于
      Add support for the BPF_F_INGRESS flag in sk_msg redirect helper.
      To do this add a scatterlist ring for receiving socks to check
      before calling into regular recvmsg call path. Additionally, because
      the poll wakeup logic only checked the skb recv queue we need to
      add a hook in TCP stack (similar to write side) so that we have
      a way to wake up polling socks when a scatterlist is redirected
      to that sock.
      
      After this all that is needed is for the redirect helper to
      push the scatterlist into the psock receive queue.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8934ce2f
    • D
      net: Move call_fib_rule_notifiers up in fib_nl_newrule · 9776d325
      David Ahern 提交于
      Move call_fib_rule_notifiers up in fib_nl_newrule to the point right
      before the rule is inserted into the list. At this point there are no
      more failure paths within the core rule code, so if the notifier
      does not fail then the rule will be inserted into the list.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9776d325
    • D
      net: Fix fib notifer to return errno · c30d9356
      David Ahern 提交于
      Notifier handlers use notifier_from_errno to convert any potential error
      to an encoded format. As a consequence the other side, call_fib_notifier{s}
      in this case, needs to use notifier_to_errno to return the error from
      the handler back to its caller.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c30d9356
    • K
      net: Introduce net_rwsem to protect net_namespace_list · f0b07bb1
      Kirill Tkhai 提交于
      rtnl_lock() is used everywhere, and contention is very high.
      When someone wants to iterate over alive net namespaces,
      he/she has no a possibility to do that without exclusive lock.
      But the exclusive rtnl_lock() in such places is overkill,
      and it just increases the contention. Yes, there is already
      for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
      and this can't be sleepable. Also, sometimes it may be need
      really prevent net_namespace_list growth, so for_each_net_rcu()
      is not fit there.
      
      This patch introduces new rw_semaphore, which will be used
      instead of rtnl_mutex to protect net_namespace_list. It is
      sleepable and allows not-exclusive iterations over net
      namespaces list. It allows to stop using rtnl_lock()
      in several places (what is made in next patches) and makes
      less the time, we keep rtnl_mutex. Here we just add new lock,
      while the explanation of we can remove rtnl_lock() there are
      in next patches.
      
      Fine grained locks generally are better, then one big lock,
      so let's do that with net_namespace_list, while the situation
      allows that.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0b07bb1
  11. 29 3月, 2018 1 次提交
  12. 28 3月, 2018 4 次提交
  13. 27 3月, 2018 3 次提交
    • E
      net: fix possible out-of-bound read in skb_network_protocol() · 1dfe82eb
      Eric Dumazet 提交于
      skb mac header is not necessarily set at the time skb_network_protocol()
      is called. Use skb->data instead.
      
      BUG: KASAN: slab-out-of-bounds in skb_network_protocol+0x46b/0x4b0 net/core/dev.c:2739
      Read of size 2 at addr ffff8801b3097a0b by task syz-executor5/14242
      
      CPU: 1 PID: 14242 Comm: syz-executor5 Not tainted 4.16.0-rc6+ #280
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x194/0x24d lib/dump_stack.c:53
       print_address_description+0x73/0x250 mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report+0x23c/0x360 mm/kasan/report.c:412
       __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:443
       skb_network_protocol+0x46b/0x4b0 net/core/dev.c:2739
       harmonize_features net/core/dev.c:2924 [inline]
       netif_skb_features+0x509/0x9b0 net/core/dev.c:3011
       validate_xmit_skb+0x81/0xb00 net/core/dev.c:3084
       validate_xmit_skb_list+0xbf/0x120 net/core/dev.c:3142
       packet_direct_xmit+0x117/0x790 net/packet/af_packet.c:256
       packet_snd net/packet/af_packet.c:2944 [inline]
       packet_sendmsg+0x3aed/0x60b0 net/packet/af_packet.c:2969
       sock_sendmsg_nosec net/socket.c:629 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:639
       ___sys_sendmsg+0x767/0x8b0 net/socket.c:2047
       __sys_sendmsg+0xe5/0x210 net/socket.c:2081
      
      Fixes: 19acc327 ("gso: Handle Trans-Ether-Bridging protocol in skb_network_protocol()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Pravin B Shelar <pshelar@ovn.org>
      Reported-by: NReported-by: syzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1dfe82eb
    • I
      ethtool: Add support for configuring PFC stall prevention in ethtool · e1577c1c
      Inbar Karmy 提交于
      In the event where the device unexpectedly becomes unresponsive
      for a long period of time, flow control mechanism may propagate
      pause frames which will cause congestion spreading to the entire
      network.
      To prevent this scenario, when the device is stalled for a period
      longer than a pre-configured timeout, flow control mechanisms are
      automatically disabled.
      
      This patch adds support for the ETHTOOL_PFC_STALL_PREVENTION
      as a tunable.
      This API provides support for configuring flow control storm prevention
      timeout (msec).
      Signed-off-by: NInbar Karmy <inbark@mellanox.com>
      Cc: Michal Kubecek <mkubecek@suse.cz>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      e1577c1c
    • J
      net: Use octal not symbolic permissions · d6444062
      Joe Perches 提交于
      Prefer the direct use of octal for permissions.
      
      Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace
      and some typing.
      
      Miscellanea:
      
      o Whitespace neatening around these conversions.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6444062
  14. 26 3月, 2018 4 次提交
    • K
      net: Drop NETDEV_UNREGISTER_FINAL · 070f2d7e
      Kirill Tkhai 提交于
      Last user is gone after bdf5bd7f "rds: tcp: remove
      register_netdevice_notifier infrastructure.", so we can
      remove this netdevice command. This allows to delete
      rtnl_lock() in netdev_run_todo(), which is hot path for
      net namespace unregistration.
      
      dev_change_net_namespace() and netdev_wait_allrefs()
      have rcu_barrier() before NETDEV_UNREGISTER_FINAL call,
      and the source commits say they were introduced to
      delemit the call with NETDEV_UNREGISTER, but this patch
      leaves them on the places, since they require additional
      analysis, whether we need in them for something else.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      070f2d7e
    • K
      net: Make NETDEV_XXX commands enum { } · ede2762d
      Kirill Tkhai 提交于
      This patch is preparation to drop NETDEV_UNREGISTER_FINAL.
      Since the cmd is used in usnic_ib_netdev_event_to_string()
      to get cmd name, after plain removing NETDEV_UNREGISTER_FINAL
      from everywhere, we'd have holes in event2str[] in this
      function.
      
      Instead of that, let's make NETDEV_XXX commands names
      available for everyone, and to define netdev_cmd_to_name()
      in the way we won't have to shaffle names after their
      numbers are changed.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ede2762d
    • S
      net/utils: Introduce inet_addr_is_any · a470143f
      Sagi Grimberg 提交于
      Can be useful to check INET_ANY address for both ipv4/ipv6 addresses.
      Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a470143f
    • Y
      net: permit skb_segment on head_frag frag_list skb · 13acc94e
      Yonghong Song 提交于
      One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
      function skb_segment(), line 3667. The bpf program attaches to
      clsact ingress, calls bpf_skb_change_proto to change protocol
      from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
      to send the changed packet out.
      
      3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
      3473                             netdev_features_t features)
      3474 {
      3475         struct sk_buff *segs = NULL;
      3476         struct sk_buff *tail = NULL;
      ...
      3665                 while (pos < offset + len) {
      3666                         if (i >= nfrags) {
      3667                                 BUG_ON(skb_headlen(list_skb));
      3668
      3669                                 i = 0;
      3670                                 nfrags = skb_shinfo(list_skb)->nr_frags;
      3671                                 frag = skb_shinfo(list_skb)->frags;
      3672                                 frag_skb = list_skb;
      ...
      
      call stack:
      ...
       #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
       #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
       #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
       #4 [ffff883ffef03668] die at ffffffff8101deb2
       #5 [ffff883ffef03698] do_trap at ffffffff8101a700
       #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
       #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
       #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
          [exception RIP: skb_segment+3044]
          RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
          RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
          RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
          RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
          R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
          R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13acc94e
  15. 23 3月, 2018 2 次提交