1. 06 11月, 2018 3 次提交
    • A
      sock_diag: fix autoloading of the raw_diag module · c34c1287
      Andrei Vagin 提交于
      IPPROTO_RAW isn't registred as an inet protocol, so
      inet_protos[protocol] is always NULL for it.
      
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Xin Long <lucien.xin@gmail.com>
      Fixes: bf2ae2e4 ("sock_diag: request _diag module only when the family or proto has been registered")
      Signed-off-by: NAndrei Vagin <avagin@gmail.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c34c1287
    • M
      net: core: netpoll: Enable netconsole IPv6 link local address · d016b4a3
      Matwey V. Kornilov 提交于
      There is no reason to discard using source link local address when
      remote netconsole IPv6 address is set to be link local one.
      
      The patch allows administrators to use IPv6 netconsole without
      explicitly configuring source address:
      
          netconsole=@/,@fe80::5054:ff:fe2f:6012/
      Signed-off-by: NMatwey V. Kornilov <matwey@sai.msu.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d016b4a3
    • A
      rtnetlink: restore handling of dumpit return value in rtnl_dump_all() · 5e1acb4a
      Alexey Kodanev 提交于
      For non-zero return from dumpit() we should break the loop
      in rtnl_dump_all() and return the result. Otherwise, e.g.,
      we could get the memory leak in inet6_dump_fib() [1]. The
      pointer to the allocated struct fib6_walker there (saved
      in cb->args) can be lost, reset on the next iteration.
      
      Fix it by partially restoring the previous behavior before
      commit c63586dc ("net: rtnl_dump_all needs to propagate
      error from dumpit function"). The returned error from
      dumpit() is still passed further.
      
      [1]:
      unreferenced object 0xffff88001322a200 (size 96):
        comm "sshd", pid 1484, jiffies 4296032768 (age 1432.542s)
        hex dump (first 32 bytes):
          00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  ................
          18 09 41 36 00 88 ff ff 18 09 41 36 00 88 ff ff  ..A6......A6....
        backtrace:
          [<0000000095846b39>] kmem_cache_alloc_trace+0x151/0x220
          [<000000007d12709f>] inet6_dump_fib+0x68d/0x940
          [<000000002775a316>] rtnl_dump_all+0x1d9/0x2d0
          [<00000000d7cd302b>] netlink_dump+0x945/0x11a0
          [<000000002f43485f>] __netlink_dump_start+0x55d/0x800
          [<00000000f76bbeec>] rtnetlink_rcv_msg+0x4fa/0xa00
          [<000000009b5761f3>] netlink_rcv_skb+0x29c/0x420
          [<0000000087a1dae1>] rtnetlink_rcv+0x15/0x20
          [<00000000691b703b>] netlink_unicast+0x4e3/0x6c0
          [<00000000b5be0204>] netlink_sendmsg+0x7f2/0xba0
          [<0000000096d2aa60>] sock_sendmsg+0xba/0xf0
          [<000000008c1b786f>] __sys_sendto+0x1e4/0x330
          [<0000000019587b3f>] __x64_sys_sendto+0xe1/0x1a0
          [<00000000071f4d56>] do_syscall_64+0x9f/0x300
          [<000000002737577f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
          [<0000000057587684>] 0xffffffffffffffff
      
      Fixes: c63586dc ("net: rtnl_dump_all needs to propagate error from dumpit function")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e1acb4a
  2. 04 11月, 2018 1 次提交
    • E
      net: do not abort bulk send on BQL status · fe60faa5
      Eric Dumazet 提交于
      Before calling dev_hard_start_xmit(), upper layers tried
      to cook optimal skb list based on BQL budget.
      
      Problem is that GSO packets can end up comsuming more than
      the BQL budget.
      
      Breaking the loop is not useful, since requeued packets
      are ahead of any packets still in the qdisc.
      
      It is also more expensive, since next TX completion will
      push these packets later, while skbs are not in cpu caches.
      
      It is also a behavior difference with TSO packets, that can
      break the BQL limit by a large amount.
      
      Note that drivers should use __netdev_tx_sent_queue()
      in order to have optimal xmit_more support, and avoid
      useless atomic operations as shown in the following patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe60faa5
  3. 03 11月, 2018 1 次提交
  4. 30 10月, 2018 1 次提交
    • I
      rtnetlink: Disallow FDB configuration for non-Ethernet device · da715775
      Ido Schimmel 提交于
      When an FDB entry is configured, the address is validated to have the
      length of an Ethernet address, but the device for which the address is
      configured can be of any type.
      
      The above can result in the use of uninitialized memory when the address
      is later compared against existing addresses since 'dev->addr_len' is
      used and it may be greater than ETH_ALEN, as with ip6tnl devices.
      
      Fix this by making sure that FDB entries are only configured for
      Ethernet devices.
      
      BUG: KMSAN: uninit-value in memcmp+0x11d/0x180 lib/string.c:863
      CPU: 1 PID: 4318 Comm: syz-executor998 Not tainted 4.19.0-rc3+ #49
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x14b/0x190 lib/dump_stack.c:113
        kmsan_report+0x183/0x2b0 mm/kmsan/kmsan.c:956
        __msan_warning+0x70/0xc0 mm/kmsan/kmsan_instr.c:645
        memcmp+0x11d/0x180 lib/string.c:863
        dev_uc_add_excl+0x165/0x7b0 net/core/dev_addr_lists.c:464
        ndo_dflt_fdb_add net/core/rtnetlink.c:3463 [inline]
        rtnl_fdb_add+0x1081/0x1270 net/core/rtnetlink.c:3558
        rtnetlink_rcv_msg+0xa0b/0x1530 net/core/rtnetlink.c:4715
        netlink_rcv_skb+0x36e/0x5f0 net/netlink/af_netlink.c:2454
        rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:4733
        netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
        netlink_unicast+0x1638/0x1720 net/netlink/af_netlink.c:1343
        netlink_sendmsg+0x1205/0x1290 net/netlink/af_netlink.c:1908
        sock_sendmsg_nosec net/socket.c:621 [inline]
        sock_sendmsg net/socket.c:631 [inline]
        ___sys_sendmsg+0xe70/0x1290 net/socket.c:2114
        __sys_sendmsg net/socket.c:2152 [inline]
        __do_sys_sendmsg net/socket.c:2161 [inline]
        __se_sys_sendmsg+0x2a3/0x3d0 net/socket.c:2159
        __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2159
        do_syscall_64+0xb8/0x100 arch/x86/entry/common.c:291
        entry_SYSCALL_64_after_hwframe+0x63/0xe7
      RIP: 0033:0x440ee9
      Code: e8 cc ab 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7
      48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
      ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fff6a93b518 EFLAGS: 00000213 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000440ee9
      RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000003
      RBP: 0000000000000000 R08: 00000000004002c8 R09: 00000000004002c8
      R10: 00000000004002c8 R11: 0000000000000213 R12: 000000000000b4b0
      R13: 0000000000401ec0 R14: 0000000000000000 R15: 0000000000000000
      
      Uninit was created at:
        kmsan_save_stack_with_flags mm/kmsan/kmsan.c:256 [inline]
        kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:181
        kmsan_kmalloc+0x98/0x100 mm/kmsan/kmsan_hooks.c:91
        kmsan_slab_alloc+0x10/0x20 mm/kmsan/kmsan_hooks.c:100
        slab_post_alloc_hook mm/slab.h:446 [inline]
        slab_alloc_node mm/slub.c:2718 [inline]
        __kmalloc_node_track_caller+0x9e7/0x1160 mm/slub.c:4351
        __kmalloc_reserve net/core/skbuff.c:138 [inline]
        __alloc_skb+0x2f5/0x9e0 net/core/skbuff.c:206
        alloc_skb include/linux/skbuff.h:996 [inline]
        netlink_alloc_large_skb net/netlink/af_netlink.c:1189 [inline]
        netlink_sendmsg+0xb49/0x1290 net/netlink/af_netlink.c:1883
        sock_sendmsg_nosec net/socket.c:621 [inline]
        sock_sendmsg net/socket.c:631 [inline]
        ___sys_sendmsg+0xe70/0x1290 net/socket.c:2114
        __sys_sendmsg net/socket.c:2152 [inline]
        __do_sys_sendmsg net/socket.c:2161 [inline]
        __se_sys_sendmsg+0x2a3/0x3d0 net/socket.c:2159
        __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2159
        do_syscall_64+0xb8/0x100 arch/x86/entry/common.c:291
        entry_SYSCALL_64_after_hwframe+0x63/0xe7
      
      v2:
      * Make error message more specific (David)
      
      Fixes: 090096bf ("net: generic fdb support for drivers without ndo_fdb_<op>")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reported-and-tested-by: syzbot+3a288d5f5530b901310e@syzkaller.appspotmail.com
      Reported-and-tested-by: syzbot+d53ab4e92a1db04110ff@syzkaller.appspotmail.com
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da715775
  5. 29 10月, 2018 1 次提交
  6. 27 10月, 2018 2 次提交
    • E
      net/neigh: fix NULL deref in pneigh_dump_table() · aab456df
      Eric Dumazet 提交于
      pneigh can have NULL device pointer, so we need to make
      neigh_master_filtered() and neigh_ifindex_filtered() more robust.
      
      syzbot report :
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 15867 Comm: syz-executor2 Not tainted 4.19.0+ #276
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:__read_once_size include/linux/compiler.h:179 [inline]
      RIP: 0010:list_empty include/linux/list.h:203 [inline]
      RIP: 0010:netdev_master_upper_dev_get+0xa1/0x250 net/core/dev.c:6467
      RSP: 0018:ffff8801bfaf7220 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000000000001 RCX: ffffc90005e92000
      RDX: 0000000000000016 RSI: ffffffff860b44d9 RDI: 0000000000000005
      RBP: ffff8801bfaf72b0 R08: ffff8801c4c84080 R09: fffffbfff139a580
      R10: fffffbfff139a580 R11: ffffffff89cd2c07 R12: 1ffff10037f5ee45
      R13: 0000000000000000 R14: ffff8801bfaf7288 R15: 00000000000000b0
      FS:  00007f65cc68d700(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001b33a21000 CR3: 00000001c6116000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       neigh_master_filtered net/core/neighbour.c:2367 [inline]
       pneigh_dump_table net/core/neighbour.c:2456 [inline]
       neigh_dump_info+0x7a9/0x1ce0 net/core/neighbour.c:2577
       netlink_dump+0x606/0x1080 net/netlink/af_netlink.c:2244
       __netlink_dump_start+0x59a/0x7c0 net/netlink/af_netlink.c:2352
       netlink_dump_start include/linux/netlink.h:216 [inline]
       rtnetlink_rcv_msg+0x809/0xc20 net/core/rtnetlink.c:4898
       netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2477
       rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4953
       netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
       netlink_unicast+0x5a5/0x760 net/netlink/af_netlink.c:1336
       netlink_sendmsg+0xa18/0xfc0 net/netlink/af_netlink.c:1917
       sock_sendmsg_nosec net/socket.c:621 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:631
       sock_write_iter+0x35e/0x5c0 net/socket.c:900
       call_write_iter include/linux/fs.h:1808 [inline]
       new_sync_write fs/read_write.c:474 [inline]
       __vfs_write+0x6b8/0x9f0 fs/read_write.c:487
       vfs_write+0x1fc/0x560 fs/read_write.c:549
       ksys_write+0x101/0x260 fs/read_write.c:598
       __do_sys_write fs/read_write.c:610 [inline]
       __se_sys_write fs/read_write.c:607 [inline]
       __x64_sys_write+0x73/0xb0 fs/read_write.c:607
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x457569
      
      Fixes: 6f52f80e ("net/neigh: Extend dump filter to proxy neighbor dumps")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: David Ahern <dsahern@gmail.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Tested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aab456df
    • D
      bpf: fix wrong helper enablement in cgroup local storage · d8fd9e10
      Daniel Borkmann 提交于
      Commit cd339431 ("bpf: introduce the bpf_get_local_storage()
      helper function") enabled the bpf_get_local_storage() helper also
      for BPF program types where it does not make sense to use them.
      
      They have been added both in sk_skb_func_proto() and sk_msg_func_proto()
      even though both program types are not invoked in combination with
      cgroups, and neither through BPF_PROG_RUN_ARRAY(). In the latter the
      bpf_cgroup_storage_set() is set shortly before BPF program invocation.
      
      Later, the helper bpf_get_local_storage() retrieves this prior set
      up per-cpu pointer and hands the buffer to the BPF program. The map
      argument in there solely retrieves the enum bpf_cgroup_storage_type
      from a local storage map associated with the program and based on the
      type returns either the global or per-cpu storage. However, there
      is no specific association between the program's map and the actual
      content in bpf_cgroup_storage[].
      
      Meaning, any BPF program that would have been properly run from the
      cgroup side through BPF_PROG_RUN_ARRAY() where bpf_cgroup_storage_set()
      was performed, and that is later unloaded such that prog / maps are
      teared down will cause a use after free if that pointer is retrieved
      from programs that are not run through BPF_PROG_RUN_ARRAY() but have
      the cgroup local storage helper enabled in their func proto.
      
      Lets just remove it from the two sock_map program types to fix it.
      Auditing through the types where this helper is enabled, it appears
      that these are the only ones where it was mistakenly allowed.
      
      Fixes: cd339431 ("bpf: introduce the bpf_get_local_storage() helper function")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Roman Gushchin <guro@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d8fd9e10
  7. 26 10月, 2018 3 次提交
  8. 25 10月, 2018 2 次提交
    • S
      net: udp: fix handling of CHECKSUM_COMPLETE packets · db4f1be3
      Sean Tranchetti 提交于
      Current handling of CHECKSUM_COMPLETE packets by the UDP stack is
      incorrect for any packet that has an incorrect checksum value.
      
      udp4/6_csum_init() will both make a call to
      __skb_checksum_validate_complete() to initialize/validate the csum
      field when receiving a CHECKSUM_COMPLETE packet. When this packet
      fails validation, skb->csum will be overwritten with the pseudoheader
      checksum so the packet can be fully validated by software, but the
      skb->ip_summed value will be left as CHECKSUM_COMPLETE so that way
      the stack can later warn the user about their hardware spewing bad
      checksums. Unfortunately, leaving the SKB in this state can cause
      problems later on in the checksum calculation.
      
      Since the the packet is still marked as CHECKSUM_COMPLETE,
      udp_csum_pull_header() will SUBTRACT the checksum of the UDP header
      from skb->csum instead of adding it, leaving us with a garbage value
      in that field. Once we try to copy the packet to userspace in the
      udp4/6_recvmsg(), we'll make a call to skb_copy_and_csum_datagram_msg()
      to checksum the packet data and add it in the garbage skb->csum value
      to perform our final validation check.
      
      Since the value we're validating is not the proper checksum, it's possible
      that the folded value could come out to 0, causing us not to drop the
      packet. Instead, we believe that the packet was checksummed incorrectly
      by hardware since skb->ip_summed is still CHECKSUM_COMPLETE, and we attempt
      to warn the user with netdev_rx_csum_fault(skb->dev);
      
      Unfortunately, since this is the UDP path, skb->dev has been overwritten
      by skb->dev_scratch and is no longer a valid pointer, so we end up
      reading invalid memory.
      
      This patch addresses this problem in two ways:
      	1) Do not use the dev pointer when calling netdev_rx_csum_fault()
      	   from skb_copy_and_csum_datagram_msg(). Since this gets called
      	   from the UDP path where skb->dev has been overwritten, we have
      	   no way of knowing if the pointer is still valid. Also for the
      	   sake of consistency with the other uses of
      	   netdev_rx_csum_fault(), don't attempt to call it if the
      	   packet was checksummed by software.
      
      	2) Add better CHECKSUM_COMPLETE handling to udp4/6_csum_init().
      	   If we receive a packet that's CHECKSUM_COMPLETE that fails
      	   verification (i.e. skb->csum_valid == 0), check who performed
      	   the calculation. It's possible that the checksum was done in
      	   software by the network stack earlier (such as Netfilter's
      	   CONNTRACK module), and if that says the checksum is bad,
      	   we can drop the packet immediately instead of waiting until
      	   we try and copy it to userspace. Otherwise, we need to
      	   mark the SKB as CHECKSUM_NONE, since the skb->csum field
      	   no longer contains the full packet checksum after the
      	   call to __skb_checksum_validate_complete().
      
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Fixes: c84d9490 ("udp: copy skb->truesize in the first cache line")
      Cc: Sam Kumar <samanthakumar@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NSean Tranchetti <stranche@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db4f1be3
    • D
      net: rtnl_dump_all needs to propagate error from dumpit function · c63586dc
      David Ahern 提交于
      If an address, route or netconf dump request is sent for AF_UNSPEC, then
      rtnl_dump_all is used to do the dump across all address families. If one
      of the dumpit functions fails (e.g., invalid attributes in the dump
      request) then rtnl_dump_all needs to propagate that error so the user
      gets an appropriate response instead of just getting no data.
      
      Fixes: effe6792 ("net: Enable kernel side filtering of route dumps")
      Fixes: 5fcd266a ("net/ipv4: Add support for dumping addresses for a specific device")
      Fixes: 6371a71f ("net/ipv6: Add support for dumping addresses for a specific device")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c63586dc
  9. 24 10月, 2018 2 次提交
    • M
      cgroup, netclassid: add a preemption point to write_classid · a90e90b7
      Michal Hocko 提交于
      We have seen a customer complaining about soft lockups on !PREEMPT
      kernel config with 4.4 based kernel
      
      [1072141.435366] NMI watchdog: BUG: soft lockup - CPU#21 stuck for 22s! [systemd:1]
      [1072141.444090] Modules linked in: mpt3sas raid_class binfmt_misc af_packet 8021q garp mrp stp llc xfs libcrc32c bonding iscsi_ibft iscsi_boot_sysfs msr ext4 crc16 jbd2 mbcache cdc_ether usbnet mii joydev hid_generic usbhid intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_ssif mgag200 i2c_algo_bit ttm ipmi_devintf drbg ixgbe drm_kms_helper vxlan ansi_cprng ip6_udp_tunnel drm aesni_intel udp_tunnel aes_x86_64 iTCO_wdt syscopyarea ptp xhci_pci lrw iTCO_vendor_support pps_core gf128mul ehci_pci glue_helper sysfillrect mdio pcspkr sb_edac ablk_helper cryptd ehci_hcd sysimgblt xhci_hcd fb_sys_fops edac_core mei_me lpc_ich ses usbcore enclosure dca mfd_core ipmi_si mei i2c_i801 scsi_transport_sas usb_common ipmi_msghandler shpchp fjes wmi processor button acpi_pad btrfs xor raid6_pq sd_mod crc32c_intel megaraid_sas sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod md_mod autofs4
      [1072141.444146] Supported: Yes
      [1072141.444149] CPU: 21 PID: 1 Comm: systemd Not tainted 4.4.121-92.80-default #1
      [1072141.444150] Hardware name: LENOVO Lenovo System x3650 M5 -[5462P4U]- -[5462P4U]-/01GR451, BIOS -[TCE136H-2.70]- 06/13/2018
      [1072141.444151] task: ffff880191bd0040 ti: ffff880191bd4000 task.ti: ffff880191bd4000
      [1072141.444153] RIP: 0010:[<ffffffff815229f9>]  [<ffffffff815229f9>] update_classid_sock+0x29/0x40
      [1072141.444157] RSP: 0018:ffff880191bd7d58  EFLAGS: 00000286
      [1072141.444158] RAX: ffff883b177cb7c0 RBX: 0000000000000000 RCX: 0000000000000000
      [1072141.444159] RDX: 00000000000009c7 RSI: ffff880191bd7d5c RDI: ffff8822e29bb200
      [1072141.444160] RBP: ffff883a72230980 R08: 0000000000000101 R09: 0000000000000000
      [1072141.444161] R10: 0000000000000008 R11: f000000000000000 R12: ffffffff815229d0
      [1072141.444162] R13: 0000000000000000 R14: ffff881fd0a47ac0 R15: ffff880191bd7f28
      [1072141.444163] FS:  00007f3e2f1eb8c0(0000) GS:ffff882000340000(0000) knlGS:0000000000000000
      [1072141.444164] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [1072141.444165] CR2: 00007f3e2f200000 CR3: 0000001ffea4e000 CR4: 00000000001606f0
      [1072141.444166] Stack:
      [1072141.444166]  ffffffa800000246 00000000000009c7 ffffffff8121d583 ffff8818312a05c0
      [1072141.444168]  ffff8818312a1100 ffff880197c3b280 ffff881861422858 ffffffffffffffea
      [1072141.444170]  ffffffff81522b1c ffffffff81d0ca20 ffff8817fa17b950 ffff883fdd8121e0
      [1072141.444171] Call Trace:
      [1072141.444179]  [<ffffffff8121d583>] iterate_fd+0x53/0x80
      [1072141.444182]  [<ffffffff81522b1c>] write_classid+0x4c/0x80
      [1072141.444187]  [<ffffffff8111328b>] cgroup_file_write+0x9b/0x100
      [1072141.444193]  [<ffffffff81278bcb>] kernfs_fop_write+0x11b/0x150
      [1072141.444198]  [<ffffffff81201566>] __vfs_write+0x26/0x100
      [1072141.444201]  [<ffffffff81201bed>] vfs_write+0x9d/0x190
      [1072141.444203]  [<ffffffff812028c2>] SyS_write+0x42/0xa0
      [1072141.444207]  [<ffffffff815f58c3>] entry_SYSCALL_64_fastpath+0x1e/0xca
      [1072141.445490] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xca
      
      If a cgroup has many tasks with many open file descriptors then we would
      end up in a large loop without any rescheduling point throught the
      operation. Add cond_resched once per task.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a90e90b7
    • K
      Revert "net: simplify sock_poll_wait" · 89ab066d
      Karsten Graul 提交于
      This reverts commit dd979b4d.
      
      This broke tcp_poll for SMC fallback: An AF_SMC socket establishes an
      internal TCP socket for the initial handshake with the remote peer.
      Whenever the SMC connection can not be established this TCP socket is
      used as a fallback. All socket operations on the SMC socket are then
      forwarded to the TCP socket. In case of poll, the file->private_data
      pointer references the SMC socket because the TCP socket has no file
      assigned. This causes tcp_poll to wait on the wrong socket.
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89ab066d
  10. 21 10月, 2018 2 次提交
    • R
      Revert "neighbour: force neigh_invalidate when NUD_FAILED update is from admin" · d2fb4fb8
      Roopa Prabhu 提交于
      This reverts commit 8e326289.
      
      This patch results in unnecessary netlink notification when one
      tries to delete a neigh entry already in NUD_FAILED state. Found
      this with a buggy app that tries to delete a NUD_FAILED entry
      repeatedly. While the notification issue can be fixed with more
      checks, adding more complexity here seems unnecessary. Also,
      recent tests with other changes in the neighbour code have
      shown that the INCOMPLETE and PROBE checks are good enough for
      the original issue.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2fb4fb8
    • J
      bpf: sk_msg program helper bpf_msg_push_data · 6fff607e
      John Fastabend 提交于
      This allows user to push data into a msg using sk_msg program types.
      The format is as follows,
      
      	bpf_msg_push_data(msg, offset, len, flags)
      
      this will insert 'len' bytes at offset 'offset'. For example to
      prepend 10 bytes at the front of the message the user can,
      
      	bpf_msg_push_data(msg, 0, 10, 0);
      
      This will invalidate data bounds so BPF user will have to then recheck
      data bounds after calling this. After this the msg size will have been
      updated and the user is free to write into the added bytes. We allow
      any offset/len as long as it is within the (data, data_end) range.
      However, a copy will be required if the ring is full and its possible
      for the helper to fail with ENOMEM or EINVAL errors which need to be
      handled by the BPF program.
      
      This can be used similar to XDP metadata to pass data between sk_msg
      layer and lower layers.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6fff607e
  11. 20 10月, 2018 6 次提交
    • D
      net: fix pskb_trim_rcsum_slow() with odd trim offset · d55bef50
      Dimitris Michailidis 提交于
      We've been getting checksum errors involving small UDP packets, usually
      59B packets with 1 extra non-zero padding byte. netdev_rx_csum_fault()
      has been complaining that HW is providing bad checksums. Turns out the
      problem is in pskb_trim_rcsum_slow(), introduced in commit 88078d98
      ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends").
      
      The source of the problem is that when the bytes we are trimming start
      at an odd address, as in the case of the 1 padding byte above,
      skb_checksum() returns a byte-swapped value. We cannot just combine this
      with skb->csum using csum_sub(). We need to use csum_block_sub() here
      that takes into account the parity of the start address and handles the
      swapping.
      
      Matches existing code in __skb_postpull_rcsum() and esp_remove_trailer().
      
      Fixes: 88078d98 ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
      Signed-off-by: NDimitris Michailidis <dmichail@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d55bef50
    • D
      netpoll: allow cleanup to be synchronous · c9fbd71f
      Debabrata Banerjee 提交于
      This fixes a problem introduced by:
      commit 2cde6acd ("netpoll: Fix __netpoll_rcu_free so that it can hold the rtnl lock")
      
      When using netconsole on a bond, __netpoll_cleanup can asynchronously
      recurse multiple times, each __netpoll_free_async call can result in
      more __netpoll_free_async's. This means there is now a race between
      cleanup_work queues on multiple netpoll_info's on multiple devices and
      the configuration of a new netpoll. For example if a netconsole is set
      to enable 0, reconfigured, and enable 1 immediately, this netconsole
      will likely not work.
      
      Given the reason for __netpoll_free_async is it can be called when rtnl
      is not locked, if it is locked, we should be able to execute
      synchronously. It appears to be locked everywhere it's called from.
      
      Generalize the design pattern from the teaming driver for current
      callers of __netpoll_free_async.
      
      CC: Neil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NDebabrata Banerjee <dbanerje@akamai.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9fbd71f
    • J
      bpf: skmsg, fix psock create on existing kcm/tls port · 5032d079
      John Fastabend 提交于
      Before using the psock returned by sk_psock_get() when adding it to a
      sockmap we need to ensure it is actually a sockmap based psock.
      Previously we were only checking this after incrementing the reference
      counter which was an error. This resulted in a slab-out-of-bounds
      error when the psock was not actually a sockmap type.
      
      This moves the check up so the reference counter is only used
      if it is a sockmap psock.
      
      Eric reported the following KASAN BUG,
      
      BUG: KASAN: slab-out-of-bounds in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
      BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
      Read of size 4 at addr ffff88019548be58 by task syz-executor4/22387
      
      CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
       print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
       check_memory_region_inline mm/kasan/kasan.c:260 [inline]
       check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
       kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
       atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
       refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
       sk_psock_get include/linux/skmsg.h:379 [inline]
       sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
       sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
       sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
       map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5032d079
    • S
      bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB · b39b5f41
      Song Liu 提交于
      BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
      skb. This patch enables direct access of skb for these programs.
      
      Two helper functions bpf_compute_and_save_data_end() and
      bpf_restore_data_end() are introduced. There are used in
      __cgroup_bpf_run_filter_skb(), to compute proper data_end for the
      BPF program, and restore original data afterwards.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b39b5f41
    • M
      bpf: add queue and stack maps · f1a2e44a
      Mauricio Vasquez B 提交于
      Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
      These maps support peek, pop and push operations that are exposed to eBPF
      programs through the new bpf_map[peek/pop/push] helpers.  Those operations
      are exposed to userspace applications through the already existing
      syscalls in the following way:
      
      BPF_MAP_LOOKUP_ELEM            -> peek
      BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
      BPF_MAP_UPDATE_ELEM            -> push
      
      Queue/stack maps are implemented using a buffer, tail and head indexes,
      hence BPF_F_NO_PREALLOC is not supported.
      
      As opposite to other maps, queue and stack do not use RCU for protecting
      maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
      argument that is a pointer to a memory zone where to save the value of a
      map.  Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
      be passed as an extra argument.
      
      Our main motivation for implementing queue/stack maps was to keep track
      of a pool of elements, like network ports in a SNAT, however we forsee
      other use cases, like for exampling saving last N kernel events in a map
      and then analysing from userspace.
      Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f1a2e44a
    • D
      Revert "bond: take rcu lock in netpoll_send_skb_on_dev" · 48995423
      David S. Miller 提交于
      This reverts commit 6fe94878.
      
      It is causing more serious regressions than the RCU warning
      it is fixing.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48995423
  12. 16 10月, 2018 8 次提交
    • E
      net: extend sk_pacing_rate to unsigned long · 76a9ebe8
      Eric Dumazet 提交于
      sk_pacing_rate has beed introduced as a u32 field in 2013,
      effectively limiting per flow pacing to 34Gbit.
      
      We believe it is time to allow TCP to pace high speed flows
      on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
      
      This patch adds no cost for 32bit kernels.
      
      The tcpi_pacing_rate and tcpi_max_pacing_rate were already
      exported as 64bit, so iproute2/ss command require no changes.
      
      Unfortunately the SO_MAX_PACING_RATE socket option will stay
      32bit and we will need to add a new option to let applications
      control high pacing rates.
      
      State      Recv-Q Send-Q Local Address:Port             Peer Address:Port
      ESTAB      0      1787144  10.246.9.76:49992             10.246.9.77:36741
                       timer:(on,003ms,0) ino:91863 sk:2 <->
       skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
       ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
       rcvmss:536 advmss:1448
       cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
       segs_in:3916318 data_segs_out:177279175
       bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
       send 28045.5Mbps lastrcv:73333
       pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
       busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
       notsent:2085120 minrtt:0.013
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76a9ebe8
    • M
      FDDI: defza: Support capturing outgoing SMT traffic · 9f9a742d
      Maciej W. Rozycki 提交于
      DEC FDDIcontroller 700 (DEFZA) uses a Tx/Rx queue pair to communicate
      SMT frames with adapter's firmware.  Any SMT frame received from the RMC
      via the Rx queue is queued back by the driver to the SMT Rx queue for
      the firmware to process.  Similarly the firmware uses the SMT Tx queue
      to supply the driver with SMT frames which are queued back to the Tx
      queue for the RMC to send to the ring.
      
      When a network tap is attached to an FDDI interface handled by `defza'
      any incoming SMT frames captured are queued to our usual processing of
      network data received, which in turn delivers them to any listening
      taps.
      
      However the outgoing SMT frames produced by the firmware bypass our
      network protocol stack and are therefore not delivered to taps.  This in
      turn means that taps are missing a part of network traffic sent by the
      adapter, which may make it more difficult to track down network problems
      or do general traffic analysis.
      
      Call `dev_queue_xmit_nit' then in the SMT Tx path, having checked that
      a network tap is attached, with a newly-created `dev_nit_active' helper
      wrapping the usual condition used in the transmit path.
      Signed-off-by: NMaciej W. Rozycki <macro@linux-mips.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f9a742d
    • W
      ethtool: fix a privilege escalation bug · 58f5bbe3
      Wenwen Wang 提交于
      In dev_ethtool(), the eth command 'ethcmd' is firstly copied from the
      use-space buffer 'useraddr' and checked to see whether it is
      ETHTOOL_PERQUEUE. If yes, the sub-command 'sub_cmd' is further copied from
      the user space. Otherwise, 'sub_cmd' is the same as 'ethcmd'. Next,
      according to 'sub_cmd', a permission check is enforced through the function
      ns_capable(). For example, the permission check is required if 'sub_cmd' is
      ETHTOOL_SCOALESCE, but it is not necessary if 'sub_cmd' is
      ETHTOOL_GCOALESCE, as suggested in the comment "Allow some commands to be
      done by anyone". The following execution invokes different handlers
      according to 'ethcmd'. Specifically, if 'ethcmd' is ETHTOOL_PERQUEUE,
      ethtool_set_per_queue() is called. In ethtool_set_per_queue(), the kernel
      object 'per_queue_opt' is copied again from the user-space buffer
      'useraddr' and 'per_queue_opt.sub_command' is used to determine which
      operation should be performed. Given that the buffer 'useraddr' is in the
      user space, a malicious user can race to change the sub-command between the
      two copies. In particular, the attacker can supply ETHTOOL_PERQUEUE and
      ETHTOOL_GCOALESCE to bypass the permission check in dev_ethtool(). Then
      before ethtool_set_per_queue() is called, the attacker changes
      ETHTOOL_GCOALESCE to ETHTOOL_SCOALESCE. In this way, the attacker can
      bypass the permission check and execute ETHTOOL_SCOALESCE.
      
      This patch enforces a check in ethtool_set_per_queue() after the second
      copy from 'useraddr'. If the sub-command is different from the one obtained
      in the first copy in dev_ethtool(), an error code EINVAL will be returned.
      
      Fixes: f38d138a ("net/ethtool: support set coalesce per queue")
      Signed-off-by: NWenwen Wang <wang6495@umn.edu>
      Reviewed-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58f5bbe3
    • W
      ethtool: fix a missing-check bug · 2bb3207d
      Wenwen Wang 提交于
      In ethtool_get_rxnfc(), the eth command 'cmd' is compared against
      'ETHTOOL_GRXFH' to see whether it is necessary to adjust the variable
      'info_size'. Then the whole structure of 'info' is copied from the
      user-space buffer 'useraddr' with 'info_size' bytes. In the following
      execution, 'info' may be copied again from the buffer 'useraddr' depending
      on the 'cmd' and the 'info.flow_type'. However, after these two copies,
      there is no check between 'cmd' and 'info.cmd'. In fact, 'cmd' is also
      copied from the buffer 'useraddr' in dev_ethtool(), which is the caller
      function of ethtool_get_rxnfc(). Given that 'useraddr' is in the user
      space, a malicious user can race to change the eth command in the buffer
      between these copies. By doing so, the attacker can supply inconsistent
      data and cause undefined behavior because in the following execution 'info'
      will be passed to ops->get_rxnfc().
      
      This patch adds a necessary check on 'info.cmd' and 'cmd' to confirm that
      they are still same after the two copies in ethtool_get_rxnfc(). Otherwise,
      an error code EINVAL will be returned.
      Signed-off-by: NWenwen Wang <wang6495@umn.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bb3207d
    • J
      bpf: Fix IPv6 dport byte-order in bpf_sk_lookup · 5ef0ae84
      Joe Stringer 提交于
      Commit 6acc9b43 ("bpf: Add helper to retrieve socket in BPF")
      mistakenly passed the destination port in network byte-order to the IPv6
      TCP/UDP socket lookup functions, which meant that BPF writers would need
      to either manually swap the byte-order of this field or otherwise IPv6
      sockets could not be located via this helper.
      
      Fix the issue by swapping the byte-order appropriately in the helper.
      This also makes the API more consistent with the IPv4 version.
      
      Fixes: 6acc9b43 ("bpf: Add helper to retrieve socket in BPF")
      Signed-off-by: NJoe Stringer <joe@wand.net.nz>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      5ef0ae84
    • J
      bpf: Allow sk_lookup with IPv6 module · 8a615c6b
      Joe Stringer 提交于
      This is a more complete fix than d71019b5 ("net: core: Fix build
      with CONFIG_IPV6=m"), so that IPv6 sockets may be looked up if the IPv6
      module is loaded (not just if it's compiled in).
      Signed-off-by: NJoe Stringer <joe@wand.net.nz>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      8a615c6b
    • D
      tls: convert to generic sk_msg interface · d829e9c4
      Daniel Borkmann 提交于
      Convert kTLS over to make use of sk_msg interface for plaintext and
      encrypted scattergather data, so it reuses all the sk_msg helpers
      and data structure which later on in a second step enables to glue
      this to BPF.
      
      This also allows to remove quite a bit of open coded helpers which
      are covered by the sk_msg API. Recent changes in kTLs 80ece6a0
      ("tls: Remove redundant vars from tls record structure") and
      4e6d4720 ("tls: Add support for inplace records encryption")
      changed the data path handling a bit; while we've kept the latter
      optimization intact, we had to undo the former change to better
      fit the sk_msg model, hence the sg_aead_in and sg_aead_out have
      been brought back and are linked into the sk_msg sgs. Now the kTLS
      record contains a msg_plaintext and msg_encrypted sk_msg each.
      
      In the original code, the zerocopy_from_iter() has been used out
      of TX but also RX path. For the strparser skb-based RX path,
      we've left the zerocopy_from_iter() in decrypt_internal() mostly
      untouched, meaning it has been moved into tls_setup_from_iter()
      with charging logic removed (as not used from RX). Given RX path
      is not based on sk_msg objects, we haven't pursued setting up a
      dummy sk_msg to call into sk_msg_zerocopy_from_iter(), but it
      could be an option to prusue in a later step.
      
      Joint work with John.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d829e9c4
    • D
      bpf, sockmap: convert to generic sk_msg interface · 604326b4
      Daniel Borkmann 提交于
      Add a generic sk_msg layer, and convert current sockmap and later
      kTLS over to make use of it. While sk_buff handles network packet
      representation from netdevice up to socket, sk_msg handles data
      representation from application to socket layer.
      
      This means that sk_msg framework spans across ULP users in the
      kernel, and enables features such as introspection or filtering
      of data with the help of BPF programs that operate on this data
      structure.
      
      Latter becomes in particular useful for kTLS where data encryption
      is deferred into the kernel, and as such enabling the kernel to
      perform L7 introspection and policy based on BPF for TLS connections
      where the record is being encrypted after BPF has run and came to
      a verdict. In order to get there, first step is to transform open
      coding of scatter-gather list handling into a common core framework
      that subsystems can use.
      
      The code itself has been split and refactored into three bigger
      pieces: i) the generic sk_msg API which deals with managing the
      scatter gather ring, providing helpers for walking and mangling,
      transferring application data from user space into it, and preparing
      it for BPF pre/post-processing, ii) the plain sock map itself
      where sockets can be attached to or detached from; these bits
      are independent of i) which can now be used also without sock
      map, and iii) the integration with plain TCP as one protocol
      to be used for processing L7 application data (later this could
      e.g. also be extended to other protocols like UDP). The semantics
      are the same with the old sock map code and therefore no change
      of user facing behavior or APIs. While pursuing this work it
      also helped finding a number of bugs in the old sockmap code
      that we've fixed already in earlier commits. The test_sockmap
      kselftest suite passes through fine as well.
      
      Joint work with John.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      604326b4
  13. 14 10月, 2018 1 次提交
  14. 13 10月, 2018 2 次提交
    • N
      net: bridge: add support for per-port vlan stats · 9163a0fc
      Nikolay Aleksandrov 提交于
      This patch adds an option to have per-port vlan stats instead of the
      default global stats. The option can be set only when there are no port
      vlans in the bridge since we need to allocate the stats if it is set
      when vlans are being added to ports (and respectively free them
      when being deleted). Also bump RTNL_MAX_TYPE as the bridge is the
      largest user of options. The current stats design allows us to add
      these without any changes to the fast-path, it all comes down to
      the per-vlan stats pointer which, if this option is enabled, will
      be allocated for each port vlan instead of using the global bridge-wide
      one.
      
      CC: bridge@lists.linux-foundation.org
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9163a0fc
    • D
      net: Evict neighbor entries on carrier down · 859bd2ef
      David Ahern 提交于
      When a link's carrier goes down it could be a sign of the port changing
      networks. If the new network has overlapping addresses with the old one,
      then the kernel will continue trying to use neighbor entries established
      based on the old network until the entries finally age out - meaning a
      potentially long delay with communications not working.
      
      This patch evicts neighbor entries on carrier down with the exception of
      those marked permanent. Permanent entries are managed by userspace (either
      an admin or a routing daemon such as FRR).
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      859bd2ef
  15. 11 10月, 2018 5 次提交
    • S
      net: ipv4: update fnhe_pmtu when first hop's MTU changes · af7d6cce
      Sabrina Dubroca 提交于
      Since commit 5aad1de5 ("ipv4: use separate genid for next hop
      exceptions"), exceptions get deprecated separately from cached
      routes. In particular, administrative changes don't clear PMTU anymore.
      
      As Stefano described in commit e9fa1495 ("ipv6: Reflect MTU changes
      on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
      the local MTU change can become stale:
       - if the local MTU is now lower than the PMTU, that PMTU is now
         incorrect
       - if the local MTU was the lowest value in the path, and is increased,
         we might discover a higher PMTU
      
      Similarly to what commit e9fa1495 did for IPv6, update PMTU in those
      cases.
      
      If the exception was locked, the discovered PMTU was smaller than the
      minimal accepted PMTU. In that case, if the new local MTU is smaller
      than the current PMTU, let PMTU discovery figure out if locking of the
      exception is still needed.
      
      To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
      notifier. By the time the notifier is called, dev->mtu has been
      changed. This patch adds the old MTU as additional information in the
      notifier structure, and a new call_netdevice_notifiers_u32() function.
      
      Fixes: 5aad1de5 ("ipv4: use separate genid for next hop exceptions")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af7d6cce
    • D
      rtnetlink: Update comment in rtnl_stats_dump regarding strict data checking · e75fa073
      David Ahern 提交于
      The NLM_F_DUMP_PROPER_HDR netlink flag was replaced by a setsockopt.
      Update the comment in rtnl_stats_dump.
      
      Fixes: 841891ec ("rtnetlink: Update rtnl_stats_dump for strict data checking")
      Reported-by: NChristian Brauner <christian@brauner.io>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e75fa073
    • D
      rtnetlink: Move ifm in valid_fdb_dump_legacy to closer to use · 4565d7e5
      David Ahern 提交于
      Move setting of local variable ifm to after the message parsing in
      valid_fdb_dump_legacy. Avoid potential future use of unchecked variable.
      
      Fixes: 8dfbda19 ("rtnetlink: Move input checking for rtnl_fdb_dump to helper")
      Reported-by: NChristian Brauner <christian@brauner.io>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4565d7e5
    • E
      net: make skb_partial_csum_set() more robust against overflows · 52b5d6f5
      Eric Dumazet 提交于
      syzbot managed to crash in skb_checksum_help() [1] :
      
              BUG_ON(offset + sizeof(__sum16) > skb_headlen(skb));
      
      Root cause is the following check in skb_partial_csum_set()
      
      	if (unlikely(start > skb_headlen(skb)) ||
      	    unlikely((int)start + off > skb_headlen(skb) - 2))
      		return false;
      
      If skb_headlen(skb) is 1, then (skb_headlen(skb) - 2) becomes 0xffffffff
      and the check fails to detect that ((int)start + off) is off the limit,
      since the compare is unsigned.
      
      When we fix that, then the first condition (start > skb_headlen(skb))
      becomes obsolete.
      
      Then we should also check that (skb_headroom(skb) + start) wont
      overflow 16bit field.
      
      [1]
      kernel BUG at net/core/dev.c:2880!
      invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      CPU: 1 PID: 7330 Comm: syz-executor4 Not tainted 4.19.0-rc6+ #253
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:skb_checksum_help+0x9e3/0xbb0 net/core/dev.c:2880
      Code: 85 00 ff ff ff 48 c1 e8 03 42 80 3c 28 00 0f 84 09 fb ff ff 48 8b bd 00 ff ff ff e8 97 a8 b9 fb e9 f8 fa ff ff e8 2d 09 76 fb <0f> 0b 48 8b bd 28 ff ff ff e8 1f a8 b9 fb e9 b1 f6 ff ff 48 89 cf
      RSP: 0018:ffff8801d83a6f60 EFLAGS: 00010293
      RAX: ffff8801b9834380 RBX: ffff8801b9f8d8c0 RCX: ffffffff8608c6d7
      RDX: 0000000000000000 RSI: ffffffff8608cc63 RDI: 0000000000000006
      RBP: ffff8801d83a7068 R08: ffff8801b9834380 R09: 0000000000000000
      R10: ffff8801d83a76d8 R11: 0000000000000000 R12: 0000000000000001
      R13: 0000000000010001 R14: 000000000000ffff R15: 00000000000000a8
      FS:  00007f1a66db5700(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f7d77f091b0 CR3: 00000001ba252000 CR4: 00000000001406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       skb_csum_hwoffload_help+0x8f/0xe0 net/core/dev.c:3269
       validate_xmit_skb+0xa2a/0xf30 net/core/dev.c:3312
       __dev_queue_xmit+0xc2f/0x3950 net/core/dev.c:3797
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3838
       packet_snd net/packet/af_packet.c:2928 [inline]
       packet_sendmsg+0x422d/0x64c0 net/packet/af_packet.c:2953
      
      Fixes: 5ff8dda3 ("net: Ensure partial checksum offset is inside the skb head")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52b5d6f5
    • M
      devlink: Add helper function for safely copy string param · bde74ad1
      Moshe Shemesh 提交于
      Devlink string param buffer is allocated at the size of
      DEVLINK_PARAM_MAX_STRING_VALUE. Add helper function which makes sure
      this size is not exceeded.
      Renamed DEVLINK_PARAM_MAX_STRING_VALUE to
      __DEVLINK_PARAM_MAX_STRING_VALUE to emphasize that it should be used by
      devlink only. The driver should use the helper function instead to
      verify it doesn't exceed the allowed length.
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bde74ad1