1. 25 10月, 2018 4 次提交
  2. 24 10月, 2018 2 次提交
    • E
      tcp: add tcp_reset_xmit_timer() helper · 3f80e08f
      Eric Dumazet 提交于
      With EDT model, SRTT no longer is inflated by pacing delays.
      
      This means that RTO and some other xmit timers might be setup
      incorrectly. This is particularly visible with either :
      
      - Very small enforced pacing rates (SO_MAX_PACING_RATE)
      - Reduced rto (from the default 200 ms)
      
      This can lead to TCP flows aborts in the worst case,
      or spurious retransmits in other cases.
      
      For example, this session gets far more throughput
      than the requested 80kbit :
      
      $ netperf -H 127.0.0.2 -l 100 -- -q 10000
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
      540000 262144 262144    104.00      2.66
      
      With the fix :
      
      $ netperf -H 127.0.0.2 -l 100 -- -q 10000
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
      540000 262144 262144    104.00      0.12
      
      EDT allows for better control of rtx timers, since TCP has
      a better idea of the earliest departure time of each skb
      in the rtx queue. We only have to eventually add to the
      timer the difference of the EDT time with current time.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f80e08f
    • K
      Revert "net: simplify sock_poll_wait" · 89ab066d
      Karsten Graul 提交于
      This reverts commit dd979b4d.
      
      This broke tcp_poll for SMC fallback: An AF_SMC socket establishes an
      internal TCP socket for the initial handshake with the remote peer.
      Whenever the SMC connection can not be established this TCP socket is
      used as a fallback. All socket operations on the SMC socket are then
      forwarded to the TCP socket. In case of poll, the file->private_data
      pointer references the SMC socket because the TCP socket has no file
      assigned. This causes tcp_poll to wait on the wrong socket.
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89ab066d
  3. 23 10月, 2018 11 次提交
    • E
      llc: do not use sk_eat_skb() · 604d415e
      Eric Dumazet 提交于
      syzkaller triggered a use-after-free [1], caused by a combination of
      skb_get() in llc_conn_state_process() and usage of sk_eat_skb()
      
      sk_eat_skb() is assuming the skb about to be freed is only used by
      the current thread. TCP/DCCP stacks enforce this because current
      thread holds the socket lock.
      
      llc_conn_state_process() wants to make sure skb does not disappear,
      and holds a reference on the skb it manipulates. But as soon as this
      skb is added to socket receive queue, another thread can consume it.
      
      This means that llc must use regular skb_unlink() and kfree_skb()
      so that both producer and consumer can safely work on the same skb.
      
      [1]
      BUG: KASAN: use-after-free in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
      BUG: KASAN: use-after-free in refcount_read include/linux/refcount.h:43 [inline]
      BUG: KASAN: use-after-free in skb_unref include/linux/skbuff.h:967 [inline]
      BUG: KASAN: use-after-free in kfree_skb+0xb7/0x580 net/core/skbuff.c:655
      Read of size 4 at addr ffff8801d1f6fba4 by task ksoftirqd/1/18
      
      CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.19.0-rc8+ #295
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c4/0x2b6 lib/dump_stack.c:113
       print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
       check_memory_region_inline mm/kasan/kasan.c:260 [inline]
       check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
       kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
       atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
       refcount_read include/linux/refcount.h:43 [inline]
       skb_unref include/linux/skbuff.h:967 [inline]
       kfree_skb+0xb7/0x580 net/core/skbuff.c:655
       llc_sap_state_process+0x9b/0x550 net/llc/llc_sap.c:224
       llc_sap_rcv+0x156/0x1f0 net/llc/llc_sap.c:297
       llc_sap_handler+0x65e/0xf80 net/llc/llc_sap.c:438
       llc_rcv+0x79e/0xe20 net/llc/llc_input.c:208
       __netif_receive_skb_one_core+0x14d/0x200 net/core/dev.c:4913
       __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:5023
       process_backlog+0x218/0x6f0 net/core/dev.c:5829
       napi_poll net/core/dev.c:6249 [inline]
       net_rx_action+0x7c5/0x1950 net/core/dev.c:6315
       __do_softirq+0x30c/0xb03 kernel/softirq.c:292
       run_ksoftirqd+0x94/0x100 kernel/softirq.c:653
       smpboot_thread_fn+0x68b/0xa00 kernel/smpboot.c:164
       kthread+0x35a/0x420 kernel/kthread.c:246
       ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413
      
      Allocated by task 18:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:448
       set_track mm/kasan/kasan.c:460 [inline]
       kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
       kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
       kmem_cache_alloc_node+0x144/0x730 mm/slab.c:3644
       __alloc_skb+0x119/0x770 net/core/skbuff.c:193
       alloc_skb include/linux/skbuff.h:995 [inline]
       llc_alloc_frame+0xbc/0x370 net/llc/llc_sap.c:54
       llc_station_ac_send_xid_r net/llc/llc_station.c:52 [inline]
       llc_station_rcv+0x1dc/0x1420 net/llc/llc_station.c:111
       llc_rcv+0xc32/0xe20 net/llc/llc_input.c:220
       __netif_receive_skb_one_core+0x14d/0x200 net/core/dev.c:4913
       __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:5023
       process_backlog+0x218/0x6f0 net/core/dev.c:5829
       napi_poll net/core/dev.c:6249 [inline]
       net_rx_action+0x7c5/0x1950 net/core/dev.c:6315
       __do_softirq+0x30c/0xb03 kernel/softirq.c:292
      
      Freed by task 16383:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:448
       set_track mm/kasan/kasan.c:460 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
       kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
       __cache_free mm/slab.c:3498 [inline]
       kmem_cache_free+0x83/0x290 mm/slab.c:3756
       kfree_skbmem+0x154/0x230 net/core/skbuff.c:582
       __kfree_skb+0x1d/0x20 net/core/skbuff.c:642
       sk_eat_skb include/net/sock.h:2366 [inline]
       llc_ui_recvmsg+0xec2/0x1610 net/llc/af_llc.c:882
       sock_recvmsg_nosec net/socket.c:794 [inline]
       sock_recvmsg+0xd0/0x110 net/socket.c:801
       ___sys_recvmsg+0x2b6/0x680 net/socket.c:2278
       __sys_recvmmsg+0x303/0xb90 net/socket.c:2390
       do_sys_recvmmsg+0x181/0x1a0 net/socket.c:2466
       __do_sys_recvmmsg net/socket.c:2484 [inline]
       __se_sys_recvmmsg net/socket.c:2480 [inline]
       __x64_sys_recvmmsg+0xbe/0x150 net/socket.c:2480
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff8801d1f6fac0
       which belongs to the cache skbuff_head_cache of size 232
      The buggy address is located 228 bytes inside of
       232-byte region [ffff8801d1f6fac0, ffff8801d1f6fba8)
      The buggy address belongs to the page:
      page:ffffea000747dbc0 count:1 mapcount:0 mapping:ffff8801d9be7680 index:0xffff8801d1f6fe80
      flags: 0x2fffc0000000100(slab)
      raw: 02fffc0000000100 ffffea0007346e88 ffffea000705b108 ffff8801d9be7680
      raw: ffff8801d1f6fe80 ffff8801d1f6f0c0 000000010000000b 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8801d1f6fa80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
       ffff8801d1f6fb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff8801d1f6fb80: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc
                                     ^
       ffff8801d1f6fc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8801d1f6fc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      604d415e
    • W
      net: dsa: legacy: simplify getting .driver_data · 2af1ccd5
      Wolfram Sang 提交于
      We should get 'driver_data' from 'struct device' directly. Going via
      platform_device is an unneeded step back and forth.
      Signed-off-by: NWolfram Sang <wsa+renesas@sang-engineering.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2af1ccd5
    • D
      net/sched: act_police: disallow 'goto chain' on fallback control action · c08f5ed5
      Davide Caratti 提交于
      in the following command:
      
       # tc action add action police rate <r> burst <b> conform-exceed <c1>/<c2>
      
      'goto chain x' is allowed only for c1: setting it for c2 makes the kernel
      crash with NULL pointer dereference, since TC core doesn't initialize the
      chain handle.
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c08f5ed5
    • D
      net/sched: act_gact: disallow 'goto chain' on fallback control action · 9469f375
      Davide Caratti 提交于
      in the following command:
      
       # tc action add action <c1> random <rand_type> <c2> <rand_param>
      
      'goto chain x' is allowed only for c1: setting it for c2 makes the kernel
      crash with NULL pointer dereference, since TC core doesn't initialize the
      chain handle.
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9469f375
    • O
      4b78030b
    • D
      net/ipv6: Add support for dumping addresses for a specific device · 6371a71f
      David Ahern 提交于
      If an RTM_GETADDR dump request has ifa_index set in the ifaddrmsg
      header, then return only the addresses for that device.
      
      Since inet6_dump_addr is reused for multicast and anycast addresses,
      this adds support for device specfic dumps of RTM_GETMULTICAST and
      RTM_GETANYCAST as well.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6371a71f
    • D
      net/ipv4: Add support for dumping addresses for a specific device · 5fcd266a
      David Ahern 提交于
      If an RTM_GETADDR dump request has ifa_index set in the ifaddrmsg
      header, then return only the addresses for that device.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fcd266a
    • D
      net/ipv6: Remove ip_idx arg to in6_dump_addrs · fe884c2b
      David Ahern 提交于
      ip_idx is always 0 going into in6_dump_addrs; it is passed as a pointer
      to save the last good index into cb. Since cb is already argument to
      in6_dump_addrs, just save the value there.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe884c2b
    • D
      net/ipv4: Move loop over addresses on a device into in_dev_dump_addr · 1c98eca4
      David Ahern 提交于
      Similar to IPv6 move the logic that walks over the ipv4 address list
      for a device into a helper.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c98eca4
    • J
      tipc: eliminate message disordering during binding table update · 988f3f16
      Jon Maloy 提交于
      We have seen the following race scenario:
      1) named_distribute() builds a "bulk" message, containing a PUBLISH
         item for a certain publication. This is based on the contents of
         the binding tables's 'cluster_scope' list.
      2) tipc_named_withdraw() removes the same publication from the list,
         bulds a WITHDRAW message and distributes it to all cluster nodes.
      3) tipc_named_node_up(), which was calling named_distribute(), sends
         out the bulk message built under 1)
      4) The WITHDRAW message arrives at the just detected node, finds
         no corresponding publication, and is dropped.
      5) The PUBLISH item arrives at the same node, is added to its binding
         table, and remains there forever.
      
      This arrival disordering was earlier taken care of by the backlog queue,
      originally added for a different purpose, which was removed in the
      commit referred to below, but we now need a different solution.
      In this commit, we replace the rcu lock protecting the 'cluster_scope'
      list with a regular RW lock which comprises even the sending of the
      bulk message. This both guarantees both the list integrity and the
      message sending order. We will later add a commit which cleans up
      this code further.
      
      Note that this commit needs recently added commit d3092b2e ("tipc:
      fix unsafe rcu locking when accessing publication list") to apply
      cleanly.
      
      Fixes: 37922ea4 ("tipc: permit overlapping service ranges in name table")
      Reported-by: NTuong Lien Tong <tuong.t.lien@dektech.com.au>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      988f3f16
    • G
      tipc: use destination length for copy string · 29e270fc
      Guoqing Jiang 提交于
      Got below warning with gcc 8.2 compiler.
      
      net/tipc/topsrv.c: In function ‘tipc_topsrv_start’:
      net/tipc/topsrv.c:660:2: warning: ‘strncpy’ specified bound depends on the length of the source argument [-Wstringop-overflow=]
        strncpy(srv->name, name, strlen(name) + 1);
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      net/tipc/topsrv.c:660:27: note: length computed here
        strncpy(srv->name, name, strlen(name) + 1);
                                 ^~~~~~~~~~~~
      So change it to correct length and use strscpy.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29e270fc
  4. 21 10月, 2018 4 次提交
  5. 20 10月, 2018 7 次提交
    • D
      net: fix pskb_trim_rcsum_slow() with odd trim offset · d55bef50
      Dimitris Michailidis 提交于
      We've been getting checksum errors involving small UDP packets, usually
      59B packets with 1 extra non-zero padding byte. netdev_rx_csum_fault()
      has been complaining that HW is providing bad checksums. Turns out the
      problem is in pskb_trim_rcsum_slow(), introduced in commit 88078d98
      ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends").
      
      The source of the problem is that when the bytes we are trimming start
      at an odd address, as in the case of the 1 padding byte above,
      skb_checksum() returns a byte-swapped value. We cannot just combine this
      with skb->csum using csum_sub(). We need to use csum_block_sub() here
      that takes into account the parity of the start address and handles the
      swapping.
      
      Matches existing code in __skb_postpull_rcsum() and esp_remove_trailer().
      
      Fixes: 88078d98 ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
      Signed-off-by: NDimitris Michailidis <dmichail@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d55bef50
    • D
      netpoll: allow cleanup to be synchronous · c9fbd71f
      Debabrata Banerjee 提交于
      This fixes a problem introduced by:
      commit 2cde6acd ("netpoll: Fix __netpoll_rcu_free so that it can hold the rtnl lock")
      
      When using netconsole on a bond, __netpoll_cleanup can asynchronously
      recurse multiple times, each __netpoll_free_async call can result in
      more __netpoll_free_async's. This means there is now a race between
      cleanup_work queues on multiple netpoll_info's on multiple devices and
      the configuration of a new netpoll. For example if a netconsole is set
      to enable 0, reconfigured, and enable 1 immediately, this netconsole
      will likely not work.
      
      Given the reason for __netpoll_free_async is it can be called when rtnl
      is not locked, if it is locked, we should be able to execute
      synchronously. It appears to be locked everywhere it's called from.
      
      Generalize the design pattern from the teaming driver for current
      callers of __netpoll_free_async.
      
      CC: Neil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NDebabrata Banerjee <dbanerje@akamai.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9fbd71f
    • J
      bpf: skmsg, fix psock create on existing kcm/tls port · 5032d079
      John Fastabend 提交于
      Before using the psock returned by sk_psock_get() when adding it to a
      sockmap we need to ensure it is actually a sockmap based psock.
      Previously we were only checking this after incrementing the reference
      counter which was an error. This resulted in a slab-out-of-bounds
      error when the psock was not actually a sockmap type.
      
      This moves the check up so the reference counter is only used
      if it is a sockmap psock.
      
      Eric reported the following KASAN BUG,
      
      BUG: KASAN: slab-out-of-bounds in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
      BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
      Read of size 4 at addr ffff88019548be58 by task syz-executor4/22387
      
      CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
       print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
       check_memory_region_inline mm/kasan/kasan.c:260 [inline]
       check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
       kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
       atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
       refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
       sk_psock_get include/linux/skmsg.h:379 [inline]
       sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
       sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
       sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
       map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5032d079
    • S
      bpf: add tests for direct packet access from CGROUP_SKB · 2cb494a3
      Song Liu 提交于
      Tests are added to make sure CGROUP_SKB cannot access:
        tc_classid, data_meta, flow_keys
      
      and can read and write:
        mark, prority, and cb[0-4]
      
      and can read other fields.
      
      To make selftest with skb->sk work, a dummy sk is added in
      bpf_prog_test_run_skb().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2cb494a3
    • S
      bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB · b39b5f41
      Song Liu 提交于
      BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
      skb. This patch enables direct access of skb for these programs.
      
      Two helper functions bpf_compute_and_save_data_end() and
      bpf_restore_data_end() are introduced. There are used in
      __cgroup_bpf_run_filter_skb(), to compute proper data_end for the
      BPF program, and restore original data afterwards.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b39b5f41
    • M
      bpf: add queue and stack maps · f1a2e44a
      Mauricio Vasquez B 提交于
      Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
      These maps support peek, pop and push operations that are exposed to eBPF
      programs through the new bpf_map[peek/pop/push] helpers.  Those operations
      are exposed to userspace applications through the already existing
      syscalls in the following way:
      
      BPF_MAP_LOOKUP_ELEM            -> peek
      BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
      BPF_MAP_UPDATE_ELEM            -> push
      
      Queue/stack maps are implemented using a buffer, tail and head indexes,
      hence BPF_F_NO_PREALLOC is not supported.
      
      As opposite to other maps, queue and stack do not use RCU for protecting
      maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
      argument that is a pointer to a memory zone where to save the value of a
      map.  Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
      be passed as an extra argument.
      
      Our main motivation for implementing queue/stack maps was to keep track
      of a pool of elements, like network ports in a SNAT, however we forsee
      other use cases, like for exampling saving last N kernel events in a map
      and then analysing from userspace.
      Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f1a2e44a
    • D
      Revert "bond: take rcu lock in netpoll_send_skb_on_dev" · 48995423
      David S. Miller 提交于
      This reverts commit 6fe94878.
      
      It is causing more serious regressions than the RCU warning
      it is fixing.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48995423
  6. 19 10月, 2018 12 次提交
    • P
      Revert "netfilter: xt_quota: fix the behavior of xt_quota module" · af510ebd
      Pablo Neira Ayuso 提交于
      This reverts commit e9837e55.
      
      When talking to Maze and Chenbo, we agreed to keep this back by now
      due to problems in the ruleset listing path with 32-bit arches.
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      af510ebd
    • W
      netfilter: remove two unused variables. · da8a705c
      Weongyo Jeong 提交于
      nft_dup_netdev_ingress_ops and nft_fwd_netdev_ingress_ops variables are
      no longer used at the code.
      Signed-off-by: NWeongyo Jeong <weongyo.linux@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      da8a705c
    • T
      netfilter: nf_flow_table: do not remove offload when other netns's interface is down · a3fb3698
      Taehee Yoo 提交于
      When interface is down, offload cleanup function(nf_flow_table_do_cleanup)
      is called and that checks whether interface index of offload and
      index of link down interface is same. but only interface index checking
      is not enough because flowtable is not pernet list.
      So that, if other netns's interface that has index is same with offload
      is down, that offload will be removed.
      This patch adds netns checking code to the offload cleanup routine.
      
      Fixes: 59c466dd ("netfilter: nf_flow_table: add a new flow state for tearing down offloading")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      a3fb3698
    • T
      netfilter: nf_flow_table: remove unnecessary parameter of nf_flow_table_cleanup() · 5f1be84a
      Taehee Yoo 提交于
      parameter net of nf_flow_table_cleanup() is not used.
      So that it can be removed.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      5f1be84a
    • T
      netfilter: nf_flow_table: remove flowtable hook flush routine in netns exit routine · b7f1a16d
      Taehee Yoo 提交于
      When device is unregistered, flowtable flush routine is called
      by notifier_call(nf_tables_flowtable_event). and exit callback of
      nftables pernet_operation(nf_tables_exit_net) also has flowtable flush
      routine. but when network namespace is destroyed, both notifier_call
      and pernet_operation are called. hence flowtable flush routine in
      pernet_operation is unnecessary.
      
      test commands:
         %ip netns add vm1
         %ip netns exec vm1 nft add table ip filter
         %ip netns exec vm1 nft add flowtable ip filter w \
      	{ hook ingress priority 0\; devices = { lo }\; }
         %ip netns del vm1
      
      splat looks like:
      [  265.187019] WARNING: CPU: 0 PID: 87 at net/netfilter/core.c:309 nf_hook_entry_head+0xc7/0xf0
      [  265.187112] Modules linked in: nf_flow_table_ipv4 nf_flow_table nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ip_tables x_tables
      [  265.187390] CPU: 0 PID: 87 Comm: kworker/u4:2 Not tainted 4.19.0-rc3+ #5
      [  265.187453] Workqueue: netns cleanup_net
      [  265.187514] RIP: 0010:nf_hook_entry_head+0xc7/0xf0
      [  265.187546] Code: 8d 81 68 03 00 00 5b c3 89 d0 83 fa 04 48 8d 84 c7 e8 11 00 00 76 81 0f 0b 31 c0 e9 78 ff ff ff 0f 0b 48 83 c4 08 31 c0 5b c3 <0f> 0b 31 c0 e9 65 ff ff ff 0f 0b 31 c0 e9 5c ff ff ff 48 89 0c 24
      [  265.187573] RSP: 0018:ffff88011546f098 EFLAGS: 00010246
      [  265.187624] RAX: ffffffff8d90e135 RBX: 1ffff10022a8de1c RCX: 0000000000000000
      [  265.187645] RDX: 0000000000000000 RSI: 0000000000000005 RDI: ffff880116298040
      [  265.187645] RBP: ffff88010ea4c1a8 R08: 0000000000000000 R09: 0000000000000000
      [  265.187645] R10: ffff88011546f1d8 R11: ffffed0022c532c1 R12: ffff88010ea4c1d0
      [  265.187645] R13: 0000000000000005 R14: dffffc0000000000 R15: ffff88010ea4c1c4
      [  265.187645] FS:  0000000000000000(0000) GS:ffff88011b200000(0000) knlGS:0000000000000000
      [  265.187645] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  265.187645] CR2: 00007fdfb8d00000 CR3: 0000000057a16000 CR4: 00000000001006f0
      [  265.187645] Call Trace:
      [  265.187645]  __nf_unregister_net_hook+0xca/0x5d0
      [  265.187645]  ? nf_hook_entries_free.part.3+0x80/0x80
      [  265.187645]  ? save_trace+0x300/0x300
      [  265.187645]  nf_unregister_net_hooks+0x2e/0x40
      [  265.187645]  nf_tables_exit_net+0x479/0x1340 [nf_tables]
      [  265.187645]  ? find_held_lock+0x39/0x1c0
      [  265.187645]  ? nf_tables_abort+0x30/0x30 [nf_tables]
      [  265.187645]  ? inet_frag_destroy_rcu+0xd0/0xd0
      [  265.187645]  ? trace_hardirqs_on+0x93/0x210
      [  265.187645]  ? __bpf_trace_preemptirq_template+0x10/0x10
      [  265.187645]  ? inet_frag_destroy_rcu+0xd0/0xd0
      [  265.187645]  ? inet_frag_destroy_rcu+0xd0/0xd0
      [  265.187645]  ? __mutex_unlock_slowpath+0x17f/0x740
      [  265.187645]  ? wait_for_completion+0x710/0x710
      [  265.187645]  ? bucket_table_free+0xb2/0x1f0
      [  265.187645]  ? nested_table_free+0x130/0x130
      [  265.187645]  ? __lock_is_held+0xb4/0x140
      [  265.187645]  ops_exit_list.isra.10+0x94/0x140
      [  265.187645]  cleanup_net+0x45b/0x900
      [ ... ]
      
      This WARNING means that hook unregisteration is failed because
      all flowtables hooks are already unregistered by notifier_call.
      
      Network namespace exit routine guarantees that all devices will be
      unregistered first. then, other exit callbacks of pernet_operations
      are called. so that removing flowtable flush routine in exit callback of
      pernet_operation(nf_tables_exit_net) doesn't make flowtable leak.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b7f1a16d
    • S
      ip6_tunnel: Fix encapsulation layout · d4d576f5
      Stefano Brivio 提交于
      Commit 058214a4 ("ip6_tun: Add infrastructure for doing
      encapsulation") added the ip6_tnl_encap() call in ip6_tnl_xmit(), before
      the call to ipv6_push_frag_opts() to append the IPv6 Tunnel Encapsulation
      Limit option (option 4, RFC 2473, par. 5.1) to the outer IPv6 header.
      
      As long as the option didn't actually end up in generated packets, this
      wasn't an issue. Then commit 89a23c8b ("ip6_tunnel: Fix missing tunnel
      encapsulation limit option") fixed sending of this option, and the
      resulting layout, e.g. for FoU, is:
      
      .-------------------.------------.----------.-------------------.----- - -
      | Outer IPv6 Header | UDP header | Option 4 | Inner IPv6 Header | Payload
      '-------------------'------------'----------'-------------------'----- - -
      
      Needless to say, FoU and GUE (at least) won't work over IPv6. The option
      is appended by default, and I couldn't find a way to disable it with the
      current iproute2.
      
      Turn this into a more reasonable:
      
      .-------------------.----------.------------.-------------------.----- - -
      | Outer IPv6 Header | Option 4 | UDP header | Inner IPv6 Header | Payload
      '-------------------'----------'------------'-------------------'----- - -
      
      With this, and with 84dad559 ("udp6: fix encap return code for
      resubmitting"), FoU and GUE work again over IPv6.
      
      Fixes: 058214a4 ("ip6_tun: Add infrastructure for doing encapsulation")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4d576f5
    • E
      tcp: fix TCP_REPAIR xmit queue setup · 79861919
      Eric Dumazet 提交于
      Andrey reported the following warning triggered while running CRIU tests:
      
      tcp_clean_rtx_queue()
      ...
      	last_ackt = tcp_skb_timestamp_us(skb);
      	WARN_ON_ONCE(last_ackt == 0);
      
      This is caused by 5f6188a8 ("tcp: do not change tcp_wstamp_ns
      in tcp_mstamp_refresh"), as we end up having skbs in retransmit queue
      with a zero skb->skb_mstamp_ns field.
      
      We could fix this bug in different ways, like making sure
      tp->tcp_wstamp_ns is not zero at socket creation, but as Neal pointed
      out, we also do not want that pacing status of a repaired socket
      could push tp->tcp_wstamp_ns far ahead in the future.
      
      So we prefer changing tcp_write_xmit() to not call tcp_update_skb_after_send()
      and instead do what is requested by TCP_REPAIR logic.
      
      Fixes: 5f6188a8 ("tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Vagin <avagin@openvz.org>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79861919
    • J
      tipc: fix info leak from kernel tipc_event · b06f9d9f
      Jon Maloy 提交于
      We initialize a struct tipc_event allocated on the kernel stack to
      zero to avert info leak to user space.
      
      Reported-by: syzbot+057458894bc8cada4dee@syzkaller.appspotmail.com
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b06f9d9f
    • W
      net: socket: fix a missing-check bug · b6168562
      Wenwen Wang 提交于
      In ethtool_ioctl(), the ioctl command 'ethcmd' is checked through a switch
      statement to see whether it is necessary to pre-process the ethtool
      structure, because, as mentioned in the comment, the structure
      ethtool_rxnfc is defined with padding. If yes, a user-space buffer 'rxnfc'
      is allocated through compat_alloc_user_space(). One thing to note here is
      that, if 'ethcmd' is ETHTOOL_GRXCLSRLALL, the size of the buffer 'rxnfc' is
      partially determined by 'rule_cnt', which is actually acquired from the
      user-space buffer 'compat_rxnfc', i.e., 'compat_rxnfc->rule_cnt', through
      get_user(). After 'rxnfc' is allocated, the data in the original user-space
      buffer 'compat_rxnfc' is then copied to 'rxnfc' through copy_in_user(),
      including the 'rule_cnt' field. However, after this copy, no check is
      re-enforced on 'rxnfc->rule_cnt'. So it is possible that a malicious user
      race to change the value in the 'compat_rxnfc->rule_cnt' between these two
      copies. Through this way, the attacker can bypass the previous check on
      'rule_cnt' and inject malicious data. This can cause undefined behavior of
      the kernel and introduce potential security risk.
      
      This patch avoids the above issue via copying the value acquired by
      get_user() to 'rxnfc->rule_cn', if 'ethcmd' is ETHTOOL_GRXCLSRLALL.
      Signed-off-by: NWenwen Wang <wang6495@umn.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6168562
    • P
      net: sched: Fix for duplicate class dump · 3c53ed8f
      Phil Sutter 提交于
      When dumping classes by parent, kernel would return classes twice:
      
      | # tc qdisc add dev lo root prio
      | # tc class show dev lo
      | class prio 8001:1 parent 8001:
      | class prio 8001:2 parent 8001:
      | class prio 8001:3 parent 8001:
      | # tc class show dev lo parent 8001:
      | class prio 8001:1 parent 8001:
      | class prio 8001:2 parent 8001:
      | class prio 8001:3 parent 8001:
      | class prio 8001:1 parent 8001:
      | class prio 8001:2 parent 8001:
      | class prio 8001:3 parent 8001:
      
      This comes from qdisc_match_from_root() potentially returning the root
      qdisc itself if its handle matched. Though in that case, root's classes
      were already dumped a few lines above.
      
      Fixes: cb395b20 ("net: sched: optimize class dumps")
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c53ed8f
    • X
      sctp: use sk_wmem_queued to check for writable space · cd305c74
      Xin Long 提交于
      sk->sk_wmem_queued is used to count the size of chunks in out queue
      while sk->sk_wmem_alloc is for counting the size of chunks has been
      sent. sctp is increasing both of them before enqueuing the chunks,
      and using sk->sk_wmem_alloc to check for writable space.
      
      However, sk_wmem_alloc is also increased by 1 for the skb allocked
      for sending in sctp_packet_transmit() but it will not wake up the
      waiters when sk_wmem_alloc is decreased in this skb's destructor.
      
      If msg size is equal to sk_sndbuf and sendmsg is waiting for sndbuf,
      the check 'msg_len <= sctp_wspace(asoc)' in sctp_wait_for_sndbuf()
      will keep waiting if there's a skb allocked in sctp_packet_transmit,
      and later even if this skb got freed, the waiting thread will never
      get waked up.
      
      This issue has been there since very beginning, so we change to use
      sk->sk_wmem_queued to check for writable space as sk_wmem_queued is
      not increased for the skb allocked for sending, also as TCP does.
      
      SOCK_SNDBUF_LOCK check is also removed here as it's for tx buf auto
      tuning which I will add in another patch.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd305c74
    • X
      sctp: count both sk and asoc sndbuf with skb truesize and sctp_chunk size · 605c0ac1
      Xin Long 提交于
      Now it's confusing that asoc sndbuf_used is doing memory accounting with
      SCTP_DATA_SNDSIZE(chunk) + sizeof(sk_buff) + sizeof(sctp_chunk) while sk
      sk_wmem_alloc is doing that with skb->truesize + sizeof(sctp_chunk).
      
      It also causes sctp_prsctp_prune to count with a wrong freed memory when
      sndbuf_policy is not set.
      
      To make this right and also keep consistent between asoc sndbuf_used, sk
      sk_wmem_alloc and sk_wmem_queued, use skb->truesize + sizeof(sctp_chunk)
      for them.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      605c0ac1