1. 23 10月, 2018 9 次提交
  2. 21 10月, 2018 4 次提交
  3. 20 10月, 2018 7 次提交
    • D
      net: fix pskb_trim_rcsum_slow() with odd trim offset · d55bef50
      Dimitris Michailidis 提交于
      We've been getting checksum errors involving small UDP packets, usually
      59B packets with 1 extra non-zero padding byte. netdev_rx_csum_fault()
      has been complaining that HW is providing bad checksums. Turns out the
      problem is in pskb_trim_rcsum_slow(), introduced in commit 88078d98
      ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends").
      
      The source of the problem is that when the bytes we are trimming start
      at an odd address, as in the case of the 1 padding byte above,
      skb_checksum() returns a byte-swapped value. We cannot just combine this
      with skb->csum using csum_sub(). We need to use csum_block_sub() here
      that takes into account the parity of the start address and handles the
      swapping.
      
      Matches existing code in __skb_postpull_rcsum() and esp_remove_trailer().
      
      Fixes: 88078d98 ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
      Signed-off-by: NDimitris Michailidis <dmichail@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d55bef50
    • D
      netpoll: allow cleanup to be synchronous · c9fbd71f
      Debabrata Banerjee 提交于
      This fixes a problem introduced by:
      commit 2cde6acd ("netpoll: Fix __netpoll_rcu_free so that it can hold the rtnl lock")
      
      When using netconsole on a bond, __netpoll_cleanup can asynchronously
      recurse multiple times, each __netpoll_free_async call can result in
      more __netpoll_free_async's. This means there is now a race between
      cleanup_work queues on multiple netpoll_info's on multiple devices and
      the configuration of a new netpoll. For example if a netconsole is set
      to enable 0, reconfigured, and enable 1 immediately, this netconsole
      will likely not work.
      
      Given the reason for __netpoll_free_async is it can be called when rtnl
      is not locked, if it is locked, we should be able to execute
      synchronously. It appears to be locked everywhere it's called from.
      
      Generalize the design pattern from the teaming driver for current
      callers of __netpoll_free_async.
      
      CC: Neil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NDebabrata Banerjee <dbanerje@akamai.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9fbd71f
    • J
      bpf: skmsg, fix psock create on existing kcm/tls port · 5032d079
      John Fastabend 提交于
      Before using the psock returned by sk_psock_get() when adding it to a
      sockmap we need to ensure it is actually a sockmap based psock.
      Previously we were only checking this after incrementing the reference
      counter which was an error. This resulted in a slab-out-of-bounds
      error when the psock was not actually a sockmap type.
      
      This moves the check up so the reference counter is only used
      if it is a sockmap psock.
      
      Eric reported the following KASAN BUG,
      
      BUG: KASAN: slab-out-of-bounds in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
      BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
      Read of size 4 at addr ffff88019548be58 by task syz-executor4/22387
      
      CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
       print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
       check_memory_region_inline mm/kasan/kasan.c:260 [inline]
       check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
       kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
       atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
       refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
       sk_psock_get include/linux/skmsg.h:379 [inline]
       sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
       sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
       sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
       map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5032d079
    • S
      bpf: add tests for direct packet access from CGROUP_SKB · 2cb494a3
      Song Liu 提交于
      Tests are added to make sure CGROUP_SKB cannot access:
        tc_classid, data_meta, flow_keys
      
      and can read and write:
        mark, prority, and cb[0-4]
      
      and can read other fields.
      
      To make selftest with skb->sk work, a dummy sk is added in
      bpf_prog_test_run_skb().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2cb494a3
    • S
      bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB · b39b5f41
      Song Liu 提交于
      BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
      skb. This patch enables direct access of skb for these programs.
      
      Two helper functions bpf_compute_and_save_data_end() and
      bpf_restore_data_end() are introduced. There are used in
      __cgroup_bpf_run_filter_skb(), to compute proper data_end for the
      BPF program, and restore original data afterwards.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b39b5f41
    • M
      bpf: add queue and stack maps · f1a2e44a
      Mauricio Vasquez B 提交于
      Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
      These maps support peek, pop and push operations that are exposed to eBPF
      programs through the new bpf_map[peek/pop/push] helpers.  Those operations
      are exposed to userspace applications through the already existing
      syscalls in the following way:
      
      BPF_MAP_LOOKUP_ELEM            -> peek
      BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
      BPF_MAP_UPDATE_ELEM            -> push
      
      Queue/stack maps are implemented using a buffer, tail and head indexes,
      hence BPF_F_NO_PREALLOC is not supported.
      
      As opposite to other maps, queue and stack do not use RCU for protecting
      maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
      argument that is a pointer to a memory zone where to save the value of a
      map.  Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
      be passed as an extra argument.
      
      Our main motivation for implementing queue/stack maps was to keep track
      of a pool of elements, like network ports in a SNAT, however we forsee
      other use cases, like for exampling saving last N kernel events in a map
      and then analysing from userspace.
      Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f1a2e44a
    • D
      Revert "bond: take rcu lock in netpoll_send_skb_on_dev" · 48995423
      David S. Miller 提交于
      This reverts commit 6fe94878.
      
      It is causing more serious regressions than the RCU warning
      it is fixing.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48995423
  4. 19 10月, 2018 10 次提交
    • P
      Revert "netfilter: xt_quota: fix the behavior of xt_quota module" · af510ebd
      Pablo Neira Ayuso 提交于
      This reverts commit e9837e55.
      
      When talking to Maze and Chenbo, we agreed to keep this back by now
      due to problems in the ruleset listing path with 32-bit arches.
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      af510ebd
    • W
      netfilter: remove two unused variables. · da8a705c
      Weongyo Jeong 提交于
      nft_dup_netdev_ingress_ops and nft_fwd_netdev_ingress_ops variables are
      no longer used at the code.
      Signed-off-by: NWeongyo Jeong <weongyo.linux@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      da8a705c
    • T
      netfilter: nf_flow_table: remove unnecessary parameter of nf_flow_table_cleanup() · 5f1be84a
      Taehee Yoo 提交于
      parameter net of nf_flow_table_cleanup() is not used.
      So that it can be removed.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      5f1be84a
    • S
      ip6_tunnel: Fix encapsulation layout · d4d576f5
      Stefano Brivio 提交于
      Commit 058214a4 ("ip6_tun: Add infrastructure for doing
      encapsulation") added the ip6_tnl_encap() call in ip6_tnl_xmit(), before
      the call to ipv6_push_frag_opts() to append the IPv6 Tunnel Encapsulation
      Limit option (option 4, RFC 2473, par. 5.1) to the outer IPv6 header.
      
      As long as the option didn't actually end up in generated packets, this
      wasn't an issue. Then commit 89a23c8b ("ip6_tunnel: Fix missing tunnel
      encapsulation limit option") fixed sending of this option, and the
      resulting layout, e.g. for FoU, is:
      
      .-------------------.------------.----------.-------------------.----- - -
      | Outer IPv6 Header | UDP header | Option 4 | Inner IPv6 Header | Payload
      '-------------------'------------'----------'-------------------'----- - -
      
      Needless to say, FoU and GUE (at least) won't work over IPv6. The option
      is appended by default, and I couldn't find a way to disable it with the
      current iproute2.
      
      Turn this into a more reasonable:
      
      .-------------------.----------.------------.-------------------.----- - -
      | Outer IPv6 Header | Option 4 | UDP header | Inner IPv6 Header | Payload
      '-------------------'----------'------------'-------------------'----- - -
      
      With this, and with 84dad559 ("udp6: fix encap return code for
      resubmitting"), FoU and GUE work again over IPv6.
      
      Fixes: 058214a4 ("ip6_tun: Add infrastructure for doing encapsulation")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4d576f5
    • E
      tcp: fix TCP_REPAIR xmit queue setup · 79861919
      Eric Dumazet 提交于
      Andrey reported the following warning triggered while running CRIU tests:
      
      tcp_clean_rtx_queue()
      ...
      	last_ackt = tcp_skb_timestamp_us(skb);
      	WARN_ON_ONCE(last_ackt == 0);
      
      This is caused by 5f6188a8 ("tcp: do not change tcp_wstamp_ns
      in tcp_mstamp_refresh"), as we end up having skbs in retransmit queue
      with a zero skb->skb_mstamp_ns field.
      
      We could fix this bug in different ways, like making sure
      tp->tcp_wstamp_ns is not zero at socket creation, but as Neal pointed
      out, we also do not want that pacing status of a repaired socket
      could push tp->tcp_wstamp_ns far ahead in the future.
      
      So we prefer changing tcp_write_xmit() to not call tcp_update_skb_after_send()
      and instead do what is requested by TCP_REPAIR logic.
      
      Fixes: 5f6188a8 ("tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Vagin <avagin@openvz.org>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79861919
    • J
      tipc: fix info leak from kernel tipc_event · b06f9d9f
      Jon Maloy 提交于
      We initialize a struct tipc_event allocated on the kernel stack to
      zero to avert info leak to user space.
      
      Reported-by: syzbot+057458894bc8cada4dee@syzkaller.appspotmail.com
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b06f9d9f
    • W
      net: socket: fix a missing-check bug · b6168562
      Wenwen Wang 提交于
      In ethtool_ioctl(), the ioctl command 'ethcmd' is checked through a switch
      statement to see whether it is necessary to pre-process the ethtool
      structure, because, as mentioned in the comment, the structure
      ethtool_rxnfc is defined with padding. If yes, a user-space buffer 'rxnfc'
      is allocated through compat_alloc_user_space(). One thing to note here is
      that, if 'ethcmd' is ETHTOOL_GRXCLSRLALL, the size of the buffer 'rxnfc' is
      partially determined by 'rule_cnt', which is actually acquired from the
      user-space buffer 'compat_rxnfc', i.e., 'compat_rxnfc->rule_cnt', through
      get_user(). After 'rxnfc' is allocated, the data in the original user-space
      buffer 'compat_rxnfc' is then copied to 'rxnfc' through copy_in_user(),
      including the 'rule_cnt' field. However, after this copy, no check is
      re-enforced on 'rxnfc->rule_cnt'. So it is possible that a malicious user
      race to change the value in the 'compat_rxnfc->rule_cnt' between these two
      copies. Through this way, the attacker can bypass the previous check on
      'rule_cnt' and inject malicious data. This can cause undefined behavior of
      the kernel and introduce potential security risk.
      
      This patch avoids the above issue via copying the value acquired by
      get_user() to 'rxnfc->rule_cn', if 'ethcmd' is ETHTOOL_GRXCLSRLALL.
      Signed-off-by: NWenwen Wang <wang6495@umn.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6168562
    • P
      net: sched: Fix for duplicate class dump · 3c53ed8f
      Phil Sutter 提交于
      When dumping classes by parent, kernel would return classes twice:
      
      | # tc qdisc add dev lo root prio
      | # tc class show dev lo
      | class prio 8001:1 parent 8001:
      | class prio 8001:2 parent 8001:
      | class prio 8001:3 parent 8001:
      | # tc class show dev lo parent 8001:
      | class prio 8001:1 parent 8001:
      | class prio 8001:2 parent 8001:
      | class prio 8001:3 parent 8001:
      | class prio 8001:1 parent 8001:
      | class prio 8001:2 parent 8001:
      | class prio 8001:3 parent 8001:
      
      This comes from qdisc_match_from_root() potentially returning the root
      qdisc itself if its handle matched. Though in that case, root's classes
      were already dumped a few lines above.
      
      Fixes: cb395b20 ("net: sched: optimize class dumps")
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c53ed8f
    • X
      sctp: use sk_wmem_queued to check for writable space · cd305c74
      Xin Long 提交于
      sk->sk_wmem_queued is used to count the size of chunks in out queue
      while sk->sk_wmem_alloc is for counting the size of chunks has been
      sent. sctp is increasing both of them before enqueuing the chunks,
      and using sk->sk_wmem_alloc to check for writable space.
      
      However, sk_wmem_alloc is also increased by 1 for the skb allocked
      for sending in sctp_packet_transmit() but it will not wake up the
      waiters when sk_wmem_alloc is decreased in this skb's destructor.
      
      If msg size is equal to sk_sndbuf and sendmsg is waiting for sndbuf,
      the check 'msg_len <= sctp_wspace(asoc)' in sctp_wait_for_sndbuf()
      will keep waiting if there's a skb allocked in sctp_packet_transmit,
      and later even if this skb got freed, the waiting thread will never
      get waked up.
      
      This issue has been there since very beginning, so we change to use
      sk->sk_wmem_queued to check for writable space as sk_wmem_queued is
      not increased for the skb allocked for sending, also as TCP does.
      
      SOCK_SNDBUF_LOCK check is also removed here as it's for tx buf auto
      tuning which I will add in another patch.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd305c74
    • X
      sctp: count both sk and asoc sndbuf with skb truesize and sctp_chunk size · 605c0ac1
      Xin Long 提交于
      Now it's confusing that asoc sndbuf_used is doing memory accounting with
      SCTP_DATA_SNDSIZE(chunk) + sizeof(sk_buff) + sizeof(sctp_chunk) while sk
      sk_wmem_alloc is doing that with skb->truesize + sizeof(sctp_chunk).
      
      It also causes sctp_prsctp_prune to count with a wrong freed memory when
      sndbuf_policy is not set.
      
      To make this right and also keep consistent between asoc sndbuf_used, sk
      sk_wmem_alloc and sk_wmem_queued, use skb->truesize + sizeof(sctp_chunk)
      for them.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      605c0ac1
  5. 18 10月, 2018 10 次提交
    • N
      net: ipmr: fix unresolved entry dumps · eddf016b
      Nikolay Aleksandrov 提交于
      If the skb space ends in an unresolved entry while dumping we'll miss
      some unresolved entries. The reason is due to zeroing the entry counter
      between dumping resolved and unresolved mfc entries. We should just
      keep counting until the whole table is dumped and zero when we move to
      the next as we have a separate table counter.
      Reported-by: NColin Ian King <colin.king@canonical.com>
      Fixes: 8fb472c0 ("ipmr: improve hash scalability")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eddf016b
    • P
      udp6: fix encap return code for resubmitting · 84dad559
      Paolo Abeni 提交于
      The commit eb63f296 ("udp6: add missing checks on edumux packet
      processing") used the same return code convention of the ipv4 counterpart,
      but ipv6 uses the opposite one: positive values means resubmit.
      
      This change addresses the issue, using positive return value for
      resubmitting. Also update the related comment, which was broken, too.
      
      Fixes: eb63f296 ("udp6: add missing checks on edumux packet processing")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84dad559
    • N
      tcp_bbr: centralize code to set gains · cf33e25c
      Neal Cardwell 提交于
      Centralize the code that sets gains used for computing cwnd and pacing
      rate. This simplifies the code and makes it easier to change the state
      machine or (in the future) dynamically change the gain values and
      ensure that the correct gain values are always used.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf33e25c
    • N
      tcp_bbr: adjust TCP BBR for departure time pacing · a87c83d5
      Neal Cardwell 提交于
      Adjust TCP BBR for the new departure time pacing model in the recent
      commit ab408b6d ("tcp: switch tcp and sch_fq to new earliest
      departure time model").
      
      With TSQ and pacing at lower layers, there are often several skbs
      queued in the pacing layer, and thus there is less data "in the
      network" than "in flight".
      
      With departure time pacing at lower layers (e.g. fq or potential
      future NICs), the data in the pacing layer now has a pre-scheduled
      ("baked-in") departure time that cannot be changed, even if the
      congestion control algorithm decides to use a new pacing rate.
      
      This means that there can be a non-trivial lag between when BBR makes
      a pacing rate change and when the inter-skb pacing delays
      change. After a pacing rate change, the number of packets in the
      network can gradually evolve to be higher or lower, depending on
      whether the sending rate is higher or lower than the delivery
      rate. Thus ignoring this lag can cause significant overshoot, with the
      flow ending up with too many or too few packets in the network.
      
      This commit changes BBR to adapt its pacing rate based on the amount
      of data in the network that it estimates has already been "baked in"
      by previous departure time decisions. We estimate the number of our
      packets that will be in the network at the earliest departure time
      (EDT) for the next skb scheduled as:
      
         in_network_at_edt = inflight_at_edt - (EDT - now) * bw
      
      If we're increasing the amount of data in the network ("in_network"),
      then we want to know if the transmit of the EDT skb will push
      in_network above the target, so our answer includes
      bbr_tso_segs_goal() from the skb departing at EDT. If we're decreasing
      in_network, then we want to know if in_network will sink too low just
      before the EDT transmit, so our answer does not include the segments
      from the skb departing at EDT.
      
      Why do we treat pacing_gain > 1.0 case and pacing_gain < 1.0 case
      differently? The in_network curve is a step function: in_network goes
      up on transmits, and down on ACKs. To accurately predict when
      in_network will go beyond our target value, this will happen on
      different events, depending on whether we're concerned about
      in_network potentially going too high or too low:
      
       o if pushing in_network up (pacing_gain > 1.0),
         then in_network goes above target upon a transmit event
      
       o if pushing in_network down (pacing_gain < 1.0),
         then in_network goes below target upon an ACK event
      
      This commit changes the BBR state machine to use this estimated
      "packets in network" value to make its decisions.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a87c83d5
    • V
      net/ncsi: Add NCSI Broadcom OEM command · cb10c7c0
      Vijay Khemka 提交于
      This patch adds OEM Broadcom commands and response handling. It also
      defines OEM Get MAC Address handler to get and configure the device.
      
      ncsi_oem_gma_handler_bcm: This handler send NCSI broadcom command for
      getting mac address.
      ncsi_rsp_handler_oem_bcm: This handles response received for all
      broadcom OEM commands.
      ncsi_rsp_handler_oem_bcm_gma: This handles get mac address response and
      set it to device.
      Signed-off-by: NVijay Khemka <vijaykhemka@fb.com>
      Reviewed-by: NSamuel Mendoza-Jonas <sam@mendozajonas.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb10c7c0
    • X
      sctp: not free the new asoc when sctp_wait_for_connect returns err · c863850c
      Xin Long 提交于
      When sctp_wait_for_connect is called to wait for connect ready
      for sp->strm_interleave in sctp_sendmsg_to_asoc, a panic could
      be triggered if cpu is scheduled out and the new asoc is freed
      elsewhere, as it will return err and later the asoc gets freed
      again in sctp_sendmsg.
      
      [  285.840764] list_del corruption, ffff9f0f7b284078->next is LIST_POISON1 (dead000000000100)
      [  285.843590] WARNING: CPU: 1 PID: 8861 at lib/list_debug.c:47 __list_del_entry_valid+0x50/0xa0
      [  285.846193] Kernel panic - not syncing: panic_on_warn set ...
      [  285.846193]
      [  285.848206] CPU: 1 PID: 8861 Comm: sctp_ndata Kdump: loaded Not tainted 4.19.0-rc7.label #584
      [  285.850559] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  285.852164] Call Trace:
      ...
      [  285.872210]  ? __list_del_entry_valid+0x50/0xa0
      [  285.872894]  sctp_association_free+0x42/0x2d0 [sctp]
      [  285.873612]  sctp_sendmsg+0x5a4/0x6b0 [sctp]
      [  285.874236]  sock_sendmsg+0x30/0x40
      [  285.874741]  ___sys_sendmsg+0x27a/0x290
      [  285.875304]  ? __switch_to_asm+0x34/0x70
      [  285.875872]  ? __switch_to_asm+0x40/0x70
      [  285.876438]  ? ptep_set_access_flags+0x2a/0x30
      [  285.877083]  ? do_wp_page+0x151/0x540
      [  285.877614]  __sys_sendmsg+0x58/0xa0
      [  285.878138]  do_syscall_64+0x55/0x180
      [  285.878669]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This is a similar issue with the one fixed in Commit ca3af4dd
      ("sctp: do not free asoc when it is already dead in sctp_sendmsg").
      But this one can't be fixed by returning -ESRCH for the dead asoc
      in sctp_wait_for_connect, as it will break sctp_connect's return
      value to users.
      
      This patch is to simply set err to -ESRCH before it returns to
      sctp_sendmsg when any err is returned by sctp_wait_for_connect
      for sp->strm_interleave, so that no asoc would be freed due to
      this.
      
      When users see this error, they will know the packet hasn't been
      sent. And it also makes sense to not free asoc because waiting
      connect fails, like the second call for sctp_wait_for_connect in
      sctp_sendmsg_to_asoc.
      
      Fixes: 668c9beb ("sctp: implement assign_number for sctp_stream_interleave")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c863850c
    • M
      sctp: fix race on sctp_id2asoc · b336deca
      Marcelo Ricardo Leitner 提交于
      syzbot reported an use-after-free involving sctp_id2asoc.  Dmitry Vyukov
      helped to root cause it and it is because of reading the asoc after it
      was freed:
      
              CPU 1                       CPU 2
      (working on socket 1)            (working on socket 2)
      	                         sctp_association_destroy
      sctp_id2asoc
         spin lock
           grab the asoc from idr
         spin unlock
                                         spin lock
      				     remove asoc from idr
      				   spin unlock
      				   free(asoc)
         if asoc->base.sk != sk ... [*]
      
      This can only be hit if trying to fetch asocs from different sockets. As
      we have a single IDR for all asocs, in all SCTP sockets, their id is
      unique on the system. An application can try to send stuff on an id
      that matches on another socket, and the if in [*] will protect from such
      usage. But it didn't consider that as that asoc may belong to another
      socket, it may be freed in parallel (read: under another socket lock).
      
      We fix it by moving the checks in [*] into the protected region. This
      fixes it because the asoc cannot be freed while the lock is held.
      
      Reported-by: syzbot+c7dd55d7aec49d48e49a@syzkaller.appspotmail.com
      Acked-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b336deca
    • T
      net: bpfilter: use get_pid_task instead of pid_task · 84258438
      Taehee Yoo 提交于
      pid_task() dereferences rcu protected tasks array.
      But there is no rcu_read_lock() in shutdown_umh() routine so that
      rcu_read_lock() is needed.
      get_pid_task() is wrapper function of pid_task. it holds rcu_read_lock()
      then calls pid_task(). if task isn't NULL, it increases reference count
      of task.
      
      test commands:
         %modprobe bpfilter
         %modprobe -rv bpfilter
      
      splat looks like:
      [15102.030932] =============================
      [15102.030957] WARNING: suspicious RCU usage
      [15102.030985] 4.19.0-rc7+ #21 Not tainted
      [15102.031010] -----------------------------
      [15102.031038] kernel/pid.c:330 suspicious rcu_dereference_check() usage!
      [15102.031063]
      	       other info that might help us debug this:
      
      [15102.031332]
      	       rcu_scheduler_active = 2, debug_locks = 1
      [15102.031363] 1 lock held by modprobe/1570:
      [15102.031389]  #0: 00000000580ef2b0 (bpfilter_lock){+.+.}, at: stop_umh+0x13/0x52 [bpfilter]
      [15102.031552]
                     stack backtrace:
      [15102.031583] CPU: 1 PID: 1570 Comm: modprobe Not tainted 4.19.0-rc7+ #21
      [15102.031607] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
      [15102.031628] Call Trace:
      [15102.031676]  dump_stack+0xc9/0x16b
      [15102.031723]  ? show_regs_print_info+0x5/0x5
      [15102.031801]  ? lockdep_rcu_suspicious+0x117/0x160
      [15102.031855]  pid_task+0x134/0x160
      [15102.031900]  ? find_vpid+0xf0/0xf0
      [15102.032017]  shutdown_umh.constprop.1+0x1e/0x53 [bpfilter]
      [15102.032055]  stop_umh+0x46/0x52 [bpfilter]
      [15102.032092]  __x64_sys_delete_module+0x47e/0x570
      [ ... ]
      
      Fixes: d2ba09c1 ("net: add skeleton of bpfilter kernel module")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84258438
    • K
      net: fix warning in af_unix · 33c4368e
      Kyeongdon Kim 提交于
      This fixes the "'hash' may be used uninitialized in this function"
      
      net/unix/af_unix.c:1041:20: warning: 'hash' may be used uninitialized in this function [-Wmaybe-uninitialized]
        addr->hash = hash ^ sk->sk_type;
      Signed-off-by: NKyeongdon Kim <kyeongdon.kim@lge.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33c4368e
    • I
      bridge: switchdev: Allow clearing FDB entry offload indication · e9ba0fbc
      Ido Schimmel 提交于
      Currently, an FDB entry only ceases being offloaded when it is deleted.
      This changes with VxLAN encapsulation.
      
      Devices capable of performing VxLAN encapsulation usually have only one
      FDB table, unlike the software data path which has two - one in the
      bridge driver and another in the VxLAN driver.
      
      Therefore, bridge FDB entries pointing to a VxLAN device are only
      offloaded if there is a corresponding entry in the VxLAN FDB.
      
      Allow clearing the offload indication in case the corresponding entry
      was deleted from the VxLAN FDB.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9ba0fbc