1. 09 8月, 2018 5 次提交
    • A
      dsa: slave: eee: Allow ports to use phylink · 1be52e97
      Andrew Lunn 提交于
      For a port to be able to use EEE, both the MAC and the PHY must
      support EEE. A phy can be provided by both a phydev or phylink. Verify
      at least one of these exist, not just phydev.
      
      Fixes: aab9c406 ("net: dsa: Plug in PHYLINK support")
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1be52e97
    • U
      net/smc: move sock lock in smc_ioctl() · 7311d665
      Ursula Braun 提交于
      When an SMC socket is connecting it is decided whether fallback to
      TCP is needed. To avoid races between connect and ioctl move the
      sock lock before the use_fallback check.
      
      Reported-by: syzbot+5b2cece1a8ecb2ca77d8@syzkaller.appspotmail.com
      Reported-by: syzbot+19557374321ca3710990@syzkaller.appspotmail.com
      Fixes: 1992d998 ("net/smc: take sock lock in smc_ioctl()")
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7311d665
    • U
      net/smc: allow sysctl rmem and wmem defaults for servers · bd58c7e0
      Ursula Braun 提交于
      Without setsockopt SO_SNDBUF and SO_RCVBUF settings, the sysctl
      defaults net.ipv4.tcp_wmem and net.ipv4.tcp_rmem should be the base
      for the sizes of the SMC sndbuf and rcvbuf. Any TCP buffer size
      optimizations for servers should be ignored.
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd58c7e0
    • U
      net/smc: no shutdown in state SMC_LISTEN · caa21e19
      Ursula Braun 提交于
      Invoking shutdown for a socket in state SMC_LISTEN does not make
      sense. Nevertheless programs like syzbot fuzzing the kernel may
      try to do this. For SMC this means a socket refcounting problem.
      This patch makes sure a shutdown call for an SMC socket in state
      SMC_LISTEN simply returns with -ENOTCONN.
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      caa21e19
    • D
      rxrpc: Fix the keepalive generator [ver #2] · 330bdcfa
      David Howells 提交于
      AF_RXRPC has a keepalive message generator that generates a message for a
      peer ~20s after the last transmission to that peer to keep firewall ports
      open.  The implementation is incorrect in the following ways:
      
       (1) It mixes up ktime_t and time64_t types.
      
       (2) It uses ktime_get_real(), the output of which may jump forward or
           backward due to adjustments to the time of day.
      
       (3) If the current time jumps forward too much or jumps backwards, the
           generator function will crank the base of the time ring round one slot
           at a time (ie. a 1s period) until it catches up, spewing out VERSION
           packets as it goes.
      
      Fix the problem by:
      
       (1) Only using time64_t.  There's no need for sub-second resolution.
      
       (2) Use ktime_get_seconds() rather than ktime_get_real() so that time
           isn't perceived to go backwards.
      
       (3) Simplifying rxrpc_peer_keepalive_worker() by splitting it into two
           parts:
      
           (a) The "worker" function that manages the buckets and the timer.
      
           (b) The "dispatch" function that takes the pending peers and
           	 potentially transmits a keepalive packet before putting them back
           	 in the ring into the slot appropriate to the revised last-Tx time.
      
       (4) Taking everything that's pending out of the ring and splicing it into
           a temporary collector list for processing.
      
           In the case that there's been a significant jump forward, the ring
           gets entirely emptied and then the time base can be warped forward
           before the peers are processed.
      
           The warping can't happen if the ring isn't empty because the slot a
           peer is in is keepalive-time dependent, relative to the base time.
      
       (5) Limit the number of iterations of the bucket array when scanning it.
      
       (6) Set the timer to skip any empty slots as there's no point waking up if
           there's nothing to do yet.
      
      This can be triggered by an incoming call from a server after a reboot with
      AF_RXRPC and AFS built into the kernel causing a peer record to be set up
      before userspace is started.  The system clock is then adjusted by
      userspace, thereby potentially causing the keepalive generator to have a
      meltdown - which leads to a message like:
      
      	watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1:23]
      	...
      	Workqueue: krxrpcd rxrpc_peer_keepalive_worker
      	EIP: lock_acquire+0x69/0x80
      	...
      	Call Trace:
      	 ? rxrpc_peer_keepalive_worker+0x5e/0x350
      	 ? _raw_spin_lock_bh+0x29/0x60
      	 ? rxrpc_peer_keepalive_worker+0x5e/0x350
      	 ? rxrpc_peer_keepalive_worker+0x5e/0x350
      	 ? __lock_acquire+0x3d3/0x870
      	 ? process_one_work+0x110/0x340
      	 ? process_one_work+0x166/0x340
      	 ? process_one_work+0x110/0x340
      	 ? worker_thread+0x39/0x3c0
      	 ? kthread+0xdb/0x110
      	 ? cancel_delayed_work+0x90/0x90
      	 ? kthread_stop+0x70/0x70
      	 ? ret_from_fork+0x19/0x24
      
      Fixes: ace45bec ("rxrpc: Fix firewall route keepalive")
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      330bdcfa
  2. 08 8月, 2018 4 次提交
    • C
      llc: use refcount_inc_not_zero() for llc_sap_find() · 0dcb8225
      Cong Wang 提交于
      llc_sap_put() decreases the refcnt before deleting sap
      from the global list. Therefore, there is a chance
      llc_sap_find() could find a sap with zero refcnt
      in this global list.
      
      Close this race condition by checking if refcnt is zero
      or not in llc_sap_find(), if it is zero then it is being
      removed so we can just treat it as gone.
      
      Reported-by: <syzbot+278893f3f7803871f7ce@syzkaller.appspotmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0dcb8225
    • A
      dccp: fix undefined behavior with 'cwnd' shift in ccid2_cwnd_restart() · 61ef4b07
      Alexey Kodanev 提交于
      The shift of 'cwnd' with '(now - hc->tx_lsndtime) / hc->tx_rto' value
      can lead to undefined behavior [1].
      
      In order to fix this use a gradual shift of the window with a 'while'
      loop, similar to what tcp_cwnd_restart() is doing.
      
      When comparing delta and RTO there is a minor difference between TCP
      and DCCP, the last one also invokes dccp_cwnd_restart() and reduces
      'cwnd' if delta equals RTO. That case is preserved in this change.
      
      [1]:
      [40850.963623] UBSAN: Undefined behaviour in net/dccp/ccids/ccid2.c:237:7
      [40851.043858] shift exponent 67 is too large for 32-bit type 'unsigned int'
      [40851.127163] CPU: 3 PID: 15940 Comm: netstress Tainted: G        W   E     4.18.0-rc7.x86_64 #1
      ...
      [40851.377176] Call Trace:
      [40851.408503]  dump_stack+0xf1/0x17b
      [40851.451331]  ? show_regs_print_info+0x5/0x5
      [40851.503555]  ubsan_epilogue+0x9/0x7c
      [40851.548363]  __ubsan_handle_shift_out_of_bounds+0x25b/0x2b4
      [40851.617109]  ? __ubsan_handle_load_invalid_value+0x18f/0x18f
      [40851.686796]  ? xfrm4_output_finish+0x80/0x80
      [40851.739827]  ? lock_downgrade+0x6d0/0x6d0
      [40851.789744]  ? xfrm4_prepare_output+0x160/0x160
      [40851.845912]  ? ip_queue_xmit+0x810/0x1db0
      [40851.895845]  ? ccid2_hc_tx_packet_sent+0xd36/0x10a0 [dccp]
      [40851.963530]  ccid2_hc_tx_packet_sent+0xd36/0x10a0 [dccp]
      [40852.029063]  dccp_xmit_packet+0x1d3/0x720 [dccp]
      [40852.086254]  dccp_write_xmit+0x116/0x1d0 [dccp]
      [40852.142412]  dccp_sendmsg+0x428/0xb20 [dccp]
      [40852.195454]  ? inet_dccp_listen+0x200/0x200 [dccp]
      [40852.254833]  ? sched_clock+0x5/0x10
      [40852.298508]  ? sched_clock+0x5/0x10
      [40852.342194]  ? inet_create+0xdf0/0xdf0
      [40852.388988]  sock_sendmsg+0xd9/0x160
      ...
      
      Fixes: 113ced1f ("dccp ccid-2: Perform congestion-window validation")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61ef4b07
    • Y
      tipc: fix an interrupt unsafe locking scenario · 37436d9c
      Ying Xue 提交于
      Commit 9faa89d4 ("tipc: make function tipc_net_finalize() thread
      safe") tries to make it thread safe to set node address, so it uses
      node_list_lock lock to serialize the whole process of setting node
      address in tipc_net_finalize(). But it causes the following interrupt
      unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        rht_deferred_worker()
        rhashtable_rehash_table()
        lock(&(&ht->lock)->rlock)
      			       tipc_nl_compat_doit()
                                     tipc_net_finalize()
                                     local_irq_disable();
                                     lock(&(&tn->node_list_lock)->rlock);
                                     tipc_sk_reinit()
                                     rhashtable_walk_enter()
                                     lock(&(&ht->lock)->rlock);
        <Interrupt>
        tipc_disc_rcv()
        tipc_node_check_dest()
        tipc_node_create()
        lock(&(&tn->node_list_lock)->rlock);
      
       *** DEADLOCK ***
      
      When rhashtable_rehash_table() holds ht->lock on CPU0, it doesn't
      disable BH. So if an interrupt happens after the lock, it can create
      an inverse lock ordering between ht->lock and tn->node_list_lock. As
      a consequence, deadlock might happen.
      
      The reason causing the inverse lock ordering scenario above is because
      the initial purpose of node_list_lock is not designed to do the
      serialization of node address setting.
      
      As cmpxchg() can guarantee CAS (compare-and-swap) process is atomic,
      we use it to replace node_list_lock to ensure setting node address can
      be atomically finished. It turns out the potential deadlock can be
      avoided as well.
      
      Fixes: 9faa89d4 ("tipc: make function tipc_net_finalize() thread safe")
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <maloy@donjonn.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37436d9c
    • C
      vsock: split dwork to avoid reinitializations · 455f05ec
      Cong Wang 提交于
      syzbot reported that we reinitialize an active delayed
      work in vsock_stream_connect():
      
      	ODEBUG: init active (active state 0) object type: timer_list hint:
      	delayed_work_timer_fn+0x0/0x90 kernel/workqueue.c:1414
      	WARNING: CPU: 1 PID: 11518 at lib/debugobjects.c:329
      	debug_print_object+0x16a/0x210 lib/debugobjects.c:326
      
      The pattern is apparently wrong, we should only initialize
      the dealyed work once and could repeatly schedule it. So we
      have to move out the initializations to allocation side.
      And to avoid confusion, we can split the shared dwork
      into two, instead of re-using the same one.
      
      Fixes: d021c344 ("VSOCK: Introduce VM Sockets")
      Reported-by: <syzbot+8a9b1bd330476a4f3db6@syzkaller.appspotmail.com>
      Cc: Andy king <acking@vmware.com>
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Jorgen Hansen <jhansen@vmware.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      455f05ec
  3. 07 8月, 2018 1 次提交
    • W
      packet: refine ring v3 block size test to hold one frame · 4576cd46
      Willem de Bruijn 提交于
      TPACKET_V3 stores variable length frames in fixed length blocks.
      Blocks must be able to store a block header, optional private space
      and at least one minimum sized frame.
      
      Frames, even for a zero snaplen packet, store metadata headers and
      optional reserved space.
      
      In the block size bounds check, ensure that the frame of the
      chosen configuration fits. This includes sockaddr_ll and optional
      tp_reserve.
      
      Syzbot was able to construct a ring with insuffient room for the
      sockaddr_ll in the header of a zero-length frame, triggering an
      out-of-bounds write in dev_parse_header.
      
      Convert the comparison to less than, as zero is a valid snap len.
      This matches the test for minimum tp_frame_size immediately below.
      
      Fixes: f6fb8f10 ("af-packet: TPACKET_V3 flexible buffer implementation.")
      Fixes: eb73190f ("net/packet: refine check for priv area size")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4576cd46
  4. 06 8月, 2018 2 次提交
  5. 05 8月, 2018 2 次提交
  6. 04 8月, 2018 1 次提交
  7. 02 8月, 2018 4 次提交
  8. 01 8月, 2018 2 次提交
    • E
      ipv4: frags: handle possible skb truesize change · 4672694b
      Eric Dumazet 提交于
      ip_frag_queue() might call pskb_pull() on one skb that
      is already in the fragment queue.
      
      We need to take care of possible truesize change, or we
      might have an imbalance of the netns frags memory usage.
      
      IPv6 is immune to this bug, because RFC5722, Section 4,
      amended by Errata ID 3089 states :
      
        When reassembling an IPv6 datagram, if
        one or more its constituent fragments is determined to be an
        overlapping fragment, the entire datagram (and any constituent
        fragments) MUST be silently discarded.
      
      Fixes: 158f323b ("net: adjust skb->truesize in pskb_expand_head()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4672694b
    • E
      inet: frag: enforce memory limits earlier · 56e2c94f
      Eric Dumazet 提交于
      We currently check current frags memory usage only when
      a new frag queue is created. This allows attackers to first
      consume the memory budget (default : 4 MB) creating thousands
      of frag queues, then sending tiny skbs to exceed high_thresh
      limit by 2 to 3 order of magnitude.
      
      Note that before commit 648700f7 ("inet: frags: use rhashtables
      for reassembly units"), work queue could be starved under DOS,
      getting no cpu cycles.
      After commit 648700f7, only the per frag queue timer can eventually
      remove an incomplete frag queue and its skbs.
      
      Fixes: b13d3cbf ("inet: frag: move eviction of queues to work queue")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJann Horn <jannh@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Peter Oskolkov <posk@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56e2c94f
  9. 31 7月, 2018 3 次提交
  10. 30 7月, 2018 2 次提交
  11. 29 7月, 2018 6 次提交
    • N
      tcp_bbr: fix bw probing to raise in-flight data for very small BDPs · 383d4709
      Neal Cardwell 提交于
      For some very small BDPs (with just a few packets) there was a
      quantization effect where the target number of packets in flight
      during the super-unity-gain (1.25x) phase of gain cycling was
      implicitly truncated to a number of packets no larger than the normal
      unity-gain (1.0x) phase of gain cycling. This meant that in multi-flow
      scenarios some flows could get stuck with a lower bandwidth, because
      they did not push enough packets inflight to discover that there was
      more bandwidth available. This was really only an issue in multi-flow
      LAN scenarios, where RTTs and BDPs are low enough for this to be an
      issue.
      
      This fix ensures that gain cycling can raise inflight for small BDPs
      by ensuring that in PROBE_BW mode target inflight values with a
      super-unity gain are always greater than inflight values with a gain
      <= 1. Importantly, this applies whether the inflight value is
      calculated for use as a cwnd value, or as a target inflight value for
      the end of the super-unity phase in bbr_is_next_cycle_phase() (both
      need to be bigger to ensure we can probe with more packets in flight
      reliably).
      
      This is a candidate fix for stable releases.
      
      Fixes: 0f8782ea ("tcp_bbr: add BBR congestion control")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NPriyaranjan Jha <priyarjha@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      383d4709
    • J
      net: socket: Fix potential spectre v1 gadget in sock_is_registered · e978de7a
      Jeremy Cline 提交于
      'family' can be a user-controlled value, so sanitize it after the bounds
      check to avoid speculative out-of-bounds access.
      
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJeremy Cline <jcline@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e978de7a
    • J
      net: socket: fix potential spectre v1 gadget in socketcall · c8e8cd57
      Jeremy Cline 提交于
      'call' is a user-controlled value, so sanitize the array index after the
      bounds check to avoid speculating past the bounds of the 'nargs' array.
      
      Found with the help of Smatch:
      
      net/socket.c:2508 __do_sys_socketcall() warn: potential spectre issue
      'nargs' [r] (local cap)
      
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJeremy Cline <jcline@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8e8cd57
    • L
      ipv4: remove BUG_ON() from fib_compute_spec_dst · 9fc12023
      Lorenzo Bianconi 提交于
      Remove BUG_ON() from fib_compute_spec_dst routine and check
      in_dev pointer during flowi4 data structure initialization.
      fib_compute_spec_dst routine can be run concurrently with device removal
      where ip_ptr net_device pointer is set to NULL. This can happen
      if userspace enables pkt info on UDP rx socket and the device
      is removed while traffic is flowing
      
      Fixes: 35ebf65e ("ipv4: Create and use fib_compute_spec_dst() helper")
      Signed-off-by: NLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fc12023
    • T
      bpf: use GFP_ATOMIC instead of GFP_KERNEL in bpf_parse_prog() · 71eb5255
      Taehee Yoo 提交于
      bpf_parse_prog() is protected by rcu_read_lock().
      so that GFP_KERNEL is not allowed in the bpf_parse_prog().
      
      [51015.579396] =============================
      [51015.579418] WARNING: suspicious RCU usage
      [51015.579444] 4.18.0-rc6+ #208 Not tainted
      [51015.579464] -----------------------------
      [51015.579488] ./include/linux/rcupdate.h:303 Illegal context switch in RCU read-side critical section!
      [51015.579510] other info that might help us debug this:
      [51015.579532] rcu_scheduler_active = 2, debug_locks = 1
      [51015.579556] 2 locks held by ip/1861:
      [51015.579577]  #0: 00000000a8c12fd1 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x2e0/0x910
      [51015.579711]  #1: 00000000bf815f8e (rcu_read_lock){....}, at: lwtunnel_build_state+0x96/0x390
      [51015.579842] stack backtrace:
      [51015.579869] CPU: 0 PID: 1861 Comm: ip Not tainted 4.18.0-rc6+ #208
      [51015.579891] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
      [51015.579911] Call Trace:
      [51015.579950]  dump_stack+0x74/0xbb
      [51015.580000]  ___might_sleep+0x16b/0x3a0
      [51015.580047]  __kmalloc_track_caller+0x220/0x380
      [51015.580077]  kmemdup+0x1c/0x40
      [51015.580077]  bpf_parse_prog+0x10e/0x230
      [51015.580164]  ? kasan_kmalloc+0xa0/0xd0
      [51015.580164]  ? bpf_destroy_state+0x30/0x30
      [51015.580164]  ? bpf_build_state+0xe2/0x3e0
      [51015.580164]  bpf_build_state+0x1bb/0x3e0
      [51015.580164]  ? bpf_parse_prog+0x230/0x230
      [51015.580164]  ? lock_is_held_type+0x123/0x1a0
      [51015.580164]  lwtunnel_build_state+0x1aa/0x390
      [51015.580164]  fib_create_info+0x1579/0x33d0
      [51015.580164]  ? sched_clock_local+0xe2/0x150
      [51015.580164]  ? fib_info_update_nh_saddr+0x1f0/0x1f0
      [51015.580164]  ? sched_clock_local+0xe2/0x150
      [51015.580164]  fib_table_insert+0x201/0x1990
      [51015.580164]  ? lock_downgrade+0x610/0x610
      [51015.580164]  ? fib_table_lookup+0x1920/0x1920
      [51015.580164]  ? lwtunnel_valid_encap_type.part.6+0xcb/0x3a0
      [51015.580164]  ? rtm_to_fib_config+0x637/0xbd0
      [51015.580164]  inet_rtm_newroute+0xed/0x1b0
      [51015.580164]  ? rtm_to_fib_config+0xbd0/0xbd0
      [51015.580164]  rtnetlink_rcv_msg+0x331/0x910
      [ ... ]
      
      Fixes: 3a0af8fd ("bpf: BPF for lightweight tunnel infrastructure")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      71eb5255
    • D
      bpf: fix bpf_skb_load_bytes_relative pkt length check · 3eee1f75
      Daniel Borkmann 提交于
      The len > skb_headlen(skb) cannot be used as a maximum upper bound
      for the packet length since it does not have any relation to the full
      linear packet length when filtering is used from upper layers (e.g.
      in case of reuseport BPF programs) as by then skb->data, skb->len
      already got mangled through __skb_pull() and others.
      
      Fixes: 4e1ec56c ("bpf: add skb_load_bytes_relative helper")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      3eee1f75
  12. 27 7月, 2018 3 次提交
    • T
      xdp: add NULL pointer check in __xdp_return() · 36e0f12b
      Taehee Yoo 提交于
      rhashtable_lookup() can return NULL. so that NULL pointer
      check routine should be added.
      
      Fixes: 02b55e56 ("xdp: add MEM_TYPE_ZERO_COPY")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      36e0f12b
    • A
      RDS: RDMA: Fix the NULL-ptr deref in rds_ib_get_mr · 9e630bcb
      Avinash Repaka 提交于
      Registration of a memory region(MR) through FRMR/fastreg(unlike FMR)
      needs a connection/qp. With a proxy qp, this dependency on connection
      will be removed, but that needs more infrastructure patches, which is a
      work in progress.
      
      As an intermediate fix, the get_mr returns EOPNOTSUPP when connection
      details are not populated. The MR registration through sendmsg() will
      continue to work even with fast registration, since connection in this
      case is formed upfront.
      
      This patch fixes the following crash:
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 1 PID: 4244 Comm: syzkaller468044 Not tainted 4.16.0-rc6+ #361
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:rds_ib_get_mr+0x5c/0x230 net/rds/ib_rdma.c:544
      RSP: 0018:ffff8801b059f890 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: ffff8801b07e1300 RCX: ffffffff8562d96e
      RDX: 000000000000000d RSI: 0000000000000001 RDI: 0000000000000068
      RBP: ffff8801b059f8b8 R08: ffffed0036274244 R09: ffff8801b13a1200
      R10: 0000000000000004 R11: ffffed0036274243 R12: ffff8801b13a1200
      R13: 0000000000000001 R14: ffff8801ca09fa9c R15: 0000000000000000
      FS:  00007f4d050af700(0000) GS:ffff8801db300000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f4d050aee78 CR3: 00000001b0d9b006 CR4: 00000000001606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       __rds_rdma_map+0x710/0x1050 net/rds/rdma.c:271
       rds_get_mr_for_dest+0x1d4/0x2c0 net/rds/rdma.c:357
       rds_setsockopt+0x6cc/0x980 net/rds/af_rds.c:347
       SYSC_setsockopt net/socket.c:1849 [inline]
       SyS_setsockopt+0x189/0x360 net/socket.c:1828
       do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7
      RIP: 0033:0x4456d9
      RSP: 002b:00007f4d050aedb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
      RAX: ffffffffffffffda RBX: 00000000006dac3c RCX: 00000000004456d9
      RDX: 0000000000000007 RSI: 0000000000000114 RDI: 0000000000000004
      RBP: 00000000006dac38 R08: 00000000000000a0 R09: 0000000000000000
      R10: 0000000020000380 R11: 0000000000000246 R12: 0000000000000000
      R13: 00007fffbfb36d6f R14: 00007f4d050af9c0 R15: 0000000000000005
      Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 cc 01 00 00 4c 8b bb 80 04 00 00
      48
      b8 00 00 00 00 00 fc ff df 49 8d 7f 68 48 89 fa 48 c1 ea 03 <80> 3c 02
      00 0f
      85 9c 01 00 00 4d 8b 7f 68 48 b8 00 00 00 00 00
      RIP: rds_ib_get_mr+0x5c/0x230 net/rds/ib_rdma.c:544 RSP:
      ffff8801b059f890
      ---[ end trace 7e1cea13b85473b0 ]---
      
      Reported-by: syzbot+b51c77ef956678a65834@syzkaller.appspotmail.com
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NAvinash Repaka <avinash.repaka@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e630bcb
    • T
      net: rollback orig value on failure of dev_qdisc_change_tx_queue_len · 7effaf06
      Tariq Toukan 提交于
      Fix dev_change_tx_queue_len so it rolls back original value
      upon a failure in dev_qdisc_change_tx_queue_len.
      This is already done for notifirers' failures, share the code.
      
      In case of failure in dev_qdisc_change_tx_queue_len, some tx queues
      would still be of the new length, while they should be reverted.
      Currently, the revert is not done, and is marked with a TODO label
      in dev_qdisc_change_tx_queue_len, and should find some nice solution
      to do it.
      Yet it is still better to not apply the newly requested value.
      
      Fixes: 48bfd55e ("net_sched: plug in qdisc ops change_tx_queue_len")
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: NEran Ben Elisha <eranbe@mellanox.com>
      Reported-by: NRan Rozenstein <ranro@mellanox.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7effaf06
  13. 26 7月, 2018 3 次提交
    • B
      xsk: fix poll/POLLIN premature returns · d24458e4
      Björn Töpel 提交于
      Polling for the ingress queues relies on reading the producer/consumer
      pointers of the Rx queue.
      
      Prior this commit, a cached consumer pointer could be used, instead of
      the actual consumer pointer and therefore report POLLIN prematurely.
      
      This patch makes sure that the non-cached consumer pointer is used
      instead.
      Reported-by: NQi Zhang <qi.z.zhang@intel.com>
      Tested-by: NQi Zhang <qi.z.zhang@intel.com>
      Fixes: c497176c ("xsk: add Rx receive functions and poll support")
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d24458e4
    • W
      net: igmp: make function __ip_mc_inc_group() static · b87bac10
      Wei Yongjun 提交于
      Fixes the following sparse warnings:
      
      net/ipv4/igmp.c:1391:6: warning:
       symbol '__ip_mc_inc_group' was not declared. Should it be static?
      
      Fixes: 6e2059b5 ("ipv4/igmp: init group mode as INCLUDE when join source group")
      Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b87bac10
    • L
      tcp: ack immediately when a cwr packet arrives · 9aee4000
      Lawrence Brakmo 提交于
      We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
      problem is triggered when the last packet of a request arrives CE
      marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
      to 1 (because there are no packets in flight). When the 1st packet of
      the next request arrives, the ACK was sometimes delayed even though it
      is CWR marked, adding up to 40ms to the RPC latency.
      
      This patch insures that CWR marked data packets arriving will be acked
      immediately.
      
      Packetdrill script to reproduce the problem:
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 2:3(1) ack 2001
      
      0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
      0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
      0.200 > [ect01] . 3:3(0) ack 4001
      
      0.210 < [ce] P. 4001:4501(500) ack 3 win 257
      
      +0.001 read(4, ..., 4500) = 4500
      +0 write(4, ..., 1) = 1
      +0 > [ect01] PE. 3:4(1) ack 4501
      
      +0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
      // Previously the ACK sequence below would be 4501, causing a long RTO
      +0.040~+0.045 > [ect01] . 4:4(0) ack 5501   // delayed ack
      
      +0.311 < [ect0] . 5501:6501(1000) ack 4 win 257  // More data
      +0 > [ect01] . 4:4(0) ack 6501     // now acks everything
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      
      Modified based on comments by Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9aee4000
  14. 25 7月, 2018 1 次提交
    • W
      ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull · 2efd4fca
      Willem de Bruijn 提交于
      Syzbot reported a read beyond the end of the skb head when returning
      IPV6_ORIGDSTADDR:
      
        BUG: KMSAN: kernel-infoleak in put_cmsg+0x5ef/0x860 net/core/scm.c:242
        CPU: 0 PID: 4501 Comm: syz-executor128 Not tainted 4.17.0+ #9
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
        Google 01/01/2011
        Call Trace:
          __dump_stack lib/dump_stack.c:77 [inline]
          dump_stack+0x185/0x1d0 lib/dump_stack.c:113
          kmsan_report+0x188/0x2a0 mm/kmsan/kmsan.c:1125
          kmsan_internal_check_memory+0x138/0x1f0 mm/kmsan/kmsan.c:1219
          kmsan_copy_to_user+0x7a/0x160 mm/kmsan/kmsan.c:1261
          copy_to_user include/linux/uaccess.h:184 [inline]
          put_cmsg+0x5ef/0x860 net/core/scm.c:242
          ip6_datagram_recv_specific_ctl+0x1cf3/0x1eb0 net/ipv6/datagram.c:719
          ip6_datagram_recv_ctl+0x41c/0x450 net/ipv6/datagram.c:733
          rawv6_recvmsg+0x10fb/0x1460 net/ipv6/raw.c:521
          [..]
      
      This logic and its ipv4 counterpart read the destination port from
      the packet at skb_transport_offset(skb) + 4.
      
      With MSG_MORE and a local SOCK_RAW sender, syzbot was able to cook a
      packet that stores headers exactly up to skb_transport_offset(skb) in
      the head and the remainder in a frag.
      
      Call pskb_may_pull before accessing the pointer to ensure that it lies
      in skb head.
      
      Link: http://lkml.kernel.org/r/CAF=yD-LEJwZj5a1-bAAj2Oy_hKmGygV6rsJ_WOrAYnv-fnayiQ@mail.gmail.com
      Reported-by: syzbot+9adb4b567003cac781f0@syzkaller.appspotmail.com
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2efd4fca
  15. 24 7月, 2018 1 次提交