1. 18 4月, 2018 13 次提交
    • D
      net/ipv6: Defer initialization of dst to data path · 6edb3c96
      David Ahern 提交于
      Defer setting dst input, output and error until fib entry is copied.
      
      The reject path from ip6_route_info_create is moved to a new function
      ip6_rt_init_dst_reject with a helper doing the conversion from fib6_type
      to dst error.
      
      The remainder of the new ip6_rt_init_dst is an amalgamtion of dst code
      from addrconf_dst_alloc and the non-reject path of ip6_route_info_create.
      The dst output function is always ip6_output and the input function is
      either ip6_input (local routes), ip6_mc_input (multicast routes) or
      ip6_forward (anything else).
      
      A couple of places using dst.error are updated to look at rt6i_flags.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6edb3c96
    • D
      net/ipv6: Move nexthop data to fib6_nh · 5e670d84
      David Ahern 提交于
      Introduce fib6_nh structure and move nexthop related data from
      rt6_info and rt6_info.dst to fib6_nh. References to dev, gateway or
      lwtstate from a FIB lookup perspective are converted to use fib6_nh;
      datapath references to dst version are left as is.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e670d84
    • D
      net/ipv6: Save route type in rt6_info · e8478e80
      David Ahern 提交于
      The RTN_ type for IPv6 FIB entries is currently embedded in rt6i_flags
      and dst.error. Since dst is going to be removed, it can no longer be
      relied on for FIB dumps so save the route type as fib6_type.
      
      fc_type is set in current users based on the algorithm in rt6_fill_node:
        - rt6i_flags contains RTF_LOCAL: fc_type = RTN_LOCAL
        - rt6i_flags contains RTF_ANYCAST: fc_type = RTN_ANYCAST
        - else fc_type = RTN_UNICAST
      
      Similarly, fib6_type is set in the rt6_info templates based on the
      RTF_REJECT section of rt6_fill_node converting dst.error to RTN type.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8478e80
    • D
      net/ipv6: Move support functions up in route.c · ae90d867
      David Ahern 提交于
      Code move only.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae90d867
    • D
      net/ipv6: Pass net namespace to route functions · afb1d4b5
      David Ahern 提交于
      Pass network namespace reference into route add, delete and get
      functions.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      afb1d4b5
    • D
      net/ipv6: Pass net to fib6_update_sernum · 7aef6859
      David Ahern 提交于
      Pass net namespace to fib6_update_sernum. It can not be marked const
      as fib6_new_sernum will change ipv6.fib6_sernum.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7aef6859
    • D
      net: Handle null dst in rtnl_put_cacheinfo · 3940746d
      David Ahern 提交于
      Need to keep expires time for IPv6 routes in a dump of FIB entries.
      Update rtnl_put_cacheinfo to allow dst to be NULL in which case
      rta_cacheinfo will only contain non-dst data.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3940746d
    • D
      net: Move fib_convert_metrics to metrics file · a919525a
      David Ahern 提交于
      Move logic of fib_convert_metrics into ip_metrics_convert. This allows
      the code that converts netlink attributes into metrics struct to be
      re-used in a later patch by IPv6.
      
      This is mostly a code move with the following changes to variable names:
        - fi->fib_net becomes net
        - fc_mx and fc_mx_len are passed as inputs pulled from fib_config
        - metrics array is passed as an input from fi->fib_metrics->metrics
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a919525a
    • L
      ipv6: send netlink notifications for manually configured addresses · a2d481b3
      Lorenzo Bianconi 提交于
      Send a netlink notification when userspace adds a manually configured
      address if DAD is enabled and optimistic flag isn't set.
      Moreover send RTM_DELADDR notifications for tentative addresses.
      
      Some userspace applications (e.g. NetworkManager) are interested in
      addr netlink events albeit the address is still in tentative state,
      however events are not sent if DAD process is not completed.
      If the address is added and immediately removed userspace listeners
      are not notified. This behaviour can be easily reproduced by using
      veth interfaces:
      
      $ ip -b - <<EOF
      > link add dev vm1 type veth peer name vm2
      > link set dev vm1 up
      > link set dev vm2 up
      > addr add 2001:db8:a:b:1:2:3:4/64 dev vm1
      > addr del 2001:db8:a:b:1:2:3:4/64 dev vm1
      EOF
      
      This patch reverts the behaviour introduced by the commit f784ad3d
      ("ipv6: do not send RTM_DELADDR for tentative addresses")
      Suggested-by: NThomas Haller <thaller@redhat.com>
      Signed-off-by: NLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2d481b3
    • S
      net/ncsi: Refactor MAC, VLAN filters · 062b3e1b
      Samuel Mendoza-Jonas 提交于
      The NCSI driver defines a generic ncsi_channel_filter struct that can be
      used to store arbitrarily formatted filters, and several generic methods
      of accessing data stored in such a filter.
      However in both the driver and as defined in the NCSI specification
      there are only two actual filters: VLAN ID filters and MAC address
      filters. The splitting of the MAC filter into unicast, multicast, and
      mixed is also technically not necessary as these are stored in the same
      location in hardware.
      
      To save complexity, particularly in the set up and accessing of these
      generic filters, remove them in favour of two specific structs. These
      can be acted on directly and do not need several generic helper
      functions to use.
      
      This also fixes a memory error found by KASAN on ARM32 (which is not
      upstream yet), where response handlers accessing a filter's data field
      could write past allocated memory.
      
      [  114.926512] ==================================================================
      [  114.933861] BUG: KASAN: slab-out-of-bounds in ncsi_configure_channel+0x4b8/0xc58
      [  114.941304] Read of size 2 at addr 94888558 by task kworker/0:2/546
      [  114.947593]
      [  114.949146] CPU: 0 PID: 546 Comm: kworker/0:2 Not tainted 4.16.0-rc6-00119-ge156398bfcad #13
      ...
      [  115.170233] The buggy address belongs to the object at 94888540
      [  115.170233]  which belongs to the cache kmalloc-32 of size 32
      [  115.181917] The buggy address is located 24 bytes inside of
      [  115.181917]  32-byte region [94888540, 94888560)
      [  115.192115] The buggy address belongs to the page:
      [  115.196943] page:9eeac100 count:1 mapcount:0 mapping:94888000 index:0x94888fc1
      [  115.204200] flags: 0x100(slab)
      [  115.207330] raw: 00000100 94888000 94888fc1 0000003f 00000001 9eea2014 9eecaa74 96c003e0
      [  115.215444] page dumped because: kasan: bad access detected
      [  115.221036]
      [  115.222544] Memory state around the buggy address:
      [  115.227384]  94888400: fb fb fb fb fc fc fc fc 04 fc fc fc fc fc fc fc
      [  115.233959]  94888480: 00 00 00 fc fc fc fc fc 00 04 fc fc fc fc fc fc
      [  115.240529] >94888500: 00 00 04 fc fc fc fc fc 00 00 04 fc fc fc fc fc
      [  115.247077]                                             ^
      [  115.252523]  94888580: 00 04 fc fc fc fc fc fc 06 fc fc fc fc fc fc fc
      [  115.259093]  94888600: 00 00 06 fc fc fc fc fc 00 00 04 fc fc fc fc fc
      [  115.265639] ==================================================================
      Reported-by: NJoel Stanley <joel@jms.id.au>
      Signed-off-by: NSamuel Mendoza-Jonas <sam@mendozajonas.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      062b3e1b
    • E
      KEYS: DNS: limit the length of option strings · c210f7b4
      Eric Biggers 提交于
      Adding a dns_resolver key whose payload contains a very long option name
      resulted in that string being printed in full.  This hit the WARN_ONCE()
      in set_precision() during the printk(), because printk() only supports a
      precision of up to 32767 bytes:
      
          precision 1000000 too large
          WARNING: CPU: 0 PID: 752 at lib/vsprintf.c:2189 vsnprintf+0x4bc/0x5b0
      
      Fix it by limiting option strings (combined name + value) to a much more
      reasonable 128 bytes.  The exact limit is arbitrary, but currently the
      only recognized option is formatted as "dnserror=%lu" which fits well
      within this limit.
      
      Also ratelimit the printks.
      
      Reproducer:
      
          perl -e 'print "#", "A" x 1000000, "\x00"' | keyctl padd dns_resolver desc @s
      
      This bug was found using syzkaller.
      Reported-by: NMark Rutland <mark.rutland@arm.com>
      Fixes: 4a2d7892 ("DNS: If the DNS server returns an error, allow that to be cached [ver #2]")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c210f7b4
    • S
      ipv6: Count interface receive statistics on the ingress netdev · bdb7cc64
      Stephen Suryaputra 提交于
      The statistics such as InHdrErrors should be counted on the ingress
      netdev rather than on the dev from the dst, which is the egress.
      Signed-off-by: NStephen Suryaputra <ssuryaextr@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdb7cc64
    • D
      net/ipv6: Make __inet6_bind static · 032234d8
      David Ahern 提交于
      BPF core gets access to __inet6_bind via ipv6_bpf_stub_impl, so it is
      not invoked directly outside of af_inet6.c. Make it static and move
      inet6_bind after to avoid forward declaration.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      032234d8
  2. 17 4月, 2018 12 次提交
    • J
      xdp: avoid leaking info stored in frame data on page reuse · 6dfb970d
      Jesper Dangaard Brouer 提交于
      The bpf infrastructure and verifier goes to great length to avoid
      bpf progs leaking kernel (pointer) info.
      
      For queueing an xdp_buff via XDP_REDIRECT, xdp_frame info stores
      kernel info (incl pointers) in top part of frame data (xdp->data_hard_start).
      Checks are in place to assure enough headroom is available for this.
      
      This info is not cleared, and if the frame is reused, then a
      malicious user could use bpf_xdp_adjust_head helper to move
      xdp->data into this area.  Thus, making this area readable.
      
      This is not super critical as XDP progs requires root or
      CAP_SYS_ADMIN, which are privileged enough for such info.  An
      effort (is underway) towards moving networking bpf hooks to the
      lesser privileged mode CAP_NET_ADMIN, where leaking such info
      should be avoided.  Thus, this patch to clear the info when
      needed.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dfb970d
    • J
      xdp: transition into using xdp_frame for ndo_xdp_xmit · 44fa2dbd
      Jesper Dangaard Brouer 提交于
      Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
      xdp_buff.  This brings xdp_return_frame and ndp_xdp_xmit in sync.
      
      This builds towards changing the API further to become a bulk API,
      because xdp_buff is not a queue-able object while xdp_frame is.
      
      V4: Adjust for commit 59655a5b ("tuntap: XDP_TX can use native XDP")
      V7: Adjust for commit d9314c47 ("i40e: add support for XDP_REDIRECT")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44fa2dbd
    • J
      xdp: transition into using xdp_frame for return API · 03993094
      Jesper Dangaard Brouer 提交于
      Changing API xdp_return_frame() to take struct xdp_frame as argument,
      seems like a natural choice. But there are some subtle performance
      details here that needs extra care, which is a deliberate choice.
      
      When de-referencing xdp_frame on a remote CPU during DMA-TX
      completion, result in the cache-line is change to "Shared"
      state. Later when the page is reused for RX, then this xdp_frame
      cache-line is written, which change the state to "Modified".
      
      This situation already happens (naturally) for, virtio_net, tun and
      cpumap as the xdp_frame pointer is the queued object.  In tun and
      cpumap, the ptr_ring is used for efficiently transferring cache-lines
      (with pointers) between CPUs. Thus, the only option is to
      de-referencing xdp_frame.
      
      It is only the ixgbe driver that had an optimization, in which it can
      avoid doing the de-reference of xdp_frame.  The driver already have
      TX-ring queue, which (in case of remote DMA-TX completion) have to be
      transferred between CPUs anyhow.  In this data area, we stored a
      struct xdp_mem_info and a data pointer, which allowed us to avoid
      de-referencing xdp_frame.
      
      To compensate for this, a prefetchw is used for telling the cache
      coherency protocol about our access pattern.  My benchmarks show that
      this prefetchw is enough to compensate the ixgbe driver.
      
      V7: Adjust for commit d9314c47 ("i40e: add support for XDP_REDIRECT")
      V8: Adjust for commit bd658dda ("net/mlx5e: Separate dma base address
      and offset in dma_sync call")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03993094
    • J
      xdp: allow page_pool as an allocator type in xdp_return_frame · 57d0a1c1
      Jesper Dangaard Brouer 提交于
      New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.
      
      The registered allocator page_pool pointer is not available directly
      from xdp_rxq_info, but it could be (if needed).  For now, the driver
      should keep separate track of the page_pool pointer, which it should
      use for RX-ring page allocation.
      
      As suggested by Saeed, to maintain a symmetric API it is the drivers
      responsibility to allocate/create and free/destroy the page_pool.
      Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
      responsibility to free the page_pool, but with a RCU free call.  This
      is done easily via the page_pool helper page_pool_destroy() (which
      avoids touching any driver code during the RCU callback, which could
      happen after the driver have been unloaded).
      
      V8: address issues found by kbuild test robot
       - Address sparse should be static warnings
       - Allow xdp.o to be compiled without page_pool.o
      
      V9: Remove inline from .c file, compiler knows best
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57d0a1c1
    • J
      page_pool: refurbish version of page_pool code · ff7d6b27
      Jesper Dangaard Brouer 提交于
      Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
      pages on DMA-TX completion time, which have good cross CPU
      performance, given DMA-TX completion time can happen on a remote CPU.
      
      Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
      Adapted page_pool code to not depend the page allocator and
      integration into struct page.  The DMA mapping feature is kept,
      even-though it will not be activated/used in this patchset.
      
      [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf
      
      V2: Adjustments requested by Tariq
       - Changed page_pool_create return codes, don't return NULL, only
         ERR_PTR, as this simplifies err handling in drivers.
      
      V4: many small improvements and cleanups
      - Add DOC comment section, that can be used by kernel-doc
      - Improve fallback mode, to work better with refcnt based recycling
        e.g. remove a WARN as pointed out by Tariq
        e.g. quicker fallback if ptr_ring is empty.
      
      V5: Fixed SPDX license as pointed out by Alexei
      
      V6: Adjustments requested by Eric Dumazet
       - Adjust ____cacheline_aligned_in_smp usage/placement
       - Move rcu_head in struct page_pool
       - Free pages quicker on destroy, minimize resources delayed an RCU period
       - Remove code for forward/backward compat ABI interface
      
      V8: Issues found by kbuild test robot
       - Address sparse should be static warnings
       - Only compile+link when a driver use/select page_pool,
         mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff7d6b27
    • J
      xdp: rhashtable with allocator ID to pointer mapping · 8d5d8852
      Jesper Dangaard Brouer 提交于
      Use the IDA infrastructure for getting a cyclic increasing ID number,
      that is used for keeping track of each registered allocator per
      RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
      uses a radix tree, use a dynamic rhashtable, for creating ID to
      pointer lookup table, because this is faster.
      
      The problem that is being solved here is that, the xdp_rxq_info
      pointer (stored in xdp_buff) cannot be used directly, as the
      guaranteed lifetime is too short.  The info is needed on a
      (potentially) remote CPU during DMA-TX completion time . In an
      xdp_frame the xdp_mem_info is stored, when it got converted from an
      xdp_buff, which is sufficient for the simple page refcnt based recycle
      schemes.
      
      For more advanced allocators there is a need to store a pointer to the
      registered allocator.  Thus, there is a need to guard the lifetime or
      validity of the allocator pointer, which is done through this
      rhashtable ID map to pointer. The removal and validity of of the
      allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
      allocator will be created by the driver, and registered with
      xdp_rxq_info_reg_mem_model().
      
      It is up-to debate who is responsible for freeing the allocator
      pointer or invoking the allocator destructor function.  In any case,
      this must happen via RCU freeing.
      
      Use the IDA infrastructure for getting a cyclic increasing ID number,
      that is used for keeping track of each registered allocator per
      RX-queue xdp_rxq_info.
      
      V4: Per req of Jason Wang
      - Use xdp_rxq_info_reg_mem_model() in all drivers implementing
        XDP_REDIRECT, even-though it's not strictly necessary when
        allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).
      
      V6: Per req of Alex Duyck
      - Introduce rhashtable_lookup() call in later patch
      
      V8: Address sparse should be static warnings (from kbuild test robot)
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d5d8852
    • J
      xdp: introduce xdp_return_frame API and use in cpumap · 5ab073ff
      Jesper Dangaard Brouer 提交于
      Introduce an xdp_return_frame API, and convert over cpumap as
      the first user, given it have queued XDP frame structure to leverage.
      
      V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck.
      V6: Remove comment that id will be added later (Req by Alex Duyck)
      V8: Rename enum mem_type to xdp_mem_type (found by kbuild test robot)
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ab073ff
    • E
      tcp: implement mmap() for zero copy receive · 93ab6cc6
      Eric Dumazet 提交于
      Some networks can make sure TCP payload can exactly fit 4KB pages,
      with well chosen MSS/MTU and architectures.
      
      Implement mmap() system call so that applications can avoid
      copying data without complex splice() games.
      
      Note that a successful mmap( X bytes) on TCP socket is consuming
      bytes, as if recvmsg() has been done. (tp->copied += X)
      
      Only PROT_READ mappings are accepted, as skb page frags
      are fundamentally shared and read only.
      
      If tcp_mmap() finds data that is not a full page, or a patch of
      urgent data, -EINVAL is returned, no bytes are consumed.
      
      Application must fallback to recvmsg() to read the problematic sequence.
      
      mmap() wont block,  regardless of socket being in blocking or
      non-blocking mode. If not enough bytes are in receive queue,
      mmap() would return -EAGAIN, or -EIO if socket is in a state
      where no other bytes can be added into receive queue.
      
      An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
      to efficiently use mmap()
      
      On the sender side, MSG_EOR might help to clearly separate unaligned
      headers and 4K-aligned chunks if necessary.
      
      Tested:
      
      mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
      MTU set to 4168  (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)
      
      Without mmap() (tcp_mmap -s)
      
      received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
        cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
      received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
        cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
      received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
        cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
      received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
        cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches
      
      With mmap() on receiver (tcp_mmap -s -z)
      
      received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
        cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
      received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
        cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
      received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
        cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
      received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
        cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93ab6cc6
    • E
      tcp: avoid extra wakeups for SO_RCVLOWAT users · 03f45c88
      Eric Dumazet 提交于
      SO_RCVLOWAT is properly handled in tcp_poll(), so that POLLIN is only
      generated when enough bytes are available in receive queue, after
      David change (commit c7004482 "tcp: Respect SO_RCVLOWAT in tcp_poll().")
      
      But TCP still calls sk->sk_data_ready() for each chunk added in receive
      queue, meaning thread is awaken, and goes back to sleep shortly after.
      
      Tested:
      
      tcp_mmap test program, receiving 32768 MB of data with SO_RCVLOWAT set to 512KB
      
      -> Should get ~2 wakeups (c-switches) per MB, regardless of how many
      (tiny or big) packets were received.
      
      High speed (mostly full size GRO packets)
      
      received 32768 MB (100 % mmap'ed) in 8.03112 s, 34.2266 Gbit,
        cpu usage user:0.037 sys:1.404, 43.9758 usec per MB, 65497 c-switches
      
      received 32768 MB (99.9954 % mmap'ed) in 7.98453 s, 34.4263 Gbit,
        cpu usage user:0.03 sys:1.422, 44.3115 usec per MB, 65485 c-switches
      
      Low speed (sender is ratelimited and sends 1-MSS at a time, so GRO is not helping)
      
      received 22474.5 MB (100 % mmap'ed) in 6015.35 s, 0.0313414 Gbit,
        cpu usage user:0.05 sys:1.586, 72.7952 usec per MB, 44950 c-switches
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03f45c88
    • E
      tcp: fix delayed acks behavior for SO_RCVLOWAT · 796f82ea
      Eric Dumazet 提交于
      We should not delay acks if there are not enough bytes
      in receive queue to satisfy SO_RCVLOWAT.
      
      Since [E]POLLIN event is not going to be generated, there is little
      hope for a delayed ack to be useful.
      
      In fact, delaying ACK prevents sender from completing
      the transfer.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      796f82ea
    • E
      tcp: fix SO_RCVLOWAT and RCVBUF autotuning · d1361840
      Eric Dumazet 提交于
      Applications might use SO_RCVLOWAT on TCP socket hoping to receive
      one [E]POLLIN event only when a given amount of bytes are ready in socket
      receive queue.
      
      Problem is that receive autotuning is not aware of this constraint,
      meaning sk_rcvbuf might be too small to allow all bytes to be stored.
      
      Add a new (struct proto_ops)->set_rcvlowat method so that a protocol
      can override the default setsockopt(SO_RCVLOWAT) behavior.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1361840
    • L
      ipv6: remove unnecessary check in addrconf_prefix_rcv_add_addr() · f85f94b8
      Lorenzo Bianconi 提交于
      Remove unnecessary check on update_lft variable in
      addrconf_prefix_rcv_add_addr routine since it is always set to 0.
      Moreover remove update_lft re-initialization to 0
      Signed-off-by: NLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f85f94b8
  3. 12 4月, 2018 2 次提交
    • G
      l2tp: fix race in duplicate tunnel detection · f6cd651b
      Guillaume Nault 提交于
      We can't use l2tp_tunnel_find() to prevent l2tp_nl_cmd_tunnel_create()
      from creating a duplicate tunnel. A tunnel can be concurrently
      registered after l2tp_tunnel_find() returns. Therefore, searching for
      duplicates must be done at registration time.
      
      Finally, remove l2tp_tunnel_find() entirely as it isn't use anywhere
      anymore.
      
      Fixes: 309795f4 ("l2tp: Add netlink control API for L2TP")
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6cd651b
    • G
      l2tp: fix races in tunnel creation · 6b9f3423
      Guillaume Nault 提交于
      l2tp_tunnel_create() inserts the new tunnel into the namespace's tunnel
      list and sets the socket's ->sk_user_data field, before returning it to
      the caller. Therefore, there are two ways the tunnel can be accessed
      and freed, before the caller even had the opportunity to take a
      reference. In practice, syzbot could crash the module by closing the
      socket right after a new tunnel was returned to pppol2tp_create().
      
      This patch moves tunnel registration out of l2tp_tunnel_create(), so
      that the caller can safely hold a reference before publishing the
      tunnel. This second step is done with the new l2tp_tunnel_register()
      function, which is now responsible for associating the tunnel to its
      socket and for inserting it into the namespace's list.
      
      While moving the code to l2tp_tunnel_register(), a few modifications
      have been done. First, the socket validation tests are done in a helper
      function, for clarity. Also, modifying the socket is now done after
      having inserted the tunnel to the namespace's tunnels list. This will
      allow insertion to fail, without having to revert theses modifications
      in the error path (a followup patch will check for duplicate tunnels
      before insertion). Either the socket is a kernel socket which we
      control, or it is a user-space socket for which we have a reference on
      the file descriptor. In any case, the socket isn't going to be closed
      from under us.
      
      Reported-by: syzbot+fbeeb5c3b538e8545644@syzkaller.appspotmail.com
      Fixes: fd558d18 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b9f3423
  4. 11 4月, 2018 1 次提交
  5. 10 4月, 2018 1 次提交
  6. 09 4月, 2018 4 次提交
    • E
      inetpeer: fix uninit-value in inet_getpeer · b6a37e5e
      Eric Dumazet 提交于
      syzbot/KMSAN reported that p->dtime was read while it was
      not yet initialized in :
      
      	delta = (__u32)jiffies - p->dtime;
      	if (delta < ttl || !refcount_dec_if_one(&p->refcnt))
      		gc_stack[i] = NULL;
      
      This is a false positive, because the inetpeer wont be erased
      from rb-tree if the refcount_dec_if_one(&p->refcnt) does not
      succeed. And this wont happen before first inet_putpeer() call
      for this inetpeer has been done, and ->dtime field is written
      exactly before the refcount_dec_and_test(&p->refcnt).
      
      The KMSAN report was :
      
      BUG: KMSAN: uninit-value in inet_peer_gc net/ipv4/inetpeer.c:163 [inline]
      BUG: KMSAN: uninit-value in inet_getpeer+0x1567/0x1e70 net/ipv4/inetpeer.c:228
      CPU: 0 PID: 9494 Comm: syz-executor5 Not tainted 4.16.0+ #82
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:53
       kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
       __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
       inet_peer_gc net/ipv4/inetpeer.c:163 [inline]
       inet_getpeer+0x1567/0x1e70 net/ipv4/inetpeer.c:228
       inet_getpeer_v4 include/net/inetpeer.h:110 [inline]
       icmpv4_xrlim_allow net/ipv4/icmp.c:330 [inline]
       icmp_send+0x2b44/0x3050 net/ipv4/icmp.c:725
       ip_options_compile+0x237c/0x29f0 net/ipv4/ip_options.c:472
       ip_rcv_options net/ipv4/ip_input.c:284 [inline]
       ip_rcv_finish+0xda8/0x16d0 net/ipv4/ip_input.c:365
       NF_HOOK include/linux/netfilter.h:288 [inline]
       ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
       __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
       __netif_receive_skb net/core/dev.c:4627 [inline]
       netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701
       netif_receive_skb+0x230/0x240 net/core/dev.c:4725
       tun_rx_batched drivers/net/tun.c:1555 [inline]
       tun_get_user+0x6d88/0x7580 drivers/net/tun.c:1962
       tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
       do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
       do_iter_write+0x30d/0xd40 fs/read_write.c:932
       vfs_writev fs/read_write.c:977 [inline]
       do_writev+0x3c9/0x830 fs/read_write.c:1012
       SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
       SyS_writev+0x56/0x80 fs/read_write.c:1082
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      RIP: 0033:0x455111
      RSP: 002b:00007fae0365cba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
      RAX: ffffffffffffffda RBX: 000000000000002e RCX: 0000000000455111
      RDX: 0000000000000001 RSI: 00007fae0365cbf0 RDI: 00000000000000fc
      RBP: 0000000020000040 R08: 00000000000000fc R09: 0000000000000000
      R10: 000000000000002e R11: 0000000000000293 R12: 00000000ffffffff
      R13: 0000000000000658 R14: 00000000006fc8e0 R15: 0000000000000000
      
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
       kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
       kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
       kmem_cache_alloc+0xaab/0xb90 mm/slub.c:2756
       inet_getpeer+0xed8/0x1e70 net/ipv4/inetpeer.c:210
       inet_getpeer_v4 include/net/inetpeer.h:110 [inline]
       ip4_frag_init+0x4d1/0x740 net/ipv4/ip_fragment.c:153
       inet_frag_alloc net/ipv4/inet_fragment.c:369 [inline]
       inet_frag_create net/ipv4/inet_fragment.c:385 [inline]
       inet_frag_find+0x7da/0x1610 net/ipv4/inet_fragment.c:418
       ip_find net/ipv4/ip_fragment.c:275 [inline]
       ip_defrag+0x448/0x67a0 net/ipv4/ip_fragment.c:676
       ip_check_defrag+0x775/0xda0 net/ipv4/ip_fragment.c:724
       packet_rcv_fanout+0x2a8/0x8d0 net/packet/af_packet.c:1447
       deliver_skb net/core/dev.c:1897 [inline]
       deliver_ptype_list_skb net/core/dev.c:1912 [inline]
       __netif_receive_skb_core+0x314a/0x4a80 net/core/dev.c:4545
       __netif_receive_skb net/core/dev.c:4627 [inline]
       netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701
       netif_receive_skb+0x230/0x240 net/core/dev.c:4725
       tun_rx_batched drivers/net/tun.c:1555 [inline]
       tun_get_user+0x6d88/0x7580 drivers/net/tun.c:1962
       tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
       do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
       do_iter_write+0x30d/0xd40 fs/read_write.c:932
       vfs_writev fs/read_write.c:977 [inline]
       do_writev+0x3c9/0x830 fs/read_write.c:1012
       SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
       SyS_writev+0x56/0x80 fs/read_write.c:1082
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6a37e5e
    • J
      devlink: convert occ_get op to separate registration · fc56be47
      Jiri Pirko 提交于
      This resolves race during initialization where the resources with
      ops are registered before driver and the structures used by occ_get
      op is initialized. So keep occ_get callbacks registered only when
      all structs are initialized.
      
      The example flows, as it is in mlxsw:
      1) driver load/asic probe:
         mlxsw_core
            -> mlxsw_sp_resources_register
              -> mlxsw_sp_kvdl_resources_register
                -> devlink_resource_register IDX
         mlxsw_spectrum
            -> mlxsw_sp_kvdl_init
              -> mlxsw_sp_kvdl_parts_init
                -> mlxsw_sp_kvdl_part_init
                  -> devlink_resource_size_get IDX (to get the current setup
                                                    size from devlink)
              -> devlink_resource_occ_get_register IDX (register current
                                                        occupancy getter)
      2) reload triggered by devlink command:
        -> mlxsw_devlink_core_bus_device_reload
          -> mlxsw_sp_fini
            -> mlxsw_sp_kvdl_fini
      	-> devlink_resource_occ_get_unregister IDX
          (struct mlxsw_sp *mlxsw_sp is freed at this point, call to occ get
           which is using mlxsw_sp would cause use-after free)
          -> mlxsw_sp_init
            -> mlxsw_sp_kvdl_init
              -> mlxsw_sp_kvdl_parts_init
                -> mlxsw_sp_kvdl_part_init
                  -> devlink_resource_size_get IDX (to get the current setup
                                                    size from devlink)
              -> devlink_resource_occ_get_register IDX (register current
                                                        occupancy getter)
      
      Fixes: d9f9b9a4 ("devlink: Add support for resource abstraction")
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc56be47
    • C
      tipc: use the right skb in tipc_sk_fill_sock_diag() · e41f0548
      Cong Wang 提交于
      Commit 4b2e6877 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag")
      tried to fix the crash but failed, the crash is still 100% reproducible
      with it.
      
      In tipc_sk_fill_sock_diag(), skb is the diag dump we are filling, it is not
      correct to retrieve its NETLINK_CB(), instead, like other protocol diag,
      we should use NETLINK_CB(cb->skb).sk here.
      
      Reported-by: <syzbot+326e587eff1074657718@syzkaller.appspotmail.com>
      Fixes: 4b2e6877 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag")
      Fixes: c30b70de (tipc: implement socket diagnostics for AF_TIPC)
      Cc: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      Cc: Ying Xue <ying.xue@windriver.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e41f0548
    • E
      sctp: sctp_sockaddr_af must check minimal addr length for AF_INET6 · 81e98370
      Eric Dumazet 提交于
      Check must happen before call to ipv6_addr_v4mapped()
      
      syzbot report was :
      
      BUG: KMSAN: uninit-value in sctp_sockaddr_af net/sctp/socket.c:359 [inline]
      BUG: KMSAN: uninit-value in sctp_do_bind+0x60f/0xdc0 net/sctp/socket.c:384
      CPU: 0 PID: 3576 Comm: syzkaller968804 Not tainted 4.16.0+ #82
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:53
       kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
       __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
       sctp_sockaddr_af net/sctp/socket.c:359 [inline]
       sctp_do_bind+0x60f/0xdc0 net/sctp/socket.c:384
       sctp_bind+0x149/0x190 net/sctp/socket.c:332
       inet6_bind+0x1fd/0x1820 net/ipv6/af_inet6.c:293
       SYSC_bind+0x3f2/0x4b0 net/socket.c:1474
       SyS_bind+0x54/0x80 net/socket.c:1460
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      RIP: 0033:0x43fd49
      RSP: 002b:00007ffe99df3d28 EFLAGS: 00000213 ORIG_RAX: 0000000000000031
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fd49
      RDX: 0000000000000010 RSI: 0000000020000000 RDI: 0000000000000003
      RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
      R10: 00000000004002c8 R11: 0000000000000213 R12: 0000000000401670
      R13: 0000000000401700 R14: 0000000000000000 R15: 0000000000000000
      
      Local variable description: ----address@SYSC_bind
      Variable was created at:
       SYSC_bind+0x6f/0x4b0 net/socket.c:1461
       SyS_bind+0x54/0x80 net/socket.c:1460
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81e98370
  7. 08 4月, 2018 7 次提交
    • A
      net: dsa: Discard frames from unused ports · fc5f3376
      Andrew Lunn 提交于
      The Marvell switches under some conditions will pass a frame to the
      host with the port being the CPU port. Such frames are invalid, and
      should be dropped. Not dropping them can result in a crash when
      incrementing the receive statistics for an invalid port.
      Reported-by: NChris Healy <cphealy@gmail.com>
      Fixes: 91da11f8 ("net: Distributed Switch Architecture protocol support")
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc5f3376
    • E
      sctp: do not leak kernel memory to user space · 6780db24
      Eric Dumazet 提交于
      syzbot produced a nice report [1]
      
      Issue here is that a recvmmsg() managed to leak 8 bytes of kernel memory
      to user space, because sin_zero (padding field) was not properly cleared.
      
      [1]
      BUG: KMSAN: uninit-value in copy_to_user include/linux/uaccess.h:184 [inline]
      BUG: KMSAN: uninit-value in move_addr_to_user+0x32e/0x530 net/socket.c:227
      CPU: 1 PID: 3586 Comm: syzkaller481044 Not tainted 4.16.0+ #82
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:53
       kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
       kmsan_internal_check_memory+0x164/0x1d0 mm/kmsan/kmsan.c:1176
       kmsan_copy_to_user+0x69/0x160 mm/kmsan/kmsan.c:1199
       copy_to_user include/linux/uaccess.h:184 [inline]
       move_addr_to_user+0x32e/0x530 net/socket.c:227
       ___sys_recvmsg+0x4e2/0x810 net/socket.c:2211
       __sys_recvmmsg+0x54e/0xdb0 net/socket.c:2313
       SYSC_recvmmsg+0x29b/0x3e0 net/socket.c:2394
       SyS_recvmmsg+0x76/0xa0 net/socket.c:2378
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      RIP: 0033:0x4401c9
      RSP: 002b:00007ffc56f73098 EFLAGS: 00000217 ORIG_RAX: 000000000000012b
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401c9
      RDX: 0000000000000001 RSI: 0000000020003ac0 RDI: 0000000000000003
      RBP: 00000000006ca018 R08: 0000000020003bc0 R09: 0000000000000010
      R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000401af0
      R13: 0000000000401b80 R14: 0000000000000000 R15: 0000000000000000
      
      Local variable description: ----addr@___sys_recvmsg
      Variable was created at:
       ___sys_recvmsg+0xd5/0x810 net/socket.c:2172
       __sys_recvmmsg+0x54e/0xdb0 net/socket.c:2313
      
      Bytes 8-15 of 16 are uninitialized
      
      ==================================================================
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 1 PID: 3586 Comm: syzkaller481044 Tainted: G    B            4.16.0+ #82
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:53
       panic+0x39d/0x940 kernel/panic.c:183
       kmsan_report+0x238/0x240 mm/kmsan/kmsan.c:1083
       kmsan_internal_check_memory+0x164/0x1d0 mm/kmsan/kmsan.c:1176
       kmsan_copy_to_user+0x69/0x160 mm/kmsan/kmsan.c:1199
       copy_to_user include/linux/uaccess.h:184 [inline]
       move_addr_to_user+0x32e/0x530 net/socket.c:227
       ___sys_recvmsg+0x4e2/0x810 net/socket.c:2211
       __sys_recvmmsg+0x54e/0xdb0 net/socket.c:2313
       SYSC_recvmmsg+0x29b/0x3e0 net/socket.c:2394
       SyS_recvmmsg+0x76/0xa0 net/socket.c:2378
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc:	Vlad Yasevich <vyasevich@gmail.com>
      Cc:	Neil Horman <nhorman@tuxdriver.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6780db24
    • E
      soreuseport: initialise timewait reuseport field · 3099a529
      Eric Dumazet 提交于
      syzbot reported an uninit-value in inet_csk_bind_conflict() [1]
      
      It turns out we never propagated sk->sk_reuseport into timewait socket.
      
      [1]
      BUG: KMSAN: uninit-value in inet_csk_bind_conflict+0x5f9/0x990 net/ipv4/inet_connection_sock.c:151
      CPU: 1 PID: 3589 Comm: syzkaller008242 Not tainted 4.16.0+ #82
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:53
       kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
       __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
       inet_csk_bind_conflict+0x5f9/0x990 net/ipv4/inet_connection_sock.c:151
       inet_csk_get_port+0x1d28/0x1e40 net/ipv4/inet_connection_sock.c:320
       inet6_bind+0x121c/0x1820 net/ipv6/af_inet6.c:399
       SYSC_bind+0x3f2/0x4b0 net/socket.c:1474
       SyS_bind+0x54/0x80 net/socket.c:1460
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      RIP: 0033:0x4416e9
      RSP: 002b:00007ffce6d15c88 EFLAGS: 00000217 ORIG_RAX: 0000000000000031
      RAX: ffffffffffffffda RBX: 0100000000000000 RCX: 00000000004416e9
      RDX: 000000000000001c RSI: 0000000020402000 RDI: 0000000000000004
      RBP: 0000000000000000 R08: 00000000e6d15e08 R09: 00000000e6d15e08
      R10: 0000000000000004 R11: 0000000000000217 R12: 0000000000009478
      R13: 00000000006cd448 R14: 0000000000000000 R15: 0000000000000000
      
      Uninit was stored to memory at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
       kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
       kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
       __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
       tcp_time_wait+0xf17/0xf50 net/ipv4/tcp_minisocks.c:283
       tcp_rcv_state_process+0xebe/0x6490 net/ipv4/tcp_input.c:6003
       tcp_v6_do_rcv+0x11dd/0x1d90 net/ipv6/tcp_ipv6.c:1331
       sk_backlog_rcv include/net/sock.h:908 [inline]
       __release_sock+0x2d6/0x680 net/core/sock.c:2271
       release_sock+0x97/0x2a0 net/core/sock.c:2786
       tcp_close+0x277/0x18f0 net/ipv4/tcp.c:2269
       inet_release+0x240/0x2a0 net/ipv4/af_inet.c:427
       inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:435
       sock_release net/socket.c:595 [inline]
       sock_close+0xe0/0x300 net/socket.c:1149
       __fput+0x49e/0xa10 fs/file_table.c:209
       ____fput+0x37/0x40 fs/file_table.c:243
       task_work_run+0x243/0x2c0 kernel/task_work.c:113
       exit_task_work include/linux/task_work.h:22 [inline]
       do_exit+0x10e1/0x38d0 kernel/exit.c:867
       do_group_exit+0x1a0/0x360 kernel/exit.c:970
       SYSC_exit_group+0x21/0x30 kernel/exit.c:981
       SyS_exit_group+0x25/0x30 kernel/exit.c:979
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      Uninit was stored to memory at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
       kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
       kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
       __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
       inet_twsk_alloc+0xaef/0xc00 net/ipv4/inet_timewait_sock.c:182
       tcp_time_wait+0xd9/0xf50 net/ipv4/tcp_minisocks.c:258
       tcp_rcv_state_process+0xebe/0x6490 net/ipv4/tcp_input.c:6003
       tcp_v6_do_rcv+0x11dd/0x1d90 net/ipv6/tcp_ipv6.c:1331
       sk_backlog_rcv include/net/sock.h:908 [inline]
       __release_sock+0x2d6/0x680 net/core/sock.c:2271
       release_sock+0x97/0x2a0 net/core/sock.c:2786
       tcp_close+0x277/0x18f0 net/ipv4/tcp.c:2269
       inet_release+0x240/0x2a0 net/ipv4/af_inet.c:427
       inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:435
       sock_release net/socket.c:595 [inline]
       sock_close+0xe0/0x300 net/socket.c:1149
       __fput+0x49e/0xa10 fs/file_table.c:209
       ____fput+0x37/0x40 fs/file_table.c:243
       task_work_run+0x243/0x2c0 kernel/task_work.c:113
       exit_task_work include/linux/task_work.h:22 [inline]
       do_exit+0x10e1/0x38d0 kernel/exit.c:867
       do_group_exit+0x1a0/0x360 kernel/exit.c:970
       SYSC_exit_group+0x21/0x30 kernel/exit.c:981
       SyS_exit_group+0x25/0x30 kernel/exit.c:979
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
       kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
       kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
       kmem_cache_alloc+0xaab/0xb90 mm/slub.c:2756
       inet_twsk_alloc+0x13b/0xc00 net/ipv4/inet_timewait_sock.c:163
       tcp_time_wait+0xd9/0xf50 net/ipv4/tcp_minisocks.c:258
       tcp_rcv_state_process+0xebe/0x6490 net/ipv4/tcp_input.c:6003
       tcp_v6_do_rcv+0x11dd/0x1d90 net/ipv6/tcp_ipv6.c:1331
       sk_backlog_rcv include/net/sock.h:908 [inline]
       __release_sock+0x2d6/0x680 net/core/sock.c:2271
       release_sock+0x97/0x2a0 net/core/sock.c:2786
       tcp_close+0x277/0x18f0 net/ipv4/tcp.c:2269
       inet_release+0x240/0x2a0 net/ipv4/af_inet.c:427
       inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:435
       sock_release net/socket.c:595 [inline]
       sock_close+0xe0/0x300 net/socket.c:1149
       __fput+0x49e/0xa10 fs/file_table.c:209
       ____fput+0x37/0x40 fs/file_table.c:243
       task_work_run+0x243/0x2c0 kernel/task_work.c:113
       exit_task_work include/linux/task_work.h:22 [inline]
       do_exit+0x10e1/0x38d0 kernel/exit.c:867
       do_group_exit+0x1a0/0x360 kernel/exit.c:970
       SYSC_exit_group+0x21/0x30 kernel/exit.c:981
       SyS_exit_group+0x25/0x30 kernel/exit.c:979
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: da5e3630 ("soreuseport: TCP/IPv4 implementation")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3099a529
    • E
      ipv4: fix uninit-value in ip_route_output_key_hash_rcu() · d0ea2b12
      Eric Dumazet 提交于
      syzbot complained that res.type could be used while not initialized.
      
      Using RTN_UNSPEC as initial value seems better than using garbage.
      
      BUG: KMSAN: uninit-value in __mkroute_output net/ipv4/route.c:2200 [inline]
      BUG: KMSAN: uninit-value in ip_route_output_key_hash_rcu+0x31f0/0x3940 net/ipv4/route.c:2493
      CPU: 1 PID: 12207 Comm: syz-executor0 Not tainted 4.16.0+ #81
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:53
       kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
       __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
       __mkroute_output net/ipv4/route.c:2200 [inline]
       ip_route_output_key_hash_rcu+0x31f0/0x3940 net/ipv4/route.c:2493
       ip_route_output_key_hash net/ipv4/route.c:2322 [inline]
       __ip_route_output_key include/net/route.h:126 [inline]
       ip_route_output_flow+0x1eb/0x3c0 net/ipv4/route.c:2577
       raw_sendmsg+0x1861/0x3ed0 net/ipv4/raw.c:653
       inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
       sock_sendmsg_nosec net/socket.c:630 [inline]
       sock_sendmsg net/socket.c:640 [inline]
       SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
       SyS_sendto+0x8a/0xb0 net/socket.c:1715
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      RIP: 0033:0x455259
      RSP: 002b:00007fdc0625dc68 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 00007fdc0625e6d4 RCX: 0000000000455259
      RDX: 0000000000000000 RSI: 0000000020000040 RDI: 0000000000000013
      RBP: 000000000072bea0 R08: 0000000020000080 R09: 0000000000000010
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 00000000000004f7 R14: 00000000006fa7c8 R15: 0000000000000000
      
      Local variable description: ----res.i.i@ip_route_output_flow
      Variable was created at:
       ip_route_output_flow+0x75/0x3c0 net/ipv4/route.c:2576
       raw_sendmsg+0x1861/0x3ed0 net/ipv4/raw.c:653
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0ea2b12
    • E
      dccp: initialize ireq->ir_mark · b855ff82
      Eric Dumazet 提交于
      syzbot reported an uninit-value read of skb->mark in iptable_mangle_hook()
      
      Thanks to the nice report, I tracked the problem to dccp not caring
      of ireq->ir_mark for passive sessions.
      
      BUG: KMSAN: uninit-value in ipt_mangle_out net/ipv4/netfilter/iptable_mangle.c:66 [inline]
      BUG: KMSAN: uninit-value in iptable_mangle_hook+0x5e5/0x720 net/ipv4/netfilter/iptable_mangle.c:84
      CPU: 0 PID: 5300 Comm: syz-executor3 Not tainted 4.16.0+ #81
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:53
       kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
       __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
       ipt_mangle_out net/ipv4/netfilter/iptable_mangle.c:66 [inline]
       iptable_mangle_hook+0x5e5/0x720 net/ipv4/netfilter/iptable_mangle.c:84
       nf_hook_entry_hookfn include/linux/netfilter.h:120 [inline]
       nf_hook_slow+0x158/0x3d0 net/netfilter/core.c:483
       nf_hook include/linux/netfilter.h:243 [inline]
       __ip_local_out net/ipv4/ip_output.c:113 [inline]
       ip_local_out net/ipv4/ip_output.c:122 [inline]
       ip_queue_xmit+0x1d21/0x21c0 net/ipv4/ip_output.c:504
       dccp_transmit_skb+0x15eb/0x1900 net/dccp/output.c:142
       dccp_xmit_packet+0x814/0x9e0 net/dccp/output.c:281
       dccp_write_xmit+0x20f/0x480 net/dccp/output.c:363
       dccp_sendmsg+0x12ca/0x12d0 net/dccp/proto.c:818
       inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
       sock_sendmsg_nosec net/socket.c:630 [inline]
       sock_sendmsg net/socket.c:640 [inline]
       ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
       __sys_sendmsg net/socket.c:2080 [inline]
       SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
       SyS_sendmsg+0x54/0x80 net/socket.c:2087
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      RIP: 0033:0x455259
      RSP: 002b:00007f1a4473dc68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00007f1a4473e6d4 RCX: 0000000000455259
      RDX: 0000000000000000 RSI: 0000000020b76fc8 RDI: 0000000000000015
      RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 00000000000004f0 R14: 00000000006fa720 R15: 0000000000000000
      
      Uninit was stored to memory at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
       kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
       kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
       __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
       ip_queue_xmit+0x1e35/0x21c0 net/ipv4/ip_output.c:502
       dccp_transmit_skb+0x15eb/0x1900 net/dccp/output.c:142
       dccp_xmit_packet+0x814/0x9e0 net/dccp/output.c:281
       dccp_write_xmit+0x20f/0x480 net/dccp/output.c:363
       dccp_sendmsg+0x12ca/0x12d0 net/dccp/proto.c:818
       inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
       sock_sendmsg_nosec net/socket.c:630 [inline]
       sock_sendmsg net/socket.c:640 [inline]
       ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
       __sys_sendmsg net/socket.c:2080 [inline]
       SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
       SyS_sendmsg+0x54/0x80 net/socket.c:2087
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      Uninit was stored to memory at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
       kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
       kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
       __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
       inet_csk_clone_lock+0x503/0x580 net/ipv4/inet_connection_sock.c:797
       dccp_create_openreq_child+0x7f/0x890 net/dccp/minisocks.c:92
       dccp_v4_request_recv_sock+0x22c/0xe90 net/dccp/ipv4.c:408
       dccp_v6_request_recv_sock+0x290/0x2000 net/dccp/ipv6.c:414
       dccp_check_req+0x7b9/0x8f0 net/dccp/minisocks.c:197
       dccp_v4_rcv+0x12e4/0x2630 net/dccp/ipv4.c:840
       ip_local_deliver_finish+0x6ed/0xd40 net/ipv4/ip_input.c:216
       NF_HOOK include/linux/netfilter.h:288 [inline]
       ip_local_deliver+0x43c/0x4e0 net/ipv4/ip_input.c:257
       dst_input include/net/dst.h:449 [inline]
       ip_rcv_finish+0x1253/0x16d0 net/ipv4/ip_input.c:397
       NF_HOOK include/linux/netfilter.h:288 [inline]
       ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
       __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
       __netif_receive_skb net/core/dev.c:4627 [inline]
       process_backlog+0x62d/0xe20 net/core/dev.c:5307
       napi_poll net/core/dev.c:5705 [inline]
       net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
       __do_softirq+0x56d/0x93d kernel/softirq.c:285
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
       kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
       kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
       kmem_cache_alloc+0xaab/0xb90 mm/slub.c:2756
       reqsk_alloc include/net/request_sock.h:88 [inline]
       inet_reqsk_alloc+0xc4/0x7f0 net/ipv4/tcp_input.c:6145
       dccp_v4_conn_request+0x5cc/0x1770 net/dccp/ipv4.c:600
       dccp_v6_conn_request+0x299/0x1880 net/dccp/ipv6.c:317
       dccp_rcv_state_process+0x2ea/0x2410 net/dccp/input.c:612
       dccp_v4_do_rcv+0x229/0x340 net/dccp/ipv4.c:682
       dccp_v6_do_rcv+0x16d/0x1220 net/dccp/ipv6.c:578
       sk_backlog_rcv include/net/sock.h:908 [inline]
       __sk_receive_skb+0x60e/0xf20 net/core/sock.c:513
       dccp_v4_rcv+0x24d4/0x2630 net/dccp/ipv4.c:874
       ip_local_deliver_finish+0x6ed/0xd40 net/ipv4/ip_input.c:216
       NF_HOOK include/linux/netfilter.h:288 [inline]
       ip_local_deliver+0x43c/0x4e0 net/ipv4/ip_input.c:257
       dst_input include/net/dst.h:449 [inline]
       ip_rcv_finish+0x1253/0x16d0 net/ipv4/ip_input.c:397
       NF_HOOK include/linux/netfilter.h:288 [inline]
       ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
       __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
       __netif_receive_skb net/core/dev.c:4627 [inline]
       process_backlog+0x62d/0xe20 net/core/dev.c:5307
       napi_poll net/core/dev.c:5705 [inline]
       net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
       __do_softirq+0x56d/0x93d kernel/softirq.c:285
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b855ff82
    • E
      net: fix uninit-value in __hw_addr_add_ex() · 77d36398
      Eric Dumazet 提交于
      syzbot complained :
      
      BUG: KMSAN: uninit-value in memcmp+0x119/0x180 lib/string.c:861
      CPU: 0 PID: 3 Comm: kworker/0:0 Not tainted 4.16.0+ #82
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: ipv6_addrconf addrconf_dad_work
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:53
       kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
       __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
       memcmp+0x119/0x180 lib/string.c:861
       __hw_addr_add_ex net/core/dev_addr_lists.c:60 [inline]
       __dev_mc_add+0x1c2/0x8e0 net/core/dev_addr_lists.c:670
       dev_mc_add+0x6d/0x80 net/core/dev_addr_lists.c:687
       igmp6_group_added+0x2db/0xa00 net/ipv6/mcast.c:662
       ipv6_dev_mc_inc+0xe9e/0x1130 net/ipv6/mcast.c:914
       addrconf_join_solict net/ipv6/addrconf.c:2078 [inline]
       addrconf_dad_begin net/ipv6/addrconf.c:3828 [inline]
       addrconf_dad_work+0x427/0x2150 net/ipv6/addrconf.c:3954
       process_one_work+0x12c6/0x1f60 kernel/workqueue.c:2113
       worker_thread+0x113c/0x24f0 kernel/workqueue.c:2247
       kthread+0x539/0x720 kernel/kthread.c:239
      
      Fixes: f001fde5 ("net: introduce a list of device addresses dev_addr_list (v6)")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77d36398
    • E
      net: initialize skb->peeked when cloning · b13dda9f
      Eric Dumazet 提交于
      syzbot reported __skb_try_recv_from_queue() was using skb->peeked
      while it was potentially unitialized.
      
      We need to clear it in __skb_clone()
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b13dda9f