1. 10 8月, 2018 7 次提交
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · e91e2189
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2018-08-10
      
      The following pull-request contains BPF updates for your *net* tree.
      
      The main changes are:
      
      1) Fix cpumap and devmap on teardown as they're under RCU context
         and won't have same assumption as running under NAPI protection,
         from Jesper.
      
      2) Fix various sockmap bugs in bpf_tcp_sendmsg() code, e.g. we had
         a bug where socket error was not propagated correctly, from Daniel.
      
      3) Fix incompatible libbpf header license for BTF code and match it
         before it gets officially released with the rest of libbpf which
         is LGPL-2.1, from Martin.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e91e2189
    • D
      Merge branch 'bpf-fix-cpu-and-devmap-teardown' · 9c954201
      Daniel Borkmann 提交于
      Jesper Dangaard Brouer says:
      
      ====================
      Removing entries from cpumap and devmap, goes through a number of
      syncronization steps to make sure no new xdp_frames can be enqueued.
      But there is a small chance, that xdp_frames remains which have not
      been flushed/processed yet.  Flushing these during teardown, happens
      from RCU context and not as usual under RX NAPI context.
      
      The optimization introduced in commt 389ab7f0 ("xdp: introduce
      xdp_return_frame_rx_napi"), missed that the flush operation can also
      be called from RCU context.  Thus, we cannot always use the
      xdp_return_frame_rx_napi call, which take advantage of the protection
      provided by XDP RX running under NAPI protection.
      
      The samples/bpf xdp_redirect_cpu have a --stress-mode, that is
      adjusted to easier reproduce (verified by Red Hat QA).
      ====================
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9c954201
    • J
      xdp: fix bug in devmap teardown code path · 1bf9116d
      Jesper Dangaard Brouer 提交于
      Like cpumap teardown, the devmap teardown code also flush remaining
      xdp_frames, via bq_xmit_all() in case map entry is removed.  The code
      can call xdp_return_frame_rx_napi, from the the wrong context, in-case
      ndo_xdp_xmit() fails.
      
      Fixes: 389ab7f0 ("xdp: introduce xdp_return_frame_rx_napi")
      Fixes: 735fc405 ("xdp: change ndo_xdp_xmit API to support bulking")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1bf9116d
    • J
      samples/bpf: xdp_redirect_cpu adjustment to reproduce teardown race easier · 37d7ff25
      Jesper Dangaard Brouer 提交于
      The teardown race in cpumap is really hard to reproduce.  These changes
      makes it easier to reproduce, for QA.
      
      The --stress-mode now have a case of a very small queue size of 8, that helps
      to trigger teardown flush to encounter a full queue, which results in calling
      xdp_return_frame API, in a non-NAPI protect context.
      
      Also increase MAX_CPUS, as my QA department have larger machines than me.
      Tested-by: NJean-Tsung Hsiao <jhsiao@redhat.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      37d7ff25
    • J
      xdp: fix bug in cpumap teardown code path · ad0ab027
      Jesper Dangaard Brouer 提交于
      When removing a cpumap entry, a number of syncronization steps happen.
      Eventually the teardown code __cpu_map_entry_free is invoked from/via
      call_rcu.
      
      The teardown code __cpu_map_entry_free() flushes remaining xdp_frames,
      by invoking bq_flush_to_queue, which calls xdp_return_frame_rx_napi().
      The issues is that the teardown code is not running in the RX NAPI
      code path.  Thus, it is not allowed to invoke the NAPI variant of
      xdp_return_frame.
      
      This bug was found and triggered by using the --stress-mode option to
      the samples/bpf program xdp_redirect_cpu.  It is hard to trigger,
      because the ptr_ring have to be full and cpumap bulk queue max
      contains 8 packets, and a remote CPU is racing to empty the ptr_ring
      queue.
      
      Fixes: 389ab7f0 ("xdp: introduce xdp_return_frame_rx_napi")
      Tested-by: NJean-Tsung Hsiao <jhsiao@redhat.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      ad0ab027
    • L
      Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · 112cbae2
      Linus Torvalds 提交于
      Pull crypto fix from Herbert Xu:
       "This fixes a performance regression in arm64 NEON crypto as well as a
        crash in x86 aegis/morus on unsupported CPUs"
      
      * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: x86/aegis,morus - Fix and simplify CPUID checks
        crypto: arm64 - revert NEON yield for fast AEAD implementations
      112cbae2
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 6395ad85
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
      
       1) The real fix for the ipv6 route metric leak Sabrina was seeing, from
          Cong Wang.
      
       2) Fix syzbot triggers AF_PACKET v3 ring buffer insufficient room
          conditions, from Willem de Bruijn.
      
       3) vsock can reinitialize active work struct, fix from Cong Wang.
      
       4) RXRPC keepalive generator can wedge a cpu, fix from David Howells.
      
       5) Fix locking in AF_SMC ioctl, from Ursula Braun.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        dsa: slave: eee: Allow ports to use phylink
        net/smc: move sock lock in smc_ioctl()
        net/smc: allow sysctl rmem and wmem defaults for servers
        net/smc: no shutdown in state SMC_LISTEN
        net: aquantia: Fix IFF_ALLMULTI flag functionality
        rxrpc: Fix the keepalive generator [ver #2]
        net/mlx5e: Cleanup of dcbnl related fields
        net/mlx5e: Properly check if hairpin is possible between two functions
        vhost: reset metadata cache when initializing new IOTLB
        llc: use refcount_inc_not_zero() for llc_sap_find()
        dccp: fix undefined behavior with 'cwnd' shift in ccid2_cwnd_restart()
        tipc: fix an interrupt unsafe locking scenario
        vsock: split dwork to avoid reinitializations
        net: thunderx: check for failed allocation lmac->dmacs
        cxgb4: mk_act_open_req() buggers ->{local, peer}_ip on big-endian hosts
        packet: refine ring v3 block size test to hold one frame
        ip6_tunnel: use the right value for ipv4 min mtu check in ip6_tnl_xmit
        ipv6: fix double refcount of fib6_metrics
      6395ad85
  2. 09 8月, 2018 17 次提交
  3. 08 8月, 2018 6 次提交
    • C
      llc: use refcount_inc_not_zero() for llc_sap_find() · 0dcb8225
      Cong Wang 提交于
      llc_sap_put() decreases the refcnt before deleting sap
      from the global list. Therefore, there is a chance
      llc_sap_find() could find a sap with zero refcnt
      in this global list.
      
      Close this race condition by checking if refcnt is zero
      or not in llc_sap_find(), if it is zero then it is being
      removed so we can just treat it as gone.
      
      Reported-by: <syzbot+278893f3f7803871f7ce@syzkaller.appspotmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0dcb8225
    • A
      dccp: fix undefined behavior with 'cwnd' shift in ccid2_cwnd_restart() · 61ef4b07
      Alexey Kodanev 提交于
      The shift of 'cwnd' with '(now - hc->tx_lsndtime) / hc->tx_rto' value
      can lead to undefined behavior [1].
      
      In order to fix this use a gradual shift of the window with a 'while'
      loop, similar to what tcp_cwnd_restart() is doing.
      
      When comparing delta and RTO there is a minor difference between TCP
      and DCCP, the last one also invokes dccp_cwnd_restart() and reduces
      'cwnd' if delta equals RTO. That case is preserved in this change.
      
      [1]:
      [40850.963623] UBSAN: Undefined behaviour in net/dccp/ccids/ccid2.c:237:7
      [40851.043858] shift exponent 67 is too large for 32-bit type 'unsigned int'
      [40851.127163] CPU: 3 PID: 15940 Comm: netstress Tainted: G        W   E     4.18.0-rc7.x86_64 #1
      ...
      [40851.377176] Call Trace:
      [40851.408503]  dump_stack+0xf1/0x17b
      [40851.451331]  ? show_regs_print_info+0x5/0x5
      [40851.503555]  ubsan_epilogue+0x9/0x7c
      [40851.548363]  __ubsan_handle_shift_out_of_bounds+0x25b/0x2b4
      [40851.617109]  ? __ubsan_handle_load_invalid_value+0x18f/0x18f
      [40851.686796]  ? xfrm4_output_finish+0x80/0x80
      [40851.739827]  ? lock_downgrade+0x6d0/0x6d0
      [40851.789744]  ? xfrm4_prepare_output+0x160/0x160
      [40851.845912]  ? ip_queue_xmit+0x810/0x1db0
      [40851.895845]  ? ccid2_hc_tx_packet_sent+0xd36/0x10a0 [dccp]
      [40851.963530]  ccid2_hc_tx_packet_sent+0xd36/0x10a0 [dccp]
      [40852.029063]  dccp_xmit_packet+0x1d3/0x720 [dccp]
      [40852.086254]  dccp_write_xmit+0x116/0x1d0 [dccp]
      [40852.142412]  dccp_sendmsg+0x428/0xb20 [dccp]
      [40852.195454]  ? inet_dccp_listen+0x200/0x200 [dccp]
      [40852.254833]  ? sched_clock+0x5/0x10
      [40852.298508]  ? sched_clock+0x5/0x10
      [40852.342194]  ? inet_create+0xdf0/0xdf0
      [40852.388988]  sock_sendmsg+0xd9/0x160
      ...
      
      Fixes: 113ced1f ("dccp ccid-2: Perform congestion-window validation")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61ef4b07
    • Y
      tipc: fix an interrupt unsafe locking scenario · 37436d9c
      Ying Xue 提交于
      Commit 9faa89d4 ("tipc: make function tipc_net_finalize() thread
      safe") tries to make it thread safe to set node address, so it uses
      node_list_lock lock to serialize the whole process of setting node
      address in tipc_net_finalize(). But it causes the following interrupt
      unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        rht_deferred_worker()
        rhashtable_rehash_table()
        lock(&(&ht->lock)->rlock)
      			       tipc_nl_compat_doit()
                                     tipc_net_finalize()
                                     local_irq_disable();
                                     lock(&(&tn->node_list_lock)->rlock);
                                     tipc_sk_reinit()
                                     rhashtable_walk_enter()
                                     lock(&(&ht->lock)->rlock);
        <Interrupt>
        tipc_disc_rcv()
        tipc_node_check_dest()
        tipc_node_create()
        lock(&(&tn->node_list_lock)->rlock);
      
       *** DEADLOCK ***
      
      When rhashtable_rehash_table() holds ht->lock on CPU0, it doesn't
      disable BH. So if an interrupt happens after the lock, it can create
      an inverse lock ordering between ht->lock and tn->node_list_lock. As
      a consequence, deadlock might happen.
      
      The reason causing the inverse lock ordering scenario above is because
      the initial purpose of node_list_lock is not designed to do the
      serialization of node address setting.
      
      As cmpxchg() can guarantee CAS (compare-and-swap) process is atomic,
      we use it to replace node_list_lock to ensure setting node address can
      be atomically finished. It turns out the potential deadlock can be
      avoided as well.
      
      Fixes: 9faa89d4 ("tipc: make function tipc_net_finalize() thread safe")
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <maloy@donjonn.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37436d9c
    • C
      vsock: split dwork to avoid reinitializations · 455f05ec
      Cong Wang 提交于
      syzbot reported that we reinitialize an active delayed
      work in vsock_stream_connect():
      
      	ODEBUG: init active (active state 0) object type: timer_list hint:
      	delayed_work_timer_fn+0x0/0x90 kernel/workqueue.c:1414
      	WARNING: CPU: 1 PID: 11518 at lib/debugobjects.c:329
      	debug_print_object+0x16a/0x210 lib/debugobjects.c:326
      
      The pattern is apparently wrong, we should only initialize
      the dealyed work once and could repeatly schedule it. So we
      have to move out the initializations to allocation side.
      And to avoid confusion, we can split the shared dwork
      into two, instead of re-using the same one.
      
      Fixes: d021c344 ("VSOCK: Introduce VM Sockets")
      Reported-by: <syzbot+8a9b1bd330476a4f3db6@syzkaller.appspotmail.com>
      Cc: Andy king <acking@vmware.com>
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Jorgen Hansen <jhansen@vmware.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      455f05ec
    • C
      net: thunderx: check for failed allocation lmac->dmacs · a94cead7
      Colin Ian King 提交于
      The allocation of lmac->dmacs is not being checked for allocation
      failure. Add the check.
      
      Fixes: 3a34ecfd ("net: thunderx: add MAC address filter tracking for LMAC")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a94cead7
    • A
      cxgb4: mk_act_open_req() buggers ->{local, peer}_ip on big-endian hosts · adfb442d
      Al Viro 提交于
      Unlike fs.val.lport and fs.val.fport, cxgb4_process_flow_match()
      sets fs.val.{l,f}ip to net-endian values without conversion - they come
      straight from flow_dissector_key_ipv4_addrs ->dst and ->src resp.  So
      the assignment in mk_act_open_req() ought to be a straight copy.
      
      	As far as I know, T4 PCIe cards do exist, so it's not as if that
      thing could only be found on little-endian systems...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Acked-by: NRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      adfb442d
  4. 07 8月, 2018 4 次提交
    • O
      crypto: x86/aegis,morus - Fix and simplify CPUID checks · 877ccce7
      Ondrej Mosnacek 提交于
      It turns out I had misunderstood how the x86_match_cpu() function works.
      It evaluates a logical OR of the matching conditions, not logical AND.
      This caused the CPU feature checks for AEGIS to pass even if only SSE2
      (but not AES-NI) was supported (or vice versa), leading to potential
      crashes if something tried to use the registered algs.
      
      This patch switches the checks to a simpler method that is used e.g. in
      the Camellia x86 code.
      
      The patch also removes the MODULE_DEVICE_TABLE declarations which
      actually seem to cause the modules to be auto-loaded at boot, which is
      not desired. The crypto API on-demand module loading is sufficient.
      
      Fixes: 1d373d4e ("crypto: x86 - Add optimized AEGIS implementations")
      Fixes: 6ecc9d9f ("crypto: x86 - Add optimized MORUS implementations")
      Signed-off-by: NOndrej Mosnacek <omosnace@redhat.com>
      Tested-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      877ccce7
    • A
      crypto: arm64 - revert NEON yield for fast AEAD implementations · f10dc56c
      Ard Biesheuvel 提交于
      As it turns out, checking the TIF_NEED_RESCHED flag after each
      iteration results in a significant performance regression (~10%)
      when running fast algorithms (i.e., ones that use special instructions
      and operate in the < 4 cycles per byte range) on in-order cores with
      comparatively slow memory accesses such as the Cortex-A53.
      
      Given the speed of these ciphers, and the fact that the page based
      nature of the AEAD scatterwalk API guarantees that the core NEON
      transform is never invoked with more than a single page's worth of
      input, we can estimate the worst case duration of any resulting
      scheduling blackout: on a 1 GHz Cortex-A53 running with 64k pages,
      processing a page's worth of input at 4 cycles per byte results in
      a delay of ~250 us, which is a reasonable upper bound.
      
      So let's remove the yield checks from the fused AES-CCM and AES-GCM
      routines entirely.
      
      This reverts commit 7b67ae4d and
      partially reverts commit 7c50136a.
      
      Fixes: 7c50136a ("crypto: arm64/aes-ghash - yield NEON after every ...")
      Fixes: 7b67ae4d ("crypto: arm64/aes-ccm - yield NEON after every ...")
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      f10dc56c
    • L
      Merge tag 'gpio-v4.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio · 1236568e
      Linus Torvalds 提交于
      Pull GPIO fix from Linus Walleij:
       "This is a single fix affecting X86 ACPI, and as such pretty important.
      
        It is going to stable as well and have all the high-notch x86 platform
        developers agreeing on it"
      
      * tag 'gpio-v4.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
        gpiolib-acpi: make sure we trigger edge events at least once on boot
      1236568e
    • W
      packet: refine ring v3 block size test to hold one frame · 4576cd46
      Willem de Bruijn 提交于
      TPACKET_V3 stores variable length frames in fixed length blocks.
      Blocks must be able to store a block header, optional private space
      and at least one minimum sized frame.
      
      Frames, even for a zero snaplen packet, store metadata headers and
      optional reserved space.
      
      In the block size bounds check, ensure that the frame of the
      chosen configuration fits. This includes sockaddr_ll and optional
      tp_reserve.
      
      Syzbot was able to construct a ring with insuffient room for the
      sockaddr_ll in the header of a zero-length frame, triggering an
      out-of-bounds write in dev_parse_header.
      
      Convert the comparison to less than, as zero is a valid snap len.
      This matches the test for minimum tp_frame_size immediately below.
      
      Fixes: f6fb8f10 ("af-packet: TPACKET_V3 flexible buffer implementation.")
      Fixes: eb73190f ("net/packet: refine check for priv area size")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4576cd46
  5. 06 8月, 2018 6 次提交