1. 30 1月, 2018 33 次提交
    • E
      ipv6: addrconf: break critical section in addrconf_verify_rtnl() · e64e469b
      Eric Dumazet 提交于
      Heiner reported a lockdep splat [1]
      
      This is caused by attempting GFP_KERNEL allocation while RCU lock is
      held and BH blocked.
      
      We believe that addrconf_verify_rtnl() could run for a long period,
      so instead of using GFP_ATOMIC here as Ido suggested, we should break
      the critical section and restart it after the allocation.
      
      [1]
      [86220.125562] =============================
      [86220.125586] WARNING: suspicious RCU usage
      [86220.125612] 4.15.0-rc7-next-20180110+ #7 Not tainted
      [86220.125641] -----------------------------
      [86220.125666] kernel/sched/core.c:6026 Illegal context switch in RCU-bh read-side critical section!
      [86220.125711]
                     other info that might help us debug this:
      
      [86220.125755]
                     rcu_scheduler_active = 2, debug_locks = 1
      [86220.125792] 4 locks held by kworker/0:2/1003:
      [86220.125817]  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
      [86220.125895]  #1:  ((addr_chk_work).work){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
      [86220.125959]  #2:  (rtnl_mutex){+.+.}, at: [<00000000b06d9510>] rtnl_lock+0x12/0x20
      [86220.126017]  #3:  (rcu_read_lock_bh){....}, at: [<00000000aef52299>] addrconf_verify_rtnl+0x1e/0x510 [ipv6]
      [86220.126111]
                     stack backtrace:
      [86220.126142] CPU: 0 PID: 1003 Comm: kworker/0:2 Not tainted 4.15.0-rc7-next-20180110+ #7
      [86220.126185] Hardware name: ZOTAC ZBOX-CI321NANO/ZBOX-CI321NANO, BIOS B246P105 06/01/2015
      [86220.126250] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6]
      [86220.126288] Call Trace:
      [86220.126312]  dump_stack+0x70/0x9e
      [86220.126337]  lockdep_rcu_suspicious+0xce/0xf0
      [86220.126365]  ___might_sleep+0x1d3/0x240
      [86220.126390]  __might_sleep+0x45/0x80
      [86220.126416]  kmem_cache_alloc_trace+0x53/0x250
      [86220.126458]  ? ipv6_add_addr+0xfe/0x6e0 [ipv6]
      [86220.126498]  ipv6_add_addr+0xfe/0x6e0 [ipv6]
      [86220.126538]  ipv6_create_tempaddr+0x24d/0x430 [ipv6]
      [86220.126580]  ? ipv6_create_tempaddr+0x24d/0x430 [ipv6]
      [86220.126623]  addrconf_verify_rtnl+0x339/0x510 [ipv6]
      [86220.126664]  ? addrconf_verify_rtnl+0x339/0x510 [ipv6]
      [86220.126708]  addrconf_verify_work+0xe/0x20 [ipv6]
      [86220.126738]  process_one_work+0x258/0x680
      [86220.126765]  worker_thread+0x35/0x3f0
      [86220.126790]  kthread+0x124/0x140
      [86220.126813]  ? process_one_work+0x680/0x680
      [86220.126839]  ? kthread_create_worker_on_cpu+0x40/0x40
      [86220.126869]  ? umh_complete+0x40/0x40
      [86220.126893]  ? call_usermodehelper_exec_async+0x12a/0x160
      [86220.126926]  ret_from_fork+0x4b/0x60
      [86220.126999] BUG: sleeping function called from invalid context at mm/slab.h:420
      [86220.127041] in_atomic(): 1, irqs_disabled(): 0, pid: 1003, name: kworker/0:2
      [86220.127082] 4 locks held by kworker/0:2/1003:
      [86220.127107]  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
      [86220.127179]  #1:  ((addr_chk_work).work){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
      [86220.127242]  #2:  (rtnl_mutex){+.+.}, at: [<00000000b06d9510>] rtnl_lock+0x12/0x20
      [86220.127300]  #3:  (rcu_read_lock_bh){....}, at: [<00000000aef52299>] addrconf_verify_rtnl+0x1e/0x510 [ipv6]
      [86220.127414] CPU: 0 PID: 1003 Comm: kworker/0:2 Not tainted 4.15.0-rc7-next-20180110+ #7
      [86220.127463] Hardware name: ZOTAC ZBOX-CI321NANO/ZBOX-CI321NANO, BIOS B246P105 06/01/2015
      [86220.127528] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6]
      [86220.127568] Call Trace:
      [86220.127591]  dump_stack+0x70/0x9e
      [86220.127616]  ___might_sleep+0x14d/0x240
      [86220.127644]  __might_sleep+0x45/0x80
      [86220.127672]  kmem_cache_alloc_trace+0x53/0x250
      [86220.127717]  ? ipv6_add_addr+0xfe/0x6e0 [ipv6]
      [86220.127762]  ipv6_add_addr+0xfe/0x6e0 [ipv6]
      [86220.127807]  ipv6_create_tempaddr+0x24d/0x430 [ipv6]
      [86220.127854]  ? ipv6_create_tempaddr+0x24d/0x430 [ipv6]
      [86220.127903]  addrconf_verify_rtnl+0x339/0x510 [ipv6]
      [86220.127950]  ? addrconf_verify_rtnl+0x339/0x510 [ipv6]
      [86220.127998]  addrconf_verify_work+0xe/0x20 [ipv6]
      [86220.128032]  process_one_work+0x258/0x680
      [86220.128063]  worker_thread+0x35/0x3f0
      [86220.128091]  kthread+0x124/0x140
      [86220.128117]  ? process_one_work+0x680/0x680
      [86220.128146]  ? kthread_create_worker_on_cpu+0x40/0x40
      [86220.128180]  ? umh_complete+0x40/0x40
      [86220.128207]  ? call_usermodehelper_exec_async+0x12a/0x160
      [86220.128243]  ret_from_fork+0x4b/0x60
      
      Fixes: f3d9832e ("ipv6: addrconf: cleanup locking in ipv6_add_addr")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e64e469b
    • W
      ipv6: change route cache aging logic · 31afeb42
      Wei Wang 提交于
      In current route cache aging logic, if a route has both RTF_EXPIRE and
      RTF_GATEWAY set, the route will only be removed if the neighbor cache
      has no NTF_ROUTER flag. Otherwise, even if the route has expired, it
      won't get deleted.
      Fix this logic to always check if the route has expired first and then
      do the gateway neighbor cache check if previous check decide to not
      remove the exception entry.
      
      Fixes: 1859bac0 ("ipv6: remove from fib tree aged out RTF_CACHE dst")
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31afeb42
    • A
      i40e/i40evf: Update DESC_NEEDED value to reflect larger value · 0a797db3
      Alexander Duyck 提交于
      When compared to ixgbe and other previous Intel drivers the i40e and i40evf
      drivers actually reserve 2 additional descriptors in maybe_stop_tx for
      cache line alignment. We need to update DESC_NEEDED to reflect this as
      otherwise we are more likely to return TX_BUSY which will cause issues with
      things like xmit_more.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a797db3
    • A
      bnxt_en: cleanup DIM work on device shutdown · 0bc0b97f
      Andy Gospodarek 提交于
      Make sure to cancel any pending work that might update driver coalesce
      settings when taking down an interface.
      
      Fixes: 6a8788f2 ("bnxt_en: add support for software dynamic interrupt moderation")
      Signed-off-by: NAndy Gospodarek <gospo@broadcom.com>
      Cc: Michael Chan <michael.chan@broadcom.com>
      Acked-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0bc0b97f
    • D
      net: ipv6: send unsolicited NA after DAD · c76fe2d9
      David Ahern 提交于
      Unsolicited IPv6 neighbor advertisements should be sent after DAD
      completes. Update ndisc_send_unsol_na to skip tentative, non-optimistic
      addresses and have those sent by addrconf_dad_completed after DAD.
      
      Fixes: 4a6e3c5d ("net: ipv6: send unsolicited NA on admin up")
      Reported-by: NVivek Venkatraman <vivek@cumulusnetworks.com>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c76fe2d9
    • A
      gianfar: prevent integer wrapping in the rx handler · 202a0a70
      Andy Spencer 提交于
      When the frame check sequence (FCS) is split across the last two frames
      of a fragmented packet, part of the FCS gets counted twice, once when
      subtracting the FCS, and again when subtracting the previously received
      data.
      
      For example, if 1602 bytes are received, and the first fragment contains
      the first 1600 bytes (including the first two bytes of the FCS), and the
      second fragment contains the last two bytes of the FCS:
      
        'skb->len == 1600' from the first fragment
      
        size  = lstatus & BD_LENGTH_MASK; # 1602
        size -= ETH_FCS_LEN;              # 1598
        size -= skb->len;                 # -2
      
      Since the size is unsigned, it wraps around and causes a BUG later in
      the packet handling, as shown below:
      
        kernel BUG at ./include/linux/skbuff.h:2068!
        Oops: Exception in kernel mode, sig: 5 [#1]
        ...
        NIP [c021ec60] skb_pull+0x24/0x44
        LR [c01e2fbc] gfar_clean_rx_ring+0x498/0x690
        Call Trace:
        [df7edeb0] [c01e2c1c] gfar_clean_rx_ring+0xf8/0x690 (unreliable)
        [df7edf20] [c01e33a8] gfar_poll_rx_sq+0x3c/0x9c
        [df7edf40] [c023352c] net_rx_action+0x21c/0x274
        [df7edf90] [c0329000] __do_softirq+0xd8/0x240
        [df7edff0] [c000c108] call_do_irq+0x24/0x3c
        [c0597e90] [c00041dc] do_IRQ+0x64/0xc4
        [c0597eb0] [c000d920] ret_from_except+0x0/0x18
        --- interrupt: 501 at arch_cpu_idle+0x24/0x5c
      
      Change the size to a signed integer and then trim off any part of the
      FCS that was received prior to the last fragment.
      
      Fixes: 6c389fc9 ("gianfar: fix size of scatter-gathered frames")
      Signed-off-by: NAndy Spencer <aspencer@spacex.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      202a0a70
    • D
      Merge branch 'net_sched-reflect-tx_queue_len-change-for-pfifo_fast' · f7dd5215
      David S. Miller 提交于
      Cong Wang says:
      
      ====================
      net_sched: reflect tx_queue_len change for pfifo_fast
      
      This pathcset restores the pfifo_fast qdisc behavior of dropping
      packets based on latest dev->tx_queue_len. Patch 1 introduces
      a helper, patch 2 introduces a new Qdisc ops which is called when
      we modify tx_queue_len, patch 3 implements this ops for pfifo_fast.
      
      Please see each patch for details.
      
      ---
      v3: use skb_array_resize_multiple()
      v2: handle error case for ->change_tx_queue_len()
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7dd5215
    • C
      net_sched: implement ->change_tx_queue_len() for pfifo_fast · 7007ba63
      Cong Wang 提交于
      pfifo_fast used to drop based on qdisc_dev(qdisc)->tx_queue_len,
      so we have to resize skb array when we change tx_queue_len.
      
      Other qdiscs which read tx_queue_len are fine because they
      all save it to sch->limit or somewhere else in qdisc during init.
      They don't have to implement this, it is nicer if they do so
      that users don't have to re-configure qdisc after changing
      tx_queue_len.
      
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7007ba63
    • C
      net_sched: plug in qdisc ops change_tx_queue_len · 48bfd55e
      Cong Wang 提交于
      Introduce a new qdisc ops ->change_tx_queue_len() so that
      each qdisc could decide how to implement this if it wants.
      Previously we simply read dev->tx_queue_len, after pfifo_fast
      switches to skb array, we need this API to resize the skb array
      when we change dev->tx_queue_len.
      
      To avoid handling race conditions with TX BH, we need to
      deactivate all TX queues before change the value and bring them
      back after we are done, this also makes implementation easier.
      
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48bfd55e
    • C
      net: introduce helper dev_change_tx_queue_len() · 6a643ddb
      Cong Wang 提交于
      This patch promotes the local change_tx_queue_len() to a core
      helper function, dev_change_tx_queue_len(), so that rtnetlink
      and net-sysfs could share the code. This also prepares for the
      following patch.
      
      Note, the -EFAULT in the original code doesn't make sense,
      we should propagate the errno from notifiers.
      
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a643ddb
    • J
      vhost_net: stop device during reset owner · 4cd87951
      Jason Wang 提交于
      We don't stop device before reset owner, this means we could try to
      serve any virtqueue kick before reset dev->worker. This will result a
      warn since the work was pending at llist during owner resetting. Fix
      this by stopping device during owner reset.
      
      Reported-by: syzbot+eb17c6162478cc50632c@syzkaller.appspotmail.com
      Fixes: 3a4d5c94 ("vhost_net: a kernel-level virtio server")
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cd87951
    • D
      Merge branch 'net-Ease-to-follow-an-interface-that-moves-to-another-netns' · e8368d9e
      David S. Miller 提交于
      Nicolas Dichtel says:
      
      ====================
      net: Ease to follow an interface that moves to another netns
      
      The goal of this series is to ease the user to follow an interface that
      moves to another netns.
      
      After this series, with a patched iproute2:
      
      $ ip netns
      bar
      foo
      $ ip monitor link &
      $ ip link set dummy0 netns foo
      Deleted 5: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
          link/ether 6e:a7:82:35:96:46 brd ff:ff:ff:ff:ff:ff new-nsid 0 new-ifindex 6
      
      => new nsid: 0, new ifindex: 6 (was 5 in the previous netns)
      
      $ ip link set eth1 netns bar
      Deleted 3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default
          link/ether 52:54:01:12:34:57 brd ff:ff:ff:ff:ff:ff new-nsid 1 new-ifindex 3
      
      => new nsid: 1, new ifindex: 3 (same ifindex)
      
      $ ip netns
      bar (id: 1)
      foo (id: 0)
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8368d9e
    • N
      dev: advertise the new ifindex when the netns iface changes · 38e01b30
      Nicolas Dichtel 提交于
      The goal is to let the user follow an interface that moves to another
      netns.
      
      CC: Jiri Benc <jbenc@redhat.com>
      CC: Christian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38e01b30
    • N
      dev: always advertise the new nsid when the netns iface changes · c36ac8e2
      Nicolas Dichtel 提交于
      The user should be able to follow any interface that moves to another
      netns.  There is no reason to hide physical interfaces.
      
      CC: Jiri Benc <jbenc@redhat.com>
      CC: Christian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c36ac8e2
    • V
      net: ethernet: cavium: Correct Cavium Thunderx NIC driver names accordingly to module name · 6b9e6547
      Vadim Lomovtsev 提交于
      It was found that ethtool provides unexisting module name while
      it queries the specified network device for associated driver
      information. Then user tries to unload that module by provided
      module name and fails.
      
      This happens because ethtool reads value of DRV_NAME macro,
      while module name is defined at the driver's Makefile.
      
      This patch is to correct Cavium CN88xx Thunder NIC driver names
      (DRV_NAME macro) 'thunder-nicvf' to 'nicvf' and 'thunder-nic'
      to 'nicpf', sync bgx and xcv driver names accordingly to their
      module names.
      Signed-off-by: NVadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b9e6547
    • D
      Merge branch 'ptr_ring-fixes' · bfbe5bab
      David S. Miller 提交于
      Michael S. Tsirkin says:
      
      ====================
      ptr_ring fixes
      
      This fixes a bunch of issues around ptr_ring use in net core.
      One of these: "tap: fix use-after-free" is also needed on net,
      but can't be backported cleanly.
      
      I will post a net patch separately.
      
      Lightly tested - Jason, could you pls confirm this
      addresses the security issue you saw with ptr_ring?
      Testing reports would be appreciated too.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Tested-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      bfbe5bab
    • M
      tools/virtio: fix smp_mb on x86 · 491847f3
      Michael S. Tsirkin 提交于
      Offset 128 overlaps the last word of the redzone.
      Use 132 which is always beyond that.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      491847f3
    • M
      tools/virtio: copy READ/WRITE_ONCE · b4eab7de
      Michael S. Tsirkin 提交于
      This is to make ptr_ring test build again.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4eab7de
    • M
      6dd42157
    • M
      tools/virtio: switch to __ptr_ring_empty · 30f1d370
      Michael S. Tsirkin 提交于
      We don't rely on lockless guarantees, but it
      seems cleaner than inverting __ptr_ring_peek.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30f1d370
    • M
      ptr_ring: prevent queue load/store tearing · a07d29c6
      Michael S. Tsirkin 提交于
      In theory compiler could tear queue loads or stores in two. It does not
      seem to be happening in practice but it seems easier to convert the
      cases where this would be a problem to READ/WRITE_ONCE than worry about
      it.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a07d29c6
    • M
      skb_array: use __ptr_ring_empty · f417dc28
      Michael S. Tsirkin 提交于
      __skb_array_empty should use __ptr_ring_empty since that's the only
      legal lockless function.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f417dc28
    • M
      Revert "net: ptr_ring: otherwise safe empty checks can overrun array bounds" · 9fb582b6
      Michael S. Tsirkin 提交于
      This reverts commit bcecb4bb.
      
      If we try to allocate an extra entry as the above commit did, and when
      the requested size is UINT_MAX, addition overflows causing zero size to
      be passed to kmalloc().
      
      kmalloc then returns ZERO_SIZE_PTR with a subsequent crash.
      
      Reported-by: syzbot+87678bcf753b44c39b67@syzkaller.appspotmail.com
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fb582b6
    • M
      ptr_ring: disallow lockless __ptr_ring_full · 84328342
      Michael S. Tsirkin 提交于
      Similar to bcecb4bb ("net: ptr_ring: otherwise safe empty checks can
      overrun array bounds") a lockless use of __ptr_ring_full might
      cause an out of bounds access.
      
      We can fix this, but it's easier to just disallow lockless
      __ptr_ring_full for now.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84328342
    • M
      tap: fix use-after-free · 88fae873
      Michael S. Tsirkin 提交于
      Lockless access to __ptr_ring_full is only legal if ring is
      never resized, otherwise it might cause use-after free errors.
      Simply drop the lockless test, we'll drop the packet
      a bit later when produce fails.
      
      Fixes: 362899b8 ("macvtap: switch to use skb array")
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88fae873
    • M
      ptr_ring: READ/WRITE_ONCE for __ptr_ring_empty · a259df36
      Michael S. Tsirkin 提交于
      Lockless __ptr_ring_empty requires that consumer head is read and
      written at once, atomically. Annotate accordingly to make sure compiler
      does it correctly.  Switch locked callers to __ptr_ring_peek which does
      not support the lockless operation.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a259df36
    • M
      ptr_ring: clean up documentation · 8619d384
      Michael S. Tsirkin 提交于
      The only function safe to call without locks
      is __ptr_ring_empty. Move documentation about
      lockless use there to make sure people do not
      try to use __ptr_ring_peek outside locks.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8619d384
    • M
      ptr_ring: keep consumer_head valid at all times · 406de755
      Michael S. Tsirkin 提交于
      The comment near __ptr_ring_peek says:
      
       * If ring is never resized, and if the pointer is merely
       * tested, there's no need to take the lock - see e.g.  __ptr_ring_empty.
      
      but this was in fact never possible since consumer_head would sometimes
      point outside the ring. Refactor the code so that it's always
      pointing within a ring.
      
      Fixes: c5ad119f ("net: sched: pfifo_fast use skb_array")
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      406de755
    • M
      ipv6: Fix SO_REUSEPORT UDP socket with implicit sk_ipv6only · 7ece54a6
      Martin KaFai Lau 提交于
      If a sk_v6_rcv_saddr is !IPV6_ADDR_ANY and !IPV6_ADDR_MAPPED, it
      implicitly implies it is an ipv6only socket.  However, in inet6_bind(),
      this addr_type checking and setting sk->sk_ipv6only to 1 are only done
      after sk->sk_prot->get_port(sk, snum) has been completed successfully.
      
      This inconsistency between sk_v6_rcv_saddr and sk_ipv6only confuses
      the 'get_port()'.
      
      In particular, when binding SO_REUSEPORT UDP sockets,
      udp_reuseport_add_sock(sk,...) is called.  udp_reuseport_add_sock()
      checks "ipv6_only_sock(sk2) == ipv6_only_sock(sk)" before adding sk to
      sk2->sk_reuseport_cb.  In this case, ipv6_only_sock(sk2) could be
      1 while ipv6_only_sock(sk) is still 0 here.  The end result is,
      reuseport_alloc(sk) is called instead of adding sk to the existing
      sk2->sk_reuseport_cb.
      
      It can be reproduced by binding two SO_REUSEPORT UDP sockets on an
      IPv6 address (!ANY and !MAPPED).  Only one of the socket will
      receive packet.
      
      The fix is to set the implicit sk_ipv6only before calling get_port().
      The original sk_ipv6only has to be saved such that it can be restored
      in case get_port() failed.  The situation is similar to the
      inet_reset_saddr(sk) after get_port() has failed.
      
      Thanks to Calvin Owens <calvinowens@fb.com> who created an easy
      reproduction which leads to a fix.
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ece54a6
    • D
      Merge branch 'rtnetlink-enable-IFLA_IF_NETNSID-for-RTM_DELLINK-RTM_SETINK' · 2479c2c9
      David S. Miller 提交于
      Christian Brauner says:
      
      ====================
      rtnetlink: enable IFLA_IF_NETNSID for RTM_{DEL,SET}LINK
      
      Based on the previous discussion this enables passing a IFLA_IF_NETNSID
      property along with RTM_SETLINK and RTM_DELLINK requests. The patch for
      RTM_NEWLINK will be sent out in a separate patch since there are more
      corner-cases to think about.
      
      Changelog 2018-01-24:
      * Preserve old behavior and report -ENODEV when either ifindex or ifname is
        provided and IFLA_GROUP is set. Spotted by Wolfgang Bumiller.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2479c2c9
    • C
      rtnetlink: enable IFLA_IF_NETNSID for RTM_DELLINK · b61ad68a
      Christian Brauner 提交于
      - Backwards Compatibility:
        If userspace wants to determine whether RTM_DELLINK supports the
        IFLA_IF_NETNSID property they should first send an RTM_GETLINK request
        with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply
        does not include IFLA_IF_NETNSID userspace should assume that
        IFLA_IF_NETNSID is not supported on this kernel.
        If the reply does contain an IFLA_IF_NETNSID property userspace
        can send an RTM_DELLINK with a IFLA_IF_NETNSID property. If they receive
        EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property
        with RTM_DELLINK. Userpace should then fallback to other means.
      
      - Security:
        Callers must have CAP_NET_ADMIN in the owning user namespace of the
        target network namespace.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b61ad68a
    • C
      rtnetlink: enable IFLA_IF_NETNSID for RTM_SETLINK · c310bfcb
      Christian Brauner 提交于
      - Backwards Compatibility:
        If userspace wants to determine whether RTM_SETLINK supports the
        IFLA_IF_NETNSID property they should first send an RTM_GETLINK request
        with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply
        does not include IFLA_IF_NETNSID userspace should assume that
        IFLA_IF_NETNSID is not supported on this kernel.
        If the reply does contain an IFLA_IF_NETNSID property userspace
        can send an RTM_SETLINK with a IFLA_IF_NETNSID property. If they receive
        EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property
        with RTM_SETLINK. Userpace should then fallback to other means.
      
        To retain backwards compatibility the kernel will first check whether a
        IFLA_NET_NS_PID or IFLA_NET_NS_FD property has been passed. If either
        one is found it will be used to identify the target network namespace.
        This implies that users who do not care whether their running kernel
        supports IFLA_IF_NETNSID with RTM_SETLINK can pass both
        IFLA_NET_NS_{FD,PID} and IFLA_IF_NETNSID referring to the same network
        namespace.
      
      - Security:
        Callers must have CAP_NET_ADMIN in the owning user namespace of the
        target network namespace.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c310bfcb
    • C
      rtnetlink: enable IFLA_IF_NETNSID in do_setlink() · 7c4f63ba
      Christian Brauner 提交于
      RTM_{NEW,SET}LINK already allow operations on other network namespaces
      by identifying the target network namespace through IFLA_NET_NS_{FD,PID}
      properties. This is done by looking for the corresponding properties in
      do_setlink(). Extend do_setlink() to also look for the IFLA_IF_NETNSID
      property. This introduces no functional changes since all callers of
      do_setlink() currently block IFLA_IF_NETNSID by reporting an error before
      they reach do_setlink().
      
      This introduces the helpers:
      
      static struct net *rtnl_link_get_net_by_nlattr(struct net *src_net, struct
                                                     nlattr *tb[])
      
      static struct net *rtnl_link_get_net_capable(const struct sk_buff *skb,
                                                   struct net *src_net,
      					     struct nlattr *tb[], int cap)
      
      to simplify permission checks and target network namespace retrieval for
      RTM_* requests that already support IFLA_NET_NS_{FD,PID} but get extended
      to IFLA_IF_NETNSID. To perserve backwards compatibility the helpers look
      for IFLA_NET_NS_{FD,PID} properties first before checking for
      IFLA_IF_NETNSID.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c4f63ba
  2. 29 1月, 2018 5 次提交
    • D
    • D
      Merge tag 'wireless-drivers-next-for-davem-2018-01-26' of... · 868c36dc
      David S. Miller 提交于
      Merge tag 'wireless-drivers-next-for-davem-2018-01-26' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
      
      Kalle Valo says:
      
      ====================
      wireless-drivers-next patches for 4.16
      
      Major changes:
      
      wil6210
      
      * add PCI device id for Talyn
      
      * support flashless device
      
      ath9k
      
      * improve RSSI/signal accuracy on AR9003 series
      
      mt76
      
      * validate CCMP PN from received frames to avoid replay attacks
      
      qtnfmac
      
      * support 64-bit network stats
      
      * report more hardware information to kernel log and some via ethtool
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      868c36dc
    • K
      sfc: mark some unexported symbols as static · e7345ba3
      kbuild test robot 提交于
      efx_default_channel_want_txqs() is only used in efx.c, while
       efx_ptp_want_txqs() and efx_ptp_channel_type (a struct) are only used
       in ptp.c.  In all cases these symbols should be static.
      
      Fixes: 2935e3c3 ("sfc: on 8000 series use TX queues for TX timestamps")
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      [ecree@solarflare.com: rewrote commit message]
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7345ba3
    • D
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 5abe9ead
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2018-01-26
      
      This series contains updates to i40e and i40evf.
      
      Michal updates the driver to pass critical errors from the firmware to
      the caller.
      
      Patryk fixes an issue of creating multiple identical filters with the
      same location, by simply moving the functions so that we remove the
      existing filter and then add the new filter.
      
      Paweł adds back in the ability to turn off offloads when VLAN is set for
      the VF driver.  Fixed an issue where the number of TC queue pairs was
      exceeding MSI-X vectors count, causing messages about invalid TC mapping
      and wrong selected Tx queue.
      
      Alex cleans up the i40e/i40evf_set_itr_per_queue() by dropping all the
      unneeded pointer chases.  Puts to use the reg_idx value, which was going
      unused, so that we can avoid having to compute the vector every time
      throughout the driver.
      
      Upasana enable the driver to display LLDP information on the vSphere Web
      Client by exposing DCB parameters.
      
      Alice converts our flags from 32 to 64 bit size, since we have added
      more flags.
      
      Dave implements a private ethtool flag to disable the processing of LLDP
      packets by the firmware, so that the firmware will not consume LLDPDU
      and cause them to be sent up the stack.
      
      Alan adds a mechanism for detecting/storing the flag for processing of
      LLDP packets by the firmware, so that its current state is persistent
      across reboots/reloads of the driver.
      
      Avinash fixes kdump with i40e due to resource constraints.  We were
      enabling VMDq and iWARP when we just have a single CPU, which was
      starving kdump for the lack of IRQs.
      
      Jake adds support to program the fragmented IPv4 input set PCTYPE.
      Fixed the reported masks to properly report that the entire field is
      masked, since we had accidentally swapped the mask values for the IPv4
      addresses with the L4 port numbers.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5abe9ead
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 457740a9
      David S. Miller 提交于
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf-next 2018-01-26
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      The main changes are:
      
      1) A number of extensions to tcp-bpf, from Lawrence.
          - direct R or R/W access to many tcp_sock fields via bpf_sock_ops
          - passing up to 3 arguments to bpf_sock_ops functions
          - tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks
          - optionally calling bpf_sock_ops program when RTO fires
          - optionally calling bpf_sock_ops program when packet is retransmitted
          - optionally calling bpf_sock_ops program when TCP state changes
          - access to tclass and sk_txhash
          - new selftest
      
      2) div/mod exception handling, from Daniel.
          One of the ugly leftovers from the early eBPF days is that div/mod
          operations based on registers have a hard-coded src_reg == 0 test
          in the interpreter as well as in JIT code generators that would
          return from the BPF program with exit code 0. This was basically
          adopted from cBPF interpreter for historical reasons.
          There are multiple reasons why this is very suboptimal and prone
          to bugs. To name one: the return code mapping for such abnormal
          program exit of 0 does not always match with a suitable program
          type's exit code mapping. For example, '0' in tc means action 'ok'
          where the packet gets passed further up the stack, which is just
          undesirable for such cases (e.g. when implementing policy) and
          also does not match with other program types.
          After considering _four_ different ways to address the problem,
          we adapt the same behavior as on some major archs like ARMv8:
          X div 0 results in 0, and X mod 0 results in X. aarch64 and
          aarch32 ISA do not generate any traps or otherwise aborts
          of program execution for unsigned divides.
          Given the options, it seems the most suitable from
          all of them, also since major archs have similar schemes in
          place. Given this is all in the realm of undefined behavior,
          we still have the option to adapt if deemed necessary.
      
      3) sockmap sample refactoring, from John.
      
      4) lpm map get_next_key fixes, from Yonghong.
      
      5) test cleanups, from Alexei and Prashant.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      457740a9
  3. 28 1月, 2018 2 次提交
    • D
      Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 6b2e2829
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      10GbE Intel Wired LAN Driver Updates 2018-01-26
      
      This series contains updates to ixgbe and ixgbevf.
      
      Emil updates ixgbevf to match ixgbe functionality, starting with the
      consolidating of functions that represent logical steps in the receive
      process so we can later update them more easily.  Updated ixgbevf to
      only synchronize the length of the frame, which will typically be the
      MTU or smaller.  Updated the VF driver to use the length of the packet
      instead of the DD status bit to determine if a new descriptor is ready
      to be processed, which saves on reads and we can save time on
      initialization.  Added support for DMA_ATTR_SKIP_CPU_SYNC/WEAK_ORDERING
      to help improve performance on some platforms.  Updated the VF driver to
      do bulk updates of the page reference count instead of just incrementing
      it by one reference at a time.  Updated the VF driver to only go through
      the region of the receive ring that was designated to be cleaned up,
      rather than process the entire ring.
      
      Colin Ian King adds the use of ARRAY_SIZE() on various arrays.
      
      Miroslav Lichvar fixes an issue where ethtool was reporting timestamping
      filters unsupported for X550, which is incorrect.
      
      Paul adds support for reporting 5G link speed for some devices.
      
      Dan Carpenter fixes a typo where && was used when it should have been
      ||.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b2e2829
    • L
      net/rocker: Remove unreachable return instruction · 751c45bd
      Leon Romanovsky 提交于
      The "return 0" instruction follows other return instruction
      and it makes it impossible to execute, hence remove it.
      
      Fixes: 00fc0c51 ("rocker: Change world_ops API and implementation to be switchdev independant")
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      751c45bd