1. 21 6月, 2021 1 次提交
  2. 16 6月, 2021 1 次提交
  3. 13 6月, 2021 1 次提交
    • C
      net: make get_net_ns return error if NET_NS is disabled · ea6932d7
      Changbin Du 提交于
      There is a panic in socket ioctl cmd SIOCGSKNS when NET_NS is not enabled.
      The reason is that nsfs tries to access ns->ops but the proc_ns_operations
      is not implemented in this case.
      
      [7.670023] Unable to handle kernel NULL pointer dereference at virtual address 00000010
      [7.670268] pgd = 32b54000
      [7.670544] [00000010] *pgd=00000000
      [7.671861] Internal error: Oops: 5 [#1] SMP ARM
      [7.672315] Modules linked in:
      [7.672918] CPU: 0 PID: 1 Comm: systemd Not tainted 5.13.0-rc3-00375-g6799d4f2 #16
      [7.673309] Hardware name: Generic DT based system
      [7.673642] PC is at nsfs_evict+0x24/0x30
      [7.674486] LR is at clear_inode+0x20/0x9c
      
      The same to tun SIOCGSKNS command.
      
      To fix this problem, we make get_net_ns() return -EINVAL when NET_NS is
      disabled. Meanwhile move it to right place net/core/net_namespace.c.
      Signed-off-by: NChangbin Du <changbin.du@gmail.com>
      Fixes: c62cce2c ("net: add an ioctl to get a socket network namespace")
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea6932d7
  4. 11 6月, 2021 1 次提交
  5. 10 6月, 2021 1 次提交
    • I
      rtnetlink: Fix regression in bridge VLAN configuration · d2e381c4
      Ido Schimmel 提交于
      Cited commit started returning errors when notification info is not
      filled by the bridge driver, resulting in the following regression:
      
       # ip link add name br1 type bridge vlan_filtering 1
       # bridge vlan add dev br1 vid 555 self pvid untagged
       RTNETLINK answers: Invalid argument
      
      As long as the bridge driver does not fill notification info for the
      bridge device itself, an empty notification should not be considered as
      an error. This is explained in commit 59ccaaaa ("bridge: dont send
      notification when skb->len == 0 in rtnl_bridge_notify").
      
      Fix by removing the error and add a comment to avoid future bugs.
      
      Fixes: a8db57c1 ("rtnetlink: Fix missing error code in rtnl_bridge_notify()")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2e381c4
  6. 08 6月, 2021 1 次提交
  7. 04 6月, 2021 2 次提交
  8. 02 6月, 2021 1 次提交
  9. 28 5月, 2021 1 次提交
  10. 21 5月, 2021 1 次提交
  11. 15 5月, 2021 3 次提交
    • M
      mm: fix struct page layout on 32-bit systems · 9ddb3c14
      Matthew Wilcox (Oracle) 提交于
      32-bit architectures which expect 8-byte alignment for 8-byte integers and
      need 64-bit DMA addresses (arm, mips, ppc) had their struct page
      inadvertently expanded in 2019.  When the dma_addr_t was added, it forced
      the alignment of the union to 8 bytes, which inserted a 4 byte gap between
      'flags' and the union.
      
      Fix this by storing the dma_addr_t in one or two adjacent unsigned longs.
      This restores the alignment to that of an unsigned long.  We always
      store the low bits in the first word to prevent the PageTail bit from
      being inadvertently set on a big endian platform.  If that happened,
      get_user_pages_fast() racing against a page which was freed and
      reallocated to the page_pool could dereference a bogus compound_head(),
      which would be hard to trace back to this cause.
      
      Link: https://lkml.kernel.org/r/20210510153211.1504886-1-willy@infradead.org
      Fixes: c25fff71 ("mm: add dma_addr_t to struct page")
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NMatteo Croce <mcroce@linux.microsoft.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ddb3c14
    • Y
      net: sched: fix tx action reschedule issue with stopped queue · dcad9ee9
      Yunsheng Lin 提交于
      The netdev qeueue might be stopped when byte queue limit has
      reached or tx hw ring is full, net_tx_action() may still be
      rescheduled if STATE_MISSED is set, which consumes unnecessary
      cpu without dequeuing and transmiting any skb because the
      netdev queue is stopped, see qdisc_run_end().
      
      This patch fixes it by checking the netdev queue state before
      calling qdisc_run() and clearing STATE_MISSED if netdev queue is
      stopped during qdisc_run(), the net_tx_action() is rescheduled
      again when netdev qeueue is restarted, see netif_tx_wake_queue().
      
      As there is time window between netif_xmit_frozen_or_stopped()
      checking and STATE_MISSED clearing, between which STATE_MISSED
      may set by net_tx_action() scheduled by netif_tx_wake_queue(),
      so set the STATE_MISSED again if netdev queue is restarted.
      
      Fixes: 6b3ba914 ("net: sched: allow qdiscs to handle locking")
      Reported-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcad9ee9
    • Y
      net: sched: fix tx action rescheduling issue during deactivation · 102b55ee
      Yunsheng Lin 提交于
      Currently qdisc_run() checks the STATE_DEACTIVATED of lockless
      qdisc before calling __qdisc_run(), which ultimately clear the
      STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED
      is set before clearing STATE_MISSED, there may be rescheduling
      of net_tx_action() at the end of qdisc_run_end(), see below:
      
      CPU0(net_tx_atcion)  CPU1(__dev_xmit_skb)  CPU2(dev_deactivate)
                .                   .                     .
                .            set STATE_MISSED             .
                .           __netif_schedule()            .
                .                   .           set STATE_DEACTIVATED
                .                   .                qdisc_reset()
                .                   .                     .
                .<---------------   .              synchronize_net()
      clear __QDISC_STATE_SCHED  |  .                     .
                .                |  .                     .
                .                |  .            some_qdisc_is_busy()
                .                |  .               return *false*
                .                |  .                     .
        test STATE_DEACTIVATED   |  .                     .
      __qdisc_run() *not* called |  .                     .
                .                |  .                     .
         test STATE_MISS         |  .                     .
       __netif_schedule()--------|  .                     .
                .                   .                     .
                .                   .                     .
      
      __qdisc_run() is not called by net_tx_atcion() in CPU0 because
      CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and
      STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule
      is called at the end of qdisc_run_end(), causing tx action
      rescheduling problem.
      
      qdisc_run() called by net_tx_action() runs in the softirq context,
      which should has the same semantic as the qdisc_run() called by
      __dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a
      synchronize_net() between STATE_DEACTIVATED flag being set and
      qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely
      bail out for the deactived lockless qdisc in net_tx_action(), and
      qdisc_reset() will reset all skb not dequeued yet.
      
      So add the rcu_read_lock() explicitly to protect the qdisc_run()
      and do the STATE_DEACTIVATED checking in net_tx_action() before
      calling qdisc_run_begin(). Another option is to do the checking in
      the qdisc_run_end(), but it will add unnecessary overhead for
      non-tx_action case, because __dev_queue_xmit() will not see qdisc
      with STATE_DEACTIVATED after synchronize_net(), the qdisc with
      STATE_DEACTIVATED can only be seen by net_tx_action() because of
      __netif_schedule().
      
      The STATE_DEACTIVATED checking in qdisc_run() is to avoid race
      between net_tx_action() and qdisc_reset(), see:
      commit d518d2ed ("net/sched: fix race between deactivation
      and dequeue for NOLOCK qdisc"). As the bailout added above for
      deactived lockless qdisc in net_tx_action() provides better
      protection for the race without calling qdisc_run() at all, so
      remove the STATE_DEACTIVATED checking in qdisc_run().
      
      After qdisc_reset(), there is no skb in qdisc to be dequeued, so
      clear the STATE_MISSED in dev_reset_queue() too.
      
      Fixes: 6b3ba914 ("net: sched: allow qdiscs to handle locking")
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      V8: Clearing STATE_MISSED before calling __netif_schedule() has
          avoid the endless rescheduling problem, but there may still
          be a unnecessary rescheduling, so adjust the commit log.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      102b55ee
  12. 13 5月, 2021 1 次提交
  13. 01 5月, 2021 2 次提交
  14. 29 4月, 2021 1 次提交
  15. 24 4月, 2021 2 次提交
    • P
      devlink: Extend SF port attributes to have external attribute · a1ab3e45
      Parav Pandit 提交于
      Extended SF port attributes to have optional external flag similar to
      PCI PF and VF port attributes.
      
      External atttibute is required to generate unique phys_port_name when PF number
      and SF number are overlapping between two controllers similar to SR-IOV
      VFs.
      
      When a SF is for external controller an example view of external SF
      port and config sequence.
      
      On eswitch system:
      $ devlink dev eswitch set pci/0033:01:00.0 mode switchdev
      
      $ devlink port show
      pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
      pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
        function:
          hw_addr 00:00:00:00:00:00
      
      $ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
      pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 splittable false
        function:
          hw_addr 00:00:00:00:00:00 state inactive opstate detached
      
      phys_port_name construction:
      $ cat /sys/class/net/eth1/phys_port_name
      c1pf0sf77
      Signed-off-by: NParav Pandit <parav@nvidia.com>
      Reviewed-by: NJiri Pirko <jiri@nvidia.com>
      Reviewed-by: NVu Pham <vuhuong@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      a1ab3e45
    • T
      net: sock: remove the unnecessary check in proto_register · ed744d81
      Tonghao Zhang 提交于
      tw_prot_cleanup will check the twsk_prot.
      
      Fixes: 0f5907af ("net: Fix potential memory leak in proto_register()")
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed744d81
  16. 23 4月, 2021 1 次提交
  17. 22 4月, 2021 1 次提交
    • C
      neighbour: Prevent Race condition in neighbour subsytem · eefb45ee
      Chinmay Agarwal 提交于
      Following Race Condition was detected:
      
      <CPU A, t0>: Executing: __netif_receive_skb() ->__netif_receive_skb_core()
      -> arp_rcv() -> arp_process().arp_process() calls __neigh_lookup() which
      takes a reference on neighbour entry 'n'.
      Moves further along, arp_process() and calls neigh_update()->
      __neigh_update(). Neighbour entry is unlocked just before a call to
      neigh_update_gc_list.
      
      This unlocking paves way for another thread that may take a reference on
      the same and mark it dead and remove it from gc_list.
      
      <CPU B, t1> - neigh_flush_dev() is under execution and calls
      neigh_mark_dead(n) marking the neighbour entry 'n' as dead. Also n will be
      removed from gc_list.
      Moves further along neigh_flush_dev() and calls
      neigh_cleanup_and_release(n), but since reference count increased in t1,
      'n' couldn't be destroyed.
      
      <CPU A, t3>- Code hits neigh_update_gc_list, with neighbour entry
      set as dead.
      
      <CPU A, t4> - arp_process() finally calls neigh_release(n), destroying
      the neighbour entry and we have a destroyed ntry still part of gc_list.
      
      Fixes: eb4e8fac("neighbour: Prevent a dead entry from updating gc_list")
      Signed-off-by: NChinmay Agarwal <chinagar@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eefb45ee
  18. 21 4月, 2021 1 次提交
  19. 20 4月, 2021 1 次提交
  20. 17 4月, 2021 2 次提交
  21. 16 4月, 2021 1 次提交
  22. 15 4月, 2021 1 次提交
    • P
      skbuff: revert "skbuff: remove some unnecessary operation in skb_segment_list()" · 17c3df70
      Paolo Abeni 提交于
      the commit 1ddc3229 ("skbuff: remove some unnecessary operation
      in skb_segment_list()") introduces an issue very similar to the
      one already fixed by commit 53475c5d ("net: fix use-after-free when
      UDP GRO with shared fraglist").
      
      If the GSO skb goes though skb_clone() and pskb_expand_head() before
      entering skb_segment_list(), the latter  will unshare the frag_list
      skbs and will release the old list. With the reverted commit in place,
      when skb_segment_list() completes, skb->next points to the just
      released list, and later on the kernel will hit UaF.
      
      Note that since commit e0e3070a ("udp: properly complete L4 GRO
      over UDP tunnel packet") the critical scenario can be reproduced also
      receiving UDP over vxlan traffic with:
      
      NIC (NETIF_F_GRO_FRAGLIST enabled) -> vxlan -> UDP sink
      
      Attaching a packet socket to the NIC will cause skb_clone() and the
      tunnel decapsulation will call pskb_expand_head().
      
      Fixes: 1ddc3229 ("skbuff: remove some unnecessary operation in skb_segment_list()")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17c3df70
  23. 14 4月, 2021 1 次提交
    • E
      gro: ensure frag0 meets IP header alignment · 38ec4944
      Eric Dumazet 提交于
      After commit 0f6925b3 ("virtio_net: Do not pull payload in skb->head")
      Guenter Roeck reported one failure in his tests using sh architecture.
      
      After much debugging, we have been able to spot silent unaligned accesses
      in inet_gro_receive()
      
      The issue at hand is that upper networking stacks assume their header
      is word-aligned. Low level drivers are supposed to reserve NET_IP_ALIGN
      bytes before the Ethernet header to make that happen.
      
      This patch hardens skb_gro_reset_offset() to not allow frag0 fast-path
      if the fragment is not properly aligned.
      
      Some arches like x86, arm64 and powerpc do not care and define NET_IP_ALIGN
      as 0, this extra check will be a NOP for them.
      
      Note that if frag0 is not used, GRO will call pskb_may_pull()
      as many times as needed to pull network and transport headers.
      
      Fixes: 0f6925b3 ("virtio_net: Do not pull payload in skb->head")
      Fixes: 78a478d0 ("gro: Inline skb_gro_header and cache frag0 virtual address")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38ec4944
  24. 12 4月, 2021 2 次提交
  25. 10 4月, 2021 1 次提交
  26. 09 4月, 2021 1 次提交
  27. 08 4月, 2021 2 次提交
  28. 07 4月, 2021 1 次提交
    • J
      bpf, sockmap: Fix incorrect fwd_alloc accounting · 144748eb
      John Fastabend 提交于
      Incorrect accounting fwd_alloc can result in a warning when the socket
      is torn down,
      
       [18455.319240] WARNING: CPU: 0 PID: 24075 at net/core/stream.c:208 sk_stream_kill_queues+0x21f/0x230
       [...]
       [18455.319543] Call Trace:
       [18455.319556]  inet_csk_destroy_sock+0xba/0x1f0
       [18455.319577]  tcp_rcv_state_process+0x1b4e/0x2380
       [18455.319593]  ? lock_downgrade+0x3a0/0x3a0
       [18455.319617]  ? tcp_finish_connect+0x1e0/0x1e0
       [18455.319631]  ? sk_reset_timer+0x15/0x70
       [18455.319646]  ? tcp_schedule_loss_probe+0x1b2/0x240
       [18455.319663]  ? lock_release+0xb2/0x3f0
       [18455.319676]  ? __release_sock+0x8a/0x1b0
       [18455.319690]  ? lock_downgrade+0x3a0/0x3a0
       [18455.319704]  ? lock_release+0x3f0/0x3f0
       [18455.319717]  ? __tcp_close+0x2c6/0x790
       [18455.319736]  ? tcp_v4_do_rcv+0x168/0x370
       [18455.319750]  tcp_v4_do_rcv+0x168/0x370
       [18455.319767]  __release_sock+0xbc/0x1b0
       [18455.319785]  __tcp_close+0x2ee/0x790
       [18455.319805]  tcp_close+0x20/0x80
      
      This currently happens because on redirect case we do skb_set_owner_r()
      with the original sock. This increments the fwd_alloc memory accounting
      on the original sock. Then on redirect we may push this into the queue
      of the psock we are redirecting to. When the skb is flushed from the
      queue we give the memory back to the original sock. The problem is if
      the original sock is destroyed/closed with skbs on another psocks queue
      then the original sock will not have a way to reclaim the memory before
      being destroyed. Then above warning will be thrown
      
        sockA                          sockB
      
        sk_psock_strp_read()
         sk_psock_verdict_apply()
           -- SK_REDIRECT --
           sk_psock_skb_redirect()
                                      skb_queue_tail(psock_other->ingress_skb..)
      
        sk_close()
         sock_map_unref()
           sk_psock_put()
             sk_psock_drop()
               sk_psock_zap_ingress()
      
      At this point we have torn down our own psock, but have the outstanding
      skb in psock_other. Note that SK_PASS doesn't have this problem because
      the sk_psock_drop() logic releases the skb, its still associated with
      our psock.
      
      To resolve lets only account for sockets on the ingress queue that are
      still associated with the current socket. On the redirect case we will
      check memory limits per 6fa9201a, but will omit fwd_alloc accounting
      until skb is actually enqueued. When the skb is sent via skb_send_sock_locked
      or received with sk_psock_skb_ingress memory will be claimed on psock_other.
      
      Fixes: 6fa9201a ("bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self")
      Reported-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/161731444013.68884.4021114312848535993.stgit@john-XPS-13-9370
      144748eb
  29. 06 4月, 2021 1 次提交
  30. 02 4月, 2021 3 次提交