1. 10 5月, 2020 1 次提交
    • Z
      netprio_cgroup: Fix unlimited memory leak of v2 cgroups · 090e28b2
      Zefan Li 提交于
      If systemd is configured to use hybrid mode which enables the use of
      both cgroup v1 and v2, systemd will create new cgroup on both the default
      root (v2) and netprio_cgroup hierarchy (v1) for a new session and attach
      task to the two cgroups. If the task does some network thing then the v2
      cgroup can never be freed after the session exited.
      
      One of our machines ran into OOM due to this memory leak.
      
      In the scenario described above when sk_alloc() is called
      cgroup_sk_alloc() thought it's in v2 mode, so it stores
      the cgroup pointer in sk->sk_cgrp_data and increments
      the cgroup refcnt, but then sock_update_netprioidx()
      thought it's in v1 mode, so it stores netprioidx value
      in sk->sk_cgrp_data, so the cgroup refcnt will never be freed.
      
      Currently we do the mode switch when someone writes to the ifpriomap
      cgroup control file. The easiest fix is to also do the switch when
      a task is attached to a new cgroup.
      
      Fixes: bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
      Reported-by: NYang Yingliang <yangyingliang@huawei.com>
      Tested-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      090e28b2
  2. 09 5月, 2020 2 次提交
  3. 08 5月, 2020 3 次提交
    • C
      net: fix a potential recursive NETDEV_FEAT_CHANGE · dd912306
      Cong Wang 提交于
      syzbot managed to trigger a recursive NETDEV_FEAT_CHANGE event
      between bonding master and slave. I managed to find a reproducer
      for this:
      
        ip li set bond0 up
        ifenslave bond0 eth0
        brctl addbr br0
        ethtool -K eth0 lro off
        brctl addif br0 bond0
        ip li set br0 up
      
      When a NETDEV_FEAT_CHANGE event is triggered on a bonding slave,
      it captures this and calls bond_compute_features() to fixup its
      master's and other slaves' features. However, when syncing with
      its lower devices by netdev_sync_lower_features() this event is
      triggered again on slaves when the LRO feature fails to change,
      so it goes back and forth recursively until the kernel stack is
      exhausted.
      
      Commit 17b85d29 intentionally lets __netdev_update_features()
      return -1 for such a failure case, so we have to just rely on
      the existing check inside netdev_sync_lower_features() and skip
      NETDEV_FEAT_CHANGE event only for this specific failure case.
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      Reported-by: syzbot+e73ceacfd8560cc8a3ca@syzkaller.appspotmail.com
      Reported-by: syzbot+c2fb6f9ddcea95ba49b5@syzkaller.appspotmail.com
      Cc: Jarod Wilson <jarod@redhat.com>
      Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Reviewed-by: NJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd912306
    • P
      mptcp: set correct vfs info for subflows · 7d14b0d2
      Paolo Abeni 提交于
      When a subflow is created via mptcp_subflow_create_socket(),
      a new 'struct socket' is allocated, with a new i_ino value.
      
      When inspecting TCP sockets via the procfs and or the diag
      interface, the above ones are not related to the process owning
      the MPTCP master socket, even if they are a logical part of it
      ('ss -p' shows an empty process field)
      
      Additionally, subflows created by the path manager get
      the uid/gid from the running workqueue.
      
      Subflows are part of the owning MPTCP master socket, let's
      adjust the vfs info to reflect this.
      
      After this patch, 'ss' correctly displays subflows as belonging
      to the msk socket creator.
      
      Fixes: 2303f994 ("mptcp: Associate MPTCP context with TCP socket")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d14b0d2
    • M
      Revert "ipv6: add mtu lock check in __ip6_rt_update_pmtu" · 09454fd0
      Maciej Żenczykowski 提交于
      This reverts commit 19bda36c:
      
      | ipv6: add mtu lock check in __ip6_rt_update_pmtu
      |
      | Prior to this patch, ipv6 didn't do mtu lock check in ip6_update_pmtu.
      | It leaded to that mtu lock doesn't really work when receiving the pkt
      | of ICMPV6_PKT_TOOBIG.
      |
      | This patch is to add mtu lock check in __ip6_rt_update_pmtu just as ipv4
      | did in __ip_rt_update_pmtu.
      
      The above reasoning is incorrect.  IPv6 *requires* icmp based pmtu to work.
      There's already a comment to this effect elsewhere in the kernel:
      
        $ git grep -p -B1 -A3 'RTAX_MTU lock'
        net/ipv6/route.c=4813=
      
        static int rt6_mtu_change_route(struct fib6_info *f6i, void *p_arg)
        ...
          /* In IPv6 pmtu discovery is not optional,
             so that RTAX_MTU lock cannot disable it.
             We still use this lock to block changes
             caused by addrconf/ndisc.
          */
      
      This reverts to the pre-4.9 behaviour.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Xin Long <lucien.xin@gmail.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Fixes: 19bda36c ("ipv6: add mtu lock check in __ip6_rt_update_pmtu")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09454fd0
  4. 07 5月, 2020 5 次提交
    • P
      net: flow_offload: skip hw stats check for FLOW_ACTION_HW_STATS_DONT_CARE · 16f80360
      Pablo Neira Ayuso 提交于
      This patch adds FLOW_ACTION_HW_STATS_DONT_CARE which tells the driver
      that the frontend does not need counters, this hw stats type request
      never fails. The FLOW_ACTION_HW_STATS_DISABLED type explicitly requests
      the driver to disable the stats, however, if the driver cannot disable
      counters, it bails out.
      
      TCA_ACT_HW_STATS_* maintains the 1:1 mapping with FLOW_ACTION_HW_STATS_*
      except by disabled which is mapped to FLOW_ACTION_HW_STATS_DISABLED
      (this is 0 in tc). Add tc_act_hw_stats() to perform the mapping between
      TCA_ACT_HW_STATS_* and FLOW_ACTION_HW_STATS_*.
      
      Fixes: 319a1d19 ("flow_offload: check for basic action hw stats type")
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16f80360
    • F
      net: dsa: Do not leave DSA master with NULL netdev_ops · 050569fc
      Florian Fainelli 提交于
      When ndo_get_phys_port_name() for the CPU port was added we introduced
      an early check for when the DSA master network device in
      dsa_master_ndo_setup() already implements ndo_get_phys_port_name(). When
      we perform the teardown operation in dsa_master_ndo_teardown() we would
      not be checking that cpu_dp->orig_ndo_ops was successfully allocated and
      non-NULL initialized.
      
      With network device drivers such as virtio_net, this leads to a NPD as
      soon as the DSA switch hanging off of it gets torn down because we are
      now assigning the virtio_net device's netdev_ops a NULL pointer.
      
      Fixes: da7b9e9b ("net: dsa: Add ndo_get_phys_port_name() for CPU port")
      Reported-by: NAllen Pais <allen.pais@oracle.com>
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Tested-by: NAllen Pais <allen.pais@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      050569fc
    • V
      net: dsa: remove duplicate assignment in dsa_slave_add_cls_matchall_mirred · 65722159
      Vladimir Oltean 提交于
      This was caused by a poor merge conflict resolution on my side. The
      "act = &cls->rule->action.entries[0];" assignment was already present in
      the code prior to the patch mentioned below.
      
      Fixes: e13c2075 ("net: dsa: refactor matchall mirred action to separate function")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65722159
    • A
      seg6: fix SRH processing to comply with RFC8754 · 0cb7498f
      Ahmed Abdelsalam 提交于
      The Segment Routing Header (SRH) which defines the SRv6 dataplane is defined
      in RFC8754.
      
      RFC8754 (section 4.1) defines the SR source node behavior which encapsulates
      packets into an outer IPv6 header and SRH. The SR source node encodes the
      full list of Segments that defines the packet path in the SRH. Then, the
      first segment from list of Segments is copied into the Destination address
      of the outer IPv6 header and the packet is sent to the first hop in its path
      towards the destination.
      
      If the Segment list has only one segment, the SR source node can omit the SRH
      as he only segment is added in the destination address.
      
      RFC8754 (section 4.1.1) defines the Reduced SRH, when a source does not
      require the entire SID list to be preserved in the SRH. A reduced SRH does
      not contain the first segment of the related SR Policy (the first segment is
      the one already in the DA of the IPv6 header), and the Last Entry field is
      set to n-2, where n is the number of elements in the SR Policy.
      
      RFC8754 (section 4.3.1.1) defines the SRH processing and the logic to
      validate the SRH (S09, S10, S11) which works for both reduced and
      non-reduced behaviors.
      
      This patch updates seg6_validate_srh() to validate the SRH as per RFC8754.
      Signed-off-by: NAhmed Abdelsalam <ahabdels@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cb7498f
    • M
      net: hsr: fix incorrect type usage for protocol variable · f5dda315
      Murali Karicheri 提交于
      Fix following sparse checker warning:-
      
      net/hsr/hsr_slave.c:38:18: warning: incorrect type in assignment (different base types)
      net/hsr/hsr_slave.c:38:18:    expected unsigned short [unsigned] [usertype] protocol
      net/hsr/hsr_slave.c:38:18:    got restricted __be16 [usertype] h_proto
      net/hsr/hsr_slave.c:39:25: warning: restricted __be16 degrades to integer
      net/hsr/hsr_slave.c:39:57: warning: restricted __be16 degrades to integer
      Signed-off-by: NMurali Karicheri <m-karicheri2@ti.com>
      Acked-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5dda315
  5. 06 5月, 2020 3 次提交
  6. 05 5月, 2020 6 次提交
    • C
      atm: fix a memory leak of vcc->user_back · 8d9f73c0
      Cong Wang 提交于
      In lec_arp_clear_vccs() only entry->vcc is freed, but vcc
      could be installed on entry->recv_vcc too in lec_vcc_added().
      
      This fixes the following memory leak:
      
      unreferenced object 0xffff8880d9266b90 (size 16):
        comm "atm2", pid 425, jiffies 4294907980 (age 23.488s)
        hex dump (first 16 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 6b 6b 6b a5  ............kkk.
        backtrace:
          [<(____ptrval____)>] kmem_cache_alloc_trace+0x10e/0x151
          [<(____ptrval____)>] lane_ioctl+0x4b3/0x569
          [<(____ptrval____)>] do_vcc_ioctl+0x1ea/0x236
          [<(____ptrval____)>] svc_ioctl+0x17d/0x198
          [<(____ptrval____)>] sock_do_ioctl+0x47/0x12f
          [<(____ptrval____)>] sock_ioctl+0x2f9/0x322
          [<(____ptrval____)>] vfs_ioctl+0x1e/0x2b
          [<(____ptrval____)>] ksys_ioctl+0x61/0x80
          [<(____ptrval____)>] __x64_sys_ioctl+0x16/0x19
          [<(____ptrval____)>] do_syscall_64+0x57/0x65
          [<(____ptrval____)>] entry_SYSCALL_64_after_hwframe+0x49/0xb3
      
      Cc: Gengming Liu <l.dmxcsnsbh@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d9f73c0
    • C
      atm: fix a UAF in lec_arp_clear_vccs() · 93a2014a
      Cong Wang 提交于
      Gengming reported a UAF in lec_arp_clear_vccs(),
      where we add a vcc socket to an entry in a per-device
      list but free the socket without removing it from the
      list when vcc->dev is NULL.
      
      We need to call lec_vcc_close() to search and remove
      those entries contain the vcc being destroyed. This can
      be done by calling vcc->push(vcc, NULL) unconditionally
      in vcc_destroy_socket().
      
      Another issue discovered by Gengming's reproducer is
      the vcc->dev may point to the static device lecatm_dev,
      for which we don't need to register/unregister device,
      so we can just check for vcc->dev->ops->owner.
      Reported-by: NGengming Liu <l.dmxcsnsbh@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93a2014a
    • C
      net_sched: fix tcm_parent in tc filter dump · a7df4870
      Cong Wang 提交于
      When we tell kernel to dump filters from root (ffff:ffff),
      those filters on ingress (ffff:0000) are matched, but their
      true parents must be dumped as they are. However, kernel
      dumps just whatever we tell it, that is either ffff:ffff
      or ffff:0000:
      
       $ nl-cls-list --dev=dummy0 --parent=root
       cls basic dev dummy0 id none parent root prio 49152 protocol ip match-all
       cls basic dev dummy0 id :1 parent root prio 49152 protocol ip match-all
       $ nl-cls-list --dev=dummy0 --parent=ffff:
       cls basic dev dummy0 id none parent ffff: prio 49152 protocol ip match-all
       cls basic dev dummy0 id :1 parent ffff: prio 49152 protocol ip match-all
      
      This is confusing and misleading, more importantly this is
      a regression since 4.15, so the old behavior must be restored.
      
      And, when tc filters are installed on a tc class, the parent
      should be the classid, rather than the qdisc handle. Commit
      edf6711c ("net: sched: remove classid and q fields from tcf_proto")
      removed the classid we save for filters, we can just restore
      this classid in tcf_block.
      
      Steps to reproduce this:
       ip li set dev dummy0 up
       tc qd add dev dummy0 ingress
       tc filter add dev dummy0 parent ffff: protocol arp basic action pass
       tc filter show dev dummy0 root
      
      Before this patch:
       filter protocol arp pref 49152 basic
       filter protocol arp pref 49152 basic handle 0x1
      	action order 1: gact action pass
      	 random type none pass val 0
      	 index 1 ref 1 bind 1
      
      After this patch:
       filter parent ffff: protocol arp pref 49152 basic
       filter parent ffff: protocol arp pref 49152 basic handle 0x1
       	action order 1: gact action pass
       	 random type none pass val 0
      	 index 1 ref 1 bind 1
      
      Fixes: a10fa201 ("net: sched: propagate q and parent from caller down to tcf_fill_node")
      Fixes: edf6711c ("net: sched: remove classid and q fields from tcf_proto")
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7df4870
    • A
      devlink: Fix reporter's recovery condition · bea0c5c9
      Aya Levin 提交于
      Devlink health core conditions the reporter's recovery with the
      expiration of the grace period. This is not relevant for the first
      recovery. Explicitly demand that the grace period will only apply to
      recoveries other than the first.
      
      Fixes: c8e1da0b ("devlink: Add health report functionality")
      Signed-off-by: NAya Levin <ayal@mellanox.com>
      Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bea0c5c9
    • T
      tipc: fix partial topology connection closure · 980d6927
      Tuong Lien 提交于
      When an application connects to the TIPC topology server and subscribes
      to some services, a new connection is created along with some objects -
      'tipc_subscription' to store related data correspondingly...
      However, there is one omission in the connection handling that when the
      connection or application is orderly shutdown (e.g. via SIGQUIT, etc.),
      the connection is not closed in kernel, the 'tipc_subscription' objects
      are not freed too.
      This results in:
      - The maximum number of subscriptions (65535) will be reached soon, new
      subscriptions will be rejected;
      - TIPC module cannot be removed (unless the objects  are somehow forced
      to release first);
      
      The commit fixes the issue by closing the connection if the 'recvmsg()'
      returns '0' i.e. when the peer is shutdown gracefully. It also includes
      the other unexpected cases.
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      980d6927
    • F
      net: dsa: Do not make user port errors fatal · 86f8b1c0
      Florian Fainelli 提交于
      Prior to 1d27732f ("net: dsa: setup and teardown ports"), we would
      not treat failures to set-up an user port as fatal, but after this
      commit we would, which is a regression for some systems where interfaces
      may be declared in the Device Tree, but the underlying hardware may not
      be present (pluggable daughter cards for instance).
      
      Fixes: 1d27732f ("net: dsa: setup and teardown ports")
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86f8b1c0
  7. 04 5月, 2020 1 次提交
  8. 02 5月, 2020 3 次提交
    • A
      drop_monitor: work around gcc-10 stringop-overflow warning · dc30b405
      Arnd Bergmann 提交于
      The current gcc-10 snapshot produces a false-positive warning:
      
      net/core/drop_monitor.c: In function 'trace_drop_common.constprop':
      cc1: error: writing 8 bytes into a region of size 0 [-Werror=stringop-overflow=]
      In file included from net/core/drop_monitor.c:23:
      include/uapi/linux/net_dropmon.h:36:8: note: at offset 0 to object 'entries' with size 4 declared here
         36 |  __u32 entries;
            |        ^~~~~~~
      
      I reported this in the gcc bugzilla, but in case it does not get
      fixed in the release, work around it by using a temporary variable.
      
      Fixes: 9a8afc8d ("Network Drop Monitor: Adding drop monitor implementation & Netlink protocol")
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94881Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc30b405
    • J
      devlink: fix return value after hitting end in region read · 610a9346
      Jakub Kicinski 提交于
      Commit d5b90e99 ("devlink: report 0 after hitting end in region read")
      fixed region dump, but region read still returns a spurious error:
      
      $ devlink region read netdevsim/netdevsim1/dummy snapshot 0 addr 0 len 128
      0000000000000000 a6 f4 c4 1c 21 35 95 a6 9d 34 c3 5b 87 5b 35 79
      0000000000000010 f3 a0 d7 ee 4f 2f 82 7f c6 dd c4 f6 a5 c3 1b ae
      0000000000000020 a4 fd c8 62 07 59 48 03 70 3b c7 09 86 88 7f 68
      0000000000000030 6f 45 5d 6d 7d 0e 16 38 a9 d0 7a 4b 1e 1e 2e a6
      0000000000000040 e6 1d ae 06 d6 18 00 85 ca 62 e8 7e 11 7e f6 0f
      0000000000000050 79 7e f7 0f f3 94 68 bd e6 40 22 85 b6 be 6f b1
      0000000000000060 af db ef 5e 34 f0 98 4b 62 9a e3 1b 8b 93 fc 17
      devlink answers: Invalid argument
      0000000000000070 61 e8 11 11 66 10 a5 f7 b1 ea 8d 40 60 53 ed 12
      
      This is a minimal fix, I'll follow up with a restructuring
      so we don't have two checks for the same condition.
      
      Fixes: fdd41ec2 ("devlink: Return right error code in case of errors for region read")
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Reviewed-by: NJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      610a9346
    • D
      ipv6: Use global sernum for dst validation with nexthop objects · 8f34e53b
      David Ahern 提交于
      Nik reported a bug with pcpu dst cache when nexthop objects are
      used illustrated by the following:
          $ ip netns add foo
          $ ip -netns foo li set lo up
          $ ip -netns foo addr add 2001:db8:11::1/128 dev lo
          $ ip netns exec foo sysctl net.ipv6.conf.all.forwarding=1
          $ ip li add veth1 type veth peer name veth2
          $ ip li set veth1 up
          $ ip addr add 2001:db8:10::1/64 dev veth1
          $ ip li set dev veth2 netns foo
          $ ip -netns foo li set veth2 up
          $ ip -netns foo addr add 2001:db8:10::2/64 dev veth2
          $ ip -6 nexthop add id 100 via 2001:db8:10::2 dev veth1
          $ ip -6 route add 2001:db8:11::1/128 nhid 100
      
          Create a pcpu entry on cpu 0:
          $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
      
          Re-add the route entry:
          $ ip -6 ro del 2001:db8:11::1
          $ ip -6 route add 2001:db8:11::1/128 nhid 100
      
          Route get on cpu 0 returns the stale pcpu:
          $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
          RTNETLINK answers: Network is unreachable
      
          While cpu 1 works:
          $ taskset -a -c 1 ip -6 route get 2001:db8:11::1
          2001:db8:11::1 from :: via 2001:db8:10::2 dev veth1 src 2001:db8:10::1 metric 1024 pref medium
      
      Conversion of FIB entries to work with external nexthop objects
      missed an important difference between IPv4 and IPv6 - how dst
      entries are invalidated when the FIB changes. IPv4 has a per-network
      namespace generation id (rt_genid) that is bumped on changes to the FIB.
      Checking if a dst_entry is still valid means comparing rt_genid in the
      rtable to the current value of rt_genid for the namespace.
      
      IPv6 also has a per network namespace counter, fib6_sernum, but the
      count is saved per fib6_node. With the per-node counter only dst_entries
      based on fib entries under the node are invalidated when changes are
      made to the routes - limiting the scope of invalidations. IPv6 uses a
      reference in the rt6_info, 'from', to track the corresponding fib entry
      used to create the dst_entry. When validating a dst_entry, the 'from'
      is used to backtrack to the fib6_node and check the sernum of it to the
      cookie passed to the dst_check operation.
      
      With the inline format (nexthop definition inline with the fib6_info),
      dst_entries cached in the fib6_nh have a 1:1 correlation between fib
      entries, nexthop data and dst_entries. With external nexthops, IPv6
      looks more like IPv4 which means multiple fib entries across disparate
      fib6_nodes can all reference the same fib6_nh. That means validation
      of dst_entries based on external nexthops needs to use the IPv4 format
      - the per-network namespace counter.
      
      Add sernum to rt6_info and set it when creating a pcpu dst entry. Update
      rt6_get_cookie to return sernum if it is set and update dst_check for
      IPv6 to look for sernum set and based the check on it if so. Finally,
      rt6_get_pcpu_route needs to validate the cached entry before returning
      a pcpu entry (similar to the rt_cache_valid calls in __mkroute_input and
      __mkroute_output for IPv4).
      
      This problem only affects routes using the new, external nexthops.
      
      Thanks to the kbuild test robot for catching the IS_ENABLED needed
      around rt_genid_ipv6 before I sent this out.
      
      Fixes: 5b98324e ("ipv6: Allow routes to use nexthop objects")
      Reported-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Tested-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f34e53b
  9. 01 5月, 2020 7 次提交
    • I
      net: bridge: vlan: Add a schedule point during VLAN processing · 7979457b
      Ido Schimmel 提交于
      User space can request to delete a range of VLANs from a bridge slave in
      one netlink request. For each deleted VLAN the FDB needs to be traversed
      in order to flush all the affected entries.
      
      If a large range of VLANs is deleted and the number of FDB entries is
      large or the FDB lock is contented, it is possible for the kernel to
      loop through the deleted VLANs for a long time. In case preemption is
      disabled, this can result in a soft lockup.
      
      Fix this by adding a schedule point after each VLAN is deleted to yield
      the CPU, if needed. This is safe because the VLANs are traversed in
      process context.
      
      Fixes: bdced7ef ("bridge: support for multiple vlans and vlan ranges in setlink and dellink requests")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reported-by: NStefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Tested-by: NStefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7979457b
    • P
      mptcp: fix uninitialized value access · ac2b47fb
      Paolo Abeni 提交于
      tcp_v{4,6}_syn_recv_sock() set 'own_req' only when returning
      a not NULL 'child', let's check 'own_req' only if child is
      available to avoid an - unharmful - UBSAN splat.
      
      v1 -> v2:
       - reference the correct hash
      
      Fixes: 4c8941de ("mptcp: avoid flipping mp_capable field in syn_recv_sock()")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac2b47fb
    • P
      mptcp: initialize the data_fin field for mpc packets · a77895db
      Paolo Abeni 提交于
      When parsing MPC+data packets we set the dss field, so
      we must also initialize the data_fin, or we can find stray
      value there.
      
      Fixes: 9a19371b ("mptcp: fix data_fin handing in RX path")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a77895db
    • P
      mptcp: fix 'use_ack' option access. · 5a91e32b
      Paolo Abeni 提交于
      The mentioned RX option field is initialized only for DSS
      packet, we must access it only if 'dss' is set too, or
      the subflow will end-up in a bad status, leading to
      RFC violations.
      
      Fixes: d22f4988 ("mptcp: process MP_CAPABLE data option")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a91e32b
    • P
      mptcp: avoid a WARN on bad input. · d6085fe1
      Paolo Abeni 提交于
      Syzcaller has found a way to trigger the WARN_ON_ONCE condition
      in check_fully_established().
      
      The root cause is a legit fallback to TCP scenario, so replace
      the WARN with a plain message on a more strict condition.
      
      Fixes: f296234c ("mptcp: Add handling of incoming MP_JOIN requests")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6085fe1
    • P
      mptcp: move option parsing into mptcp_incoming_options() · cfde141e
      Paolo Abeni 提交于
      The mptcp_options_received structure carries several per
      packet flags (mp_capable, mp_join, etc.). Such fields must
      be cleared on each packet, even on dropped ones or packet
      not carrying any MPTCP options, but the current mptcp
      code clears them only on TCP option reset.
      
      On several races/corner cases we end-up with stray bits in
      incoming options, leading to WARN_ON splats. e.g.:
      
      [  171.164906] Bad mapping: ssn=32714 map_seq=1 map_data_len=32713
      [  171.165006] WARNING: CPU: 1 PID: 5026 at net/mptcp/subflow.c:533 warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
      [  171.167632] Modules linked in: ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel geneve ip6_udp_tunnel udp_tunnel macsec macvtap tap ipvlan macvlan 8021q garp mrp xfrm_interface veth netdevsim nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun binfmt_misc intel_rapl_msr intel_rapl_common rfkill kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ip_tables xfs libcrc32c crc32c_intel serio_raw virtio_console ata_generic virtio_blk virtio_net net_failover failover ata_piix libata
      [  171.199464] CPU: 1 PID: 5026 Comm: repro Not tainted 5.7.0-rc1.mptcp_f227fdf5d388+ #95
      [  171.200886] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
      [  171.202546] RIP: 0010:warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
      [  171.206537] Code: c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 1d 8b 55 3c 44 89 e6 48 c7 c7 20 51 13 95 e8 37 8b 22 fe <0f> 0b 48 83 c4 08 5b 5d 41 5c c3 89 4c 24 04 e8 db d6 94 fe 8b 4c
      [  171.220473] RSP: 0018:ffffc90000150560 EFLAGS: 00010282
      [  171.221639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [  171.223108] RDX: 0000000000000000 RSI: 0000000000000008 RDI: fffff5200002a09e
      [  171.224388] RBP: ffff8880aa6e3c00 R08: 0000000000000001 R09: fffffbfff2ec9955
      [  171.225706] R10: ffffffff9764caa7 R11: fffffbfff2ec9954 R12: 0000000000007fca
      [  171.227211] R13: ffff8881066f4a7f R14: ffff8880aa6e3c00 R15: 0000000000000020
      [  171.228460] FS:  00007f8623719740(0000) GS:ffff88810be00000(0000) knlGS:0000000000000000
      [  171.230065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  171.231303] CR2: 00007ffdab190a50 CR3: 00000001038ea006 CR4: 0000000000160ee0
      [  171.232586] Call Trace:
      [  171.233109]  <IRQ>
      [  171.233531] get_mapping_status (linux-mptcp/net/mptcp/subflow.c:691)
      [  171.234371] mptcp_subflow_data_available (linux-mptcp/net/mptcp/subflow.c:736 linux-mptcp/net/mptcp/subflow.c:832)
      [  171.238181] subflow_state_change (linux-mptcp/net/mptcp/subflow.c:1085 (discriminator 1))
      [  171.239066] tcp_fin (linux-mptcp/net/ipv4/tcp_input.c:4217)
      [  171.240123] tcp_data_queue (linux-mptcp/./include/linux/compiler.h:199 linux-mptcp/net/ipv4/tcp_input.c:4822)
      [  171.245083] tcp_rcv_established (linux-mptcp/./include/linux/skbuff.h:1785 linux-mptcp/./include/net/tcp.h:1774 linux-mptcp/./include/net/tcp.h:1847 linux-mptcp/net/ipv4/tcp_input.c:5238 linux-mptcp/net/ipv4/tcp_input.c:5730)
      [  171.254089] tcp_v4_rcv (linux-mptcp/./include/linux/spinlock.h:393 linux-mptcp/net/ipv4/tcp_ipv4.c:2009)
      [  171.258969] ip_protocol_deliver_rcu (linux-mptcp/net/ipv4/ip_input.c:204 (discriminator 1))
      [  171.260214] ip_local_deliver_finish (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/ipv4/ip_input.c:232)
      [  171.261389] ip_local_deliver (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:252)
      [  171.265884] ip_rcv (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:539)
      [  171.273666] process_backlog (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/core/dev.c:6135)
      [  171.275328] net_rx_action (linux-mptcp/net/core/dev.c:6572 linux-mptcp/net/core/dev.c:6640)
      [  171.280472] __do_softirq (linux-mptcp/./arch/x86/include/asm/jump_label.h:25 linux-mptcp/./include/linux/jump_label.h:200 linux-mptcp/./include/trace/events/irq.h:142 linux-mptcp/kernel/softirq.c:293)
      [  171.281379] do_softirq_own_stack (linux-mptcp/arch/x86/entry/entry_64.S:1083)
      [  171.282358]  </IRQ>
      
      We could address the issue clearing explicitly the relevant fields
      in several places - tcp_parse_option, tcp_fast_parse_options,
      possibly others.
      
      Instead we move the MPTCP option parsing into the already existing
      mptcp ingress hook, so that we need to clear the fields in a single
      place.
      
      This allows us dropping an MPTCP hook from the TCP code and
      removing the quite large mptcp_options_received from the tcp_sock
      struct. On the flip side, the MPTCP sockets will traverse the
      option space twice (in tcp_parse_option() and in
      mptcp_incoming_options(). That looks acceptable: we already
      do that for syn and 3rd ack packets, plain TCP socket will
      benefit from it, and even MPTCP sockets will experience better
      code locality, reducing the jumps between TCP and MPTCP code.
      
      v1 -> v2:
       - rebased on current '-net' tree
      
      Fixes: 648ef4b8 ("mptcp: Implement MPTCP receive path")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfde141e
    • P
      mptcp: consolidate synack processing. · 263e1201
      Paolo Abeni 提交于
      Currently the MPTCP code uses 2 hooks to process syn-ack
      packets, mptcp_rcv_synsent() and the sk_rx_dst_set()
      callback.
      
      We can drop the first, moving the relevant code into the
      latter, reducing the hooking into the TCP code. This is
      also needed by the next patch.
      
      v1 -> v2:
       - use local tcp sock ptr instead of casting the sk variable
         several times - DaveM
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      263e1201
  10. 30 4月, 2020 2 次提交
  11. 29 4月, 2020 2 次提交
    • Y
      net/x25: Fix null-ptr-deref in x25_disconnect · 8999dc89
      YueHaibing 提交于
      We should check null before do x25_neigh_put in x25_disconnect,
      otherwise may cause null-ptr-deref like this:
      
       #include <sys/socket.h>
       #include <linux/x25.h>
      
       int main() {
          int sck_x25;
          sck_x25 = socket(AF_X25, SOCK_SEQPACKET, 0);
          close(sck_x25);
          return 0;
       }
      
      BUG: kernel NULL pointer dereference, address: 00000000000000d8
      CPU: 0 PID: 4817 Comm: t2 Not tainted 5.7.0-rc3+ #159
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-
      RIP: 0010:x25_disconnect+0x91/0xe0
      Call Trace:
       x25_release+0x18a/0x1b0
       __sock_release+0x3d/0xc0
       sock_close+0x13/0x20
       __fput+0x107/0x270
       ____fput+0x9/0x10
       task_work_run+0x6d/0xb0
       exit_to_usermode_loop+0x102/0x110
       do_syscall_64+0x23c/0x260
       entry_SYSCALL_64_after_hwframe+0x49/0xb3
      
      Reported-by: syzbot+6db548b615e5aeefdce2@syzkaller.appspotmail.com
      Fixes: 4becb7ee ("net/x25: Fix x25_neigh refcnt leak when x25 disconnect")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8999dc89
    • N
      SUNRPC: defer slow parts of rpc_free_client() to a workqueue. · 7c4310ff
      NeilBrown 提交于
      The rpciod workqueue is on the write-out path for freeing dirty memory,
      so it is important that it never block waiting for memory to be
      allocated - this can lead to a deadlock.
      
      rpc_execute() - which is often called by an rpciod work item - calls
      rcp_task_release_client() which can lead to rpc_free_client().
      
      rpc_free_client() makes two calls which could potentially block wating
      for memory allocation.
      
      rpc_clnt_debugfs_unregister() calls into debugfs and will block while
      any of the debugfs files are being accessed.  In particular it can block
      while any of the 'open' methods are being called and all of these use
      malloc for one thing or another.  So this can deadlock if the memory
      allocation waits for NFS to complete some writes via rpciod.
      
      rpc_clnt_remove_pipedir() can take the inode_lock() and while it isn't
      obvious that memory allocations can happen while the lock it held, it is
      safer to assume they might and to not let rpciod call
      rpc_clnt_remove_pipedir().
      
      So this patch moves these two calls (together with the final kfree() and
      rpciod_down()) into a work-item to be run from the system work-queue.
      rpciod can continue its important work, and the final stages of the free
      can happen whenever they happen.
      
      I have seen this deadlock on a 4.12 based kernel where debugfs used
      synchronize_srcu() when removing objects.  synchronize_srcu() requires a
      workqueue and there were no free workther threads and none could be
      allocated.  While debugsfs no longer uses SRCU, I believe the deadlock
      is still possible.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      7c4310ff
  12. 28 4月, 2020 5 次提交
    • X
      bpf: Fix sk_psock refcnt leak when receiving message · 18f02ad1
      Xiyu Yang 提交于
      tcp_bpf_recvmsg() invokes sk_psock_get(), which returns a reference of
      the specified sk_psock object to "psock" with increased refcnt.
      
      When tcp_bpf_recvmsg() returns, local variable "psock" becomes invalid,
      so the refcount should be decreased to keep refcount balanced.
      
      The reference counting issue happens in several exception handling paths
      of tcp_bpf_recvmsg(). When those error scenarios occur such as "flags"
      includes MSG_ERRQUEUE, the function forgets to decrease the refcnt
      increased by sk_psock_get(), causing a refcnt leak.
      
      Fix this issue by calling sk_psock_put() or pulling up the error queue
      read handling when those error scenarios occur.
      
      Fixes: e7a5f1f1 ("bpf/sockmap: Read psock ingress_msg before sk_receive_queue")
      Signed-off-by: NXiyu Yang <xiyuyang19@fudan.edu.cn>
      Signed-off-by: NXin Tan <tanxin.ctf@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/1587872115-42805-1-git-send-email-xiyuyang19@fudan.edu.cn
      18f02ad1
    • E
      sch_sfq: validate silly quantum values · df4953e4
      Eric Dumazet 提交于
      syzbot managed to set up sfq so that q->scaled_quantum was zero,
      triggering an infinite loop in sfq_dequeue()
      
      More generally, we must only accept quantum between 1 and 2^18 - 7,
      meaning scaled_quantum must be in [1, 0x7FFF] range.
      
      Otherwise, we also could have a loop in sfq_dequeue()
      if scaled_quantum happens to be 0x8000, since slot->allot
      could indefinitely switch between 0 and 0x8000.
      
      Fixes: eeaeb068 ("sch_sfq: allow big packets and be fair")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: syzbot+0251e883fe39e7a0cb0a@syzkaller.appspotmail.com
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df4953e4
    • E
      sch_choke: avoid potential panic in choke_reset() · 8738c85c
      Eric Dumazet 提交于
      If choke_init() could not allocate q->tab, we would crash later
      in choke_reset().
      
      BUG: KASAN: null-ptr-deref in memset include/linux/string.h:366 [inline]
      BUG: KASAN: null-ptr-deref in choke_reset+0x208/0x340 net/sched/sch_choke.c:326
      Write of size 8 at addr 0000000000000000 by task syz-executor822/7022
      
      CPU: 1 PID: 7022 Comm: syz-executor822 Not tainted 5.7.0-rc1-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x188/0x20d lib/dump_stack.c:118
       __kasan_report.cold+0x5/0x4d mm/kasan/report.c:515
       kasan_report+0x33/0x50 mm/kasan/common.c:625
       check_memory_region_inline mm/kasan/generic.c:187 [inline]
       check_memory_region+0x141/0x190 mm/kasan/generic.c:193
       memset+0x20/0x40 mm/kasan/common.c:85
       memset include/linux/string.h:366 [inline]
       choke_reset+0x208/0x340 net/sched/sch_choke.c:326
       qdisc_reset+0x6b/0x520 net/sched/sch_generic.c:910
       dev_deactivate_queue.constprop.0+0x13c/0x240 net/sched/sch_generic.c:1138
       netdev_for_each_tx_queue include/linux/netdevice.h:2197 [inline]
       dev_deactivate_many+0xe2/0xba0 net/sched/sch_generic.c:1195
       dev_deactivate+0xf8/0x1c0 net/sched/sch_generic.c:1233
       qdisc_graft+0xd25/0x1120 net/sched/sch_api.c:1051
       tc_modify_qdisc+0xbab/0x1a00 net/sched/sch_api.c:1670
       rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5454
       netlink_rcv_skb+0x15a/0x410 net/netlink/af_netlink.c:2469
       netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
       netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329
       netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:672
       ____sys_sendmsg+0x6bf/0x7e0 net/socket.c:2362
       ___sys_sendmsg+0x100/0x170 net/socket.c:2416
       __sys_sendmsg+0xec/0x1b0 net/socket.c:2449
       do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
      
      Fixes: 77e62da6 ("sch_choke: drop all packets in queue during reset")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8738c85c
    • E
      fq_codel: fix TCA_FQ_CODEL_DROP_BATCH_SIZE sanity checks · 14695212
      Eric Dumazet 提交于
      My intent was to not let users set a zero drop_batch_size,
      it seems I once again messed with min()/max().
      
      Fixes: 9d18562a ("fq_codel: add batch ability to fq_codel_drop()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14695212
    • X
      net/tls: Fix sk_psock refcnt leak when in tls_data_ready() · 62b4011f
      Xiyu Yang 提交于
      tls_data_ready() invokes sk_psock_get(), which returns a reference of
      the specified sk_psock object to "psock" with increased refcnt.
      
      When tls_data_ready() returns, local variable "psock" becomes invalid,
      so the refcount should be decreased to keep refcount balanced.
      
      The reference counting issue happens in one exception handling path of
      tls_data_ready(). When "psock->ingress_msg" is empty but "psock" is not
      NULL, the function forgets to decrease the refcnt increased by
      sk_psock_get(), causing a refcnt leak.
      
      Fix this issue by calling sk_psock_put() on all paths when "psock" is
      not NULL.
      Signed-off-by: NXiyu Yang <xiyuyang19@fudan.edu.cn>
      Signed-off-by: NXin Tan <tanxin.ctf@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62b4011f