1. 16 10月, 2018 8 次提交
    • S
      ipv6: rate-limit probes for neighbourless routes · f547fac6
      Sabrina Dubroca 提交于
      When commit 27097255 ("[IPV6]: ROUTE: Add Router Reachability
      Probing (RFC4191).") introduced router probing, the rt6_probe() function
      required that a neighbour entry existed. This neighbour entry is used to
      record the timestamp of the last probe via the ->updated field.
      
      Later, commit 2152caea ("ipv6: Do not depend on rt->n in rt6_probe().")
      removed the requirement for a neighbour entry. Neighbourless routes skip
      the interval check and are not rate-limited.
      
      This patch adds rate-limiting for neighbourless routes, by recording the
      timestamp of the last probe in the fib6_info itself.
      
      Fixes: 2152caea ("ipv6: Do not depend on rt->n in rt6_probe().")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f547fac6
    • Y
      rxrpc: use correct kvec num when sending BUSY response packet · d6672a5a
      YueHaibing 提交于
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      net/rxrpc/output.c: In function 'rxrpc_reject_packets':
      net/rxrpc/output.c:527:11: warning:
       variable 'ioc' set but not used [-Wunused-but-set-variable]
      
      'ioc' is the correct kvec num when sending a BUSY (or an ABORT) response
      packet.
      
      Fixes: ece64fec ("rxrpc: Emit BUSY packets when supposed to rather than ABORTs")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6672a5a
    • D
      rxrpc: Fix an uninitialised variable · d7b4c24f
      David Howells 提交于
      Fix an uninitialised variable introduced by the last patch.  This can cause
      a crash when a new call comes in to a local service, such as when an AFS
      fileserver calls back to the local cache manager.
      
      Fixes: c1e15b49 ("rxrpc: Fix the packet reception routine")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7b4c24f
    • J
      tipc: initialize broadcast link stale counter correctly · 4af00f4c
      Jon Maloy 提交于
      In the commit referred to below we added link tolerance as an additional
      criteria for declaring broadcast transmission "stale" and resetting the
      unicast links to the affected node.
      
      Unfortunately, this 'improvement' introduced two bugs, which each and
      one alone cause only limited problems, but combined lead to seemingly
      stochastic unicast link resets, depending on the amount of broadcast
      traffic transmitted.
      
      The first issue, a missing initialization of the 'tolerance' field of
      the receiver broadcast link, was recently fixed by commit 047491ea
      ("tipc: set link tolerance correctly in broadcast link").
      
      Ths second issue, where we omit to reset the 'stale_cnt' field of
      the same link after a 'stale' period is over, leads to this counter
      accumulating over time, and in the absence of the 'tolerance' criteria
      leads to the above described symptoms. This commit adds the missing
      initialization.
      
      Fixes: a4dc70d4 ("tipc: extend link reset criteria for stale packet retransmission")
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4af00f4c
    • C
      llc: set SOCK_RCU_FREE in llc_sap_add_socket() · 5a8e7aea
      Cong Wang 提交于
      WHen an llc sock is added into the sk_laddr_hash of an llc_sap,
      it is not marked with SOCK_RCU_FREE.
      
      This causes that the sock could be freed while it is still being
      read by __llc_lookup_established() with RCU read lock. sock is
      refcounted, but with RCU read lock, nothing prevents the readers
      getting a zero refcnt.
      
      Fix it by setting SOCK_RCU_FREE in llc_sap_add_socket().
      
      Reported-by: syzbot+11e05f04c15e03be5254@syzkaller.appspotmail.com
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a8e7aea
    • D
      net/sched: cls_api: add missing validation of netlink attributes · e331473f
      Davide Caratti 提交于
      Similarly to what has been done in 8b4c3cdd ("net: sched: Add policy
      validation for tc attributes"), fix classifier code to add validation of
      TCA_CHAIN and TCA_KIND netlink attributes.
      
      tested with:
       # ./tdc.py -c filter
      
      v2: Let sch_api and cls_api share nla_policy they have in common, thanks
          to David Ahern.
      v3: Avoid EXPORT_SYMBOL(), as validation of those attributes is not done
          by TC modules, thanks to Cong Wang.
          While at it, restore the 'Delete / get qdisc' comment to its orginal
          position, just above tc_get_qdisc() function prototype.
      
      Fixes: 5bc17018 ("net: sched: introduce multichain support for filters")
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e331473f
    • W
      ethtool: fix a privilege escalation bug · 58f5bbe3
      Wenwen Wang 提交于
      In dev_ethtool(), the eth command 'ethcmd' is firstly copied from the
      use-space buffer 'useraddr' and checked to see whether it is
      ETHTOOL_PERQUEUE. If yes, the sub-command 'sub_cmd' is further copied from
      the user space. Otherwise, 'sub_cmd' is the same as 'ethcmd'. Next,
      according to 'sub_cmd', a permission check is enforced through the function
      ns_capable(). For example, the permission check is required if 'sub_cmd' is
      ETHTOOL_SCOALESCE, but it is not necessary if 'sub_cmd' is
      ETHTOOL_GCOALESCE, as suggested in the comment "Allow some commands to be
      done by anyone". The following execution invokes different handlers
      according to 'ethcmd'. Specifically, if 'ethcmd' is ETHTOOL_PERQUEUE,
      ethtool_set_per_queue() is called. In ethtool_set_per_queue(), the kernel
      object 'per_queue_opt' is copied again from the user-space buffer
      'useraddr' and 'per_queue_opt.sub_command' is used to determine which
      operation should be performed. Given that the buffer 'useraddr' is in the
      user space, a malicious user can race to change the sub-command between the
      two copies. In particular, the attacker can supply ETHTOOL_PERQUEUE and
      ETHTOOL_GCOALESCE to bypass the permission check in dev_ethtool(). Then
      before ethtool_set_per_queue() is called, the attacker changes
      ETHTOOL_GCOALESCE to ETHTOOL_SCOALESCE. In this way, the attacker can
      bypass the permission check and execute ETHTOOL_SCOALESCE.
      
      This patch enforces a check in ethtool_set_per_queue() after the second
      copy from 'useraddr'. If the sub-command is different from the one obtained
      in the first copy in dev_ethtool(), an error code EINVAL will be returned.
      
      Fixes: f38d138a ("net/ethtool: support set coalesce per queue")
      Signed-off-by: NWenwen Wang <wang6495@umn.edu>
      Reviewed-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58f5bbe3
    • W
      ethtool: fix a missing-check bug · 2bb3207d
      Wenwen Wang 提交于
      In ethtool_get_rxnfc(), the eth command 'cmd' is compared against
      'ETHTOOL_GRXFH' to see whether it is necessary to adjust the variable
      'info_size'. Then the whole structure of 'info' is copied from the
      user-space buffer 'useraddr' with 'info_size' bytes. In the following
      execution, 'info' may be copied again from the buffer 'useraddr' depending
      on the 'cmd' and the 'info.flow_type'. However, after these two copies,
      there is no check between 'cmd' and 'info.cmd'. In fact, 'cmd' is also
      copied from the buffer 'useraddr' in dev_ethtool(), which is the caller
      function of ethtool_get_rxnfc(). Given that 'useraddr' is in the user
      space, a malicious user can race to change the eth command in the buffer
      between these copies. By doing so, the attacker can supply inconsistent
      data and cause undefined behavior because in the following execution 'info'
      will be passed to ops->get_rxnfc().
      
      This patch adds a necessary check on 'info.cmd' and 'cmd' to confirm that
      they are still same after the two copies in ethtool_get_rxnfc(). Otherwise,
      an error code EINVAL will be returned.
      Signed-off-by: NWenwen Wang <wang6495@umn.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bb3207d
  2. 12 10月, 2018 1 次提交
    • Y
      tipc: eliminate possible recursive locking detected by LOCKDEP · a1f8dd34
      Ying Xue 提交于
      When booting kernel with LOCKDEP option, below warning info was found:
      
      WARNING: possible recursive locking detected
      4.19.0-rc7+ #14 Not tainted
      --------------------------------------------
      swapper/0/1 is trying to acquire lock:
      00000000dcfc0fc8 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh
      include/linux/spinlock.h:334 [inline]
      00000000dcfc0fc8 (&(&list->lock)->rlock#4){+...}, at:
      tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
      
      but task is already holding lock:
      00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh
      include/linux/spinlock.h:334 [inline]
      00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
      tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&(&list->lock)->rlock#4);
        lock(&(&list->lock)->rlock#4);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      2 locks held by swapper/0/1:
       #0: 00000000f7539d34 (pernet_ops_rwsem){+.+.}, at:
      register_pernet_subsys+0x19/0x40 net/core/net_namespace.c:1051
       #1: 00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
      spin_lock_bh include/linux/spinlock.h:334 [inline]
       #1: 00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
      tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849
      
      stack backtrace:
      CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0-rc7+ #14
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1af/0x295 lib/dump_stack.c:113
       print_deadlock_bug kernel/locking/lockdep.c:1759 [inline]
       check_deadlock kernel/locking/lockdep.c:1803 [inline]
       validate_chain kernel/locking/lockdep.c:2399 [inline]
       __lock_acquire+0xf1e/0x3c60 kernel/locking/lockdep.c:3411
       lock_acquire+0x1db/0x520 kernel/locking/lockdep.c:3900
       __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
       _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168
       spin_lock_bh include/linux/spinlock.h:334 [inline]
       tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
       tipc_link_bc_create+0xb5/0x1f0 net/tipc/link.c:526
       tipc_bcast_init+0x59b/0xab0 net/tipc/bcast.c:521
       tipc_init_net+0x472/0x610 net/tipc/core.c:82
       ops_init+0xf7/0x520 net/core/net_namespace.c:129
       __register_pernet_operations net/core/net_namespace.c:940 [inline]
       register_pernet_operations+0x453/0xac0 net/core/net_namespace.c:1011
       register_pernet_subsys+0x28/0x40 net/core/net_namespace.c:1052
       tipc_init+0x83/0x104 net/tipc/core.c:140
       do_one_initcall+0x109/0x70a init/main.c:885
       do_initcall_level init/main.c:953 [inline]
       do_initcalls init/main.c:961 [inline]
       do_basic_setup init/main.c:979 [inline]
       kernel_init_freeable+0x4bd/0x57f init/main.c:1144
       kernel_init+0x13/0x180 init/main.c:1063
       ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413
      
      The reason why the noise above was complained by LOCKDEP is because we
      nested to hold l->wakeupq.lock and l->inputq->lock in tipc_link_reset
      function. In fact it's unnecessary to move skb buffer from l->wakeupq
      queue to l->inputq queue while holding the two locks at the same time.
      Instead, we can move skb buffers in l->wakeupq queue to a temporary
      list first and then move the buffers of the temporary list to l->inputq
      queue, which is also safe for us.
      
      Fixes: 3f32d0be ("tipc: lock wakeup & inputq at tipc_link_reset()")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1f8dd34
  3. 11 10月, 2018 11 次提交
    • B
      xsk: do not call synchronize_net() under RCU read lock · cee27167
      Björn Töpel 提交于
      The XSKMAP update and delete functions called synchronize_net(), which
      can sleep. It is not allowed to sleep during an RCU read section.
      
      Instead we need to make sure that the sock sk_destruct (xsk_destruct)
      function is asynchronously called after an RCU grace period. Setting
      the SOCK_RCU_FREE flag for XDP sockets takes care of this.
      
      Fixes: fbfc504a ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cee27167
    • P
      tipc: queue socket protocol error messages into socket receive buffer · e7eb0582
      Parthasarathy Bhuvaragan 提交于
      In tipc_sk_filter_rcv(), when we detect protocol messages with error we
      call tipc_sk_conn_proto_rcv() and let it reset the connection and notify
      the socket by calling sk->sk_state_change().
      
      However, tipc_sk_filter_rcv() may have been called from the function
      tipc_backlog_rcv(), in which case the socket lock is held and the socket
      already awake. This means that the sk_state_change() call is ignored and
      the error notification lost. Now the receive queue will remain empty and
      the socket sleeps forever.
      
      In this commit, we convert the protocol message into a connection abort
      message and enqueue it into the socket's receive queue. By this addition
      to the above state change we cover all conditions.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7eb0582
    • J
      tipc: set link tolerance correctly in broadcast link · 047491ea
      Jon Maloy 提交于
      In the patch referred to below we added link tolerance as an additional
      criteria for declaring broadcast transmission "stale" and resetting the
      affected links.
      
      However, the 'tolerance' field of the broadcast link is never set, and
      remains at zero. This renders the whole commit without the intended
      improving effect, but luckily also with no negative effect.
      
      In this commit we add the missing initialization.
      
      Fixes: a4dc70d4 ("tipc: extend link reset criteria for stale packet retransmission")
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      047491ea
    • S
      net: ipv4: don't let PMTU updates increase route MTU · 28d35bcd
      Sabrina Dubroca 提交于
      When an MTU update with PMTU smaller than net.ipv4.route.min_pmtu is
      received, we must clamp its value. However, we can receive a PMTU
      exception with PMTU < old_mtu < ip_rt_min_pmtu, which would lead to an
      increase in PMTU.
      
      To fix this, take the smallest of the old MTU and ip_rt_min_pmtu.
      
      Before this patch, in case of an update, the exception's MTU would
      always change. Now, an exception can have only its lock flag updated,
      but not the MTU, so we need to add a check on locking to the following
      "is this exception getting updated, or close to expiring?" test.
      
      Fixes: d52e5a7e ("ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28d35bcd
    • S
      net: ipv4: update fnhe_pmtu when first hop's MTU changes · af7d6cce
      Sabrina Dubroca 提交于
      Since commit 5aad1de5 ("ipv4: use separate genid for next hop
      exceptions"), exceptions get deprecated separately from cached
      routes. In particular, administrative changes don't clear PMTU anymore.
      
      As Stefano described in commit e9fa1495 ("ipv6: Reflect MTU changes
      on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
      the local MTU change can become stale:
       - if the local MTU is now lower than the PMTU, that PMTU is now
         incorrect
       - if the local MTU was the lowest value in the path, and is increased,
         we might discover a higher PMTU
      
      Similarly to what commit e9fa1495 did for IPv6, update PMTU in those
      cases.
      
      If the exception was locked, the discovered PMTU was smaller than the
      minimal accepted PMTU. In that case, if the new local MTU is smaller
      than the current PMTU, let PMTU discovery figure out if locking of the
      exception is still needed.
      
      To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
      notifier. By the time the notifier is called, dev->mtu has been
      changed. This patch adds the old MTU as additional information in the
      notifier structure, and a new call_netdevice_notifiers_u32() function.
      
      Fixes: 5aad1de5 ("ipv4: use separate genid for next hop exceptions")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af7d6cce
    • M
      net/ipv6: stop leaking percpu memory in fib6 info · 7abab7b9
      Mike Rapoport 提交于
      The fib6_info_alloc() function allocates percpu memory to hold per CPU
      pointers to rt6_info, but this memory is never freed. Fix it.
      
      Fixes: a64efe14 ("net/ipv6: introduce fib6_info struct and helpers")
      Signed-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7abab7b9
    • K
      rds: RDS (tcp) hangs on sendto() to unresponding address · 9a4890bd
      Ka-Cheong Poon 提交于
      In rds_send_mprds_hash(), if the calculated hash value is non-zero and
      the MPRDS connections are not yet up, it will wait.  But it should not
      wait if the send is non-blocking.  In this case, it should just use the
      base c_path for sending the message.
      Signed-off-by: NKa-Cheong Poon <ka-cheong.poon@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a4890bd
    • E
      net: make skb_partial_csum_set() more robust against overflows · 52b5d6f5
      Eric Dumazet 提交于
      syzbot managed to crash in skb_checksum_help() [1] :
      
              BUG_ON(offset + sizeof(__sum16) > skb_headlen(skb));
      
      Root cause is the following check in skb_partial_csum_set()
      
      	if (unlikely(start > skb_headlen(skb)) ||
      	    unlikely((int)start + off > skb_headlen(skb) - 2))
      		return false;
      
      If skb_headlen(skb) is 1, then (skb_headlen(skb) - 2) becomes 0xffffffff
      and the check fails to detect that ((int)start + off) is off the limit,
      since the compare is unsigned.
      
      When we fix that, then the first condition (start > skb_headlen(skb))
      becomes obsolete.
      
      Then we should also check that (skb_headroom(skb) + start) wont
      overflow 16bit field.
      
      [1]
      kernel BUG at net/core/dev.c:2880!
      invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      CPU: 1 PID: 7330 Comm: syz-executor4 Not tainted 4.19.0-rc6+ #253
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:skb_checksum_help+0x9e3/0xbb0 net/core/dev.c:2880
      Code: 85 00 ff ff ff 48 c1 e8 03 42 80 3c 28 00 0f 84 09 fb ff ff 48 8b bd 00 ff ff ff e8 97 a8 b9 fb e9 f8 fa ff ff e8 2d 09 76 fb <0f> 0b 48 8b bd 28 ff ff ff e8 1f a8 b9 fb e9 b1 f6 ff ff 48 89 cf
      RSP: 0018:ffff8801d83a6f60 EFLAGS: 00010293
      RAX: ffff8801b9834380 RBX: ffff8801b9f8d8c0 RCX: ffffffff8608c6d7
      RDX: 0000000000000000 RSI: ffffffff8608cc63 RDI: 0000000000000006
      RBP: ffff8801d83a7068 R08: ffff8801b9834380 R09: 0000000000000000
      R10: ffff8801d83a76d8 R11: 0000000000000000 R12: 0000000000000001
      R13: 0000000000010001 R14: 000000000000ffff R15: 00000000000000a8
      FS:  00007f1a66db5700(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f7d77f091b0 CR3: 00000001ba252000 CR4: 00000000001406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       skb_csum_hwoffload_help+0x8f/0xe0 net/core/dev.c:3269
       validate_xmit_skb+0xa2a/0xf30 net/core/dev.c:3312
       __dev_queue_xmit+0xc2f/0x3950 net/core/dev.c:3797
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3838
       packet_snd net/packet/af_packet.c:2928 [inline]
       packet_sendmsg+0x422d/0x64c0 net/packet/af_packet.c:2953
      
      Fixes: 5ff8dda3 ("net: Ensure partial checksum offset is inside the skb head")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52b5d6f5
    • M
      devlink: Add helper function for safely copy string param · bde74ad1
      Moshe Shemesh 提交于
      Devlink string param buffer is allocated at the size of
      DEVLINK_PARAM_MAX_STRING_VALUE. Add helper function which makes sure
      this size is not exceeded.
      Renamed DEVLINK_PARAM_MAX_STRING_VALUE to
      __DEVLINK_PARAM_MAX_STRING_VALUE to emphasize that it should be used by
      devlink only. The driver should use the helper function instead to
      verify it doesn't exceed the allowed length.
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bde74ad1
    • M
      devlink: Fix param cmode driverinit for string type · 1276534c
      Moshe Shemesh 提交于
      Driverinit configuration mode value is held by devlink to enable the
      driver fetch the value after reload command. In case the param type is
      string devlink should copy the value from driver string buffer to
      devlink string buffer on devlink_param_driverinit_value_set() and
      vice-versa on devlink_param_driverinit_value_get().
      
      Fixes: ec01aeb1 ("devlink: Add support for get/set driverinit value")
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1276534c
    • M
      devlink: Fix param set handling for string type · f355cfcd
      Moshe Shemesh 提交于
      In case devlink param type is string, it needs to copy the string value
      it got from the input to devlink_param_value.
      
      Fixes: e3b7ca18 ("devlink: Add param set command")
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f355cfcd
  4. 09 10月, 2018 3 次提交
    • D
      rxrpc: Fix the packet reception routine · c1e15b49
      David Howells 提交于
      The rxrpc_input_packet() function and its call tree was built around the
      assumption that data_ready() handler called from UDP to inform a kernel
      service that there is data to be had was non-reentrant.  This means that
      certain locking could be dispensed with.
      
      This, however, turns out not to be the case with a multi-queue network card
      that can deliver packets to multiple cpus simultaneously.  Each of those
      cpus can be in the rxrpc_input_packet() function at the same time.
      
      Fix by adding or changing some structure members:
      
       (1) Add peer->rtt_input_lock to serialise access to the RTT buffer.
      
       (2) Make conn->service_id into a 32-bit variable so that it can be
           cmpxchg'd on all arches.
      
       (3) Add call->input_lock to serialise access to the Rx/Tx state.  Note
           that although the Rx and Tx states are (almost) entirely separate,
           there's no point completing the separation and having separate locks
           since it's a bi-phasal RPC protocol rather than a bi-direction
           streaming protocol.  Data transmission and data reception do not take
           place simultaneously on any particular call.
      
      and making the following functional changes:
      
       (1) In rxrpc_input_data(), hold call->input_lock around the core to
           prevent simultaneous producing of packets into the Rx ring and
           updating of tracking state for a particular call.
      
       (2) In rxrpc_input_ping_response(), only read call->ping_serial once, and
           check it before checking RXRPC_CALL_PINGING as that's a cheaper test.
           The bit test and bit clear can then be combined.  No further locking
           is needed here.
      
       (3) In rxrpc_input_ack(), take call->input_lock after we've parsed much of
           the ACK packet.  The superseded ACK check is then done both before and
           after the lock is taken.
      
           The handing of ackinfo data is split, parsing before the lock is taken
           and processing with it held.  This is keyed on rxMTU being non-zero.
      
           Congestion management is also done within the locked section.
      
       (4) In rxrpc_input_ackall(), take call->input_lock around the Tx window
           rotation.  The ACKALL packet carries no information and is only really
           useful after all packets have been transmitted since it's imprecise.
      
       (5) In rxrpc_input_implicit_end_call(), we use rx->incoming_lock to
           prevent calls being simultaneously implicitly ended on two cpus and
           also to prevent any races with incoming call setup.
      
       (6) In rxrpc_input_packet(), use cmpxchg() to effect the service upgrade
           on a connection.  It is only permitted to happen once for a
           connection.
      
       (7) In rxrpc_new_incoming_call(), we have to recheck the routing inside
           rx->incoming_lock to see if someone else set up the call, connection
           or peer whilst we were getting there.  We can't trust the values from
           the earlier routing check unless we pin refs on them - which we want
           to avoid.
      
           Further, we need to allow for an incoming call to have its state
           changed on another CPU between us making it live and us adjusting it
           because the conn is now in the RXRPC_CONN_SERVICE state.
      
       (8) In rxrpc_peer_add_rtt(), take peer->rtt_input_lock around the access
           to the RTT buffer.  Don't need to lock around setting peer->rtt.
      
      For reference, the inventory of state-accessing or state-altering functions
      used by the packet input procedure is:
      
      > rxrpc_input_packet()
        * PACKET CHECKING
      
        * ROUTING
          > rxrpc_post_packet_to_local()
          > rxrpc_find_connection_rcu() - uses RCU
            > rxrpc_lookup_peer_rcu() - uses RCU
            > rxrpc_find_service_conn_rcu() - uses RCU
            > idr_find() - uses RCU
      
        * CONNECTION-LEVEL PROCESSING
          - Service upgrade
            - Can only happen once per conn
            ! Changed to use cmpxchg
          > rxrpc_post_packet_to_conn()
          - Setting conn->hi_serial
            - Probably safe not using locks
            - Maybe use cmpxchg
      
        * CALL-LEVEL PROCESSING
          > Old-call checking
            > rxrpc_input_implicit_end_call()
              > rxrpc_call_completed()
      	> rxrpc_queue_call()
      	! Need to take rx->incoming_lock
      	> __rxrpc_disconnect_call()
      	> rxrpc_notify_socket()
          > rxrpc_new_incoming_call()
            - Uses rx->incoming_lock for the entire process
              - Might be able to drop this earlier in favour of the call lock
            > rxrpc_incoming_call()
            	! Conflicts with rxrpc_input_implicit_end_call()
          > rxrpc_send_ping()
            - Don't need locks to check rtt state
            > rxrpc_propose_ACK
      
        * PACKET DISTRIBUTION
          > rxrpc_input_call_packet()
            > rxrpc_input_data()
      	* QUEUE DATA PACKET ON CALL
      	> rxrpc_reduce_call_timer()
      	  - Uses timer_reduce()
      	! Needs call->input_lock()
      	> rxrpc_receiving_reply()
      	  ! Needs locking around ack state
      	  > rxrpc_rotate_tx_window()
      	  > rxrpc_end_tx_phase()
      	> rxrpc_proto_abort()
      	> rxrpc_input_dup_data()
      	- Fills the Rx buffer
      	- rxrpc_propose_ACK()
      	- rxrpc_notify_socket()
      
            > rxrpc_input_ack()
      	* APPLY ACK PACKET TO CALL AND DISCARD PACKET
      	> rxrpc_input_ping_response()
      	  - Probably doesn't need any extra locking
      	  ! Need READ_ONCE() on call->ping_serial
      	  > rxrpc_input_check_for_lost_ack()
      	    - Takes call->lock to consult Tx buffer
      	  > rxrpc_peer_add_rtt()
      	    ! Needs to take a lock (peer->rtt_input_lock)
      	    ! Could perhaps manage with cmpxchg() and xadd() instead
      	> rxrpc_input_requested_ack
      	  - Consults Tx buffer
      	    ! Probably needs a lock
      	  > rxrpc_peer_add_rtt()
      	> rxrpc_propose_ack()
      	> rxrpc_input_ackinfo()
      	  - Changes call->tx_winsize
      	    ! Use cmpxchg to handle change
      	    ! Should perhaps track serial number
      	  - Uses peer->lock to record MTU specification changes
      	> rxrpc_proto_abort()
      	! Need to take call->input_lock
      	> rxrpc_rotate_tx_window()
      	> rxrpc_end_tx_phase()
      	> rxrpc_input_soft_acks()
      	- Consults the Tx buffer
      	> rxrpc_congestion_management()
      	  - Modifies the Tx annotations
      	  ! Needs call->input_lock()
      	  > rxrpc_queue_call()
      
            > rxrpc_input_abort()
      	* APPLY ABORT PACKET TO CALL AND DISCARD PACKET
      	> rxrpc_set_call_completion()
      	> rxrpc_notify_socket()
      
            > rxrpc_input_ackall()
      	* APPLY ACKALL PACKET TO CALL AND DISCARD PACKET
      	! Need to take call->input_lock
      	> rxrpc_rotate_tx_window()
      	> rxrpc_end_tx_phase()
      
          > rxrpc_reject_packet()
      
      There are some functions used by the above that queue the packet, after
      which the procedure is terminated:
      
       - rxrpc_post_packet_to_local()
         - local->event_queue is an sk_buff_head
         - local->processor is a work_struct
       - rxrpc_post_packet_to_conn()
         - conn->rx_queue is an sk_buff_head
         - conn->processor is a work_struct
       - rxrpc_reject_packet()
         - local->reject_queue is an sk_buff_head
         - local->processor is a work_struct
      
      And some that offload processing to process context:
      
       - rxrpc_notify_socket()
         - Uses RCU lock
         - Uses call->notify_lock to call call->notify_rx
         - Uses call->recvmsg_lock to queue recvmsg side
       - rxrpc_queue_call()
         - call->processor is a work_struct
       - rxrpc_propose_ACK()
         - Uses call->lock to wrap __rxrpc_propose_ACK()
      
      And a bunch that complete a call, all of which use call->state_lock to
      protect the call state:
      
       - rxrpc_call_completed()
       - rxrpc_set_call_completion()
       - rxrpc_abort_call()
       - rxrpc_proto_abort()
         - Also uses rxrpc_queue_call()
      
      Fixes: 17926a79 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c1e15b49
    • D
      rxrpc: Fix connection-level abort handling · 64753092
      David Howells 提交于
      Fix connection-level abort handling to cache the abort and error codes
      properly so that a new incoming call can be properly aborted if it races
      with the parent connection being aborted by another CPU.
      
      The abort_code and error parameters can then be dropped from
      rxrpc_abort_calls().
      
      Fixes: f5c17aae ("rxrpc: Calls should only have one terminal state")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      64753092
    • D
      rxrpc: Only take the rwind and mtu values from latest ACK · 298bc15b
      David Howells 提交于
      Move the out-of-order and duplicate ACK packet check to before the call to
      rxrpc_input_ackinfo() so that the receive window size and MTU size are only
      checked in the latest ACK packet and don't regress.
      
      Fixes: 248f219c ("rxrpc: Rewrite the data and ack handling code")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      298bc15b
  5. 08 10月, 2018 6 次提交
    • D
      rxrpc: Carry call state out of locked section in rxrpc_rotate_tx_window() · dfe99522
      David Howells 提交于
      Carry the call state out of the locked section in rxrpc_rotate_tx_window()
      rather than sampling it afterwards.  This is only used to select tracepoint
      data, but could have changed by the time we do the tracepoint.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      dfe99522
    • D
      rxrpc: Don't check RXRPC_CALL_TX_LAST after calling rxrpc_rotate_tx_window() · c479d5f2
      David Howells 提交于
      We should only call the function to end a call's Tx phase if we rotated the
      marked-last packet out of the transmission buffer.
      
      Make rxrpc_rotate_tx_window() return an indication of whether it just
      rotated the packet marked as the last out of the transmit buffer, carrying
      the information out of the locked section in that function.
      
      We can then check the return value instead of examining RXRPC_CALL_TX_LAST.
      
      Fixes: 70790dbe ("rxrpc: Pass the last Tx packet marker in the annotation buffer")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c479d5f2
    • D
      rxrpc: Don't need to take the RCU read lock in the packet receiver · bfd28211
      David Howells 提交于
      We don't need to take the RCU read lock in the rxrpc packet receive
      function because it's held further up the stack in the IP input routine
      around the UDP receive routines.
      
      Fix this by dropping the RCU read lock calls from rxrpc_input_packet().
      This simplifies the code.
      
      Fixes: 70790dbe ("rxrpc: Pass the last Tx packet marker in the annotation buffer")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      bfd28211
    • D
      rxrpc: Use the UDP encap_rcv hook · 5271953c
      David Howells 提交于
      Use the UDP encap_rcv hook to cut the bit out of the rxrpc packet reception
      in which a packet is placed onto the UDP receive queue and then immediately
      removed again by rxrpc.  Going via the queue in this manner seems like it
      should be unnecessary.
      
      This does, however, require the invention of a value to place in encap_type
      as that's one of the conditions to switch packets out to the encap_rcv
      hook.  Possibly the value doesn't actually matter for anything other than
      sockopts on the UDP socket, which aren't accessible outside of rxrpc
      anyway.
      
      This seems to cut a bit of time out of the time elapsed between each
      sk_buff being timestamped and turning up in rxrpc (the final number in the
      following trace excerpts).  I measured this by making the rxrpc_rx_packet
      trace point print the time elapsed between the skb being timestamped and
      the current time (in ns), e.g.:
      
      	... 424.278721: rxrpc_rx_packet: ...  ACK 25026
      
      So doing a 512MiB DIO read from my test server, with an unmodified kernel:
      
      	N       min     max     sum		mean    stddev
      	27605   2626    7581    7.83992e+07     2840.04 181.029
      
      and with the patch applied:
      
      	N       min     max     sum		mean    stddev
      	27547   1895    12165   6.77461e+07     2459.29 255.02
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      5271953c
    • A
      net: sched: cls_u32: fix hnode refcounting · 6d4c4077
      Al Viro 提交于
      cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references
      via ->hlist and via ->tp_root together.  u32_destroy() drops the former
      and, in case when there had been links, leaves the sucker on the list.
      As the result, there's nothing to protect it from getting freed once links
      are dropped.
      That also makes the "is it busy" check incapable of catching the root
      hnode - it *is* busy (there's a reference from tp), but we don't see it as
      something separate.  "Is it our root?" check partially covers that, but
      the problem exists for others' roots as well.
      
      AFAICS, the minimal fix preserving the existing behaviour (where it doesn't
      include oopsen, that is) would be this:
              * count tp->root and tp_c->hlist as separate references.  I.e.
      have u32_init() set refcount to 2, not 1.
      	* in u32_destroy() we always drop the former;
      in u32_destroy_hnode() - the latter.
      
      	That way we have *all* references contributing to refcount.  List
      removal happens in u32_destroy_hnode() (called only when ->refcnt is 1)
      an in u32_destroy() in case of tc_u_common going away, along with
      everything reachable from it.  IOW, that way we know that
      u32_destroy_key() won't free something still on the list (or pointed to by
      someone's ->root).
      
      Reproducer:
      
      tc qdisc add dev eth0 ingress
      tc filter add dev eth0 parent ffff: protocol ip prio 100 handle 1: \
      u32 divisor 1
      tc filter add dev eth0 parent ffff: protocol ip prio 200 handle 2: \
      u32 divisor 1
      tc filter add dev eth0 parent ffff: protocol ip prio 100 \
      handle 1:0:11 u32 ht 1: link 801: offset at 0 mask 0f00 shift 6 \
      plus 0 eat match ip protocol 6 ff
      tc filter delete dev eth0 parent ffff: protocol ip prio 200
      tc filter change dev eth0 parent ffff: protocol ip prio 100 \
      handle 1:0:11 u32 ht 1: link 0: offset at 0 mask 0f00 shift 6 plus 0 \
      eat match ip protocol 6 ff
      tc filter delete dev eth0 parent ffff: protocol ip prio 100
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d4c4077
    • J
      udp: Unbreak modules that rely on external __skb_recv_udp() availability · 7e823644
      Jiri Kosina 提交于
      Commit 2276f58a ("udp: use a separate rx queue for packet reception")
      turned static inline __skb_recv_udp() from being a trivial helper around
      __skb_recv_datagram() into a UDP specific implementaion, making it
      EXPORT_SYMBOL_GPL() at the same time.
      
      There are external modules that got broken by __skb_recv_udp() not being
      visible to them. Let's unbreak them by making __skb_recv_udp EXPORT_SYMBOL().
      
      Rationale (one of those) why this is actually "technically correct" thing
      to do: __skb_recv_udp() used to be an inline wrapper around
      __skb_recv_datagram(), which itself (still, and correctly so, I believe)
      is EXPORT_SYMBOL().
      
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Fixes: 2276f58a ("udp: use a separate rx queue for packet reception")
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e823644
  6. 06 10月, 2018 5 次提交
    • K
      treewide: Replace more open-coded allocation size multiplications · 329e0989
      Kees Cook 提交于
      As done treewide earlier, this catches several more open-coded
      allocation size calculations that were added to the kernel during the
      merge window. This performs the following mechanical transformations
      using Coccinelle:
      
      	kvmalloc(a * b, ...) -> kvmalloc_array(a, b, ...)
      	kvzalloc(a * b, ...) -> kvcalloc(a, b, ...)
      	devm_kzalloc(..., a * b, ...) -> devm_kcalloc(..., a, b, ...)
      Signed-off-by: NKees Cook <keescook@chromium.org>
      329e0989
    • W
      ipv6: take rcu lock in rawv6_send_hdrinc() · a688caa3
      Wei Wang 提交于
      In rawv6_send_hdrinc(), in order to avoid an extra dst_hold(), we
      directly assign the dst to skb and set passed in dst to NULL to avoid
      double free.
      However, in error case, we free skb and then do stats update with the
      dst pointer passed in. This causes use-after-free on the dst.
      Fix it by taking rcu read lock right before dst could get released to
      make sure dst does not get freed until the stats update is done.
      Note: we don't have this issue in ipv4 cause dst is not used for stats
      update in v4.
      
      Syzkaller reported following crash:
      BUG: KASAN: use-after-free in rawv6_send_hdrinc net/ipv6/raw.c:692 [inline]
      BUG: KASAN: use-after-free in rawv6_sendmsg+0x4421/0x4630 net/ipv6/raw.c:921
      Read of size 8 at addr ffff8801d95ba730 by task syz-executor0/32088
      
      CPU: 1 PID: 32088 Comm: syz-executor0 Not tainted 4.19.0-rc2+ #93
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
       print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
       rawv6_send_hdrinc net/ipv6/raw.c:692 [inline]
       rawv6_sendmsg+0x4421/0x4630 net/ipv6/raw.c:921
       inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:621 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:631
       ___sys_sendmsg+0x7fd/0x930 net/socket.c:2114
       __sys_sendmsg+0x11d/0x280 net/socket.c:2152
       __do_sys_sendmsg net/socket.c:2161 [inline]
       __se_sys_sendmsg net/socket.c:2159 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2159
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x457099
      Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f83756edc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00007f83756ee6d4 RCX: 0000000000457099
      RDX: 0000000000000000 RSI: 0000000020003840 RDI: 0000000000000004
      RBP: 00000000009300a0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 00000000004d4b30 R14: 00000000004c90b1 R15: 0000000000000000
      
      Allocated by task 32088:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:448
       set_track mm/kasan/kasan.c:460 [inline]
       kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
       kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
       kmem_cache_alloc+0x12e/0x730 mm/slab.c:3554
       dst_alloc+0xbb/0x1d0 net/core/dst.c:105
       ip6_dst_alloc+0x35/0xa0 net/ipv6/route.c:353
       ip6_rt_cache_alloc+0x247/0x7b0 net/ipv6/route.c:1186
       ip6_pol_route+0x8f8/0xd90 net/ipv6/route.c:1895
       ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2093
       fib6_rule_lookup+0x277/0x860 net/ipv6/fib6_rules.c:122
       ip6_route_output_flags+0x2c5/0x350 net/ipv6/route.c:2121
       ip6_route_output include/net/ip6_route.h:88 [inline]
       ip6_dst_lookup_tail+0xe27/0x1d60 net/ipv6/ip6_output.c:951
       ip6_dst_lookup_flow+0xc8/0x270 net/ipv6/ip6_output.c:1079
       rawv6_sendmsg+0x12d9/0x4630 net/ipv6/raw.c:905
       inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:621 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:631
       ___sys_sendmsg+0x7fd/0x930 net/socket.c:2114
       __sys_sendmsg+0x11d/0x280 net/socket.c:2152
       __do_sys_sendmsg net/socket.c:2161 [inline]
       __se_sys_sendmsg net/socket.c:2159 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2159
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 5356:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:448
       set_track mm/kasan/kasan.c:460 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
       kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
       __cache_free mm/slab.c:3498 [inline]
       kmem_cache_free+0x83/0x290 mm/slab.c:3756
       dst_destroy+0x267/0x3c0 net/core/dst.c:141
       dst_destroy_rcu+0x16/0x19 net/core/dst.c:154
       __rcu_reclaim kernel/rcu/rcu.h:236 [inline]
       rcu_do_batch kernel/rcu/tree.c:2576 [inline]
       invoke_rcu_callbacks kernel/rcu/tree.c:2880 [inline]
       __rcu_process_callbacks kernel/rcu/tree.c:2847 [inline]
       rcu_process_callbacks+0xf23/0x2670 kernel/rcu/tree.c:2864
       __do_softirq+0x30b/0xad8 kernel/softirq.c:292
      
      Fixes: 1789a640 ("raw: avoid two atomics in xmit")
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a688caa3
    • D
      net: sched: Add policy validation for tc attributes · 8b4c3cdd
      David Ahern 提交于
      A number of TC attributes are processed without proper validation
      (e.g., length checks). Add a tca policy for all input attributes and use
      when invoking nlmsg_parse.
      
      The 2 Fixes tags below cover the latest additions. The other attributes
      are a string (KIND), nested attribute (OPTIONS which does seem to have
      validation in most cases), for dumps only or a flag.
      
      Fixes: 5bc17018 ("net: sched: introduce multichain support for filters")
      Fixes: d47a6b0e ("net: sched: introduce ingress/egress block index attributes for qdisc")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b4c3cdd
    • M
      rtnetlink: fix rtnl_fdb_dump() for ndmsg header · bd961c9b
      Mauricio Faria de Oliveira 提交于
      Currently, rtnl_fdb_dump() assumes the family header is 'struct ifinfomsg',
      which is not always true -- 'struct ndmsg' is used by iproute2 ('ip neigh').
      
      The problem is, the function bails out early if nlmsg_parse() fails, which
      does occur for iproute2 usage of 'struct ndmsg' because the payload length
      is shorter than the family header alone (as 'struct ifinfomsg' is assumed).
      
      This breaks backward compatibility with userspace -- nothing is sent back.
      
      Some examples with iproute2 and netlink library for go [1]:
      
       1) $ bridge fdb show
          33:33:00:00:00:01 dev ens3 self permanent
          01:00:5e:00:00:01 dev ens3 self permanent
          33:33:ff:15:98:30 dev ens3 self permanent
      
            This one works, as it uses 'struct ifinfomsg'.
      
            fdb_show() @ iproute2/bridge/fdb.c
              """
              .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
              ...
              if (rtnl_dump_request(&rth, RTM_GETNEIGH, [...]
              """
      
       2) $ ip --family bridge neigh
          RTNETLINK answers: Invalid argument
          Dump terminated
      
            This one fails, as it uses 'struct ndmsg'.
      
            do_show_or_flush() @ iproute2/ip/ipneigh.c
              """
              .n.nlmsg_type = RTM_GETNEIGH,
              .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ndmsg)),
              """
      
       3) $ ./neighlist
          < no output >
      
            This one fails, as it uses 'struct ndmsg'-based.
      
            neighList() @ netlink/neigh_linux.go
              """
              req := h.newNetlinkRequest(unix.RTM_GETNEIGH, [...]
              msg := Ndmsg{
              """
      
      The actual breakage was introduced by commit 0ff50e83 ("net: rtnetlink:
      bail out from rtnl_fdb_dump() on parse error"), because nlmsg_parse() fails
      if the payload length (with the _actual_ family header) is less than the
      family header length alone (which is assumed, in parameter 'hdrlen').
      This is true in the examples above with struct ndmsg, with size and payload
      length shorter than struct ifinfomsg.
      
      However, that commit just intends to fix something under the assumption the
      family header is indeed an 'struct ifinfomsg' - by preventing access to the
      payload as such (via 'ifm' pointer) if the payload length is not sufficient
      to actually contain it.
      
      The assumption was introduced by commit 5e6d2435 ("bridge: netlink dump
      interface at par with brctl"), to support iproute2's 'bridge fdb' command
      (not 'ip neigh') which indeed uses 'struct ifinfomsg', thus is not broken.
      
      So, in order to unbreak the 'struct ndmsg' family headers and still allow
      'struct ifinfomsg' to continue to work, check for the known message sizes
      used with 'struct ndmsg' in iproute2 (with zero or one attribute which is
      not used in this function anyway) then do not parse the data as ifinfomsg.
      
      Same examples with this patch applied (or revert/before the original fix):
      
          $ bridge fdb show
          33:33:00:00:00:01 dev ens3 self permanent
          01:00:5e:00:00:01 dev ens3 self permanent
          33:33:ff:15:98:30 dev ens3 self permanent
      
          $ ip --family bridge neigh
          dev ens3 lladdr 33:33:00:00:00:01 PERMANENT
          dev ens3 lladdr 01:00:5e:00:00:01 PERMANENT
          dev ens3 lladdr 33:33:ff:15:98:30 PERMANENT
      
          $ ./neighlist
          netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x33, 0x33, 0x0, 0x0, 0x0, 0x1}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
          netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x1, 0x0, 0x5e, 0x0, 0x0, 0x1}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
          netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x33, 0x33, 0xff, 0x15, 0x98, 0x30}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
      
      Tested on mainline (v4.19-rc6) and net-next (3bd09b05).
      
      References:
      
      [1] netlink library for go (test-case)
          https://github.com/vishvananda/netlink
      
          $ cat ~/go/src/neighlist/main.go
          package main
          import ("fmt"; "syscall"; "github.com/vishvananda/netlink")
          func main() {
              neighs, _ := netlink.NeighList(0, syscall.AF_BRIDGE)
              for _, neigh := range neighs { fmt.Printf("%#v\n", neigh) }
          }
      
          $ export GOPATH=~/go
          $ go get github.com/vishvananda/netlink
          $ go build neighlist
          $ ~/go/src/neighlist/neighlist
      
      Thanks to David Ahern for suggestions to improve this patch.
      
      Fixes: 0ff50e83 ("net: rtnetlink: bail out from rtnl_fdb_dump() on parse error")
      Fixes: 5e6d2435 ("bridge: netlink dump interface at par with brctl")
      Reported-by: NAidan Obley <aobley@pivotal.io>
      Signed-off-by: NMauricio Faria de Oliveira <mfo@canonical.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd961c9b
    • S
      net: bpfilter: Fix type cast and pointer warnings · 33aa8da1
      Shanthosh RK 提交于
      Fixes the following Sparse warnings:
      
      net/bpfilter/bpfilter_kern.c:62:21: warning: cast removes address space
      of expression
      net/bpfilter/bpfilter_kern.c:101:49: warning: Using plain integer as
      NULL pointer
      Signed-off-by: NShanthosh RK <shanthosh.rk@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33aa8da1
  7. 05 10月, 2018 4 次提交
    • D
      rxrpc: Fix the data_ready handler · 2cfa2271
      David Howells 提交于
      Fix the rxrpc_data_ready() function to pick up all packets and to not miss
      any.  There are two problems:
      
       (1) The sk_data_ready pointer on the UDP socket is set *after* it is
           bound.  This means that it's open for business before we're ready to
           dequeue packets and there's a tiny window exists in which a packet can
           sneak onto the receive queue, but we never know about it.
      
           Fix this by setting the pointers on the socket prior to binding it.
      
       (2) skb_recv_udp() will return an error (such as ENETUNREACH) if there was
           an error on the transmission side, even though we set the
           sk_error_report hook.  Because rxrpc_data_ready() returns immediately
           in such a case, it never actually removes its packet from the receive
           queue.
      
           Fix this by abstracting out the UDP dequeuing and checksumming into a
           separate function that keeps hammering on skb_recv_udp() until it
           returns -EAGAIN, passing the packets extracted to the remainder of the
           function.
      
      and two potential problems:
      
       (3) It might be possible in some circumstances or in the future for
           packets to be being added to the UDP receive queue whilst rxrpc is
           running consuming them, so the data_ready() handler might get called
           less often than once per packet.
      
           Allow for this by fully draining the queue on each call as (2).
      
       (4) If a packet fails the checksum check, the code currently returns after
           discarding the packet without checking for more.
      
           Allow for this by fully draining the queue on each call as (2).
      
      Fixes: 17926a79 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      2cfa2271
    • D
      rxrpc: Fix some missed refs to init_net · 5e33a23b
      David Howells 提交于
      Fix some refs to init_net that should've been changed to the appropriate
      network namespace.
      
      Fixes: 2baec2c3 ("rxrpc: Support network namespacing")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      5e33a23b
    • J
      net/packet: fix packet drop as of virtio gso · 9d2f67e4
      Jianfeng Tan 提交于
      When we use raw socket as the vhost backend, a packet from virito with
      gso offloading information, cannot be sent out in later validaton at
      xmit path, as we did not set correct skb->protocol which is further used
      for looking up the gso function.
      
      To fix this, we set this field according to virito hdr information.
      
      Fixes: e858fae2 ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
      Signed-off-by: NJianfeng Tan <jianfeng.tan@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d2f67e4
    • F
      openvswitch: load NAT helper · 17c357ef
      Flavio Leitner 提交于
      Load the respective NAT helper module if the flow uses it.
      Signed-off-by: NFlavio Leitner <fbl@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17c357ef
  8. 04 10月, 2018 1 次提交
  9. 03 10月, 2018 1 次提交