1. 23 10月, 2019 8 次提交
  2. 22 10月, 2019 1 次提交
  3. 20 10月, 2019 3 次提交
  4. 19 10月, 2019 3 次提交
  5. 18 10月, 2019 5 次提交
    • W
      ipv4: fix race condition between route lookup and invalidation · 5018c596
      Wei Wang 提交于
      Jesse and Ido reported the following race condition:
      <CPU A, t0> - Received packet A is forwarded and cached dst entry is
      taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set()
      
      <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables
      from multiple ISPs"), route is added / deleted and rt_cache_flush() is
      called
      
      <CPU B, t2> - Received packet B tries to use the same cached dst entry
      from t0, but rt_cache_valid() is no longer true and it is replaced in
      rt_cache_route() by the newer one. This calls dst_dev_put() on the
      original dst entry which assigns the blackhole netdev to 'dst->dev'
      
      <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due
      to 'dst->dev' being the blackhole netdev
      
      There are 2 issues in the v4 routing code:
      1. A per-netns counter is used to do the validation of the route. That
      means whenever a route is changed in the netns, users of all routes in
      the netns needs to redo lookup. v6 has an implementation of only
      updating fn_sernum for routes that are affected.
      2. When rt_cache_valid() returns false, rt_cache_route() is called to
      throw away the current cache, and create a new one. This seems
      unnecessary because as long as this route does not change, the route
      cache does not need to be recreated.
      
      To fully solve the above 2 issues, it probably needs quite some code
      changes and requires careful testing, and does not suite for net branch.
      
      So this patch only tries to add the deleted cached rt into the uncached
      list, so user could still be able to use it to receive packets until
      it's done.
      
      Fixes: 95c47f9c ("ipv4: call dst_dev_put() properly")
      Signed-off-by: NWei Wang <weiwan@google.com>
      Reported-by: NIdo Schimmel <idosch@idosch.org>
      Reported-by: NJesse Hathaway <jesse@mbuki-mvuki.org>
      Tested-by: NJesse Hathaway <jesse@mbuki-mvuki.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: David Ahern <dsahern@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5018c596
    • S
      ipv4: Return -ENETUNREACH if we can't create route but saddr is valid · 595e0651
      Stefano Brivio 提交于
      ...instead of -EINVAL. An issue was found with older kernel versions
      while unplugging a NFS client with pending RPCs, and the wrong error
      code here prevented it from recovering once link is back up with a
      configured address.
      
      Incidentally, this is not an issue anymore since commit 4f8943f8
      ("SUNRPC: Replace direct task wakeups from softirq context"), included
      in 5.2-rc7, had the effect of decoupling the forwarding of this error
      by using SO_ERROR in xs_wake_error(), as pointed out by Benjamin
      Coddington.
      
      To the best of my knowledge, this isn't currently causing any further
      issue, but the error code doesn't look appropriate anyway, and we
      might hit this in other paths as well.
      
      In detail, as analysed by Gonzalo Siero, once the route is deleted
      because the interface is down, and can't be resolved and we return
      -EINVAL here, this ends up, courtesy of inet_sk_rebuild_header(),
      as the socket error seen by tcp_write_err(), called by
      tcp_retransmit_timer().
      
      In turn, tcp_write_err() indirectly calls xs_error_report(), which
      wakes up the RPC pending tasks with a status of -EINVAL. This is then
      seen by call_status() in the SUN RPC implementation, which aborts the
      RPC call calling rpc_exit(), instead of handling this as a
      potentially temporary condition, i.e. as a timeout.
      
      Return -EINVAL only if the input parameters passed to
      ip_route_output_key_hash_rcu() are actually invalid (this is the case
      if the specified source address is multicast, limited broadcast or
      all zeroes), but return -ENETUNREACH in all cases where, at the given
      moment, the given source address doesn't allow resolving the route.
      
      While at it, drop the initialisation of err to -ENETUNREACH, which
      was added to __ip_route_output_key() back then by commit
      0315e382 ("net: Fix behaviour of unreachable, blackhole and
      prohibit routes"), but actually had no effect, as it was, and is,
      overwritten by the fib_lookup() return code assignment, and anyway
      ignored in all other branches, including the if (fl4->saddr) one:
      I find this rather confusing, as it would look like -ENETUNREACH is
      the "default" error, while that statement has no effect.
      
      Also note that after commit fc75fc83 ("ipv4: dont create routes
      on down devices"), we would get -ENETUNREACH if the device is down,
      but -EINVAL if the source address is specified and we can't resolve
      the route, and this appears to be rather inconsistent.
      Reported-by: NStefan Walter <walteste@inf.ethz.ch>
      Analysed-by: NBenjamin Coddington <bcodding@redhat.com>
      Analysed-by: NGonzalo Siero <gsierohu@redhat.com>
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      595e0651
    • M
      net: sched: Avoid using yield() in a busy waiting loop · 4eab421b
      Marc Kleine-Budde 提交于
      With threaded interrupts enabled, the interrupt thread runs as SCHED_RR
      with priority 50. If a user application with a higher priority preempts
      the interrupt thread and tries to shutdown the network interface then it
      will loop forever. The kernel will spin in the loop waiting for the
      device to become idle and the scheduler will never consider the
      interrupt thread because its priority is lower.
      
      Avoid the problem by sleeping for a jiffy giving other tasks,
      including the interrupt thread, a chance to run and make progress.
      
      In the original thread it has been suggested to use wait_event() and
      properly waiting for the state to occur. DaveM explained that this would
      require to add expensive checks in the fast paths of packet processing.
      
      Link: https://lkml.kernel.org/r/1393976987-23555-1-git-send-email-mkl@pengutronix.deSigned-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      [bigeasy: Rewrite commit message, add comment, use
                schedule_timeout_uninterruptible()]
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4eab421b
    • Y
      net/rds: Remove unnecessary null check · ce753e66
      YueHaibing 提交于
      Null check before dma_pool_destroy is redundant, so remove it.
      This is detected by coccinelle.
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce753e66
    • Y
      pktgen: remove unnecessary assignment in pktgen_xmit() · a8c41a68
      Yunsheng Lin 提交于
      variable ret is not used after jumping to "unlock" label, so
      the assignment is redundant.
      Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8c41a68
  6. 17 10月, 2019 2 次提交
  7. 16 10月, 2019 10 次提交
    • X
      sctp: change sctp_prot .no_autobind with true · 63dfb793
      Xin Long 提交于
      syzbot reported a memory leak:
      
        BUG: memory leak, unreferenced object 0xffff888120b3d380 (size 64):
        backtrace:
      
          [...] slab_alloc mm/slab.c:3319 [inline]
          [...] kmem_cache_alloc+0x13f/0x2c0 mm/slab.c:3483
          [...] sctp_bucket_create net/sctp/socket.c:8523 [inline]
          [...] sctp_get_port_local+0x189/0x5a0 net/sctp/socket.c:8270
          [...] sctp_do_bind+0xcc/0x200 net/sctp/socket.c:402
          [...] sctp_bindx_add+0x4b/0xd0 net/sctp/socket.c:497
          [...] sctp_setsockopt_bindx+0x156/0x1b0 net/sctp/socket.c:1022
          [...] sctp_setsockopt net/sctp/socket.c:4641 [inline]
          [...] sctp_setsockopt+0xaea/0x2dc0 net/sctp/socket.c:4611
          [...] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3147
          [...] __sys_setsockopt+0x10f/0x220 net/socket.c:2084
          [...] __do_sys_setsockopt net/socket.c:2100 [inline]
      
      It was caused by when sending msgs without binding a port, in the path:
      inet_sendmsg() -> inet_send_prepare() -> inet_autobind() ->
      .get_port/sctp_get_port(), sp->bind_hash will be set while bp->port is
      not. Later when binding another port by sctp_setsockopt_bindx(), a new
      bucket will be created as bp->port is not set.
      
      sctp's autobind is supposed to call sctp_autobind() where it does all
      things including setting bp->port. Since sctp_autobind() is called in
      sctp_sendmsg() if the sk is not yet bound, it should have skipped the
      auto bind.
      
      THis patch is to avoid calling inet_autobind() in inet_send_prepare()
      by changing sctp_prot .no_autobind with true, also remove the unused
      .get_port.
      
      Reported-by: syzbot+d44f7bbebdea49dbc84a@syzkaller.appspotmail.com
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63dfb793
    • V
      sched: etf: Fix ordering of packets with same txtime · 28aa7c86
      Vinicius Costa Gomes 提交于
      When a application sends many packets with the same txtime, they may
      be transmitted out of order (different from the order in which they
      were enqueued).
      
      This happens because when inserting elements into the tree, when the
      txtime of two packets are the same, the new packet is inserted at the
      left side of the tree, causing the reordering. The only effect of this
      change should be that packets with the same txtime will be transmitted
      in the order they are enqueued.
      
      The application in question (the AVTP GStreamer plugin, still in
      development) is sending video traffic, in which each video frame have
      a single presentation time, the problem is that when packetizing,
      multiple packets end up with the same txtime.
      
      The receiving side was rejecting packets because they were being
      received out of order.
      
      Fixes: 25db26a9 ("net/sched: Introduce the ETF Qdisc")
      Reported-by: NEderson de Souza <ederson.desouza@intel.com>
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28aa7c86
    • E
      net: avoid potential infinite loop in tc_ctl_action() · 39f13ea2
      Eric Dumazet 提交于
      tc_ctl_action() has the ability to loop forever if tcf_action_add()
      returns -EAGAIN.
      
      This special case has been done in case a module needed to be loaded,
      but it turns out that tcf_add_notify() could also return -EAGAIN
      if the socket sk_rcvbuf limit is hit.
      
      We need to separate the two cases, and only loop for the module
      loading case.
      
      While we are at it, add a limit of 10 attempts since unbounded
      loops are always scary.
      
      syzbot repro was something like :
      
      socket(PF_NETLINK, SOCK_RAW|SOCK_NONBLOCK, NETLINK_ROUTE) = 3
      write(3, ..., 38) = 38
      setsockopt(3, SOL_SOCKET, SO_RCVBUF, [0], 4) = 0
      sendmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{..., 388}], msg_controllen=0, msg_flags=0x10}, ...)
      
      NMI backtrace for cpu 0
      CPU: 0 PID: 1054 Comm: khungtaskd Not tainted 5.4.0-rc1+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       nmi_cpu_backtrace.cold+0x70/0xb2 lib/nmi_backtrace.c:101
       nmi_trigger_cpumask_backtrace+0x23b/0x28b lib/nmi_backtrace.c:62
       arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
       trigger_all_cpu_backtrace include/linux/nmi.h:146 [inline]
       check_hung_uninterruptible_tasks kernel/hung_task.c:205 [inline]
       watchdog+0x9d0/0xef0 kernel/hung_task.c:289
       kthread+0x361/0x430 kernel/kthread.c:255
       ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      Sending NMI from CPU 0 to CPUs 1:
      NMI backtrace for cpu 1
      CPU: 1 PID: 8859 Comm: syz-executor910 Not tainted 5.4.0-rc1+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:arch_local_save_flags arch/x86/include/asm/paravirt.h:751 [inline]
      RIP: 0010:lockdep_hardirqs_off+0x1df/0x2e0 kernel/locking/lockdep.c:3453
      Code: 5c 08 00 00 5b 41 5c 41 5d 5d c3 48 c7 c0 58 1d f3 88 48 ba 00 00 00 00 00 fc ff df 48 c1 e8 03 80 3c 10 00 0f 85 d3 00 00 00 <48> 83 3d 21 9e 99 07 00 0f 84 b9 00 00 00 9c 58 0f 1f 44 00 00 f6
      RSP: 0018:ffff8880a6f3f1b8 EFLAGS: 00000046
      RAX: 1ffffffff11e63ab RBX: ffff88808c9c6080 RCX: 0000000000000000
      RDX: dffffc0000000000 RSI: 0000000000000000 RDI: ffff88808c9c6914
      RBP: ffff8880a6f3f1d0 R08: ffff88808c9c6080 R09: fffffbfff16be5d1
      R10: fffffbfff16be5d0 R11: 0000000000000003 R12: ffffffff8746591f
      R13: ffff88808c9c6080 R14: ffffffff8746591f R15: 0000000000000003
      FS:  00000000011e4880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffffffffff600400 CR3: 00000000a8920000 CR4: 00000000001406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       trace_hardirqs_off+0x62/0x240 kernel/trace/trace_preemptirq.c:45
       __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:108 [inline]
       _raw_spin_lock_irqsave+0x6f/0xcd kernel/locking/spinlock.c:159
       __wake_up_common_lock+0xc8/0x150 kernel/sched/wait.c:122
       __wake_up+0xe/0x10 kernel/sched/wait.c:142
       netlink_unlock_table net/netlink/af_netlink.c:466 [inline]
       netlink_unlock_table net/netlink/af_netlink.c:463 [inline]
       netlink_broadcast_filtered+0x705/0xb80 net/netlink/af_netlink.c:1514
       netlink_broadcast+0x3a/0x50 net/netlink/af_netlink.c:1534
       rtnetlink_send+0xdd/0x110 net/core/rtnetlink.c:714
       tcf_add_notify net/sched/act_api.c:1343 [inline]
       tcf_action_add+0x243/0x370 net/sched/act_api.c:1362
       tc_ctl_action+0x3b5/0x4bc net/sched/act_api.c:1410
       rtnetlink_rcv_msg+0x463/0xb00 net/core/rtnetlink.c:5386
       netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
       rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5404
       netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
       netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328
       netlink_sendmsg+0x8a5/0xd60 net/netlink/af_netlink.c:1917
       sock_sendmsg_nosec net/socket.c:637 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:657
       ___sys_sendmsg+0x803/0x920 net/socket.c:2311
       __sys_sendmsg+0x105/0x1d0 net/socket.c:2356
       __do_sys_sendmsg net/socket.c:2365 [inline]
       __se_sys_sendmsg net/socket.c:2363 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2363
       do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x440939
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: syzbot+cf0adbb9c28c8866c788@syzkaller.appspotmail.com
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39f13ea2
    • E
      net_sched: sch_fq: remove one obsolete check in fq_dequeue() · e9c43add
      Eric Dumazet 提交于
      After commit eeb84aa0 ("net_sched: sch_fq: do not assume EDT
      packets are ordered"), all skbs get a non zero time_to_send
      in flow_queue_add()
      
      This means @time_next_packet variable in fq_dequeue()
      can no longer be zero.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9c43add
    • E
      tcp: fix a possible lockdep splat in tcp_done() · cab209e5
      Eric Dumazet 提交于
      syzbot found that if __inet_inherit_port() returns an error,
      we call tcp_done() after inet_csk_prepare_forced_close(),
      meaning the socket lock is no longer held.
      
      We might fix this in a different way in net-next, but
      for 5.4 it seems safer to relax the lockdep check.
      
      Fixes: d983ea6f ("tcp: add rcu protection around tp->fastopen_rsk")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cab209e5
    • A
      net: core: use listified Rx for GRO_NORMAL in napi_gro_receive() · 6570bc79
      Alexander Lobakin 提交于
      Commit 323ebb61 ("net: use listified RX for handling GRO_NORMAL
      skbs") made use of listified skb processing for the users of
      napi_gro_frags().
      The same technique can be used in a way more common napi_gro_receive()
      to speed up non-merged (GRO_NORMAL) skbs for a wide range of drivers
      including gro_cells and mac80211 users.
      This slightly changes the return value in cases where skb is being
      dropped by the core stack, but it seems to have no impact on related
      drivers' functionality.
      gro_normal_batch is left untouched as it's very individual for every
      single system configuration and might be tuned in manual order to
      achieve an optimal performance.
      Signed-off-by: NAlexander Lobakin <alobakin@dlink.ru>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6570bc79
    • H
      hv_sock: use HV_HYP_PAGE_SIZE for Hyper-V communication · 77ffe333
      Himadri Pandya 提交于
      Current code assumes PAGE_SIZE (the guest page size) is equal
      to the page size used to communicate with Hyper-V (which is
      always 4K). While this assumption is true on x86, it may not
      be true for Hyper-V on other architectures. For example,
      Linux on ARM64 may have PAGE_SIZE of 16K or 64K. A new symbol,
      HV_HYP_PAGE_SIZE, has been previously introduced to use when
      the Hyper-V page size is intended instead of the guest page size.
      
      Make this code work on non-x86 architectures by using the new
      HV_HYP_PAGE_SIZE symbol instead of PAGE_SIZE, where appropriate.
      Also replace the now redundant PAGE_SIZE_4K with HV_HYP_PAGE_SIZE.
      The change has no effect on x86, but lays the groundwork to run
      on ARM64 and others.
      Signed-off-by: NHimadri Pandya <himadrispandya@gmail.com>
      Reviewed-by: NMichael Kelley <mikelley@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77ffe333
    • D
      net/sched: fix corrupted L2 header with MPLS 'push' and 'pop' actions · fa4e0f88
      Davide Caratti 提交于
      the following script:
      
       # tc qdisc add dev eth0 clsact
       # tc filter add dev eth0 egress protocol ip matchall \
       > action mpls push protocol mpls_uc label 0x355aa bos 1
      
      causes corruption of all IP packets transmitted by eth0. On TC egress, we
      can't rely on the value of skb->mac_len, because it's 0 and a MPLS 'push'
      operation will result in an overwrite of the first 4 octets in the packet
      L2 header (e.g. the Destination Address if eth0 is an Ethernet); the same
      error pattern is present also in the MPLS 'pop' operation. Fix this error
      in act_mpls data plane, computing 'mac_len' as the difference between the
      network header and the mac header (when not at TC ingress), and use it in
      MPLS 'push'/'pop' core functions.
      
      v2: unbreak 'make htmldocs' because of missing documentation of 'mac_len'
          in skb_mpls_pop(), reported by kbuild test robot
      
      CC: Lorenzo Bianconi <lorenzo@kernel.org>
      Fixes: 2a2ea508 ("net: sched: add mpls manipulation actions to TC")
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Acked-by: NJohn Hurley <john.hurley@netronome.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa4e0f88
    • D
      net: avoid errors when trying to pop MLPS header on non-MPLS packets · dedc5a08
      Davide Caratti 提交于
      the following script:
      
       # tc qdisc add dev eth0 clsact
       # tc filter add dev eth0 egress matchall action mpls pop
      
      implicitly makes the kernel drop all packets transmitted by eth0, if they
      don't have a MPLS header. This behavior is uncommon: other encapsulations
      (like VLAN) just let the packet pass unmodified. Since the result of MPLS
      'pop' operation would be the same regardless of the presence / absence of
      MPLS header(s) in the original packet, we can let skb_mpls_pop() return 0
      when dealing with non-MPLS packets.
      
      For the OVS use-case, this is acceptable because __ovs_nla_copy_actions()
      already ensures that MPLS 'pop' operation only occurs with packets having
      an MPLS Ethernet type (and there are no other callers in current code, so
      the semantic change should be ok).
      
      v2: better documentation of use-cases for skb_mpls_pop(), thanks to Simon
          Horman
      
      Fixes: 2a2ea508 ("net: sched: add mpls manipulation actions to TC")
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Acked-by: NJohn Hurley <john.hurley@netronome.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dedc5a08
    • M
      blackhole_netdev: fix syzkaller reported issue · b0818f80
      Mahesh Bandewar 提交于
      While invalidating the dst, we assign backhole_netdev instead of
      loopback device. However, this device does not have idev pointer
      and hence no ip6_ptr even if IPv6 is enabled. Possibly this has
      triggered the syzbot reported crash.
      
      The syzbot report does not have reproducer, however, this is the
      only device that doesn't have matching idev created.
      
      Crash instruction is :
      
      static inline bool ip6_ignore_linkdown(const struct net_device *dev)
      {
              const struct inet6_dev *idev = __in6_dev_get(dev);
      
              return !!idev->cnf.ignore_routes_with_linkdown; <= crash
      }
      
      Also ipv6 always assumes presence of idev and never checks for it
      being NULL (as does the above referenced code). So adding a idev
      for the blackhole_netdev to avoid this class of crashes in the future.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0818f80
  8. 14 10月, 2019 8 次提交