1. 13 11月, 2017 13 次提交
  2. 11 11月, 2017 19 次提交
  3. 10 11月, 2017 8 次提交
    • Y
      tcp: fix tcp_fastretrans_alert warning · 0eb96bf7
      Yuchung Cheng 提交于
      This patch fixes the cause of an WARNING indicatng TCP has pending
      retransmission in Open state in tcp_fastretrans_alert().
      
      The root cause is a bad interaction between path mtu probing,
      if enabled, and the RACK loss detection. Upong receiving a SACK
      above the sequence of the MTU probing packet, RACK could mark the
      probe packet lost in tcp_fastretrans_alert(), prior to calling
      tcp_simple_retransmit().
      
      tcp_simple_retransmit() only enters Loss state if it newly marks
      the probe packet lost. If the probe packet is already identified as
      lost by RACK, the sender remains in Open state with some packets
      marked lost and retransmitted. Then the next SACK would trigger
      the warning. The likely scenario is that the probe packet was
      lost due to its size or network congestion. The actual impact of
      this warning is small by potentially entering fast recovery an
      ACK later.
      
      The simple fix is always entering recovery (Loss) state if some
      packet is marked lost during path MTU probing.
      
      Fixes: a0370b3f ("tcp: enable RACK loss detection to trigger recovery")
      Reported-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Reported-by: NAlexei Starovoitov <alexei.starovoitov@gmail.com>
      Reported-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0eb96bf7
    • E
      tcp: gso: avoid refcount_t warning from tcp_gso_segment() · 7ec318fe
      Eric Dumazet 提交于
      When a GSO skb of truesize O is segmented into 2 new skbs of truesize N1
      and N2, we want to transfer socket ownership to the new fresh skbs.
      
      In order to avoid expensive atomic operations on a cache line subject to
      cache bouncing, we replace the sequence :
      
      refcount_add(N1, &sk->sk_wmem_alloc);
      refcount_add(N2, &sk->sk_wmem_alloc); // repeated by number of segments
      
      refcount_sub(O, &sk->sk_wmem_alloc);
      
      by a single
      
      refcount_add(sum_of(N) - O, &sk->sk_wmem_alloc);
      
      Problem is :
      
      In some pathological cases, sum(N) - O might be a negative number, and
      syzkaller bot was apparently able to trigger this trace [1]
      
      atomic_t was ok with this construct, but we need to take care of the
      negative delta with refcount_t
      
      [1]
      refcount_t: saturated; leaking memory.
      ------------[ cut here ]------------
      WARNING: CPU: 0 PID: 8404 at lib/refcount.c:77 refcount_add_not_zero+0x198/0x200 lib/refcount.c:77
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 0 PID: 8404 Comm: syz-executor2 Not tainted 4.14.0-rc5-mm1+ #20
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:52
       panic+0x1e4/0x41c kernel/panic.c:183
       __warn+0x1c4/0x1e0 kernel/panic.c:546
       report_bug+0x211/0x2d0 lib/bug.c:183
       fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:177
       do_trap_no_signal arch/x86/kernel/traps.c:211 [inline]
       do_trap+0x260/0x390 arch/x86/kernel/traps.c:260
       do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:297
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:310
       invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
      RIP: 0010:refcount_add_not_zero+0x198/0x200 lib/refcount.c:77
      RSP: 0018:ffff8801c606e3a0 EFLAGS: 00010282
      RAX: 0000000000000026 RBX: 0000000000001401 RCX: 0000000000000000
      RDX: 0000000000000026 RSI: ffffc900036fc000 RDI: ffffed0038c0dc68
      RBP: ffff8801c606e430 R08: 0000000000000001 R09: 0000000000000000
      R10: ffff8801d97f5eba R11: 0000000000000000 R12: ffff8801d5acf73c
      R13: 1ffff10038c0dc75 R14: 00000000ffffffff R15: 00000000fffff72f
       refcount_add+0x1b/0x60 lib/refcount.c:101
       tcp_gso_segment+0x10d0/0x16b0 net/ipv4/tcp_offload.c:155
       tcp4_gso_segment+0xd4/0x310 net/ipv4/tcp_offload.c:51
       inet_gso_segment+0x60c/0x11c0 net/ipv4/af_inet.c:1271
       skb_mac_gso_segment+0x33f/0x660 net/core/dev.c:2749
       __skb_gso_segment+0x35f/0x7f0 net/core/dev.c:2821
       skb_gso_segment include/linux/netdevice.h:3971 [inline]
       validate_xmit_skb+0x4ba/0xb20 net/core/dev.c:3074
       __dev_queue_xmit+0xe49/0x2070 net/core/dev.c:3497
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3538
       neigh_hh_output include/net/neighbour.h:471 [inline]
       neigh_output include/net/neighbour.h:479 [inline]
       ip_finish_output2+0xece/0x1460 net/ipv4/ip_output.c:229
       ip_finish_output+0x85e/0xd10 net/ipv4/ip_output.c:317
       NF_HOOK_COND include/linux/netfilter.h:238 [inline]
       ip_output+0x1cc/0x860 net/ipv4/ip_output.c:405
       dst_output include/net/dst.h:459 [inline]
       ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
       ip_queue_xmit+0x8c6/0x18e0 net/ipv4/ip_output.c:504
       tcp_transmit_skb+0x1ab7/0x3840 net/ipv4/tcp_output.c:1137
       tcp_write_xmit+0x663/0x4de0 net/ipv4/tcp_output.c:2341
       __tcp_push_pending_frames+0xa0/0x250 net/ipv4/tcp_output.c:2513
       tcp_push_pending_frames include/net/tcp.h:1722 [inline]
       tcp_data_snd_check net/ipv4/tcp_input.c:5050 [inline]
       tcp_rcv_established+0x8c7/0x18a0 net/ipv4/tcp_input.c:5497
       tcp_v4_do_rcv+0x2ab/0x7d0 net/ipv4/tcp_ipv4.c:1460
       sk_backlog_rcv include/net/sock.h:909 [inline]
       __release_sock+0x124/0x360 net/core/sock.c:2264
       release_sock+0xa4/0x2a0 net/core/sock.c:2776
       tcp_sendmsg+0x3a/0x50 net/ipv4/tcp.c:1462
       inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
       sock_sendmsg_nosec net/socket.c:632 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:642
       ___sys_sendmsg+0x31c/0x890 net/socket.c:2048
       __sys_sendmmsg+0x1e6/0x5f0 net/socket.c:2138
      
      Fixes: 14afee4b ("net: convert sock.sk_wmem_alloc from atomic_t to refcount_t")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ec318fe
    • M
      act_vlan: VLAN action rewrite to use RCU lock/unlock and update · 4c5b9d96
      Manish Kurup 提交于
      Using a spinlock in the VLAN action causes performance issues when the VLAN
      action is used on multiple cores. Rewrote the VLAN action to use RCU read
      locking for reads and updates instead.
      All functions now use an RCU dereferenced pointer to access the VLAN action
      context. Modified helper functions used by other modules, to use the RCU as
      opposed to directly accessing the structure.
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NManish Kurup <manish.kurup@verizon.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c5b9d96
    • M
      act_vlan: Change stats update to use per-core stats · e0496cbb
      Manish Kurup 提交于
      The VLAN action maintains one set of stats across all cores, and uses a
      spinlock to synchronize updates to it from the same. Changed this to use a
      per-CPU stats context instead.
      This change will result in better performance.
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NManish Kurup <manish.kurup@verizon.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0496cbb
    • H
      rds: ib: Fix NULL pointer dereference in debug code · 1cb483a5
      Håkon Bugge 提交于
      rds_ib_recv_refill() is a function that refills an IB receive
      queue. It can be called from both the CQE handler (tasklet) and a
      worker thread.
      
      Just after the call to ib_post_recv(), a debug message is printed with
      rdsdebug():
      
                  ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr);
                  rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv,
                           recv->r_ibinc, sg_page(&recv->r_frag->f_sg),
                           (long) ib_sg_dma_address(
                                  ic->i_cm_id->device,
                                  &recv->r_frag->f_sg),
                          ret);
      
      Now consider an invocation of rds_ib_recv_refill() from the worker
      thread, which is preemptible. Further, assume that the worker thread
      is preempted between the ib_post_recv() and rdsdebug() statements.
      
      Then, if the preemption is due to a receive CQE event, the
      rds_ib_recv_cqe_handler() will be invoked. This function processes
      receive completions, including freeing up data structures, such as the
      recv->r_frag.
      
      In this scenario, rds_ib_recv_cqe_handler() will process the receive
      WR posted above. That implies, that the recv->r_frag has been freed
      before the above rdsdebug() statement has been executed. When it is
      later executed, we will have a NULL pointer dereference:
      
      [ 4088.068008] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
      [ 4088.076754] IP: rds_ib_recv_refill+0x87/0x620 [rds_rdma]
      [ 4088.082686] PGD 0 P4D 0
      [ 4088.085515] Oops: 0000 [#1] SMP
      [ 4088.089015] Modules linked in: rds_rdma(OE) rds(OE) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) nfs(E) fscache(E) mlx4_ib(E) ib_ipoib(E) rdma_ucm(E) ib_ucm(E) ib_uverbs(E) ib_umad(E) rdma_cm(E) ib_cm(E) iw_cm(E) ib_core(E) binfmt_misc(E) sb_edac(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) aesni_intel(E) crypto_simd(E) iTCO_wdt(E) glue_helper(E) iTCO_vendor_support(E) sg(E) cryptd(E) pcspkr(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) shpchp(E) ioatdma(E) i2c_i801(E) wmi(E) lpc_ich(E) mei_me(E) mei(E) mfd_core(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) ip_tables(E) ext4(E) mbcache(E) jbd2(E) fscrypto(E) mgag200(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E)
      [ 4088.168486]  fb_sys_fops(E) ahci(E) ixgbe(E) libahci(E) ttm(E) mdio(E) ptp(E) pps_core(E) drm(E) sd_mod(E) libata(E) crc32c_intel(E) mlx4_core(E) i2c_core(E) dca(E) megaraid_sas(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) [last unloaded: rds]
      [ 4088.193442] CPU: 20 PID: 1244 Comm: kworker/20:2 Tainted: G           OE   4.14.0-rc7.master.20171105.ol7.x86_64 #1
      [ 4088.205097] Hardware name: Oracle Corporation ORACLE SERVER X5-2L/ASM,MOBO TRAY,2U, BIOS 31110000 03/03/2017
      [ 4088.216074] Workqueue: ib_cm cm_work_handler [ib_cm]
      [ 4088.221614] task: ffff885fa11d0000 task.stack: ffffc9000e598000
      [ 4088.228224] RIP: 0010:rds_ib_recv_refill+0x87/0x620 [rds_rdma]
      [ 4088.234736] RSP: 0018:ffffc9000e59bb68 EFLAGS: 00010286
      [ 4088.240568] RAX: 0000000000000000 RBX: ffffc9002115d050 RCX: ffffc9002115d050
      [ 4088.248535] RDX: ffffffffa0521380 RSI: ffffffffa0522158 RDI: ffffffffa0525580
      [ 4088.256498] RBP: ffffc9000e59bbf8 R08: 0000000000000005 R09: 0000000000000000
      [ 4088.264465] R10: 0000000000000339 R11: 0000000000000001 R12: 0000000000000000
      [ 4088.272433] R13: ffff885f8c9d8000 R14: ffffffff81a0a060 R15: ffff884676268000
      [ 4088.280397] FS:  0000000000000000(0000) GS:ffff885fbec80000(0000) knlGS:0000000000000000
      [ 4088.289434] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 4088.295846] CR2: 0000000000000020 CR3: 0000000001e09005 CR4: 00000000001606e0
      [ 4088.303816] Call Trace:
      [ 4088.306557]  rds_ib_cm_connect_complete+0xe0/0x220 [rds_rdma]
      [ 4088.312982]  ? __dynamic_pr_debug+0x8c/0xb0
      [ 4088.317664]  ? __queue_work+0x142/0x3c0
      [ 4088.321944]  rds_rdma_cm_event_handler+0x19e/0x250 [rds_rdma]
      [ 4088.328370]  cma_ib_handler+0xcd/0x280 [rdma_cm]
      [ 4088.333522]  cm_process_work+0x25/0x120 [ib_cm]
      [ 4088.338580]  cm_work_handler+0xd6b/0x17aa [ib_cm]
      [ 4088.343832]  process_one_work+0x149/0x360
      [ 4088.348307]  worker_thread+0x4d/0x3e0
      [ 4088.352397]  kthread+0x109/0x140
      [ 4088.355996]  ? rescuer_thread+0x380/0x380
      [ 4088.360467]  ? kthread_park+0x60/0x60
      [ 4088.364563]  ret_from_fork+0x25/0x30
      [ 4088.368548] Code: 48 89 45 90 48 89 45 98 eb 4d 0f 1f 44 00 00 48 8b 43 08 48 89 d9 48 c7 c2 80 13 52 a0 48 c7 c6 58 21 52 a0 48 c7 c7 80 55 52 a0 <4c> 8b 48 20 44 89 64 24 08 48 8b 40 30 49 83 e1 fc 48 89 04 24
      [ 4088.389612] RIP: rds_ib_recv_refill+0x87/0x620 [rds_rdma] RSP: ffffc9000e59bb68
      [ 4088.397772] CR2: 0000000000000020
      [ 4088.401505] ---[ end trace fe922e6ccf004431 ]---
      
      This bug was provoked by compiling rds out-of-tree with
      EXTRA_CFLAGS="-DRDS_DEBUG -DDEBUG" and inserting an artificial delay
      between the rdsdebug() and ib_ib_port_recv() statements:
      
         	       /* XXX when can this fail? */
      	       ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr);
      +		if (can_wait)
      +			usleep_range(1000, 5000);
      	       rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv,
      			recv->r_ibinc, sg_page(&recv->r_frag->f_sg),
      			(long) ib_sg_dma_address(
      
      The fix is simply to move the rdsdebug() statement up before the
      ib_post_recv() and remove the printing of ret, which is taken care of
      anyway by the non-debug code.
      Signed-off-by: NHåkon Bugge <haakon.bugge@oracle.com>
      Reviewed-by: NKnut Omang <knut.omang@oracle.com>
      Reviewed-by: NWei Lin Guay <wei.lin.guay@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cb483a5
    • X
      ip_gre: add the support for i/o_flags update via ioctl · a0efab67
      Xin Long 提交于
      As patch 'ip_gre: add the support for i/o_flags update via netlink'
      did for netlink, we also need to do the same job for these update
      via ioctl.
      
      This patch is to update i/o_flags and call ipgre_link_update to
      recalculate these gre properties after ip_tunnel_ioctl does the
      common update.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0efab67
    • X
      ip_gre: add the support for i/o_flags update via netlink · dd9d598c
      Xin Long 提交于
      Now ip_gre is using ip_tunnel_changelink to update it's properties, but
      ip_tunnel_changelink in ip_tunnel doesn't update i/o_flags as a common
      function.
      
      o_flags updates would cause that tunnel->tun_hlen / hlen and dev->mtu /
      needed_headroom need to be recalculated, and dev->(hw_)features need to
      be updated as well.
      
      Therefore, we can't just add the update into ip_tunnel_update called
      in ip_tunnel_changelink, and it's also better not to touch ip_tunnel
      codes.
      
      This patch updates i/o_flags and calls ipgre_link_update to recalculate
      these gre properties after ip_tunnel_changelink does the common update.
      
      Note that since ipgre_link_update doesn't know the lower dev, it will
      update gre->hlen, dev->mtu and dev->needed_headroom with the value of
      'new tun_hlen - old tun_hlen'. In this way, we can avoid many redundant
      codes, unlike ip6_gre.
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd9d598c
    • E
      tcp: Namespace-ify sysctl_tcp_rmem and sysctl_tcp_wmem · 356d1833
      Eric Dumazet 提交于
      Note that when a new netns is created, it inherits its
      sysctl_tcp_rmem and sysctl_tcp_wmem from initial netns.
      
      This change is needed so that we can refine TCP rcvbuf autotuning,
      to take RTT into consideration.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      356d1833