1. 18 12月, 2018 1 次提交
  2. 16 12月, 2018 2 次提交
    • E
      net: clear skb->tstamp in forwarding paths · 8203e2d8
      Eric Dumazet 提交于
      Sergey reported that forwarding was no longer working
      if fq packet scheduler was used.
      
      This is caused by the recent switch to EDT model, since incoming
      packets might have been timestamped by __net_timestamp()
      
      __net_timestamp() uses ktime_get_real(), while fq expects packets
      using CLOCK_MONOTONIC base.
      
      The fix is to clear skb->tstamp in forwarding paths.
      
      Fixes: 80b14dee ("net: Add a new socket option for a future transmit time.")
      Fixes: fb420d5d ("tcp/fq: move back to CLOCK_MONOTONIC")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NSergey Matyukevich <geomatsi@gmail.com>
      Tested-by: NSergey Matyukevich <geomatsi@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8203e2d8
    • M
      net: ipv4: do not handle duplicate fragments as overlapping · ade44640
      Michal Kubecek 提交于
      Since commit 7969e5c4 ("ip: discard IPv4 datagrams with overlapping
      segments.") IPv4 reassembly code drops the whole queue whenever an
      overlapping fragment is received. However, the test is written in a way
      which detects duplicate fragments as overlapping so that in environments
      with many duplicate packets, fragmented packets may be undeliverable.
      
      Add an extra test and for (potentially) duplicate fragment, only drop the
      new fragment rather than the whole queue. Only starting offset and length
      are checked, not the contents of the fragments as that would be too
      expensive. For similar reason, linear list ("run") of a rbtree node is not
      iterated, we only check if the new fragment is a subset of the interval
      covered by existing consecutive fragments.
      
      v2: instead of an exact check iterating through linear list of an rbtree
      node, only check if the new fragment is subset of the "run" (suggested
      by Eric Dumazet)
      
      Fixes: 7969e5c4 ("ip: discard IPv4 datagrams with overlapping segments.")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ade44640
  3. 15 12月, 2018 1 次提交
    • D
      net: Allow class-e address assignment via ifconfig ioctl · 65cab850
      Dave Taht 提交于
      While most distributions long ago switched to the iproute2 suite
      of utilities, which allow class-e (240.0.0.0/4) address assignment,
      distributions relying on busybox, toybox and other forms of
      ifconfig cannot assign class-e addresses without this kernel patch.
      
      While CIDR has been obsolete for 2 decades, and a survey of all the
      open source code in the world shows the IN_whatever macros are also
      obsolete... rather than obsolete CIDR from this ioctl entirely, this
      patch merely enables class-e assignment, sanely.
      Signed-off-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65cab850
  4. 11 12月, 2018 1 次提交
  5. 08 12月, 2018 1 次提交
  6. 06 12月, 2018 4 次提交
    • J
      ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes · ebaf39e6
      Jiri Wiesner 提交于
      The *_frag_reasm() functions are susceptible to miscalculating the byte
      count of packet fragments in case the truesize of a head buffer changes.
      The truesize member may be changed by the call to skb_unclone(), leaving
      the fragment memory limit counter unbalanced even if all fragments are
      processed. This miscalculation goes unnoticed as long as the network
      namespace which holds the counter is not destroyed.
      
      Should an attempt be made to destroy a network namespace that holds an
      unbalanced fragment memory limit counter the cleanup of the namespace
      never finishes. The thread handling the cleanup gets stuck in
      inet_frags_exit_net() waiting for the percpu counter to reach zero. The
      thread is usually in running state with a stacktrace similar to:
      
       PID: 1073   TASK: ffff880626711440  CPU: 1   COMMAND: "kworker/u48:4"
        #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
        #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
        #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
        #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
        #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
       #10 [ffff880621563e38] process_one_work at ffffffff81096f14
      
      It is not possible to create new network namespaces, and processes
      that call unshare() end up being stuck in uninterruptible sleep state
      waiting to acquire the net_mutex.
      
      The bug was observed in the IPv6 netfilter code by Per Sundstrom.
      I thank him for his analysis of the problem. The parts of this patch
      that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
      Signed-off-by: NJiri Wiesner <jwiesner@suse.com>
      Reported-by: NPer Sundstrom <per.sundstrom@redqube.se>
      Acked-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebaf39e6
    • Y
      tcp: fix NULL ref in tail loss probe · b2b7af86
      Yuchung Cheng 提交于
      TCP loss probe timer may fire when the retranmission queue is empty but
      has a non-zero tp->packets_out counter. tcp_send_loss_probe will call
      tcp_rearm_rto which triggers NULL pointer reference by fetching the
      retranmission queue head in its sub-routines.
      
      Add a more detailed warning to help catch the root cause of the inflight
      accounting inconsistency.
      Reported-by: NRafael Tinoco <rafael.tinoco@linaro.org>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2b7af86
    • E
      tcp: Do not underestimate rwnd_limited · 41727549
      Eric Dumazet 提交于
      If available rwnd is too small, tcp_tso_should_defer()
      can decide it is worth waiting before splitting a TSO packet.
      
      This really means we are rwnd limited.
      
      Fixes: 5615f886 ("tcp: instrument how long TCP is limited by receive window")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Reviewed-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41727549
    • E
      net: use skb_list_del_init() to remove from RX sublists · 22f6bbb7
      Edward Cree 提交于
      list_del() leaves the skb->next pointer poisoned, which can then lead to
       a crash in e.g. OVS forwarding.  For example, setting up an OVS VXLAN
       forwarding bridge on sfc as per:
      
      ========
      $ ovs-vsctl show
      5dfd9c47-f04b-4aaa-aa96-4fbb0a522a30
          Bridge "br0"
              Port "br0"
                  Interface "br0"
                      type: internal
              Port "enp6s0f0"
                  Interface "enp6s0f0"
              Port "vxlan0"
                  Interface "vxlan0"
                      type: vxlan
                      options: {key="1", local_ip="10.0.0.5", remote_ip="10.0.0.4"}
          ovs_version: "2.5.0"
      ========
      (where 10.0.0.5 is an address on enp6s0f1)
      and sending traffic across it will lead to the following panic:
      ========
      general protection fault: 0000 [#1] SMP PTI
      CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.20.0-rc3-ehc+ #701
      Hardware name: Dell Inc. PowerEdge R710/0M233H, BIOS 6.4.0 07/23/2013
      RIP: 0010:dev_hard_start_xmit+0x38/0x200
      Code: 53 48 89 fb 48 83 ec 20 48 85 ff 48 89 54 24 08 48 89 4c 24 18 0f 84 ab 01 00 00 48 8d 86 90 00 00 00 48 89 f5 48 89 44 24 10 <4c> 8b 33 48 c7 03 00 00 00 00 48 8b 05 c7 d1 b3 00 4d 85 f6 0f 95
      RSP: 0018:ffff888627b437e0 EFLAGS: 00010202
      RAX: 0000000000000000 RBX: dead000000000100 RCX: ffff88862279c000
      RDX: ffff888614a342c0 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff888618a88000 R08: 0000000000000001 R09: 00000000000003e8
      R10: 0000000000000000 R11: ffff888614a34140 R12: 0000000000000000
      R13: 0000000000000062 R14: dead000000000100 R15: ffff888616430000
      FS:  0000000000000000(0000) GS:ffff888627b40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6d2bc6d000 CR3: 000000000200a000 CR4: 00000000000006e0
      Call Trace:
       <IRQ>
       __dev_queue_xmit+0x623/0x870
       ? masked_flow_lookup+0xf7/0x220 [openvswitch]
       ? ep_poll_callback+0x101/0x310
       do_execute_actions+0xaba/0xaf0 [openvswitch]
       ? __wake_up_common+0x8a/0x150
       ? __wake_up_common_lock+0x87/0xc0
       ? queue_userspace_packet+0x31c/0x5b0 [openvswitch]
       ovs_execute_actions+0x47/0x120 [openvswitch]
       ovs_dp_process_packet+0x7d/0x110 [openvswitch]
       ovs_vport_receive+0x6e/0xd0 [openvswitch]
       ? dst_alloc+0x64/0x90
       ? rt_dst_alloc+0x50/0xd0
       ? ip_route_input_slow+0x19a/0x9a0
       ? __udp_enqueue_schedule_skb+0x198/0x1b0
       ? __udp4_lib_rcv+0x856/0xa30
       ? __udp4_lib_rcv+0x856/0xa30
       ? cpumask_next_and+0x19/0x20
       ? find_busiest_group+0x12d/0xcd0
       netdev_frame_hook+0xce/0x150 [openvswitch]
       __netif_receive_skb_core+0x205/0xae0
       __netif_receive_skb_list_core+0x11e/0x220
       netif_receive_skb_list+0x203/0x460
       ? __efx_rx_packet+0x335/0x5e0 [sfc]
       efx_poll+0x182/0x320 [sfc]
       net_rx_action+0x294/0x3c0
       __do_softirq+0xca/0x297
       irq_exit+0xa6/0xb0
       do_IRQ+0x54/0xd0
       common_interrupt+0xf/0xf
       </IRQ>
      ========
      So, in all listified-receive handling, instead pull skbs off the lists with
       skb_list_del_init().
      
      Fixes: 9af86f93 ("net: core: fix use-after-free in __netif_receive_skb_list_core")
      Fixes: 7da517a3 ("net: core: Another step of skb receive list processing")
      Fixes: a4ca8b7d ("net: ipv4: fix drop handling in ip_list_rcv() and ip_list_rcv_finish()")
      Fixes: d8269e2c ("net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()")
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22f6bbb7
  7. 01 12月, 2018 3 次提交
  8. 27 11月, 2018 2 次提交
    • T
      netfilter: nat: fix double register in masquerade modules · 095faf45
      Taehee Yoo 提交于
      There is a reference counter to ensure that masquerade modules register
      notifiers only once. However, the existing reference counter approach is
      not safe, test commands are:
      
         while :
         do
         	   modprobe ip6t_MASQUERADE &
      	   modprobe nft_masq_ipv6 &
      	   modprobe -rv ip6t_MASQUERADE &
      	   modprobe -rv nft_masq_ipv6 &
         done
      
      numbers below represent the reference counter.
      --------------------------------------------------------
      CPU0        CPU1        CPU2        CPU3        CPU4
      [insmod]    [insmod]    [rmmod]     [rmmod]     [insmod]
      --------------------------------------------------------
      0->1
      register    1->2
                  returns     2->1
      			returns     1->0
                                                      0->1
                                                      register <--
                                          unregister
      --------------------------------------------------------
      
      The unregistation of CPU3 should be processed before the
      registration of CPU4.
      
      In order to fix this, use a mutex instead of reference counter.
      
      splat looks like:
      [  323.869557] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:1381]
      [  323.869574] Modules linked in: nf_tables(+) nf_nat_ipv6(-) nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 n]
      [  323.869574] irq event stamp: 194074
      [  323.898930] hardirqs last  enabled at (194073): [<ffffffff90004a0d>] trace_hardirqs_on_thunk+0x1a/0x1c
      [  323.898930] hardirqs last disabled at (194074): [<ffffffff90004a29>] trace_hardirqs_off_thunk+0x1a/0x1c
      [  323.898930] softirqs last  enabled at (182132): [<ffffffff922006ec>] __do_softirq+0x6ec/0xa3b
      [  323.898930] softirqs last disabled at (182109): [<ffffffff90193426>] irq_exit+0x1a6/0x1e0
      [  323.898930] CPU: 0 PID: 1381 Comm: modprobe Not tainted 4.20.0-rc2+ #27
      [  323.898930] RIP: 0010:raw_notifier_chain_register+0xea/0x240
      [  323.898930] Code: 3c 03 0f 8e f2 00 00 00 44 3b 6b 10 7f 4d 49 bc 00 00 00 00 00 fc ff df eb 22 48 8d 7b 10 488
      [  323.898930] RSP: 0018:ffff888101597218 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
      [  323.898930] RAX: 0000000000000000 RBX: ffffffffc04361c0 RCX: 0000000000000000
      [  323.898930] RDX: 1ffffffff26132ae RSI: ffffffffc04aa3c0 RDI: ffffffffc04361d0
      [  323.898930] RBP: ffffffffc04361c8 R08: 0000000000000000 R09: 0000000000000001
      [  323.898930] R10: ffff8881015972b0 R11: fffffbfff26132c4 R12: dffffc0000000000
      [  323.898930] R13: 0000000000000000 R14: 1ffff110202b2e44 R15: ffffffffc04aa3c0
      [  323.898930] FS:  00007f813ed41540(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000
      [  323.898930] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  323.898930] CR2: 0000559bf2c9f120 CR3: 000000010bc80000 CR4: 00000000001006f0
      [  323.898930] Call Trace:
      [  323.898930]  ? atomic_notifier_chain_register+0x2d0/0x2d0
      [  323.898930]  ? down_read+0x150/0x150
      [  323.898930]  ? sched_clock_cpu+0x126/0x170
      [  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
      [  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
      [  323.898930]  register_netdevice_notifier+0xbb/0x790
      [  323.898930]  ? __dev_close_many+0x2d0/0x2d0
      [  323.898930]  ? __mutex_unlock_slowpath+0x17f/0x740
      [  323.898930]  ? wait_for_completion+0x710/0x710
      [  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
      [  323.898930]  ? up_write+0x6c/0x210
      [  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
      [  324.127073]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
      [  324.127073]  nft_chain_filter_init+0x1e/0xe8a [nf_tables]
      [  324.127073]  nf_tables_module_init+0x37/0x92 [nf_tables]
      [ ... ]
      
      Fixes: 8dd33cc9 ("netfilter: nf_nat: generalize IPv4 masquerading support for nf_tables")
      Fixes: be6b635c ("netfilter: nf_nat: generalize IPv6 masquerading support for nf_tables")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      095faf45
    • T
      netfilter: add missing error handling code for register functions · 584eab29
      Taehee Yoo 提交于
      register_{netdevice/inetaddr/inet6addr}_notifier may return an error
      value, this patch adds the code to handle these error paths.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      584eab29
  9. 25 11月, 2018 2 次提交
    • W
      net: always initialize pagedlen · aba36930
      Willem de Bruijn 提交于
      In ip packet generation, pagedlen is initialized for each skb at the
      start of the loop in __ip(6)_append_data, before label alloc_new_skb.
      
      Depending on compiler options, code can be generated that jumps to
      this label, triggering use of an an uninitialized variable.
      
      In practice, at -O2, the generated code moves the initialization below
      the label. But the code should not rely on that for correctness.
      
      Fixes: 15e36f5b ("udp: paged allocation with gso")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aba36930
    • E
      tcp: address problems caused by EDT misshaps · 9efdda4e
      Eric Dumazet 提交于
      When a qdisc setup including pacing FQ is dismantled and recreated,
      some TCP packets are sent earlier than instructed by TCP stack.
      
      TCP can be fooled when ACK comes back, because the following
      operation can return a negative value.
      
          tcp_time_stamp(tp) - tp->rx_opt.rcv_tsecr;
      
      Some paths in TCP stack were not dealing properly with this,
      this patch addresses four of them.
      
      Fixes: ab408b6d ("tcp: switch tcp and sch_fq to new earliest departure time model")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9efdda4e
  10. 22 11月, 2018 1 次提交
    • E
      tcp: defer SACK compression after DupThresh · 86de5921
      Eric Dumazet 提交于
      Jean-Louis reported a TCP regression and bisected to recent SACK
      compression.
      
      After a loss episode (receiver not able to keep up and dropping
      packets because its backlog is full), linux TCP stack is sending
      a single SACK (DUPACK).
      
      Sender waits a full RTO timer before recovering losses.
      
      While RFC 6675 says in section 5, "Algorithm Details",
      
         (2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns true --
             indicating at least three segments have arrived above the current
             cumulative acknowledgment point, which is taken to indicate loss
             -- go to step (4).
      ...
         (4) Invoke fast retransmit and enter loss recovery as follows:
      
      there are old TCP stacks not implementing this strategy, and
      still counting the dupacks before starting fast retransmit.
      
      While these stacks probably perform poorly when receivers implement
      LRO/GRO, we should be a little more gentle to them.
      
      This patch makes sure we do not enable SACK compression unless
      3 dupacks have been sent since last rcv_nxt update.
      
      Ideally we should even rearm the timer to send one or two
      more DUPACK if no more packets are coming, but that will
      be work aiming for linux-4.21.
      
      Many thanks to Jean-Louis for bisecting the issue, providing
      packet captures and testing this patch.
      
      Fixes: 5d9f4262 ("tcp: add SACK compression")
      Reported-by: NJean-Louis Dupond <jean-louis@dupond.be>
      Tested-by: NJean-Louis Dupond <jean-louis@dupond.be>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86de5921
  11. 21 11月, 2018 1 次提交
  12. 18 11月, 2018 1 次提交
  13. 09 11月, 2018 1 次提交
  14. 06 11月, 2018 1 次提交
    • T
      net: bpfilter: fix iptables failure if bpfilter_umh is disabled · 97adadda
      Taehee Yoo 提交于
      When iptables command is executed, ip_{set/get}sockopt() try to upload
      bpfilter.ko if bpfilter is enabled. if it couldn't find bpfilter.ko,
      command is failed.
      bpfilter.ko is generated if CONFIG_BPFILTER_UMH is enabled.
      ip_{set/get}sockopt() only checks CONFIG_BPFILTER.
      So that if CONFIG_BPFILTER is enabled and CONFIG_BPFILTER_UMH is disabled,
      iptables command is always failed.
      
      test config:
         CONFIG_BPFILTER=y
         # CONFIG_BPFILTER_UMH is not set
      
      test command:
         %iptables -L
         iptables: No chain/target/match by that name.
      
      Fixes: d2ba09c1 ("net: add skeleton of bpfilter kernel module")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97adadda
  15. 02 11月, 2018 1 次提交
    • C
      net: drop skb on failure in ip_check_defrag() · 7de414a9
      Cong Wang 提交于
      Most callers of pskb_trim_rcsum() simply drop the skb when
      it fails, however, ip_check_defrag() still continues to pass
      the skb up to stack. This is suspicious.
      
      In ip_check_defrag(), after we learn the skb is an IP fragment,
      passing the skb to callers makes no sense, because callers expect
      fragments are defrag'ed on success. So, dropping the skb when we
      can't defrag it is reasonable.
      
      Note, prior to commit 88078d98, this is not a big problem as
      checksum will be fixed up anyway. After it, the checksum is not
      correct on failure.
      
      Found this during code review.
      
      Fixes: 88078d98 ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7de414a9
  16. 31 10月, 2018 2 次提交
  17. 30 10月, 2018 1 次提交
    • H
      ipv4/igmp: fix v1/v2 switchback timeout based on rfc3376, 8.12 · 966c37f2
      Hangbin Liu 提交于
      Similiar with ipv6 mcast commit 89225d1c ("net: ipv6: mld: fix v1/v2
      switchback timeout to rfc3810, 9.12.")
      
      i) RFC3376 8.12. Older Version Querier Present Timeout says:
      
         The Older Version Querier Interval is the time-out for transitioning
         a host back to IGMPv3 mode once an older version query is heard.
         When an older version query is received, hosts set their Older
         Version Querier Present Timer to Older Version Querier Interval.
      
         This value MUST be ((the Robustness Variable) times (the Query
         Interval in the last Query received)) plus (one Query Response
         Interval).
      
      Currently we only use a hardcode value IGMP_V1/v2_ROUTER_PRESENT_TIMEOUT.
      Fix it by adding two new items mr_qi(Query Interval) and mr_qri(Query Response
      Interval) in struct in_device.
      
      Now we can calculate the switchback time via (mr_qrv * mr_qi) + mr_qri.
      We need update these values when receive IGMPv3 queries.
      Reported-by: NYing Xu <yinxu@redhat.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      966c37f2
  18. 29 10月, 2018 1 次提交
    • L
      net: diag: document swapped src/dst in udp_dump_one. · 747569b0
      Lorenzo Colitti 提交于
      Since its inception, udp_dump_one has had a bug where userspace
      needs to swap src and dst addresses and ports in order to find
      the socket it wants. This is because it passes the socket source
      address to __udp[46]_lib_lookup's saddr argument, but those
      functions are intended to find local sockets matching received
      packets, so saddr is the remote address, not the local address.
      
      This can no longer be fixed for backwards compatibility reasons,
      so add a brief comment explaining that this is the case. This
      will avoid confusion and help ensure SOCK_DIAG implementations
      of new protocols don't have the same problem.
      
      Fixes: a925aa00 ("udp_diag: Implement the get_exact dumping functionality")
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      747569b0
  19. 27 10月, 2018 1 次提交
  20. 26 10月, 2018 1 次提交
  21. 25 10月, 2018 3 次提交
    • S
      net: udp: fix handling of CHECKSUM_COMPLETE packets · db4f1be3
      Sean Tranchetti 提交于
      Current handling of CHECKSUM_COMPLETE packets by the UDP stack is
      incorrect for any packet that has an incorrect checksum value.
      
      udp4/6_csum_init() will both make a call to
      __skb_checksum_validate_complete() to initialize/validate the csum
      field when receiving a CHECKSUM_COMPLETE packet. When this packet
      fails validation, skb->csum will be overwritten with the pseudoheader
      checksum so the packet can be fully validated by software, but the
      skb->ip_summed value will be left as CHECKSUM_COMPLETE so that way
      the stack can later warn the user about their hardware spewing bad
      checksums. Unfortunately, leaving the SKB in this state can cause
      problems later on in the checksum calculation.
      
      Since the the packet is still marked as CHECKSUM_COMPLETE,
      udp_csum_pull_header() will SUBTRACT the checksum of the UDP header
      from skb->csum instead of adding it, leaving us with a garbage value
      in that field. Once we try to copy the packet to userspace in the
      udp4/6_recvmsg(), we'll make a call to skb_copy_and_csum_datagram_msg()
      to checksum the packet data and add it in the garbage skb->csum value
      to perform our final validation check.
      
      Since the value we're validating is not the proper checksum, it's possible
      that the folded value could come out to 0, causing us not to drop the
      packet. Instead, we believe that the packet was checksummed incorrectly
      by hardware since skb->ip_summed is still CHECKSUM_COMPLETE, and we attempt
      to warn the user with netdev_rx_csum_fault(skb->dev);
      
      Unfortunately, since this is the UDP path, skb->dev has been overwritten
      by skb->dev_scratch and is no longer a valid pointer, so we end up
      reading invalid memory.
      
      This patch addresses this problem in two ways:
      	1) Do not use the dev pointer when calling netdev_rx_csum_fault()
      	   from skb_copy_and_csum_datagram_msg(). Since this gets called
      	   from the UDP path where skb->dev has been overwritten, we have
      	   no way of knowing if the pointer is still valid. Also for the
      	   sake of consistency with the other uses of
      	   netdev_rx_csum_fault(), don't attempt to call it if the
      	   packet was checksummed by software.
      
      	2) Add better CHECKSUM_COMPLETE handling to udp4/6_csum_init().
      	   If we receive a packet that's CHECKSUM_COMPLETE that fails
      	   verification (i.e. skb->csum_valid == 0), check who performed
      	   the calculation. It's possible that the checksum was done in
      	   software by the network stack earlier (such as Netfilter's
      	   CONNTRACK module), and if that says the checksum is bad,
      	   we can drop the packet immediately instead of waiting until
      	   we try and copy it to userspace. Otherwise, we need to
      	   mark the SKB as CHECKSUM_NONE, since the skb->csum field
      	   no longer contains the full packet checksum after the
      	   call to __skb_checksum_validate_complete().
      
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Fixes: c84d9490 ("udp: copy skb->truesize in the first cache line")
      Cc: Sam Kumar <samanthakumar@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NSean Tranchetti <stranche@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db4f1be3
    • D
      net: Don't return invalid table id error when dumping all families · ae677bbb
      David Ahern 提交于
      When doing a route dump across all address families, do not error out
      if the table does not exist. This allows a route dump for AF_UNSPEC
      with a table id that may only exist for some of the families.
      
      Do return the table does not exist error if dumping routes for a
      specific family and the table does not exist.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae677bbb
    • D
      net/ipv4: Put target net when address dump fails due to bad attributes · d7e38611
      David Ahern 提交于
      If tgt_net is set based on IFA_TARGET_NETNSID attribute in the dump
      request, make sure all error paths call put_net.
      
      Fixes: 5fcd266a ("net/ipv4: Add support for dumping addresses for a specific device")
      Fixes: c33078e3 ("net/ipv4: Update inet_dump_ifaddr for strict data checking")
      Reported-by: NLi RongQing <lirongqing@baidu.com>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7e38611
  22. 24 10月, 2018 2 次提交
    • E
      tcp: add tcp_reset_xmit_timer() helper · 3f80e08f
      Eric Dumazet 提交于
      With EDT model, SRTT no longer is inflated by pacing delays.
      
      This means that RTO and some other xmit timers might be setup
      incorrectly. This is particularly visible with either :
      
      - Very small enforced pacing rates (SO_MAX_PACING_RATE)
      - Reduced rto (from the default 200 ms)
      
      This can lead to TCP flows aborts in the worst case,
      or spurious retransmits in other cases.
      
      For example, this session gets far more throughput
      than the requested 80kbit :
      
      $ netperf -H 127.0.0.2 -l 100 -- -q 10000
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
      540000 262144 262144    104.00      2.66
      
      With the fix :
      
      $ netperf -H 127.0.0.2 -l 100 -- -q 10000
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
      540000 262144 262144    104.00      0.12
      
      EDT allows for better control of rtx timers, since TCP has
      a better idea of the earliest departure time of each skb
      in the rtx queue. We only have to eventually add to the
      timer the difference of the EDT time with current time.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f80e08f
    • K
      Revert "net: simplify sock_poll_wait" · 89ab066d
      Karsten Graul 提交于
      This reverts commit dd979b4d.
      
      This broke tcp_poll for SMC fallback: An AF_SMC socket establishes an
      internal TCP socket for the initial handshake with the remote peer.
      Whenever the SMC connection can not be established this TCP socket is
      used as a fallback. All socket operations on the SMC socket are then
      forwarded to the TCP socket. In case of poll, the file->private_data
      pointer references the SMC socket because the TCP socket has no file
      assigned. This causes tcp_poll to wait on the wrong socket.
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89ab066d
  23. 23 10月, 2018 2 次提交
  24. 19 10月, 2018 1 次提交
  25. 18 10月, 2018 3 次提交
    • N
      net: ipmr: fix unresolved entry dumps · eddf016b
      Nikolay Aleksandrov 提交于
      If the skb space ends in an unresolved entry while dumping we'll miss
      some unresolved entries. The reason is due to zeroing the entry counter
      between dumping resolved and unresolved mfc entries. We should just
      keep counting until the whole table is dumped and zero when we move to
      the next as we have a separate table counter.
      Reported-by: NColin Ian King <colin.king@canonical.com>
      Fixes: 8fb472c0 ("ipmr: improve hash scalability")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eddf016b
    • N
      tcp_bbr: centralize code to set gains · cf33e25c
      Neal Cardwell 提交于
      Centralize the code that sets gains used for computing cwnd and pacing
      rate. This simplifies the code and makes it easier to change the state
      machine or (in the future) dynamically change the gain values and
      ensure that the correct gain values are always used.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf33e25c
    • N
      tcp_bbr: adjust TCP BBR for departure time pacing · a87c83d5
      Neal Cardwell 提交于
      Adjust TCP BBR for the new departure time pacing model in the recent
      commit ab408b6d ("tcp: switch tcp and sch_fq to new earliest
      departure time model").
      
      With TSQ and pacing at lower layers, there are often several skbs
      queued in the pacing layer, and thus there is less data "in the
      network" than "in flight".
      
      With departure time pacing at lower layers (e.g. fq or potential
      future NICs), the data in the pacing layer now has a pre-scheduled
      ("baked-in") departure time that cannot be changed, even if the
      congestion control algorithm decides to use a new pacing rate.
      
      This means that there can be a non-trivial lag between when BBR makes
      a pacing rate change and when the inter-skb pacing delays
      change. After a pacing rate change, the number of packets in the
      network can gradually evolve to be higher or lower, depending on
      whether the sending rate is higher or lower than the delivery
      rate. Thus ignoring this lag can cause significant overshoot, with the
      flow ending up with too many or too few packets in the network.
      
      This commit changes BBR to adapt its pacing rate based on the amount
      of data in the network that it estimates has already been "baked in"
      by previous departure time decisions. We estimate the number of our
      packets that will be in the network at the earliest departure time
      (EDT) for the next skb scheduled as:
      
         in_network_at_edt = inflight_at_edt - (EDT - now) * bw
      
      If we're increasing the amount of data in the network ("in_network"),
      then we want to know if the transmit of the EDT skb will push
      in_network above the target, so our answer includes
      bbr_tso_segs_goal() from the skb departing at EDT. If we're decreasing
      in_network, then we want to know if in_network will sink too low just
      before the EDT transmit, so our answer does not include the segments
      from the skb departing at EDT.
      
      Why do we treat pacing_gain > 1.0 case and pacing_gain < 1.0 case
      differently? The in_network curve is a step function: in_network goes
      up on transmits, and down on ACKs. To accurately predict when
      in_network will go beyond our target value, this will happen on
      different events, depending on whether we're concerned about
      in_network potentially going too high or too low:
      
       o if pushing in_network up (pacing_gain > 1.0),
         then in_network goes above target upon a transmit event
      
       o if pushing in_network down (pacing_gain < 1.0),
         then in_network goes below target upon an ACK event
      
      This commit changes the BBR state machine to use this estimated
      "packets in network" value to make its decisions.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a87c83d5