1. 13 9月, 2018 1 次提交
    • X
      ipv6: use rt6_info members when dst is set in rt6_fill_node · 22d0bd82
      Xin Long 提交于
      In inet6_rtm_getroute, since Commit 93531c67 ("net/ipv6: separate
      handling of FIB entries from dst based routes"), it has used rt->from
      to dump route info instead of rt.
      
      However for some route like cache, some of its information like flags
      or gateway is not the same as that of the 'from' one. It caused 'ip
      route get' to dump the wrong route information.
      
      In Jianlin's testing, the output information even lost the expiration
      time for a pmtu route cache due to the wrong fib6_flags.
      
      So change to use rt6_info members for dst addr, src addr, flags and
      gateway when it tries to dump a route entry without fibmatch set.
      
      v1->v2:
        - not use rt6i_prefsrc.
        - also fix the gw dump issue.
      
      Fixes: 93531c67 ("net/ipv6: separate handling of FIB entries from dst based routes")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22d0bd82
  2. 10 9月, 2018 1 次提交
    • T
      ip: frags: fix crash in ip_do_fragment() · 5d407b07
      Taehee Yoo 提交于
      A kernel crash occurrs when defragmented packet is fragmented
      in ip_do_fragment().
      In defragment routine, skb_orphan() is called and
      skb->ip_defrag_offset is set. but skb->sk and
      skb->ip_defrag_offset are same union member. so that
      frag->sk is not NULL.
      Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
      defragmented packet is fragmented.
      
      test commands:
         %iptables -t nat -I POSTROUTING -j MASQUERADE
         %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000
      
      splat looks like:
      [  261.069429] kernel BUG at net/ipv4/ip_output.c:636!
      [  261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      [  261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
      [  261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
      [  261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
      [  261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202
      [  261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004
      [  261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8
      [  261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395
      [  261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4
      [  261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000
      [  261.174169] FS:  00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
      [  261.183012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0
      [  261.198158] Call Trace:
      [  261.199018]  ? dst_output+0x180/0x180
      [  261.205011]  ? save_trace+0x300/0x300
      [  261.209018]  ? ip_copy_metadata+0xb00/0xb00
      [  261.213034]  ? sched_clock_local+0xd4/0x140
      [  261.218158]  ? kill_l4proto+0x120/0x120 [nf_conntrack]
      [  261.223014]  ? rt_cpu_seq_stop+0x10/0x10
      [  261.227014]  ? find_held_lock+0x39/0x1c0
      [  261.233008]  ip_finish_output+0x51d/0xb50
      [  261.237006]  ? ip_fragment.constprop.56+0x220/0x220
      [  261.243011]  ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
      [  261.250152]  ? rcu_is_watching+0x77/0x120
      [  261.255010]  ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
      [  261.261033]  ? nf_hook_slow+0xb1/0x160
      [  261.265007]  ip_output+0x1c7/0x710
      [  261.269005]  ? ip_mc_output+0x13f0/0x13f0
      [  261.273002]  ? __local_bh_enable_ip+0xe9/0x1b0
      [  261.278152]  ? ip_fragment.constprop.56+0x220/0x220
      [  261.282996]  ? nf_hook_slow+0xb1/0x160
      [  261.287007]  raw_sendmsg+0x21f9/0x4420
      [  261.291008]  ? dst_output+0x180/0x180
      [  261.297003]  ? sched_clock_cpu+0x126/0x170
      [  261.301003]  ? find_held_lock+0x39/0x1c0
      [  261.306155]  ? stop_critical_timings+0x420/0x420
      [  261.311004]  ? check_flags.part.36+0x450/0x450
      [  261.315005]  ? _raw_spin_unlock_irq+0x29/0x40
      [  261.320995]  ? _raw_spin_unlock_irq+0x29/0x40
      [  261.326142]  ? cyc2ns_read_end+0x10/0x10
      [  261.330139]  ? raw_bind+0x280/0x280
      [  261.334138]  ? sched_clock_cpu+0x126/0x170
      [  261.338995]  ? check_flags.part.36+0x450/0x450
      [  261.342991]  ? __lock_acquire+0x4500/0x4500
      [  261.348994]  ? inet_sendmsg+0x11c/0x500
      [  261.352989]  ? dst_output+0x180/0x180
      [  261.357012]  inet_sendmsg+0x11c/0x500
      [ ... ]
      
      v2:
       - clear skb->sk at reassembly routine.(Eric Dumarzet)
      
      Fixes: fa0f5273 ("ip: use rb trees for IP frag queue.")
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d407b07
  3. 04 9月, 2018 1 次提交
  4. 03 9月, 2018 1 次提交
    • D
      net/ipv6: Only update MTU metric if it set · 15a81b41
      David Ahern 提交于
      Jan reported a regression after an update to 4.18.5. In this case ipv6
      default route is setup by systemd-networkd based on data from an RA. The
      RA contains an MTU of 1492 which is used when the route is first inserted
      but then systemd-networkd pushes down updates to the default route
      without the mtu set.
      
      Prior to the change to fib6_info, metrics such as MTU were held in the
      dst_entry and rt6i_pmtu in rt6_info contained an update to the mtu if
      any. ip6_mtu would look at rt6i_pmtu first and use it if set. If not,
      the value from the metrics is used if it is set and finally falling
      back to the idev value.
      
      After the fib6_info change metrics are contained in the fib6_info struct
      and there is no equivalent to rt6i_pmtu. To maintain consistency with
      the old behavior the new code should only reset the MTU in the metrics
      if the route update has it set.
      
      Fixes: d4ead6b3 ("net/ipv6: move metrics from dst to rt6_info")
      Reported-by: NJan Janssen <medhefgo@web.de>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15a81b41
  5. 02 9月, 2018 1 次提交
    • A
      ipv6: don't get lwtstate twice in ip6_rt_copy_init() · 93bbadd6
      Alexey Kodanev 提交于
      Commit 80f1a0f4 ("net/ipv6: Put lwtstate when destroying fib6_info")
      partially fixed the kmemleak [1], lwtstate can be copied from fib6_info,
      with ip6_rt_copy_init(), and it should be done only once there.
      
      rt->dst.lwtstate is set by ip6_rt_init_dst(), at the start of the function
      ip6_rt_copy_init(), so there is no need to get it again at the end.
      
      With this patch, lwtstate also isn't copied from RTF_REJECT routes.
      
      [1]:
      unreferenced object 0xffff880b6aaa14e0 (size 64):
        comm "ip", pid 10577, jiffies 4295149341 (age 1273.903s)
        hex dump (first 32 bytes):
          01 00 04 00 04 00 00 00 10 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<0000000018664623>] lwtunnel_build_state+0x1bc/0x420
          [<00000000b73aa29a>] ip6_route_info_create+0x9f7/0x1fd0
          [<00000000ee2c5d1f>] ip6_route_add+0x14/0x70
          [<000000008537b55c>] inet6_rtm_newroute+0xd9/0xe0
          [<000000002acc50f5>] rtnetlink_rcv_msg+0x66f/0x8e0
          [<000000008d9cd381>] netlink_rcv_skb+0x268/0x3b0
          [<000000004c893c76>] netlink_unicast+0x417/0x5a0
          [<00000000f2ab1afb>] netlink_sendmsg+0x70b/0xc30
          [<00000000890ff0aa>] sock_sendmsg+0xb1/0xf0
          [<00000000a2e7b66f>] ___sys_sendmsg+0x659/0x950
          [<000000001e7426c8>] __sys_sendmsg+0xde/0x170
          [<00000000fe411443>] do_syscall_64+0x9f/0x4a0
          [<000000001be7b28b>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<000000006d21f353>] 0xffffffffffffffff
      
      Fixes: 6edb3c96 ("net/ipv6: Defer initialization of dst to data path")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93bbadd6
  6. 30 8月, 2018 3 次提交
    • S
      ipv6: fix cleanup ordering for pingv6 registration · a03dc36b
      Sabrina Dubroca 提交于
      Commit 6d0bfe22 ("net: ipv6: Add IPv6 support to the ping socket.")
      contains an error in the cleanup path of inet6_init(): when
      proto_register(&pingv6_prot, 1) fails, we try to unregister
      &pingv6_prot. When rawv6_init() fails, we skip unregistering
      &pingv6_prot.
      
      Example of panic (triggered by faking a failure of
       proto_register(&pingv6_prot, 1)):
      
          general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
          [...]
          RIP: 0010:__list_del_entry_valid+0x79/0x160
          [...]
          Call Trace:
           proto_unregister+0xbb/0x550
           ? trace_preempt_on+0x6f0/0x6f0
           ? sock_no_shutdown+0x10/0x10
           inet6_init+0x153/0x1b8
      
      Fixes: 6d0bfe22 ("net: ipv6: Add IPv6 support to the ping socket.")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a03dc36b
    • S
      ipv6: fix cleanup ordering for ip6_mr failure · afe49de4
      Sabrina Dubroca 提交于
      Commit 15e66807 ("ipv6: reorder icmpv6_init() and ip6_mr_init()")
      moved the cleanup label for ipmr_fail, but should have changed the
      contents of the cleanup labels as well. Now we can end up cleaning up
      icmpv6 even though it hasn't been initialized (jump to icmp_fail or
      ipmr_fail).
      
      Simply undo things in the reverse order of their initialization.
      
      Example of panic (triggered by faking a failure of icmpv6_init):
      
          kasan: GPF could be caused by NULL-ptr deref or user memory access
          general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
          [...]
          RIP: 0010:__list_del_entry_valid+0x79/0x160
          [...]
          Call Trace:
           ? lock_release+0x8a0/0x8a0
           unregister_pernet_operations+0xd4/0x560
           ? ops_free_list+0x480/0x480
           ? down_write+0x91/0x130
           ? unregister_pernet_subsys+0x15/0x30
           ? down_read+0x1b0/0x1b0
           ? up_read+0x110/0x110
           ? kmem_cache_create_usercopy+0x1b4/0x240
           unregister_pernet_subsys+0x1d/0x30
           icmpv6_cleanup+0x1d/0x30
           inet6_init+0x1b5/0x23f
      
      Fixes: 15e66807 ("ipv6: reorder icmpv6_init() and ip6_mr_init()")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      afe49de4
    • A
      vti6: remove !skb->ignore_df check from vti6_xmit() · 9f289546
      Alexey Kodanev 提交于
      Before the commit d6990976 ("vti6: fix PMTU caching and reporting
      on xmit") '!skb->ignore_df' check was always true because the function
      skb_scrub_packet() was called before it, resetting ignore_df to zero.
      
      In the commit, skb_scrub_packet() was moved below, and now this check
      can be false for the packet, e.g. when sending it in the two fragments,
      this prevents successful PMTU updates in such case. The next attempts
      to send the packet lead to the same tx error. Moreover, vti6 initial
      MTU value relies on PMTU adjustments.
      
      This issue can be reproduced with the following LTP test script:
          udp_ipsec_vti.sh -6 -p ah -m tunnel -s 2000
      
      Fixes: ccd740cb ("vti6: Add pmtu handling to vti6_xmit.")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Acked-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f289546
  7. 28 8月, 2018 1 次提交
  8. 23 8月, 2018 2 次提交
  9. 21 8月, 2018 1 次提交
  10. 20 8月, 2018 2 次提交
    • H
      ip6_vti: fix a null pointer deference when destroy vti6 tunnel · 9c86336c
      Haishuang Yan 提交于
      If load ip6_vti module and create a network namespace when set
      fb_tunnels_only_for_init_net to 1, then exit the namespace will
      cause following crash:
      
      [ 6601.677036] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [ 6601.679057] PGD 8000000425eca067 P4D 8000000425eca067 PUD 424292067 PMD 0
      [ 6601.680483] Oops: 0000 [#1] SMP PTI
      [ 6601.681223] CPU: 7 PID: 93 Comm: kworker/u16:1 Kdump: loaded Tainted: G            E     4.18.0+ #3
      [ 6601.683153] Hardware name: Fedora Project OpenStack Nova, BIOS seabios-1.7.5-11.el7 04/01/2014
      [ 6601.684919] Workqueue: netns cleanup_net
      [ 6601.685742] RIP: 0010:vti6_exit_batch_net+0x87/0xd0 [ip6_vti]
      [ 6601.686932] Code: 7b 08 48 89 e6 e8 b9 ea d3 dd 48 8b 1b 48 85 db 75 ec 48 83 c5 08 48 81 fd 00 01 00 00 75 d5 49 8b 84 24 08 01 00 00 48 89 e6 <48> 8b 78 08 e8 90 ea d3 dd 49 8b 45 28 49 39 c6 4c 8d 68 d8 75 a1
      [ 6601.690735] RSP: 0018:ffffa897c2737de0 EFLAGS: 00010246
      [ 6601.691846] RAX: 0000000000000000 RBX: 0000000000000000 RCX: dead000000000200
      [ 6601.693324] RDX: 0000000000000015 RSI: ffffa897c2737de0 RDI: ffffffff9f2ea9e0
      [ 6601.694824] RBP: 0000000000000100 R08: 0000000000000000 R09: 0000000000000000
      [ 6601.696314] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8dc323c07e00
      [ 6601.697812] R13: ffff8dc324a63100 R14: ffffa897c2737e30 R15: ffffa897c2737e30
      [ 6601.699345] FS:  0000000000000000(0000) GS:ffff8dc33fdc0000(0000) knlGS:0000000000000000
      [ 6601.701068] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6601.702282] CR2: 0000000000000008 CR3: 0000000424966002 CR4: 00000000001606e0
      [ 6601.703791] Call Trace:
      [ 6601.704329]  cleanup_net+0x1b4/0x2c0
      [ 6601.705268]  process_one_work+0x16c/0x370
      [ 6601.706145]  worker_thread+0x49/0x3e0
      [ 6601.706942]  kthread+0xf8/0x130
      [ 6601.707626]  ? rescuer_thread+0x340/0x340
      [ 6601.708476]  ? kthread_bind+0x10/0x10
      [ 6601.709266]  ret_from_fork+0x35/0x40
      
      Reproduce:
      modprobe ip6_vti
      echo 1 > /proc/sys/net/core/fb_tunnels_only_for_init_net
      unshare -n
      exit
      
      This because ip6n->tnls_wc[0] point to fallback device in default, but
      in non-default namespace, ip6n->tnls_wc[0] will be NULL, so add the NULL
      check comparatively.
      
      Fixes: e2948e5a ("ip6_vti: fix creating fallback tunnel device for vti6")
      Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c86336c
    • H
      ip6_vti: fix creating fallback tunnel device for vti6 · e2948e5a
      Haishuang Yan 提交于
      When set fb_tunnels_only_for_init_net to 1, don't create fallback tunnel
      device for vti6 when a new namespace is created.
      
      Tested:
      [root@builder2 ~]# modprobe ip6_tunnel
      [root@builder2 ~]# modprobe ip6_vti
      [root@builder2 ~]# echo 1 > /proc/sys/net/core/fb_tunnels_only_for_init_net
      [root@builder2 ~]# unshare -n
      [root@builder2 ~]# ip link
      1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group
      default qlen 1000
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
      Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2948e5a
  11. 19 8月, 2018 1 次提交
  12. 17 8月, 2018 1 次提交
  13. 13 8月, 2018 1 次提交
    • V
      ipv6: Add icmp_echo_ignore_all support for ICMPv6 · e6f86b0f
      Virgile Jarry 提交于
      Preventing the kernel from responding to ICMP Echo Requests messages
      can be useful in several ways. The sysctl parameter
      'icmp_echo_ignore_all' can be used to prevent the kernel from
      responding to IPv4 ICMP echo requests. For IPv6 pings, such
      a sysctl kernel parameter did not exist.
      
      Add the ability to prevent the kernel from responding to IPv6
      ICMP echo requests through the use of the following sysctl
      parameter : /proc/sys/net/ipv6/icmp/echo_ignore_all.
      Update the documentation to reflect this change.
      Signed-off-by: NVirgile Jarry <virgile@acceis.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6f86b0f
  14. 11 8月, 2018 1 次提交
    • M
      bpf: Enable BPF_PROG_TYPE_SK_REUSEPORT bpf prog in reuseport selection · 8217ca65
      Martin KaFai Lau 提交于
      This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a
      SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in
      the earlier patch.  "bpf_run_sk_reuseport()" will return -ECONNREFUSED
      when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP.
      The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to
      handle this case and return NULL immediately instead of continuing the
      sk search from its hashtable.
      
      It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach
      BPF_PROG_TYPE_SK_REUSEPORT.  The "sk_reuseport_attach_bpf()" will check
      if the attaching bpf prog is in the new SK_REUSEPORT or the existing
      SOCKET_FILTER type and then check different things accordingly.
      
      One level of "__reuseport_attach_prog()" call is removed.  The
      "sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed
      back to "reuseport_attach_prog()" in sock_reuseport.c.  sock_reuseport.c
      seems to have more knowledge on those test requirements than filter.c.
      In "reuseport_attach_prog()", after new_prog is attached to reuse->prog,
      the old_prog (if any) is also directly freed instead of returning the
      old_prog to the caller and asking the caller to free.
      
      The sysctl_optmem_max check is moved back to the
      "sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()".
      As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only
      bounded by the usual "bpf_prog_charge_memlock()" during load time
      instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8217ca65
  15. 10 8月, 2018 1 次提交
    • M
      net: ipv6_gre: Fix GRO to work on IPv6 over GRE tap · eb95f52f
      Maria Pasechnik 提交于
      IPv6 GRO over GRE tap is not working while GRO is not set
      over the native interface.
      
      gro_list_prepare function updates the same_flow variable
      of existing sessions to 1 if their mac headers match the one
      of the incoming packet.
      same_flow is used to filter out non-matching sessions and keep
      potential ones for aggregation.
      
      The number of bytes to compare should be the number of bytes
      in the mac headers. In gro_list_prepare this number is set to
      be skb->dev->hard_header_len. For GRE interfaces this hard_header_len
      should be as it is set in the initialization process (when GRE is
      created), it should not be overridden. But currently it is being overridden
      by the value that is actually supposed to represent the needed_headroom.
      Therefore, the number of bytes compared in order to decide whether the
      the mac headers are the same is greater than the length of the headers.
      
      As it's documented in netdevice.h, hard_header_len is the maximum
      hardware header length, and needed_headroom is the extra headroom
      the hardware may need.
      hard_header_len is basically all the bytes received by the physical
      till layer 3 header of the packet received by the interface.
      For example, if the interface is a GRE tap then the needed_headroom
      should be the total length of the following headers:
      IP header of the physical, GRE header, mac header of GRE.
      It is often used to calculate the MTU of the created interface.
      
      This patch removes the override of the hard_header_len, and
      assigns the calculated value to needed_headroom.
      This way, the comparison in gro_list_prepare is really of
      the mac headers, and if the packets have the same mac headers
      the same_flow will be set to 1.
      
      Performance testing: 45% higher bandwidth.
      Measuring bandwidth of single-stream IPv4 TCP traffic over IPv6
      GRE tap while GRO is not set on the native.
      NIC: ConnectX-4LX
      Before (GRO not working) : 7.2 Gbits/sec
      After (GRO working): 10.5 Gbits/sec
      Signed-off-by: NMaria Pasechnik <mariap@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb95f52f
  16. 08 8月, 2018 1 次提交
  17. 06 8月, 2018 5 次提交
    • X
      ip6_tunnel: use the right value for ipv4 min mtu check in ip6_tnl_xmit · 82a40777
      Xin Long 提交于
      According to RFC791, 68 bytes is the minimum size of IPv4 datagram every
      device must be able to forward without further fragmentation while 576
      bytes is the minimum size of IPv4 datagram every device has to be able
      to receive, so in ip6_tnl_xmit(), 68(IPV4_MIN_MTU) should be the right
      value for the ipv4 min mtu check in ip6_tnl_xmit.
      
      While at it, change to use max() instead of if statement.
      
      Fixes: c9fefa08 ("ip6_tunnel: get the min mtu properly in ip6_tnl_xmit")
      Reported-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82a40777
    • C
      ipv6: fix double refcount of fib6_metrics · e70a3aad
      Cong Wang 提交于
      All the callers of ip6_rt_copy_init()/rt6_set_from() hold refcnt
      of the "from" fib6_info, so there is no need to hold fib6_metrics
      refcnt again, because fib6_metrics refcnt is only released when
      fib6_info is gone, that is, they have the same life time, so the
      whole fib6_metrics refcnt can be removed actually.
      
      This fixes a kmemleak warning reported by Sabrina.
      
      Fixes: 93531c67 ("net/ipv6: separate handling of FIB entries from dst based routes")
      Reported-by: NSabrina Dubroca <sd@queasysnail.net>
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Cc: David Ahern <dsahern@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e70a3aad
    • F
      ipv6: defrag: drop non-last frags smaller than min mtu · 0ed4229b
      Florian Westphal 提交于
      don't bother with pathological cases, they only waste cycles.
      IPv6 requires a minimum MTU of 1280 so we should never see fragments
      smaller than this (except last frag).
      
      v3: don't use awkward "-offset + len"
      v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
          There were concerns that there could be even smaller frags
          generated by intermediate nodes, e.g. on radio networks.
      
      Cc: Peter Oskolkov <posk@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ed4229b
    • P
      ip: use rb trees for IP frag queue. · fa0f5273
      Peter Oskolkov 提交于
      Similar to TCP OOO RX queue, it makes sense to use rb trees to store
      IP fragments, so that OOO fragments are inserted faster.
      
      Tested:
      
      - a follow-up patch contains a rather comprehensive ip defrag
        self-test (functional)
      - ran neper `udp_stream -c -H <host> -F 100 -l 300 -T 20`:
          netstat --statistics
          Ip:
              282078937 total packets received
              0 forwarded
              0 incoming packets discarded
              946760 incoming packets delivered
              18743456 requests sent out
              101 fragments dropped after timeout
              282077129 reassemblies required
              944952 packets reassembled ok
              262734239 packet reassembles failed
         (The numbers/stats above are somewhat better re:
          reassemblies vs a kernel without this patchset. More
          comprehensive performance testing TBD).
      Reported-by: NJann Horn <jannh@google.com>
      Reported-by: NJuha-Matti Tilli <juha-matti.tilli@iki.fi>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa0f5273
    • G
      ipv6: icmp: Updating pmtu for link local route · 5f379ef5
      Georg Kohmann 提交于
      When a ICMPV6_PKT_TOOBIG is received from a link local address the pmtu will
      be updated on a route with an arbitrary interface index. Subsequent packets
      sent back to the same link local address may therefore end up not
      considering the updated pmtu.
      
      Current behavior breaks TAHI v6LC4.1.4 Reduce PMTU On-link. Referring to RFC
      1981: Section 3: "Note that Path MTU Discovery must be performed even in
      cases where a node "thinks" a destination is attached to the same link as
      itself. In a situation such as when a neighboring router acts as proxy [ND]
      for some destination, the destination can to appear to be directly
      connected but is in fact more than one hop away."
      
      Using the interface index from the incoming ICMPV6_PKT_TOOBIG when updating
      the pmtu.
      Signed-off-by: NGeorg Kohmann <geokohma@cisco.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f379ef5
  18. 02 8月, 2018 2 次提交
  19. 31 7月, 2018 2 次提交
  20. 30 7月, 2018 1 次提交
  21. 25 7月, 2018 3 次提交
  22. 24 7月, 2018 3 次提交
    • P
      ip: hash fragments consistently · 3dd1c9a1
      Paolo Abeni 提交于
      The skb hash for locally generated ip[v6] fragments belonging
      to the same datagram can vary in several circumstances:
      * for connected UDP[v6] sockets, the first fragment get its hash
        via set_owner_w()/skb_set_hash_from_sk()
      * for unconnected IPv6 UDPv6 sockets, the first fragment can get
        its hash via ip6_make_flowlabel()/skb_get_hash_flowi6(), if
        auto_flowlabel is enabled
      
      For the following frags the hash is usually computed via
      skb_get_hash().
      The above can cause OoO for unconnected IPv6 UDPv6 socket: in that
      scenario the egress tx queue can be selected on a per packet basis
      via the skb hash.
      It may also fool flow-oriented schedulers to place fragments belonging
      to the same datagram in different flows.
      
      Fix the issue by copying the skb hash from the head frag into
      the others at fragmentation time.
      
      Before this commit:
      perf probe -a "dev_queue_xmit skb skb->hash skb->l4_hash:b1@0/8 skb->sw_hash:b1@1/8"
      netperf -H $IPV4 -t UDP_STREAM -l 5 -- -m 2000 -n &
      perf record -e probe:dev_queue_xmit -e probe:skb_set_owner_w -a sleep 0.1
      perf script
      probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=3713014309 l4_hash=1 sw_hash=0
      probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=0 l4_hash=0 sw_hash=0
      
      After this commit:
      probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0
      probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0
      
      Fixes: b73c3d0e ("net: Save TX flow hash in sock and set in skbuf on xmit")
      Fixes: 67800f9b ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3dd1c9a1
    • W
      ipv6: use fib6_info_hold_safe() when necessary · e873e4b9
      Wei Wang 提交于
      In the code path where only rcu read lock is held, e.g. in the route
      lookup code path, it is not safe to directly call fib6_info_hold()
      because the fib6_info may already have been deleted but still exists
      in the rcu grace period. Holding reference to it could cause double
      free and crash the kernel.
      
      This patch adds a new function fib6_info_hold_safe() and replace
      fib6_info_hold() in all necessary places.
      
      Syzbot reported 3 crash traces because of this. One of them is:
      8021q: adding VLAN 0 to HW filter on device team0
      IPv6: ADDRCONF(NETDEV_CHANGE): team0: link becomes ready
      dst_release: dst:(____ptrval____) refcnt:-1
      dst_release: dst:(____ptrval____) refcnt:-2
      WARNING: CPU: 1 PID: 4845 at include/net/dst.h:239 dst_hold include/net/dst.h:239 [inline]
      WARNING: CPU: 1 PID: 4845 at include/net/dst.h:239 ip6_setup_cork+0xd66/0x1830 net/ipv6/ip6_output.c:1204
      dst_release: dst:(____ptrval____) refcnt:-1
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 1 PID: 4845 Comm: syz-executor493 Not tainted 4.18.0-rc3+ #10
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
       panic+0x238/0x4e7 kernel/panic.c:184
      dst_release: dst:(____ptrval____) refcnt:-2
      dst_release: dst:(____ptrval____) refcnt:-3
       __warn.cold.8+0x163/0x1ba kernel/panic.c:536
      dst_release: dst:(____ptrval____) refcnt:-4
       report_bug+0x252/0x2d0 lib/bug.c:186
       fixup_bug arch/x86/kernel/traps.c:178 [inline]
       do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296
      dst_release: dst:(____ptrval____) refcnt:-5
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316
       invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
      RIP: 0010:dst_hold include/net/dst.h:239 [inline]
      RIP: 0010:ip6_setup_cork+0xd66/0x1830 net/ipv6/ip6_output.c:1204
      Code: c1 ed 03 89 9d 18 ff ff ff 48 b8 00 00 00 00 00 fc ff df 41 c6 44 05 00 f8 e9 2d 01 00 00 4c 8b a5 c8 fe ff ff e8 1a f6 e6 fa <0f> 0b e9 6a fc ff ff e8 0e f6 e6 fa 48 8b 85 d0 fe ff ff 48 8d 78
      RSP: 0018:ffff8801a8fcf178 EFLAGS: 00010293
      RAX: ffff8801a8eba5c0 RBX: 0000000000000000 RCX: ffffffff869511e6
      RDX: 0000000000000000 RSI: ffffffff869515b6 RDI: 0000000000000005
      RBP: ffff8801a8fcf2c8 R08: ffff8801a8eba5c0 R09: ffffed0035ac8338
      R10: ffffed0035ac8338 R11: ffff8801ad6419c3 R12: ffff8801a8fcf720
      R13: ffff8801a8fcf6a0 R14: ffff8801ad6419c0 R15: ffff8801ad641980
       ip6_make_skb+0x2c8/0x600 net/ipv6/ip6_output.c:1768
       udpv6_sendmsg+0x2c90/0x35f0 net/ipv6/udp.c:1376
       inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:641 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:651
       ___sys_sendmsg+0x51d/0x930 net/socket.c:2125
       __sys_sendmmsg+0x240/0x6f0 net/socket.c:2220
       __do_sys_sendmmsg net/socket.c:2249 [inline]
       __se_sys_sendmmsg net/socket.c:2246 [inline]
       __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2246
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x446ba9
      Code: e8 cc bb 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fb39a469da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
      RAX: ffffffffffffffda RBX: 00000000006dcc54 RCX: 0000000000446ba9
      RDX: 00000000000000b8 RSI: 0000000020001b00 RDI: 0000000000000003
      RBP: 00000000006dcc50 R08: 00007fb39a46a700 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 45c828efc7a64843
      R13: e6eeb815b9d8a477 R14: 5068caf6f713c6fc R15: 0000000000000001
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Kernel Offset: disabled
      Rebooting in 86400 seconds..
      
      Fixes: 93531c67 ("net/ipv6: separate handling of FIB entries from dst based routes")
      Reported-by: syzbot+902e2a1bcd4f7808cef5@syzkaller.appspotmail.com
      Reported-by: syzbot+8ae62d67f647abeeceb9@syzkaller.appspotmail.com
      Reported-by: syzbot+3f08feb14086930677d0@syzkaller.appspotmail.com
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e873e4b9
    • Y
      ipv6: sr: Use kmemdup instead of duplicating it in parse_nla_srh · 7fa41efa
      YueHaibing 提交于
      Replace calls to kmalloc followed by a memcpy with a direct call to
      kmemdup.
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7fa41efa
  23. 22 7月, 2018 3 次提交
    • H
      multicast: do not restore deleted record source filter mode to new one · 08d3ffcc
      Hangbin Liu 提交于
      There are two scenarios that we will restore deleted records. The first is
      when device down and up(or unmap/remap). In this scenario the new filter
      mode is same with previous one. Because we get it from in_dev->mc_list and
      we do not touch it during device down and up.
      
      The other scenario is when a new socket join a group which was just delete
      and not finish sending status reports. In this scenario, we should use the
      current filter mode instead of restore old one. Here are 4 cases in total.
      
      old_socket        new_socket       before_fix       after_fix
        IN(A)             IN(A)           ALLOW(A)         ALLOW(A)
        IN(A)             EX( )           TO_IN( )         TO_EX( )
        EX( )             IN(A)           TO_EX( )         ALLOW(A)
        EX( )             EX( )           TO_EX( )         TO_EX( )
      
      Fixes: 24803f38 (igmp: do not remove igmp souce list info when set link down)
      Fixes: 1666d49e (mld: do not remove mld souce list info when set link down)
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08d3ffcc
    • H
      multicast: remove useless parameter for group add · 0ae0d60a
      Hangbin Liu 提交于
      Remove the mode parameter for igmp/igmp6_group_added as we can get it
      from first parameter.
      
      Fixes: 6e2059b5 (ipv4/igmp: init group mode as INCLUDE when join source group)
      Fixes: c7ea20c9 (ipv6/mcast: init as INCLUDE when join SSM INCLUDE group)
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ae0d60a
    • D
      net/ipv6: Fix linklocal to global address with VRF · 24b711ed
      David Ahern 提交于
      Example setup:
          host: ip -6 addr add dev eth1 2001:db8:104::4
                 where eth1 is enslaved to a VRF
      
          switch: ip -6 ro add 2001:db8:104::4/128 dev br1
                  where br1 only has an LLA
      
                 ping6 2001:db8:104::4
                 ssh   2001:db8:104::4
      
      (NOTE: UDP works fine if the PKTINFO has the address set to the global
      address and ifindex is set to the index of eth1 with a destination an
      LLA).
      
      For ICMP, icmp6_iif needs to be updated to check if skb->dev is an
      L3 master. If it is then return the ifindex from rt6i_idev similar
      to what is done for loopback.
      
      For TCP, restore the original tcp_v6_iif definition which is needed in
      most places and add a new tcp_v6_iif_l3_slave that considers the
      l3_slave variability. This latter check is only needed for socket
      lookups.
      
      Fixes: 9ff74384 ("net: vrf: Handle ipv6 multicast and link-local addresses")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24b711ed
  24. 19 7月, 2018 1 次提交