1. 05 8月, 2020 1 次提交
    • S
      tunnels: PMTU discovery support for directly bridged IP packets · 4cb47a86
      Stefano Brivio 提交于
      It's currently possible to bridge Ethernet tunnels carrying IP
      packets directly to external interfaces without assigning them
      addresses and routes on the bridged network itself: this is the case
      for UDP tunnels bridged with a standard bridge or by Open vSwitch.
      
      PMTU discovery is currently broken with those configurations, because
      the encapsulation effectively decreases the MTU of the link, and
      while we are able to account for this using PMTU discovery on the
      lower layer, we don't have a way to relay ICMP or ICMPv6 messages
      needed by the sender, because we don't have valid routes to it.
      
      On the other hand, as a tunnel endpoint, we can't fragment packets
      as a general approach: this is for instance clearly forbidden for
      VXLAN by RFC 7348, section 4.3:
      
         VTEPs MUST NOT fragment VXLAN packets.  Intermediate routers may
         fragment encapsulated VXLAN packets due to the larger frame size.
         The destination VTEP MAY silently discard such VXLAN fragments.
      
      The same paragraph recommends that the MTU over the physical network
      accomodates for encapsulations, but this isn't a practical option for
      complex topologies, especially for typical Open vSwitch use cases.
      
      Further, it states that:
      
         Other techniques like Path MTU discovery (see [RFC1191] and
         [RFC1981]) MAY be used to address this requirement as well.
      
      Now, PMTU discovery already works for routed interfaces, we get
      route exceptions created by the encapsulation device as they receive
      ICMP Fragmentation Needed and ICMPv6 Packet Too Big messages, and
      we already rebuild those messages with the appropriate MTU and route
      them back to the sender.
      
      Add the missing bits for bridged cases:
      
      - checks in skb_tunnel_check_pmtu() to understand if it's appropriate
        to trigger a reply according to RFC 1122 section 3.2.2 for ICMP and
        RFC 4443 section 2.4 for ICMPv6. This function is already called by
        UDP tunnels
      
      - a new function generating those ICMP or ICMPv6 replies. We can't
        reuse icmp_send() and icmp6_send() as we don't see the sender as a
        valid destination. This doesn't need to be generic, as we don't
        cover any other type of ICMP errors given that we only provide an
        encapsulation function to the sender
      
      While at it, make the MTU check in skb_tunnel_check_pmtu() accurate:
      we might receive GSO buffers here, and the passed headroom already
      includes the inner MAC length, so we don't have to account for it
      a second time (that would imply three MAC headers on the wire, but
      there are just two).
      
      This issue became visible while bridging IPv6 packets with 4500 bytes
      of payload over GENEVE using IPv4 with a PMTU of 4000. Given the 50
      bytes of encapsulation headroom, we would advertise MTU as 3950, and
      we would reject fragmented IPv6 datagrams of 3958 bytes size on the
      wire. We're exclusively dealing with network MTU here, though, so we
      could get Ethernet frames up to 3964 octets in that case.
      
      v2:
      - moved skb_tunnel_check_pmtu() to ip_tunnel_core.c (David Ahern)
      - split IPv4/IPv6 functions (David Ahern)
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cb47a86
  2. 08 7月, 2020 1 次提交
    • M
      net: Added pointer check for dst->ops->neigh_lookup in dst_neigh_lookup_skb · 394de110
      Martin Varghese 提交于
      The packets from tunnel devices (eg bareudp) may have only
      metadata in the dst pointer of skb. Hence a pointer check of
      neigh_lookup is needed in dst_neigh_lookup_skb
      
      Kernel crashes when packets from bareudp device is processed in
      the kernel neighbour subsytem.
      
      [  133.384484] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [  133.385240] #PF: supervisor instruction fetch in kernel mode
      [  133.385828] #PF: error_code(0x0010) - not-present page
      [  133.386603] PGD 0 P4D 0
      [  133.386875] Oops: 0010 [#1] SMP PTI
      [  133.387275] CPU: 0 PID: 5045 Comm: ping Tainted: G        W         5.8.0-rc2+ #15
      [  133.388052] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  133.391076] RIP: 0010:0x0
      [  133.392401] Code: Bad RIP value.
      [  133.394029] RSP: 0018:ffffb79980003d50 EFLAGS: 00010246
      [  133.396656] RAX: 0000000080000102 RBX: ffff9de2fe0d6600 RCX: ffff9de2fe5e9d00
      [  133.399018] RDX: 0000000000000000 RSI: ffff9de2fe5e9d00 RDI: ffff9de2fc21b400
      [  133.399685] RBP: ffff9de2fe5e9d00 R08: 0000000000000000 R09: 0000000000000000
      [  133.400350] R10: ffff9de2fbc6be22 R11: ffff9de2fe0d6600 R12: ffff9de2fc21b400
      [  133.401010] R13: ffff9de2fe0d6628 R14: 0000000000000001 R15: 0000000000000003
      [  133.401667] FS:  00007fe014918740(0000) GS:ffff9de2fec00000(0000) knlGS:0000000000000000
      [  133.402412] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  133.402948] CR2: ffffffffffffffd6 CR3: 000000003bb72000 CR4: 00000000000006f0
      [  133.403611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  133.404270] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  133.404933] Call Trace:
      [  133.405169]  <IRQ>
      [  133.405367]  __neigh_update+0x5a4/0x8f0
      [  133.405734]  arp_process+0x294/0x820
      [  133.406076]  ? __netif_receive_skb_core+0x866/0xe70
      [  133.406557]  arp_rcv+0x129/0x1c0
      [  133.406882]  __netif_receive_skb_one_core+0x95/0xb0
      [  133.407340]  process_backlog+0xa7/0x150
      [  133.407705]  net_rx_action+0x2af/0x420
      [  133.408457]  __do_softirq+0xda/0x2a8
      [  133.408813]  asm_call_on_stack+0x12/0x20
      [  133.409290]  </IRQ>
      [  133.409519]  do_softirq_own_stack+0x39/0x50
      [  133.410036]  do_softirq+0x50/0x60
      [  133.410401]  __local_bh_enable_ip+0x50/0x60
      [  133.410871]  ip_finish_output2+0x195/0x530
      [  133.411288]  ip_output+0x72/0xf0
      [  133.411673]  ? __ip_finish_output+0x1f0/0x1f0
      [  133.412122]  ip_send_skb+0x15/0x40
      [  133.412471]  raw_sendmsg+0x853/0xab0
      [  133.412855]  ? insert_pfn+0xfe/0x270
      [  133.413827]  ? vvar_fault+0xec/0x190
      [  133.414772]  sock_sendmsg+0x57/0x80
      [  133.415685]  __sys_sendto+0xdc/0x160
      [  133.416605]  ? syscall_trace_enter+0x1d4/0x2b0
      [  133.417679]  ? __audit_syscall_exit+0x1d9/0x280
      [  133.418753]  ? __prepare_exit_to_usermode+0x5d/0x1a0
      [  133.419819]  __x64_sys_sendto+0x24/0x30
      [  133.420848]  do_syscall_64+0x4d/0x90
      [  133.421768]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [  133.422833] RIP: 0033:0x7fe013689c03
      [  133.423749] Code: Bad RIP value.
      [  133.424624] RSP: 002b:00007ffc7288f418 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      [  133.425940] RAX: ffffffffffffffda RBX: 000056151fc63720 RCX: 00007fe013689c03
      [  133.427225] RDX: 0000000000000040 RSI: 000056151fc63720 RDI: 0000000000000003
      [  133.428481] RBP: 00007ffc72890b30 R08: 000056151fc60500 R09: 0000000000000010
      [  133.429757] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040
      [  133.431041] R13: 000056151fc636e0 R14: 000056151fc616bc R15: 0000000000000080
      [  133.432481] Modules linked in: mpls_iptunnel act_mirred act_tunnel_key cls_flower sch_ingress veth mpls_router ip_tunnel bareudp ip6_udp_tunnel udp_tunnel macsec udp_diag inet_diag unix_diag af_packet_diag netlink_diag binfmt_misc xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc ebtable_filter ebtables overlay ip6table_filter ip6_tables iptable_filter sunrpc ext4 mbcache jbd2 pcspkr i2c_piix4 virtio_balloon joydev ip_tables xfs libcrc32c ata_generic qxl pata_acpi drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ata_piix libata virtio_net net_failover virtio_console failover virtio_blk i2c_core virtio_pci virtio_ring serio_raw floppy virtio dm_mirror dm_region_hash dm_log dm_mod
      [  133.444045] CR2: 0000000000000000
      [  133.445082] ---[ end trace f4aeee1958fd1638 ]---
      [  133.446236] RIP: 0010:0x0
      [  133.447180] Code: Bad RIP value.
      [  133.448152] RSP: 0018:ffffb79980003d50 EFLAGS: 00010246
      [  133.449363] RAX: 0000000080000102 RBX: ffff9de2fe0d6600 RCX: ffff9de2fe5e9d00
      [  133.450835] RDX: 0000000000000000 RSI: ffff9de2fe5e9d00 RDI: ffff9de2fc21b400
      [  133.452237] RBP: ffff9de2fe5e9d00 R08: 0000000000000000 R09: 0000000000000000
      [  133.453722] R10: ffff9de2fbc6be22 R11: ffff9de2fe0d6600 R12: ffff9de2fc21b400
      [  133.455149] R13: ffff9de2fe0d6628 R14: 0000000000000001 R15: 0000000000000003
      [  133.456520] FS:  00007fe014918740(0000) GS:ffff9de2fec00000(0000) knlGS:0000000000000000
      [  133.458046] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  133.459342] CR2: ffffffffffffffd6 CR3: 000000003bb72000 CR4: 00000000000006f0
      [  133.460782] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  133.462240] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  133.463697] Kernel panic - not syncing: Fatal exception in interrupt
      [  133.465226] Kernel Offset: 0xfa00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [  133.467025] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
      
      Fixes: aaa0c23c ("Fix dst_neigh_lookup/dst_neigh_lookup_skb return value handling bug")
      Signed-off-by: NMartin Varghese <martin.varghese@nokia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      394de110
  3. 24 3月, 2020 1 次提交
  4. 25 12月, 2019 3 次提交
    • H
      net/dst: do not confirm neighbor for vxlan and geneve pmtu update · f081042d
      Hangbin Liu 提交于
      When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
      we should not call dst_confirm_neigh() as there is no two-way communication.
      
      So disable the neigh confirm for vxlan and geneve pmtu update.
      
      v5: No change.
      v4: No change.
      v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
          dst_ops.update_pmtu to control whether we should do neighbor confirm.
          Also split the big patch to small ones for each area.
      v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.
      
      Fixes: a93bf0ff ("vxlan: update skb dst pmtu on tx path")
      Fixes: 52a589d5 ("geneve: update skb dst pmtu on tx path")
      Reviewed-by: NGuillaume Nault <gnault@redhat.com>
      Tested-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f081042d
    • H
      net/dst: add new function skb_dst_update_pmtu_no_confirm · 07dc35c6
      Hangbin Liu 提交于
      Add a new function skb_dst_update_pmtu_no_confirm() for callers who need
      update pmtu but should not do neighbor confirm.
      
      v5: No change.
      v4: No change.
      v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
          dst_ops.update_pmtu to control whether we should do neighbor confirm.
          Also split the big patch to small ones for each area.
      v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.
      Reviewed-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07dc35c6
    • H
      net: add bool confirm_neigh parameter for dst_ops.update_pmtu · bd085ef6
      Hangbin Liu 提交于
      The MTU update code is supposed to be invoked in response to real
      networking events that update the PMTU. In IPv6 PMTU update function
      __ip6_rt_update_pmtu() we called dst_confirm_neigh() to update neighbor
      confirmed time.
      
      But for tunnel code, it will call pmtu before xmit, like:
        - tnl_update_pmtu()
          - skb_dst_update_pmtu()
            - ip6_rt_update_pmtu()
              - __ip6_rt_update_pmtu()
                - dst_confirm_neigh()
      
      If the tunnel remote dst mac address changed and we still do the neigh
      confirm, we will not be able to update neigh cache and ping6 remote
      will failed.
      
      So for this ip_tunnel_xmit() case, _EVEN_ if the MTU is changed, we
      should not be invoking dst_confirm_neigh() as we have no evidence
      of successful two-way communication at this point.
      
      On the other hand it is also important to keep the neigh reachability fresh
      for TCP flows, so we cannot remove this dst_confirm_neigh() call.
      
      To fix the issue, we have to add a new bool parameter for dst_ops.update_pmtu
      to choose whether we should do neigh update or not. I will add the parameter
      in this patch and set all the callers to true to comply with the previous
      way, and fix the tunnel code one by one on later patches.
      
      v5: No change.
      v4: No change.
      v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
          dst_ops.update_pmtu to control whether we should do neighbor confirm.
          Also split the big patch to small ones for each area.
      v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Reviewed-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd085ef6
  5. 21 12月, 2019 1 次提交
    • G
      net: dst: Force 4-byte alignment of dst_metrics · 258a980d
      Geert Uytterhoeven 提交于
      When storing a pointer to a dst_metrics structure in dst_entry._metrics,
      two flags are added in the least significant bits of the pointer value.
      Hence this assumes all pointers to dst_metrics structures have at least
      4-byte alignment.
      
      However, on m68k, the minimum alignment of 32-bit values is 2 bytes, not
      4 bytes.  Hence in some kernel builds, dst_default_metrics may be only
      2-byte aligned, leading to obscure boot warnings like:
      
          WARNING: CPU: 0 PID: 7 at lib/refcount.c:28 refcount_warn_saturate+0x44/0x9a
          refcount_t: underflow; use-after-free.
          Modules linked in:
          CPU: 0 PID: 7 Comm: ksoftirqd/0 Tainted: G        W         5.5.0-rc2-atari-01448-g114a1a1038af891d-dirty #261
          Stack from 10835e6c:
      	    10835e6c 0038134f 00023fa6 00394b0f 0000001c 00000009 00321560 00023fea
      	    00394b0f 0000001c 001a70f8 00000009 00000000 10835eb4 00000001 00000000
      	    04208040 0000000a 00394b4a 10835ed4 00043aa8 001a70f8 00394b0f 0000001c
      	    00000009 00394b4a 0026aba8 003215a4 00000003 00000000 0026d5a8 00000001
      	    003215a4 003a4361 003238d6 000001f0 00000000 003215a4 10aa3b00 00025e84
      	    003ddb00 10834000 002416a8 10aa3b00 00000000 00000080 000aa038 0004854a
          Call Trace: [<00023fa6>] __warn+0xb2/0xb4
           [<00023fea>] warn_slowpath_fmt+0x42/0x64
           [<001a70f8>] refcount_warn_saturate+0x44/0x9a
           [<00043aa8>] printk+0x0/0x18
           [<001a70f8>] refcount_warn_saturate+0x44/0x9a
           [<0026aba8>] refcount_sub_and_test.constprop.73+0x38/0x3e
           [<0026d5a8>] ipv4_dst_destroy+0x5e/0x7e
           [<00025e84>] __local_bh_enable_ip+0x0/0x8e
           [<002416a8>] dst_destroy+0x40/0xae
      
      Fix this by forcing 4-byte alignment of all dst_metrics structures.
      
      Fixes: e5fd387a ("ipv6: do not overwrite inetpeer metrics prematurely")
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      258a980d
  6. 02 7月, 2019 1 次提交
  7. 30 6月, 2019 1 次提交
    • F
      net: make skb_dst_force return true when dst is refcounted · b60a7738
      Florian Westphal 提交于
      netfilter did not expect that skb_dst_force() can cause skb to lose its
      dst entry.
      
      I got a bug report with a skb->dst NULL dereference in netfilter
      output path.  The backtrace contains nf_reinject(), so the dst might have
      been cleared when skb got queued to userspace.
      
      Other users were fixed via
      if (skb_dst(skb)) {
      	skb_dst_force(skb);
      	if (!skb_dst(skb))
      		goto handle_err;
      }
      
      But I think its preferable to make the 'dst might be cleared' part
      of the function explicit.
      
      In netfilter case, skb with a null dst is expected when queueing in
      prerouting hook, so drop skb for the other hooks.
      
      v2:
       v1 of this patch returned true in case skb had no dst entry.
       Eric said:
         Say if we have two skb_dst_force() calls for some reason
         on the same skb, only the first one will return false.
      
       This now returns false even when skb had no dst, as per Erics
       suggestion, so callers might need to check skb_dst() first before
       skb_dst_force().
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b60a7738
  8. 22 3月, 2019 1 次提交
  9. 18 10月, 2018 1 次提交
  10. 20 7月, 2018 1 次提交
    • B
      xfrm: Remove xfrmi interface ID from flowi · bc56b334
      Benedict Wong 提交于
      In order to remove performance impact of having the extra u32 in every
      single flowi, this change removes the flowi_xfrm struct, prefering to
      take the if_id as a method parameter where needed.
      
      In the inbound direction, if_id is only needed during the
      __xfrm_check_policy() function, and the if_id can be determined at that
      point based on the skb. As such, xfrmi_decode_session() is only called
      with the skb in __xfrm_check_policy().
      
      In the outbound direction, the only place where if_id is needed is the
      xfrm_lookup() call in xfrmi_xmit2(). With this change, the if_id is
      directly passed into the xfrm_lookup_with_ifid() call. All existing
      callers can still call xfrm_lookup(), which uses a default if_id of 0.
      
      This change does not change any behavior of XFRMIs except for improving
      overall system performance via flowi size reduction.
      
      This change has been tested against the Android Kernel Networking Tests:
      
      https://android.googlesource.com/kernel/tests/+/master/net/testSigned-off-by: NBenedict Wong <benedictwong@google.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      bc56b334
  11. 06 3月, 2018 1 次提交
  12. 26 1月, 2018 1 次提交
  13. 30 11月, 2017 8 次提交
  14. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  15. 27 10月, 2017 1 次提交
    • P
      net: updating dst lastusage is an unlikely event. · 32d18ab1
      Paolo Abeni 提交于
      Since commit 0da4af00 ("ipv6: only update __use and lastusetime
      once per jiffy at most"), updating the dst lastuse field is an
      unlikely action: it happens at most once per jiffy, out of
      potentially millions of calls per second.
      
      Mark explicitly the code as such, and let the compiler generate
      better code.
      
      Note: gcc 7.2 and several older versions do actually generate
      different - better - code when the unlikely() hint is in place,
      avoid jump in the fast path and keeping better code locality.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32d18ab1
  16. 17 10月, 2017 1 次提交
    • W
      ipv6: only update __use and lastusetime once per jiffy at most · 0da4af00
      Wei Wang 提交于
      In order to not dirty the cacheline too often, we try to only update
      dst->__use and dst->lastusetime at most once per jiffy.
      As dst->lastusetime is only used by ipv6 garbage collector, it should
      be good enough time resolution.
      And __use is only used in ipv6_route_seq_show() to show how many times a
      dst has been used. And as __use is not atomic_t right now, it does not
      show the precise number of usage times anyway. So we think it should be
      OK to only update it at most once per jiffy.
      
      According to my latest syn flood test on a machine with intel Xeon 6th
      gen processor and 2 10G mlx nics bonded together, each with 8 rx queues
      on 2 NUMA nodes:
      With this patch, the packet process rate increases from ~3.49Mpps to
      ~3.75Mpps with a 7% increase rate.
      
      Note: dst_use() is being renamed to dst_hold_and_use() to better specify
      the purpose of the function.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@googl.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0da4af00
  17. 08 10月, 2017 1 次提交
    • W
      ipv6: replace rwlock with rcu and spinlock in fib6_table · 66f5d6ce
      Wei Wang 提交于
      With all the preparation work before, we are now ready to replace rwlock
      with rcu and spinlock in fib6_table.
      That means now all fib6_node in fib6_table are protected by rcu. And
      when freeing fib6_node, call_rcu() is used to wait for the rcu grace
      period before releasing the memory.
      When accessing fib6_node, corresponding rcu APIs need to be used.
      And all previous sessions protected by the write lock will now be
      protected by the spin lock per table.
      All previous sessions protected by read lock will now be protected by
      rcu_read_lock().
      
      A couple of things to note here:
      1. As part of the work of replacing rwlock with rcu, the linked list of
      fn->leaf now has to be rcu protected as well. So both fn->leaf and
      rt->dst.rt6_next are now __rcu tagged and corresponding rcu APIs are
      used when manipulating them.
      
      2. For fn->rr_ptr, first of all, it also needs to be rcu protected now
      and is tagged with __rcu and rcu APIs are used in corresponding places.
      Secondly, fn->rr_ptr is changed in rt6_select() which is a reader
      thread. This makes the issue a bit complicated. We think a valid
      solution for it is to let rt6_select() grab the tb6_lock if it decides
      to change it. As it is not in the normal operation and only happens when
      there is no valid neighbor cache for the route, we think the performance
      impact should be low.
      
      3. fib6_walk_continue() has to be called with tb6_lock held even in the
      route dumping related functions, e.g. inet6_dump_fib(),
      fib6_tables_dump() and ipv6_route_seq_ops. It is because
      fib6_walk_continue() makes modifications to the walker structure, and so
      are fib6_repair_tree() and fib6_del_route(). In order to do proper
      syncing between them, we need to let fib6_walk_continue() hold the lock.
      We may be able to do further improvement on the way we do the tree walk
      to get rid of the need for holding the spin lock. But not for now.
      
      4. When fib6_del_route() removes a route from the tree, we no longer
      mark rt->dst.rt6_next to NULL to make simultaneous reader be able to
      further traverse the list with rcu. However, rt->dst.rt6_next is only
      valid within this same rcu period. No one should access it later.
      
      5. All the operation of atomic_inc(rt->rt6i_ref) is changed to be
      performed before we publish this route (either by linking it to fn->leaf
      or insert it in the list pointed by fn->leaf) just to be safe because as
      soon as we publish the route, some read thread will be able to access it.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66f5d6ce
  18. 22 9月, 2017 1 次提交
    • E
      net: prevent dst uses after free · 222d7dbd
      Eric Dumazet 提交于
      In linux-4.13, Wei worked hard to convert dst to a traditional
      refcounted model, removing GC.
      
      We now want to make sure a dst refcount can not transition from 0 back
      to 1.
      
      The problem here is that input path attached a not refcounted dst to an
      skb. Then later, because packet is forwarded and hits skb_dst_force()
      before exiting RCU section, we might try to take a refcount on one dst
      that is about to be freed, if another cpu saw 1 -> 0 transition in
      dst_release() and queued the dst for freeing after one RCU grace period.
      
      Lets unify skb_dst_force() and skb_dst_force_safe(), since we should
      always perform the complete check against dst refcount, and not assume
      it is not zero.
      
      Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005
      
      [  989.919496]  skb_dst_force+0x32/0x34
      [  989.919498]  __dev_queue_xmit+0x1ad/0x482
      [  989.919501]  ? eth_header+0x28/0xc6
      [  989.919502]  dev_queue_xmit+0xb/0xd
      [  989.919504]  neigh_connected_output+0x9b/0xb4
      [  989.919507]  ip_finish_output2+0x234/0x294
      [  989.919509]  ? ipt_do_table+0x369/0x388
      [  989.919510]  ip_finish_output+0x12c/0x13f
      [  989.919512]  ip_output+0x53/0x87
      [  989.919513]  ip_forward_finish+0x53/0x5a
      [  989.919515]  ip_forward+0x2cb/0x3e6
      [  989.919516]  ? pskb_trim_rcsum.part.9+0x4b/0x4b
      [  989.919518]  ip_rcv_finish+0x2e2/0x321
      [  989.919519]  ip_rcv+0x26f/0x2eb
      [  989.919522]  ? vlan_do_receive+0x4f/0x289
      [  989.919523]  __netif_receive_skb_core+0x467/0x50b
      [  989.919526]  ? tcp_gro_receive+0x239/0x239
      [  989.919529]  ? inet_gro_receive+0x226/0x238
      [  989.919530]  __netif_receive_skb+0x4d/0x5f
      [  989.919532]  netif_receive_skb_internal+0x5c/0xaf
      [  989.919533]  napi_gro_receive+0x45/0x81
      [  989.919536]  ixgbe_poll+0xc8a/0xf09
      [  989.919539]  ? kmem_cache_free_bulk+0x1b6/0x1f7
      [  989.919540]  net_rx_action+0xf4/0x266
      [  989.919543]  __do_softirq+0xa8/0x19d
      [  989.919545]  irq_exit+0x5d/0x6b
      [  989.919546]  do_IRQ+0x9c/0xb5
      [  989.919548]  common_interrupt+0x93/0x93
      [  989.919548]  </IRQ>
      
      Similarly dst_clone() can use dst_hold() helper to have additional
      debugging, as a follow up to commit 44ebe791 ("net: add debug
      atomic_inc_not_zero() in dst_hold()")
      
      In net-next we will convert dst atomic_t to refcount_t for peace of
      mind.
      
      Fixes: a4c2fd7f ("net: remove DST_NOCACHE flag")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Reported-by: NPaweł Staszewski <pstaszewski@itcare.pl>
      Bisected-by: NPaweł Staszewski <pstaszewski@itcare.pl>
      Acked-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      222d7dbd
  19. 19 8月, 2017 1 次提交
  20. 18 6月, 2017 8 次提交
    • W
      net: add debug atomic_inc_not_zero() in dst_hold() · 44ebe791
      Wei Wang 提交于
      This patch is meant to add a debug warning on the situation where dst is
      being held during its destroy phase. This could potentially cause double
      free issue on the dst.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44ebe791
    • W
      net: reorder all the dst flags · 1eb04e7c
      Wei Wang 提交于
      As some dst flags are removed, reorder the dst flags to fill in the
      blanks.
      Note: these flags are not exposed into user space. So it is safe to
      reorder.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1eb04e7c
    • W
      net: remove DST_NOCACHE flag · a4c2fd7f
      Wei Wang 提交于
      DST_NOCACHE flag check has been removed from dst_release() and
      dst_hold_safe() in a previous patch because all the dst are now ref
      counted properly and can be released based on refcnt only.
      Looking at the rest of the DST_NOCACHE use, all of them can now be
      removed or replaced with other checks.
      So this patch gets rid of all the DST_NOCACHE usage and remove this flag
      completely.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4c2fd7f
    • W
      net: remove DST_NOGC flag · b2a9c0ed
      Wei Wang 提交于
      Now that all the components have been changed to release dst based on
      refcnt only and not depend on dst gc anymore, we can remove the
      temporary flag DST_NOGC.
      
      Note that we also need to remove the DST_NOCACHE check in dst_release()
      and dst_hold_safe() because now all the dst are released based on refcnt
      and behaves as DST_NOCACHE.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2a9c0ed
    • W
      net: remove dst gc related code · 5b7c9a8f
      Wei Wang 提交于
      This patch removes all dst gc related code and all the dst free
      functions
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b7c9a8f
    • W
      xfrm: take refcnt of dst when creating struct xfrm_dst bundle · 52df157f
      Wei Wang 提交于
      During the creation of xfrm_dst bundle, always take ref count when
      allocating the dst. This way, xfrm_bundle_create() will form a linked
      list of dst with dst->child pointing to a ref counted dst child. And
      the returned dst pointer is also ref counted. This makes the link from
      the flow cache to this dst now ref counted properly.
      As the dst is always ref counted properly, we can safely mark
      DST_NOGC flag so dst_release() will release dst based on refcnt only.
      And dst gc is no longer needed and all dst_free() and its related
      function calls should be replaced with dst_release() or
      dst_release_immediate().
      
      The special handling logic for dst->child in dst_destroy() can be
      replaced with a simple dst_release_immediate() call on the child to
      release the whole list linked by dst->child pointer.
      Previously used DST_NOHASH flag is not needed anymore as well. The
      reason that DST_NOHASH is used in the existing code is mainly to prevent
      the dst inserted in the fib tree to be wrongly destroyed during the
      deletion of the xfrm_dst bundle. So in the existing code, DST_NOHASH
      flag is marked in all the dst children except the one which is in the
      fib tree.
      However, with this patch series to remove dst gc logic and release dst
      only based on ref count, it is safe to release all the children from a
      xfrm_dst bundle as long as the dst children are all ref counted
      properly which is already the case in the existing code.
      So, this patch removes the use of DST_NOHASH flag.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52df157f
    • W
      net: introduce a new function dst_dev_put() · 4a6ce2b6
      Wei Wang 提交于
      This function should be called when removing routes from fib tree after
      the dst gc is no longer in use.
      We first mark DST_OBSOLETE_DEAD on this dst to make sure next
      dst_ops->check() fails and returns NULL.
      Secondly, as we no longer keep the gc_list, we need to properly
      release dst->dev right at the moment when the dst is removed from
      the fib/fib6 tree.
      It does the following:
      1. change dst->input and output pointers to dst_discard/dst_dscard_out to
         discard all packets
      2. replace dst->dev with loopback interface
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a6ce2b6
    • W
      net: introduce DST_NOGC in dst_release() to destroy dst based on refcnt · 5f56f409
      Wei Wang 提交于
      The current mechanism of freeing dst is a bit complicated. dst has its
      ref count and when user grabs the reference to the dst, the ref count is
      properly taken in most cases except in IPv4/IPv6/decnet/xfrm routing
      code due to some historic reasons.
      
      If the reference to dst is always taken properly, we should be able to
      simplify the logic in dst_release() to destroy dst when dst->__refcnt
      drops from 1 to 0. And this should be the only condition to determine
      if we can call dst_destroy().
      And as dst is always ref counted, there is no need for a dst garbage
      list to hold the dst entries that already get removed by the routing
      code but are still held by other users. And the task to periodically
      check the list to free dst if ref count become 0 is also not needed
      anymore.
      
      This patch introduces a temporary flag DST_NOGC(no garbage collector).
      If it is set in the dst, dst_release() will call dst_destroy() when
      dst->__refcnt drops to 0. dst_hold_safe() will also check for this flag
      and do atomic_inc_not_zero() similar as DST_NOCACHE to avoid double free
      issue.
      This temporary flag is mainly used so that we can make the transition
      component by component without breaking other parts.
      This flag will be removed after all components are properly transitioned.
      
      This patch also introduces a new function dst_release_immediate() which
      destroys dst without waiting on the rcu when refcnt drops to 0. It will
      be used in later patches.
      
      Follow-up patches will correct all the places to properly take ref count
      on dst and mark DST_NOGC. dst_release() or dst_release_immediate() will
      be used to release the dst instead of dst_free() and its related
      functions.
      And final clean-up patch will remove the DST_NOGC flag.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f56f409
  21. 27 5月, 2017 1 次提交
    • E
      ipv4: add reference counting to metrics · 3fb07daf
      Eric Dumazet 提交于
      Andrey Konovalov reported crashes in ipv4_mtu()
      
      I could reproduce the issue with KASAN kernels, between
      10.246.7.151 and 10.246.7.152 :
      
      1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &
      
      2) At the same time run following loop :
      while :
      do
       ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
       ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
      done
      
      Cong Wang attempted to add back rt->fi in commit
      82486aa6 ("ipv4: restore rt->fi for reference counting")
      but this proved to add some issues that were complex to solve.
      
      Instead, I suggested to add a refcount to the metrics themselves,
      being a standalone object (in particular, no reference to other objects)
      
      I tried to make this patch as small as possible to ease its backport,
      instead of being super clean. Note that we believe that only ipv4 dst
      need to take care of the metric refcount. But if this is wrong,
      this patch adds the basic infrastructure to extend this to other
      families.
      
      Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
      for his efforts on this problem.
      
      Fixes: 2860583f ("ipv4: Kill rt->fi")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fb07daf
  22. 18 5月, 2017 1 次提交
    • A
      net: make struct dst_entry::dev first member · 66727145
      Alexey Dobriyan 提交于
      struct dst_entry::dev is used most often. Move it so it can be
      accessed without imm8 offset on x86_64.
      
      	add/remove: 0/0 grow/shrink: 9/239 up/down: 52/-413 (-361)
      	function                                     old     new   delta
      	dst_rcu_free                                 126     138     +12
      	fnhe_flush_routes                            211     219      +8
      	rt_set_nexthop                               747     754      +7
      	rt_cache_route                                85      91      +6
      	rt6_release                                  209     215      +6
      	dst_release                                  107     111      +4
      	dst_destroy_rcu                               29      33      +4
      	dn_dst_check_expire                          329     333      +4
      	dn_insert_route                              484     485      +1
      	xfrm_resolve_and_create_bundle              2991    2990      -1
      					...
      	ip_route_me_harder                          1163    1157      -6
      	__ip_append_data.isra                       2730    2724      -6
      	ip6_forward                                 3052    3045      -7
      	callforward_do_filter                        659     651      -8
      	dst_gc_task                                  571     549     -22
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66727145
  23. 12 2月, 2017 1 次提交
  24. 08 2月, 2017 1 次提交