1. 29 6月, 2019 1 次提交
    • J
      net: sched: refactor reinsert action · 720f22fe
      John Hurley 提交于
      The TC_ACT_REINSERT return type was added as an in-kernel only option to
      allow a packet ingress or egress redirect. This is used to avoid
      unnecessary skb clones in situations where they are not required. If a TC
      hook returns this code then the packet is 'reinserted' and no skb consume
      is carried out as no clone took place.
      
      This return type is only used in act_mirred. Rather than have the reinsert
      called from the main datapath, call it directly in act_mirred. Instead of
      returning TC_ACT_REINSERT, change the type to the new TC_ACT_CONSUMED
      which tells the caller that the packet has been stolen by another process
      and that no consume call is required.
      
      Moving all redirect calls to the act_mirred code is in preparation for
      tracking recursion created by act_mirred.
      Signed-off-by: NJohn Hurley <john.hurley@netronome.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      720f22fe
  2. 28 6月, 2019 1 次提交
    • L
      batman-adv: mcast: detect, distribute and maintain multicast router presence · 61caf3d1
      Linus Lüssing 提交于
      To be able to apply our group aware multicast optimizations to packets
      with a scope greater than link-local we need to not only keep track of
      multicast listeners but also multicast routers.
      
      With this patch a node detects the presence of multicast routers on
      its segment by checking if
      /proc/sys/net/ipv{4,6}/conf/<bat0|br0(bat)>/mc_forwarding is set for one
      thing. This option is enabled by multicast routing daemons and needed
      for the kernel's multicast routing tables to receive and route packets.
      
      For another thing if a bridge is configured on top of bat0 then the
      presence of an IPv6 multicast router behind this bridge is currently
      detected by checking for an IPv6 multicast "All Routers Address"
      (ff02::2). This should later be replaced by querying the bridge, which
      performs proper, RFC4286 compliant Multicast Router Discovery (our
      simplified approach includes more hosts than necessary, most notably
      not just multicast routers but also unicast ones and is not applicable
      for IPv4).
      
      If no multicast router is detected then this is signalized via the new
      BATADV_MCAST_WANT_NO_RTR4 and BATADV_MCAST_WANT_NO_RTR6
      multicast tvlv flags.
      Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
      Signed-off-by: NSven Eckelmann <sven@narfation.org>
      Signed-off-by: NSimon Wunderlich <sw@simonwunderlich.de>
      61caf3d1
  3. 27 6月, 2019 4 次提交
  4. 26 6月, 2019 8 次提交
  5. 25 6月, 2019 3 次提交
    • S
      ipv6: Dump route exceptions if requested · 1e47b483
      Stefano Brivio 提交于
      Since commit 2b760fcf ("ipv6: hook up exception table to store dst
      cache"), route exceptions reside in a separate hash table, and won't be
      found by walking the FIB, so they won't be dumped to userspace on a
      RTM_GETROUTE message.
      
      This causes 'ip -6 route list cache' and 'ip -6 route flush cache' to
      have no function anymore:
      
       # ip -6 route get fc00:3::1
       fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 539sec mtu 1400 pref medium
       # ip -6 route get fc00:4::1
       fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 536sec mtu 1500 pref medium
       # ip -6 route list cache
       # ip -6 route flush cache
       # ip -6 route get fc00:3::1
       fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 520sec mtu 1400 pref medium
       # ip -6 route get fc00:4::1
       fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 519sec mtu 1500 pref medium
      
      because iproute2 lists cached routes using RTM_GETROUTE, and flushes them
      by listing all the routes, and deleting them with RTM_DELROUTE one by one.
      
      If cached routes are requested using the RTM_F_CLONED flag together with
      strict checking, or if no strict checking is requested (and hence we can't
      consistently apply filters), look up exceptions in the hash table
      associated with the current fib6_info in rt6_dump_route(), and, if present
      and not expired, add them to the dump.
      
      We might be unable to dump all the entries for a given node in a single
      message, so keep track of how many entries were handled for the current
      node in fib6_walker, and skip that amount in case we start from the same
      partially dumped node.
      
      When a partial dump restarts, as the starting node might change when
      'sernum' changes, we have no guarantee that we need to skip the same
      amount of in-node entries. Therefore, we need two counters, and we need to
      zero the in-node counter if the node from which the dump is resumed
      differs.
      
      Note that, with the current version of iproute2, this only fixes the
      'ip -6 route list cache': on a flush command, iproute2 doesn't pass
      RTM_F_CLONED and, due to this inconsistency, 'ip -6 route flush cache' is
      still unable to fetch the routes to be flushed. This will be addressed in
      a patch for iproute2.
      
      To flush cached routes, a procfs entry could be introduced instead: that's
      how it works for IPv4. We already have a rt6_flush_exception() function
      ready to be wired to it. However, this would not solve the issue for
      listing.
      
      Versions of iproute2 and kernel tested:
      
                          iproute2
      kernel             4.14.0   4.15.0   4.19.0   5.0.0   5.1.0    5.1.0, patched
       3.18    list        +        +        +        +       +            +
               flush       +        +        +        +       +            +
       4.4     list        +        +        +        +       +            +
               flush       +        +        +        +       +            +
       4.9     list        +        +        +        +       +            +
               flush       +        +        +        +       +            +
       4.14    list        +        +        +        +       +            +
               flush       +        +        +        +       +            +
       4.15    list
               flush
       4.19    list
               flush
       5.0     list
               flush
       5.1     list
               flush
       with    list        +        +        +        +       +            +
       fix     flush       +        +        +                             +
      
      v7:
        - Explain usage of "skip" counters in commit message (suggested by
          David Ahern)
      
      v6:
        - Rebase onto net-next, use recently introduced nexthop walker
        - Make rt6_nh_dump_exceptions() a separate function (suggested by David
          Ahern)
      
      v5:
        - Use dump_routes and dump_exceptions from filter, ignore NLM_F_MATCH,
          update test results (flushing works with iproute2 < 5.0.0 now)
      
      v4:
        - Split NLM_F_MATCH and strict check handling in separate patches
        - Filter routes using RTM_F_CLONED: if it's not set, only return
          non-cached routes, and if it's set, only return cached routes:
          change requested by David Ahern and Martin Lau. This implies that
          iproute2 needs a separate patch to be able to flush IPv6 cached
          routes. This is not ideal because we can't fix the breakage caused
          by 2b760fcf entirely in kernel. However, two years have passed
          since then, and this makes it more tolerable
      
      v3:
        - More descriptive comment about expired exceptions in rt6_dump_route()
        - Swap return values of rt6_dump_route() (suggested by Martin Lau)
        - Don't zero skip_in_node in case we don't dump anything in a given pass
          (also suggested by Martin Lau)
        - Remove check on RTM_F_CLONED altogether: in the current UAPI semantic,
          it's just a flag to indicate the route was cloned, not to filter on
          routes
      
      v2: Add tracking of number of entries to be skipped in current node after
          a partial dump. As we restart from the same node, if not all the
          exceptions for a given node fit in a single message, the dump will
          not terminate, as suggested by Martin Lau. This is a concrete
          possibility, setting up a big number of exceptions for the same route
          actually causes the issue, suggested by David Ahern.
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Fixes: 2b760fcf ("ipv6: hook up exception table to store dst cache")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e47b483
    • S
      ipv4: Dump route exceptions if requested · ee28906f
      Stefano Brivio 提交于
      Since commit 4895c771 ("ipv4: Add FIB nexthop exceptions."), cached
      exception routes are stored as a separate entity, so they are not dumped
      on a FIB dump, even if the RTM_F_CLONED flag is passed.
      
      This implies that the command 'ip route list cache' doesn't return any
      result anymore.
      
      If the RTM_F_CLONED is passed, and strict checking requested, retrieve
      nexthop exception routes and dump them. If no strict checking is
      requested, filtering can't be performed consistently: dump everything in
      that case.
      
      With this, we need to add an argument to the netlink callback in order to
      track how many entries were already dumped for the last leaf included in
      a partial netlink dump.
      
      A single additional argument is sufficient, even if we traverse logically
      nested structures (nexthop objects, hash table buckets, bucket chains): it
      doesn't matter if we stop in the middle of any of those, because they are
      always traversed the same way. As an example, s_i values in [], s_fa
      values in ():
      
        node (fa) #1 [1]
          nexthop #1
          bucket #1 -> #0 in chain (1)
          bucket #2 -> #0 in chain (2) -> #1 in chain (3) -> #2 in chain (4)
          bucket #3 -> #0 in chain (5) -> #1 in chain (6)
      
          nexthop #2
          bucket #1 -> #0 in chain (7) -> #1 in chain (8)
          bucket #2 -> #0 in chain (9)
        --
        node (fa) #2 [2]
          nexthop #1
          bucket #1 -> #0 in chain (1) -> #1 in chain (2)
          bucket #2 -> #0 in chain (3)
      
      it doesn't matter if we stop at (3), (4), (7) for "node #1", or at (2)
      for "node #2": walking flattens all that.
      
      It would even be possible to drop the distinction between the in-tree
      (s_i) and in-node (s_fa) counter, but a further improvement might
      advise against this. This is only as accurate as the existing tracking
      mechanism for leaves: if a partial dump is restarted after exceptions
      are removed or expired, we might skip some non-dumped entries.
      
      To improve this, we could attach a 'sernum' attribute (similar to the
      one used for IPv6) to nexthop entities, and bump this counter whenever
      exceptions change: having a distinction between the two counters would
      make this more convenient.
      
      Listing of exception routes (modified routes pre-3.5) was tested against
      these versions of kernel and iproute2:
      
                          iproute2
      kernel         4.14.0   4.15.0   4.19.0   5.0.0   5.1.0
       3.5-rc4         +        +        +        +       +
       4.4
       4.9
       4.14
       4.15
       4.19
       5.0
       5.1
       fixed           +        +        +        +       +
      
      v7:
         - Move loop over nexthop objects to route.c, and pass struct fib_info
           and table ID to it, not a struct fib_alias (suggested by David Ahern)
         - While at it, note that the NULL check on fa->fa_info is redundant,
           and the check on RTNH_F_DEAD is also not consistent with what's done
           with regular route listing: just keep it for nhc_flags
         - Rename entry point function for dumping exceptions to
           fib_dump_info_fnhe(), and rearrange arguments for consistency with
           fib_dump_info()
         - Rename fnhe_dump_buckets() to fnhe_dump_bucket() and make it handle
           one bucket at a time
         - Expand commit message to describe why we can have a single "skip"
           counter for all exceptions stored in bucket chains in nexthop objects
           (suggested by David Ahern)
      
      v6:
         - Rebased onto net-next
         - Loop over nexthop paths too. Move loop over fnhe buckets to route.c,
           avoids need to export rt_fill_info() and to touch exceptions from
           fib_trie.c. Pass NULL as flow to rt_fill_info(), it now allows that
           (suggested by David Ahern)
      
      Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions.")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee28906f
    • S
      fib_frontend, ip6_fib: Select routes or exceptions dump from RTM_F_CLONED · 564c91f7
      Stefano Brivio 提交于
      The following patches add back the ability to dump IPv4 and IPv6 exception
      routes, and we need to allow selection of regular routes or exceptions.
      
      Use RTM_F_CLONED as filter to decide whether to dump routes or exceptions:
      iproute2 passes it in dump requests (except for IPv6 cache flush requests,
      this will be fixed in iproute2) and this used to work as long as
      exceptions were stored directly in the FIB, for both IPv4 and IPv6.
      
      Caveat: if strict checking is not requested (that is, if the dump request
      doesn't go through ip_valid_fib_dump_req()), we can't filter on protocol,
      tables or route types.
      
      In this case, filtering on RTM_F_CLONED would be inconsistent: we would
      fix 'ip route list cache' by returning exception routes and at the same
      time introduce another bug in case another selector is present, e.g. on
      'ip route list cache table main' we would return all exception routes,
      without filtering on tables.
      
      Keep this consistent by applying no filters at all, and dumping both
      routes and exceptions, if strict checking is not requested. iproute2
      currently filters results anyway, and no unwanted results will be
      presented to the user. The kernel will just dump more data than needed.
      
      v7: No changes
      
      v6: Rebase onto net-next, no changes
      
      v5: New patch: add dump_routes and dump_exceptions flags in filter and
          simply clear the unwanted one if strict checking is enabled, don't
          ignore NLM_F_MATCH and don't set filter_set if NLM_F_MATCH is set.
          Skip filtering altogether if no strict checking is requested:
          selecting routes or exceptions only would be inconsistent with the
          fact we can't filter on tables.
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      564c91f7
  6. 24 6月, 2019 6 次提交
    • D
      net/tls: fix page double free on TX cleanup · 9354544c
      Dirk van der Merwe 提交于
      With commit 94850257 ("tls: Fix tls_device handling of partial records")
      a new path was introduced to cleanup partial records during sk_proto_close.
      This path does not handle the SW KTLS tx_list cleanup.
      
      This is unnecessary though since the free_resources calls for both
      SW and offload paths will cleanup a partial record.
      
      The visible effect is the following warning, but this bug also causes
      a page double free.
      
          WARNING: CPU: 7 PID: 4000 at net/core/stream.c:206 sk_stream_kill_queues+0x103/0x110
          RIP: 0010:sk_stream_kill_queues+0x103/0x110
          RSP: 0018:ffffb6df87e07bd0 EFLAGS: 00010206
          RAX: 0000000000000000 RBX: ffff8c21db4971c0 RCX: 0000000000000007
          RDX: ffffffffffffffa0 RSI: 000000000000001d RDI: ffff8c21db497270
          RBP: ffff8c21db497270 R08: ffff8c29f4748600 R09: 000000010020001a
          R10: ffffb6df87e07aa0 R11: ffffffff9a445600 R12: 0000000000000007
          R13: 0000000000000000 R14: ffff8c21f03f2900 R15: ffff8c21f03b8df0
          Call Trace:
           inet_csk_destroy_sock+0x55/0x100
           tcp_close+0x25d/0x400
           ? tcp_check_oom+0x120/0x120
           tls_sk_proto_close+0x127/0x1c0
           inet_release+0x3c/0x60
           __sock_release+0x3d/0xb0
           sock_close+0x11/0x20
           __fput+0xd8/0x210
           task_work_run+0x84/0xa0
           do_exit+0x2dc/0xb90
           ? release_sock+0x43/0x90
           do_group_exit+0x3a/0xa0
           get_signal+0x295/0x720
           do_signal+0x36/0x610
           ? SYSC_recvfrom+0x11d/0x130
           exit_to_usermode_loop+0x69/0xb0
           do_syscall_64+0x173/0x180
           entry_SYSCALL_64_after_hwframe+0x3d/0xa2
          RIP: 0033:0x7fe9b9abc10d
          RSP: 002b:00007fe9b19a1d48 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
          RAX: fffffffffffffe00 RBX: 0000000000000006 RCX: 00007fe9b9abc10d
          RDX: 0000000000000002 RSI: 0000000000000080 RDI: 00007fe948003430
          RBP: 00007fe948003410 R08: 00007fe948003430 R09: 0000000000000000
          R10: 0000000000000000 R11: 0000000000000246 R12: 00005603739d9080
          R13: 00007fe9b9ab9f90 R14: 00007fe948003430 R15: 0000000000000000
      
      Fixes: 94850257 ("tls: Fix tls_device handling of partial records")
      Signed-off-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9354544c
    • W
      ipv6: convert major tx path to use RT6_LOOKUP_F_DST_NOREF · 7d9e5f42
      Wei Wang 提交于
      For tx path, in most cases, we still have to take refcnt on the dst
      cause the caller is caching the dst somewhere. But it still is
      beneficial to make use of RT6_LOOKUP_F_DST_NOREF flag while doing the
      route lookup. It is cause this flag prevents manipulating refcnt on
      net->ipv6.ip6_null_entry when doing fib6_rule_lookup() to traverse each
      routing table. The null_entry is a shared object and constant updates on
      it cause false sharing.
      
      We converted the current major lookup function ip6_route_output_flags()
      to make use of RT6_LOOKUP_F_DST_NOREF.
      
      Together with the change in the rx path, we see noticable performance
      boost:
      I ran synflood tests between 2 hosts under the same switch. Both hosts
      have 20G mlx NIC, and 8 tx/rx queues.
      Sender sends pure SYN flood with random src IPs and ports using trafgen.
      Receiver has a simple TCP listener on the target port.
      Both hosts have multiple custom rules:
      - For incoming packets, only local table is traversed.
      - For outgoing packets, 3 tables are traversed to find the route.
      The packet processing rate on the receiver is as follows:
      - Before the fix: 3.78Mpps
      - After the fix:  5.50Mpps
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d9e5f42
    • W
      ipv6: honor RT6_LOOKUP_F_DST_NOREF in rule lookup logic · d64a1f57
      Wei Wang 提交于
      This patch specifically converts the rule lookup logic to honor this
      flag and not release refcnt when traversing each rule and calling
      lookup() on each routing table.
      Similar to previous patch, we also need some special handling of dst
      entries in uncached list because there is always 1 refcnt taken for them
      even if RT6_LOOKUP_F_DST_NOREF flag is set.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d64a1f57
    • W
      ipv6: introduce RT6_LOOKUP_F_DST_NOREF flag in ip6_pol_route() · 0e09edcc
      Wei Wang 提交于
      This new flag is to instruct the route lookup function to not take
      refcnt on the dst entry. The user which does route lookup with this flag
      must properly use rcu protection.
      ip6_pol_route() is the major route lookup function for both tx and rx
      path.
      In this function:
      Do not take refcnt on dst if RT6_LOOKUP_F_DST_NOREF flag is set, and
      directly return the route entry. The caller should be holding rcu lock
      when using this flag, and decide whether to take refcnt or not.
      
      One note on the dst cache in the uncached_list:
      As uncached_list does not consume refcnt, one refcnt is always returned
      back to the caller even if RT6_LOOKUP_F_DST_NOREF flag is set.
      Uncached dst is only possible in the output path. So in such call path,
      caller MUST check if the dst is in the uncached_list before assuming
      that there is no refcnt taken on the returned dst.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e09edcc
    • Q
      inet: fix compilation warnings in fqdir_pre_exit() · 08003d0b
      Qian Cai 提交于
      The linux-next commit "inet: fix various use-after-free in defrags
      units" [1] introduced compilation warnings,
      
      ./include/net/inet_frag.h:117:1: warning: 'inline' is not at beginning
      of declaration [-Wold-style-declaration]
       static void inline fqdir_pre_exit(struct fqdir *fqdir)
       ^~~~~~
      In file included from ./include/net/netns/ipv4.h:10,
                       from ./include/net/net_namespace.h:20,
                       from ./include/linux/netdevice.h:38,
                       from ./include/linux/icmpv6.h:13,
                       from ./include/linux/ipv6.h:86,
                       from ./include/net/ipv6.h:12,
                       from ./include/rdma/ib_verbs.h:51,
                       from ./include/linux/mlx5/device.h:37,
                       from ./include/linux/mlx5/driver.h:51,
                       from
      drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:37:
      
      [1] https://lore.kernel.org/netdev/20190618180900.88939-3-edumazet@google.com/Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08003d0b
    • T
      mtd: spi-nor: use 16-bit WRR command when QE is set on spansion flashes · 191f5c2e
      Tudor Ambarus 提交于
      SPI memory devices from different manufacturers have widely
      different configurations for Status, Control and Configuration
      registers. JEDEC 216C defines a new map for these common register
      bits and their functions, and describes how the individual bits may
      be accessed for a specific device. For the JEDEC 216B compliant
      flashes, we can partially deduce Status and Configuration registers
      functions by inspecting the 16th DWORD of BFPT. Older flashes that
      don't declare the SFDP tables (SPANSION FL512SAIFG1 311QQ063 A ©11
      SPANSION) let the software decide how to interact with these registers.
      
      The commit dcb4b22e ("spi-nor: s25fl512s supports region locking")
      uncovered a probe error for s25fl512s, when the Quad Enable bit CR[1]
      was set to one in the bootloader. When this bit is one, only the Write
      Status (01h) command with two data byts may be used, the 01h command with
      one data byte is not recognized and hence the error when trying to clear
      the block protection bits.
      
      Fix the above by using the Write Status (01h) command with two data bytes
      when the Quad Enable bit is one.
      
      Backward compatibility should be fine. The newly introduced
      spi_nor_spansion_clear_sr_bp() is tightly coupled with the
      spansion_quad_enable() function. Both assume that the Write Register
      with 16 bits, together with the Read Configuration Register (35h)
      instructions are supported.
      
      Fixes: dcb4b22e ("spi-nor: s25fl512s supports region locking")
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NTudor Ambarus <tudor.ambarus@microchip.com>
      Tested-by: NJonas Bonn <jonas@norrbonn.se>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: NVignesh Raghavendra <vigneshr@ti.com>
      Tested-by: NVignesh Raghavendra <vigneshr@ti.com>
      Signed-off-by: NMiquel Raynal <miquel.raynal@bootlin.com>
      191f5c2e
  7. 23 6月, 2019 1 次提交
    • A
      net: fastopen: robustness and endianness fixes for SipHash · 438ac880
      Ard Biesheuvel 提交于
      Some changes to the TCP fastopen code to make it more robust
      against future changes in the choice of key/cookie size, etc.
      
      - Instead of keeping the SipHash key in an untyped u8[] buffer
        and casting it to the right type upon use, use the correct
        type directly. This ensures that the key will appear at the
        correct alignment if we ever change the way these data
        structures are allocated. (Currently, they are only allocated
        via kmalloc so they always appear at the correct alignment)
      
      - Use DIV_ROUND_UP when sizing the u64[] array to hold the
        cookie, so it is always of sufficient size, even if
        TCP_FASTOPEN_COOKIE_MAX is no longer a multiple of 8.
      
      - Drop the 'len' parameter from the tcp_fastopen_reset_cipher()
        function, which is no longer used.
      
      - Add endian swabbing when setting the keys and calculating the hash,
        to ensure that cookie values are the same for a given key and
        source/destination address pair regardless of the endianness of
        the server.
      
      Note that none of these are functional changes wrt the current
      state of the code, with the exception of the swabbing, which only
      affects big endian systems.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      438ac880
  8. 22 6月, 2019 2 次提交
  9. 21 6月, 2019 1 次提交
    • A
      netfilter: fix nf_conntrack_bridge/ipv6 link error · 43a38c3f
      Arnd Bergmann 提交于
      When CONFIG_IPV6 is disabled, the bridge netfilter code
      produces a link error:
      
      ERROR: "br_ip6_fragment" [net/bridge/netfilter/nf_conntrack_bridge.ko] undefined!
      ERROR: "nf_ct_frag6_gather" [net/bridge/netfilter/nf_conntrack_bridge.ko] undefined!
      
      The problem is that it assumes that whenever IPV6 is not a loadable
      module, we can call the functions direction. This is clearly
      not true when IPV6 is disabled.
      
      There are two other functions defined like this in linux/netfilter_ipv6.h,
      so change them all the same way.
      
      Fixes: 764dd163 ("netfilter: nf_conntrack_bridge: add support for IPv6")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      43a38c3f
  10. 20 6月, 2019 4 次提交
    • A
      netfilter: synproxy: fix building syncookie calls · 8527fa6c
      Arnd Bergmann 提交于
      When either CONFIG_IPV6 or CONFIG_SYN_COOKIES are disabled, the kernel
      fails to build:
      
      include/linux/netfilter_ipv6.h:180:9: error: implicit declaration of function '__cookie_v6_init_sequence'
            [-Werror,-Wimplicit-function-declaration]
              return __cookie_v6_init_sequence(iph, th, mssp);
      include/linux/netfilter_ipv6.h:194:9: error: implicit declaration of function '__cookie_v6_check'
            [-Werror,-Wimplicit-function-declaration]
              return __cookie_v6_check(iph, th, cookie);
      net/ipv6/netfilter.c:237:26: error: use of undeclared identifier '__cookie_v6_init_sequence'; did you mean 'cookie_init_sequence'?
      net/ipv6/netfilter.c:238:21: error: use of undeclared identifier '__cookie_v6_check'; did you mean '__cookie_v4_check'?
      
      Fix the IS_ENABLED() checks to match the function declaration
      and definitions for these.
      
      Fixes: 3006a522 ("netfilter: synproxy: remove module dependency on IPv6 SYNPROXY")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8527fa6c
    • J
      page_pool: fix compile warning when CONFIG_PAGE_POOL is disabled · 497ad9f5
      Jesper Dangaard Brouer 提交于
      Kbuild test robot reported compile warning:
       warning: no return statement in function returning non-void
      in function page_pool_request_shutdown, when CONFIG_PAGE_POOL is disabled.
      
      The fix makes the code a little more verbose, with a descriptive variable.
      
      Fixes: 99c07c43 ("xdp: tracking page_pool resources and safe removal")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      497ad9f5
    • E
      inet: clear num_timeout reqsk_alloc() · 85f9aa75
      Eric Dumazet 提交于
      KMSAN caught uninit-value in tcp_create_openreq_child() [1]
      This is caused by a recent change, combined by the fact
      that TCP cleared num_timeout, num_retrans and sk fields only
      when a request socket was about to be queued.
      
      Under syncookie mode, a temporary request socket is used,
      and req->num_timeout could contain garbage.
      
      Lets clear these three fields sooner, there is really no
      point trying to defer this and risk other bugs.
      
      [1]
      
      BUG: KMSAN: uninit-value in tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526
      CPU: 1 PID: 13357 Comm: syz-executor591 Not tainted 5.2.0-rc4+ #3
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x191/0x1f0 lib/dump_stack.c:113
       kmsan_report+0x162/0x2d0 mm/kmsan/kmsan.c:611
       __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:304
       tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526
       tcp_v6_syn_recv_sock+0x761/0x2d80 net/ipv6/tcp_ipv6.c:1152
       tcp_get_cookie_sock+0x16e/0x6b0 net/ipv4/syncookies.c:209
       cookie_v6_check+0x27e0/0x29a0 net/ipv6/syncookies.c:252
       tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline]
       tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344
       tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554
       ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397
       ip6_input_finish net/ipv6/ip6_input.c:438 [inline]
       NF_HOOK include/linux/netfilter.h:305 [inline]
       ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447
       dst_input include/net/dst.h:439 [inline]
       ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
       NF_HOOK include/linux/netfilter.h:305 [inline]
       ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272
       __netif_receive_skb_one_core net/core/dev.c:4981 [inline]
       __netif_receive_skb net/core/dev.c:5095 [inline]
       process_backlog+0x721/0x1410 net/core/dev.c:5906
       napi_poll net/core/dev.c:6329 [inline]
       net_rx_action+0x738/0x1940 net/core/dev.c:6395
       __do_softirq+0x4ad/0x858 kernel/softirq.c:293
       do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052
       </IRQ>
       do_softirq kernel/softirq.c:338 [inline]
       __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190
       local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
       rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline]
       ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117
       ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167
       dst_output include/net/dst.h:433 [inline]
       NF_HOOK include/linux/netfilter.h:305 [inline]
       ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271
       inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135
       __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156
       tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline]
       tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397
       __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573
       tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118
       tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403
       inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427
       inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470
       __sock_release net/socket.c:601 [inline]
       sock_close+0x156/0x490 net/socket.c:1273
       __fput+0x4c9/0xba0 fs/file_table.c:280
       ____fput+0x37/0x40 fs/file_table.c:313
       task_work_run+0x22e/0x2a0 kernel/task_work.c:113
       tracehook_notify_resume include/linux/tracehook.h:185 [inline]
       exit_to_usermode_loop arch/x86/entry/common.c:168 [inline]
       prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199
       syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279
       do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305
       entry_SYSCALL_64_after_hwframe+0x63/0xe7
      RIP: 0033:0x401d50
      Code: 01 f0 ff ff 0f 83 40 0d 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d dd 8d 2d 00 00 75 14 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 14 0d 00 00 c3 48 83 ec 08 e8 7a 02 00 00
      RSP: 002b:00007fff1cf58cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
      RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000401d50
      RDX: 000000000000001c RSI: 0000000000000000 RDI: 0000000000000003
      RBP: 00000000004a9050 R08: 0000000020000040 R09: 000000000000001c
      R10: 0000000020004004 R11: 0000000000000246 R12: 0000000000402ef0
      R13: 0000000000402f80 R14: 0000000000000000 R15: 0000000000000000
      
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:201 [inline]
       kmsan_internal_poison_shadow+0x53/0xa0 mm/kmsan/kmsan.c:160
       kmsan_kmalloc+0xa4/0x130 mm/kmsan/kmsan_hooks.c:177
       kmem_cache_alloc+0x534/0xb00 mm/slub.c:2781
       reqsk_alloc include/net/request_sock.h:84 [inline]
       inet_reqsk_alloc+0xa8/0x600 net/ipv4/tcp_input.c:6384
       cookie_v6_check+0xadb/0x29a0 net/ipv6/syncookies.c:173
       tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline]
       tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344
       tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554
       ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397
       ip6_input_finish net/ipv6/ip6_input.c:438 [inline]
       NF_HOOK include/linux/netfilter.h:305 [inline]
       ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447
       dst_input include/net/dst.h:439 [inline]
       ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
       NF_HOOK include/linux/netfilter.h:305 [inline]
       ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272
       __netif_receive_skb_one_core net/core/dev.c:4981 [inline]
       __netif_receive_skb net/core/dev.c:5095 [inline]
       process_backlog+0x721/0x1410 net/core/dev.c:5906
       napi_poll net/core/dev.c:6329 [inline]
       net_rx_action+0x738/0x1940 net/core/dev.c:6395
       __do_softirq+0x4ad/0x858 kernel/softirq.c:293
       do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052
       do_softirq kernel/softirq.c:338 [inline]
       __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190
       local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
       rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline]
       ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117
       ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167
       dst_output include/net/dst.h:433 [inline]
       NF_HOOK include/linux/netfilter.h:305 [inline]
       ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271
       inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135
       __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156
       tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline]
       tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397
       __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573
       tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118
       tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403
       inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427
       inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470
       __sock_release net/socket.c:601 [inline]
       sock_close+0x156/0x490 net/socket.c:1273
       __fput+0x4c9/0xba0 fs/file_table.c:280
       ____fput+0x37/0x40 fs/file_table.c:313
       task_work_run+0x22e/0x2a0 kernel/task_work.c:113
       tracehook_notify_resume include/linux/tracehook.h:185 [inline]
       exit_to_usermode_loop arch/x86/entry/common.c:168 [inline]
       prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199
       syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279
       do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305
       entry_SYSCALL_64_after_hwframe+0x63/0xe7
      
      Fixes: 336c39a0 ("tcp: undo init congestion window on false SYNACK timeout")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85f9aa75
    • K
      net: sched: act_ctinfo: tidy UAPI definition · 16e5a266
      Kevin Darbyshire-Bryant 提交于
      Remove some enums from the UAPI definition that were only used
      internally and are NOT part of the UAPI.
      Signed-off-by: NKevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16e5a266
  11. 19 6月, 2019 9 次提交
    • L
      netfilter: nf_tables: enable set expiration time for set elements · 79ebb5bb
      Laura Garcia Liebana 提交于
      Currently, the expiration of every element in a set or map
      is a read-only parameter generated at kernel side.
      
      This change will permit to set a certain expiration date
      per element that will be required, for example, during
      stateful replication among several nodes.
      
      This patch handles the NFTA_SET_ELEM_EXPIRATION in order
      to configure the expiration parameter per element, or
      will use the timeout in the case that the expiration
      is not set.
      Signed-off-by: NLaura Garcia Liebana <nevola@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      79ebb5bb
    • E
      inet: fix various use-after-free in defrags units · d5dd8879
      Eric Dumazet 提交于
      syzbot reported another issue caused by my recent patches. [1]
      
      The issue here is that fqdir_exit() is initiating a work queue
      and immediately returns. A bit later cleanup_net() was able
      to free the MIB (percpu data) and the whole struct net was freed,
      but we had active frag timers that fired and triggered use-after-free.
      
      We need to make sure that timers can catch fqdir->dead being set,
      to bailout.
      
      Since RCU is used for the reader side, this means
      we want to respect an RCU grace period between these operations :
      
      1) qfdir->dead = 1;
      
      2) netns dismantle (freeing of various data structure)
      
      This patch uses new new (struct pernet_operations)->pre_exit
      infrastructure to ensures a full RCU grace period
      happens between fqdir_pre_exit() and fqdir_exit()
      
      This also means we can use a regular work queue, we no
      longer need rcu_work.
      
      Tested:
      
      $ time for i in {1..1000}; do unshare -n /bin/false;done
      
      real	0m2.585s
      user	0m0.160s
      sys	0m2.214s
      
      [1]
      
      BUG: KASAN: use-after-free in ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
      Read of size 8 at addr ffff88808b9fe330 by task syz-executor.4/11860
      
      CPU: 1 PID: 11860 Comm: syz-executor.4 Not tainted 5.2.0-rc2+ #22
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
       __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       kasan_report+0x12/0x20 mm/kasan/common.c:614
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
       ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
       call_timer_fn+0x193/0x720 kernel/time/timer.c:1322
       expire_timers kernel/time/timer.c:1366 [inline]
       __run_timers kernel/time/timer.c:1685 [inline]
       __run_timers kernel/time/timer.c:1653 [inline]
       run_timer_softirq+0x66f/0x1740 kernel/time/timer.c:1698
       __do_softirq+0x25c/0x94c kernel/softirq.c:293
       invoke_softirq kernel/softirq.c:374 [inline]
       irq_exit+0x180/0x1d0 kernel/softirq.c:414
       exiting_irq arch/x86/include/asm/apic.h:536 [inline]
       smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
       </IRQ>
      RIP: 0010:tomoyo_domain_quota_is_ok+0x131/0x540 security/tomoyo/util.c:1035
      Code: 24 4c 3b 65 d0 0f 84 9c 00 00 00 e8 19 1d 73 fe 49 8d 7c 24 18 48 ba 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 0f b6 04 10 <48> 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 69 03 00 00 41 0f b6 5c
      RSP: 0018:ffff88806ae079c0 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff13
      RAX: 0000000000000000 RBX: 0000000000000010 RCX: ffffc9000e655000
      RDX: dffffc0000000000 RSI: ffffffff82fd88a7 RDI: ffff888086202398
      RBP: ffff88806ae07a00 R08: ffff88808b6c8700 R09: ffffed100d5c0f4d
      R10: ffffed100d5c0f4c R11: 0000000000000000 R12: ffff888086202380
      R13: 0000000000000030 R14: 00000000000000d3 R15: 0000000000000000
       tomoyo_supervisor+0x2e8/0xef0 security/tomoyo/common.c:2087
       tomoyo_audit_path_number_log security/tomoyo/file.c:235 [inline]
       tomoyo_path_number_perm+0x42f/0x520 security/tomoyo/file.c:734
       tomoyo_file_ioctl+0x23/0x30 security/tomoyo/tomoyo.c:335
       security_file_ioctl+0x77/0xc0 security/security.c:1370
       ksys_ioctl+0x57/0xd0 fs/ioctl.c:711
       __do_sys_ioctl fs/ioctl.c:720 [inline]
       __se_sys_ioctl fs/ioctl.c:718 [inline]
       __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4592c9
      Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f8db5e44c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004592c9
      RDX: 0000000020000080 RSI: 00000000000089f1 RDI: 0000000000000006
      RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8db5e456d4
      R13: 00000000004cc770 R14: 00000000004d5cd8 R15: 00000000ffffffff
      
      Allocated by task 9047:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_kmalloc mm/kasan/common.c:489 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
       kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
       slab_post_alloc_hook mm/slab.h:437 [inline]
       slab_alloc mm/slab.c:3326 [inline]
       kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3488
       kmem_cache_zalloc include/linux/slab.h:732 [inline]
       net_alloc net/core/net_namespace.c:386 [inline]
       copy_net_ns+0xed/0x340 net/core/net_namespace.c:426
       create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
       unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
       ksys_unshare+0x440/0x980 kernel/fork.c:2692
       __do_sys_unshare kernel/fork.c:2760 [inline]
       __se_sys_unshare kernel/fork.c:2758 [inline]
       __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 2541:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
       __cache_free mm/slab.c:3432 [inline]
       kmem_cache_free+0x86/0x260 mm/slab.c:3698
       net_free net/core/net_namespace.c:402 [inline]
       net_drop_ns.part.0+0x70/0x90 net/core/net_namespace.c:409
       net_drop_ns net/core/net_namespace.c:408 [inline]
       cleanup_net+0x538/0x960 net/core/net_namespace.c:571
       process_one_work+0x989/0x1790 kernel/workqueue.c:2269
       worker_thread+0x98/0xe40 kernel/workqueue.c:2415
       kthread+0x354/0x420 kernel/kthread.c:255
       ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      
      The buggy address belongs to the object at ffff88808b9fe100
       which belongs to the cache net_namespace of size 6784
      The buggy address is located 560 bytes inside of
       6784-byte region [ffff88808b9fe100, ffff88808b9ffb80)
      The buggy address belongs to the page:
      page:ffffea00022e7f80 refcount:1 mapcount:0 mapping:ffff88821b6f60c0 index:0x0 compound_mapcount: 0
      flags: 0x1fffc0000010200(slab|head)
      raw: 01fffc0000010200 ffffea000256f288 ffffea0001bbef08 ffff88821b6f60c0
      raw: 0000000000000000 ffff88808b9fe100 0000000100000001 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88808b9fe200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88808b9fe280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff88808b9fe300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                           ^
       ffff88808b9fe380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88808b9fe400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 3c8fc878 ("inet: frags: rework rhashtable dismantle")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5dd8879
    • E
      netns: add pre_exit method to struct pernet_operations · d7d99872
      Eric Dumazet 提交于
      Current struct pernet_operations exit() handlers are highly
      discouraged to call synchronize_rcu().
      
      There are cases where we need them, and exit_batch() does
      not help the common case where a single netns is dismantled.
      
      This patch leverages the existing synchronize_rcu() call
      in cleanup_net()
      
      Calling optional ->pre_exit() method before ->exit() or
      ->exit_batch() allows to benefit from a single synchronize_rcu()
      call.
      
      Note that the synchronize_rcu() calls added in this patch
      are only in error paths or slow paths.
      
      Tested:
      
      $ time for i in {1..1000}; do unshare -n /bin/false;done
      
      real	0m2.612s
      user	0m0.171s
      sys	0m2.216s
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7d99872
    • J
      page_pool: add tracepoints for page_pool with details need by XDP · 32c28f7e
      Jesper Dangaard Brouer 提交于
      The xdp tracepoints for mem id disconnect don't carry information about, why
      it was not safe_to_remove.  The tracepoint page_pool:page_pool_inflight in
      this patch can be used for extract this info for further debugging.
      
      This patchset also adds tracepoint for the pages_state_* release/hold
      transitions, including a pointer to the page.  This can be used for stats
      about in-flight pages, or used to debug page leakage via keeping track of
      page pointer and combining this with kprobe for __put_page().
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32c28f7e
    • J
      xdp: add tracepoints for XDP mem · f033b688
      Jesper Dangaard Brouer 提交于
      These tracepoints make it easier to troubleshoot XDP mem id disconnect.
      
      The xdp:mem_disconnect tracepoint cannot be replaced via kprobe. It is
      placed at the last stable place for the pointer to struct xdp_mem_allocator,
      just before it's scheduled for RCU removal. It also extract info on
      'safe_to_remove' and 'force'.
      
      Detailed info about in-flight pages is not available at this layer. The next
      patch will added tracepoints needed at the page_pool layer for this.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f033b688
    • J
      xdp: tracking page_pool resources and safe removal · 99c07c43
      Jesper Dangaard Brouer 提交于
      This patch is needed before we can allow drivers to use page_pool for
      DMA-mappings. Today with page_pool and XDP return API, it is possible to
      remove the page_pool object (from rhashtable), while there are still
      in-flight packet-pages. This is safely handled via RCU and failed lookups in
      __xdp_return() fallback to call put_page(), when page_pool object is gone.
      In-case page is still DMA mapped, this will result in page note getting
      correctly DMA unmapped.
      
      To solve this, the page_pool is extended with tracking in-flight pages. And
      XDP disconnect system queries page_pool and waits, via workqueue, for all
      in-flight pages to be returned.
      
      To avoid killing performance when tracking in-flight pages, the implement
      use two (unsigned) counters, that in placed on different cache-lines, and
      can be used to deduct in-flight packets. This is done by mapping the
      unsigned "sequence" counters onto signed Two's complement arithmetic
      operations. This is e.g. used by kernel's time_after macros, described in
      kernel commit 1ba3aab3 and 5a581b36, and also explained in RFC1982.
      
      The trick is these two incrementing counters only need to be read and
      compared, when checking if it's safe to free the page_pool structure. Which
      will only happen when driver have disconnected RX/alloc side. Thus, on a
      non-fast-path.
      
      It is chosen that page_pool tracking is also enabled for the non-DMA
      use-case, as this can be used for statistics later.
      
      After this patch, using page_pool requires more strict resource "release",
      e.g. via page_pool_release_page() that was introduced in this patchset, and
      previous patches implement/fix this more strict requirement.
      
      Drivers no-longer call page_pool_destroy(). Drivers already call
      xdp_rxq_info_unreg() which call xdp_rxq_info_unreg_mem_model(), which will
      attempt to disconnect the mem id, and if attempt fails schedule the
      disconnect for later via delayed workqueue.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99c07c43
    • J
      page_pool: introduce page_pool_free and use in mlx5 · e54cfd7e
      Jesper Dangaard Brouer 提交于
      In case driver fails to register the page_pool with XDP return API (via
      xdp_rxq_info_reg_mem_model()), then the driver can free the page_pool
      resources more directly than calling page_pool_destroy(), which does a
      unnecessarily RCU free procedure.
      
      This patch is preparing for removing page_pool_destroy(), from driver
      invocation.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e54cfd7e
    • J
      xdp: page_pool related fix to cpumap · 6bf071bf
      Jesper Dangaard Brouer 提交于
      When converting an xdp_frame into an SKB, and sending this into the network
      stack, then the underlying XDP memory model need to release associated
      resources, because the network stack don't have callbacks for XDP memory
      models.  The only memory model that needs this is page_pool, when a driver
      use the DMA-mapping feature.
      
      Introduce page_pool_release_page(), which basically does the same as
      page_pool_unmap_page(). Add xdp_release_frame() as the XDP memory model
      interface for calling it, if the memory model match MEM_TYPE_PAGE_POOL, to
      save the function call overhead for others. Have cpumap call
      xdp_release_frame() before xdp_scrub_frame().
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bf071bf
    • I
      net: page_pool: add helper function to unmap dma addresses · a25d50bf
      Ilias Apalodimas 提交于
      On a previous patch dma addr was stored in 'struct page'.
      Use that to unmap DMA addresses used by network drivers
      Signed-off-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a25d50bf