1. 29 6月, 2019 1 次提交
    • J
      net: sched: refactor reinsert action · 720f22fe
      John Hurley 提交于
      The TC_ACT_REINSERT return type was added as an in-kernel only option to
      allow a packet ingress or egress redirect. This is used to avoid
      unnecessary skb clones in situations where they are not required. If a TC
      hook returns this code then the packet is 'reinserted' and no skb consume
      is carried out as no clone took place.
      
      This return type is only used in act_mirred. Rather than have the reinsert
      called from the main datapath, call it directly in act_mirred. Instead of
      returning TC_ACT_REINSERT, change the type to the new TC_ACT_CONSUMED
      which tells the caller that the packet has been stolen by another process
      and that no consume call is required.
      
      Moving all redirect calls to the act_mirred code is in preparation for
      tracking recursion created by act_mirred.
      Signed-off-by: NJohn Hurley <john.hurley@netronome.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      720f22fe
  2. 27 6月, 2019 2 次提交
  3. 25 6月, 2019 3 次提交
    • S
      ipv6: Dump route exceptions if requested · 1e47b483
      Stefano Brivio 提交于
      Since commit 2b760fcf ("ipv6: hook up exception table to store dst
      cache"), route exceptions reside in a separate hash table, and won't be
      found by walking the FIB, so they won't be dumped to userspace on a
      RTM_GETROUTE message.
      
      This causes 'ip -6 route list cache' and 'ip -6 route flush cache' to
      have no function anymore:
      
       # ip -6 route get fc00:3::1
       fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 539sec mtu 1400 pref medium
       # ip -6 route get fc00:4::1
       fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 536sec mtu 1500 pref medium
       # ip -6 route list cache
       # ip -6 route flush cache
       # ip -6 route get fc00:3::1
       fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 520sec mtu 1400 pref medium
       # ip -6 route get fc00:4::1
       fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 519sec mtu 1500 pref medium
      
      because iproute2 lists cached routes using RTM_GETROUTE, and flushes them
      by listing all the routes, and deleting them with RTM_DELROUTE one by one.
      
      If cached routes are requested using the RTM_F_CLONED flag together with
      strict checking, or if no strict checking is requested (and hence we can't
      consistently apply filters), look up exceptions in the hash table
      associated with the current fib6_info in rt6_dump_route(), and, if present
      and not expired, add them to the dump.
      
      We might be unable to dump all the entries for a given node in a single
      message, so keep track of how many entries were handled for the current
      node in fib6_walker, and skip that amount in case we start from the same
      partially dumped node.
      
      When a partial dump restarts, as the starting node might change when
      'sernum' changes, we have no guarantee that we need to skip the same
      amount of in-node entries. Therefore, we need two counters, and we need to
      zero the in-node counter if the node from which the dump is resumed
      differs.
      
      Note that, with the current version of iproute2, this only fixes the
      'ip -6 route list cache': on a flush command, iproute2 doesn't pass
      RTM_F_CLONED and, due to this inconsistency, 'ip -6 route flush cache' is
      still unable to fetch the routes to be flushed. This will be addressed in
      a patch for iproute2.
      
      To flush cached routes, a procfs entry could be introduced instead: that's
      how it works for IPv4. We already have a rt6_flush_exception() function
      ready to be wired to it. However, this would not solve the issue for
      listing.
      
      Versions of iproute2 and kernel tested:
      
                          iproute2
      kernel             4.14.0   4.15.0   4.19.0   5.0.0   5.1.0    5.1.0, patched
       3.18    list        +        +        +        +       +            +
               flush       +        +        +        +       +            +
       4.4     list        +        +        +        +       +            +
               flush       +        +        +        +       +            +
       4.9     list        +        +        +        +       +            +
               flush       +        +        +        +       +            +
       4.14    list        +        +        +        +       +            +
               flush       +        +        +        +       +            +
       4.15    list
               flush
       4.19    list
               flush
       5.0     list
               flush
       5.1     list
               flush
       with    list        +        +        +        +       +            +
       fix     flush       +        +        +                             +
      
      v7:
        - Explain usage of "skip" counters in commit message (suggested by
          David Ahern)
      
      v6:
        - Rebase onto net-next, use recently introduced nexthop walker
        - Make rt6_nh_dump_exceptions() a separate function (suggested by David
          Ahern)
      
      v5:
        - Use dump_routes and dump_exceptions from filter, ignore NLM_F_MATCH,
          update test results (flushing works with iproute2 < 5.0.0 now)
      
      v4:
        - Split NLM_F_MATCH and strict check handling in separate patches
        - Filter routes using RTM_F_CLONED: if it's not set, only return
          non-cached routes, and if it's set, only return cached routes:
          change requested by David Ahern and Martin Lau. This implies that
          iproute2 needs a separate patch to be able to flush IPv6 cached
          routes. This is not ideal because we can't fix the breakage caused
          by 2b760fcf entirely in kernel. However, two years have passed
          since then, and this makes it more tolerable
      
      v3:
        - More descriptive comment about expired exceptions in rt6_dump_route()
        - Swap return values of rt6_dump_route() (suggested by Martin Lau)
        - Don't zero skip_in_node in case we don't dump anything in a given pass
          (also suggested by Martin Lau)
        - Remove check on RTM_F_CLONED altogether: in the current UAPI semantic,
          it's just a flag to indicate the route was cloned, not to filter on
          routes
      
      v2: Add tracking of number of entries to be skipped in current node after
          a partial dump. As we restart from the same node, if not all the
          exceptions for a given node fit in a single message, the dump will
          not terminate, as suggested by Martin Lau. This is a concrete
          possibility, setting up a big number of exceptions for the same route
          actually causes the issue, suggested by David Ahern.
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Fixes: 2b760fcf ("ipv6: hook up exception table to store dst cache")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e47b483
    • S
      ipv4: Dump route exceptions if requested · ee28906f
      Stefano Brivio 提交于
      Since commit 4895c771 ("ipv4: Add FIB nexthop exceptions."), cached
      exception routes are stored as a separate entity, so they are not dumped
      on a FIB dump, even if the RTM_F_CLONED flag is passed.
      
      This implies that the command 'ip route list cache' doesn't return any
      result anymore.
      
      If the RTM_F_CLONED is passed, and strict checking requested, retrieve
      nexthop exception routes and dump them. If no strict checking is
      requested, filtering can't be performed consistently: dump everything in
      that case.
      
      With this, we need to add an argument to the netlink callback in order to
      track how many entries were already dumped for the last leaf included in
      a partial netlink dump.
      
      A single additional argument is sufficient, even if we traverse logically
      nested structures (nexthop objects, hash table buckets, bucket chains): it
      doesn't matter if we stop in the middle of any of those, because they are
      always traversed the same way. As an example, s_i values in [], s_fa
      values in ():
      
        node (fa) #1 [1]
          nexthop #1
          bucket #1 -> #0 in chain (1)
          bucket #2 -> #0 in chain (2) -> #1 in chain (3) -> #2 in chain (4)
          bucket #3 -> #0 in chain (5) -> #1 in chain (6)
      
          nexthop #2
          bucket #1 -> #0 in chain (7) -> #1 in chain (8)
          bucket #2 -> #0 in chain (9)
        --
        node (fa) #2 [2]
          nexthop #1
          bucket #1 -> #0 in chain (1) -> #1 in chain (2)
          bucket #2 -> #0 in chain (3)
      
      it doesn't matter if we stop at (3), (4), (7) for "node #1", or at (2)
      for "node #2": walking flattens all that.
      
      It would even be possible to drop the distinction between the in-tree
      (s_i) and in-node (s_fa) counter, but a further improvement might
      advise against this. This is only as accurate as the existing tracking
      mechanism for leaves: if a partial dump is restarted after exceptions
      are removed or expired, we might skip some non-dumped entries.
      
      To improve this, we could attach a 'sernum' attribute (similar to the
      one used for IPv6) to nexthop entities, and bump this counter whenever
      exceptions change: having a distinction between the two counters would
      make this more convenient.
      
      Listing of exception routes (modified routes pre-3.5) was tested against
      these versions of kernel and iproute2:
      
                          iproute2
      kernel         4.14.0   4.15.0   4.19.0   5.0.0   5.1.0
       3.5-rc4         +        +        +        +       +
       4.4
       4.9
       4.14
       4.15
       4.19
       5.0
       5.1
       fixed           +        +        +        +       +
      
      v7:
         - Move loop over nexthop objects to route.c, and pass struct fib_info
           and table ID to it, not a struct fib_alias (suggested by David Ahern)
         - While at it, note that the NULL check on fa->fa_info is redundant,
           and the check on RTNH_F_DEAD is also not consistent with what's done
           with regular route listing: just keep it for nhc_flags
         - Rename entry point function for dumping exceptions to
           fib_dump_info_fnhe(), and rearrange arguments for consistency with
           fib_dump_info()
         - Rename fnhe_dump_buckets() to fnhe_dump_bucket() and make it handle
           one bucket at a time
         - Expand commit message to describe why we can have a single "skip"
           counter for all exceptions stored in bucket chains in nexthop objects
           (suggested by David Ahern)
      
      v6:
         - Rebased onto net-next
         - Loop over nexthop paths too. Move loop over fnhe buckets to route.c,
           avoids need to export rt_fill_info() and to touch exceptions from
           fib_trie.c. Pass NULL as flow to rt_fill_info(), it now allows that
           (suggested by David Ahern)
      
      Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions.")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee28906f
    • S
      fib_frontend, ip6_fib: Select routes or exceptions dump from RTM_F_CLONED · 564c91f7
      Stefano Brivio 提交于
      The following patches add back the ability to dump IPv4 and IPv6 exception
      routes, and we need to allow selection of regular routes or exceptions.
      
      Use RTM_F_CLONED as filter to decide whether to dump routes or exceptions:
      iproute2 passes it in dump requests (except for IPv6 cache flush requests,
      this will be fixed in iproute2) and this used to work as long as
      exceptions were stored directly in the FIB, for both IPv4 and IPv6.
      
      Caveat: if strict checking is not requested (that is, if the dump request
      doesn't go through ip_valid_fib_dump_req()), we can't filter on protocol,
      tables or route types.
      
      In this case, filtering on RTM_F_CLONED would be inconsistent: we would
      fix 'ip route list cache' by returning exception routes and at the same
      time introduce another bug in case another selector is present, e.g. on
      'ip route list cache table main' we would return all exception routes,
      without filtering on tables.
      
      Keep this consistent by applying no filters at all, and dumping both
      routes and exceptions, if strict checking is not requested. iproute2
      currently filters results anyway, and no unwanted results will be
      presented to the user. The kernel will just dump more data than needed.
      
      v7: No changes
      
      v6: Rebase onto net-next, no changes
      
      v5: New patch: add dump_routes and dump_exceptions flags in filter and
          simply clear the unwanted one if strict checking is enabled, don't
          ignore NLM_F_MATCH and don't set filter_set if NLM_F_MATCH is set.
          Skip filtering altogether if no strict checking is requested:
          selecting routes or exceptions only would be inconsistent with the
          fact we can't filter on tables.
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      564c91f7
  4. 24 6月, 2019 5 次提交
    • D
      net/tls: fix page double free on TX cleanup · 9354544c
      Dirk van der Merwe 提交于
      With commit 94850257 ("tls: Fix tls_device handling of partial records")
      a new path was introduced to cleanup partial records during sk_proto_close.
      This path does not handle the SW KTLS tx_list cleanup.
      
      This is unnecessary though since the free_resources calls for both
      SW and offload paths will cleanup a partial record.
      
      The visible effect is the following warning, but this bug also causes
      a page double free.
      
          WARNING: CPU: 7 PID: 4000 at net/core/stream.c:206 sk_stream_kill_queues+0x103/0x110
          RIP: 0010:sk_stream_kill_queues+0x103/0x110
          RSP: 0018:ffffb6df87e07bd0 EFLAGS: 00010206
          RAX: 0000000000000000 RBX: ffff8c21db4971c0 RCX: 0000000000000007
          RDX: ffffffffffffffa0 RSI: 000000000000001d RDI: ffff8c21db497270
          RBP: ffff8c21db497270 R08: ffff8c29f4748600 R09: 000000010020001a
          R10: ffffb6df87e07aa0 R11: ffffffff9a445600 R12: 0000000000000007
          R13: 0000000000000000 R14: ffff8c21f03f2900 R15: ffff8c21f03b8df0
          Call Trace:
           inet_csk_destroy_sock+0x55/0x100
           tcp_close+0x25d/0x400
           ? tcp_check_oom+0x120/0x120
           tls_sk_proto_close+0x127/0x1c0
           inet_release+0x3c/0x60
           __sock_release+0x3d/0xb0
           sock_close+0x11/0x20
           __fput+0xd8/0x210
           task_work_run+0x84/0xa0
           do_exit+0x2dc/0xb90
           ? release_sock+0x43/0x90
           do_group_exit+0x3a/0xa0
           get_signal+0x295/0x720
           do_signal+0x36/0x610
           ? SYSC_recvfrom+0x11d/0x130
           exit_to_usermode_loop+0x69/0xb0
           do_syscall_64+0x173/0x180
           entry_SYSCALL_64_after_hwframe+0x3d/0xa2
          RIP: 0033:0x7fe9b9abc10d
          RSP: 002b:00007fe9b19a1d48 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
          RAX: fffffffffffffe00 RBX: 0000000000000006 RCX: 00007fe9b9abc10d
          RDX: 0000000000000002 RSI: 0000000000000080 RDI: 00007fe948003430
          RBP: 00007fe948003410 R08: 00007fe948003430 R09: 0000000000000000
          R10: 0000000000000000 R11: 0000000000000246 R12: 00005603739d9080
          R13: 00007fe9b9ab9f90 R14: 00007fe948003430 R15: 0000000000000000
      
      Fixes: 94850257 ("tls: Fix tls_device handling of partial records")
      Signed-off-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9354544c
    • W
      ipv6: convert major tx path to use RT6_LOOKUP_F_DST_NOREF · 7d9e5f42
      Wei Wang 提交于
      For tx path, in most cases, we still have to take refcnt on the dst
      cause the caller is caching the dst somewhere. But it still is
      beneficial to make use of RT6_LOOKUP_F_DST_NOREF flag while doing the
      route lookup. It is cause this flag prevents manipulating refcnt on
      net->ipv6.ip6_null_entry when doing fib6_rule_lookup() to traverse each
      routing table. The null_entry is a shared object and constant updates on
      it cause false sharing.
      
      We converted the current major lookup function ip6_route_output_flags()
      to make use of RT6_LOOKUP_F_DST_NOREF.
      
      Together with the change in the rx path, we see noticable performance
      boost:
      I ran synflood tests between 2 hosts under the same switch. Both hosts
      have 20G mlx NIC, and 8 tx/rx queues.
      Sender sends pure SYN flood with random src IPs and ports using trafgen.
      Receiver has a simple TCP listener on the target port.
      Both hosts have multiple custom rules:
      - For incoming packets, only local table is traversed.
      - For outgoing packets, 3 tables are traversed to find the route.
      The packet processing rate on the receiver is as follows:
      - Before the fix: 3.78Mpps
      - After the fix:  5.50Mpps
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d9e5f42
    • W
      ipv6: honor RT6_LOOKUP_F_DST_NOREF in rule lookup logic · d64a1f57
      Wei Wang 提交于
      This patch specifically converts the rule lookup logic to honor this
      flag and not release refcnt when traversing each rule and calling
      lookup() on each routing table.
      Similar to previous patch, we also need some special handling of dst
      entries in uncached list because there is always 1 refcnt taken for them
      even if RT6_LOOKUP_F_DST_NOREF flag is set.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d64a1f57
    • W
      ipv6: introduce RT6_LOOKUP_F_DST_NOREF flag in ip6_pol_route() · 0e09edcc
      Wei Wang 提交于
      This new flag is to instruct the route lookup function to not take
      refcnt on the dst entry. The user which does route lookup with this flag
      must properly use rcu protection.
      ip6_pol_route() is the major route lookup function for both tx and rx
      path.
      In this function:
      Do not take refcnt on dst if RT6_LOOKUP_F_DST_NOREF flag is set, and
      directly return the route entry. The caller should be holding rcu lock
      when using this flag, and decide whether to take refcnt or not.
      
      One note on the dst cache in the uncached_list:
      As uncached_list does not consume refcnt, one refcnt is always returned
      back to the caller even if RT6_LOOKUP_F_DST_NOREF flag is set.
      Uncached dst is only possible in the output path. So in such call path,
      caller MUST check if the dst is in the uncached_list before assuming
      that there is no refcnt taken on the returned dst.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e09edcc
    • Q
      inet: fix compilation warnings in fqdir_pre_exit() · 08003d0b
      Qian Cai 提交于
      The linux-next commit "inet: fix various use-after-free in defrags
      units" [1] introduced compilation warnings,
      
      ./include/net/inet_frag.h:117:1: warning: 'inline' is not at beginning
      of declaration [-Wold-style-declaration]
       static void inline fqdir_pre_exit(struct fqdir *fqdir)
       ^~~~~~
      In file included from ./include/net/netns/ipv4.h:10,
                       from ./include/net/net_namespace.h:20,
                       from ./include/linux/netdevice.h:38,
                       from ./include/linux/icmpv6.h:13,
                       from ./include/linux/ipv6.h:86,
                       from ./include/net/ipv6.h:12,
                       from ./include/rdma/ib_verbs.h:51,
                       from ./include/linux/mlx5/device.h:37,
                       from ./include/linux/mlx5/driver.h:51,
                       from
      drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:37:
      
      [1] https://lore.kernel.org/netdev/20190618180900.88939-3-edumazet@google.com/Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08003d0b
  5. 23 6月, 2019 1 次提交
    • A
      net: fastopen: robustness and endianness fixes for SipHash · 438ac880
      Ard Biesheuvel 提交于
      Some changes to the TCP fastopen code to make it more robust
      against future changes in the choice of key/cookie size, etc.
      
      - Instead of keeping the SipHash key in an untyped u8[] buffer
        and casting it to the right type upon use, use the correct
        type directly. This ensures that the key will appear at the
        correct alignment if we ever change the way these data
        structures are allocated. (Currently, they are only allocated
        via kmalloc so they always appear at the correct alignment)
      
      - Use DIV_ROUND_UP when sizing the u64[] array to hold the
        cookie, so it is always of sufficient size, even if
        TCP_FASTOPEN_COOKIE_MAX is no longer a multiple of 8.
      
      - Drop the 'len' parameter from the tcp_fastopen_reset_cipher()
        function, which is no longer used.
      
      - Add endian swabbing when setting the keys and calculating the hash,
        to ensure that cookie values are the same for a given key and
        source/destination address pair regardless of the endianness of
        the server.
      
      Note that none of these are functional changes wrt the current
      state of the code, with the exception of the swabbing, which only
      affects big endian systems.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      438ac880
  6. 20 6月, 2019 3 次提交
    • J
      page_pool: fix compile warning when CONFIG_PAGE_POOL is disabled · 497ad9f5
      Jesper Dangaard Brouer 提交于
      Kbuild test robot reported compile warning:
       warning: no return statement in function returning non-void
      in function page_pool_request_shutdown, when CONFIG_PAGE_POOL is disabled.
      
      The fix makes the code a little more verbose, with a descriptive variable.
      
      Fixes: 99c07c43 ("xdp: tracking page_pool resources and safe removal")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      497ad9f5
    • E
      inet: clear num_timeout reqsk_alloc() · 85f9aa75
      Eric Dumazet 提交于
      KMSAN caught uninit-value in tcp_create_openreq_child() [1]
      This is caused by a recent change, combined by the fact
      that TCP cleared num_timeout, num_retrans and sk fields only
      when a request socket was about to be queued.
      
      Under syncookie mode, a temporary request socket is used,
      and req->num_timeout could contain garbage.
      
      Lets clear these three fields sooner, there is really no
      point trying to defer this and risk other bugs.
      
      [1]
      
      BUG: KMSAN: uninit-value in tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526
      CPU: 1 PID: 13357 Comm: syz-executor591 Not tainted 5.2.0-rc4+ #3
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x191/0x1f0 lib/dump_stack.c:113
       kmsan_report+0x162/0x2d0 mm/kmsan/kmsan.c:611
       __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:304
       tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526
       tcp_v6_syn_recv_sock+0x761/0x2d80 net/ipv6/tcp_ipv6.c:1152
       tcp_get_cookie_sock+0x16e/0x6b0 net/ipv4/syncookies.c:209
       cookie_v6_check+0x27e0/0x29a0 net/ipv6/syncookies.c:252
       tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline]
       tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344
       tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554
       ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397
       ip6_input_finish net/ipv6/ip6_input.c:438 [inline]
       NF_HOOK include/linux/netfilter.h:305 [inline]
       ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447
       dst_input include/net/dst.h:439 [inline]
       ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
       NF_HOOK include/linux/netfilter.h:305 [inline]
       ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272
       __netif_receive_skb_one_core net/core/dev.c:4981 [inline]
       __netif_receive_skb net/core/dev.c:5095 [inline]
       process_backlog+0x721/0x1410 net/core/dev.c:5906
       napi_poll net/core/dev.c:6329 [inline]
       net_rx_action+0x738/0x1940 net/core/dev.c:6395
       __do_softirq+0x4ad/0x858 kernel/softirq.c:293
       do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052
       </IRQ>
       do_softirq kernel/softirq.c:338 [inline]
       __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190
       local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
       rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline]
       ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117
       ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167
       dst_output include/net/dst.h:433 [inline]
       NF_HOOK include/linux/netfilter.h:305 [inline]
       ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271
       inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135
       __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156
       tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline]
       tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397
       __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573
       tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118
       tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403
       inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427
       inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470
       __sock_release net/socket.c:601 [inline]
       sock_close+0x156/0x490 net/socket.c:1273
       __fput+0x4c9/0xba0 fs/file_table.c:280
       ____fput+0x37/0x40 fs/file_table.c:313
       task_work_run+0x22e/0x2a0 kernel/task_work.c:113
       tracehook_notify_resume include/linux/tracehook.h:185 [inline]
       exit_to_usermode_loop arch/x86/entry/common.c:168 [inline]
       prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199
       syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279
       do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305
       entry_SYSCALL_64_after_hwframe+0x63/0xe7
      RIP: 0033:0x401d50
      Code: 01 f0 ff ff 0f 83 40 0d 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d dd 8d 2d 00 00 75 14 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 14 0d 00 00 c3 48 83 ec 08 e8 7a 02 00 00
      RSP: 002b:00007fff1cf58cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
      RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000401d50
      RDX: 000000000000001c RSI: 0000000000000000 RDI: 0000000000000003
      RBP: 00000000004a9050 R08: 0000000020000040 R09: 000000000000001c
      R10: 0000000020004004 R11: 0000000000000246 R12: 0000000000402ef0
      R13: 0000000000402f80 R14: 0000000000000000 R15: 0000000000000000
      
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:201 [inline]
       kmsan_internal_poison_shadow+0x53/0xa0 mm/kmsan/kmsan.c:160
       kmsan_kmalloc+0xa4/0x130 mm/kmsan/kmsan_hooks.c:177
       kmem_cache_alloc+0x534/0xb00 mm/slub.c:2781
       reqsk_alloc include/net/request_sock.h:84 [inline]
       inet_reqsk_alloc+0xa8/0x600 net/ipv4/tcp_input.c:6384
       cookie_v6_check+0xadb/0x29a0 net/ipv6/syncookies.c:173
       tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline]
       tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344
       tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554
       ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397
       ip6_input_finish net/ipv6/ip6_input.c:438 [inline]
       NF_HOOK include/linux/netfilter.h:305 [inline]
       ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447
       dst_input include/net/dst.h:439 [inline]
       ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
       NF_HOOK include/linux/netfilter.h:305 [inline]
       ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272
       __netif_receive_skb_one_core net/core/dev.c:4981 [inline]
       __netif_receive_skb net/core/dev.c:5095 [inline]
       process_backlog+0x721/0x1410 net/core/dev.c:5906
       napi_poll net/core/dev.c:6329 [inline]
       net_rx_action+0x738/0x1940 net/core/dev.c:6395
       __do_softirq+0x4ad/0x858 kernel/softirq.c:293
       do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052
       do_softirq kernel/softirq.c:338 [inline]
       __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190
       local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
       rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline]
       ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117
       ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167
       dst_output include/net/dst.h:433 [inline]
       NF_HOOK include/linux/netfilter.h:305 [inline]
       ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271
       inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135
       __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156
       tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline]
       tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397
       __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573
       tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118
       tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403
       inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427
       inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470
       __sock_release net/socket.c:601 [inline]
       sock_close+0x156/0x490 net/socket.c:1273
       __fput+0x4c9/0xba0 fs/file_table.c:280
       ____fput+0x37/0x40 fs/file_table.c:313
       task_work_run+0x22e/0x2a0 kernel/task_work.c:113
       tracehook_notify_resume include/linux/tracehook.h:185 [inline]
       exit_to_usermode_loop arch/x86/entry/common.c:168 [inline]
       prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199
       syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279
       do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305
       entry_SYSCALL_64_after_hwframe+0x63/0xe7
      
      Fixes: 336c39a0 ("tcp: undo init congestion window on false SYNACK timeout")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85f9aa75
    • K
      net: sched: act_ctinfo: tidy UAPI definition · 16e5a266
      Kevin Darbyshire-Bryant 提交于
      Remove some enums from the UAPI definition that were only used
      internally and are NOT part of the UAPI.
      Signed-off-by: NKevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16e5a266
  7. 19 6月, 2019 18 次提交
    • L
      netfilter: nf_tables: enable set expiration time for set elements · 79ebb5bb
      Laura Garcia Liebana 提交于
      Currently, the expiration of every element in a set or map
      is a read-only parameter generated at kernel side.
      
      This change will permit to set a certain expiration date
      per element that will be required, for example, during
      stateful replication among several nodes.
      
      This patch handles the NFTA_SET_ELEM_EXPIRATION in order
      to configure the expiration parameter per element, or
      will use the timeout in the case that the expiration
      is not set.
      Signed-off-by: NLaura Garcia Liebana <nevola@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      79ebb5bb
    • E
      inet: fix various use-after-free in defrags units · d5dd8879
      Eric Dumazet 提交于
      syzbot reported another issue caused by my recent patches. [1]
      
      The issue here is that fqdir_exit() is initiating a work queue
      and immediately returns. A bit later cleanup_net() was able
      to free the MIB (percpu data) and the whole struct net was freed,
      but we had active frag timers that fired and triggered use-after-free.
      
      We need to make sure that timers can catch fqdir->dead being set,
      to bailout.
      
      Since RCU is used for the reader side, this means
      we want to respect an RCU grace period between these operations :
      
      1) qfdir->dead = 1;
      
      2) netns dismantle (freeing of various data structure)
      
      This patch uses new new (struct pernet_operations)->pre_exit
      infrastructure to ensures a full RCU grace period
      happens between fqdir_pre_exit() and fqdir_exit()
      
      This also means we can use a regular work queue, we no
      longer need rcu_work.
      
      Tested:
      
      $ time for i in {1..1000}; do unshare -n /bin/false;done
      
      real	0m2.585s
      user	0m0.160s
      sys	0m2.214s
      
      [1]
      
      BUG: KASAN: use-after-free in ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
      Read of size 8 at addr ffff88808b9fe330 by task syz-executor.4/11860
      
      CPU: 1 PID: 11860 Comm: syz-executor.4 Not tainted 5.2.0-rc2+ #22
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
       __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       kasan_report+0x12/0x20 mm/kasan/common.c:614
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
       ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
       call_timer_fn+0x193/0x720 kernel/time/timer.c:1322
       expire_timers kernel/time/timer.c:1366 [inline]
       __run_timers kernel/time/timer.c:1685 [inline]
       __run_timers kernel/time/timer.c:1653 [inline]
       run_timer_softirq+0x66f/0x1740 kernel/time/timer.c:1698
       __do_softirq+0x25c/0x94c kernel/softirq.c:293
       invoke_softirq kernel/softirq.c:374 [inline]
       irq_exit+0x180/0x1d0 kernel/softirq.c:414
       exiting_irq arch/x86/include/asm/apic.h:536 [inline]
       smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
       </IRQ>
      RIP: 0010:tomoyo_domain_quota_is_ok+0x131/0x540 security/tomoyo/util.c:1035
      Code: 24 4c 3b 65 d0 0f 84 9c 00 00 00 e8 19 1d 73 fe 49 8d 7c 24 18 48 ba 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 0f b6 04 10 <48> 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 69 03 00 00 41 0f b6 5c
      RSP: 0018:ffff88806ae079c0 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff13
      RAX: 0000000000000000 RBX: 0000000000000010 RCX: ffffc9000e655000
      RDX: dffffc0000000000 RSI: ffffffff82fd88a7 RDI: ffff888086202398
      RBP: ffff88806ae07a00 R08: ffff88808b6c8700 R09: ffffed100d5c0f4d
      R10: ffffed100d5c0f4c R11: 0000000000000000 R12: ffff888086202380
      R13: 0000000000000030 R14: 00000000000000d3 R15: 0000000000000000
       tomoyo_supervisor+0x2e8/0xef0 security/tomoyo/common.c:2087
       tomoyo_audit_path_number_log security/tomoyo/file.c:235 [inline]
       tomoyo_path_number_perm+0x42f/0x520 security/tomoyo/file.c:734
       tomoyo_file_ioctl+0x23/0x30 security/tomoyo/tomoyo.c:335
       security_file_ioctl+0x77/0xc0 security/security.c:1370
       ksys_ioctl+0x57/0xd0 fs/ioctl.c:711
       __do_sys_ioctl fs/ioctl.c:720 [inline]
       __se_sys_ioctl fs/ioctl.c:718 [inline]
       __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4592c9
      Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f8db5e44c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004592c9
      RDX: 0000000020000080 RSI: 00000000000089f1 RDI: 0000000000000006
      RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8db5e456d4
      R13: 00000000004cc770 R14: 00000000004d5cd8 R15: 00000000ffffffff
      
      Allocated by task 9047:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_kmalloc mm/kasan/common.c:489 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
       kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
       slab_post_alloc_hook mm/slab.h:437 [inline]
       slab_alloc mm/slab.c:3326 [inline]
       kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3488
       kmem_cache_zalloc include/linux/slab.h:732 [inline]
       net_alloc net/core/net_namespace.c:386 [inline]
       copy_net_ns+0xed/0x340 net/core/net_namespace.c:426
       create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
       unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
       ksys_unshare+0x440/0x980 kernel/fork.c:2692
       __do_sys_unshare kernel/fork.c:2760 [inline]
       __se_sys_unshare kernel/fork.c:2758 [inline]
       __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 2541:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
       __cache_free mm/slab.c:3432 [inline]
       kmem_cache_free+0x86/0x260 mm/slab.c:3698
       net_free net/core/net_namespace.c:402 [inline]
       net_drop_ns.part.0+0x70/0x90 net/core/net_namespace.c:409
       net_drop_ns net/core/net_namespace.c:408 [inline]
       cleanup_net+0x538/0x960 net/core/net_namespace.c:571
       process_one_work+0x989/0x1790 kernel/workqueue.c:2269
       worker_thread+0x98/0xe40 kernel/workqueue.c:2415
       kthread+0x354/0x420 kernel/kthread.c:255
       ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      
      The buggy address belongs to the object at ffff88808b9fe100
       which belongs to the cache net_namespace of size 6784
      The buggy address is located 560 bytes inside of
       6784-byte region [ffff88808b9fe100, ffff88808b9ffb80)
      The buggy address belongs to the page:
      page:ffffea00022e7f80 refcount:1 mapcount:0 mapping:ffff88821b6f60c0 index:0x0 compound_mapcount: 0
      flags: 0x1fffc0000010200(slab|head)
      raw: 01fffc0000010200 ffffea000256f288 ffffea0001bbef08 ffff88821b6f60c0
      raw: 0000000000000000 ffff88808b9fe100 0000000100000001 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88808b9fe200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88808b9fe280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff88808b9fe300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                           ^
       ffff88808b9fe380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88808b9fe400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 3c8fc878 ("inet: frags: rework rhashtable dismantle")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5dd8879
    • E
      netns: add pre_exit method to struct pernet_operations · d7d99872
      Eric Dumazet 提交于
      Current struct pernet_operations exit() handlers are highly
      discouraged to call synchronize_rcu().
      
      There are cases where we need them, and exit_batch() does
      not help the common case where a single netns is dismantled.
      
      This patch leverages the existing synchronize_rcu() call
      in cleanup_net()
      
      Calling optional ->pre_exit() method before ->exit() or
      ->exit_batch() allows to benefit from a single synchronize_rcu()
      call.
      
      Note that the synchronize_rcu() calls added in this patch
      are only in error paths or slow paths.
      
      Tested:
      
      $ time for i in {1..1000}; do unshare -n /bin/false;done
      
      real	0m2.612s
      user	0m0.171s
      sys	0m2.216s
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7d99872
    • J
      xdp: add tracepoints for XDP mem · f033b688
      Jesper Dangaard Brouer 提交于
      These tracepoints make it easier to troubleshoot XDP mem id disconnect.
      
      The xdp:mem_disconnect tracepoint cannot be replaced via kprobe. It is
      placed at the last stable place for the pointer to struct xdp_mem_allocator,
      just before it's scheduled for RCU removal. It also extract info on
      'safe_to_remove' and 'force'.
      
      Detailed info about in-flight pages is not available at this layer. The next
      patch will added tracepoints needed at the page_pool layer for this.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f033b688
    • J
      xdp: tracking page_pool resources and safe removal · 99c07c43
      Jesper Dangaard Brouer 提交于
      This patch is needed before we can allow drivers to use page_pool for
      DMA-mappings. Today with page_pool and XDP return API, it is possible to
      remove the page_pool object (from rhashtable), while there are still
      in-flight packet-pages. This is safely handled via RCU and failed lookups in
      __xdp_return() fallback to call put_page(), when page_pool object is gone.
      In-case page is still DMA mapped, this will result in page note getting
      correctly DMA unmapped.
      
      To solve this, the page_pool is extended with tracking in-flight pages. And
      XDP disconnect system queries page_pool and waits, via workqueue, for all
      in-flight pages to be returned.
      
      To avoid killing performance when tracking in-flight pages, the implement
      use two (unsigned) counters, that in placed on different cache-lines, and
      can be used to deduct in-flight packets. This is done by mapping the
      unsigned "sequence" counters onto signed Two's complement arithmetic
      operations. This is e.g. used by kernel's time_after macros, described in
      kernel commit 1ba3aab3 and 5a581b36, and also explained in RFC1982.
      
      The trick is these two incrementing counters only need to be read and
      compared, when checking if it's safe to free the page_pool structure. Which
      will only happen when driver have disconnected RX/alloc side. Thus, on a
      non-fast-path.
      
      It is chosen that page_pool tracking is also enabled for the non-DMA
      use-case, as this can be used for statistics later.
      
      After this patch, using page_pool requires more strict resource "release",
      e.g. via page_pool_release_page() that was introduced in this patchset, and
      previous patches implement/fix this more strict requirement.
      
      Drivers no-longer call page_pool_destroy(). Drivers already call
      xdp_rxq_info_unreg() which call xdp_rxq_info_unreg_mem_model(), which will
      attempt to disconnect the mem id, and if attempt fails schedule the
      disconnect for later via delayed workqueue.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99c07c43
    • J
      page_pool: introduce page_pool_free and use in mlx5 · e54cfd7e
      Jesper Dangaard Brouer 提交于
      In case driver fails to register the page_pool with XDP return API (via
      xdp_rxq_info_reg_mem_model()), then the driver can free the page_pool
      resources more directly than calling page_pool_destroy(), which does a
      unnecessarily RCU free procedure.
      
      This patch is preparing for removing page_pool_destroy(), from driver
      invocation.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e54cfd7e
    • J
      xdp: page_pool related fix to cpumap · 6bf071bf
      Jesper Dangaard Brouer 提交于
      When converting an xdp_frame into an SKB, and sending this into the network
      stack, then the underlying XDP memory model need to release associated
      resources, because the network stack don't have callbacks for XDP memory
      models.  The only memory model that needs this is page_pool, when a driver
      use the DMA-mapping feature.
      
      Introduce page_pool_release_page(), which basically does the same as
      page_pool_unmap_page(). Add xdp_release_frame() as the XDP memory model
      interface for calling it, if the memory model match MEM_TYPE_PAGE_POOL, to
      save the function call overhead for others. Have cpumap call
      xdp_release_frame() before xdp_scrub_frame().
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bf071bf
    • I
      net: page_pool: add helper function to unmap dma addresses · a25d50bf
      Ilias Apalodimas 提交于
      On a previous patch dma addr was stored in 'struct page'.
      Use that to unmap DMA addresses used by network drivers
      Signed-off-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a25d50bf
    • I
      net: page_pool: add helper function to retrieve dma addresses · 0afdeeed
      Ilias Apalodimas 提交于
      On a previous patch dma addr was stored in 'struct page'.
      Use that to retrieve DMA addresses used by network drivers
      Signed-off-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0afdeeed
    • T
      treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 · d2912cb1
      Thomas Gleixner 提交于
      Based on 2 normalized pattern(s):
      
        this program is free software you can redistribute it and or modify
        it under the terms of the gnu general public license version 2 as
        published by the free software foundation
      
        this program is free software you can redistribute it and or modify
        it under the terms of the gnu general public license version 2 as
        published by the free software foundation #
      
      extracted by the scancode license scanner the SPDX license identifier
      
        GPL-2.0-only
      
      has been chosen to replace the boilerplate/reference in 4122 file(s).
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NEnrico Weigelt <info@metux.net>
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NAllison Randal <allison@lohutok.net>
      Cc: linux-spdx@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d2912cb1
    • T
      treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 234 · caab277b
      Thomas Gleixner 提交于
      Based on 1 normalized pattern(s):
      
        this program is free software you can redistribute it and or modify
        it under the terms of the gnu general public license version 2 as
        published by the free software foundation this program is
        distributed in the hope that it will be useful but without any
        warranty without even the implied warranty of merchantability or
        fitness for a particular purpose see the gnu general public license
        for more details you should have received a copy of the gnu general
        public license along with this program if not see http www gnu org
        licenses
      
      extracted by the scancode license scanner the SPDX license identifier
      
        GPL-2.0-only
      
      has been chosen to replace the boilerplate/reference in 503 file(s).
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NAlexios Zavras <alexios.zavras@intel.com>
      Reviewed-by: NAllison Randal <allison@lohutok.net>
      Reviewed-by: NEnrico Weigelt <info@metux.net>
      Cc: linux-spdx@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190602204653.811534538@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      caab277b
    • J
      net: flow_offload: implement support for meta key · 9558a83a
      Jiri Pirko 提交于
      Implement support for previously added flow dissector meta key.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9558a83a
    • J
      flow_dissector: add support for ingress ifindex dissection · 82828b88
      Jiri Pirko 提交于
      Add new key meta that contains ingress ifindex value and add a function
      to dissect this from skb. The key and function is prepared to cover
      other potential skb metadata values dissection.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82828b88
    • X
      ip6_tunnel: allow not to count pkts on tstats by passing dev as NULL · 6f6a8622
      Xin Long 提交于
      A similar fix to Patch "ip_tunnel: allow not to count pkts on tstats by
      setting skb's dev to NULL" is also needed by ip6_tunnel.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f6a8622
    • I
      ipv6: Stop sending in-kernel notifications for each nexthop · d5382fef
      Ido Schimmel 提交于
      Both listeners - mlxsw and netdevsim - of IPv6 FIB notifications are now
      ready to handle IPv6 multipath notifications.
      
      Therefore, stop ignoring such notifications in both drivers and stop
      sending notification for each added / deleted nexthop.
      
      v2:
      * Remove 'multipath_rt' from 'struct fib6_entry_notifier_info'
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5382fef
    • I
      ipv6: Extend notifier info for multipath routes · d4b96c7b
      Ido Schimmel 提交于
      Extend the IPv6 FIB notifier info with number of sibling routes being
      notified.
      
      This will later allow listeners to process one notification for a
      multipath routes instead of N, where N is the number of nexthops.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4b96c7b
    • I
      netlink: Add field to skip in-kernel notifications · c82481f7
      Ido Schimmel 提交于
      The struct includes a 'skip_notify' flag that indicates if netlink
      notifications to user space should be suppressed. As explained in commit
      3b1137fe ("net: ipv6: Change notifications for multipath add to
      RTA_MULTIPATH"), this is useful to suppress per-nexthop RTM_NEWROUTE
      notifications when an IPv6 multipath route is added / deleted. Instead,
      one notification is sent for the entire multipath route.
      
      This concept is also useful for in-kernel notifications. Sending one
      in-kernel notification for the addition / deletion of an IPv6 multipath
      route - instead of one per-nexthop - provides a significant increase in
      the insertion / deletion rate to underlying devices.
      
      Add a 'skip_notify_kernel' flag to suppress in-kernel notifications.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c82481f7
    • I
      netlink: Document all fields of 'struct nl_info' · 3de205cd
      Ido Schimmel 提交于
      Some fields were not documented. Add documentation.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3de205cd
  8. 18 6月, 2019 1 次提交
    • A
      net: ipv4: move tcp_fastopen server side code to SipHash library · c681edae
      Ard Biesheuvel 提交于
      Using a bare block cipher in non-crypto code is almost always a bad idea,
      not only for security reasons (and we've seen some examples of this in
      the kernel in the past), but also for performance reasons.
      
      In the TCP fastopen case, we call into the bare AES block cipher one or
      two times (depending on whether the connection is IPv4 or IPv6). On most
      systems, this results in a call chain such as
      
        crypto_cipher_encrypt_one(ctx, dst, src)
          crypto_cipher_crt(tfm)->cit_encrypt_one(crypto_cipher_tfm(tfm), ...);
            aesni_encrypt
              kernel_fpu_begin();
              aesni_enc(ctx, dst, src); // asm routine
              kernel_fpu_end();
      
      It is highly unlikely that the use of special AES instructions has a
      benefit in this case, especially since we are doing the above twice
      for IPv6 connections, instead of using a transform which can process
      the entire input in one go.
      
      We could switch to the cbcmac(aes) shash, which would at least get
      rid of the duplicated overhead in *some* cases (i.e., today, only
      arm64 has an accelerated implementation of cbcmac(aes), while x86 will
      end up using the generic cbcmac template wrapping the AES-NI cipher,
      which basically ends up doing exactly the above). However, in the given
      context, it makes more sense to use a light-weight MAC algorithm that
      is more suitable for the purpose at hand, such as SipHash.
      
      Since the output size of SipHash already matches our chosen value for
      TCP_FASTOPEN_COOKIE_SIZE, and given that it accepts arbitrary input
      sizes, this greatly simplifies the code as well.
      
      NOTE: Server farms backing a single server IP for load balancing purposes
            and sharing a single fastopen key will be adversely affected by
            this change unless all systems in the pool receive their kernel
            upgrades at the same time.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c681edae
  9. 17 6月, 2019 3 次提交
    • F
      netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXY · d7f9b2f1
      Fernando Fernandez Mancera 提交于
      Add common functions into nf_synproxy_core.c to prepare for nftables support.
      The prototypes of the functions used by {ipt, ip6t}_SYNPROXY are in the new
      file nf_synproxy.h
      Signed-off-by: NFernando Fernandez Mancera <ffmancera@riseup.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d7f9b2f1
    • C
      netfilter: bridge: port sysctls to use brnf_net · ff6d090d
      Christian Brauner 提交于
      This ports the sysctls to use struct brnf_net.
      
      With this patch we make it possible to namespace the br_netfilter module in
      the following patch.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      ff6d090d
    • F
      netfilter: conntrack: small conntrack lookup optimization · 87e389b4
      Florian Westphal 提交于
      ____nf_conntrack_find() performs checks on the conntrack objects in
      this order:
      
      1. if (nf_ct_is_expired(ct))
      
      This fetches ct->timeout, in third cache line.
      
      The hnnode that is used to store the list pointers resides in the first
      (origin) or second (reply tuple) cache lines.
      
      This test rarely passes, but its necessary to reap obsolete entries.
      
      2. if (nf_ct_is_dying(ct))
      
      This fetches ct->status, also in third cache line.
      
      The test is useless, and can be removed:
        Consider:
           cpu0                                           cpu1
          ct = ____nf_conntrack_find()
          atomic_inc_not_zero(ct) -> ok
          nf_ct_key_equal -> ok
          is_dying -> DYING bit not set, ok
                                                          set_bit(ct, DYING);
      						    ... unhash ... etc.
          return ct
          -> returning a ct with dying bit set, despite
          having a test for it.
      
      This (unlikely) case is fine - refcount prevents ct from getting free'd.
      
      3. if (nf_ct_key_equal(h, tuple, zone, net))
      
      nf_ct_key_equal checks in following order:
      
      1. Tuple equal (first or second cacheline)
      2. Zone equal (third cacheline)
      3. confirmed bit set (->status, third cacheline)
      4. net namespace match (third cacheline).
      
      Swapping "timeout" and "cpu" places timeout in the first cacheline.
      This has two advantages:
      
      1. For a conntrack that won't even match the original tuple,
         we will now only fetch the first and maybe the second cacheline
         instead of always accessing the 3rd one as well.
      
      2.  in case of TCP ct->timeout changes frequently because we
          reduce/increase it when there are packets outstanding in the network.
      
      The first cacheline contains both the reference count and the ct spinlock,
      i.e. moving timeout there avoids writes to 3rd cacheline.
      
      The restart sequence in __nf_conntrack_find() is removed, if we found a
      candidate, but then fail to increment the refcount or discover the tuple
      has changed (object recycling), just pretend we did not find an entry.
      
      A second lookup won't find anything until another CPU adds a new conntrack
      with identical tuple into the hash table, which is very unlikely.
      
      We have the confirmation-time checks (when we hold hash lock) that deal
      with identical entries and even perform clash resolution in some cases.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      87e389b4
  10. 16 6月, 2019 3 次提交
    • E
      tcp: add tcp_min_snd_mss sysctl · 5f3e2bf0
      Eric Dumazet 提交于
      Some TCP peers announce a very small MSS option in their SYN and/or
      SYN/ACK messages.
      
      This forces the stack to send packets with a very high network/cpu
      overhead.
      
      Linux has enforced a minimal value of 48. Since this value includes
      the size of TCP options, and that the options can consume up to 40
      bytes, this means that each segment can include only 8 bytes of payload.
      
      In some cases, it can be useful to increase the minimal value
      to a saner value.
      
      We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
      reasons.
      
      Note that TCP_MAXSEG socket option enforces a minimal value
      of (TCP_MIN_MSS). David Miller increased this minimal value
      in commit c39508d6 ("tcp: Make TCP_MAXSEG minimum more correct.")
      from 64 to 88.
      
      We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.
      
      CVE-2019-11479 -- tcp mss hardcoded to 48
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NJonathan Looney <jtl@netflix.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f3e2bf0
    • E
      tcp: limit payload size of sacked skbs · 3b4929f6
      Eric Dumazet 提交于
      Jonathan Looney reported that TCP can trigger the following crash
      in tcp_shifted_skb() :
      
      	BUG_ON(tcp_skb_pcount(skb) < pcount);
      
      This can happen if the remote peer has advertized the smallest
      MSS that linux TCP accepts : 48
      
      An skb can hold 17 fragments, and each fragment can hold 32KB
      on x86, or 64KB on PowerPC.
      
      This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
      can overflow.
      
      Note that tcp_sendmsg() builds skbs with less than 64KB
      of payload, so this problem needs SACK to be enabled.
      SACK blocks allow TCP to coalesce multiple skbs in the retransmit
      queue, thus filling the 17 fragments to maximal capacity.
      
      CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs
      
      Fixes: 832d11c5 ("tcp: Try to restore large SKBs while SACK processing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJonathan Looney <jtl@netflix.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NTyler Hicks <tyhicks@canonical.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b4929f6
    • J
      net: sched: remove NET_CLS_IND config option · a5148626
      Jiri Pirko 提交于
      This config option makes only couple of lines optional.
      Two small helpers and an int in couple of cls structs.
      
      Remove the config option and always compile this in.
      This saves the user from unexpected surprises when he adds
      a filter with ingress device match which is silently ignored
      in case the config option is not set.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5148626