1. 06 6月, 2019 2 次提交
  2. 05 6月, 2019 10 次提交
    • F
      xfrm: remove init_flags indirection from xfrm_state_afinfo · e4681747
      Florian Westphal 提交于
      There is only one implementation of this function; just call it directly.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      e4681747
    • F
      xfrm: remove init_temprop indirection from xfrm_state_afinfo · 5c1b9ab3
      Florian Westphal 提交于
      same as previous patch: just place this in the caller, no need to
      have an indirection for a structure initialization.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      5c1b9ab3
    • F
      xfrm: remove init_tempsel indirection from xfrm_state_afinfo · bac95935
      Florian Westphal 提交于
      Simple initialization, handle it in the caller.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      bac95935
    • D
      ipv6: Plumb support for nexthop object in a fib6_info · f88d8ea6
      David Ahern 提交于
      Add struct nexthop and nh_list list_head to fib6_info. nh_list is the
      fib6_info side of the nexthop <-> fib_info relationship. Since a fib6_info
      referencing a nexthop object can not have 'sibling' entries (the old way
      of doing multipath routes), the nh_list is a union with fib6_siblings.
      
      Add f6i_list list_head to 'struct nexthop' to track fib6_info entries
      using a nexthop instance. Update __remove_nexthop_fib to walk f6_list
      and delete fib entries using the nexthop.
      
      Add a few nexthop helpers for use when a nexthop is added to fib6_info:
      - nexthop_fib6_nh - return first fib6_nh in a nexthop object
      - fib6_info_nh_dev moved to nexthop.h and updated to use nexthop_fib6_nh
        if the fib6_info references a nexthop object
      - nexthop_path_fib6_result - similar to ipv4, select a path within a
        multipath nexthop object. If the nexthop is a blackhole, set
        fib6_result type to RTN_BLACKHOLE, and set the REJECT flag
      
      Update the fib6_info references to check for nh and take a different path
      as needed:
      - rt6_qualify_for_ecmp - if a fib entry uses a nexthop object it can NOT
        be coalesced with other fib entries into a multipath route
      - rt6_duplicate_nexthop - use nexthop_cmp if either fib6_info references
        a nexthop
      - addrconf (host routes), RA's and info entries (anything configured via
        ndisc) does not use nexthop objects
      - fib6_info_destroy_rcu - put reference to nexthop object
      - fib6_purge_rt - drop fib6_info from f6i_list
      - fib6_select_path - update to use the new nexthop_path_fib6_result when
        fib entry uses a nexthop object
      - rt6_device_match - update to catch use of nexthop object as a blackhole
        and set fib6_type and flags.
      - ip6_route_info_create - don't add space for fib6_nh if fib entry is
        going to reference a nexthop object, take a reference to nexthop object,
        disallow use of source routing
      - rt6_nlmsg_size - add space for RTA_NH_ID
      - add rt6_fill_node_nexthop to add nexthop data on a dump
      
      As with ipv4, most of the changes push existing code into the else branch
      of whether the fib entry uses a nexthop object.
      
      Update the nexthop code to walk f6i_list on a nexthop deleted to remove
      fib entries referencing it.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f88d8ea6
    • D
      ipv4: Plumb support for nexthop object in a fib_info · 4c7e8084
      David Ahern 提交于
      Add 'struct nexthop' and nh_list list_head to fib_info. nh_list is the
      fib_info side of the nexthop <-> fib_info relationship.
      
      Add fi_list list_head to 'struct nexthop' to track fib_info entries
      using a nexthop instance. Add __remove_nexthop_fib and add it to
      __remove_nexthop to walk the new list_head and mark those fib entries
      as dead when the nexthop is deleted.
      
      Add a few nexthop helpers for use when a nexthop is added to fib_info:
      - nexthop_cmp to determine if 2 nexthops are the same
      - nexthop_path_fib_result to select a path for a multipath
        'struct nexthop'
      - nexthop_fib_nhc to select a specific fib_nh_common within a
        multipath 'struct nexthop'
      
      Update existing fib_info_nhc to use nexthop_fib_nhc if a fib_info uses
      a 'struct nexthop', and mark fib_info_nh as only used for the non-nexthop
      case.
      
      Update the fib_info functions to check for fi->nh and take a different
      path as needed:
      - free_fib_info_rcu - put the nexthop object reference
      - fib_release_info - remove the fib_info from the nexthop's fi_list
      - nh_comp - use nexthop_cmp when either fib_info references a nexthop
        object
      - fib_info_hashfn - use the nexthop id for the hashing vs the oif of
        each fib_nh in a fib_info
      - fib_nlmsg_size - add space for the RTA_NH_ID attribute
      - fib_create_info - verify nexthop reference can be taken, verify
        nexthop spec is valid for fib entry, and add fib_info to fi_list for
        a nexthop
      - fib_select_multipath - use the new nexthop_path_fib_result to select a
        path when nexthop objects are used
      - fib_table_lookup - if the 'struct nexthop' is a blackhole nexthop, treat
        it the same as a fib entry using 'blackhole'
      
      The bulk of the changes are in fib_semantics.c and most of that is
      moving the existing change_nexthops into an else branch.
      
      Update the nexthop code to walk fi_list on a nexthop deleted to remove
      fib entries referencing it.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c7e8084
    • D
      ipv4: Prepare for fib6_nh from a nexthop object · dcb1ecb5
      David Ahern 提交于
      Convert more IPv4 code to use fib_nh_common over fib_nh to enable routes
      to use a fib6_nh based nexthop. In the end, only code not using a
      nexthop object in a fib_info should directly access fib_nh in a fib_info
      without checking the famiy and going through fib_nh_common. Those
      functions will be marked when it is not directly evident.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcb1ecb5
    • D
      ipv4: Use accessors for fib_info nexthop data · 5481d73f
      David Ahern 提交于
      Use helpers to access fib_nh and fib_nhs fields of a fib_info. Drop the
      fib_dev macro which is an alias for the first nexthop. Replacements:
      
        fi->fib_dev    --> fib_info_nh(fi, 0)->fib_nh_dev
        fi->fib_nh     --> fib_info_nh(fi, 0)
        fi->fib_nh[i]  --> fib_info_nh(fi, i)
        fi->fib_nhs    --> fib_info_num_path(fi)
      
      where fib_info_nh(fi, i) returns fi->fib_nh[nhsel] and fib_info_num_path
      returns fi->fib_nhs.
      
      Move the existing fib_info_nhc to nexthop.h and define the new ones
      there. A later patch adds a check if a fib_info uses a nexthop object,
      and defining the helpers in nexthop.h avoid circular header
      dependencies.
      
      After this all remaining open coded references to fi->fib_nhs and
      fi->fib_nh are in:
      - fib_create_info and helpers used to lookup an existing fib_info
        entry, and
      - the netdev event functions fib_sync_down_dev and fib_sync_up.
      
      The latter two will not be reused for nexthops, and the fib_create_info
      will be updated to handle a nexthop in a fib_info.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5481d73f
    • J
      net/tls: don't pass version to tls_advance_record_sn() · fb0f886f
      Jakub Kicinski 提交于
      All callers pass prot->version as the last parameter
      of tls_advance_record_sn(), yet tls_advance_record_sn()
      itself needs a pointer to prot.  Pass prot from callers.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb0f886f
    • J
      net/tls: reorganize struct tls_context · f0aaa2c9
      Jakub Kicinski 提交于
      struct tls_context is slightly badly laid out.  If we reorder things
      right we can save 16 bytes (320 -> 304) but also make all fast path
      data fit into two cache lines (one read only and one read/write,
      down from four cache lines).
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0aaa2c9
    • J
      devlink: allow driver to update progress of flash update · 191ed202
      Jiri Pirko 提交于
      Introduce a function to be called from drivers during flash. It sends
      notification to userspace about flash update progress.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      191ed202
  3. 04 6月, 2019 3 次提交
    • E
      net: fix use-after-free in kfree_skb_list · b7034146
      Eric Dumazet 提交于
      syzbot reported nasty use-after-free [1]
      
      Lets remove frag_list field from structs ip_fraglist_iter
      and ip6_fraglist_iter. This seens not needed anyway.
      
      [1] :
      BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
      Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947
      
      CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
       __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       kasan_report+0x12/0x20 mm/kasan/common.c:614
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
       kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
       ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882
       __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
       ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
       dst_output include/net/dst.h:433 [inline]
       ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
       ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
       ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
       rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
       rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
       inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:671
       ___sys_sendmsg+0x803/0x920 net/socket.c:2292
       __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
       __do_sys_sendmsg net/socket.c:2339 [inline]
       __se_sys_sendmsg net/socket.c:2337 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x44add9
      Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9
      RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
      RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c
      R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf
      
      Allocated by task 8947:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_kmalloc mm/kasan/common.c:489 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
       kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
       slab_post_alloc_hook mm/slab.h:437 [inline]
       slab_alloc_node mm/slab.c:3269 [inline]
       kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579
       __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199
       alloc_skb include/linux/skbuff.h:1058 [inline]
       __ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519
       ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688
       rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940
       inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:671
       ___sys_sendmsg+0x803/0x920 net/socket.c:2292
       __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
       __do_sys_sendmsg net/socket.c:2339 [inline]
       __se_sys_sendmsg net/socket.c:2337 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 8947:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
       __cache_free mm/slab.c:3432 [inline]
       kmem_cache_free+0x86/0x260 mm/slab.c:3698
       kfree_skbmem net/core/skbuff.c:625 [inline]
       kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619
       __kfree_skb net/core/skbuff.c:682 [inline]
       kfree_skb net/core/skbuff.c:699 [inline]
       kfree_skb+0xf0/0x390 net/core/skbuff.c:693
       kfree_skb_list+0x44/0x60 net/core/skbuff.c:708
       __dev_xmit_skb net/core/dev.c:3551 [inline]
       __dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850
       dev_queue_xmit+0x18/0x20 net/core/dev.c:3914
       neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532
       neigh_output include/net/neighbour.h:511 [inline]
       ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120
       ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863
       __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
       ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
       dst_output include/net/dst.h:433 [inline]
       ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
       ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
       ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
       rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
       rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
       inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:671
       ___sys_sendmsg+0x803/0x920 net/socket.c:2292
       __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
       __do_sys_sendmsg net/socket.c:2339 [inline]
       __se_sys_sendmsg net/socket.c:2337 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff888085a3cbc0
       which belongs to the cache skbuff_head_cache of size 224
      The buggy address is located 0 bytes inside of
       224-byte region [ffff888085a3cbc0, ffff888085a3cca0)
      The buggy address belongs to the page:
      page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0
      flags: 0x1fffc0000000200(slab)
      raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0
      raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
      >ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                 ^
       ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
      
      Fixes: 0feca619 ("net: ipv6: add skbuff fraglist splitter")
      Fixes: c8b17be0 ("net: ipv4: add skbuff fraglist splitter")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7034146
    • E
      flow_offload: include linux/kernel.h from flow_offload.h · fa85999f
      Edward Cree 提交于
      flow_stats_update() uses max_t, so ensure we have that defined.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa85999f
    • S
      flow_dissector: remove unused FLOW_DISSECTOR_F_STOP_AT_L3 flag · 1cc26450
      Stanislav Fomichev 提交于
      This flag is not used by any caller, remove it.
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cc26450
  4. 01 6月, 2019 2 次提交
  5. 31 5月, 2019 11 次提交
    • J
      ipvs: add function to find tunnels · 2aa3c9f4
      Julian Anastasov 提交于
      Add ip_vs_find_tunnel() to match tunnel headers
      by family, address and optional port. Use it to
      properly find the tunnel real server used in
      received ICMP errors.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2aa3c9f4
    • J
      ipvs: allow rs_table to contain different real server types · 1da40ab6
      Julian Anastasov 提交于
      Before now rs_table was used only for NAT real servers.
      Change it to allow TUN real severs from different types,
      possibly hashed with different port key.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      1da40ab6
    • M
      sctp: deduplicate identical skb_checksum_ops · c3e933a5
      Matteo Croce 提交于
      The same skb_checksum_ops struct is defined twice in two different places,
      leading to code duplication. Declare it as a global variable into a common
      header instead of allocating it on the stack on each function call.
      bloat-o-meter reports a slight code shrink.
      
      add/remove: 1/1 grow/shrink: 0/10 up/down: 128/-1282 (-1154)
      Function                                     old     new   delta
      sctp_csum_ops                                  -     128    +128
      crc32c_csum_ops                               16       -     -16
      sctp_rcv                                    6616    6583     -33
      sctp_packet_pack                            4542    4504     -38
      nf_conntrack_sctp_packet                    4980    4926     -54
      execute_masked_set_action                   6453    6389     -64
      tcf_csum_sctp                                575     428    -147
      sctp_gso_segment                            1292    1126    -166
      sctp_csum_check                              579     412    -167
      sctp_snat_handler                            957     772    -185
      sctp_dnat_handler                           1321    1132    -189
      l4proto_manip_pkt                           2536    2313    -223
      Total: Before=359297613, After=359296459, chg -0.00%
      Reviewed-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3e933a5
    • P
      netfilter: bridge: add connection tracking system · 3c171f49
      Pablo Neira Ayuso 提交于
      This patch adds basic connection tracking support for the bridge,
      including initial IPv4 support.
      
      This patch register two hooks to deal with the bridge forwarding path,
      one from the bridge prerouting hook to call nf_conntrack_in(); and
      another from the bridge postrouting hook to confirm the entry.
      
      The conntrack bridge prerouting hook defragments packets before passing
      them to nf_conntrack_in() to look up for an existing entry, otherwise a
      new entry is allocated and it is attached to the skbuff. The conntrack
      bridge postrouting hook confirms new conntrack entries, ie. if this is
      the first packet seen, then it adds the entry to the hashtable and (if
      needed) it refragments the skbuff into the original fragments, leaving
      the geometry as is if possible. Exceptions are linearized skbuffs, eg.
      skbuffs that are passed up to nfqueue and conntrack helpers, as well as
      cloned skbuff for the local delivery (eg. tcpdump), also in case of
      bridge port flooding (cloned skbuff too).
      
      The packet defragmentation is done through the ip_defrag() call.  This
      forces us to save the bridge control buffer, reset the IP control buffer
      area and then restore it after call. This function also bumps the IP
      fragmentation statistics, it would be probably desiderable to have
      independent statistics for the bridge defragmentation/refragmentation.
      The maximum fragment length is stored in the control buffer and it is
      used to refragment the skbuff from the postrouting path.
      
      The new fraglist splitter and fragment transformer APIs are used to
      implement the bridge refragmentation code. The br_ip_fragment() function
      drops the packet in case the maximum fragment size seen is larger than
      the output port MTU.
      
      This patchset follows the principle that conntrack should not drop
      packets, so users can do it through policy via invalid state matching.
      
      Like br_netfilter, there is no refragmentation for packets that are
      passed up for local delivery, ie. prerouting -> input path. There are
      calls to nf_reset() already in several spots in the stack since time ago
      already, eg. af_packet, that show that skbuff fraglist handling from the
      netif_rx path is supported already.
      
      The helpers are called from the postrouting hook, before confirmation,
      from there we may see packet floods to bridge ports. Then, although
      unlikely, this may result in exercising the helpers many times for each
      clone. It would be good to explore how to pass all the packets in a list
      to the conntrack hook to do this handle only once for this case.
      
      Thanks to Florian Westphal for handing me over an initial patchset
      version to add support for conntrack bridge.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c171f49
    • P
      netfilter: nf_conntrack: allow to register bridge support · d035f19f
      Pablo Neira Ayuso 提交于
      This patch adds infrastructure to register and to unregister bridge
      support for the conntrack module via nf_ct_bridge_register() and
      nf_ct_bridge_unregister().
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d035f19f
    • P
      net: ipv6: split skbuff into fragments transformer · 8a6a1f17
      Pablo Neira Ayuso 提交于
      This patch exposes a new API to refragment a skbuff. This allows you to
      split either a linear skbuff or to force the refragmentation of an
      existing fraglist using a different mtu. The API consists of:
      
      * ip6_frag_init(), that initializes the internal state of the transformer.
      * ip6_frag_next(), that allows you to fetch the next fragment. This function
        internally allocates the skbuff that represents the fragment, it pushes
        the IPv6 header, and it also copies the payload for each fragment.
      
      The ip6_frag_state object stores the internal state of the splitter.
      
      This code has been extracted from ip6_fragment(). Symbols are also
      exported to allow to reuse this iterator from the bridge codepath to
      build its own refragmentation routine by reusing the existing codebase.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a6a1f17
    • P
      net: ipv4: split skbuff into fragments transformer · 065ff79f
      Pablo Neira Ayuso 提交于
      This patch exposes a new API to refragment a skbuff. This allows you to
      split either a linear skbuff or to force the refragmentation of an
      existing fraglist using a different mtu. The API consists of:
      
      * ip_frag_init(), that initializes the internal state of the transformer.
      * ip_frag_next(), that allows you to fetch the next fragment. This function
        internally allocates the skbuff that represents the fragment, it pushes
        the IPv4 header, and it also copies the payload for each fragment.
      
      The ip_frag_state object stores the internal state of the splitter.
      
      This code has been extracted from ip_do_fragment(). Symbols are also
      exported to allow to reuse this iterator from the bridge codepath to
      build its own refragmentation routine by reusing the existing codebase.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      065ff79f
    • P
      net: ipv6: add skbuff fraglist splitter · 0feca619
      Pablo Neira Ayuso 提交于
      This patch adds the skbuff fraglist split iterator. This API provides an
      iterator to transform the fraglist into single skbuff objects, it
      consists of:
      
      * ip6_fraglist_init(), that initializes the internal state of the
        fraglist iterator.
      * ip6_fraglist_prepare(), that restores the IPv6 header on the fragment.
      * ip6_fraglist_next(), that retrieves the fragment from the fraglist and
        updates the internal state of the iterator to point to the next
        fragment in the fraglist.
      
      The ip6_fraglist_iter object stores the internal state of the iterator.
      
      This code has been extracted from ip6_fragment(). Symbols are also
      exported to allow to reuse this iterator from the bridge codepath to
      build its own refragmentation routine by reusing the existing codebase.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0feca619
    • P
      net: ipv4: add skbuff fraglist splitter · c8b17be0
      Pablo Neira Ayuso 提交于
      This patch adds the skbuff fraglist splitter. This API provides an
      iterator to transform the fraglist into single skbuff objects, it
      consists of:
      
      * ip_fraglist_init(), that initializes the internal state of the
        fraglist splitter.
      * ip_fraglist_prepare(), that restores the IPv4 header on the
        fragments.
      * ip_fraglist_next(), that retrieves the fragment from the fraglist and
        it updates the internal state of the splitter to point to the next
        fragment skbuff in the fraglist.
      
      The ip_fraglist_iter object stores the internal state of the iterator.
      
      This code has been extracted from ip_do_fragment(). Symbols are also
      exported to allow to reuse this iterator from the bridge codepath to
      build its own refragmentation routine by reusing the existing codebase.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8b17be0
    • J
      tcp: add backup TFO key infrastructure · 9092a76d
      Jason Baron 提交于
      We would like to be able to rotate TFO keys while minimizing the number of
      client cookies that are rejected. Currently, we have only one key which can
      be used to generate and validate cookies, thus if we simply replace this
      key clients can easily have cookies rejected upon rotation.
      
      We propose having the ability to have both a primary key and a backup key.
      The primary key is used to generate as well as to validate cookies.
      The backup is only used to validate cookies. Thus, keys can be rotated as:
      
      1) generate new key
      2) add new key as the backup key
      3) swap the primary and backup key, thus setting the new key as the primary
      
      We don't simply set the new key as the primary key and move the old key to
      the backup slot because the ip may be behind a load balancer and we further
      allow for the fact that all machines behind the load balancer will not be
      updated simultaneously.
      
      We make use of this infrastructure in subsequent patches.
      Suggested-by: NIgor Lubashev <ilubashe@akamai.com>
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9092a76d
    • S
      udp: Avoid post-GRO UDP checksum recalculation · f2696099
      Sean Tranchetti 提交于
      Currently, when resegmenting an unexpected UDP GRO packet, the full UDP
      checksum will be calculated for every new SKB created by skb_segment()
      because the netdev features passed in by udp_rcv_segment() lack any
      information about checksum offload capabilities.
      
      Usually, we have no need to perform this calculation again, as
        1) The GRO implementation guarantees that any packets making it to the
           udp_rcv_segment() function had correct checksums, and, more
           importantly,
        2) Upon the successful return of udp_rcv_segment(), we immediately pull
           the UDP header off and either queue the segment to the socket or
           hand it off to a new protocol handler.
      
      Unless userspace has set the IP_CHECKSUM sockopt to indicate that they
      want the final checksum values, we can pass the needed netdev feature
      flags to __skb_gso_segment() to avoid checksumming each segment in
      skb_segment().
      
      Fixes: cf329aa4 ("udp: cope with UDP GRO packet misdirection")
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: NSean Tranchetti <stranche@codeaurora.org>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2696099
  6. 30 5月, 2019 2 次提交
    • I
      net: phylink: Add struct phylink_config to PHYLINK API · 44cc27e4
      Ioana Ciornei 提交于
      The phylink_config structure will encapsulate a pointer to a struct
      device and the operation type requested for this instance of PHYLINK.
      This patch does not make any functional changes, it just transitions the
      PHYLINK internals and all its users to the new API.
      
      A pointer to a phylink_config structure will be passed to
      phylink_create() instead of the net_device directly. Also, the same
      phylink_config pointer will be passed back to all phylink_mac_ops
      callbacks instead of the net_device. Using this mechanism, a PHYLINK
      user can get the original net_device using a structure such as
      'to_net_dev(config->dev)' or directly the structure containing the
      phylink_config using a container_of call.
      
      At the moment, only the PHYLINK_NETDEV is defined as a valid operation
      type for PHYLINK. In this mode, a valid reference to a struct device
      linked to the original net_device should be passed to PHYLINK through
      the phylink_config structure.
      
      This API changes is mainly driven by the necessity of adding a new
      operation type in PHYLINK that disconnects the phy_device from the
      net_device and also works when the net_device is lacking.
      Signed-off-by: NIoana Ciornei <ioana.ciornei@nxp.com>
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NMaxime Chevallier <maxime.chevallier@bootlin.com>
      Tested-by: NMaxime Chevallier <maxime.chevallier@bootlin.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44cc27e4
    • K
      net: sched: Introduce act_ctinfo action · 24ec483c
      Kevin 'ldir' Darbyshire-Bryant 提交于
      ctinfo is a new tc filter action module.  It is designed to restore
      information contained in firewall conntrack marks to other packet fields
      and is typically used on packet ingress paths.  At present it has two
      independent sub-functions or operating modes, DSCP restoration mode &
      skb mark restoration mode.
      
      The DSCP restore mode:
      
      This mode copies DSCP values that have been placed in the firewall
      conntrack mark back into the IPv4/v6 diffserv fields of relevant
      packets.
      
      The DSCP restoration is intended for use and has been found useful for
      restoring ingress classifications based on egress classifications across
      links that bleach or otherwise change DSCP, typically home ISP Internet
      links.  Restoring DSCP on ingress on the WAN link allows qdiscs such as
      but by no means limited to CAKE to shape inbound packets according to
      policies that are easier to set & mark on egress.
      
      Ingress classification is traditionally a challenging task since
      iptables rules haven't yet run and tc filter/eBPF programs are pre-NAT
      lookups, hence are unable to see internal IPv4 addresses as used on the
      typical home masquerading gateway.  Thus marking the connection in some
      manner on egress for later restoration of classification on ingress is
      easier to implement.
      
      Parameters related to DSCP restore mode:
      
      dscpmask - a 32 bit mask of 6 contiguous bits and indicate bits of the
      conntrack mark field contain the DSCP value to be restored.
      
      statemask - a 32 bit mask of (usually) 1 bit length, outside the area
      specified by dscpmask.  This represents a conditional operation flag
      whereby the DSCP is only restored if the flag is set.  This is useful to
      implement a 'one shot' iptables based classification where the
      'complicated' iptables rules are only run once to classify the
      connection on initial (egress) packet and subsequent packets are all
      marked/restored with the same DSCP.  A mask of zero disables the
      conditional behaviour ie. the conntrack mark DSCP bits are always
      restored to the ip diffserv field (assuming the conntrack entry is found
      & the skb is an ipv4/ipv6 type)
      
      e.g. dscpmask 0xfc000000 statemask 0x01000000
      
      |----0xFC----conntrack mark----000000---|
      | Bits 31-26 | bit 25 | bit24 |~~~ Bit 0|
      | DSCP       | unused | flag  |unused   |
      |-----------------------0x01---000000---|
            |                   |
            |                   |
            ---|             Conditional flag
               v             only restore if set
      |-ip diffserv-|
      | 6 bits      |
      |-------------|
      
      The skb mark restore mode (cpmark):
      
      This mode copies the firewall conntrack mark to the skb's mark field.
      It is completely the functional equivalent of the existing act_connmark
      action with the additional feature of being able to apply a mask to the
      restored value.
      
      Parameters related to skb mark restore mode:
      
      mask - a 32 bit mask applied to the firewall conntrack mark to mask out
      bits unwanted for restoration.  This can be useful where the conntrack
      mark is being used for different purposes by different applications.  If
      not specified and by default the whole mark field is copied (i.e.
      default mask of 0xffffffff)
      
      e.g. mask 0x00ffffff to mask out the top 8 bits being used by the
      aforementioned DSCP restore mode.
      
      |----0x00----conntrack mark----ffffff---|
      | Bits 31-24 |                          |
      | DSCP & flag|      some value here     |
      |---------------------------------------|
      			|
      			|
      			v
      |------------skb mark-------------------|
      |            |                          |
      |  zeroed    |                          |
      |---------------------------------------|
      
      Overall parameters:
      
      zone - conntrack zone
      
      control - action related control (reclassify | pipe | drop | continue |
      ok | goto chain <CHAIN_INDEX>)
      Signed-off-by: NKevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
      Reviewed-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24ec483c
  7. 29 5月, 2019 7 次提交
    • D
      nexthop: Add support for nexthop groups · 430a0491
      David Ahern 提交于
      Allow the creation of nexthop groups which reference other nexthop
      objects to create multipath routes:
      
                            +--------------+
         +------------+   +--------------+ |
         | nh  nh_grp --->| nh_grp_entry |-+
         +------------+   +---------|----+
           ^                |       |    +------------+
           +----------------+       +--->| nh, weight |
              nh_parent                  +------------+
      
      A group entry points to a nexthop with a weight for that hop within the
      group. The nexthop has a list_head, grp_list, for tracking which groups
      it is a member of and the group entry has a reference back to the parent.
      The grp_list is used when a nexthop is deleted - to efficiently remove
      it from groups using it.
      
      If a nexthop group spec is given, no other attributes can be set. Each
      nexthop id in a group spec must already exist.
      
      Similar to single nexthops, the specification of a nexthop group can be
      updated so that data is managed with rcu locking.
      
      Add path selection function to account for multiple paths and add
      ipv{4,6}_good_nh helpers to know that if a neighbor entry exists it is
      in a good state.
      
      Update NETDEV event handling to rebalance multipath nexthop groups if
      a nexthop is deleted due to a link event (down or unregister).
      
      When a nexthop is removed any groups using it are updated. Groups using a
      nexthop a tracked via a grp_list.
      
      Nexthop dumps can be limited to groups only by adding NHA_GROUPS to the
      request.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      430a0491
    • D
      nexthop: Add support for lwt encaps · b513bd03
      David Ahern 提交于
      Add support for NHA_ENCAP and NHA_ENCAP_TYPE. Leverages the existing code
      for lwtunnel within fib_nh_common, so the only change needed is handling
      the attributes in the nexthop code.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b513bd03
    • D
      nexthop: Add support for IPv6 gateways · 53010f99
      David Ahern 提交于
      Handle IPv6 gateway in a nexthop spec. If nh_family is set to AF_INET6,
      NHA_GATEWAY is expected to be an IPv6 address. Add ipv6 option to gw in
      nh_config to hold the address, add fib6_nh to nh_info to leverage the
      ipv6 initialization and cleanup code. Update nh_fill_node to dump the v6
      address.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53010f99
    • D
      nexthop: Add support for IPv4 nexthops · 597cfe4f
      David Ahern 提交于
      Add support for IPv4 nexthops. If nh_family is set to AF_INET, then
      NHA_GATEWAY is expected to be an IPv4 address.
      
      Register for netdev events to be notified of admin up/down changes as
      well as deletes. A hash table is used to track nexthop per devices to
      quickly convert device events to the affected nexthops.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      597cfe4f
    • D
      net: Initial nexthop code · ab84be7e
      David Ahern 提交于
      Barebones start point for nexthops. Implementation for RTM commands,
      notifications, management of rbtree for holding nexthops by id, and
      kernel side data structures for nexthops and nexthop config.
      
      Nexthops are maintained in an rbtree sorted by id. Similar to routes,
      nexthops are configured per namespace using netns_nexthop struct added
      to struct net.
      
      Nexthop notifications are sent when a nexthop is added or deleted,
      but NOT if the delete is due to a device event or network namespace
      teardown (which also involves device events). Applications are
      expected to use the device down event to flush nexthops and any
      routes used by the nexthops.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab84be7e
    • E
      inet: frags: fix use-after-free read in inet_frag_destroy_rcu · dc93f46b
      Eric Dumazet 提交于
      As caught by syzbot [1], the rcu grace period that is respected
      before fqdir_rwork_fn() proceeds and frees fqdir is not enough
      to prevent inet_frag_destroy_rcu() being run after the freeing.
      
      We need a proper rcu_barrier() synchronization to replace
      the one we had in inet_frags_fini()
      
      We also have to fix a potential problem at module removal :
      inet_frags_fini() needs to make sure that all queued work queues
      (fqdir_rwork_fn) have completed, otherwise we might
      call kmem_cache_destroy() too soon and get another use-after-free.
      
      [1]
      BUG: KASAN: use-after-free in inet_frag_destroy_rcu+0xd9/0xe0 net/ipv4/inet_fragment.c:201
      Read of size 8 at addr ffff88806ed47a18 by task swapper/1/0
      
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.2.0-rc1+ #2
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
       __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       kasan_report+0x12/0x20 mm/kasan/common.c:614
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
       inet_frag_destroy_rcu+0xd9/0xe0 net/ipv4/inet_fragment.c:201
       __rcu_reclaim kernel/rcu/rcu.h:222 [inline]
       rcu_do_batch kernel/rcu/tree.c:2092 [inline]
       invoke_rcu_callbacks kernel/rcu/tree.c:2310 [inline]
       rcu_core+0xba5/0x1500 kernel/rcu/tree.c:2291
       __do_softirq+0x25c/0x94c kernel/softirq.c:293
       invoke_softirq kernel/softirq.c:374 [inline]
       irq_exit+0x180/0x1d0 kernel/softirq.c:414
       exiting_irq arch/x86/include/asm/apic.h:536 [inline]
       smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
       </IRQ>
      RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61
      Code: ff ff 48 89 df e8 f2 95 8c fa eb 82 e9 07 00 00 00 0f 00 2d e4 45 4b 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d d4 45 4b 00 fb f4 <c3> 90 55 48 89 e5 41 57 41 56 41 55 41 54 53 e8 8e 18 42 fa e8 99
      RSP: 0018:ffff8880a98e7d78 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
      RAX: 1ffffffff1164e11 RBX: ffff8880a98d4340 RCX: 0000000000000000
      RDX: dffffc0000000000 RSI: 0000000000000006 RDI: ffff8880a98d4bbc
      RBP: ffff8880a98e7da8 R08: ffff8880a98d4340 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
      R13: ffffffff88b27078 R14: 0000000000000001 R15: 0000000000000000
       arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:571
       default_idle_call+0x36/0x90 kernel/sched/idle.c:94
       cpuidle_idle_call kernel/sched/idle.c:154 [inline]
       do_idle+0x377/0x560 kernel/sched/idle.c:263
       cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:354
       start_secondary+0x34e/0x4c0 arch/x86/kernel/smpboot.c:267
       secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243
      
      Allocated by task 8877:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_kmalloc mm/kasan/common.c:489 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
       kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503
       kmem_cache_alloc_trace+0x151/0x750 mm/slab.c:3555
       kmalloc include/linux/slab.h:547 [inline]
       kzalloc include/linux/slab.h:742 [inline]
       fqdir_init include/net/inet_frag.h:115 [inline]
       ipv6_frags_init_net+0x48/0x460 net/ipv6/reassembly.c:513
       ops_init+0xb3/0x410 net/core/net_namespace.c:130
       setup_net+0x2d3/0x740 net/core/net_namespace.c:316
       copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439
       create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
       unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
       ksys_unshare+0x440/0x980 kernel/fork.c:2692
       __do_sys_unshare kernel/fork.c:2760 [inline]
       __se_sys_unshare kernel/fork.c:2758 [inline]
       __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 17:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
       __cache_free mm/slab.c:3432 [inline]
       kfree+0xcf/0x220 mm/slab.c:3755
       fqdir_rwork_fn+0x33/0x40 net/ipv4/inet_fragment.c:154
       process_one_work+0x989/0x1790 kernel/workqueue.c:2269
       worker_thread+0x98/0xe40 kernel/workqueue.c:2415
       kthread+0x354/0x420 kernel/kthread.c:255
       ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      
      The buggy address belongs to the object at ffff88806ed47a00
       which belongs to the cache kmalloc-512 of size 512
      The buggy address is located 24 bytes inside of
       512-byte region [ffff88806ed47a00, ffff88806ed47c00)
      The buggy address belongs to the page:
      page:ffffea0001bb51c0 refcount:1 mapcount:0 mapping:ffff8880aa400940 index:0x0
      flags: 0x1fffc0000000200(slab)
      raw: 01fffc0000000200 ffffea000282a788 ffffea0001bb53c8 ffff8880aa400940
      raw: 0000000000000000 ffff88806ed47000 0000000100000006 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88806ed47900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88806ed47980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      >ffff88806ed47a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                  ^
       ffff88806ed47a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88806ed47b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 3c8fc878 ("inet: frags: rework rhashtable dismantle")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc93f46b
    • E
      inet: frags: uninline fqdir_init() · 6b73d197
      Eric Dumazet 提交于
      fqdir_init() is not fast path and is getting bigger.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b73d197
  8. 27 5月, 2019 3 次提交
    • E
      inet: frags: rework rhashtable dismantle · 3c8fc878
      Eric Dumazet 提交于
      syszbot found an interesting use-after-free [1] happening
      while IPv4 fragment rhashtable was destroyed at netns dismantle.
      
      While no insertions can possibly happen at the time a dismantling
      netns is destroying this rhashtable, timers can still fire and
      attempt to remove elements from this rhashtable.
      
      This is forbidden, since rhashtable_free_and_destroy() has
      no synchronization against concurrent inserts and deletes.
      
      Add a new fqdir->dead flag so that timers do not attempt
      a rhashtable_remove_fast() operation.
      
      We also have to respect an RCU grace period before starting
      the rhashtable_free_and_destroy() from process context,
      thus we use rcu_work infrastructure.
      
      This is a refinement of a prior rough attempt to fix this bug :
      https://marc.info/?l=linux-netdev&m=153845936820900&w=2
      
      Since the rhashtable cleanup is now deferred to a work queue,
      netns dismantles should be slightly faster.
      
      [1]
      BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:194 [inline]
      BUG: KASAN: use-after-free in rhashtable_last_table+0x162/0x180 lib/rhashtable.c:212
      Read of size 8 at addr ffff8880a6497b70 by task kworker/0:0/5
      
      CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.2.0-rc1+ #2
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events rht_deferred_worker
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
       __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       kasan_report+0x12/0x20 mm/kasan/common.c:614
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
       __read_once_size include/linux/compiler.h:194 [inline]
       rhashtable_last_table+0x162/0x180 lib/rhashtable.c:212
       rht_deferred_worker+0x111/0x2030 lib/rhashtable.c:411
       process_one_work+0x989/0x1790 kernel/workqueue.c:2269
       worker_thread+0x98/0xe40 kernel/workqueue.c:2415
       kthread+0x354/0x420 kernel/kthread.c:255
       ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      
      Allocated by task 32687:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_kmalloc mm/kasan/common.c:489 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
       kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503
       __do_kmalloc_node mm/slab.c:3620 [inline]
       __kmalloc_node+0x4e/0x70 mm/slab.c:3627
       kmalloc_node include/linux/slab.h:590 [inline]
       kvmalloc_node+0x68/0x100 mm/util.c:431
       kvmalloc include/linux/mm.h:637 [inline]
       kvzalloc include/linux/mm.h:645 [inline]
       bucket_table_alloc+0x90/0x480 lib/rhashtable.c:178
       rhashtable_init+0x3f4/0x7b0 lib/rhashtable.c:1057
       inet_frags_init_net include/net/inet_frag.h:109 [inline]
       ipv4_frags_init_net+0x182/0x410 net/ipv4/ip_fragment.c:683
       ops_init+0xb3/0x410 net/core/net_namespace.c:130
       setup_net+0x2d3/0x740 net/core/net_namespace.c:316
       copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439
       create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
       unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
       ksys_unshare+0x440/0x980 kernel/fork.c:2692
       __do_sys_unshare kernel/fork.c:2760 [inline]
       __se_sys_unshare kernel/fork.c:2758 [inline]
       __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 7:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
       __cache_free mm/slab.c:3432 [inline]
       kfree+0xcf/0x220 mm/slab.c:3755
       kvfree+0x61/0x70 mm/util.c:460
       bucket_table_free+0x69/0x150 lib/rhashtable.c:108
       rhashtable_free_and_destroy+0x165/0x8b0 lib/rhashtable.c:1155
       inet_frags_exit_net+0x3d/0x50 net/ipv4/inet_fragment.c:152
       ipv4_frags_exit_net+0x73/0x90 net/ipv4/ip_fragment.c:695
       ops_exit_list.isra.0+0xaa/0x150 net/core/net_namespace.c:154
       cleanup_net+0x3fb/0x960 net/core/net_namespace.c:553
       process_one_work+0x989/0x1790 kernel/workqueue.c:2269
       worker_thread+0x98/0xe40 kernel/workqueue.c:2415
       kthread+0x354/0x420 kernel/kthread.c:255
       ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      
      The buggy address belongs to the object at ffff8880a6497b40
       which belongs to the cache kmalloc-1k of size 1024
      The buggy address is located 48 bytes inside of
       1024-byte region [ffff8880a6497b40, ffff8880a6497f40)
      The buggy address belongs to the page:
      page:ffffea0002992580 refcount:1 mapcount:0 mapping:ffff8880aa400ac0 index:0xffff8880a64964c0 compound_mapcount: 0
      flags: 0x1fffc0000010200(slab|head)
      raw: 01fffc0000010200 ffffea0002916e88 ffffea000218fe08 ffff8880aa400ac0
      raw: ffff8880a64964c0 ffff8880a6496040 0000000100000005 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8880a6497a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8880a6497a80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
      >ffff8880a6497b00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                                   ^
       ffff8880a6497b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8880a6497c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 648700f7 ("inet: frags: use rhashtables for reassembly units")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c8fc878
    • E
      net: dynamically allocate fqdir structures · 4907abc6
      Eric Dumazet 提交于
      Following patch will add rcu grace period before fqdir
      rhashtable destruction, so we need to dynamically allocate
      fqdir structures to not force expensive synchronize_rcu() calls
      in netns dismantle path.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4907abc6
    • E
      net: add a net pointer to struct fqdir · a39aca67
      Eric Dumazet 提交于
      fqdir will soon be dynamically allocated.
      
      We need to reach the struct net pointer from fqdir,
      so add it, and replace the various container_of() constructs
      by direct access to the new field.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a39aca67