1. 11 8月, 2018 4 次提交
    • M
      bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT · 2dbb9b9e
      Martin KaFai Lau 提交于
      This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
      a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
      non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
      
      BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
      to store the bpf context instead of using the skb->cb[48].
      
      At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
      from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
      point,  it is not always clear where the bpf context can be appended
      in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
      aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
      clear if the lower layer is only ipv4 and ipv6 in the future and
      will it not touch the cb[] again before transiting to the upper
      layer.
      
      For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
      instead of IP[6]CB and it may still modify the cb[] after calling
      the udp[46]_lib_lookup_skb().  Because of the above reason, if
      sk->cb is used for the bpf ctx, saving-and-restoring is needed
      and likely the whole 48 bytes cb[] has to be saved and restored.
      
      Instead of saving, setting and restoring the cb[], this patch opts
      to create a new "struct sk_reuseport_kern" and setting the needed
      values in there.
      
      The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
      will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
      specific usage at this point and it is also inline with the current
      sock_reuseport.c implementation (i.e. no protocol specific requirement).
      
      In "struct sk_reuseport_md", this patch exposes data/data_end/len
      with semantic similar to other existing usages.  Together
      with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
      the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
      the bpf prog that the reuseport group is bind-ed to a local
      INANY address which cannot be learned from skb.
      
      The new "bind_inany" is added to "struct sock_reuseport" which will be
      used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
      to avoid repeating the "bind INANY" test on
      "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
      only be properly initialized when a "sk->sk_reuseport" enabled sk is
      adding to a hashtable (i.e. during "reuseport_alloc()" and
      "reuseport_add_sock()").
      
      The new "sk_select_reuseport()" is the main helper that the
      bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
      that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
      the earlier patch, the validity of a selected sk is checked in
      run time in "sk_select_reuseport()".  Doing the check in
      verification time is difficult and inflexible (consider the map-in-map
      use case).  The runtime check is to compare the selected sk's reuseport_id
      with the reuseport_id that we want.  This helper will return -EXXX if the
      selected sk cannot serve the incoming request (e.g. reuseport_id
      not match).  The bpf prog can decide if it wants to do SK_DROP as its
      discretion.
      
      When the bpf prog returns SK_PASS, the kernel will check if a
      valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
      If it does , it will use the selected sk.  If not, the kernel
      will select one from "reuse->socks[]" (as before this patch).
      
      The SK_DROP and SK_PASS handling logic will be in the next patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2dbb9b9e
    • M
      bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY · 5dc4c4b7
      Martin KaFai Lau 提交于
      This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
      
      To unleash the full potential of a bpf prog, it is essential for the
      userspace to be capable of directly setting up a bpf map which can then
      be consumed by the bpf prog to make decision.  In this case, decide which
      SO_REUSEPORT sk to serve the incoming request.
      
      By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
      and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
      The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
      the bpf prog can directly select a sk from the bpf map.  That will
      raise the programmability of the bpf prog attached to a reuseport
      group (a group of sk serving the same IP:PORT).
      
      For example, in UDP, the bpf prog can peek into the payload (e.g.
      through the "data" pointer introduced in the later patch) to learn
      the application level's connection information and then decide which sk
      to pick from a bpf map.  The userspace can tightly couple the sk's location
      in a bpf map with the application logic in generating the UDP payload's
      connection information.  This connection info contact/API stays within the
      userspace.
      
      Also, when used with map-in-map, the userspace can switch the
      old-server-process's inner map to a new-server-process's inner map
      in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
      The bpf prog will then direct incoming requests to the new process instead
      of the old process.  The old process can finish draining the pending
      requests (e.g. by "accept()") before closing the old-fds.  [Note that
      deleting a fd from a bpf map does not necessary mean the fd is closed]
      
      During map_update_elem(),
      Only SO_REUSEPORT sk (i.e. which has already been added
      to a reuse->socks[]) can be used.  That means a SO_REUSEPORT sk that is
      "bind()" for UDP or "bind()+listen()" for TCP.  These conditions are
      ensured in "reuseport_array_update_check()".
      
      A SO_REUSEPORT sk can only be added once to a map (i.e. the
      same sk cannot be added twice even to the same map).  SO_REUSEPORT
      already allows another sk to be created for the same IP:PORT.
      There is no need to re-create a similar usage in the BPF side.
      
      When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
      it will notify the bpf map to remove it from the map also.  It is
      done through "bpf_sk_reuseport_detach()" and it will only be called
      if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
      
      The map_update()/map_delete() has to be in-sync with the
      "reuse->socks[]".  Hence, the same "reuseport_lock" used
      by "reuse->socks[]" has to be used here also. Care has
      been taken to ensure the lock is only acquired when the
      adding sk passes some strict tests. and
      freeing the map does not require the reuseport_lock.
      
      The reuseport_array will also support lookup from the syscall
      side.  It will return a sock_gen_cookie().  The sock_gen_cookie()
      is on-demand (i.e. a sk's cookie is not generated until the very
      first map_lookup_elem()).
      
      The lookup cookie is 64bits but it goes against the logical userspace
      expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
      It may catch user in surprise if we enforce value_size=8 while
      userspace still pass a 32bits fd during update.  Supporting different
      value_size between lookup and update seems unintuitive also.
      
      We also need to consider what if other existing fd based maps want
      to return 64bits value from syscall's lookup in the future.
      Hence, reuseport_array supports both value_size 4 and 8, and
      assuming user will usually use value_size=4.  The syscall's lookup
      will return ENOSPC on value_size=4.  It will will only
      return 64bits value from sock_gen_cookie() when user consciously
      choose value_size=8 (as a signal that lookup is desired) which then
      requires a 64bits value in both lookup and update.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5dc4c4b7
    • M
      net: Add ID (if needed) to sock_reuseport and expose reuseport_lock · 736b4602
      Martin KaFai Lau 提交于
      A later patch will introduce a BPF_MAP_TYPE_REUSEPORT_ARRAY which
      allows a SO_REUSEPORT sk to be added to a bpf map.  When a sk
      is removed from reuse->socks[], it also needs to be removed from
      the bpf map.  Also, when adding a sk to a bpf map, the bpf
      map needs to ensure it is indeed in a reuse->socks[].
      Hence, reuseport_lock is needed by the bpf map to ensure its
      map_update_elem() and map_delete_elem() operations are in-sync with
      the reuse->socks[].  The BPF_MAP_TYPE_REUSEPORT_ARRAY map will only
      acquire the reuseport_lock after ensuring the adding sk is already
      in a reuseport group (i.e. reuse->socks[]).  The map_lookup_elem()
      will be lockless.
      
      This patch also adds an ID to sock_reuseport.  A later patch
      will introduce BPF_PROG_TYPE_SK_REUSEPORT which allows
      a bpf prog to select a sk from a bpf map.  It is inflexible to
      statically enforce a bpf map can only contain the sk belonging to
      a particular reuse->socks[] (i.e. same IP:PORT) during the bpf
      verification time. For example, think about the the map-in-map situation
      where the inner map can be dynamically changed in runtime and the outer
      map may have inner maps belonging to different reuseport groups.
      Hence, when the bpf prog (in the new BPF_PROG_TYPE_SK_REUSEPORT
      type) selects a sk,  this selected sk has to be checked to ensure it
      belongs to the requesting reuseport group (i.e. the group serving
      that IP:PORT).
      
      The "sk->sk_reuseport_cb" pointer cannot be used for this checking
      purpose because the pointer value will change after reuseport_grow().
      Instead of saving all checking conditions like the ones
      preced calling "reuseport_add_sock()" and compare them everytime a
      bpf_prog is run, a 32bits ID is introduced to survive the
      reuseport_grow().  The ID is only acquired if any of the
      reuse->socks[] is added to the newly introduced
      "BPF_MAP_TYPE_REUSEPORT_ARRAY" map.
      
      If "BPF_MAP_TYPE_REUSEPORT_ARRAY" is not used,  the changes in this
      patch is a no-op.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      736b4602
    • M
      tcp: Avoid TCP syncookie rejected by SO_REUSEPORT socket · 40a1227e
      Martin KaFai Lau 提交于
      Although the actual cookie check "__cookie_v[46]_check()" does
      not involve sk specific info, it checks whether the sk has recent
      synq overflow event in "tcp_synq_no_recent_overflow()".  The
      tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second
      when it has sent out a syncookie (through "tcp_synq_overflow()").
      
      The above per sk "recent synq overflow event timestamp" works well
      for non SO_REUSEPORT use case.  However, it may cause random
      connection request reject/discard when SO_REUSEPORT is used with
      syncookie because it fails the "tcp_synq_no_recent_overflow()"
      test.
      
      When SO_REUSEPORT is used, it usually has multiple listening
      socks serving TCP connection requests destinated to the same local IP:PORT.
      There are cases that the TCP-ACK-COOKIE may not be received
      by the same sk that sent out the syncookie.  For example,
      if reuse->socks[] began with {sk0, sk1},
      1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp
         was updated.
      2) the reuse->socks[] became {sk1, sk2} later.  e.g. sk0 was first closed
         and then sk2 was added.  Here, sk2 does not have ts_recent_stamp set.
         There are other ordering that will trigger the similar situation
         below but the idea is the same.
      3) When the TCP-ACK-COOKIE comes back, sk2 was selected.
         "tcp_synq_no_recent_overflow(sk2)" returns true. In this case,
         all syncookies sent by sk1 will be handled (and rejected)
         by sk2 while sk1 is still alive.
      
      The userspace may create and remove listening SO_REUSEPORT sockets
      as it sees fit.  E.g. Adding new thread (and SO_REUSEPORT sock) to handle
      incoming requests, old process stopping and new process starting...etc.
      With or without SO_ATTACH_REUSEPORT_[CB]BPF,
      the sockets leaving and joining a reuseport group makes picking
      the same sk to check the syncookie very difficult (if not impossible).
      
      The later patches will allow bpf prog more flexibility in deciding
      where a sk should be located in a bpf map and selecting a particular
      SO_REUSEPORT sock as it sees fit.  e.g. Without closing any sock,
      replace the whole bpf reuseport_array in one map_update() by using
      map-in-map.  Getting the syncookie check working smoothly across
      socks in the same "reuse->socks[]" is important.
      
      A partial solution is to set the newly added sk's ts_recent_stamp
      to the max ts_recent_stamp of a reuseport group but that will require
      to iterate through reuse->socks[]  OR
      pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is
      joining a reuseport group.  However, neither of them will solve the
      existing sk getting moved around the reuse->socks[] and that
      sk may not have ts_recent_stamp updated, unlikely under continuous
      synflood but not impossible.
      
      This patch opts to treat the reuseport group as a whole when
      considering the last synq overflow timestamp since
      they are serving the same IP:PORT from the userspace
      (and BPF program) perspective.
      
      "synq_overflow_ts" is added to "struct sock_reuseport".
      The tcp_synq_overflow() and tcp_synq_no_recent_overflow()
      will update/check reuse->synq_overflow_ts if the sk is
      in a reuseport group.  Similar to the reuseport decision in
      __inet_lookup_listener(), both sk->sk_reuseport and
      sk->sk_reuseport_cb are tested for SO_REUSEPORT usage.
      Update on "synq_overflow_ts" happens at roughly once
      every second.
      
      A synflood test was done with a 16 rx-queues and 16 reuseport sockets.
      No meaningful performance change is observed.  Before and
      after the change is ~9Mpps in IPv4.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      40a1227e
  2. 10 8月, 2018 4 次提交
  3. 08 8月, 2018 1 次提交
  4. 07 8月, 2018 1 次提交
  5. 06 8月, 2018 2 次提交
  6. 05 8月, 2018 1 次提交
  7. 04 8月, 2018 1 次提交
  8. 03 8月, 2018 3 次提交
  9. 31 7月, 2018 6 次提交
  10. 30 7月, 2018 2 次提交
  11. 29 7月, 2018 2 次提交
    • T
      bpf: use GFP_ATOMIC instead of GFP_KERNEL in bpf_parse_prog() · 71eb5255
      Taehee Yoo 提交于
      bpf_parse_prog() is protected by rcu_read_lock().
      so that GFP_KERNEL is not allowed in the bpf_parse_prog().
      
      [51015.579396] =============================
      [51015.579418] WARNING: suspicious RCU usage
      [51015.579444] 4.18.0-rc6+ #208 Not tainted
      [51015.579464] -----------------------------
      [51015.579488] ./include/linux/rcupdate.h:303 Illegal context switch in RCU read-side critical section!
      [51015.579510] other info that might help us debug this:
      [51015.579532] rcu_scheduler_active = 2, debug_locks = 1
      [51015.579556] 2 locks held by ip/1861:
      [51015.579577]  #0: 00000000a8c12fd1 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x2e0/0x910
      [51015.579711]  #1: 00000000bf815f8e (rcu_read_lock){....}, at: lwtunnel_build_state+0x96/0x390
      [51015.579842] stack backtrace:
      [51015.579869] CPU: 0 PID: 1861 Comm: ip Not tainted 4.18.0-rc6+ #208
      [51015.579891] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
      [51015.579911] Call Trace:
      [51015.579950]  dump_stack+0x74/0xbb
      [51015.580000]  ___might_sleep+0x16b/0x3a0
      [51015.580047]  __kmalloc_track_caller+0x220/0x380
      [51015.580077]  kmemdup+0x1c/0x40
      [51015.580077]  bpf_parse_prog+0x10e/0x230
      [51015.580164]  ? kasan_kmalloc+0xa0/0xd0
      [51015.580164]  ? bpf_destroy_state+0x30/0x30
      [51015.580164]  ? bpf_build_state+0xe2/0x3e0
      [51015.580164]  bpf_build_state+0x1bb/0x3e0
      [51015.580164]  ? bpf_parse_prog+0x230/0x230
      [51015.580164]  ? lock_is_held_type+0x123/0x1a0
      [51015.580164]  lwtunnel_build_state+0x1aa/0x390
      [51015.580164]  fib_create_info+0x1579/0x33d0
      [51015.580164]  ? sched_clock_local+0xe2/0x150
      [51015.580164]  ? fib_info_update_nh_saddr+0x1f0/0x1f0
      [51015.580164]  ? sched_clock_local+0xe2/0x150
      [51015.580164]  fib_table_insert+0x201/0x1990
      [51015.580164]  ? lock_downgrade+0x610/0x610
      [51015.580164]  ? fib_table_lookup+0x1920/0x1920
      [51015.580164]  ? lwtunnel_valid_encap_type.part.6+0xcb/0x3a0
      [51015.580164]  ? rtm_to_fib_config+0x637/0xbd0
      [51015.580164]  inet_rtm_newroute+0xed/0x1b0
      [51015.580164]  ? rtm_to_fib_config+0xbd0/0xbd0
      [51015.580164]  rtnetlink_rcv_msg+0x331/0x910
      [ ... ]
      
      Fixes: 3a0af8fd ("bpf: BPF for lightweight tunnel infrastructure")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      71eb5255
    • D
      bpf: fix bpf_skb_load_bytes_relative pkt length check · 3eee1f75
      Daniel Borkmann 提交于
      The len > skb_headlen(skb) cannot be used as a maximum upper bound
      for the packet length since it does not have any relation to the full
      linear packet length when filtering is used from upper layers (e.g.
      in case of reuseport BPF programs) as by then skb->data, skb->len
      already got mangled through __skb_pull() and others.
      
      Fixes: 4e1ec56c ("bpf: add skb_load_bytes_relative helper")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      3eee1f75
  12. 27 7月, 2018 2 次提交
  13. 25 7月, 2018 2 次提交
  14. 24 7月, 2018 1 次提交
  15. 23 7月, 2018 1 次提交
    • R
      rtnetlink: add rtnl_link_state check in rtnl_configure_link · 5025f7f7
      Roopa Prabhu 提交于
      rtnl_configure_link sets dev->rtnl_link_state to
      RTNL_LINK_INITIALIZED and unconditionally calls
      __dev_notify_flags to notify user-space of dev flags.
      
      current call sequence for rtnl_configure_link
      rtnetlink_newlink
          rtnl_link_ops->newlink
          rtnl_configure_link (unconditionally notifies userspace of
                               default and new dev flags)
      
      If a newlink handler wants to call rtnl_configure_link
      early, we will end up with duplicate notifications to
      user-space.
      
      This patch fixes rtnl_configure_link to check rtnl_link_state
      and call __dev_notify_flags with gchanges = 0 if already
      RTNL_LINK_INITIALIZED.
      
      Later in the series, this patch will help the following sequence
      where a driver implementing newlink can call rtnl_configure_link
      to initialize the link early.
      
      makes the following call sequence work:
      rtnetlink_newlink
          rtnl_link_ops->newlink (vxlan) -> rtnl_configure_link (initializes
                                                      link and notifies
                                                      user-space of default
                                                      dev flags)
          rtnl_configure_link (updates dev flags if requested by user ifm
                               and notifies user-space of new dev flags)
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5025f7f7
  16. 22 7月, 2018 1 次提交
    • E
      net: skb_segment() should not return NULL · ff907a11
      Eric Dumazet 提交于
      syzbot caught a NULL deref [1], caused by skb_segment()
      
      skb_segment() has many "goto err;" that assume the @err variable
      contains -ENOMEM.
      
      A successful call to __skb_linearize() should not clear @err,
      otherwise a subsequent memory allocation error could return NULL.
      
      While we are at it, we might use -EINVAL instead of -ENOMEM when
      MAX_SKB_FRAGS limit is reached.
      
      [1]
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      CPU: 0 PID: 13285 Comm: syz-executor3 Not tainted 4.18.0-rc4+ #146
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:tcp_gso_segment+0x3dc/0x1780 net/ipv4/tcp_offload.c:106
      Code: f0 ff ff 0f 87 1c fd ff ff e8 00 88 0b fb 48 8b 75 d0 48 b9 00 00 00 00 00 fc ff df 48 8d be 90 00 00 00 48 89 f8 48 c1 e8 03 <0f> b6 14 08 48 8d 86 94 00 00 00 48 89 c6 83 e0 07 48 c1 ee 03 0f
      RSP: 0018:ffff88019b7fd060 EFLAGS: 00010206
      RAX: 0000000000000012 RBX: 0000000000000020 RCX: dffffc0000000000
      RDX: 0000000000040000 RSI: 0000000000000000 RDI: 0000000000000090
      RBP: ffff88019b7fd0f0 R08: ffff88019510e0c0 R09: ffffed003b5c46d6
      R10: ffffed003b5c46d6 R11: ffff8801dae236b3 R12: 0000000000000001
      R13: ffff8801d6c581f4 R14: 0000000000000000 R15: ffff8801d6c58128
      FS:  00007fcae64d6700(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000004e8664 CR3: 00000001b669b000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       tcp4_gso_segment+0x1c3/0x440 net/ipv4/tcp_offload.c:54
       inet_gso_segment+0x64e/0x12d0 net/ipv4/af_inet.c:1342
       inet_gso_segment+0x64e/0x12d0 net/ipv4/af_inet.c:1342
       skb_mac_gso_segment+0x3b5/0x740 net/core/dev.c:2792
       __skb_gso_segment+0x3c3/0x880 net/core/dev.c:2865
       skb_gso_segment include/linux/netdevice.h:4099 [inline]
       validate_xmit_skb+0x640/0xf30 net/core/dev.c:3104
       __dev_queue_xmit+0xc14/0x3910 net/core/dev.c:3561
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3602
       neigh_hh_output include/net/neighbour.h:473 [inline]
       neigh_output include/net/neighbour.h:481 [inline]
       ip_finish_output2+0x1063/0x1860 net/ipv4/ip_output.c:229
       ip_finish_output+0x841/0xfa0 net/ipv4/ip_output.c:317
       NF_HOOK_COND include/linux/netfilter.h:276 [inline]
       ip_output+0x223/0x880 net/ipv4/ip_output.c:405
       dst_output include/net/dst.h:444 [inline]
       ip_local_out+0xc5/0x1b0 net/ipv4/ip_output.c:124
       iptunnel_xmit+0x567/0x850 net/ipv4/ip_tunnel_core.c:91
       ip_tunnel_xmit+0x1598/0x3af1 net/ipv4/ip_tunnel.c:778
       ipip_tunnel_xmit+0x264/0x2c0 net/ipv4/ipip.c:308
       __netdev_start_xmit include/linux/netdevice.h:4148 [inline]
       netdev_start_xmit include/linux/netdevice.h:4157 [inline]
       xmit_one net/core/dev.c:3034 [inline]
       dev_hard_start_xmit+0x26c/0xc30 net/core/dev.c:3050
       __dev_queue_xmit+0x29ef/0x3910 net/core/dev.c:3569
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3602
       neigh_direct_output+0x15/0x20 net/core/neighbour.c:1403
       neigh_output include/net/neighbour.h:483 [inline]
       ip_finish_output2+0xa67/0x1860 net/ipv4/ip_output.c:229
       ip_finish_output+0x841/0xfa0 net/ipv4/ip_output.c:317
       NF_HOOK_COND include/linux/netfilter.h:276 [inline]
       ip_output+0x223/0x880 net/ipv4/ip_output.c:405
       dst_output include/net/dst.h:444 [inline]
       ip_local_out+0xc5/0x1b0 net/ipv4/ip_output.c:124
       ip_queue_xmit+0x9df/0x1f80 net/ipv4/ip_output.c:504
       tcp_transmit_skb+0x1bf9/0x3f10 net/ipv4/tcp_output.c:1168
       tcp_write_xmit+0x1641/0x5c20 net/ipv4/tcp_output.c:2363
       __tcp_push_pending_frames+0xb2/0x290 net/ipv4/tcp_output.c:2536
       tcp_push+0x638/0x8c0 net/ipv4/tcp.c:735
       tcp_sendmsg_locked+0x2ec5/0x3f00 net/ipv4/tcp.c:1410
       tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1447
       inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:641 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:651
       __sys_sendto+0x3d7/0x670 net/socket.c:1797
       __do_sys_sendto net/socket.c:1809 [inline]
       __se_sys_sendto net/socket.c:1805 [inline]
       __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1805
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x455ab9
      Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fcae64d5c68 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 00007fcae64d66d4 RCX: 0000000000455ab9
      RDX: 0000000000000001 RSI: 0000000020000200 RDI: 0000000000000013
      RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000014
      R13: 00000000004c1145 R14: 00000000004d1818 R15: 0000000000000006
      Modules linked in:
      Dumping ftrace buffer:
         (ftrace buffer empty)
      
      Fixes: ddff00d4 ("net: Move skb_has_shared_frag check out of GRE code and into segmentation")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff907a11
  17. 21 7月, 2018 4 次提交
  18. 20 7月, 2018 2 次提交
    • O
      flow_dissector: Dissect tos and ttl from the tunnel info · 5544adb9
      Or Gerlitz 提交于
      Add dissection of the tos and ttl from the ip tunnel headers
      fields in case a match is needed on them.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Reviewed-by: NRoi Dayan <roid@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5544adb9
    • T
      net/page_pool: Fix inconsistent lock state warning · 4905bd9a
      Tariq Toukan 提交于
      Fix the warning below by calling the ptr_ring_consume_bh,
      which uses spin_[un]lock_bh.
      
      [  179.064300] ================================
      [  179.069073] WARNING: inconsistent lock state
      [  179.073846] 4.18.0-rc2+ #18 Not tainted
      [  179.078133] --------------------------------
      [  179.082907] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      [  179.089637] swapper/21/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
      [  179.095478] 00000000963d1995 (&(&r->consumer_lock)->rlock){+.?.}, at:
      __page_pool_empty_ring+0x61/0x100
      [  179.105988] {SOFTIRQ-ON-W} state was registered at:
      [  179.111443]   _raw_spin_lock+0x35/0x50
      [  179.115634]   __page_pool_empty_ring+0x61/0x100
      [  179.120699]   page_pool_destroy+0x32/0x50
      [  179.125204]   mlx5e_free_rq+0x38/0xc0 [mlx5_core]
      [  179.130471]   mlx5e_close_channel+0x20/0x120 [mlx5_core]
      [  179.136418]   mlx5e_close_channels+0x26/0x40 [mlx5_core]
      [  179.142364]   mlx5e_close_locked+0x44/0x50 [mlx5_core]
      [  179.148509]   mlx5e_close+0x42/0x60 [mlx5_core]
      [  179.153936]   __dev_close_many+0xb1/0x120
      [  179.158749]   dev_close_many+0xa2/0x170
      [  179.163364]   rollback_registered_many+0x148/0x460
      [  179.169047]   rollback_registered+0x56/0x90
      [  179.174043]   unregister_netdevice_queue+0x7e/0x100
      [  179.179816]   unregister_netdev+0x18/0x20
      [  179.184623]   mlx5e_remove+0x2a/0x50 [mlx5_core]
      [  179.190107]   mlx5_remove_device+0xe5/0x110 [mlx5_core]
      [  179.196274]   mlx5_unregister_interface+0x39/0x90 [mlx5_core]
      [  179.203028]   cleanup+0x5/0xbfc [mlx5_core]
      [  179.208031]   __x64_sys_delete_module+0x16b/0x240
      [  179.213640]   do_syscall_64+0x5a/0x210
      [  179.218151]   entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  179.224218] irq event stamp: 334398
      [  179.228438] hardirqs last  enabled at (334398): [<ffffffffa511d8b7>]
      rcu_process_callbacks+0x1c7/0x790
      [  179.239178] hardirqs last disabled at (334397): [<ffffffffa511d872>]
      rcu_process_callbacks+0x182/0x790
      [  179.249931] softirqs last  enabled at (334386): [<ffffffffa509732e>] irq_enter+0x5e/0x70
      [  179.259306] softirqs last disabled at (334387): [<ffffffffa509741c>] irq_exit+0xdc/0xf0
      [  179.268584]
      [  179.268584] other info that might help us debug this:
      [  179.276572]  Possible unsafe locking scenario:
      [  179.276572]
      [  179.283877]        CPU0
      [  179.286954]        ----
      [  179.290033]   lock(&(&r->consumer_lock)->rlock);
      [  179.295546]   <Interrupt>
      [  179.298830]     lock(&(&r->consumer_lock)->rlock);
      [  179.304550]
      [  179.304550]  *** DEADLOCK ***
      
      Fixes: ff7d6b27 ("page_pool: refurbish version of page_pool code")
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4905bd9a