1. 07 9月, 2018 1 次提交
  2. 06 9月, 2018 1 次提交
    • V
      packet: add sockopt to ignore outgoing packets · fa788d98
      Vincent Whitchurch 提交于
      Currently, the only way to ignore outgoing packets on a packet socket is
      via the BPF filter.  With MSG_ZEROCOPY, packets that are looped into
      AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even
      if the filter run from packet_rcv() would reject them.  So the presence
      of a packet socket on the interface takes away the benefits of
      MSG_ZEROCOPY, even if the packet socket is not interested in outgoing
      packets.  (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily
      cloned, but the cost for that is much lower.)
      
      Add a socket option to allow AF_PACKET sockets to ignore outgoing
      packets to solve this.  Note that the *BSDs already have something
      similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT.
      
      The first intended user is lldpd.
      Signed-off-by: NVincent Whitchurch <vincent.whitchurch@axis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa788d98
  3. 03 9月, 2018 1 次提交
    • T
      bpf: Fix bpf_msg_pull_data() · 9db39f4d
      Tushar Dave 提交于
      Helper bpf_msg_pull_data() mistakenly reuses variable 'offset' while
      linearizing multiple scatterlist elements. Variable 'offset' is used
      to find first starting scatterlist element
          i.e. msg->data = sg_virt(&sg[first_sg]) + start - offset"
      
      Use different variable name while linearizing multiple scatterlist
      elements so that value contained in variable 'offset' won't get
      overwritten.
      
      Fixes: 015632bb ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
      Signed-off-by: NTushar Dave <tushar.n.dave@oracle.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9db39f4d
  4. 01 9月, 2018 2 次提交
  5. 30 8月, 2018 8 次提交
    • M
      notifier: Remove notifier header file wherever not used · 13ba17be
      Mukesh Ojha 提交于
      The conversion of the hotplug notifiers to a state machine left the
      notifier.h includes around in some places. Remove them.
      Signed-off-by: NMukesh Ojha <mojha@codeaurora.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1535114033-4605-1-git-send-email-mojha@codeaurora.org
      13ba17be
    • M
      ethtool: drop get_settings and set_settings callbacks · 9b300495
      Michal Kubecek 提交于
      Since [gs]et_settings ethtool_ops callbacks have been deprecated in
      February 2016, all in tree NIC drivers have been converted to provide
      [gs]et_link_ksettings() and out of tree drivers have had enough time to do
      the same.
      
      Drop get_settings() and set_settings() and implement both ETHTOOL_[GS]SET
      and ETHTOOL_[GS]LINKSETTINGS only using [gs]et_link_ksettings().
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b300495
    • S
      net: rtnl: return early from rtnl_unregister_all when protocol isn't registered · f707ef61
      Sabrina Dubroca 提交于
      rtnl_unregister_all(PF_INET6) gets called from inet6_init in cases when
      no handler has been registered for PF_INET6 yet, for example if
      ip6_mr_init() fails. Abort and avoid a NULL pointer deref in that case.
      
      Example of panic (triggered by faking a failure of
       register_pernet_subsys):
      
          general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
          [...]
          RIP: 0010:rtnl_unregister_all+0x17e/0x2a0
          [...]
          Call Trace:
           ? rtnetlink_net_init+0x250/0x250
           ? sock_unregister+0x103/0x160
           ? kernel_getsockopt+0x200/0x200
           inet6_init+0x197/0x20d
      
      Fixes: e2fddf5e ("[IPV6]: Make af_inet6 to check ip6_route_init return value.")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f707ef61
    • B
      xdp: export xdp_rxq_info_unreg_mem_model · dce5bd61
      Björn Töpel 提交于
      Export __xdp_rxq_info_unreg_mem_model as xdp_rxq_info_unreg_mem_model,
      so it can be used from netdev drivers. Also, add additional checks for
      the memory type.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      dce5bd61
    • B
      xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY · b0d1beef
      Björn Töpel 提交于
      This commit adds proper MEM_TYPE_ZERO_COPY support for
      convert_to_xdp_frame. Converting a MEM_TYPE_ZERO_COPY xdp_buff to an
      xdp_frame is done by transforming the MEM_TYPE_ZERO_COPY buffer into a
      MEM_TYPE_PAGE_ORDER0 frame. This is costly, and in the future it might
      make sense to implement a more sophisticated thread-safe alloc/free
      scheme for MEM_TYPE_ZERO_COPY, so that no allocation and copy is
      required in the fast-path.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b0d1beef
    • D
      bpf: fix sg shift repair start offset in bpf_msg_pull_data · a8cf76a9
      Daniel Borkmann 提交于
      When we perform the sg shift repair for the scatterlist ring, we
      currently start out at i = first_sg + 1. However, this is not
      correct since the first_sg could point to the sge sitting at slot
      MAX_SKB_FRAGS - 1, and a subsequent i = MAX_SKB_FRAGS will access
      the scatterlist ring (sg) out of bounds. Add the sk_msg_iter_var()
      helper for iterating through the ring, and apply the same rule
      for advancing to the next ring element as we do elsewhere. Later
      work will use this helper also in other places.
      
      Fixes: 015632bb ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a8cf76a9
    • D
      bpf: fix shift upon scatterlist ring wrap-around in bpf_msg_pull_data · 2e43f95d
      Daniel Borkmann 提交于
      If first_sg and last_sg wraps around in the scatterlist ring, then we
      need to account for that in the shift as well. E.g. crafting such msgs
      where this is the case leads to a hang as shift becomes negative. E.g.
      consider the following scenario:
      
        first_sg := 14     |=>    shift := -12     msg->sg_start := 10
        last_sg  :=  3     |                       msg->sg_end   :=  5
      
      round  1:  i := 15, move_from :=   3, sg[15] := sg[  3]
      round  2:  i :=  0, move_from := -12, sg[ 0] := sg[-12]
      round  3:  i :=  1, move_from := -11, sg[ 1] := sg[-11]
      round  4:  i :=  2, move_from := -10, sg[ 2] := sg[-10]
      [...]
      round 13:  i := 11, move_from :=  -1, sg[ 2] := sg[ -1]
      round 14:  i := 12, move_from :=   0, sg[ 2] := sg[  0]
      round 15:  i := 13, move_from :=   1, sg[ 2] := sg[  1]
      round 16:  i := 14, move_from :=   2, sg[ 2] := sg[  2]
      round 17:  i := 15, move_from :=   3, sg[ 2] := sg[  3]
      [...]
      
      This means we will loop forever and never hit the msg->sg_end condition
      to break out of the loop. When we see that the ring wraps around, then
      the shift should be MAX_SKB_FRAGS - first_sg + last_sg - 1. Meaning,
      the remainder slots from the tail of the ring and the head until last_sg
      combined.
      
      Fixes: 015632bb ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2e43f95d
    • D
      bpf: fix msg->data/data_end after sg shift repair in bpf_msg_pull_data · 0e06b227
      Daniel Borkmann 提交于
      In the current code, msg->data is set as sg_virt(&sg[i]) + start - offset
      and msg->data_end relative to it as msg->data + bytes. Using iterator i
      to point to the updated starting scatterlist element holds true for some
      cases, however not for all where we'd end up pointing out of bounds. It
      is /correct/ for these ones:
      
      1) When first finding the starting scatterlist element (sge) where we
         find that the page is already privately owned by the msg and where
         the requested bytes and headroom fit into the sge's length.
      
      However, it's /incorrect/ for the following ones:
      
      2) After we made the requested area private and updated the newly allocated
         page into first_sg slot of the scatterlist ring; when we find that no
         shift repair of the ring is needed where we bail out updating msg->data
         and msg->data_end. At that point i will point to last_sg, which in this
         case is the next elem of first_sg in the ring. The sge at that point
         might as well be invalid (e.g. i == msg->sg_end), which we use for
         setting the range of sg_virt(&sg[i]). The correct one would have been
         first_sg.
      
      3) Similar as in 2) but when we find that a shift repair of the ring is
         needed. In this case we fix up all sges and stop once we've reached the
         end. In this case i will point to will point to the new msg->sg_end,
         and the sge at that point will be invalid. Again here the requested
         range sits in first_sg.
      
      Fixes: 015632bb ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0e06b227
  6. 29 8月, 2018 1 次提交
    • D
      bpf: fix several offset tests in bpf_msg_pull_data · 5b24109b
      Daniel Borkmann 提交于
      While recently going over bpf_msg_pull_data(), I noticed three
      issues which are fixed in here:
      
      1) When we attempt to find the first scatterlist element (sge)
         for the start offset, we add len to the offset before we check
         for start < offset + len, whereas it should come after when
         we iterate to the next sge to accumulate the offsets. For
         example, given a start offset of 12 with a sge length of 8
         for the first sge in the list would lead us to determine this
         sge as the first sge thinking it covers first 16 bytes where
         start is located, whereas start sits in subsequent sges so
         we would end up pulling in the wrong data.
      
      2) After figuring out the starting sge, we have a short-cut test
         in !msg->sg_copy[i] && bytes <= len. This checks whether it's
         not needed to make the page at the sge private where we can
         just exit by updating msg->data and msg->data_end. However,
         the length test is not fully correct. bytes <= len checks
         whether the requested bytes (end - start offsets) fit into the
         sge's length. The part that is missing is that start must not
         be sge length aligned. Meaning, the start offset into the sge
         needs to be accounted as well on top of the requested bytes
         as otherwise we can access the sge out of bounds. For example
         the sge could have length of 8, our requested bytes could have
         length of 8, but at a start offset of 4, so we also would need
         to pull in 4 bytes of the next sge, when we jump to the out
         label we do set msg->data to sg_virt(&sg[i]) + start - offset
         and msg->data_end to msg->data + bytes which would be oob.
      
      3) The subsequent bytes < copy test for finding the last sge has
         the same issue as in point 2) but also it tests for less than
         rather than less or equal to. Meaning if the sge length is of
         8 and requested bytes of 8 while having the start aligned with
         the sge, we would unnecessarily go and pull in the next sge as
         well to make it private.
      
      Fixes: 015632bb ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      5b24109b
  7. 28 8月, 2018 1 次提交
    • S
      bpf: fix build error with clang · 3f6e138d
      Stefan Agner 提交于
      Building the newly introduced BPF_PROG_TYPE_SK_REUSEPORT leads to
      a compile time error when building with clang:
      net/core/filter.o: In function `sk_reuseport_convert_ctx_access':
        ../net/core/filter.c:7284: undefined reference to `__compiletime_assert_7284'
      
      It seems that clang has issues resolving hweight_long at compile
      time. Since SK_FL_PROTO_MASK is a constant, we can use the interface
      for known constant arguments which works fine with clang.
      
      Fixes: 2dbb9b9e ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT")
      Signed-off-by: NStefan Agner <stefan@agner.ch>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3f6e138d
  8. 22 8月, 2018 1 次提交
  9. 18 8月, 2018 1 次提交
    • D
      bpf: fix redirect to map under tail calls · f6069b9a
      Daniel Borkmann 提交于
      Commits 109980b8 ("bpf: don't select potentially stale ri->map
      from buggy xdp progs") and 7c300131 ("bpf: fix ri->map_owner
      pointer on bpf_prog_realloc") tried to mitigate that buggy programs
      using bpf_redirect_map() helper call do not leave stale maps behind.
      Idea was to add a map_owner cookie into the per CPU struct redirect_info
      which was set to prog->aux by the prog making the helper call as a
      proof that the map is not stale since the prog is implicitly holding
      a reference to it. This owner cookie could later on get compared with
      the program calling into BPF whether they match and therefore the
      redirect could proceed with processing the map safely.
      
      In (obvious) hindsight, this approach breaks down when tail calls are
      involved since the original caller's prog->aux pointer does not have
      to match the one from one of the progs out of the tail call chain,
      and therefore the xdp buffer will be dropped instead of redirected.
      A way around that would be to fix the issue differently (which also
      allows to remove related work in fast path at the same time): once
      the life-time of a redirect map has come to its end we use it's map
      free callback where we need to wait on synchronize_rcu() for current
      outstanding xdp buffers and remove such a map pointer from the
      redirect info if found to be present. At that time no program is
      using this map anymore so we simply invalidate the map pointers to
      NULL iff they previously pointed to that instance while making sure
      that the redirect path only reads out the map once.
      
      Fixes: 97f91a7c ("bpf: add bpf_redirect_map helper routine")
      Fixes: 109980b8 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
      Reported-by: NSebastiano Miano <sebastiano.miano@polito.it>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f6069b9a
  10. 17 8月, 2018 1 次提交
    • T
      net/xdp: Fix suspicious RCU usage warning · 21b172ee
      Tariq Toukan 提交于
      Fix the warning below by calling rhashtable_lookup_fast.
      Also, make some code movements for better quality and human
      readability.
      
      [  342.450870] WARNING: suspicious RCU usage
      [  342.455856] 4.18.0-rc2+ #17 Tainted: G           O
      [  342.462210] -----------------------------
      [  342.467202] ./include/linux/rhashtable.h:481 suspicious rcu_dereference_check() usage!
      [  342.476568]
      [  342.476568] other info that might help us debug this:
      [  342.476568]
      [  342.486978]
      [  342.486978] rcu_scheduler_active = 2, debug_locks = 1
      [  342.495211] 4 locks held by modprobe/3934:
      [  342.500265]  #0: 00000000e23116b2 (mlx5_intf_mutex){+.+.}, at:
      mlx5_unregister_interface+0x18/0x90 [mlx5_core]
      [  342.511953]  #1: 00000000ca16db96 (rtnl_mutex){+.+.}, at: unregister_netdev+0xe/0x20
      [  342.521109]  #2: 00000000a46e2c4b (&priv->state_lock){+.+.}, at: mlx5e_close+0x29/0x60
      [mlx5_core]
      [  342.531642]  #3: 0000000060c5bde3 (mem_id_lock){+.+.}, at: xdp_rxq_info_unreg+0x93/0x6b0
      [  342.541206]
      [  342.541206] stack backtrace:
      [  342.547075] CPU: 12 PID: 3934 Comm: modprobe Tainted: G           O      4.18.0-rc2+ #17
      [  342.556621] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
      [  342.565606] Call Trace:
      [  342.568861]  dump_stack+0x78/0xb3
      [  342.573086]  xdp_rxq_info_unreg+0x3f5/0x6b0
      [  342.578285]  ? __call_rcu+0x220/0x300
      [  342.582911]  mlx5e_free_rq+0x38/0xc0 [mlx5_core]
      [  342.588602]  mlx5e_close_channel+0x20/0x120 [mlx5_core]
      [  342.594976]  mlx5e_close_channels+0x26/0x40 [mlx5_core]
      [  342.601345]  mlx5e_close_locked+0x44/0x50 [mlx5_core]
      [  342.607519]  mlx5e_close+0x42/0x60 [mlx5_core]
      [  342.613005]  __dev_close_many+0xb1/0x120
      [  342.617911]  dev_close_many+0xa2/0x170
      [  342.622622]  rollback_registered_many+0x148/0x460
      [  342.628401]  ? __lock_acquire+0x48d/0x11b0
      [  342.633498]  ? unregister_netdev+0xe/0x20
      [  342.638495]  rollback_registered+0x56/0x90
      [  342.643588]  unregister_netdevice_queue+0x7e/0x100
      [  342.649461]  unregister_netdev+0x18/0x20
      [  342.654362]  mlx5e_remove+0x2a/0x50 [mlx5_core]
      [  342.659944]  mlx5_remove_device+0xe5/0x110 [mlx5_core]
      [  342.666208]  mlx5_unregister_interface+0x39/0x90 [mlx5_core]
      [  342.673038]  cleanup+0x5/0xbfc [mlx5_core]
      [  342.678094]  __x64_sys_delete_module+0x16b/0x240
      [  342.683725]  ? do_syscall_64+0x1c/0x210
      [  342.688476]  do_syscall_64+0x5a/0x210
      [  342.693025]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixes: 8d5d8852 ("xdp: rhashtable with allocator ID to pointer mapping")
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Suggested-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      21b172ee
  11. 15 8月, 2018 2 次提交
  12. 13 8月, 2018 1 次提交
    • A
      bpf: Introduce bpf_skb_ancestor_cgroup_id helper · 77236281
      Andrey Ignatov 提交于
      == Problem description ==
      
      It's useful to be able to identify cgroup associated with skb in TC so
      that a policy can be applied to this skb, and existing bpf_skb_cgroup_id
      helper can help with this.
      
      Though in real life cgroup hierarchy and hierarchy to apply a policy to
      don't map 1:1.
      
      It's often the case that there is a container and corresponding cgroup,
      but there are many more sub-cgroups inside container, e.g. because it's
      delegated to containerized application to control resources for its
      subsystems, or to separate application inside container from infra that
      belongs to containerization system (e.g. sshd).
      
      At the same time it may be useful to apply a policy to container as a
      whole.
      
      If multiple containers like this are run on a host (what is often the
      case) and many of them have sub-cgroups, it may not be possible to apply
      per-container policy in TC with existing helpers such as
      bpf_skb_under_cgroup or bpf_skb_cgroup_id:
      
      * bpf_skb_cgroup_id will return id of immediate cgroup associated with
        skb, i.e. if it's a sub-cgroup inside container, it can't be used to
        identify container's cgroup;
      
      * bpf_skb_under_cgroup can work only with one cgroup and doesn't scale,
        i.e. if there are N containers on a host and a policy has to be
        applied to M of them (0 <= M <= N), it'd require M calls to
        bpf_skb_under_cgroup, and, if M changes, it'd require to rebuild &
        load new BPF program.
      
      == Solution ==
      
      The patch introduces new helper bpf_skb_ancestor_cgroup_id that can be
      used to get id of cgroup v2 that is an ancestor of cgroup associated
      with skb at specified level of cgroup hierarchy.
      
      That way admin can place all containers on one level of cgroup hierarchy
      (what is a good practice in general and already used in many
      configurations) and identify specific cgroup on this level no matter
      what sub-cgroup skb is associated with.
      
      E.g. if there is a cgroup hierarchy:
        root/
        root/container1/
        root/container1/app11/
        root/container1/app11/sub-app-a/
        root/container1/app12/
        root/container2/
        root/container2/app21/
        root/container2/app22/
        root/container2/app22/sub-app-b/
      
      , then having skb associated with root/container1/app11/sub-app-a/ it's
      possible to get ancestor at level 1, what is container1 and apply policy
      for this container, or apply another policy if it's container2.
      
      Policies can be kept e.g. in a hash map where key is a container cgroup
      id and value is an action.
      
      Levels where container cgroups are created are usually known in advance
      whether cgroup hierarchy inside container may be hard to predict
      especially in case when its creation is delegated to containerized
      application.
      
      == Implementation details ==
      
      The helper gets ancestor by walking parents up to specified level.
      
      Another option would be to get different kind of "id" from
      cgroup->ancestor_ids[level] and use it with idr_find() to get struct
      cgroup for ancestor. But that would require radix lookup what doesn't
      seem to be better (at least it's not obviously better).
      
      Format of return value of the new helper is same as that of
      bpf_skb_cgroup_id.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      77236281
  13. 12 8月, 2018 1 次提交
  14. 11 8月, 2018 5 次提交
    • M
      bpf: Enable BPF_PROG_TYPE_SK_REUSEPORT bpf prog in reuseport selection · 8217ca65
      Martin KaFai Lau 提交于
      This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a
      SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in
      the earlier patch.  "bpf_run_sk_reuseport()" will return -ECONNREFUSED
      when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP.
      The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to
      handle this case and return NULL immediately instead of continuing the
      sk search from its hashtable.
      
      It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach
      BPF_PROG_TYPE_SK_REUSEPORT.  The "sk_reuseport_attach_bpf()" will check
      if the attaching bpf prog is in the new SK_REUSEPORT or the existing
      SOCKET_FILTER type and then check different things accordingly.
      
      One level of "__reuseport_attach_prog()" call is removed.  The
      "sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed
      back to "reuseport_attach_prog()" in sock_reuseport.c.  sock_reuseport.c
      seems to have more knowledge on those test requirements than filter.c.
      In "reuseport_attach_prog()", after new_prog is attached to reuse->prog,
      the old_prog (if any) is also directly freed instead of returning the
      old_prog to the caller and asking the caller to free.
      
      The sysctl_optmem_max check is moved back to the
      "sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()".
      As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only
      bounded by the usual "bpf_prog_charge_memlock()" during load time
      instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8217ca65
    • M
      bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT · 2dbb9b9e
      Martin KaFai Lau 提交于
      This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
      a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
      non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
      
      BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
      to store the bpf context instead of using the skb->cb[48].
      
      At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
      from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
      point,  it is not always clear where the bpf context can be appended
      in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
      aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
      clear if the lower layer is only ipv4 and ipv6 in the future and
      will it not touch the cb[] again before transiting to the upper
      layer.
      
      For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
      instead of IP[6]CB and it may still modify the cb[] after calling
      the udp[46]_lib_lookup_skb().  Because of the above reason, if
      sk->cb is used for the bpf ctx, saving-and-restoring is needed
      and likely the whole 48 bytes cb[] has to be saved and restored.
      
      Instead of saving, setting and restoring the cb[], this patch opts
      to create a new "struct sk_reuseport_kern" and setting the needed
      values in there.
      
      The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
      will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
      specific usage at this point and it is also inline with the current
      sock_reuseport.c implementation (i.e. no protocol specific requirement).
      
      In "struct sk_reuseport_md", this patch exposes data/data_end/len
      with semantic similar to other existing usages.  Together
      with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
      the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
      the bpf prog that the reuseport group is bind-ed to a local
      INANY address which cannot be learned from skb.
      
      The new "bind_inany" is added to "struct sock_reuseport" which will be
      used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
      to avoid repeating the "bind INANY" test on
      "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
      only be properly initialized when a "sk->sk_reuseport" enabled sk is
      adding to a hashtable (i.e. during "reuseport_alloc()" and
      "reuseport_add_sock()").
      
      The new "sk_select_reuseport()" is the main helper that the
      bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
      that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
      the earlier patch, the validity of a selected sk is checked in
      run time in "sk_select_reuseport()".  Doing the check in
      verification time is difficult and inflexible (consider the map-in-map
      use case).  The runtime check is to compare the selected sk's reuseport_id
      with the reuseport_id that we want.  This helper will return -EXXX if the
      selected sk cannot serve the incoming request (e.g. reuseport_id
      not match).  The bpf prog can decide if it wants to do SK_DROP as its
      discretion.
      
      When the bpf prog returns SK_PASS, the kernel will check if a
      valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
      If it does , it will use the selected sk.  If not, the kernel
      will select one from "reuse->socks[]" (as before this patch).
      
      The SK_DROP and SK_PASS handling logic will be in the next patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2dbb9b9e
    • M
      bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY · 5dc4c4b7
      Martin KaFai Lau 提交于
      This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
      
      To unleash the full potential of a bpf prog, it is essential for the
      userspace to be capable of directly setting up a bpf map which can then
      be consumed by the bpf prog to make decision.  In this case, decide which
      SO_REUSEPORT sk to serve the incoming request.
      
      By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
      and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
      The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
      the bpf prog can directly select a sk from the bpf map.  That will
      raise the programmability of the bpf prog attached to a reuseport
      group (a group of sk serving the same IP:PORT).
      
      For example, in UDP, the bpf prog can peek into the payload (e.g.
      through the "data" pointer introduced in the later patch) to learn
      the application level's connection information and then decide which sk
      to pick from a bpf map.  The userspace can tightly couple the sk's location
      in a bpf map with the application logic in generating the UDP payload's
      connection information.  This connection info contact/API stays within the
      userspace.
      
      Also, when used with map-in-map, the userspace can switch the
      old-server-process's inner map to a new-server-process's inner map
      in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
      The bpf prog will then direct incoming requests to the new process instead
      of the old process.  The old process can finish draining the pending
      requests (e.g. by "accept()") before closing the old-fds.  [Note that
      deleting a fd from a bpf map does not necessary mean the fd is closed]
      
      During map_update_elem(),
      Only SO_REUSEPORT sk (i.e. which has already been added
      to a reuse->socks[]) can be used.  That means a SO_REUSEPORT sk that is
      "bind()" for UDP or "bind()+listen()" for TCP.  These conditions are
      ensured in "reuseport_array_update_check()".
      
      A SO_REUSEPORT sk can only be added once to a map (i.e. the
      same sk cannot be added twice even to the same map).  SO_REUSEPORT
      already allows another sk to be created for the same IP:PORT.
      There is no need to re-create a similar usage in the BPF side.
      
      When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
      it will notify the bpf map to remove it from the map also.  It is
      done through "bpf_sk_reuseport_detach()" and it will only be called
      if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
      
      The map_update()/map_delete() has to be in-sync with the
      "reuse->socks[]".  Hence, the same "reuseport_lock" used
      by "reuse->socks[]" has to be used here also. Care has
      been taken to ensure the lock is only acquired when the
      adding sk passes some strict tests. and
      freeing the map does not require the reuseport_lock.
      
      The reuseport_array will also support lookup from the syscall
      side.  It will return a sock_gen_cookie().  The sock_gen_cookie()
      is on-demand (i.e. a sk's cookie is not generated until the very
      first map_lookup_elem()).
      
      The lookup cookie is 64bits but it goes against the logical userspace
      expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
      It may catch user in surprise if we enforce value_size=8 while
      userspace still pass a 32bits fd during update.  Supporting different
      value_size between lookup and update seems unintuitive also.
      
      We also need to consider what if other existing fd based maps want
      to return 64bits value from syscall's lookup in the future.
      Hence, reuseport_array supports both value_size 4 and 8, and
      assuming user will usually use value_size=4.  The syscall's lookup
      will return ENOSPC on value_size=4.  It will will only
      return 64bits value from sock_gen_cookie() when user consciously
      choose value_size=8 (as a signal that lookup is desired) which then
      requires a 64bits value in both lookup and update.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5dc4c4b7
    • M
      net: Add ID (if needed) to sock_reuseport and expose reuseport_lock · 736b4602
      Martin KaFai Lau 提交于
      A later patch will introduce a BPF_MAP_TYPE_REUSEPORT_ARRAY which
      allows a SO_REUSEPORT sk to be added to a bpf map.  When a sk
      is removed from reuse->socks[], it also needs to be removed from
      the bpf map.  Also, when adding a sk to a bpf map, the bpf
      map needs to ensure it is indeed in a reuse->socks[].
      Hence, reuseport_lock is needed by the bpf map to ensure its
      map_update_elem() and map_delete_elem() operations are in-sync with
      the reuse->socks[].  The BPF_MAP_TYPE_REUSEPORT_ARRAY map will only
      acquire the reuseport_lock after ensuring the adding sk is already
      in a reuseport group (i.e. reuse->socks[]).  The map_lookup_elem()
      will be lockless.
      
      This patch also adds an ID to sock_reuseport.  A later patch
      will introduce BPF_PROG_TYPE_SK_REUSEPORT which allows
      a bpf prog to select a sk from a bpf map.  It is inflexible to
      statically enforce a bpf map can only contain the sk belonging to
      a particular reuse->socks[] (i.e. same IP:PORT) during the bpf
      verification time. For example, think about the the map-in-map situation
      where the inner map can be dynamically changed in runtime and the outer
      map may have inner maps belonging to different reuseport groups.
      Hence, when the bpf prog (in the new BPF_PROG_TYPE_SK_REUSEPORT
      type) selects a sk,  this selected sk has to be checked to ensure it
      belongs to the requesting reuseport group (i.e. the group serving
      that IP:PORT).
      
      The "sk->sk_reuseport_cb" pointer cannot be used for this checking
      purpose because the pointer value will change after reuseport_grow().
      Instead of saving all checking conditions like the ones
      preced calling "reuseport_add_sock()" and compare them everytime a
      bpf_prog is run, a 32bits ID is introduced to survive the
      reuseport_grow().  The ID is only acquired if any of the
      reuse->socks[] is added to the newly introduced
      "BPF_MAP_TYPE_REUSEPORT_ARRAY" map.
      
      If "BPF_MAP_TYPE_REUSEPORT_ARRAY" is not used,  the changes in this
      patch is a no-op.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      736b4602
    • M
      tcp: Avoid TCP syncookie rejected by SO_REUSEPORT socket · 40a1227e
      Martin KaFai Lau 提交于
      Although the actual cookie check "__cookie_v[46]_check()" does
      not involve sk specific info, it checks whether the sk has recent
      synq overflow event in "tcp_synq_no_recent_overflow()".  The
      tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second
      when it has sent out a syncookie (through "tcp_synq_overflow()").
      
      The above per sk "recent synq overflow event timestamp" works well
      for non SO_REUSEPORT use case.  However, it may cause random
      connection request reject/discard when SO_REUSEPORT is used with
      syncookie because it fails the "tcp_synq_no_recent_overflow()"
      test.
      
      When SO_REUSEPORT is used, it usually has multiple listening
      socks serving TCP connection requests destinated to the same local IP:PORT.
      There are cases that the TCP-ACK-COOKIE may not be received
      by the same sk that sent out the syncookie.  For example,
      if reuse->socks[] began with {sk0, sk1},
      1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp
         was updated.
      2) the reuse->socks[] became {sk1, sk2} later.  e.g. sk0 was first closed
         and then sk2 was added.  Here, sk2 does not have ts_recent_stamp set.
         There are other ordering that will trigger the similar situation
         below but the idea is the same.
      3) When the TCP-ACK-COOKIE comes back, sk2 was selected.
         "tcp_synq_no_recent_overflow(sk2)" returns true. In this case,
         all syncookies sent by sk1 will be handled (and rejected)
         by sk2 while sk1 is still alive.
      
      The userspace may create and remove listening SO_REUSEPORT sockets
      as it sees fit.  E.g. Adding new thread (and SO_REUSEPORT sock) to handle
      incoming requests, old process stopping and new process starting...etc.
      With or without SO_ATTACH_REUSEPORT_[CB]BPF,
      the sockets leaving and joining a reuseport group makes picking
      the same sk to check the syncookie very difficult (if not impossible).
      
      The later patches will allow bpf prog more flexibility in deciding
      where a sk should be located in a bpf map and selecting a particular
      SO_REUSEPORT sock as it sees fit.  e.g. Without closing any sock,
      replace the whole bpf reuseport_array in one map_update() by using
      map-in-map.  Getting the syncookie check working smoothly across
      socks in the same "reuse->socks[]" is important.
      
      A partial solution is to set the newly added sk's ts_recent_stamp
      to the max ts_recent_stamp of a reuseport group but that will require
      to iterate through reuse->socks[]  OR
      pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is
      joining a reuseport group.  However, neither of them will solve the
      existing sk getting moved around the reuse->socks[] and that
      sk may not have ts_recent_stamp updated, unlikely under continuous
      synflood but not impossible.
      
      This patch opts to treat the reuseport group as a whole when
      considering the last synq overflow timestamp since
      they are serving the same IP:PORT from the userspace
      (and BPF program) perspective.
      
      "synq_overflow_ts" is added to "struct sock_reuseport".
      The tcp_synq_overflow() and tcp_synq_no_recent_overflow()
      will update/check reuse->synq_overflow_ts if the sk is
      in a reuseport group.  Similar to the reuseport decision in
      __inet_lookup_listener(), both sk->sk_reuseport and
      sk->sk_reuseport_cb are tested for SO_REUSEPORT usage.
      Update on "synq_overflow_ts" happens at roughly once
      every second.
      
      A synflood test was done with a 16 rx-queues and 16 reuseport sockets.
      No meaningful performance change is observed.  Before and
      after the change is ~9Mpps in IPv4.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      40a1227e
  15. 10 8月, 2018 5 次提交
    • T
      xdp: Helpers for disabling napi_direct of xdp_return_frame · 2539650f
      Toshiaki Makita 提交于
      We need some mechanism to disable napi_direct on calling
      xdp_return_frame_rx_napi() from some context.
      When veth gets support of XDP_REDIRECT, it will redirects packets which
      are redirected from other devices. On redirection veth will reuse
      xdp_mem_info of the redirection source device to make return_frame work.
      But in this case .ndo_xdp_xmit() called from veth redirection uses
      xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit()
      is not called directly from the rxq which owns the xdp_mem_info.
      
      This approach introduces a flag in bpf_redirect_info to indicate that
      napi_direct should be disabled even when _rx_napi variant is used as
      well as helper functions to use it.
      
      A NAPI handler who wants to use this flag needs to call
      xdp_set_return_frame_no_direct() before processing packets, and call
      xdp_clear_return_frame_no_direct() after xdp_do_flush_map() before
      exiting NAPI.
      
      v4:
      - Use bpf_redirect_info for storing the flag instead of xdp_mem_info to
        avoid per-frame copy cost.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2539650f
    • T
      bpf: Make redirect_info accessible from modules · 0b19cc0a
      Toshiaki Makita 提交于
      We are going to add kern_flags field in redirect_info for kernel
      internal use.
      In order to avoid function call to access the flags, make redirect_info
      accessible from modules. Also as it is now non-static, add prefix bpf_
      to redirect_info.
      
      v6:
      - Fix sparse warning around EXPORT_SYMBOL.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      0b19cc0a
    • T
      net: Export skb_headers_offset_update · b0768a86
      Toshiaki Makita 提交于
      This is needed for veth XDP which does skb_copy_expand()-like operation.
      
      v2:
      - Drop skb_copy_header part because it has already been exported now.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b0768a86
    • B
      Revert "xdp: add NULL pointer check in __xdp_return()" · eb91e4d4
      Björn Töpel 提交于
      This reverts commit 36e0f12b.
      
      The reverted commit adds a WARN to check against NULL entries in the
      mem_id_ht rhashtable. Any kernel path implementing the XDP (generic or
      driver) fast path is required to make a paired
      xdp_rxq_info_reg/xdp_rxq_info_unreg call for proper function. In
      addition, a driver using a different allocation scheme than the
      default MEM_TYPE_PAGE_SHARED is required to additionally call
      xdp_rxq_info_reg_mem_model.
      
      For MEM_TYPE_ZERO_COPY, an xdp_rxq_info_reg_mem_model call ensures
      that the mem_id_ht rhashtable has a properly inserted allocator id. If
      not, this would be a driver bug. A NULL pointer kernel OOPS is
      preferred to the WARN.
      Suggested-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      eb91e4d4
    • A
      net: allow to call netif_reset_xps_queues() under cpus_read_lock · 4d99f660
      Andrei Vagin 提交于
      The definition of static_key_slow_inc() has cpus_read_lock in place. In the
      virtio_net driver, XPS queues are initialized after setting the queue:cpu
      affinity in virtnet_set_affinity() which is already protected within
      cpus_read_lock. Lockdep prints a warning when we are trying to acquire
      cpus_read_lock when it is already held.
      
      This patch adds an ability to call __netif_set_xps_queue under
      cpus_read_lock().
      Acked-by: NJason Wang <jasowang@redhat.com>
      
      ============================================
      WARNING: possible recursive locking detected
      4.18.0-rc3-next-20180703+ #1 Not tainted
      --------------------------------------------
      swapper/0/1 is trying to acquire lock:
      00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: static_key_slow_inc+0xe/0x20
      
      but task is already holding lock:
      00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(cpu_hotplug_lock.rw_sem);
        lock(cpu_hotplug_lock.rw_sem);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      3 locks held by swapper/0/1:
       #0: 00000000244bc7da (&dev->mutex){....}, at: __driver_attach+0x5a/0x110
       #1: 00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
       #2: 000000005cd8463f (xps_map_mutex){+.+.}, at: __netif_set_xps_queue+0x8d/0xc60
      
      v2: move cpus_read_lock() out of __netif_set_xps_queue()
      
      Cc: "Nambiar, Amritha" <amritha.nambiar@intel.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Fixes: 8af2c06f ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
      Signed-off-by: NAndrei Vagin <avagin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d99f660
  16. 08 8月, 2018 1 次提交
  17. 07 8月, 2018 1 次提交
  18. 06 8月, 2018 2 次提交
  19. 05 8月, 2018 1 次提交
  20. 04 8月, 2018 1 次提交
  21. 03 8月, 2018 2 次提交