1. 10 5月, 2020 5 次提交
    • Y
      bpf: Create file bpf iterator · 367ec3e4
      Yonghong Song 提交于
      To produce a file bpf iterator, the fd must be
      corresponding to a link_fd assocciated with a
      trace/iter program. When the pinned file is
      opened, a seq_file will be generated.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200509175906.2475893-1-yhs@fb.com
      367ec3e4
    • Y
      bpf: Create anonymous bpf iterator · ac51d99b
      Yonghong Song 提交于
      A new bpf command BPF_ITER_CREATE is added.
      
      The anonymous bpf iterator is seq_file based.
      The seq_file private data are referenced by targets.
      The bpf_iter infrastructure allocated additional space
      at seq_file->private before the space used by targets
      to store some meta data, e.g.,
        prog:       prog to run
        session_id: an unique id for each opened seq_file
        seq_num:    how many times bpf programs are queried in this session
        done_stop:  an internal state to decide whether bpf program
                    should be called in seq_ops->stop() or not
      
      The seq_num will start from 0 for valid objects.
      The bpf program may see the same seq_num more than once if
       - seq_file buffer overflow happens and the same object
         is retried by bpf_seq_read(), or
       - the bpf program explicitly requests a retry of the
         same object
      
      Since module is not supported for bpf_iter, all target
      registeration happens at __init time, so there is no
      need to change bpf_iter_unreg_target() as it is used
      mostly in error path of the init function at which time
      no bpf iterators have been created yet.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200509175905.2475770-1-yhs@fb.com
      ac51d99b
    • Y
      bpf: Support bpf tracing/iter programs for BPF_LINK_CREATE · de4e05ca
      Yonghong Song 提交于
      Given a bpf program, the step to create an anonymous bpf iterator is:
        - create a bpf_iter_link, which combines bpf program and the target.
          In the future, there could be more information recorded in the link.
          A link_fd will be returned to the user space.
        - create an anonymous bpf iterator with the given link_fd.
      
      The bpf_iter_link can be pinned to bpffs mount file system to
      create a file based bpf iterator as well.
      
      The benefit to use of bpf_iter_link:
        - using bpf link simplifies design and implementation as bpf link
          is used for other tracing bpf programs.
        - for file based bpf iterator, bpf_iter_link provides a standard
          way to replace underlying bpf programs.
        - for both anonymous and free based iterators, bpf link query
          capability can be leveraged.
      
      The patch added support of tracing/iter programs for BPF_LINK_CREATE.
      A new link type BPF_LINK_TYPE_ITER is added to facilitate link
      querying. Currently, only prog_id is needed, so there is no
      additional in-kernel show_fdinfo() and fill_link_info() hook
      is needed for BPF_LINK_TYPE_ITER link.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200509175901.2475084-1-yhs@fb.com
      de4e05ca
    • Y
      bpf: Allow loading of a bpf_iter program · 15d83c4d
      Yonghong Song 提交于
      A bpf_iter program is a tracing program with attach type
      BPF_TRACE_ITER. The load attribute
        attach_btf_id
      is used by the verifier against a particular kernel function,
      which represents a target, e.g., __bpf_iter__bpf_map
      for target bpf_map which is implemented later.
      
      The program return value must be 0 or 1 for now.
        0 : successful, except potential seq_file buffer overflow
            which is handled by seq_file reader.
        1 : request to restart the same object
      
      In the future, other return values may be used for filtering or
      teminating the iterator.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200509175900.2474947-1-yhs@fb.com
      15d83c4d
    • Y
      bpf: Implement an interface to register bpf_iter targets · ae24345d
      Yonghong Song 提交于
      The target can call bpf_iter_reg_target() to register itself.
      The needed information:
        target:           target name
        seq_ops:          the seq_file operations for the target
        init_seq_private  target callback to initialize seq_priv during file open
        fini_seq_private  target callback to clean up seq_priv during file release
        seq_priv_size:    the private_data size needed by the seq_file
                          operations
      
      The target name represents a target which provides a seq_ops
      for iterating objects.
      
      The target can provide two callback functions, init_seq_private
      and fini_seq_private, called during file open/release time.
      For example, /proc/net/{tcp6, ipv6_route, netlink, ...}, net
      name space needs to be setup properly during file open and
      released properly during file release.
      
      Function bpf_iter_unreg_target() is also implemented to unregister
      a particular target.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200509175859.2474669-1-yhs@fb.com
      ae24345d
  2. 05 5月, 2020 1 次提交
    • A
      bpf: Avoid gcc-10 stringop-overflow warning in struct bpf_prog · d26c0cc5
      Arnd Bergmann 提交于
      gcc-10 warns about accesses to zero-length arrays:
      
      kernel/bpf/core.c: In function 'bpf_patch_insn_single':
      cc1: warning: writing 8 bytes into a region of size 0 [-Wstringop-overflow=]
      In file included from kernel/bpf/core.c:21:
      include/linux/filter.h:550:20: note: at offset 0 to object 'insnsi' with size 0 declared here
        550 |   struct bpf_insn  insnsi[0];
            |                    ^~~~~~
      
      In this case, we really want to have two flexible-array members,
      but that is not possible. Removing the union to make insnsi a
      flexible-array member while leaving insns as a zero-length array
      fixes the warning, as nothing writes to the other one in that way.
      
      This trick only works on linux-3.18 or higher, as older versions
      had additional members in the union.
      
      Fixes: 60a3b225 ("net: bpf: make eBPF interpreter images read-only")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200430213101.135134-6-arnd@arndb.de
      d26c0cc5
  3. 03 5月, 2020 1 次提交
  4. 02 5月, 2020 1 次提交
    • S
      bpf: Sharing bpf runtime stats with BPF_ENABLE_STATS · d46edd67
      Song Liu 提交于
      Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
      Typical userspace tools use kernel.bpf_stats_enabled as follows:
      
        1. Enable kernel.bpf_stats_enabled;
        2. Check program run_time_ns;
        3. Sleep for the monitoring period;
        4. Check program run_time_ns again, calculate the difference;
        5. Disable kernel.bpf_stats_enabled.
      
      The problem with this approach is that only one userspace tool can toggle
      this sysctl. If multiple tools toggle the sysctl at the same time, the
      measurement may be inaccurate.
      
      To fix this problem while keep backward compatibility, introduce a new
      bpf command BPF_ENABLE_STATS. On success, this command enables stats and
      returns a valid fd. BPF_ENABLE_STATS takes argument "type". Currently,
      only one type, BPF_STATS_RUN_TIME, is supported. We can extend the
      command to support other types of stats in the future.
      
      With BPF_ENABLE_STATS, user space tool would have the following flow:
      
        1. Get a fd with BPF_ENABLE_STATS, and make sure it is valid;
        2. Check program run_time_ns;
        3. Sleep for the monitoring period;
        4. Check program run_time_ns again, calculate the difference;
        5. Close the fd.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200430071506.1408910-2-songliubraving@fb.com
      d46edd67
  5. 01 5月, 2020 5 次提交
  6. 29 4月, 2020 12 次提交
  7. 28 4月, 2020 1 次提交
  8. 27 4月, 2020 4 次提交
  9. 26 4月, 2020 2 次提交
  10. 25 4月, 2020 2 次提交
  11. 24 4月, 2020 4 次提交
    • E
      net: napi: add hard irqs deferral feature · 6f8b12d6
      Eric Dumazet 提交于
      Back in commit 3b47d303 ("net: gro: add a per device gro flush timer")
      we added the ability to arm one high resolution timer, that we used
      to keep not-complete packets in GRO engine a bit longer, hoping that further
      frames might be added to them.
      
      Since then, we added the napi_complete_done() interface, and commit
      364b6055 ("net: busy-poll: return busypolling status to drivers")
      allowed drivers to avoid re-arming NIC interrupts if we made a promise
      that their NAPI poll() handler would be called in the near future.
      
      This infrastructure can be leveraged, thanks to a new device parameter,
      which allows to arm the napi hrtimer, instead of re-arming the device
      hard IRQ.
      
      We have noticed that on some servers with 32 RX queues or more, the chit-chat
      between the NIC and the host caused by IRQ delivery and re-arming could hurt
      throughput by ~20% on 100Gbit NIC.
      
      In contrast, hrtimers are using local (percpu) resources and might have lower
      cost.
      
      The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy
      than gro_flush_timeout (/sys/class/net/ethX/)
      
      By default, both gro_flush_timeout and napi_defer_hard_irqs are zero.
      
      This patch does not change the prior behavior of gro_flush_timeout
      if used alone : NIC hard irqs should be rearmed as before.
      
      One concrete usage can be :
      
      echo 20000 >/sys/class/net/eth1/gro_flush_timeout
      echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs
      
      If at least one packet is retired, then we will reset napi counter
      to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans
      of the queue.
      
      On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ
      avoidance was only possible if napi->poll() was exhausting its budget
      and not call napi_complete_done().
      
      This feature also can be used to work around some non-optimal NIC irq
      coalescing strategies.
      
      Having the ability to insert XX usec delays between each napi->poll()
      can increase cache efficiency, since we increase batch sizes.
      
      It also keeps serving cpus not idle too long, reducing tail latencies.
      Co-developed-by: NLuigi Rizzo <lrizzo@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f8b12d6
    • L
      net/mlx5: Update transobj.c new cmd interface · e0b4b472
      Leon Romanovsky 提交于
      Do mass update of transobj.c to reuse newly introduced
      mlx5_cmd_exec_in*() interfaces.
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      e0b4b472
    • L
      net/mlx5: Update cq.c to new cmd interface · d1f62050
      Leon Romanovsky 提交于
      Do mass update of cq.c to reuse newly introduced
      mlx5_cmd_exec_in*() interfaces.
      Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      d1f62050
    • L
      net/mlx5: Update vport.c to new cmd interface · 5d1c9a11
      Leon Romanovsky 提交于
      Do mass update of vport.c to reuse newly introduced
      mlx5_cmd_exec_in*() interfaces.
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      5d1c9a11
  12. 23 4月, 2020 2 次提交