1. 03 3月, 2019 1 次提交
    • B
      bpf: Sample HBM BPF program to limit egress bw · 187d0738
      brakmo 提交于
      A cgroup skb BPF program to limit cgroup output bandwidth.
      It uses a modified virtual token bucket queue to limit average
      egress bandwidth. The implementation uses credits instead of tokens.
      Negative credits imply that queueing would have happened (this is
      a virtual queue, so no queueing is done by it. However, queueing may
      occur at the actual qdisc (which is not used for rate limiting).
      
      This implementation uses 3 thresholds, one to start marking packets and
      the other two to drop packets:
                                       CREDIT
             - <--------------------------|------------------------> +
                   |    |          |      0
                   |  Large pkt    |
                   |  drop thresh  |
        Small pkt drop             Mark threshold
            thresh
      
      The effect of marking depends on the type of packet:
      a) If the packet is ECN enabled, then the packet is ECN ce marked.
         The current mark threshold is tuned for DCTCP.
      c) Else, it is dropped if it is a large packet.
      
      If the credit is below the drop threshold, the packet is dropped.
      Note that dropping a packet through the BPF program does not trigger CWR
      (Congestion Window Reduction) in TCP packets. A future patch will add
      support for triggering CWR.
      
      This BPF program actually uses 2 drop thresholds, one threshold
      for larger packets (>= 120 bytes) and another for smaller packets. This
      protects smaller packets such as SYNs, ACKs, etc.
      
      The default bandwidth limit is set at 1Gbps but this can be changed by
      a user program through a shared BPF map. In addition, by default this BPF
      program does not limit connections using loopback. This behavior can be
      overwritten by the user program. There is also an option to calculate
      some statistics, such as percent of packets marked or dropped, which
      the user program can access.
      
      A latter patch provides such a program (hbm.c)
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      187d0738
  2. 01 3月, 2019 2 次提交
  3. 26 2月, 2019 1 次提交
  4. 02 2月, 2019 1 次提交
  5. 16 1月, 2019 1 次提交
    • Y
      samples/bpf: workaround clang asm goto compilation errors · 6bf3bbe1
      Yonghong Song 提交于
      x86 compilation has required asm goto support since 4.17.
      Since clang does not support asm goto, at 4.17,
      Commit b1ae32db ("x86/cpufeature: Guard asm_volatile_goto usage
      for BPF compilation") worked around the issue by permitting an
      alternative implementation without asm goto for clang.
      
      At 5.0, more asm goto usages appeared.
        [yhs@148 x86]$ egrep -r asm_volatile_goto
        include/asm/cpufeature.h:     asm_volatile_goto("1: jmp 6f\n"
        include/asm/jump_label.h:     asm_volatile_goto("1:"
        include/asm/jump_label.h:     asm_volatile_goto("1:"
        include/asm/rmwcc.h:  asm_volatile_goto (fullop "; j" #cc " %l[cc_label]"     \
        include/asm/uaccess.h:        asm_volatile_goto("\n"                          \
        include/asm/uaccess.h:        asm_volatile_goto("\n"                          \
        [yhs@148 x86]$
      
      Compiling samples/bpf directories, most bpf programs failed
      compilation with error messages like:
        In file included from /home/yhs/work/bpf-next/samples/bpf/xdp_sample_pkts_kern.c:2:
        In file included from /home/yhs/work/bpf-next/include/linux/ptrace.h:6:
        In file included from /home/yhs/work/bpf-next/include/linux/sched.h:15:
        In file included from /home/yhs/work/bpf-next/include/linux/sem.h:5:
        In file included from /home/yhs/work/bpf-next/include/uapi/linux/sem.h:5:
        In file included from /home/yhs/work/bpf-next/include/linux/ipc.h:9:
        In file included from /home/yhs/work/bpf-next/include/linux/refcount.h:72:
        /home/yhs/work/bpf-next/arch/x86/include/asm/refcount.h:70:9: error: 'asm goto' constructs are not supported yet
              return GEN_BINARY_SUFFIXED_RMWcc(LOCK_PREFIX "subl",
                     ^
        /home/yhs/work/bpf-next/arch/x86/include/asm/rmwcc.h:67:2: note: expanded from macro 'GEN_BINARY_SUFFIXED_RMWcc'
              __GEN_RMWcc(op " %[val], %[var]\n\t" suffix, var, cc,           \
              ^
        /home/yhs/work/bpf-next/arch/x86/include/asm/rmwcc.h:21:2: note: expanded from macro '__GEN_RMWcc'
              asm_volatile_goto (fullop "; j" #cc " %l[cc_label]"             \
              ^
        /home/yhs/work/bpf-next/include/linux/compiler_types.h:188:37: note: expanded from macro 'asm_volatile_goto'
        #define asm_volatile_goto(x...) asm goto(x)
      
      Most implementation does not even provide an alternative
      implementation. And it is also not practical to make changes
      for each call site.
      
      This patch workarounded the asm goto issue by redefining the macro like below:
        #define asm_volatile_goto(x...) asm volatile("invalid use of asm_volatile_goto")
      
      If asm_volatile_goto is not used by bpf programs, which is typically the case, nothing bad
      will happen. If asm_volatile_goto is used by bpf programs, which is incorrect, the compiler
      will issue an error since "invalid use of asm_volatile_goto" is not valid assembly codes.
      
      With this patch, all bpf programs under samples/bpf can pass compilation.
      
      Note that bpf programs under tools/testing/selftests/bpf/ compiled fine as
      they do not access kernel internal headers.
      
      Fixes: e769742d ("Revert "x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs"")
      Fixes: 18fe5822 ("x86, asm: change the GEN_*_RMWcc() macros to not quote the condition")
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6bf3bbe1
  6. 23 12月, 2018 2 次提交
  7. 21 11月, 2018 1 次提交
  8. 01 9月, 2018 1 次提交
  9. 27 7月, 2018 3 次提交
  10. 18 7月, 2018 2 次提交
  11. 27 6月, 2018 1 次提交
  12. 25 5月, 2018 1 次提交
  13. 15 5月, 2018 3 次提交
  14. 14 5月, 2018 1 次提交
  15. 11 5月, 2018 4 次提交
  16. 04 5月, 2018 1 次提交
  17. 29 4月, 2018 1 次提交
  18. 27 4月, 2018 1 次提交
  19. 19 4月, 2018 1 次提交
  20. 29 3月, 2018 1 次提交
  21. 26 2月, 2018 1 次提交
    • L
      samples/bpf: Add program for CPU state statistics · c5350777
      Leo Yan 提交于
      CPU is active when have running tasks on it and CPUFreq governor can
      select different operating points (OPP) according to different workload;
      we use 'pstate' to present CPU state which have running tasks with one
      specific OPP.  On the other hand, CPU is idle which only idle task on
      it, CPUIdle governor can select one specific idle state to power off
      hardware logics; we use 'cstate' to present CPU idle state.
      
      Based on trace events 'cpu_idle' and 'cpu_frequency' we can accomplish
      the duration statistics for every state.  Every time when CPU enters
      into or exits from idle states, the trace event 'cpu_idle' is recorded;
      trace event 'cpu_frequency' records the event for CPU OPP changing, so
      it's easily to know how long time the CPU stays in the specified OPP,
      and the CPU must be not in any idle state.
      
      This patch is to utilize the mentioned trace events for pstate and
      cstate statistics.  To achieve more accurate profiling data, the program
      uses below sequence to insure CPU running/idle time aren't missed:
      
      - Before profiling the user space program wakes up all CPUs for once, so
        can avoid to missing account time for CPU staying in idle state for
        long time; the program forces to set 'scaling_max_freq' to lowest
        frequency and then restore 'scaling_max_freq' to highest frequency,
        this can ensure the frequency to be set to lowest frequency and later
        after start to run workload the frequency can be easily to be changed
        to higher frequency;
      
      - User space program reads map data and update statistics for every 5s,
        so this is same with other sample bpf programs for avoiding big
        overload introduced by bpf program self;
      
      - When send signal to terminate program, the signal handler wakes up
        all CPUs, set lowest frequency and restore highest frequency to
        'scaling_max_freq'; this is exactly same with the first step so
        avoid to missing account CPU pstate and cstate time during last
        stage.  Finally it reports the latest statistics.
      
      The program has been tested on Hikey board with octa CA53 CPUs, below
      is one example for statistics result, the format mainly follows up
      Jesper Dangaard Brouer suggestion.
      
      Jesper reminds to 'get printf to pretty print with thousands separators
      use %' and setlocale(LC_NUMERIC, "en_US")', tried three different arm64
      GCC toolchains (5.4.0 20160609, 6.2.1 20161016, 6.3.0 20170516) but all
      of them cannot support printf flag character %' on arm64 platform, so go
      back print number without grouping mode.
      
      CPU states statistics:
      state(ms)  cstate-0    cstate-1    cstate-2    pstate-0    pstate-1    pstate-2    pstate-3    pstate-4
      CPU-0      767         6111        111863      561         31          756         853         190
      CPU-1      241         10606       107956      484         125         646         990         85
      CPU-2      413         19721       98735       636         84          696         757         89
      CPU-3      84          11711       79989       17516       909         4811        5773        341
      CPU-4      152         19610       98229       444         53          649         708         1283
      CPU-5      185         8781        108697      666         91          671         677         1365
      CPU-6      157         21964       95825       581         67          566         684         1284
      CPU-7      125         15238       102704      398         20          665         786         1197
      
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c5350777
  22. 03 2月, 2018 1 次提交
    • E
      libbpf: add error reporting in XDP · bbf48c18
      Eric Leblond 提交于
      Parse netlink ext attribute to get the error message returned by
      the card. Code is partially take from libnl.
      
      We add netlink.h to the uapi include of tools. And we need to
      avoid include of userspace netlink header to have a successful
      build of sample so nlattr.h has a define to avoid
      the inclusion. Using a direct define could have been an issue
      as NLMSGERR_ATTR_MAX can change in the future.
      
      We also define SOL_NETLINK if not defined to avoid to have to
      copy socket.h for a fixed value.
      Signed-off-by: NEric Leblond <eric@regit.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      bbf48c18
  23. 27 1月, 2018 1 次提交
  24. 11 1月, 2018 1 次提交
    • J
      samples/bpf: xdp2skb_meta shows transferring info from XDP to SKB · 36e04a2d
      Jesper Dangaard Brouer 提交于
      Creating a bpf sample that shows howto use the XDP 'data_meta'
      infrastructure, created by Daniel Borkmann.  Very few drivers support
      this feature, but I wanted a functional sample to begin with, when
      working on adding driver support.
      
      XDP data_meta is about creating a communication channel between BPF
      programs.  This can be XDP tail-progs, but also other SKB based BPF
      hooks, like in this case the TC clsact hook. In this sample I show
      that XDP can store info named "mark", and TC/clsact chooses to use
      this info and store it into the skb->mark.
      
      It is a bit annoying that XDP and TC samples uses different tools/libs
      when attaching their BPF hooks.  As the XDP and TC programs need to
      cooperate and agree on a struct-layout, it is best/easiest if the two
      programs can be contained within the same BPF restricted-C file.
      
      As the bpf-loader, I choose to not use bpf_load.c (or libbpf), but
      instead wrote a bash shell scripted named xdp2skb_meta.sh, which
      demonstrate howto use the iproute cmdline tools 'tc' and 'ip' for
      loading BPF programs.  To make it easy for first time users, the shell
      script have command line parsing, and support --verbose and --dry-run
      mode, if you just want to see/learn the tc+ip command syntax:
      
       # ./xdp2skb_meta.sh --dev ixgbe2 --dry-run
       # Dry-run mode: enable VERBOSE and don't call TC+IP
       tc qdisc del dev ixgbe2 clsact
       tc qdisc add dev ixgbe2 clsact
       tc filter add dev ixgbe2 ingress prio 1 handle 1 bpf da obj ./xdp2skb_meta_kern.o sec tc_mark
       # Flush XDP on device: ixgbe2
       ip link set dev ixgbe2 xdp off
       ip link set dev ixgbe2 xdp obj ./xdp2skb_meta_kern.o sec xdp_mark
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      36e04a2d
  25. 06 1月, 2018 1 次提交
    • J
      samples/bpf: program demonstrating access to xdp_rxq_info · 0fca931a
      Jesper Dangaard Brouer 提交于
      This sample program can be used for monitoring and reporting how many
      packets per sec (pps) are received per NIC RX queue index and which
      CPU processed the packet. In itself it is a useful tool for quickly
      identifying RSS imbalance issues, see below.
      
      The default XDP action is XDP_PASS in-order to provide a monitor
      mode. For benchmarking purposes it is possible to specify other XDP
      actions on the cmdline --action.
      
      Output below shows an imbalance RSS case where most RXQ's deliver to
      CPU-0 while CPU-2 only get packets from a single RXQ.  Looking at
      things from a CPU level the two CPUs are processing approx the same
      amount, BUT looking at the rx_queue_index levels it is clear that
      RXQ-2 receive much better service, than other RXQs which all share CPU-0.
      
      Running XDP on dev:i40e1 (ifindex:3) action:XDP_PASS
      XDP stats       CPU     pps         issue-pps
      XDP-RX CPU      0       900,473     0
      XDP-RX CPU      2       906,921     0
      XDP-RX CPU      total   1,807,395
      
      RXQ stats       RXQ:CPU pps         issue-pps
      rx_queue_index    0:0   180,098     0
      rx_queue_index    0:sum 180,098
      rx_queue_index    1:0   180,098     0
      rx_queue_index    1:sum 180,098
      rx_queue_index    2:2   906,921     0
      rx_queue_index    2:sum 906,921
      rx_queue_index    3:0   180,098     0
      rx_queue_index    3:sum 180,098
      rx_queue_index    4:0   180,082     0
      rx_queue_index    4:sum 180,082
      rx_queue_index    5:0   180,093     0
      rx_queue_index    5:sum 180,093
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0fca931a
  26. 13 12月, 2017 1 次提交
  27. 18 11月, 2017 1 次提交
  28. 11 11月, 2017 2 次提交
  29. 08 11月, 2017 1 次提交