1. 23 7月, 2021 6 次提交
  2. 21 7月, 2021 6 次提交
  3. 20 7月, 2021 2 次提交
  4. 17 7月, 2021 21 次提交
    • A
      Merge branch 'libbpf: BTF typed dump cleanups' · 78e4a955
      Andrii Nakryiko 提交于
      Alan Maguire says:
      
      ====================
      
      Fix issues with libbpf BTF typed dump code.  Patch 1 addresses handling
      of unaligned data. Patch 2 fixes issues Andrii noticed when compiling
      on ppc64le.  Patch 3 simplifies typed dump by getting rid of allocation
      of dump data structure which tracks dump state etc.
      
      Changes since v1:
      
       - Andrii suggested using a function instead of a macro for checking
         alignment of data, and pointed out that we need to consider dump
         ptr size versus native pointer size (patch 1)
      ====================
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      78e4a955
    • A
      libbpf: Btf typed dump does not need to allocate dump data · add192f8
      Alan Maguire 提交于
      By using the stack for this small structure, we avoid the need
      for freeing memory in error paths.
      Suggested-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/1626475617-25984-4-git-send-email-alan.maguire@oracle.com
      add192f8
    • A
      libbpf: Fix compilation errors on ppc64le for btf dump typed data · 04eb4dff
      Alan Maguire 提交于
      __s64 can be defined as either long or long long, depending on the
      architecture. On ppc64le it's defined as long, giving this error:
      
       In file included from btf_dump.c:22:
      btf_dump.c: In function 'btf_dump_type_data_check_overflow':
      libbpf_internal.h:111:22: error: format '%lld' expects argument of
      type 'long long int', but argument 3 has type '__s64' {aka 'long int'}
      [-Werror=format=]
        111 |  libbpf_print(level, "libbpf: " fmt, ##__VA_ARGS__); \
            |                      ^~~~~~~~~~
      libbpf_internal.h:114:27: note: in expansion of macro '__pr'
        114 | #define pr_warn(fmt, ...) __pr(LIBBPF_WARN, fmt, ##__VA_ARGS__)
            |                           ^~~~
      btf_dump.c:1992:3: note: in expansion of macro 'pr_warn'
       1992 |   pr_warn("unexpected size [%lld] for id [%u]\n",
            |   ^~~~~~~
      btf_dump.c:1992:32: note: format string is defined here
       1992 |   pr_warn("unexpected size [%lld] for id [%u]\n",
            |                             ~~~^
            |                                |
            |                                long long int
            |                             %ld
      
      Cast to size_t and use %zu instead.
      Reported-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/1626475617-25984-3-git-send-email-alan.maguire@oracle.com
      04eb4dff
    • A
      libbpf: Clarify/fix unaligned data issues for btf typed dump · 8d44c357
      Alan Maguire 提交于
      If data is packed, data structures can store it outside of usual
      boundaries.  For example a 4-byte int can be stored on a unaligned
      boundary in a case like this:
      
      struct s {
      	char f1;
      	int f2;
      } __attribute((packed));
      
      ...the int is stored at an offset of one byte.  Some platforms have
      problems dereferencing data that is not aligned with its size, and
      code exists to handle most cases of this for BTF typed data display.
      However pointer display was missed, and a simple function to test if
      "ptr_is_aligned(data, data_sz)" would help clarify this code.
      Suggested-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/1626475617-25984-2-git-send-email-alan.maguire@oracle.com
      8d44c357
    • A
      Merge branch 'libbpf: BTF dumper support for typed data' · 068dfc65
      Andrii Nakryiko 提交于
      Alan Maguire says:
      
      ====================
      
      Add a libbpf dumper function that supports dumping a representation
      of data passed in using the BTF id associated with the data in a
      manner similar to the bpf_snprintf_btf helper.
      
      Default output format is identical to that dumped by bpf_snprintf_btf()
      (bar using tabs instead of spaces for indentation, but the indent string
      can be customized also); for example, a "struct sk_buff" representation
      would look like this:
      
      (struct sk_buff){
              (union){
                      (struct){
                              .next = (struct sk_buff *)0xffffffffffffffff,
                              .prev = (struct sk_buff *)0xffffffffffffffff,
                              (union){
                                      .dev = (struct net_device *)0xffffffffffffffff,
                                      .dev_scratch = (long unsigned int)18446744073709551615,
                              },
              },
      ...
      
      Patch 1 implements the dump functionality in a manner similar
      to that in kernel/bpf/btf.c, but with a view to fitting into
      libbpf more naturally.  For example, rather than using flags,
      boolean dump options are used to control output.  In addition,
      rather than combining checks for display (such as is this
      field zero?) and actual display - as is done for the kernel
      code - the code is organized to separate zero and overflow
      checks from type display.
      
      Patch 2 adds ASSERT_STRNEQ() for use in the following BTF dumper
      tests.
      
      Patch 3 consists of selftests that utilize a dump printf function
      to snprintf the dump output to a string for comparison with
      expected output.  Tests deliberately mirror those in
      snprintf_btf helper test to keep output consistent, but
      also cover overflow handling, var/section display.
      
      Changes since v5 [1]
       - readjust dump options to avoid unnecessary padding (Andrii, patch 1).
       - tidied up bitfield data checking/retrieval using Andrii's suggestions.
         Removed code where we adjust data pointer prior to calling bitfield
         functions as this adjustment is not needed, provided we use the type
         size as the number of bytes to iterate over when retrieving the
         full value we apply bit shifting operations to retrieve the bitfield
         value.  With these chances, the *_int_bits() functions were no longer needed
         (Andrii, patch 1).
       - coalesced the "is zero" checking for ints, floats and pointers
         into btf_dump_base_type_check_zero(), using a memcmp() of the
         size of the data.  This can be derived from t->size for ints
         and floats, and pointer size is retrieved from dump's ptr_sz
         field (Andrii, patch 1).
       - Added alignment-aware handling for int, enum, float retrieval.
         Packed data structures can force ints, enums and floats to be
         aligned on different boundaries; for example, the
      
      struct p {
              char f1;
              int f2;
      } __attribute__((packed));
      
         ...will have the int f2 field offset at byte 1, rather than at
         byte 4 for an unpacked structure.  The problem is directly
         dereferencing that as an int is problematic on some platforms.
         For ints and enums, we can reuse bitfield retrieval to get the
         value for display, while for floats we use a local union of the
         floating-point types and memcpy into it, ensuring we can then
         dereference pointers into that union which will have safe alignment
         (Andrii, patch 1).
       - added comments to explain why we increment depth prior to displaying
         opening parens, and decrement it prior to displaying closing parens
         for structs, unions and arrays.  The reason is that we don't want
         to have a trailing newline when displaying a type.  The logic that
         handles this says "don't show a newline when the depth we're at is 0".
         For this to work for opening parens then we need to bump depth before
         showing opening parens + newline, and when we close out structure
         we need to show closing parens after reducing depth so that we don't
         append a newline to a top-level structure. So as a result we have
      
      struct foo {\n
       struct bar {\n
       }\n
      }
      
       - silently truncate provided indent string with strncat() if > 31 bytes
         (Andrii, patch 1).
       - fixed ASSERT_STRNEQ() macro to show only n bytes of string
         (Andrii, patch 2).
       - fixed strncat() of type data string to avoid stack corruption
         (Andrii, patch 3).
       - removed early returns from dump type tests (Andrii, patch 3).
       - have tests explicitly specify prefix (enum, struct, union)
         (Andrii, patch 3).
       - switch from CHECK() to ASSERT_* where possible (Andrii, patch 3).
      
      Changes since v4 [2]
      - Andrii kindly provided code to unify emitting a prepended cast
        (for example "(int)") with existing code, and this had the nice
        benefit of adding array indices in type specifications (Andrii,
        patches 1, 3)
      - Fixed indent_str option to make it a const char *, stored in a
        fixed-length buffer internally (Andrii, patch 1)
      - Reworked bit shift logic to minimize endian-specific interactions,
        and use same macros as found elsewhere in libbpf to determine endianness
        (Andrii, patch 1)
      - Fixed type emitting to ensure that a trailing '\n' is not displayed;
        newlines are added during struct/array display, but for a single type
        the last character is no longer a newline (Andrii, patches 1, 3)
      - Added support for ASSERT_STRNEQ() macro (Andrii, patch 2)
      - Split tests into subtests for int, char, enum etc rather than one
        "dump type data" subtest (Andrii, patch 3)
      - Made better use of ASSERT* macros (Andrii, patch 3)
      - Got rid of some other TEST_* macros that were unneeded (Andrii, patch 3)
      - Switched to using "struct fs_context" to verify enum bitfield values
        (Andrii, patch 3)
      
      Changes since v3 [3]
      - Retained separation of emitting of type name cast prefixing
        type values from existing functionality such as btf_dump_emit_type_chain()
        since initial code-shared version had so many exceptions it became
        hard to read.  For example, we don't emit a type name if the type
        to be displayed is an array member, we also always emit "forward"
        definitions for structs/unions that aren't really forward definitions
        (we just want a "struct foo" output for "(struct foo){.bar = ...".
        We also always ignore modifiers const/volatile/restrict as they
        clutter output when emitting large types.
      - Added configurable 4-char indent string option; defaults to tab
        (Andrii)
      - Added support for BTF_KIND_FLOAT and associated tests (Andrii)
      - Added support for BTF_KIND_FUNC_PROTO function pointers to
        improve output of "ops" structures; for example:
      
      (struct file_operations){
              .owner = (struct module *)0xffffffffffffffff,
              .llseek = (loff_t(*)(struct file *, loff_t, int))0xffffffffffffffff,
              ...
        Added associated test also (Andrii)
      - Added handling for enum bitfields and associated test (Andrii)
      - Allocation of "struct btf_dump_data" done on-demand (Andrii)
      - Removed ".field = " output from function emitting type name and
        into caller (Andrii)
      - Removed BTF_INT_OFFSET() support (Andrii)
      - Use libbpf_err() to set errno for error cases (Andrii)
      - btf_dump_dump_type_data() returns size written, which is used
        when returning successfully from btf_dump__dump_type_data()
        (Andrii)
      
      Changes since v2 [4]
      - Renamed function to btf_dump__dump_type_data, reorganized
        arguments such that opts are last (Andrii)
      - Modified code to separate questions about display such
        as have we overflowed?/is this field zero? from actual
        display of typed data, such that we ask those questions
        separately from the code that actually displays typed data
        (Andrii)
      - Reworked code to handle overflow - where we do not provide
        enough data for the type we wish to display - by returning
        -E2BIG and attempting to present as much data as possible.
        Such a mode of operation allows for tracers which retrieve
        partial data (such as first 1024 bytes of a
        "struct task_struct" say), and want to display that partial
        data, while also knowing that it is not the full type.
       Such tracers can then denote this (perhaps via "..." or
        similar).
      - Explored reusing existing type emit functions, such as
        passing in a type id stack with a single type id to
        btf_dump_emit_type_chain() to support the display of
        typed data where a "cast" is prepended to the data to
        denote its type; "(int)1", "(struct foo){", etc.
        However the task of emitting a
        ".field_name = (typecast)" did not match well with model
        of walking the stack to display innermost types first
        and made the resultant code harder to read.  Added a
        dedicated btf_dump_emit_type_name() function instead which
        is only ~70 lines (Andrii)
      - Various cleanups around bitfield macros, unneeded member
        iteration macros, avoiding compiler complaints when
        displaying int da ta by casting to long long, etc (Andrii)
      - Use DECLARE_LIBBPF_OPTS() in defining opts for tests (Andrii)
      - Added more type tests, overflow tests, var tests and
        section tests.
      
      Changes since RFC [5]
      - The initial approach explored was to share the kernel code
        with libbpf using #defines to paper over the different needs;
        however it makes more sense to try and fit in with libbpf
        code style for maintenance.  A comment in the code points at
        the implementation in kernel/bpf/btf.c and notes that any
        issues found in it should be fixed there or vice versa;
        mirroring the tests should help with this also
        (Andrii)
      
      [1] https://lore.kernel.org/bpf/1624092968-5598-1-git-send-email-alan.maguire@oracle.com/
      [2] https://lore.kernel.org/bpf/CAEf4BzYtbnphCkhz0epMKE4zWfvSOiMpu+-SXp9hadsrRApuZw@mail.gmail.com/T/
      [3] https://lore.kernel.org/bpf/1622131170-8260-1-git-send-email-alan.maguire@oracle.com/
      [4] https://lore.kernel.org/bpf/1610921764-7526-1-git-send-email-alan.maguire@oracle.com/
      [5] https://lore.kernel.org/bpf/1610386373-24162-1-git-send-email-alan.maguire@oracle.com/
      ====================
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      068dfc65
    • A
      selftests/bpf: Add dump type data tests to btf dump tests · 70a9241f
      Alan Maguire 提交于
      Test various type data dumping operations by comparing expected
      format with the dumped string; an snprintf-style printf function
      is used to record the string dumped.  Also verify overflow handling
      where the data passed does not cover the full size of a type,
      such as would occur if a tracer has a portion of the 8k
      "struct task_struct".
      Signed-off-by: NAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/1626362126-27775-4-git-send-email-alan.maguire@oracle.com
      70a9241f
    • A
    • A
      libbpf: BTF dumper support for typed data · 920d16af
      Alan Maguire 提交于
      Add a BTF dumper for typed data, so that the user can dump a typed
      version of the data provided.
      
      The API is
      
      int btf_dump__dump_type_data(struct btf_dump *d, __u32 id,
                                   void *data, size_t data_sz,
                                   const struct btf_dump_type_data_opts *opts);
      
      ...where the id is the BTF id of the data pointed to by the "void *"
      argument; for example the BTF id of "struct sk_buff" for a
      "struct skb *" data pointer.  Options supported are
      
       - a starting indent level (indent_lvl)
       - a user-specified indent string which will be printed once per
         indent level; if NULL, tab is chosen but any string <= 32 chars
         can be provided.
       - a set of boolean options to control dump display, similar to those
         used for BPF helper bpf_snprintf_btf().  Options are
              - compact : omit newlines and other indentation
              - skip_names: omit member names
              - emit_zeroes: show zero-value members
      
      Default output format is identical to that dumped by bpf_snprintf_btf(),
      for example a "struct sk_buff" representation would look like this:
      
      struct sk_buff){
      	(union){
      		(struct){
      			.next = (struct sk_buff *)0xffffffffffffffff,
      			.prev = (struct sk_buff *)0xffffffffffffffff,
      		(union){
      			.dev = (struct net_device *)0xffffffffffffffff,
      			.dev_scratch = (long unsigned int)18446744073709551615,
      		},
      	},
      ...
      
      If the data structure is larger than the *data_sz*
      number of bytes that are available in *data*, as much
      of the data as possible will be dumped and -E2BIG will
      be returned.  This is useful as tracers will sometimes
      not be able to capture all of the data associated with
      a type; for example a "struct task_struct" is ~16k.
      Being able to specify that only a subset is available is
      important for such cases.  On success, the amount of data
      dumped is returned.
      Signed-off-by: NAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/1626362126-27775-2-git-send-email-alan.maguire@oracle.com
      920d16af
    • A
      Merge branch 'Add btf_custom_path in bpf_obj_open_opts' · 334faa5c
      Andrii Nakryiko 提交于
      Shuyi Cheng says:
      
      ====================
      
      This patch set adds the ability to point to a custom BTF for the
      purposes of BPF CO-RE relocations. This is useful for using BPF CO-RE
      on old kernels that don't yet natively support kernel (vmlinux) BTF
      and thus libbpf needs application's help in locating kernel BTF
      generated separately from the kernel itself. This was already possible
      to do through bpf_object__load's attribute struct, but that makes it
      inconvenient to use with BPF skeleton, which only allows to specify
      bpf_object_open_opts during the open step. Thus, add the ability to
      override vmlinux BTF at open time.
      
      Patch #1 adds libbpf changes.
      Patch #2 fixes pre-existing memory leak detected during the code review.
      Patch #3 switches existing selftests to using open_opts for custom BTF.
      
      Changelog:
      ----------
      
      v3: https://lore.kernel.org/bpf/CAEf4BzY2cdT44bfbMus=gei27ViqGE1BtGo6XrErSsOCnqtVJg@mail.gmail.com/T/#m877eed1d4cf0a1d3352d3f3d6c5ff158be45c542
      v3->v4:
       - Follow Andrii's suggestion to modify cover letter description.
       - Delete function bpf_object__load_override_btf.
       - Follow Dan's suggestion to add fixes tag and modify commit msg to patch #2.
       - Add pathch #3 to switch existing selftests to using open_opts.
      
      v2: https://lore.kernel.org/bpf/CAEf4Bza_ua+tjxdhyy4nZ8Boeo+scipWmr_1xM1pC6N5wyuhAA@mail.gmail.com/T/#mf9cf86ae0ffa96180ac29e4fd12697eb70eccd0f
      v2->v3:
        - Load the BTF specified by btf_custom_path to btf_vmlinux_override
          instead of btf_bmlinux.
        - Fix the memory leak that may be introduced by the second version
          of the patch.
        - Add a new patch to fix the possible memory leak caused by
          obj->kconfig.
      
      v1: https://lore.kernel.org/bpf/CAEf4BzaGjEC4t1OefDo11pj2-HfNy0BLhs_G2UREjRNTmb2u=A@mail.gmail.com/t/#m4d9f7c6761fbd2b436b5dfe491cd864b70225804
      v1->v2:
        - Change custom_btf_path to btf_custom_path.
        - If the length of btf_custom_path of bpf_obj_open_opts is too long,
          return ERR_PTR(-ENAMETOOLONG).
        - Add `custom BTF is in addition to vmlinux BTF` with btf_custom_path field.
      ====================
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      334faa5c
    • S
      selftests/bpf: Switch existing selftests to using open_opts for custom BTF · f0b7d119
      Shuyi Cheng 提交于
      This patch mainly replaces the bpf_object_load_attr of
      the core_autosize.c and core_reloc.c files with bpf_object_open_opts.
      Signed-off-by: NShuyi Cheng <chengshuyi@linux.alibaba.com>
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/1626180159-112996-4-git-send-email-chengshuyi@linux.alibaba.com
      f0b7d119
    • S
      libbpf: Fix the possible memory leak on error · 18353c87
      Shuyi Cheng 提交于
      If the strdup() fails then we need to call bpf_object__close(obj) to
      avoid a resource leak.
      
      Fixes: 166750bc ("libbpf: Support libbpf-provided extern variables")
      Signed-off-by: NShuyi Cheng <chengshuyi@linux.alibaba.com>
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/1626180159-112996-3-git-send-email-chengshuyi@linux.alibaba.com
      18353c87
    • S
      libbpf: Introduce 'btf_custom_path' to 'bpf_obj_open_opts' · 1373ff59
      Shuyi Cheng 提交于
      btf_custom_path allows developers to load custom BTF which libbpf will
      subsequently use for CO-RE relocation instead of vmlinux BTF.
      
      Having btf_custom_path in bpf_object_open_opts one can directly use the
      skeleton's <objname>_bpf__open_opts() API to pass in the btf_custom_path
      parameter, as opposed to using bpf_object__load_xattr() which is slated to be
      deprecated ([0]).
      
      This work continues previous work started by another developer ([1]).
      
        [0] https://lore.kernel.org/bpf/CAEf4BzbJZLjNoiK8_VfeVg_Vrg=9iYFv+po-38SMe=UzwDKJ=Q@mail.gmail.com/#t
        [1] https://yhbt.net/lore/all/CAEf4Bzbgw49w2PtowsrzKQNcxD4fZRE6AKByX-5-dMo-+oWHHA@mail.gmail.com/Signed-off-by: NShuyi Cheng <chengshuyi@linux.alibaba.com>
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/1626180159-112996-2-git-send-email-chengshuyi@linux.alibaba.com
      1373ff59
    • R
      bpf, doc: Add heading and example for extensions in cbpf · 88865347
      Roy, UjjaL 提交于
      Add new heading for extensions to make it more readable. Also, add one
      more example of filtering interface index for better understanding.
      Signed-off-by: NRoy, UjjaL <royujjal@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/CAADnVQJ=DoRDcVkaXmY3EmNdLoO7gq1mkJOn5G=00wKH8qUtZQ@mail.gmail.com
      88865347
    • A
      bpf: Add ambient BPF runtime context stored in current · c7603cfa
      Andrii Nakryiko 提交于
      b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
      helper") fixed the problem with cgroup-local storage use in BPF by
      pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate
      possible BPF program preemptions and nested executions.
      
      While this seems to work good in practice, it introduces new and unnecessary
      failure mode in which not all BPF programs might be executed if we fail to
      find an unused slot for cgroup storage, however unlikely it is. It might also
      not be so unlikely when/if we allow sleepable cgroup BPF programs in the
      future.
      
      Further, the way that cgroup storage is implemented as ambiently-available
      property during entire BPF program execution is a convenient way to pass extra
      information to BPF program and helpers without requiring user code to pass
      around extra arguments explicitly. So it would be good to have a generic
      solution that can allow implementing this without arbitrary restrictions.
      Ideally, such solution would work for both preemptable and sleepable BPF
      programs in exactly the same way.
      
      This patch introduces such solution, bpf_run_ctx. It adds one pointer field
      (bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of
      macros in such a way that it always stays valid throughout BPF program
      execution. BPF program preemption is handled by remembering previous
      current->bpf_ctx value locally while executing nested BPF program and
      restoring old value after nested BPF program finishes. This is handled by two
      helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are
      supposed to be used before and after BPF program runs, respectively.
      
      Restoring old value of the pointer handles preemption, while bpf_run_ctx
      pointer being a property of current task_struct naturally solves this problem
      for sleepable BPF programs by "following" BPF program execution as it is
      scheduled in and out of CPU. It would even allow CPU migration of BPF
      programs, even though it's not currently allowed by BPF infra.
      
      This patch cleans up cgroup local storage handling as a first application. The
      design itself is generic, though, with bpf_run_ctx being an empty struct that
      is supposed to be embedded into a specific struct for a given BPF program type
      (bpf_cg_run_ctx in this case). Follow up patches are planned that will expand
      this mechanism for other uses within tracing BPF programs.
      
      To verify that this change doesn't revert the fix to the original cgroup
      storage issue, I ran the same repro as in the original report ([0]) and didn't
      get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with
      bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work).
      
        [0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/
      
      Fixes: b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org
      c7603cfa
    • P
      netdevsim: Add multi-queue support · d4861fc6
      Peilin Ye 提交于
      Currently netdevsim only supports a single queue per port, which is
      insufficient for testing multi-queue TC schedulers e.g. sch_mq.  Extend
      the current sysfs interface so that users can create ports with multiple
      queues:
      
      $ echo "[ID] [PORT_COUNT] [NUM_QUEUES]" > /sys/bus/netdevsim/new_device
      
      As an example, echoing "2 4 8" creates 4 ports, with 8 queues per port.
      Note, this is compatible with the current interface, with default number
      of queues set to 1.  For example, echoing "2 4" creates 4 ports with 1
      queue per port; echoing "2" simply creates 1 port with 1 queue.
      Reviewed-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NPeilin Ye <peilin.ye@bytedance.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4861fc6
    • M
      openvswitch: Introduce per-cpu upcall dispatch · b83d23a2
      Mark Gray 提交于
      The Open vSwitch kernel module uses the upcall mechanism to send
      packets from kernel space to user space when it misses in the kernel
      space flow table. The upcall sends packets via a Netlink socket.
      Currently, a Netlink socket is created for every vport. In this way,
      there is a 1:1 mapping between a vport and a Netlink socket.
      When a packet is received by a vport, if it needs to be sent to
      user space, it is sent via the corresponding Netlink socket.
      
      This mechanism, with various iterations of the corresponding user
      space code, has seen some limitations and issues:
      
      * On systems with a large number of vports, there is a correspondingly
      large number of Netlink sockets which can limit scaling.
      (https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
      * Packet reordering on upcalls.
      (https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
      * A thundering herd issue.
      (https://bugzilla.redhat.com/show_bug.cgi?id=1834444)
      
      This patch introduces an alternative, feature-negotiated, upcall
      mode using a per-cpu dispatch rather than a per-vport dispatch.
      
      In this mode, the Netlink socket to be used for the upcall is
      selected based on the CPU of the thread that is executing the upcall.
      In this way, it resolves the issues above as:
      
      a) The number of Netlink sockets scales with the number of CPUs
      rather than the number of vports.
      b) Ordering per-flow is maintained as packets are distributed to
      CPUs based on mechanisms such as RSS and flows are distributed
      to a single user space thread.
      c) Packets from a flow can only wake up one user space thread.
      
      The corresponding user space code can be found at:
      https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/385139.html
      
      Bugzilla: https://bugzilla.redhat.com/1844576Signed-off-by: NMark Gray <mark.d.gray@redhat.com>
      Acked-by: NFlavio Leitner <fbl@sysclose.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b83d23a2
    • B
      bnx2x: remove unused variable 'cur_data_offset' · 919d5279
      Bill Wendling 提交于
      Fix the clang build warning:
      
        drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c:1862:13: error: variable 'cur_data_offset' set but not used [-Werror,-Wunused-but-set-variable]
              dma_addr_t cur_data_offset;
      Signed-off-by: NBill Wendling <morbo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      919d5279
    • C
      net: switchdev: Simplify 'mlxsw_sp_mc_write_mdb_entry()' · a99f030b
      Christophe JAILLET 提交于
      Use 'bitmap_alloc()/bitmap_free()' instead of hand-writing it.
      This makes the code less verbose.
      
      Also, use 'bitmap_alloc()' instead of 'bitmap_zalloc()' because the bitmap
      is fully overridden by a 'bitmap_copy()' call just after its allocation.
      
      While at it, remove an extra and unneeded space.
      Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a99f030b
    • Y
      net/sched: Remove unnecessary if statement · f79a3bcb
      Yajun Deng 提交于
      It has been deal with the 'if (err' statement in rtnetlink_send()
      and rtnl_unicast(). so remove unnecessary if statement.
      
      v2: use the raw name rtnetlink_send().
      Signed-off-by: NYajun Deng <yajun.deng@linux.dev>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f79a3bcb
    • Y
      rtnetlink: use nlmsg_notify() in rtnetlink_send() · cfdf0d9a
      Yajun Deng 提交于
      The netlink_{broadcast, unicast} don't deal with 'if (err > 0' statement
      but nlmsg_{multicast, unicast} do. The nlmsg_notify() contains them.
      so use nlmsg_notify() instead. so that the caller wouldn't deal with
      'if (err > 0' statement.
      
      v2: use nlmsg_notify() will do well.
      Signed-off-by: NYajun Deng <yajun.deng@linux.dev>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfdf0d9a
    • H
      gve: fix the wrong AdminQ buffer overflow check · 63a9192b
      Haiyue Wang 提交于
      The 'tail' pointer is also free-running count, so it needs to be masked
      as 'adminq_prod_cnt' does, to become an index value of AdminQ buffer.
      
      Fixes: 5cdad90d ("gve: Batch AQ commands for creating and destroying queues.")
      Signed-off-by: NHaiyue Wang <haiyue.wang@intel.com>
      Reviewed-by: NCatherine Sullivan <csully@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63a9192b
  5. 16 7月, 2021 5 次提交