1. 21 11月, 2022 1 次提交
    • D
      bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs · 3f00c523
      David Vernet 提交于
      Kfuncs currently support specifying the KF_TRUSTED_ARGS flag to signal
      to the verifier that it should enforce that a BPF program passes it a
      "safe", trusted pointer. Currently, "safe" means that the pointer is
      either PTR_TO_CTX, or is refcounted. There may be cases, however, where
      the kernel passes a BPF program a safe / trusted pointer to an object
      that the BPF program wishes to use as a kptr, but because the object
      does not yet have a ref_obj_id from the perspective of the verifier, the
      program would be unable to pass it to a KF_ACQUIRE | KF_TRUSTED_ARGS
      kfunc.
      
      The solution is to expand the set of pointers that are considered
      trusted according to KF_TRUSTED_ARGS, so that programs can invoke kfuncs
      with these pointers without getting rejected by the verifier.
      
      There is already a PTR_UNTRUSTED flag that is set in some scenarios,
      such as when a BPF program reads a kptr directly from a map
      without performing a bpf_kptr_xchg() call. These pointers of course can
      and should be rejected by the verifier. Unfortunately, however,
      PTR_UNTRUSTED does not cover all the cases for safety that need to
      be addressed to adequately protect kfuncs. Specifically, pointers
      obtained by a BPF program "walking" a struct are _not_ considered
      PTR_UNTRUSTED according to BPF. For example, say that we were to add a
      kfunc called bpf_task_acquire(), with KF_ACQUIRE | KF_TRUSTED_ARGS, to
      acquire a struct task_struct *. If we only used PTR_UNTRUSTED to signal
      that a task was unsafe to pass to a kfunc, the verifier would mistakenly
      allow the following unsafe BPF program to be loaded:
      
      SEC("tp_btf/task_newtask")
      int BPF_PROG(unsafe_acquire_task,
                   struct task_struct *task,
                   u64 clone_flags)
      {
              struct task_struct *acquired, *nested;
      
              nested = task->last_wakee;
      
              /* Would not be rejected by the verifier. */
              acquired = bpf_task_acquire(nested);
              if (!acquired)
                      return 0;
      
              bpf_task_release(acquired);
              return 0;
      }
      
      To address this, this patch defines a new type flag called PTR_TRUSTED
      which tracks whether a PTR_TO_BTF_ID pointer is safe to pass to a
      KF_TRUSTED_ARGS kfunc or a BPF helper function. PTR_TRUSTED pointers are
      passed directly from the kernel as a tracepoint or struct_ops callback
      argument. Any nested pointer that is obtained from walking a PTR_TRUSTED
      pointer is no longer PTR_TRUSTED. From the example above, the struct
      task_struct *task argument is PTR_TRUSTED, but the 'nested' pointer
      obtained from 'task->last_wakee' is not PTR_TRUSTED.
      
      A subsequent patch will add kfuncs for storing a task kfunc as a kptr,
      and then another patch will add selftests to validate.
      Signed-off-by: NDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20221120051004.3605026-3-void@manifault.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      3f00c523
  2. 18 11月, 2022 6 次提交
    • K
      bpf: Introduce bpf_obj_new · 958cf2e2
      Kumar Kartikeya Dwivedi 提交于
      Introduce type safe memory allocator bpf_obj_new for BPF programs. The
      kernel side kfunc is named bpf_obj_new_impl, as passing hidden arguments
      to kfuncs still requires having them in prototype, unlike BPF helpers
      which always take 5 arguments and have them checked using bpf_func_proto
      in verifier, ignoring unset argument types.
      
      Introduce __ign suffix to ignore a specific kfunc argument during type
      checks, then use this to introduce support for passing type metadata to
      the bpf_obj_new_impl kfunc.
      
      The user passes BTF ID of the type it wants to allocates in program BTF,
      the verifier then rewrites the first argument as the size of this type,
      after performing some sanity checks (to ensure it exists and it is a
      struct type).
      
      The second argument is also fixed up and passed by the verifier. This is
      the btf_struct_meta for the type being allocated. It would be needed
      mostly for the offset array which is required for zero initializing
      special fields while leaving the rest of storage in unitialized state.
      
      It would also be needed in the next patch to perform proper destruction
      of the object's special fields.
      
      Under the hood, bpf_obj_new will call bpf_mem_alloc and bpf_mem_free,
      using the any context BPF memory allocator introduced recently. To this
      end, a global instance of the BPF memory allocator is initialized on
      boot to be used for this purpose. This 'bpf_global_ma' serves all
      allocations for bpf_obj_new. In the future, bpf_obj_new variants will
      allow specifying a custom allocator.
      
      Note that now that bpf_obj_new can be used to allocate objects that can
      be linked to BPF linked list (when future linked list helpers are
      available), we need to also free the elements using bpf_mem_free.
      However, since the draining of elements is done outside the
      bpf_spin_lock, we need to do migrate_disable around the call since
      bpf_list_head_free can be called from map free path where migration is
      enabled. Otherwise, when called from BPF programs migration is already
      disabled.
      
      A convenience macro is included in the bpf_experimental.h header to hide
      over the ugly details of the implementation, leading to user code
      looking similar to a language level extension which allocates and
      constructs fields of a user type.
      
      struct bar {
      	struct bpf_list_node node;
      };
      
      struct foo {
      	struct bpf_spin_lock lock;
      	struct bpf_list_head head __contains(bar, node);
      };
      
      void prog(void) {
      	struct foo *f;
      
      	f = bpf_obj_new(typeof(*f));
      	if (!f)
      		return;
      	...
      }
      
      A key piece of this story is still missing, i.e. the free function,
      which will come in the next patch.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-14-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      958cf2e2
    • K
      bpf: Rewrite kfunc argument handling · 00b85860
      Kumar Kartikeya Dwivedi 提交于
      As we continue to add more features, argument types, kfunc flags, and
      different extensions to kfuncs, the code to verify the correctness of
      the kfunc prototype wrt the passed in registers has become ad-hoc and
      ugly to read. To make life easier, and make a very clear split between
      different stages of argument processing, move all the code into
      verifier.c and refactor into easier to read helpers and functions.
      
      This also makes sharing code within the verifier easier with kfunc
      argument processing. This will be more and more useful in later patches
      as we are now moving to implement very core BPF helpers as kfuncs, to
      keep them experimental before baking into UAPI.
      
      Remove all kfunc related bits now from btf_check_func_arg_match, as
      users have been converted away to refactored kfunc argument handling.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-12-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      00b85860
    • K
      bpf: Verify ownership relationships for user BTF types · 865ce09a
      Kumar Kartikeya Dwivedi 提交于
      Ensure that there can be no ownership cycles among different types by
      way of having owning objects that can hold some other type as their
      element. For instance, a map value can only hold allocated objects, but
      these are allowed to have another bpf_list_head. To prevent unbounded
      recursion while freeing resources, elements of bpf_list_head in local
      kptrs can never have a bpf_list_head which are part of list in a map
      value. Later patches will verify this by having dedicated BTF selftests.
      
      Also, to make runtime destruction easier, once btf_struct_metas is fully
      populated, we can stash the metadata of the value type directly in the
      metadata of the list_head fields, as that allows easier access to the
      value type's layout to destruct it at runtime from the btf_field entry
      of the list head itself.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-8-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      865ce09a
    • K
      bpf: Recognize lock and list fields in allocated objects · 8ffa5cc1
      Kumar Kartikeya Dwivedi 提交于
      Allow specifying bpf_spin_lock, bpf_list_head, bpf_list_node fields in a
      allocated object.
      
      Also update btf_struct_access to reject direct access to these special
      fields.
      
      A bpf_list_head allows implementing map-in-map style use cases, where an
      allocated object with bpf_list_head is linked into a list in a map
      value. This would require embedding a bpf_list_node, support for which
      is also included. The bpf_spin_lock is used to protect the bpf_list_head
      and other data.
      
      While we strictly don't require to hold a bpf_spin_lock while touching
      the bpf_list_head in such objects, as when have access to it, we have
      complete ownership of the object, the locking constraint is still kept
      and may be conditionally lifted in the future.
      
      Note that the specification of such types can be done just like map
      values, e.g.:
      
      struct bar {
      	struct bpf_list_node node;
      };
      
      struct foo {
      	struct bpf_spin_lock lock;
      	struct bpf_list_head head __contains(bar, node);
      	struct bpf_list_node node;
      };
      
      struct map_value {
      	struct bpf_spin_lock lock;
      	struct bpf_list_head head __contains(foo, node);
      };
      
      To recognize such types in user BTF, we build a btf_struct_metas array
      of metadata items corresponding to each BTF ID. This is done once during
      the btf_parse stage to avoid having to do it each time during the
      verification process's requirement to inspect the metadata.
      
      Moreover, the computed metadata needs to be passed to some helpers in
      future patches which requires allocating them and storing them in the
      BTF that is pinned by the program itself, so that valid access can be
      assumed to such data during program runtime.
      
      A key thing to note is that once a btf_struct_meta is available for a
      type, both the btf_record and btf_field_offs should be available. It is
      critical that btf_field_offs is available in case special fields are
      present, as we extensively rely on special fields being zeroed out in
      map values and allocated objects in later patches. The code ensures that
      by bailing out in case of errors and ensuring both are available
      together. If the record is not available, the special fields won't be
      recognized, so not having both is also fine (in terms of being a
      verification error and not a runtime bug).
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-7-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      8ffa5cc1
    • K
      bpf: Introduce allocated objects support · 282de143
      Kumar Kartikeya Dwivedi 提交于
      Introduce support for representing pointers to objects allocated by the
      BPF program, i.e. PTR_TO_BTF_ID that point to a type in program BTF.
      This is indicated by the presence of MEM_ALLOC type flag in reg->type to
      avoid having to check btf_is_kernel when trying to match argument types
      in helpers.
      
      Whenever walking such types, any pointers being walked will always yield
      a SCALAR instead of pointer. In the future we might permit kptr inside
      such allocated objects (either kernel or program allocated), and it will
      then form a PTR_TO_BTF_ID of the respective type.
      
      For now, such allocated objects will always be referenced in verifier
      context, hence ref_obj_id == 0 for them is a bug. It is allowed to write
      to such objects, as long fields that are special are not touched
      (support for which will be added in subsequent patches). Note that once
      such a pointer is marked PTR_UNTRUSTED, it is no longer allowed to write
      to it.
      
      No PROBE_MEM handling is therefore done for loads into this type unless
      PTR_UNTRUSTED is part of the register type, since they can never be in
      an undefined state, and their lifetime will always be valid.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-6-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      282de143
    • H
      bpf: Pass map file to .map_update_batch directly · 3af43ba4
      Hou Tao 提交于
      Currently bpf_map_do_batch() first invokes fdget(batch.map_fd) to get
      the target map file, then it invokes generic_map_update_batch() to do
      batch update. generic_map_update_batch() will get the target map file
      by using fdget(batch.map_fd) again and pass it to bpf_map_update_value().
      
      The problem is map file returned by the second fdget() may be NULL or a
      totally different file compared by map file in bpf_map_do_batch(). The
      reason is that the first fdget() only guarantees the liveness of struct
      file instead of file descriptor and the file description may be released
      by concurrent close() through pick_file().
      
      It doesn't incur any problem as for now, because maps with batch update
      support don't use map file in .map_fd_get_ptr() ops. But it is better to
      fix the potential access of an invalid map file.
      
      Using __bpf_map_get() again in generic_map_update_batch() can not fix
      the problem, because batch.map_fd may be closed and reopened, and the
      returned map file may be different with map file got in
      bpf_map_do_batch(), so just passing the map file directly to
      .map_update_batch() in bpf_map_do_batch().
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20221116075059.1551277-1-houtao@huaweicloud.com
      3af43ba4
  3. 16 11月, 2022 1 次提交
  4. 15 11月, 2022 6 次提交
    • K
      bpf: Refactor btf_struct_access · 6728aea7
      Kumar Kartikeya Dwivedi 提交于
      Instead of having to pass multiple arguments that describe the register,
      pass the bpf_reg_state into the btf_struct_access callback. Currently,
      all call sites simply reuse the btf and btf_id of the reg they want to
      check the access of. The only exception to this pattern is the callsite
      in check_ptr_to_map_access, hence for that case create a dummy reg to
      simulate PTR_TO_BTF_ID access.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221114191547.1694267-8-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      6728aea7
    • K
      bpf: Rename MEM_ALLOC to MEM_RINGBUF · 894f2a8b
      Kumar Kartikeya Dwivedi 提交于
      Currently, verifier uses MEM_ALLOC type tag to specially tag memory
      returned from bpf_ringbuf_reserve helper. However, this is currently
      only used for this purpose and there is an implicit assumption that it
      only refers to ringbuf memory (e.g. the check for ARG_PTR_TO_ALLOC_MEM
      in check_func_arg_reg_off).
      
      Hence, rename MEM_ALLOC to MEM_RINGBUF to indicate this special
      relationship and instead open the use of MEM_ALLOC for more generic
      allocations made for user types.
      
      Also, since ARG_PTR_TO_ALLOC_MEM_OR_NULL is unused, simply drop it.
      
      Finally, update selftests using 'alloc_' verifier string to 'ringbuf_'.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221114191547.1694267-7-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      894f2a8b
    • K
      bpf: Rename RET_PTR_TO_ALLOC_MEM · 2de2669b
      Kumar Kartikeya Dwivedi 提交于
      Currently, the verifier has two return types, RET_PTR_TO_ALLOC_MEM, and
      RET_PTR_TO_ALLOC_MEM_OR_NULL, however the former is confusingly named to
      imply that it carries MEM_ALLOC, while only the latter does. This causes
      confusion during code review leading to conclusions like that the return
      value of RET_PTR_TO_DYNPTR_MEM_OR_NULL (which is RET_PTR_TO_ALLOC_MEM |
      PTR_MAYBE_NULL) may be consumable by bpf_ringbuf_{submit,commit}.
      
      Rename it to make it clear MEM_ALLOC needs to be tacked on top of
      RET_PTR_TO_MEM.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221114191547.1694267-6-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      2de2669b
    • K
      bpf: Support bpf_list_head in map values · f0c5941f
      Kumar Kartikeya Dwivedi 提交于
      Add the support on the map side to parse, recognize, verify, and build
      metadata table for a new special field of the type struct bpf_list_head.
      To parameterize the bpf_list_head for a certain value type and the
      list_node member it will accept in that value type, we use BTF
      declaration tags.
      
      The definition of bpf_list_head in a map value will be done as follows:
      
      struct foo {
      	struct bpf_list_node node;
      	int data;
      };
      
      struct map_value {
      	struct bpf_list_head head __contains(foo, node);
      };
      
      Then, the bpf_list_head only allows adding to the list 'head' using the
      bpf_list_node 'node' for the type struct foo.
      
      The 'contains' annotation is a BTF declaration tag composed of four
      parts, "contains:name:node" where the name is then used to look up the
      type in the map BTF, with its kind hardcoded to BTF_KIND_STRUCT during
      the lookup. The node defines name of the member in this type that has
      the type struct bpf_list_node, which is actually used for linking into
      the linked list. For now, 'kind' part is hardcoded as struct.
      
      This allows building intrusive linked lists in BPF, using container_of
      to obtain pointer to entry, while being completely type safe from the
      perspective of the verifier. The verifier knows exactly the type of the
      nodes, and knows that list helpers return that type at some fixed offset
      where the bpf_list_node member used for this list exists. The verifier
      also uses this information to disallow adding types that are not
      accepted by a certain list.
      
      For now, no elements can be added to such lists. Support for that is
      coming in future patches, hence draining and freeing items is done with
      a TODO that will be resolved in a future patch.
      
      Note that the bpf_list_head_free function moves the list out to a local
      variable under the lock and releases it, doing the actual draining of
      the list items outside the lock. While this helps with not holding the
      lock for too long pessimizing other concurrent list operations, it is
      also necessary for deadlock prevention: unless every function called in
      the critical section would be notrace, a fentry/fexit program could
      attach and call bpf_map_update_elem again on the map, leading to the
      same lock being acquired if the key matches and lead to a deadlock.
      While this requires some special effort on part of the BPF programmer to
      trigger and is highly unlikely to occur in practice, it is always better
      if we can avoid such a condition.
      
      While notrace would prevent this, doing the draining outside the lock
      has advantages of its own, hence it is used to also fix the deadlock
      related problem.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221114191547.1694267-5-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      f0c5941f
    • K
      bpf: Fix copy_map_value, zero_map_value · e5feed0f
      Kumar Kartikeya Dwivedi 提交于
      The current offset needs to also skip over the already copied region in
      addition to the size of the next field. This case manifests where there
      are gaps between adjacent special fields.
      
      It was observed that for a map value with size 48, having fields at:
      off:  0, 16, 32
      size: 4, 16, 16
      
      The current code does:
      
      memcpy(dst + 0, src + 0, 0)
      memcpy(dst + 4, src + 4, 12)
      memcpy(dst + 20, src + 20, 12)
      memcpy(dst + 36, src + 36, 12)
      
      With the fix, it is done correctly as:
      
      memcpy(dst + 0, src + 0, 0)
      memcpy(dst + 4, src + 4, 12)
      memcpy(dst + 32, src + 32, 0)
      memcpy(dst + 48, src + 48, 0)
      
      Fixes: 4d7d7f69 ("bpf: Adapt copy_map_value for multiple offset case")
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221114191547.1694267-4-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      e5feed0f
    • K
      bpf: Remove BPF_MAP_OFF_ARR_MAX · 2d577252
      Kumar Kartikeya Dwivedi 提交于
      In f71b2f64 ("bpf: Refactor map->off_arr handling"), map->off_arr
      was refactored to be btf_field_offs. The number of field offsets is
      equal to maximum possible fields limited by BTF_FIELDS_MAX. Hence, reuse
      BTF_FIELDS_MAX as spin_lock and timer no longer are to be handled
      specially for offset sorting, fix the comment, and remove incorrect
      WARN_ON as its rec->cnt can never exceed this value. The reason to keep
      separate constant was the it was always more 2 more than total kptrs.
      This is no longer the case.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221114191547.1694267-3-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      2d577252
  5. 04 11月, 2022 3 次提交
    • K
      bpf: Refactor map->off_arr handling · f71b2f64
      Kumar Kartikeya Dwivedi 提交于
      Refactor map->off_arr handling into generic functions that can work on
      their own without hardcoding map specific code. The btf_fields_offs
      structure is now returned from btf_parse_field_offs, which can be reused
      later for types in program BTF.
      
      All functions like copy_map_value, zero_map_value call generic
      underlying functions so that they can also be reused later for copying
      to values allocated in programs which encode specific fields.
      
      Later, some helper functions will also require access to this
      btf_field_offs structure to be able to skip over special fields at
      runtime.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-9-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      f71b2f64
    • K
      bpf: Consolidate spin_lock, timer management into btf_record · db559117
      Kumar Kartikeya Dwivedi 提交于
      Now that kptr_off_tab has been refactored into btf_record, and can hold
      more than one specific field type, accomodate bpf_spin_lock and
      bpf_timer as well.
      
      While they don't require any more metadata than offset, having all
      special fields in one place allows us to share the same code for
      allocated user defined types and handle both map values and these
      allocated objects in a similar fashion.
      
      As an optimization, we still keep spin_lock_off and timer_off offsets in
      the btf_record structure, just to avoid having to find the btf_field
      struct each time their offset is needed. This is mostly needed to
      manipulate such objects in a map value at runtime. It's ok to hardcode
      just one offset as more than one field is disallowed.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-8-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      db559117
    • K
      bpf: Refactor kptr_off_tab into btf_record · aa3496ac
      Kumar Kartikeya Dwivedi 提交于
      To prepare the BPF verifier to handle special fields in both map values
      and program allocated types coming from program BTF, we need to refactor
      the kptr_off_tab handling code into something more generic and reusable
      across both cases to avoid code duplication.
      
      Later patches also require passing this data to helpers at runtime, so
      that they can work on user defined types, initialize them, destruct
      them, etc.
      
      The main observation is that both map values and such allocated types
      point to a type in program BTF, hence they can be handled similarly. We
      can prepare a field metadata table for both cases and store them in
      struct bpf_map or struct btf depending on the use case.
      
      Hence, refactor the code into generic btf_record and btf_field member
      structs. The btf_record represents the fields of a specific btf_type in
      user BTF. The cnt indicates the number of special fields we successfully
      recognized, and field_mask is a bitmask of fields that were found, to
      enable quick determination of availability of a certain field.
      
      Subsequently, refactor the rest of the code to work with these generic
      types, remove assumptions about kptr and kptr_off_tab, rename variables
      to more meaningful names, etc.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-7-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      aa3496ac
  6. 26 10月, 2022 5 次提交
    • Y
      bpf: Implement cgroup storage available to non-cgroup-attached bpf progs · c4bcfb38
      Yonghong Song 提交于
      Similar to sk/inode/task storage, implement similar cgroup local storage.
      
      There already exists a local storage implementation for cgroup-attached
      bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
      bpf_get_local_storage(). But there are use cases such that non-cgroup
      attached bpf progs wants to access cgroup local storage data. For example,
      tc egress prog has access to sk and cgroup. It is possible to use
      sk local storage to emulate cgroup local storage by storing data in socket.
      But this is a waste as it could be lots of sockets belonging to a particular
      cgroup. Alternatively, a separate map can be created with cgroup id as the key.
      But this will introduce additional overhead to manipulate the new map.
      A cgroup local storage, similar to existing sk/inode/task storage,
      should help for this use case.
      
      The life-cycle of storage is managed with the life-cycle of the
      cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
      with a call to bpf_cgrp_storage_free() when cgroup itself
      is deleted.
      
      The userspace map operations can be done by using a cgroup fd as a key
      passed to the lookup, update and delete operations.
      
      Typically, the following code is used to get the current cgroup:
          struct task_struct *task = bpf_get_current_task_btf();
          ... task->cgroups->dfl_cgrp ...
      and in structure task_struct definition:
          struct task_struct {
              ....
              struct css_set __rcu            *cgroups;
              ....
          }
      With sleepable program, accessing task->cgroups is not protected by rcu_read_lock.
      So the current implementation only supports non-sleepable program and supporting
      sleepable program will be the next step together with adding rcu_read_lock
      protection for rcu tagged structures.
      
      Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
      storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used
      for cgroup storage available to non-cgroup-attached bpf programs. The old
      cgroup storage supports bpf_get_local_storage() helper to get the cgroup data.
      The new cgroup storage helper bpf_cgrp_storage_get() can provide similar
      functionality. While old cgroup storage pre-allocates storage memory, the new
      mechanism can also pre-allocate with a user space bpf_map_update_elem() call
      to avoid potential run-time memory allocation failure.
      Therefore, the new cgroup storage can provide all functionality w.r.t.
      the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to
      BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can
      be deprecated since the new one can provide the same functionality.
      Acked-by: NDavid Vernet <void@manifault.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20221026042850.673791-1-yhs@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      c4bcfb38
    • M
      bpf: Add new bpf_task_storage_delete proto with no deadlock detection · 8a7dac37
      Martin KaFai Lau 提交于
      The bpf_lsm and bpf_iter do not recur that will cause a deadlock.
      The situation is similar to the bpf_pid_task_storage_delete_elem()
      which is called from the syscall map_delete_elem.  It does not need
      deadlock detection.  Otherwise, it will cause unnecessary failure
      when calling the bpf_task_storage_delete() helper.
      
      This patch adds bpf_task_storage_delete proto that does not do deadlock
      detection.  It will be used by bpf_lsm and bpf_iter program.
      Signed-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20221025184524.3526117-8-martin.lau@linux.devSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      8a7dac37
    • M
      bpf: Add new bpf_task_storage_get proto with no deadlock detection · 4279adb0
      Martin KaFai Lau 提交于
      The bpf_lsm and bpf_iter do not recur that will cause a deadlock.
      The situation is similar to the bpf_pid_task_storage_lookup_elem()
      which is called from the syscall map_lookup_elem.  It does not need
      deadlock detection.  Otherwise, it will cause unnecessary failure
      when calling the bpf_task_storage_get() helper.
      
      This patch adds bpf_task_storage_get proto that does not do deadlock
      detection.  It will be used by bpf_lsm and bpf_iter programs.
      Signed-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20221025184524.3526117-6-martin.lau@linux.devSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      4279adb0
    • M
      bpf: Append _recur naming to the bpf_task_storage helper proto · 0593dd34
      Martin KaFai Lau 提交于
      This patch adds the "_recur" naming to the bpf_task_storage_{get,delete}
      proto.  In a latter patch, they will only be used by the tracing
      programs that requires a deadlock detection because a tracing
      prog may use bpf_task_storage_{get,delete} recursively and cause a
      deadlock.
      
      Another following patch will add a different helper proto for the non
      tracing programs because they do not need the deadlock prevention.
      This patch does this rename to prepare for this future proto
      additions.
      Signed-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20221025184524.3526117-3-martin.lau@linux.devSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      0593dd34
    • M
      bpf: Remove prog->active check for bpf_lsm and bpf_iter · 271de525
      Martin KaFai Lau 提交于
      The commit 64696c40 ("bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline")
      removed prog->active check for struct_ops prog.  The bpf_lsm
      and bpf_iter is also using trampoline.  Like struct_ops, the bpf_lsm
      and bpf_iter have fixed hooks for the prog to attach.  The
      kernel does not call the same hook in a recursive way.
      This patch also removes the prog->active check for
      bpf_lsm and bpf_iter.
      
      A later patch has a test to reproduce the recursion issue
      for a sleepable bpf_lsm program.
      
      This patch appends the '_recur' naming to the existing
      enter and exit functions that track the prog->active counter.
      New __bpf_prog_{enter,exit}[_sleepable] function are
      added to skip the prog->active tracking. The '_struct_ops'
      version is also removed.
      
      It also moves the decision on picking the enter and exit function to
      the new bpf_trampoline_{enter,exit}().  It returns the '_recur' ones
      for all tracing progs to use.  For bpf_lsm, bpf_iter,
      struct_ops (no prog->active tracking after 64696c40), and
      bpf_lsm_cgroup (no prog->active tracking after 69fd337a),
      it will return the functions that don't track the prog->active.
      Signed-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20221025184524.3526117-2-martin.lau@linux.devSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      271de525
  7. 21 10月, 2022 1 次提交
  8. 30 9月, 2022 1 次提交
    • M
      bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline · 64696c40
      Martin KaFai Lau 提交于
      The struct_ops prog is to allow using bpf to implement the functions in
      a struct (eg. kernel module).  The current usage is to implement the
      tcp_congestion.  The kernel does not call the tcp-cc's ops (ie.
      the bpf prog) in a recursive way.
      
      The struct_ops is sharing the tracing-trampoline's enter/exit
      function which tracks prog->active to avoid recursion.  It is
      needed for tracing prog.  However, it turns out the struct_ops
      bpf prog will hit this prog->active and unnecessarily skipped
      running the struct_ops prog.  eg.  The '.ssthresh' may run in_task()
      and then interrupted by softirq that runs the same '.ssthresh'.
      Skip running the '.ssthresh' will end up returning random value
      to the caller.
      
      The patch adds __bpf_prog_{enter,exit}_struct_ops for the
      struct_ops trampoline.  They do not track the prog->active
      to detect recursion.
      
      One exception is when the tcp_congestion's '.init' ops is doing
      bpf_setsockopt(TCP_CONGESTION) and then recurs to the same
      '.init' ops.  This will be addressed in the following patches.
      
      Fixes: ca06f55b ("bpf: Add per-program recursion prevention mechanism")
      Signed-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20220929070407.965581-2-martin.lau@linux.devSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      64696c40
  9. 29 9月, 2022 1 次提交
  10. 27 9月, 2022 2 次提交
  11. 22 9月, 2022 4 次提交
    • J
      bpf: Prevent bpf program recursion for raw tracepoint probes · 05b24ff9
      Jiri Olsa 提交于
      We got report from sysbot [1] about warnings that were caused by
      bpf program attached to contention_begin raw tracepoint triggering
      the same tracepoint by using bpf_trace_printk helper that takes
      trace_printk_lock lock.
      
       Call Trace:
        <TASK>
        ? trace_event_raw_event_bpf_trace_printk+0x5f/0x90
        bpf_trace_printk+0x2b/0xe0
        bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
        bpf_trace_run2+0x26/0x90
        native_queued_spin_lock_slowpath+0x1c6/0x2b0
        _raw_spin_lock_irqsave+0x44/0x50
        bpf_trace_printk+0x3f/0xe0
        bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
        bpf_trace_run2+0x26/0x90
        native_queued_spin_lock_slowpath+0x1c6/0x2b0
        _raw_spin_lock_irqsave+0x44/0x50
        bpf_trace_printk+0x3f/0xe0
        bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
        bpf_trace_run2+0x26/0x90
        native_queued_spin_lock_slowpath+0x1c6/0x2b0
        _raw_spin_lock_irqsave+0x44/0x50
        bpf_trace_printk+0x3f/0xe0
        bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
        bpf_trace_run2+0x26/0x90
        native_queued_spin_lock_slowpath+0x1c6/0x2b0
        _raw_spin_lock_irqsave+0x44/0x50
        __unfreeze_partials+0x5b/0x160
        ...
      
      The can be reproduced by attaching bpf program as raw tracepoint on
      contention_begin tracepoint. The bpf prog calls bpf_trace_printk
      helper. Then by running perf bench the spin lock code is forced to
      take slow path and call contention_begin tracepoint.
      
      Fixing this by skipping execution of the bpf program if it's
      already running, Using bpf prog 'active' field, which is being
      currently used by trampoline programs for the same reason.
      
      Moving bpf_prog_inc_misses_counter to syscall.c because
      trampoline.c is compiled in just for CONFIG_BPF_JIT option.
      Reviewed-by: NStanislav Fomichev <sdf@google.com>
      Reported-by: syzbot+2251879aa068ad9c960d@syzkaller.appspotmail.com
      [1] https://lore.kernel.org/bpf/YxhFe3EwqchC%2FfYf@krava/T/#tSigned-off-by: NJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20220916071914.7156-1-jolsa@kernel.orgSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      05b24ff9
    • R
      bpf: Add bpf_lookup_*_key() and bpf_key_put() kfuncs · f3cf4134
      Roberto Sassu 提交于
      Add the bpf_lookup_user_key(), bpf_lookup_system_key() and bpf_key_put()
      kfuncs, to respectively search a key with a given key handle serial number
      and flags, obtain a key from a pre-determined ID defined in
      include/linux/verification.h, and cleanup.
      
      Introduce system_keyring_id_check() to validate the keyring ID parameter of
      bpf_lookup_system_key().
      Signed-off-by: NRoberto Sassu <roberto.sassu@huawei.com>
      Acked-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Acked-by: NSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20220920075951.929132-8-roberto.sassu@huaweicloud.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      f3cf4134
    • R
      bpf: Export bpf_dynptr_get_size() · 51df4865
      Roberto Sassu 提交于
      Export bpf_dynptr_get_size(), so that kernel code dealing with eBPF dynamic
      pointers can obtain the real size of data carried by this data structure.
      Signed-off-by: NRoberto Sassu <roberto.sassu@huawei.com>
      Reviewed-by: NJoanne Koong <joannelkoong@gmail.com>
      Acked-by: NKP Singh <kpsingh@kernel.org>
      Acked-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220920075951.929132-6-roberto.sassu@huaweicloud.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      51df4865
    • D
      bpf: Add bpf_user_ringbuf_drain() helper · 20571567
      David Vernet 提交于
      In a prior change, we added a new BPF_MAP_TYPE_USER_RINGBUF map type which
      will allow user-space applications to publish messages to a ring buffer
      that is consumed by a BPF program in kernel-space. In order for this
      map-type to be useful, it will require a BPF helper function that BPF
      programs can invoke to drain samples from the ring buffer, and invoke
      callbacks on those samples. This change adds that capability via a new BPF
      helper function:
      
      bpf_user_ringbuf_drain(struct bpf_map *map, void *callback_fn, void *ctx,
                             u64 flags)
      
      BPF programs may invoke this function to run callback_fn() on a series of
      samples in the ring buffer. callback_fn() has the following signature:
      
      long callback_fn(struct bpf_dynptr *dynptr, void *context);
      
      Samples are provided to the callback in the form of struct bpf_dynptr *'s,
      which the program can read using BPF helper functions for querying
      struct bpf_dynptr's.
      
      In order to support bpf_ringbuf_drain(), a new PTR_TO_DYNPTR register
      type is added to the verifier to reflect a dynptr that was allocated by
      a helper function and passed to a BPF program. Unlike PTR_TO_STACK
      dynptrs which are allocated on the stack by a BPF program, PTR_TO_DYNPTR
      dynptrs need not use reference tracking, as the BPF helper is trusted to
      properly free the dynptr before returning. The verifier currently only
      supports PTR_TO_DYNPTR registers that are also DYNPTR_TYPE_LOCAL.
      
      Note that while the corresponding user-space libbpf logic will be added
      in a subsequent patch, this patch does contain an implementation of the
      .map_poll() callback for BPF_MAP_TYPE_USER_RINGBUF maps. This
      .map_poll() callback guarantees that an epoll-waiting user-space
      producer will receive at least one event notification whenever at least
      one sample is drained in an invocation of bpf_user_ringbuf_drain(),
      provided that the function is not invoked with the BPF_RB_NO_WAKEUP
      flag. If the BPF_RB_FORCE_WAKEUP flag is provided, a wakeup
      notification is sent even if no sample was drained.
      Signed-off-by: NDavid Vernet <void@manifault.com>
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220920000100.477320-3-void@manifault.com
      20571567
  12. 17 9月, 2022 1 次提交
  13. 11 9月, 2022 1 次提交
  14. 08 9月, 2022 4 次提交
  15. 07 9月, 2022 1 次提交
  16. 26 8月, 2022 1 次提交
    • H
      bpf: Introduce cgroup iter · d4ccaf58
      Hao Luo 提交于
      Cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes:
      
       - walking a cgroup's descendants in pre-order.
       - walking a cgroup's descendants in post-order.
       - walking a cgroup's ancestors.
       - process only the given cgroup.
      
      When attaching cgroup_iter, one can set a cgroup to the iter_link
      created from attaching. This cgroup is passed as a file descriptor
      or cgroup id and serves as the starting point of the walk. If no
      cgroup is specified, the starting point will be the root cgroup v2.
      
      For walking descendants, one can specify the order: either pre-order or
      post-order. For walking ancestors, the walk starts at the specified
      cgroup and ends at the root.
      
      One can also terminate the walk early by returning 1 from the iter
      program.
      
      Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
      program is called with cgroup_mutex held.
      
      Currently only one session is supported, which means, depending on the
      volume of data bpf program intends to send to user space, the number
      of cgroups that can be walked is limited. For example, given the current
      buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each
      cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can
      be walked is 512. This is a limitation of cgroup_iter. If the output
      data is larger than the kernel buffer size, after all data in the
      kernel buffer is consumed by user space, the subsequent read() syscall
      will signal EOPNOTSUPP. In order to work around, the user may have to
      update their program to reduce the volume of data sent to output. For
      example, skip some uninteresting cgroups. In future, we may extend
      bpf_iter flags to allow customizing buffer size.
      Acked-by: NYonghong Song <yhs@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/r/20220824233117.1312810-2-haoluo@google.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      d4ccaf58
  17. 24 8月, 2022 1 次提交