1. 21 5月, 2022 1 次提交
  2. 14 5月, 2022 1 次提交
  3. 12 5月, 2022 2 次提交
  4. 11 5月, 2022 7 次提交
  5. 27 4月, 2022 3 次提交
    • J
      net: atm: remove support for ZeitNet ZN122x ATM devices · 052e1f01
      Jakub Kicinski 提交于
      This driver received nothing but automated fixes in the last 15 years.
      Since it's using virt_to_bus it's unlikely to be used on any modern
      platform.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      052e1f01
    • E
      net: generalize skb freeing deferral to per-cpu lists · 68822bdf
      Eric Dumazet 提交于
      Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
      lock is released") helped bulk TCP flows to move the cost of skbs
      frees outside of critical section where socket lock was held.
      
      But for RPC traffic, or hosts with RFS enabled, the solution is far from
      being ideal.
      
      For RPC traffic, recvmsg() has to return to user space right after
      skb payload has been consumed, meaning that BH handler has no chance
      to pick the skb before recvmsg() thread. This issue is more visible
      with BIG TCP, as more RPC fit one skb.
      
      For RFS, even if BH handler picks the skbs, they are still picked
      from the cpu on which user thread is running.
      
      Ideally, it is better to free the skbs (and associated page frags)
      on the cpu that originally allocated them.
      
      This patch removes the per socket anchor (sk->defer_list) and
      instead uses a per-cpu list, which will hold more skbs per round.
      
      This new per-cpu list is drained at the end of net_action_rx(),
      after incoming packets have been processed, to lower latencies.
      
      In normal conditions, skbs are added to the per-cpu list with
      no further action. In the (unlikely) cases where the cpu does not
      run net_action_rx() handler fast enough, we use an IPI to raise
      NET_RX_SOFTIRQ on the remote cpu.
      
      Also, we do not bother draining the per-cpu list from dev_cpu_dead()
      This is because skbs in this list have no requirement on how fast
      they should be freed.
      
      Note that we can add in the future a small per-cpu cache
      if we see any contention on sd->defer_lock.
      
      Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
      and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
      page recycling strategy used by NIC driver (its page pool capacity
      being too small compared to number of skbs/pages held in sockets
      receive queues)
      
      Note that this tuning was only done to demonstrate worse
      conditions for skb freeing for this particular test.
      These conditions can happen in more general production workload.
      
      10 runs of one TCP_STREAM flow
      
      Before:
      Average throughput: 49685 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() show high cost for
      skb freeing related functions (*)
      
          57.81%  [kernel]       [k] copy_user_enhanced_fast_string
      (*) 12.87%  [kernel]       [k] skb_release_data
      (*)  4.25%  [kernel]       [k] __free_one_page
      (*)  3.57%  [kernel]       [k] __list_del_entry_valid
           1.85%  [kernel]       [k] __netif_receive_skb_core
           1.60%  [kernel]       [k] __skb_datagram_iter
      (*)  1.59%  [kernel]       [k] free_unref_page_commit
      (*)  1.16%  [kernel]       [k] __slab_free
           1.16%  [kernel]       [k] _copy_to_iter
      (*)  1.01%  [kernel]       [k] kfree
      (*)  0.88%  [kernel]       [k] free_unref_page
           0.57%  [kernel]       [k] ip6_rcv_core
           0.55%  [kernel]       [k] ip6t_do_table
           0.54%  [kernel]       [k] flush_smp_call_function_queue
      (*)  0.54%  [kernel]       [k] free_pcppages_bulk
           0.51%  [kernel]       [k] llist_reverse_order
           0.38%  [kernel]       [k] process_backlog
      (*)  0.38%  [kernel]       [k] free_pcp_prepare
           0.37%  [kernel]       [k] tcp_recvmsg_locked
      (*)  0.37%  [kernel]       [k] __list_add_valid
           0.34%  [kernel]       [k] sock_rfree
           0.34%  [kernel]       [k] _raw_spin_lock_irq
      (*)  0.33%  [kernel]       [k] __page_cache_release
           0.33%  [kernel]       [k] tcp_v6_rcv
      (*)  0.33%  [kernel]       [k] __put_page
      (*)  0.29%  [kernel]       [k] __mod_zone_page_state
           0.27%  [kernel]       [k] _raw_spin_lock
      
      After patch:
      Average throughput: 73076 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() looks better:
      
          81.35%  [kernel]       [k] copy_user_enhanced_fast_string
           1.95%  [kernel]       [k] _copy_to_iter
           1.95%  [kernel]       [k] __skb_datagram_iter
           1.27%  [kernel]       [k] __netif_receive_skb_core
           1.03%  [kernel]       [k] ip6t_do_table
           0.60%  [kernel]       [k] sock_rfree
           0.50%  [kernel]       [k] tcp_v6_rcv
           0.47%  [kernel]       [k] ip6_rcv_core
           0.45%  [kernel]       [k] read_tsc
           0.44%  [kernel]       [k] _raw_spin_lock_irqsave
           0.37%  [kernel]       [k] _raw_spin_lock
           0.37%  [kernel]       [k] native_irq_return_iret
           0.33%  [kernel]       [k] __inet6_lookup_established
           0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
           0.29%  [kernel]       [k] tcp_rcv_established
           0.29%  [kernel]       [k] llist_reverse_order
      
      v2: kdoc issue (kernel bots)
          do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
          replace the sk_buff_head with a single-linked list (Jakub)
          add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      68822bdf
    • M
      bpf: Compute map_btf_id during build time · c317ab71
      Menglong Dong 提交于
      For now, the field 'map_btf_id' in 'struct bpf_map_ops' for all map
      types are computed during vmlinux-btf init:
      
        btf_parse_vmlinux() -> btf_vmlinux_map_ids_init()
      
      It will lookup the btf_type according to the 'map_btf_name' field in
      'struct bpf_map_ops'. This process can be done during build time,
      thanks to Jiri's resolve_btfids.
      
      selftest of map_ptr has passed:
      
        $96 map_ptr:OK
        Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NMenglong Dong <imagedong@tencent.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c317ab71
  6. 26 4月, 2022 10 次提交
    • K
      bpf: Make BTF type match stricter for release arguments · 2ab3b380
      Kumar Kartikeya Dwivedi 提交于
      The current of behavior of btf_struct_ids_match for release arguments is
      that when type match fails, it retries with first member type again
      (recursively). Since the offset is already 0, this is akin to just
      casting the pointer in normal C, since if type matches it was just
      embedded inside parent sturct as an object. However, we want to reject
      cases for release function type matching, be it kfunc or BPF helpers.
      
      An example is the following:
      
      struct foo {
      	struct bar b;
      };
      
      struct foo *v = acq_foo();
      rel_bar(&v->b); // btf_struct_ids_match fails btf_types_are_same, then
      		// retries with first member type and succeeds, while
      		// it should fail.
      
      Hence, don't walk the struct and only rely on btf_types_are_same for
      strict mode. All users of strict mode must be dealing with zero offset
      anyway, since otherwise they would want the struct to be walked.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220424214901.2743946-10-memxor@gmail.com
      2ab3b380
    • K
      bpf: Teach verifier about kptr_get kfunc helpers · a1ef1959
      Kumar Kartikeya Dwivedi 提交于
      We introduce a new style of kfunc helpers, namely *_kptr_get, where they
      take pointer to the map value which points to a referenced kernel
      pointer contained in the map. Since this is referenced, only
      bpf_kptr_xchg from BPF side and xchg from kernel side is allowed to
      change the current value, and each pointer that resides in that location
      would be referenced, and RCU protected (this must be kept in mind while
      adding kernel types embeddable as reference kptr in BPF maps).
      
      This means that if do the load of the pointer value in an RCU read
      section, and find a live pointer, then as long as we hold RCU read lock,
      it won't be freed by a parallel xchg + release operation. This allows us
      to implement a safe refcount increment scheme. Hence, enforce that first
      argument of all such kfunc is a proper PTR_TO_MAP_VALUE pointing at the
      right offset to referenced pointer.
      
      For the rest of the arguments, they are subjected to typical kfunc
      argument checks, hence allowing some flexibility in passing more intent
      into how the reference should be taken.
      
      For instance, in case of struct nf_conn, it is not freed until RCU grace
      period ends, but can still be reused for another tuple once refcount has
      dropped to zero. Hence, a bpf_ct_kptr_get helper not only needs to call
      refcount_inc_not_zero, but also do a tuple match after incrementing the
      reference, and when it fails to match it, put the reference again and
      return NULL.
      
      This can be implemented easily if we allow passing additional parameters
      to the bpf_ct_kptr_get kfunc, like a struct bpf_sock_tuple * and a
      tuple__sz pair.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220424214901.2743946-9-memxor@gmail.com
      a1ef1959
    • K
      bpf: Wire up freeing of referenced kptr · 14a324f6
      Kumar Kartikeya Dwivedi 提交于
      A destructor kfunc can be defined as void func(type *), where type may
      be void or any other pointer type as per convenience.
      
      In this patch, we ensure that the type is sane and capture the function
      pointer into off_desc of ptr_off_tab for the specific pointer offset,
      with the invariant that the dtor pointer is always set when 'kptr_ref'
      tag is applied to the pointer's pointee type, which is indicated by the
      flag BPF_MAP_VALUE_OFF_F_REF.
      
      Note that only BTF IDs whose destructor kfunc is registered, thus become
      the allowed BTF IDs for embedding as referenced kptr. Hence it serves
      the purpose of finding dtor kfunc BTF ID, as well acting as a check
      against the whitelist of allowed BTF IDs for this purpose.
      
      Finally, wire up the actual freeing of the referenced pointer if any at
      all available offsets, so that no references are leaked after the BPF
      map goes away and the BPF program previously moved the ownership a
      referenced pointer into it.
      
      The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
      will free any existing referenced kptr. The same case is with LRU map's
      bpf_lru_push_free/htab_lru_push_free functions, which are extended to
      reset unreferenced and free referenced kptr.
      
      Note that unlike BPF timers, kptr is not reset or freed when map uref
      drops to zero.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
      14a324f6
    • K
      bpf: Populate pairs of btf_id and destructor kfunc in btf · 5ce937d6
      Kumar Kartikeya Dwivedi 提交于
      To support storing referenced PTR_TO_BTF_ID in maps, we require
      associating a specific BTF ID with a 'destructor' kfunc. This is because
      we need to release a live referenced pointer at a certain offset in map
      value from the map destruction path, otherwise we end up leaking
      resources.
      
      Hence, introduce support for passing an array of btf_id, kfunc_btf_id
      pairs that denote a BTF ID and its associated release function. Then,
      add an accessor 'btf_find_dtor_kfunc' which can be used to look up the
      destructor kfunc of a certain BTF ID. If found, we can use it to free
      the object from the map free path.
      
      The registration of these pairs also serve as a whitelist of structures
      which are allowed as referenced PTR_TO_BTF_ID in a BPF map, because
      without finding the destructor kfunc, we will bail and return an error.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220424214901.2743946-7-memxor@gmail.com
      5ce937d6
    • K
      bpf: Adapt copy_map_value for multiple offset case · 4d7d7f69
      Kumar Kartikeya Dwivedi 提交于
      Since now there might be at most 10 offsets that need handling in
      copy_map_value, the manual shuffling and special case is no longer going
      to work. Hence, let's generalise the copy_map_value function by using
      a sorted array of offsets to skip regions that must be avoided while
      copying into and out of a map value.
      
      When the map is created, we populate the offset array in struct map,
      Then, copy_map_value uses this sorted offset array is used to memcpy
      while skipping timer, spin lock, and kptr. The array is allocated as
      in most cases none of these special fields would be present in map
      value, hence we can save on space for the common case by not embedding
      the entire object inside bpf_map struct.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220424214901.2743946-6-memxor@gmail.com
      4d7d7f69
    • K
      bpf: Prevent escaping of kptr loaded from maps · 6efe152d
      Kumar Kartikeya Dwivedi 提交于
      While we can guarantee that even for unreferenced kptr, the object
      pointer points to being freed etc. can be handled by the verifier's
      exception handling (normal load patching to PROBE_MEM loads), we still
      cannot allow the user to pass these pointers to BPF helpers and kfunc,
      because the same exception handling won't be done for accesses inside
      the kernel. The same is true if a referenced pointer is loaded using
      normal load instruction. Since the reference is not guaranteed to be
      held while the pointer is used, it must be marked as untrusted.
      
      Hence introduce a new type flag, PTR_UNTRUSTED, which is used to mark
      all registers loading unreferenced and referenced kptr from BPF maps,
      and ensure they can never escape the BPF program and into the kernel by
      way of calling stable/unstable helpers.
      
      In check_ptr_to_btf_access, the !type_may_be_null check to reject type
      flags is still correct, as apart from PTR_MAYBE_NULL, only MEM_USER,
      MEM_PERCPU, and PTR_UNTRUSTED may be set for PTR_TO_BTF_ID. The first
      two are checked inside the function and rejected using a proper error
      message, but we still want to allow dereference of untrusted case.
      
      Also, we make sure to inherit PTR_UNTRUSTED when chain of pointers are
      walked, so that this flag is never dropped once it has been set on a
      PTR_TO_BTF_ID (i.e. trusted to untrusted transition can only be in one
      direction).
      
      In convert_ctx_accesses, extend the switch case to consider untrusted
      PTR_TO_BTF_ID in addition to normal PTR_TO_BTF_ID for PROBE_MEM
      conversion for BPF_LDX.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220424214901.2743946-5-memxor@gmail.com
      6efe152d
    • K
      bpf: Allow storing referenced kptr in map · c0a5a21c
      Kumar Kartikeya Dwivedi 提交于
      Extending the code in previous commits, introduce referenced kptr
      support, which needs to be tagged using 'kptr_ref' tag instead. Unlike
      unreferenced kptr, referenced kptr have a lot more restrictions. In
      addition to the type matching, only a newly introduced bpf_kptr_xchg
      helper is allowed to modify the map value at that offset. This transfers
      the referenced pointer being stored into the map, releasing the
      references state for the program, and returning the old value and
      creating new reference state for the returned pointer.
      
      Similar to unreferenced pointer case, return value for this case will
      also be PTR_TO_BTF_ID_OR_NULL. The reference for the returned pointer
      must either be eventually released by calling the corresponding release
      function, otherwise it must be transferred into another map.
      
      It is also allowed to call bpf_kptr_xchg with a NULL pointer, to clear
      the value, and obtain the old value if any.
      
      BPF_LDX, BPF_STX, and BPF_ST cannot access referenced kptr. A future
      commit will permit using BPF_LDX for such pointers, but attempt at
      making it safe, since the lifetime of object won't be guaranteed.
      
      There are valid reasons to enforce the restriction of permitting only
      bpf_kptr_xchg to operate on referenced kptr. The pointer value must be
      consistent in face of concurrent modification, and any prior values
      contained in the map must also be released before a new one is moved
      into the map. To ensure proper transfer of this ownership, bpf_kptr_xchg
      returns the old value, which the verifier would require the user to
      either free or move into another map, and releases the reference held
      for the pointer being moved in.
      
      In the future, direct BPF_XCHG instruction may also be permitted to work
      like bpf_kptr_xchg helper.
      
      Note that process_kptr_func doesn't have to call
      check_helper_mem_access, since we already disallow rdonly/wronly flags
      for map, which is what check_map_access_type checks, and we already
      ensure the PTR_TO_MAP_VALUE refers to kptr by obtaining its off_desc,
      so check_map_access is also not required.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220424214901.2743946-4-memxor@gmail.com
      c0a5a21c
    • K
      bpf: Tag argument to be released in bpf_func_proto · 8f14852e
      Kumar Kartikeya Dwivedi 提交于
      Add a new type flag for bpf_arg_type that when set tells verifier that
      for a release function, that argument's register will be the one for
      which meta.ref_obj_id will be set, and which will then be released
      using release_reference. To capture the regno, introduce a new field
      release_regno in bpf_call_arg_meta.
      
      This would be required in the next patch, where we may either pass NULL
      or a refcounted pointer as an argument to the release function
      bpf_kptr_xchg. Just releasing only when meta.ref_obj_id is set is not
      enough, as there is a case where the type of argument needed matches,
      but the ref_obj_id is set to 0. Hence, we must enforce that whenever
      meta.ref_obj_id is zero, the register that is to be released can only
      be NULL for a release function.
      
      Since we now indicate whether an argument is to be released in
      bpf_func_proto itself, is_release_function helper has lost its utitlity,
      hence refactor code to work without it, and just rely on
      meta.release_regno to know when to release state for a ref_obj_id.
      Still, the restriction of one release argument and only one ref_obj_id
      passed to BPF helper or kfunc remains. This may be lifted in the future.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220424214901.2743946-3-memxor@gmail.com
      8f14852e
    • K
      bpf: Allow storing unreferenced kptr in map · 61df10c7
      Kumar Kartikeya Dwivedi 提交于
      This commit introduces a new pointer type 'kptr' which can be embedded
      in a map value to hold a PTR_TO_BTF_ID stored by a BPF program during
      its invocation. When storing such a kptr, BPF program's PTR_TO_BTF_ID
      register must have the same type as in the map value's BTF, and loading
      a kptr marks the destination register as PTR_TO_BTF_ID with the correct
      kernel BTF and BTF ID.
      
      Such kptr are unreferenced, i.e. by the time another invocation of the
      BPF program loads this pointer, the object which the pointer points to
      may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
      patched to PROBE_MEM loads by the verifier, it would safe to allow user
      to still access such invalid pointer, but passing such pointers into
      BPF helpers and kfuncs should not be permitted. A future patch in this
      series will close this gap.
      
      The flexibility offered by allowing programs to dereference such invalid
      pointers while being safe at runtime frees the verifier from doing
      complex lifetime tracking. As long as the user may ensure that the
      object remains valid, it can ensure data read by it from the kernel
      object is valid.
      
      The user indicates that a certain pointer must be treated as kptr
      capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
      a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
      information is recorded in the object BTF which will be passed into the
      kernel by way of map's BTF information. The name and kind from the map
      value BTF is used to look up the in-kernel type, and the actual BTF and
      BTF ID is recorded in the map struct in a new kptr_off_tab member. For
      now, only storing pointers to structs is permitted.
      
      An example of this specification is shown below:
      
      	#define __kptr __attribute__((btf_type_tag("kptr")))
      
      	struct map_value {
      		...
      		struct task_struct __kptr *task;
      		...
      	};
      
      Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
      task_struct into the map, and then load it later.
      
      Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
      the verifier cannot know whether the value is NULL or not statically, it
      must treat all potential loads at that map value offset as loading a
      possibly NULL pointer.
      
      Only BPF_LDX, BPF_STX, and BPF_ST (with insn->imm = 0 to denote NULL)
      are allowed instructions that can access such a pointer. On BPF_LDX, the
      destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
      it is checked whether the source register type is a PTR_TO_BTF_ID with
      same BTF type as specified in the map BTF. The access size must always
      be BPF_DW.
      
      For the map in map support, the kptr_off_tab for outer map is copied
      from the inner map's kptr_off_tab. It was chosen to do a deep copy
      instead of introducing a refcount to kptr_off_tab, because the copy only
      needs to be done when paramterizing using inner_map_fd in the map in map
      case, hence would be unnecessary for all other users.
      
      It is not permitted to use MAP_FREEZE command and mmap for BPF map
      having kptrs, similar to the bpf_timer case. A kptr also requires that
      BPF program has both read and write access to the map (hence both
      BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG are disallowed).
      
      Note that check_map_access must be called from both
      check_helper_mem_access and for the BPF instructions, hence the kptr
      check must distinguish between ACCESS_DIRECT and ACCESS_HELPER, and
      reject ACCESS_HELPER cases. We rename stack_access_src to bpf_access_src
      and reuse it for this purpose.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220424214901.2743946-2-memxor@gmail.com
      61df10c7
    • S
      bpf: Use bpf_prog_run_array_cg_flags everywhere · d9d31cf8
      Stanislav Fomichev 提交于
      Rename bpf_prog_run_array_cg_flags to bpf_prog_run_array_cg and
      use it everywhere. check_return_code already enforces sane
      return ranges for all cgroup types. (only egress and bind hooks have
      uncanonical return ranges, the rest is using [0, 1])
      
      No functional changes.
      
      v2:
      - 'func_ret & 1' under explicit test (Andrii & Martin)
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220425220448.3669032-1-sdf@google.com
      d9d31cf8
  7. 25 4月, 2022 3 次提交
  8. 23 4月, 2022 3 次提交
  9. 22 4月, 2022 2 次提交
    • G
      ipv4: Avoid using RTO_ONLINK with ip_route_connect(). · 67e1e2f4
      Guillaume Nault 提交于
      Now that ip_rt_fix_tos() doesn't reset ->flowi4_scope unconditionally,
      we don't have to rely on the RTO_ONLINK bit to properly set the scope
      of a flowi4 structure. We can just set ->flowi4_scope explicitly and
      avoid using RTO_ONLINK in ->flowi4_tos.
      
      This patch converts callers of ip_route_connect(). Instead of setting
      the tos parameter with RT_CONN_FLAGS(sk), as all callers do, we can:
      
        1- Drop the tos parameter from ip_route_connect(): its value was
           entirely based on sk, which is also passed as parameter.
      
        2- Set ->flowi4_scope depending on the SOCK_LOCALROUTE socket option
           instead of always initialising it with RT_SCOPE_UNIVERSE (let's
           define ip_sock_rt_scope() for this purpose).
      
        3- Avoid overloading ->flowi4_tos with RTO_ONLINK: since the scope is
           now properly initialised, we don't need to tell ip_rt_fix_tos() to
           adjust ->flowi4_scope for us. So let's define ip_sock_rt_tos(),
           which is the same as RT_CONN_FLAGS() but without the RTO_ONLINK
           bit overload.
      
      Note:
        In the original ip_route_connect() code, __ip_route_output_key()
        might clear the RTO_ONLINK bit of fl4->flowi4_tos (because of
        ip_rt_fix_tos()). Therefore flowi4_update_output() had to reuse the
        original tos variable. Now that we don't set RTO_ONLINK any more,
        this is not a problem and we can use fl4->flowi4_tos in
        flowi4_update_output().
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67e1e2f4
    • K
      ipv6: Remove __ipv6_only_sock(). · 89e9c728
      Kuniyuki Iwashima 提交于
      Since commit 9fe516ba ("inet: move ipv6only in sock_common"),
      ipv6_only_sock() and __ipv6_only_sock() are the same macro.  Let's
      remove the one.
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89e9c728
  10. 21 4月, 2022 1 次提交
  11. 20 4月, 2022 6 次提交
    • B
      net/sched: flower: Add number of vlan tags filter · b4000312
      Boris Sukholitko 提交于
      These are bookkeeping parts of the new num_of_vlans filter.
      Defines, dump, load and set are being done here.
      Signed-off-by: NBoris Sukholitko <boris.sukholitko@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4000312
    • B
      flow_dissector: Add number of vlan tags dissector · 34951fcf
      Boris Sukholitko 提交于
      Our customers in the fiber telecom world have network configurations
      where they would like to control their traffic according to the number
      of tags appearing in the packet.
      
      For example, TR247 GPON conformance test suite specification mostly
      talks about untagged, single, double tagged packets and gives lax
      guidelines on the vlan protocol vs. number of vlan tags.
      
      This is different from the common IT networks where 802.1Q and 802.1ad
      protocols are usually describe single and double tagged packet. GPON
      configurations that we work with have arbitrary mix the above protocols
      and number of vlan tags in the packet.
      
      The goal is to make the following TC commands possible:
      
      tc filter add dev eth1 ingress flower \
        num_of_vlans 1 vlan_prio 5 action drop
      
      From our logs, we have redirect rules such that:
      
      tc filter add dev $GPON ingress flower num_of_vlans $N \
           action mirred egress redirect dev $DEV
      
      where N can range from 0 to 3 and $DEV is the function of $N.
      
      Also there are rules setting skb mark based on the number of vlans:
      
      tc filter add dev $GPON ingress flower num_of_vlans $N vlan_prio \
          $P action skbedit mark $M
      
      This new dissector allows extracting the number of vlan tags existing in
      the packet.
      Signed-off-by: NBoris Sukholitko <boris.sukholitko@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34951fcf
    • K
      bpf: Fix usage of trace RCU in local storage. · dcf456c9
      KP Singh 提交于
      bpf_{sk,task,inode}_storage_free() do not need to use
      call_rcu_tasks_trace as no BPF program should be accessing the owner
      as it's being destroyed. The only other reader at this point is
      bpf_local_storage_map_free() which uses normal RCU.
      
      The only path that needs trace RCU are:
      
      * bpf_local_storage_{delete,update} helpers
      * map_{delete,update}_elem() syscalls
      
      Fixes: 0fe4b381 ("bpf: Allow bpf_local_storage to be used by sleepable programs")
      Signed-off-by: NKP Singh <kpsingh@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20220418155158.2865678-1-kpsingh@kernel.org
      dcf456c9
    • S
      vmalloc: replace VM_NO_HUGE_VMAP with VM_ALLOW_HUGE_VMAP · 559089e0
      Song Liu 提交于
      Huge page backed vmalloc memory could benefit performance in many cases.
      However, some users of vmalloc may not be ready to handle huge pages for
      various reasons: hardware constraints, potential pages split, etc.
      VM_NO_HUGE_VMAP was introduced to allow vmalloc users to opt-out huge
      pages.  However, it is not easy to track down all the users that require
      the opt-out, as the allocation are passed different stacks and may cause
      issues in different layers.
      
      To address this issue, replace VM_NO_HUGE_VMAP with an opt-in flag,
      VM_ALLOW_HUGE_VMAP, so that users that benefit from huge pages could ask
      specificially.
      
      Also, remove vmalloc_no_huge() and add opt-in helper vmalloc_huge().
      
      Fixes: fac54e2b ("x86/Kconfig: Select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP")
      Link: https://lore.kernel.org/netdev/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/"
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <song@kernel.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      559089e0
    • C
      fs: fix acl translation · 705191b0
      Christian Brauner 提交于
      Last cycle we extended the idmapped mounts infrastructure to support
      idmapped mounts of idmapped filesystems (No such filesystem yet exist.).
      Since then, the meaning of an idmapped mount is a mount whose idmapping
      is different from the filesystems idmapping.
      
      While doing that work we missed to adapt the acl translation helpers.
      They still assume that checking for the identity mapping is enough.  But
      they need to use the no_idmapping() helper instead.
      
      Note, POSIX ACLs are always translated right at the userspace-kernel
      boundary using the caller's current idmapping and the initial idmapping.
      The order depends on whether we're coming from or going to userspace.
      The filesystem's idmapping doesn't matter at the border.
      
      Consequently, if a non-idmapped mount is passed we need to make sure to
      always pass the initial idmapping as the mount's idmapping and not the
      filesystem idmapping.  Since it's irrelevant here it would yield invalid
      ids and prevent setting acls for filesystems that are mountable in a
      userns and support posix acls (tmpfs and fuse).
      
      I verified the regression reported in [1] and verified that this patch
      fixes it.  A regression test will be added to xfstests in parallel.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=215849 [1]
      Fixes: bd303368 ("fs: support mapped mounts of mapped filesystems")
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <stable@vger.kernel.org> # 5.17
      Cc: <regressions@lists.linux.dev>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      705191b0
    • S
      bpf: Move rcu lock management out of BPF_PROG_RUN routines · 055eb955
      Stanislav Fomichev 提交于
      Commit 7d08c2c9 ("bpf: Refactor BPF_PROG_RUN_ARRAY family of macros
      into functions") switched a bunch of BPF_PROG_RUN macros to inline
      routines. This changed the semantic a bit. Due to arguments expansion
      of macros, it used to be:
      
      	rcu_read_lock();
      	array = rcu_dereference(cgrp->bpf.effective[atype]);
      	...
      
      Now, with with inline routines, we have:
      	array_rcu = rcu_dereference(cgrp->bpf.effective[atype]);
      	/* array_rcu can be kfree'd here */
      	rcu_read_lock();
      	array = rcu_dereference(array_rcu);
      
      I'm assuming in practice rcu subsystem isn't fast enough to trigger
      this but let's use rcu API properly.
      
      Also, rename to lower caps to not confuse with macros. Additionally,
      drop and expand BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY.
      
      See [1] for more context.
      
        [1] https://lore.kernel.org/bpf/CAKH8qBs60fOinFdxiiQikK_q0EcVxGvNTQoWvHLEUGbgcj1UYg@mail.gmail.com/T/#u
      
      v2
      - keep rcu locks inside by passing cgroup_bpf
      
      Fixes: 7d08c2c9 ("bpf: Refactor BPF_PROG_RUN_ARRAY family of macros into functions")
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20220414161233.170780-1-sdf@google.com
      055eb955
  12. 19 4月, 2022 1 次提交