1. 26 9月, 2020 2 次提交
    • J
      bpf: Add comment to document BTF type PTR_TO_BTF_ID_OR_NULL · ba5f4cfe
      John Fastabend 提交于
      The meaning of PTR_TO_BTF_ID_OR_NULL differs slightly from other types
      denoted with the *_OR_NULL type. For example the types PTR_TO_SOCKET
      and PTR_TO_SOCKET_OR_NULL can be used for branch analysis because the
      type PTR_TO_SOCKET is guaranteed to _not_ have a null value.
      
      In contrast PTR_TO_BTF_ID and BTF_TO_BTF_ID_OR_NULL have slightly
      different meanings. A PTR_TO_BTF_TO_ID may be a pointer to NULL value,
      but it is safe to read this pointer in the program context because
      the program context will handle any faults. The fallout is for
      PTR_TO_BTF_ID the verifier can assume reads are safe, but can not
      use the type in branch analysis. Additionally, authors need to be
      extra careful when passing PTR_TO_BTF_ID into helpers. In general
      helpers consuming type PTR_TO_BTF_ID will need to assume it may
      be null.
      
      Seeing the above is not obvious to readers without the back knowledge
      lets add a comment in the type definition.
      
      Editorial comment, as networking and tracing programs get closer
      and more tightly merged we may need to consider a new type that we
      can ensure is non-null for branch analysis and also passing into
      helpers.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NLorenz Bauer <lmb@cloudflare.com>
      ba5f4cfe
    • M
      bpf: Enable bpf_skc_to_* sock casting helper to networking prog type · 1df8f55a
      Martin KaFai Lau 提交于
      There is a constant need to add more fields into the bpf_tcp_sock
      for the bpf programs running at tc, sock_ops...etc.
      
      A current workaround could be to use bpf_probe_read_kernel().  However,
      other than making another helper call for reading each field and missing
      CO-RE, it is also not as intuitive to use as directly reading
      "tp->lsndtime" for example.  While already having perfmon cap to do
      bpf_probe_read_kernel(), it will be much easier if the bpf prog can
      directly read from the tcp_sock.
      
      This patch tries to do that by using the existing casting-helpers
      bpf_skc_to_*() whose func_proto returns a btf_id.  For example, the
      func_proto of bpf_skc_to_tcp_sock returns the btf_id of the
      kernel "struct tcp_sock".
      
      These helpers are also added to is_ptr_cast_function().
      It ensures the returning reg (BPF_REF_0) will also carries the ref_obj_id.
      That will keep the ref-tracking works properly.
      
      The bpf_skc_to_* helpers are made available to most of the bpf prog
      types in filter.c. The bpf_skc_to_* helpers will be limited by
      perfmon cap.
      
      This patch adds a ARG_PTR_TO_BTF_ID_SOCK_COMMON.  The helper accepting
      this arg can accept a btf-id-ptr (PTR_TO_BTF_ID + &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON])
      or a legacy-ctx-convert-skc-ptr (PTR_TO_SOCK_COMMON).  The bpf_skc_to_*()
      helpers are changed to take ARG_PTR_TO_BTF_ID_SOCK_COMMON such that
      they will accept pointer obtained from skb->sk.
      
      Instead of specifying both arg_type and arg_btf_id in the same func_proto
      which is how the current ARG_PTR_TO_BTF_ID does, the arg_btf_id of
      the new ARG_PTR_TO_BTF_ID_SOCK_COMMON is specified in the
      compatible_reg_types[] in verifier.c.  The reason is the arg_btf_id is
      always the same.  Discussion in this thread:
      https://lore.kernel.org/bpf/20200922070422.1917351-1-kafai@fb.com/
      
      The ARG_PTR_TO_BTF_ID_ part gives a clear expectation that the helper is
      expecting a PTR_TO_BTF_ID which could be NULL.  This is the same
      behavior as the existing helper taking ARG_PTR_TO_BTF_ID.
      
      The _SOCK_COMMON part means the helper is also expecting the legacy
      SOCK_COMMON pointer.
      
      By excluding the _OR_NULL part, the bpf prog cannot call helper
      with a literal NULL which doesn't make sense in most cases.
      e.g. bpf_skc_to_tcp_sock(NULL) will be rejected.  All PTR_TO_*_OR_NULL
      reg has to do a NULL check first before passing into the helper or else
      the bpf prog will be rejected.  This behavior is nothing new and
      consistent with the current expectation during bpf-prog-load.
      
      [ ARG_PTR_TO_BTF_ID_SOCK_COMMON will be used to replace
        ARG_PTR_TO_SOCK* of other existing helpers later such that
        those existing helpers can take the PTR_TO_BTF_ID returned by
        the bpf_skc_to_*() helpers.
      
        The only special case is bpf_sk_lookup_assign() which can accept a
        literal NULL ptr.  It has to be handled specially in another follow
        up patch if there is a need (e.g. by renaming ARG_PTR_TO_SOCKET_OR_NULL
        to ARG_PTR_TO_BTF_ID_SOCK_COMMON_OR_NULL). ]
      
      [ When converting the older helpers that take ARG_PTR_TO_SOCK* in
        the later patch, if the kernel does not support BTF,
        ARG_PTR_TO_BTF_ID_SOCK_COMMON will behave like ARG_PTR_TO_SOCK_COMMON
        because no reg->type could have PTR_TO_BTF_ID in this case.
      
        It is not a concern for the newer-btf-only helper like the bpf_skc_to_*()
        here though because these helpers must require BTF vmlinux to begin
        with. ]
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200925000350.3855720-1-kafai@fb.com
      1df8f55a
  2. 24 9月, 2020 2 次提交
  3. 22 9月, 2020 5 次提交
  4. 21 9月, 2020 5 次提交
  5. 20 9月, 2020 3 次提交
  6. 19 9月, 2020 4 次提交
  7. 18 9月, 2020 9 次提交
    • T
      ieee80211: redefine S1G bits with GENMASK · 37050e3a
      Thomas Pedersen 提交于
      The S1G capability fields were defined by ORing BITS()
      together, and expecting a custom macro to use the _SHIFT
      definitions. Use the Linux kernel GENMASK for the
      definitions now, and FIELD_{GET,PREP} to access the fields
      in the future.
      
      Take the chance to rename eg. S1G_CAPAB_B0 to the more
      compact S1G_CAP0.
      Signed-off-by: NThomas Pedersen <thomas@adapt-ip.com>
      Link: https://lore.kernel.org/r/20200908190323.15814-2-thomas@adapt-ip.comSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      37050e3a
    • A
      bpf: Add abnormal return checks. · 09b28d76
      Alexei Starovoitov 提交于
      LD_[ABS|IND] instructions may return from the function early. bpf_tail_call
      pseudo instruction is either fallthrough or return. Allow them in the
      subprograms only when subprograms are BTF annotated and have scalar return
      types. Allow ld_abs and tail_call in the main program even if it calls into
      subprograms. In the past that was not ok to do for ld_abs, since it was JITed
      with special exit sequence. Since bpf_gen_ld_abs() was introduced the ld_abs
      looks like normal exit insn from JIT point of view, so it's safe to allow them
      in the main program.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      09b28d76
    • M
      bpf, x64: rework pro/epilogue and tailcall handling in JIT · ebf7d1f5
      Maciej Fijalkowski 提交于
      This commit serves two things:
      1) it optimizes BPF prologue/epilogue generation
      2) it makes possible to have tailcalls within BPF subprogram
      
      Both points are related to each other since without 1), 2) could not be
      achieved.
      
      In [1], Alexei says:
      "The prologue will look like:
      nop5
      xor eax,eax  // two new bytes if bpf_tail_call() is used in this
                   // function
      push rbp
      mov rbp, rsp
      sub rsp, rounded_stack_depth
      push rax // zero init tail_call counter
      variable number of push rbx,r13,r14,r15
      
      Then bpf_tail_call will pop variable number rbx,..
      and final 'pop rax'
      Then 'add rsp, size_of_current_stack_frame'
      jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
      rbp, rsp'
      
      This way new function will set its own stack size and will init tail
      call
      counter with whatever value the parent had.
      
      If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
      Instead it would need to have 'nop2' in there."
      
      Implement that suggestion.
      
      Since the layout of stack is changed, tail call counter handling can not
      rely anymore on popping it to rbx just like it have been handled for
      constant prologue case and later overwrite of rbx with actual value of
      rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
      is considered to be volatile/caller-saved and pop the value of tail call
      counter in there in the epilogue.
      
      Drop the BUILD_BUG_ON in emit_prologue and in
      emit_bpf_tail_call_indirect where instruction layout is not constant
      anymore.
      
      Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
      dedicated for skipping the register pops and stack unwind that are
      generated right before the actual jump to target program.
      For case when the target program is not present, BPF program will skip
      the pop instructions and nop5 dedicated for jmpq $target. An example of
      such state when only R6 of callee saved registers is used by program:
      
      ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
      ffffffffc0513aa6:       5b                      pop    %rbx
      ffffffffc0513aa7:       58                      pop    %rax
      ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi
      
      When target program is inserted, the jump that was there to skip
      pops/nop5 will become the nop5, so CPU will go over pops and do the
      actual tailcall.
      
      One might ask why there simply can not be pushes after the nop5?
      In the following example snippet:
      
      ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
      (...)
      ffffffffc0370332:       5b                      pop    %rbx
      ffffffffc0370333:       58                      pop    %rax
      ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
      ffffffffc0370347:       50                      push   %rax
      ffffffffc0370348:       53                      push   %rbx
      ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
      ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548
      
      There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
      and jump target is not present. ctx is in %rbx register and BPF
      subprogram that we will call into on ffffffffc037034c is relying on it,
      e.g. it will pick ctx from there. Such code layout is therefore broken
      as we would overwrite the content of %rbx with the value that was pushed
      on the prologue. That is the reason for the 'bypass' approach.
      
      Special care needs to be taken during the install/update/remove of
      tailcall target. In case when target program is not present, the CPU
      must not execute the pop instructions that precede the tailcall.
      
      To address that, the following states can be defined:
      A nop, unwind, nop
      B nop, unwind, tail
      C skip, unwind, nop
      D skip, unwind, tail
      
      A is forbidden (lead to incorrectness). The state transitions between
      tailcall install/update/remove will work as follows:
      
      First install tail call f: C->D->B(f)
       * poke the tailcall, after that get rid of the skip
      Update tail call f to f': B(f)->B(f')
       * poke the tailcall (poke->tailcall_target) and do NOT touch the
         poke->tailcall_bypass
      Remove tail call: B(f')->C(f')
       * poke->tailcall_bypass is poked back to jump, then we wait the RCU
         grace period so that other programs will finish its execution and
         after that we are safe to remove the poke->tailcall_target
      Install new tail call (f''): C(f')->D(f'')->B(f'').
       * same as first step
      
      This way CPU can never be exposed to "unwind, tail" state.
      
      Last but not least, when tailcalls get mixed with bpf2bpf calls, it
      would be possible to encounter the endless loop due to clearing the
      tailcall counter if for example we would use the tailcall3-like from BPF
      selftests program that would be subprogram-based, meaning the tailcall
      would be present within the BPF subprogram.
      
      This test, broken down to particular steps, would do:
      entry -> set tailcall counter to 0, bump it by 1, tailcall to func0
      func0 -> call subprog_tail
      (we are NOT skipping the first 11 bytes of prologue and this subprogram
      has a tailcall, therefore we clear the counter...)
      subprog -> do the same thing as entry
      
      and then loop forever.
      
      To address this, the idea is to go through the call chain of bpf2bpf progs
      and look for a tailcall presence throughout whole chain. If we saw a single
      tail call then each node in this call chain needs to be marked as a subprog
      that can reach the tailcall. We would later feed the JIT with this info
      and:
      - set eax to 0 only when tailcall is reachable and this is the entry prog
      - if tailcall is reachable but there's no tailcall in insns of currently
        JITed prog then push rax anyway, so that it will be possible to
        propagate further down the call chain
      - finally if tailcall is reachable, then we need to precede the 'call'
        insn with mov rax, [rbp - (stack_depth + 8)]
      
      Tail call related cases from test_verifier kselftest are also working
      fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
      work properly as well.
      
      [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ebf7d1f5
    • M
      bpf: Limit caller's stack depth 256 for subprogs with tailcalls · 7f6e4312
      Maciej Fijalkowski 提交于
      Protect against potential stack overflow that might happen when bpf2bpf
      calls get combined with tailcalls. Limit the caller's stack depth for
      such case down to 256 so that the worst case scenario would result in 8k
      stack size (32 which is tailcall limit * 256 = 8k).
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7f6e4312
    • Y
      netdev: Remove unused functions · 2492c205
      YueHaibing 提交于
      There is no callers in tree, so can remove it.
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: NSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2492c205
    • M
      bpf: rename poke descriptor's 'ip' member to 'tailcall_target' · cf71b174
      Maciej Fijalkowski 提交于
      Reflect the actual purpose of poke->ip and rename it to
      poke->tailcall_target so that it will not the be confused with another
      poke target that will be introduced in next commit.
      
      While at it, do the same thing with poke->ip_stable - rename it to
      poke->tailcall_target_stable.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      cf71b174
    • M
      bpf: propagate poke descriptors to subprograms · a748c697
      Maciej Fijalkowski 提交于
      Previously, there was no need for poke descriptors being present in
      subprogram's bpf_prog_aux struct since tailcalls were simply not allowed
      in them. Each subprog is JITed independently so in order to enable
      JITing subprograms that use tailcalls, do the following:
      
      - in fixup_bpf_calls() store the index of tailcall insn onto the generated
        poke descriptor,
      - in case when insn patching occurs, adjust the tailcall insn idx from
        bpf_patch_insn_data,
      - then in jit_subprogs() check whether the given poke descriptor belongs
        to the current subprog by checking if that previously stored absolute
        index of tail call insn is in the scope of the insns of given subprog,
      - update the insn->imm with new poke descriptor slot so that while JITing
        the proper poke descriptor will be grabbed
      
      This way each of the main program's poke descriptors are distributed
      across the subprograms poke descriptor array, so main program's
      descriptors can be untracked out of the prog array map.
      
      Add also subprog's aux struct to the BPF map poke_progs list by calling
      on it map_poke_track().
      
      In case of any error, call the map_poke_untrack() on subprog's aux
      structs that have already been registered to prog array map.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a748c697
    • L
      mm: allow a controlled amount of unfairness in the page lock · 5ef64cc8
      Linus Torvalds 提交于
      Commit 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic") made
      the page locking entirely fair, in that if a waiter came in while the
      lock was held, the lock would be transferred to the lockers strictly in
      order.
      
      That was intended to finally get rid of the long-reported watchdog
      failures that involved the page lock under extreme load, where a process
      could end up waiting essentially forever, as other page lockers stole
      the lock from under it.
      
      It also improved some benchmarks, but it ended up causing huge
      performance regressions on others, simply because fair lock behavior
      doesn't end up giving out the lock as aggressively, causing better
      worst-case latency, but potentially much worse average latencies and
      throughput.
      
      Instead of reverting that change entirely, this introduces a controlled
      amount of unfairness, with a sysctl knob to tune it if somebody needs
      to.  But the default value should hopefully be good for any normal load,
      allowing a few rounds of lock stealing, but enforcing the strict
      ordering before the lock has been stolen too many times.
      
      There is also a hint from Matthieu Baerts that the fair page coloring
      may end up exposing an ABBA deadlock that is hidden by the usual
      optimistic lock stealing, and while the unfairness doesn't fix the
      fundamental issue (and I'm still looking at that), it avoids it in
      practice.
      
      The amount of unfairness can be modified by writing a new value to the
      'sysctl_page_lock_unfairness' variable (default value of 5, exposed
      through /proc/sys/vm/page_lock_unfairness), but that is hopefully
      something we'd use mainly for debugging rather than being necessary for
      any deep system tuning.
      
      This whole issue has exposed just how critical the page lock can be, and
      how contended it gets under certain locks.  And the main contention
      doesn't really seem to be anything related to IO (which was the origin
      of this lock), but for things like just verifying that the page file
      mapping is stable while faulting in the page into a page table.
      
      Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
      Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
      Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/Reported-and-tested-by: NMichael Larabel <Michael@michaellarabel.com>
      Tested-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ef64cc8
    • A
      arm64: paravirt: Initialize steal time when cpu is online · 75df529b
      Andrew Jones 提交于
      Steal time initialization requires mapping a memory region which
      invokes a memory allocation. Doing this at CPU starting time results
      in the following trace when CONFIG_DEBUG_ATOMIC_SLEEP is enabled:
      
      BUG: sleeping function called from invalid context at mm/slab.h:498
      in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/1
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc5+ #1
      Call trace:
       dump_backtrace+0x0/0x208
       show_stack+0x1c/0x28
       dump_stack+0xc4/0x11c
       ___might_sleep+0xf8/0x130
       __might_sleep+0x58/0x90
       slab_pre_alloc_hook.constprop.101+0xd0/0x118
       kmem_cache_alloc_node_trace+0x84/0x270
       __get_vm_area_node+0x88/0x210
       get_vm_area_caller+0x38/0x40
       __ioremap_caller+0x70/0xf8
       ioremap_cache+0x78/0xb0
       memremap+0x9c/0x1a8
       init_stolen_time_cpu+0x54/0xf0
       cpuhp_invoke_callback+0xa8/0x720
       notify_cpu_starting+0xc8/0xd8
       secondary_start_kernel+0x114/0x180
      CPU1: Booted secondary processor 0x0000000001 [0x431f0a11]
      
      However we don't need to initialize steal time at CPU starting time.
      We can simply wait until CPU online time, just sacrificing a bit of
      accuracy by returning zero for steal time until we know better.
      
      While at it, add __init to the functions that are only called by
      pv_time_init() which is __init.
      Signed-off-by: NAndrew Jones <drjones@redhat.com>
      Fixes: e0685fa2 ("arm64: Retrieve stolen time as paravirtualized guest")
      Cc: stable@vger.kernel.org
      Reviewed-by: NSteven Price <steven.price@arm.com>
      Link: https://lore.kernel.org/r/20200916154530.40809-1-drjones@redhat.comSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      75df529b
  8. 17 9月, 2020 3 次提交
    • P
      rcu-tasks: Fix grace-period/unlock race in RCU Tasks Trace · ba3a86e4
      Paul E. McKenney 提交于
      The more intense grace-period processing resulting from the 50x RCU
      Tasks Trace grace-period speedups exposed the following race condition:
      
      o	Task A running on CPU 0 executes rcu_read_lock_trace(),
      	entering a read-side critical section.
      
      o	When Task A eventually invokes rcu_read_unlock_trace()
      	to exit its read-side critical section, this function
      	notes that the ->trc_reader_special.s flag is zero and
      	and therefore invoke wil set ->trc_reader_nesting to zero
      	using WRITE_ONCE().  But before that happens...
      
      o	The RCU Tasks Trace grace-period kthread running on some other
      	CPU interrogates Task A, but this fails because this task is
      	currently running.  This kthread therefore sends an IPI to CPU 0.
      
      o	CPU 0 receives the IPI, and thus invokes trc_read_check_handler().
      	Because Task A has not yet cleared its ->trc_reader_nesting
      	counter, this function sees that Task A is still within its
      	read-side critical section.  This function therefore sets the
      	->trc_reader_nesting.b.need_qs flag, AKA the .need_qs flag.
      
      	Except that Task A has already checked the .need_qs flag, which
      	is part of the ->trc_reader_special.s flag.  The .need_qs flag
      	therefore remains set until Task A's next rcu_read_unlock_trace().
      
      o	Task A now invokes synchronize_rcu_tasks_trace(), which cannot
      	start a new grace period until the current grace period completes.
      	And thus cannot return until after that time.
      
      	But Task A's .need_qs flag is still set, which prevents the current
      	grace period from completing.  And because Task A is blocked, it
      	will never execute rcu_read_unlock_trace() until its call to
      	synchronize_rcu_tasks_trace() returns.
      
      	We are therefore deadlocked.
      
      This race is improbable, but 80 hours of rcutorture made it happen twice.
      The race was possible before the grace-period speedup, but roughly 50x
      less probable.  Several thousand hours of rcutorture would have been
      necessary to have a reasonable chance of making this happen before this
      50x speedup.
      
      This commit therefore eliminates this deadlock by setting
      ->trc_reader_nesting to a large negative number before checking the
      .need_qs and zeroing (or decrementing with respect to its initial
      value) ->trc_reader_nesting.  For its part, the IPI handler's
      trc_read_check_handler() function adds a check for negative values,
      deferring evaluation of the task in this case.  Taken together, these
      changes avoid this deadlock scenario.
      
      Fixes: 276c4104 ("rcu-tasks: Split ->trc_reader_need_end")
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: <bpf@vger.kernel.org>
      Cc: <stable@vger.kernel.org> # 5.7.x
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      ba3a86e4
    • A
      fs: fix cast in fsparam_u32hex() macro · ffbc3dd1
      Alexey Dobriyan 提交于
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ffbc3dd1
    • P
      cpuidle: Allow cpuidle drivers to take over RCU-idle · 8747f202
      Peter Zijlstra 提交于
      Some drivers have to do significant work, some of which relies on RCU
      still being active. Instead of using RCU_NONIDLE in the drivers and
      flipping RCU back on, allow drivers to take over RCU-idle duty.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NUlf Hansson <ulf.hansson@linaro.org>
      Tested-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      8747f202
  9. 16 9月, 2020 6 次提交
    • H
      locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count · e6b1a44e
      Hou Tao 提交于
      The __this_cpu*() accessors are (in general) IRQ-unsafe which, given
      that percpu-rwsem is a blocking primitive, should be just fine.
      
      However, file_end_write() is used from IRQ context and will cause
      load-store issues on architectures where the per-cpu accessors are not
      natively irq-safe.
      
      Fix it by using the IRQ-safe this_cpu_*() for operations on
      read_count. This will generate more expensive code on a number of
      platforms, which might cause a performance regression for some of the
      other percpu-rwsem users.
      
      If any such is reported, we can consider alternative solutions.
      
      Fixes: 70fe2f48 ("aio: fix freeze protection of aio writes")
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NWill Deacon <will@kernel.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Link: https://lkml.kernel.org/r/20200915140750.137881-1-houtao1@huawei.com
      e6b1a44e
    • J
      serial: core: fix console port-lock regression · e0830dbf
      Johan Hovold 提交于
      Fix the port-lock initialisation regression introduced by commit
      a3cb39d2 ("serial: core: Allow detach and attach serial device for
      console") by making sure that the lock is again initialised during
      console setup.
      
      The console may be registered before the serial controller has been
      probed in which case the port lock needs to be initialised during
      console setup by a call to uart_set_options(). The console-detach
      changes introduced a regression in several drivers by effectively
      removing that initialisation by not initialising the lock when the port
      is used as a console (which is always the case during console setup).
      
      Add back the early lock initialisation and instead use a new
      console-reinit flag to handle the case where a console is being
      re-attached through sysfs.
      
      The question whether the console-detach interface should have been added
      in the first place is left for another discussion.
      
      Note that the console-enabled check in uart_set_options() is not
      redundant because of kgdboc, which can end up reinitialising an already
      enabled console (see commit 42b6a1ba ("serial_core: Don't
      re-initialize a previously initialized spinlock.")).
      
      Fixes: a3cb39d2 ("serial: core: Allow detach and attach serial device for console")
      Cc: stable <stable@vger.kernel.org>     # 5.7
      Signed-off-by: NJohan Hovold <johan@kernel.org>
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Link: https://lore.kernel.org/r/20200909143101.15389-3-johan@kernel.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e0830dbf
    • Y
      bpf: Mutex protect used_maps array and count · 984fe94f
      YiFei Zhu 提交于
      To support modifying the used_maps array, we use a mutex to protect
      the use of the counter and the array. The mutex is initialized right
      after the prog aux is allocated, and destroyed right before prog
      aux is freed. This way we guarantee it's initialized for both cBPF
      and eBPF.
      Signed-off-by: NYiFei Zhu <zhuyifei@google.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Cc: YiFei Zhu <zhuyifei1999@gmail.com>
      Link: https://lore.kernel.org/bpf/20200915234543.3220146-2-sdf@google.com
      984fe94f
    • J
      ethtool: add standard pause stats · 9a27a330
      Jakub Kicinski 提交于
      Currently drivers have to report their pause frames statistics
      via ethtool -S, and there is a wide variety of names used for
      these statistics.
      
      Add the two statistics defined in IEEE 802.3x to the standard
      API. Create a new ethtool request header flag for including
      statistics in the response to GET commands.
      
      Always create the ETHTOOL_A_PAUSE_STATS nest in replies when
      flag is set. Testing if driver declares the op is not a reliable
      way of checking if any stats will actually be included and therefore
      we don't want to give the impression that presence of
      ETHTOOL_A_PAUSE_STATS indicates driver support.
      
      Note that this patch does not include PFC counters, which may fit
      better in dcbnl? But mostly I don't need them/have a setup to test
      them so I haven't looked deeply into exposing them :)
      
      v3:
       - add a helper for "uninitializing" stats, rather than a cryptic
         memset() (Andrew)
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Reviewed-by: NSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a27a330
    • O
      net/mlx5e: Add CQE compression support for multi-strides packets · b7cf0806
      Ofer Levi 提交于
      Add CQE compression support for completions of packets that span
      multiple strides in a Striding RQ, per the HW capability.
      In our memory model, we use small strides (256B as of today) for the
      non-linear SKB mode. This feature allows CQE compression to work also
      for multiple strides packets. In this case decompressing the mini CQE
      array will use stride index provided by HW as part of the mini CQE.
      Before this feature, compression was possible only for single-strided
      packets, i.e. for packets of size up to 256 bytes when in non-linear
      mode, and the index was maintained by SW.
      This feature is supported for ConnectX-5 and above.
      
      Feature performance test:
      This was whitebox-tested, we reduced the PCI speed from 125Gb/s to
      62.5Gb/s to overload pci and manipulated mlx5 driver to drop incoming
      packets before building the SKB to achieve low cpu utilization.
      Outcome is low cpu utilization and bottleneck on pci only.
      Test setup:
      Server: Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz server, 32 cores
      NIC: ConnectX-6 DX.
      Sender side generates 300 byte packets at full pci bandwidth.
      Receiver side configuration:
      Single channel, one cpu processing with one ring allocated. Cpu utilization
      is ~20% while pci bandwidth is fully utilized.
      For the generated traffic and interface MTU of 4500B (to activate the
      non-linear SKB mode), packet rate improvement is about 19% from ~17.6Mpps
      to ~21Mpps.
      Without this feature, counters show no CQE compression blocks for
      this setup, while with the feature, counters show ~20.7Mpps compressed CQEs
      in ~500K compression blocks.
      Signed-off-by: NOfer Levi <oferle@mellanox.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      b7cf0806
    • E
      net/mlx5: Always use container_of to find mdev pointer from clock struct · fb609b51
      Eran Ben Elisha 提交于
      Clock struct is part of struct mlx5_core_dev. Code was inconsistent, on
      some cases used container_of and on another used clock->mdev.
      
      Align code to use container_of amd remove clock->mdev pointer.
      While here, fix reverse xmas tree coding style.
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
      fb609b51
  10. 15 9月, 2020 1 次提交