1. 07 8月, 2021 3 次提交
  2. 29 7月, 2021 3 次提交
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · fc16a532
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2021-07-29
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 9 non-merge commits during the last 14 day(s) which contain
      a total of 20 files changed, 446 insertions(+), 138 deletions(-).
      
      The main changes are:
      
      1) Fix UBSAN out-of-bounds splat for showing XDP link fdinfo, from Lorenz Bauer.
      
      2) Fix insufficient Spectre v4 mitigation in BPF runtime, from Daniel Borkmann,
         Piotr Krysiuk and Benedict Schlueter.
      
      3) Batch of fixes for BPF sockmap found under stress testing, from John Fastabend.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc16a532
    • D
      bpf: Fix leakage due to insufficient speculative store bypass mitigation · 2039f26f
      Daniel Borkmann 提交于
      Spectre v4 gadgets make use of memory disambiguation, which is a set of
      techniques that execute memory access instructions, that is, loads and
      stores, out of program order; Intel's optimization manual, section 2.4.4.5:
      
        A load instruction micro-op may depend on a preceding store. Many
        microarchitectures block loads until all preceding store addresses are
        known. The memory disambiguator predicts which loads will not depend on
        any previous stores. When the disambiguator predicts that a load does
        not have such a dependency, the load takes its data from the L1 data
        cache. Eventually, the prediction is verified. If an actual conflict is
        detected, the load and all succeeding instructions are re-executed.
      
      af86ca4e ("bpf: Prevent memory disambiguation attack") tried to mitigate
      this attack by sanitizing the memory locations through preemptive "fast"
      (low latency) stores of zero prior to the actual "slow" (high latency) store
      of a pointer value such that upon dependency misprediction the CPU then
      speculatively executes the load of the pointer value and retrieves the zero
      value instead of the attacker controlled scalar value previously stored at
      that location, meaning, subsequent access in the speculative domain is then
      redirected to the "zero page".
      
      The sanitized preemptive store of zero prior to the actual "slow" store is
      done through a simple ST instruction based on r10 (frame pointer) with
      relative offset to the stack location that the verifier has been tracking
      on the original used register for STX, which does not have to be r10. Thus,
      there are no memory dependencies for this store, since it's only using r10
      and immediate constant of zero; hence af86ca4e /assumed/ a low latency
      operation.
      
      However, a recent attack demonstrated that this mitigation is not sufficient
      since the preemptive store of zero could also be turned into a "slow" store
      and is thus bypassed as well:
      
        [...]
        // r2 = oob address (e.g. scalar)
        // r7 = pointer to map value
        31: (7b) *(u64 *)(r10 -16) = r2
        // r9 will remain "fast" register, r10 will become "slow" register below
        32: (bf) r9 = r10
        // JIT maps BPF reg to x86 reg:
        //  r9  -> r15 (callee saved)
        //  r10 -> rbp
        // train store forward prediction to break dependency link between both r9
        // and r10 by evicting them from the predictor's LRU table.
        33: (61) r0 = *(u32 *)(r7 +24576)
        34: (63) *(u32 *)(r7 +29696) = r0
        35: (61) r0 = *(u32 *)(r7 +24580)
        36: (63) *(u32 *)(r7 +29700) = r0
        37: (61) r0 = *(u32 *)(r7 +24584)
        38: (63) *(u32 *)(r7 +29704) = r0
        39: (61) r0 = *(u32 *)(r7 +24588)
        40: (63) *(u32 *)(r7 +29708) = r0
        [...]
        543: (61) r0 = *(u32 *)(r7 +25596)
        544: (63) *(u32 *)(r7 +30716) = r0
        // prepare call to bpf_ringbuf_output() helper. the latter will cause rbp
        // to spill to stack memory while r13/r14/r15 (all callee saved regs) remain
        // in hardware registers. rbp becomes slow due to push/pop latency. below is
        // disasm of bpf_ringbuf_output() helper for better visual context:
        //
        // ffffffff8117ee20: 41 54                 push   r12
        // ffffffff8117ee22: 55                    push   rbp
        // ffffffff8117ee23: 53                    push   rbx
        // ffffffff8117ee24: 48 f7 c1 fc ff ff ff  test   rcx,0xfffffffffffffffc
        // ffffffff8117ee2b: 0f 85 af 00 00 00     jne    ffffffff8117eee0 <-- jump taken
        // [...]
        // ffffffff8117eee0: 49 c7 c4 ea ff ff ff  mov    r12,0xffffffffffffffea
        // ffffffff8117eee7: 5b                    pop    rbx
        // ffffffff8117eee8: 5d                    pop    rbp
        // ffffffff8117eee9: 4c 89 e0              mov    rax,r12
        // ffffffff8117eeec: 41 5c                 pop    r12
        // ffffffff8117eeee: c3                    ret
        545: (18) r1 = map[id:4]
        547: (bf) r2 = r7
        548: (b7) r3 = 0
        549: (b7) r4 = 4
        550: (85) call bpf_ringbuf_output#194288
        // instruction 551 inserted by verifier    \
        551: (7a) *(u64 *)(r10 -16) = 0            | /both/ are now slow stores here
        // storing map value pointer r7 at fp-16   | since value of r10 is "slow".
        552: (7b) *(u64 *)(r10 -16) = r7           /
        // following "fast" read to the same memory location, but due to dependency
        // misprediction it will speculatively execute before insn 551/552 completes.
        553: (79) r2 = *(u64 *)(r9 -16)
        // in speculative domain contains attacker controlled r2. in non-speculative
        // domain this contains r7, and thus accesses r7 +0 below.
        554: (71) r3 = *(u8 *)(r2 +0)
        // leak r3
      
      As can be seen, the current speculative store bypass mitigation which the
      verifier inserts at line 551 is insufficient since /both/, the write of
      the zero sanitation as well as the map value pointer are a high latency
      instruction due to prior memory access via push/pop of r10 (rbp) in contrast
      to the low latency read in line 553 as r9 (r15) which stays in hardware
      registers. Thus, architecturally, fp-16 is r7, however, microarchitecturally,
      fp-16 can still be r2.
      
      Initial thoughts to address this issue was to track spilled pointer loads
      from stack and enforce their load via LDX through r10 as well so that /both/
      the preemptive store of zero /as well as/ the load use the /same/ register
      such that a dependency is created between the store and load. However, this
      option is not sufficient either since it can be bypassed as well under
      speculation. An updated attack with pointer spill/fills now _all_ based on
      r10 would look as follows:
      
        [...]
        // r2 = oob address (e.g. scalar)
        // r7 = pointer to map value
        [...]
        // longer store forward prediction training sequence than before.
        2062: (61) r0 = *(u32 *)(r7 +25588)
        2063: (63) *(u32 *)(r7 +30708) = r0
        2064: (61) r0 = *(u32 *)(r7 +25592)
        2065: (63) *(u32 *)(r7 +30712) = r0
        2066: (61) r0 = *(u32 *)(r7 +25596)
        2067: (63) *(u32 *)(r7 +30716) = r0
        // store the speculative load address (scalar) this time after the store
        // forward prediction training.
        2068: (7b) *(u64 *)(r10 -16) = r2
        // preoccupy the CPU store port by running sequence of dummy stores.
        2069: (63) *(u32 *)(r7 +29696) = r0
        2070: (63) *(u32 *)(r7 +29700) = r0
        2071: (63) *(u32 *)(r7 +29704) = r0
        2072: (63) *(u32 *)(r7 +29708) = r0
        2073: (63) *(u32 *)(r7 +29712) = r0
        2074: (63) *(u32 *)(r7 +29716) = r0
        2075: (63) *(u32 *)(r7 +29720) = r0
        2076: (63) *(u32 *)(r7 +29724) = r0
        2077: (63) *(u32 *)(r7 +29728) = r0
        2078: (63) *(u32 *)(r7 +29732) = r0
        2079: (63) *(u32 *)(r7 +29736) = r0
        2080: (63) *(u32 *)(r7 +29740) = r0
        2081: (63) *(u32 *)(r7 +29744) = r0
        2082: (63) *(u32 *)(r7 +29748) = r0
        2083: (63) *(u32 *)(r7 +29752) = r0
        2084: (63) *(u32 *)(r7 +29756) = r0
        2085: (63) *(u32 *)(r7 +29760) = r0
        2086: (63) *(u32 *)(r7 +29764) = r0
        2087: (63) *(u32 *)(r7 +29768) = r0
        2088: (63) *(u32 *)(r7 +29772) = r0
        2089: (63) *(u32 *)(r7 +29776) = r0
        2090: (63) *(u32 *)(r7 +29780) = r0
        2091: (63) *(u32 *)(r7 +29784) = r0
        2092: (63) *(u32 *)(r7 +29788) = r0
        2093: (63) *(u32 *)(r7 +29792) = r0
        2094: (63) *(u32 *)(r7 +29796) = r0
        2095: (63) *(u32 *)(r7 +29800) = r0
        2096: (63) *(u32 *)(r7 +29804) = r0
        2097: (63) *(u32 *)(r7 +29808) = r0
        2098: (63) *(u32 *)(r7 +29812) = r0
        // overwrite scalar with dummy pointer; same as before, also including the
        // sanitation store with 0 from the current mitigation by the verifier.
        2099: (7a) *(u64 *)(r10 -16) = 0         | /both/ are now slow stores here
        2100: (7b) *(u64 *)(r10 -16) = r7        | since store unit is still busy.
        // load from stack intended to bypass stores.
        2101: (79) r2 = *(u64 *)(r10 -16)
        2102: (71) r3 = *(u8 *)(r2 +0)
        // leak r3
        [...]
      
      Looking at the CPU microarchitecture, the scheduler might issue loads (such
      as seen in line 2101) before stores (line 2099,2100) because the load execution
      units become available while the store execution unit is still busy with the
      sequence of dummy stores (line 2069-2098). And so the load may use the prior
      stored scalar from r2 at address r10 -16 for speculation. The updated attack
      may work less reliable on CPU microarchitectures where loads and stores share
      execution resources.
      
      This concludes that the sanitizing with zero stores from af86ca4e ("bpf:
      Prevent memory disambiguation attack") is insufficient. Moreover, the detection
      of stack reuse from af86ca4e where previously data (STACK_MISC) has been
      written to a given stack slot where a pointer value is now to be stored does
      not have sufficient coverage as precondition for the mitigation either; for
      several reasons outlined as follows:
      
       1) Stack content from prior program runs could still be preserved and is
          therefore not "random", best example is to split a speculative store
          bypass attack between tail calls, program A would prepare and store the
          oob address at a given stack slot and then tail call into program B which
          does the "slow" store of a pointer to the stack with subsequent "fast"
          read. From program B PoV such stack slot type is STACK_INVALID, and
          therefore also must be subject to mitigation.
      
       2) The STACK_SPILL must not be coupled to register_is_const(&stack->spilled_ptr)
          condition, for example, the previous content of that memory location could
          also be a pointer to map or map value. Without the fix, a speculative
          store bypass is not mitigated in such precondition and can then lead to
          a type confusion in the speculative domain leaking kernel memory near
          these pointer types.
      
      While brainstorming on various alternative mitigation possibilities, we also
      stumbled upon a retrospective from Chrome developers [0]:
      
        [...] For variant 4, we implemented a mitigation to zero the unused memory
        of the heap prior to allocation, which cost about 1% when done concurrently
        and 4% for scavenging. Variant 4 defeats everything we could think of. We
        explored more mitigations for variant 4 but the threat proved to be more
        pervasive and dangerous than we anticipated. For example, stack slots used
        by the register allocator in the optimizing compiler could be subject to
        type confusion, leading to pointer crafting. Mitigating type confusion for
        stack slots alone would have required a complete redesign of the backend of
        the optimizing compiler, perhaps man years of work, without a guarantee of
        completeness. [...]
      
      From BPF side, the problem space is reduced, however, options are rather
      limited. One idea that has been explored was to xor-obfuscate pointer spills
      to the BPF stack:
      
        [...]
        // preoccupy the CPU store port by running sequence of dummy stores.
        [...]
        2106: (63) *(u32 *)(r7 +29796) = r0
        2107: (63) *(u32 *)(r7 +29800) = r0
        2108: (63) *(u32 *)(r7 +29804) = r0
        2109: (63) *(u32 *)(r7 +29808) = r0
        2110: (63) *(u32 *)(r7 +29812) = r0
        // overwrite scalar with dummy pointer; xored with random 'secret' value
        // of 943576462 before store ...
        2111: (b4) w11 = 943576462
        2112: (af) r11 ^= r7
        2113: (7b) *(u64 *)(r10 -16) = r11
        2114: (79) r11 = *(u64 *)(r10 -16)
        2115: (b4) w2 = 943576462
        2116: (af) r2 ^= r11
        // ... and restored with the same 'secret' value with the help of AX reg.
        2117: (71) r3 = *(u8 *)(r2 +0)
        [...]
      
      While the above would not prevent speculation, it would make data leakage
      infeasible by directing it to random locations. In order to be effective
      and prevent type confusion under speculation, such random secret would have
      to be regenerated for each store. The additional complexity involved for a
      tracking mechanism that prevents jumps such that restoring spilled pointers
      would not get corrupted is not worth the gain for unprivileged. Hence, the
      fix in here eventually opted for emitting a non-public BPF_ST | BPF_NOSPEC
      instruction which the x86 JIT translates into a lfence opcode. Inserting the
      latter in between the store and load instruction is one of the mitigations
      options [1]. The x86 instruction manual notes:
      
        [...] An LFENCE that follows an instruction that stores to memory might
        complete before the data being stored have become globally visible. [...]
      
      The latter meaning that the preceding store instruction finished execution
      and the store is at minimum guaranteed to be in the CPU's store queue, but
      it's not guaranteed to be in that CPU's L1 cache at that point (globally
      visible). The latter would only be guaranteed via sfence. So the load which
      is guaranteed to execute after the lfence for that local CPU would have to
      rely on store-to-load forwarding. [2], in section 2.3 on store buffers says:
      
        [...] For every store operation that is added to the ROB, an entry is
        allocated in the store buffer. This entry requires both the virtual and
        physical address of the target. Only if there is no free entry in the store
        buffer, the frontend stalls until there is an empty slot available in the
        store buffer again. Otherwise, the CPU can immediately continue adding
        subsequent instructions to the ROB and execute them out of order. On Intel
        CPUs, the store buffer has up to 56 entries. [...]
      
      One small upside on the fix is that it lifts constraints from af86ca4e
      where the sanitize_stack_off relative to r10 must be the same when coming
      from different paths. The BPF_ST | BPF_NOSPEC gets emitted after a BPF_STX
      or BPF_ST instruction. This happens either when we store a pointer or data
      value to the BPF stack for the first time, or upon later pointer spills.
      The former needs to be enforced since otherwise stale stack data could be
      leaked under speculation as outlined earlier. For non-x86 JITs the BPF_ST |
      BPF_NOSPEC mapping is currently optimized away, but others could emit a
      speculation barrier as well if necessary. For real-world unprivileged
      programs e.g. generated by LLVM, pointer spill/fill is only generated upon
      register pressure and LLVM only tries to do that for pointers which are not
      used often. The program main impact will be the initial BPF_ST | BPF_NOSPEC
      sanitation for the STACK_INVALID case when the first write to a stack slot
      occurs e.g. upon map lookup. In future we might refine ways to mitigate
      the latter cost.
      
        [0] https://arxiv.org/pdf/1902.05178.pdf
        [1] https://msrc-blog.microsoft.com/2018/05/21/analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639/
        [2] https://arxiv.org/pdf/1905.05725.pdf
      
      Fixes: af86ca4e ("bpf: Prevent memory disambiguation attack")
      Fixes: f7cf25b2 ("bpf: track spill/fill of constants")
      Co-developed-by: NPiotr Krysiuk <piotras@gmail.com>
      Co-developed-by: NBenedict Schlueter <benedict.schlueter@rub.de>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPiotr Krysiuk <piotras@gmail.com>
      Signed-off-by: NBenedict Schlueter <benedict.schlueter@rub.de>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      2039f26f
    • D
      bpf: Introduce BPF nospec instruction for mitigating Spectre v4 · f5e81d11
      Daniel Borkmann 提交于
      In case of JITs, each of the JIT backends compiles the BPF nospec instruction
      /either/ to a machine instruction which emits a speculation barrier /or/ to
      /no/ machine instruction in case the underlying architecture is not affected
      by Speculative Store Bypass or has different mitigations in place already.
      
      This covers both x86 and (implicitly) arm64: In case of x86, we use 'lfence'
      instruction for mitigation. In case of arm64, we rely on the firmware mitigation
      as controlled via the ssbd kernel parameter. Whenever the mitigation is enabled,
      it works for all of the kernel code with no need to provide any additional
      instructions here (hence only comment in arm64 JIT). Other archs can follow
      as needed. The BPF nospec instruction is specifically targeting Spectre v4
      since i) we don't use a serialization barrier for the Spectre v1 case, and
      ii) mitigation instructions for v1 and v4 might be different on some archs.
      
      The BPF nospec is required for a future commit, where the BPF verifier does
      annotate intermediate BPF programs with speculation barriers.
      Co-developed-by: NPiotr Krysiuk <piotras@gmail.com>
      Co-developed-by: NBenedict Schlueter <benedict.schlueter@rub.de>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPiotr Krysiuk <piotras@gmail.com>
      Signed-off-by: NBenedict Schlueter <benedict.schlueter@rub.de>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      f5e81d11
  3. 28 7月, 2021 24 次提交
    • W
      sis900: Fix missing pci_disable_device() in probe and remove · 89fb62fd
      Wang Hai 提交于
      Replace pci_enable_device() with pcim_enable_device(),
      pci_disable_device() and pci_release_regions() will be
      called in release automatically.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NWang Hai <wanghai38@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89fb62fd
    • Z
      net: let flow have same hash in two directions · 1e60cebf
      zhang kai 提交于
      using same source and destination ip/port for flow hash calculation
      within the two directions.
      Signed-off-by: Nzhang kai <zhangkaiheb@126.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e60cebf
    • K
      nfc: nfcsim: fix use after free during module unload · 5e7b30d2
      Krzysztof Kozlowski 提交于
      There is a use after free memory corruption during module exit:
       - nfcsim_exit()
        - nfcsim_device_free(dev0)
          - nfc_digital_unregister_device()
            This iterates over command queue and frees all commands,
          - dev->up = false
          - nfcsim_link_shutdown()
            - nfcsim_link_recv_wake()
              This wakes the sleeping thread nfcsim_link_recv_skb().
      
       - nfcsim_link_recv_skb()
         Wake from wait_event_interruptible_timeout(),
         call directly the deb->cb callback even though (dev->up == false),
         - digital_send_cmd_complete()
           Dereference of "struct digital_cmd" cmd which was freed earlier by
           nfc_digital_unregister_device().
      
      This causes memory corruption shortly after (with unrelated stack
      trace):
      
        nfc nfc0: NFC: nfcsim_recv_wq: Device is down
        llcp: nfc_llcp_recv: err -19
        nfc nfc1: NFC: nfcsim_recv_wq: Device is down
        BUG: unable to handle page fault for address: ffffffffffffffed
        Call Trace:
         fsnotify+0x54b/0x5c0
         __fsnotify_parent+0x1fe/0x300
         ? vfs_write+0x27c/0x390
         vfs_write+0x27c/0x390
         ksys_write+0x63/0xe0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      KASAN report:
      
        BUG: KASAN: use-after-free in digital_send_cmd_complete+0x16/0x50
        Write of size 8 at addr ffff88800a05f720 by task kworker/0:2/71
        Workqueue: events nfcsim_recv_wq [nfcsim]
        Call Trace:
         dump_stack_lvl+0x45/0x59
         print_address_description.constprop.0+0x21/0x140
         ? digital_send_cmd_complete+0x16/0x50
         ? digital_send_cmd_complete+0x16/0x50
         kasan_report.cold+0x7f/0x11b
         ? digital_send_cmd_complete+0x16/0x50
         ? digital_dep_link_down+0x60/0x60
         digital_send_cmd_complete+0x16/0x50
         nfcsim_recv_wq+0x38f/0x3d5 [nfcsim]
         ? nfcsim_in_send_cmd+0x4a/0x4a [nfcsim]
         ? lock_is_held_type+0x98/0x110
         ? finish_wait+0x110/0x110
         ? rcu_read_lock_sched_held+0x9c/0xd0
         ? rcu_read_lock_bh_held+0xb0/0xb0
         ? lockdep_hardirqs_on_prepare+0x12e/0x1f0
      
      This flow of calling digital_send_cmd_complete() callback on driver exit
      is specific to nfcsim which implements reading and sending work queues.
      Since the NFC digital device was unregistered, the callback should not
      be called.
      
      Fixes: 204bddcb ("NFC: nfcsim: Make use of the Digital layer")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NKrzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e7b30d2
    • W
      tulip: windbond-840: Fix missing pci_disable_device() in probe and remove · 76a16be0
      Wang Hai 提交于
      Replace pci_enable_device() with pcim_enable_device(),
      pci_disable_device() and pci_release_regions() will be
      called in release automatically.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NWang Hai <wanghai38@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76a16be0
    • M
      sctp: fix return value check in __sctp_rcv_asconf_lookup · 557fb586
      Marcelo Ricardo Leitner 提交于
      As Ben Hutchings noticed, this check should have been inverted: the call
      returns true in case of success.
      Reported-by: NBen Hutchings <ben@decadent.org.uk>
      Fixes: 0c5dc070 ("sctp: validate from_addr_param return")
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Reviewed-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      557fb586
    • T
      nfc: s3fwrn5: fix undefined parameter values in dev_err() · 46573e3a
      Tang Bin 提交于
      In the function s3fwrn5_fw_download(), the 'ret' is not assigned,
      so the correct value should be given in dev_err function.
      
      Fixes: a0302ff5 ("nfc: s3fwrn5: remove unnecessary label")
      Signed-off-by: NZhang Shengju <zhangshengju@cmss.chinamobile.com>
      Signed-off-by: NTang Bin <tangbin@cmss.chinamobile.com>
      Reviewed-by: NNathan Chancellor <nathan@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46573e3a
    • D
      Merge tag 'mlx5-fixes-2021-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 9d0279d0
      David S. Miller 提交于
      Saeed Mahameed says:
      
      ====================
      mlx5 fixes 2021-07-27
      
      This series introduces some fixes to mlx5 driver.
      Please pull and let me know if there is any problem.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d0279d0
    • C
      net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32 · 740452e0
      Chris Mi 提交于
      The offending refactor commit uses u16 chain wrongly. Actually, it
      should be u32.
      
      Fixes: c620b772 ("net/mlx5: Refactor tc flow attributes structure")
      CC: Ariel Levkovich <lariel@nvidia.com>
      Signed-off-by: NChris Mi <cmi@nvidia.com>
      Reviewed-by: NRoi Dayan <roid@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      740452e0
    • D
      net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev() · b1c2f631
      Dima Chumak 提交于
      The result of __dev_get_by_index() is not checked for NULL and then gets
      dereferenced immediately.
      
      Also, __dev_get_by_index() must be called while holding either RTNL lock
      or @dev_base_lock, which isn't satisfied by mlx5e_hairpin_get_mdev() or
      its callers. This makes the underlying hlist_for_each_entry() loop not
      safe, and can have adverse effects in itself.
      
      Fix by using dev_get_by_index() and handling nullptr return value when
      ifindex device is not found. Update mlx5e_hairpin_get_mdev() callers to
      check for possible PTR_ERR() result.
      
      Fixes: 77ab67b7 ("net/mlx5e: Basic setup of hairpin object")
      Addresses-Coverity: ("Dereference null return value")
      Signed-off-by: NDima Chumak <dchumak@nvidia.com>
      Reviewed-by: NVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: NRoi Dayan <roid@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      b1c2f631
    • A
      net/mlx5: Unload device upon firmware fatal error · 7f331bf0
      Aya Levin 提交于
      When fw_fatal reporter reports an error, the firmware in not responding.
      Unload the device to ensure that the driver closes all its resources,
      even if recovery is not due (user disabled auto-recovery or reporter is
      in grace period). On successful recovery the device is loaded back up.
      
      Fixes: b3bd076f ("net/mlx5: Report devlink health on FW fatal issues")
      Signed-off-by: NAya Levin <ayal@nvidia.com>
      Reviewed-by: NMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      7f331bf0
    • A
      net/mlx5e: Fix page allocation failure for ptp-RQ over SF · 678b1ae1
      Aya Levin 提交于
      Set the correct pci-device pointer to the ptp-RQ. This allows access to
      dma_mask and avoids allocation request with wrong pci-device.
      
      Fixes: a099da8f ("net/mlx5e: Add RQ to PTP channel")
      Signed-off-by: NAya Levin <ayal@nvidia.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      678b1ae1
    • A
      net/mlx5e: Fix page allocation failure for trap-RQ over SF · 497008e7
      Aya Levin 提交于
      Set the correct device pointer to the trap-RQ, to allow access to
      dma_mask and avoid allocation request with the wrong pci-dev.
      
      WARNING: CPU: 1 PID: 12005 at kernel/dma/mapping.c:151 dma_map_page_attrs+0x139/0x1c0
      ...
      all Trace:
      <IRQ>
      ? __page_pool_alloc_pages_slow+0x5a/0x210
      mlx5e_post_rx_wqes+0x258/0x400 [mlx5_core]
      mlx5e_trap_napi_poll+0x44/0xc0 [mlx5_core]
      __napi_poll+0x24/0x150
      net_rx_action+0x22b/0x280
      __do_softirq+0xc7/0x27e
      do_softirq+0x61/0x80
      </IRQ>
      __local_bh_enable_ip+0x4b/0x50
      mlx5e_handle_action_trap+0x2dd/0x4d0 [mlx5_core]
      blocking_notifier_call_chain+0x5a/0x80
      mlx5_devlink_trap_action_set+0x8b/0x100 [mlx5_core]
      
      Fixes: 5543e989 ("net/mlx5e: Add trap entity to ETH driver")
      Signed-off-by: NAya Levin <ayal@nvidia.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      497008e7
    • A
      net/mlx5e: Consider PTP-RQ when setting RX VLAN stripping · a759f845
      Aya Levin 提交于
      Add PTP-RQ to the loop when setting rx-vlan-offload feature via ethtool.
      On PTP-RQ's creation, set rx-vlan-offload into its parameters.
      
      Fixes: a099da8f ("net/mlx5e: Add RQ to PTP channel")
      Signed-off-by: NAya Levin <ayal@nvidia.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      a759f845
    • M
      net/mlx5e: Add NETIF_F_HW_TC to hw_features when HTB offload is available · 9841d58f
      Maxim Mikityanskiy 提交于
      If a feature flag is only present in features, but not in hw_features,
      the user can't reset it. Although hw_features may contain NETIF_F_HW_TC
      by the point where the driver checks whether HTB offload is supported,
      this flag is controlled by another condition that may not hold. Set it
      explicitly to make sure the user can disable it.
      
      Fixes: 214baf22 ("net/mlx5e: Support HTB offload")
      Signed-off-by: NMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      9841d58f
    • T
      net/mlx5e: RX, Avoid possible data corruption when relaxed ordering and LRO combined · e2351e51
      Tariq Toukan 提交于
      When HW aggregates packets for an LRO session, it writes the payload
      of two consecutive packets of a flow contiguously, so that they usually
      share a cacheline.
      
      The first byte of a packet's payload is written immediately after
      the last byte of the preceding packet.
      In this flow, there are two consecutive write requests to the shared
      cacheline:
      1. Regular write for the earlier packet.
      2. Read-modify-write for the following packet.
      
      In case of relaxed-ordering on, these two writes might be re-ordered.
      Using the end padding optimization (to avoid partial write for the last
      cacheline of a packet) becomes problematic if the two writes occur
      out-of-order, as the padding would overwrite payload that belongs to
      the following packet, causing data corruption.
      
      Avoid this by disabling the end padding optimization when both
      LRO and relaxed-ordering are enabled.
      
      Fixes: 17347d54 ("net/mlx5e: Add support for PCI relaxed ordering")
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: NMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      e2351e51
    • R
      net/mlx5: E-Switch, handle devcom events only for ports on the same device · dd3fddb8
      Roi Dayan 提交于
      This is the same check as LAG mode checks if to enable lag.
      This will fix adding peer miss rules if lag is not supported
      and even an incorrect rules in socket direct mode.
      
      Also fix the incorrect comment on mlx5_get_next_phys_dev() as flow #1
      doesn't exists.
      
      Fixes: ac004b83 ("net/mlx5e: E-Switch, Add peer miss rules")
      Signed-off-by: NRoi Dayan <roid@nvidia.com>
      Reviewed-by: NMaor Dickman <maord@nvidia.com>
      Reviewed-by: NMark Bloch <mbloch@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      dd3fddb8
    • M
      net/mlx5: E-Switch, Set destination vport vhca id only when merged eswitch is supported · c6719725
      Maor Dickman 提交于
      Destination vport vhca id is valid flag is set only merged eswitch isn't supported.
      Change destination vport vhca id value to be set also only when merged eswitch
      is supported.
      
      Fixes: e4ad91f2 ("net/mlx5e: Split offloaded eswitch TC rules for port mirroring")
      Signed-off-by: NMaor Dickman <maord@nvidia.com>
      Reviewed-by: NRoi Dayan <roid@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      c6719725
    • M
      net/mlx5e: Disable Rx ntuple offload for uplink representor · 90b22b9b
      Maor Dickman 提交于
      Rx ntuple offload is not supported in switchdev mode.
      Tryng to enable it cause kernel panic.
      
       BUG: kernel NULL pointer dereference, address: 0000000000000008
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 80000001065a5067 P4D 80000001065a5067 PUD 106594067 PMD 0
       Oops: 0000 [#1] SMP PTI
       CPU: 7 PID: 1089 Comm: ethtool Not tainted 5.13.0-rc7_for_upstream_min_debug_2021_06_23_16_44 #1
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:mlx5e_arfs_enable+0x70/0xd0 [mlx5_core]
       Code: 44 24 10 00 00 00 00 48 c7 44 24 18 00 00 00 00 49 63 c4 48 89 e2 44 89 e6 48 69 c0 20 08 00 00 48 89 ef 48 03 85 68 ac 00 00 <48> 8b 40 08 48 89 44 24 08 e8 d2 aa fd ff 48 83 05 82 96 18 00 01
       RSP: 0018:ffff8881047679e0 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: 0000004000000000 RCX: 0000004000000000
       RDX: ffff8881047679e0 RSI: 0000000000000000 RDI: ffff888115100880
       RBP: ffff888115100880 R08: ffffffffa00f6cb0 R09: ffff888104767a18
       R10: ffff8881151000a0 R11: ffff888109479540 R12: 0000000000000000
       R13: ffff888104767bb8 R14: ffff888115100000 R15: ffff8881151000a0
       FS:  00007f41a64ab740(0000) GS:ffff8882f5dc0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000008 CR3: 0000000104cbc005 CR4: 0000000000370ea0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        set_feature_arfs+0x1e/0x40 [mlx5_core]
        mlx5e_handle_feature+0x43/0xa0 [mlx5_core]
        mlx5e_set_features+0x139/0x1b0 [mlx5_core]
        __netdev_update_features+0x2b3/0xaf0
        ethnl_set_features+0x176/0x3a0
        ? __nla_parse+0x22/0x30
        genl_family_rcv_msg_doit+0xe2/0x140
        genl_rcv_msg+0xde/0x1d0
        ? features_reply_size+0xe0/0xe0
        ? genl_get_cmd+0xd0/0xd0
        netlink_rcv_skb+0x4e/0xf0
        genl_rcv+0x24/0x40
        netlink_unicast+0x1f6/0x2b0
        netlink_sendmsg+0x225/0x450
        sock_sendmsg+0x33/0x40
        __sys_sendto+0xd4/0x120
        ? __sys_recvmsg+0x4e/0x90
        ? exc_page_fault+0x219/0x740
        __x64_sys_sendto+0x25/0x30
        do_syscall_64+0x3f/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f41a65b0cba
       Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 c3 0f 1f 44 00 00 55 48 83 ec 30 44 89 4c
       RSP: 002b:00007ffd8d688358 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
       RAX: ffffffffffffffda RBX: 00000000010f42a0 RCX: 00007f41a65b0cba
       RDX: 0000000000000058 RSI: 00000000010f43b0 RDI: 0000000000000003
       RBP: 000000000047ae60 R08: 00007f41a667c000 R09: 000000000000000c
       R10: 0000000000000000 R11: 0000000000000246 R12: 00000000010f4340
       R13: 00000000010f4350 R14: 00007ffd8d688400 R15: 00000000010f42a0
       Modules linked in: mlx5_vdpa vhost_iotlb vdpa xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core overlay mlx5_core ptp pps_core fuse
       CR2: 0000000000000008
       ---[ end trace c66523f2aba94b43 ]---
      
      Fixes: 7a9fb35e ("net/mlx5e: Do not reload ethernet ports when changing eswitch mode")
      Signed-off-by: NMaor Dickman <maord@nvidia.com>
      Reviewed-by: NRoi Dayan <roid@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      90b22b9b
    • M
      net/mlx5: Fix flow table chaining · 8b54874e
      Maor Gottlieb 提交于
      Fix a bug when flow table is created in priority that already
      has other flow tables as shown in the below diagram.
      If the new flow table (FT-B) has the lowest level in the priority,
      we need to connect the flow tables from the previous priority (p0)
      to this new table. In addition when this flow table is destroyed
      (FT-B), we need to connect the flow tables from the previous
      priority (p0) to the next level flow table (FT-C) in the same
      priority of the destroyed table (if exists).
      
                             ---------
                             |root_ns|
                             ---------
                                  |
                  --------------------------------
                  |               |              |
             ----------      ----------      ---------
             |p(prio)-x|     |   p-y  |      |   p-n |
             ----------      ----------      ---------
                  |               |
           ----------------  ------------------
           |ns(e.g bypass)|  |ns(e.g. kernel) |
           ----------------  ------------------
                  |            |           |
      	-------	       ------       ----
              |  p0 |        | p1 |       |p2|
              -------        ------       ----
                 |             |    \
              --------       ------- ------
              | FT-A |       |FT-B | |FT-C|
              --------       ------- ------
      
      Fixes: f90edfd2 ("net/mlx5_core: Connect flow tables")
      Signed-off-by: NMaor Gottlieb <maorg@nvidia.com>
      Reviewed-by: NMark Bloch <mbloch@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      8b54874e
    • A
      Merge branch 'sockmap fixes picked up by stress tests' · f1fdee33
      Andrii Nakryiko 提交于
      John Fastabend says:
      
      ====================
      
      Running stress tests with recent patch to remove an extra lock in sockmap
      resulted in a couple new issues popping up. It seems only one of them
      is actually related to the patch:
      
      799aa7f9 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
      
      The other two issues had existed long before, but I guess the timing
      with the serialization we had before was too tight to get any of
      our tests or deployments to hit it.
      
      With attached series stress testing sockmap+TCP with workloads that
      create lots of short-lived connections no more splats like below were
      seen on upstream bpf branch.
      
      [224913.935822] WARNING: CPU: 3 PID: 32100 at net/core/stream.c:208 sk_stream_kill_queues+0x212/0x220
      [224913.935841] Modules linked in: fuse overlay bpf_preload x86_pkg_temp_thermal intel_uncore wmi_bmof squashfs sch_fq_codel efivarfs ip_tables x_tables uas xhci_pci ixgbe mdio xfrm_algo xhci_hcd wmi
      [224913.935897] CPU: 3 PID: 32100 Comm: fgs-bench Tainted: G          I       5.14.0-rc1alu+ #181
      [224913.935908] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
      [224913.935914] RIP: 0010:sk_stream_kill_queues+0x212/0x220
      [224913.935923] Code: 8b 83 20 02 00 00 85 c0 75 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 df e8 2b 11 fe ff eb c3 0f 0b e9 7c ff ff ff 0f 0b eb ce <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 0f 1f 44 00 00 41 57 41
      [224913.935932] RSP: 0018:ffff88816271fd38 EFLAGS: 00010206
      [224913.935941] RAX: 0000000000000ae8 RBX: ffff88815acd5240 RCX: dffffc0000000000
      [224913.935948] RDX: 0000000000000003 RSI: 0000000000000ae8 RDI: ffff88815acd5460
      [224913.935954] RBP: ffff88815acd5460 R08: ffffffff955c0ae8 R09: fffffbfff2e6f543
      [224913.935961] R10: ffffffff9737aa17 R11: fffffbfff2e6f542 R12: ffff88815acd5390
      [224913.935967] R13: ffff88815acd5480 R14: ffffffff98d0c080 R15: ffffffff96267500
      [224913.935974] FS:  00007f86e6bd1700(0000) GS:ffff888451cc0000(0000) knlGS:0000000000000000
      [224913.935981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [224913.935988] CR2: 000000c0008eb000 CR3: 00000001020e0005 CR4: 00000000003706e0
      [224913.935994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [224913.936000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [224913.936007] Call Trace:
      [224913.936016]  inet_csk_destroy_sock+0xba/0x1f0
      [224913.936033]  __tcp_close+0x620/0x790
      [224913.936047]  tcp_close+0x20/0x80
      [224913.936056]  inet_release+0x8f/0xf0
      [224913.936070]  __sock_release+0x72/0x120
      
      v3: make sock_drop inline in skmsg.h
      v2: init skb to null and fix a space/tab issue. Added Jakub's acks.
      ====================
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      f1fdee33
    • J
      bpf, sockmap: Fix memleak on ingress msg enqueue · 9635720b
      John Fastabend 提交于
      If backlog handler is running during a tear down operation we may enqueue
      data on the ingress msg queue while tear down is trying to free it.
      
       sk_psock_backlog()
         sk_psock_handle_skb()
           skb_psock_skb_ingress()
             sk_psock_skb_ingress_enqueue()
               sk_psock_queue_msg(psock,msg)
                                                 spin_lock(ingress_lock)
                                                  sk_psock_zap_ingress()
                                                   _sk_psock_purge_ingerss_msg()
                                                    _sk_psock_purge_ingress_msg()
                                                  -- free ingress_msg list --
                                                 spin_unlock(ingress_lock)
                 spin_lock(ingress_lock)
                 list_add_tail(msg,ingress_msg) <- entry on list with no one
                                                   left to free it.
                 spin_unlock(ingress_lock)
      
      To fix we only enqueue from backlog if the ENABLED bit is set. The tear
      down logic clears the bit with ingress_lock set so we wont enqueue the
      msg in the last step.
      
      Fixes: 799aa7f9 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210727160500.1713554-4-john.fastabend@gmail.com
      9635720b
    • J
      bpf, sockmap: On cleanup we additionally need to remove cached skb · 476d9801
      John Fastabend 提交于
      Its possible if a socket is closed and the receive thread is under memory
      pressure it may have cached a skb. We need to ensure these skbs are
      free'd along with the normal ingress_skb queue.
      
      Before 799aa7f9 ("skmsg: Avoid lock_sock() in sk_psock_backlog()") tear
      down and backlog processing both had sock_lock for the common case of
      socket close or unhash. So it was not possible to have both running in
      parrallel so all we would need is the kfree in those kernels.
      
      But, latest kernels include the commit 799aa7f98d5e and this requires a
      bit more work. Without the ingress_lock guarding reading/writing the
      state->skb case its possible the tear down could run before the state
      update causing it to leak memory or worse when the backlog reads the state
      it could potentially run interleaved with the tear down and we might end up
      free'ing the state->skb from tear down side but already have the reference
      from backlog side. To resolve such races we wrap accesses in ingress_lock
      on both sides serializing tear down and backlog case. In both cases this
      only happens after an EAGAIN error case so having an extra lock in place
      is likely fine. The normal path will skip the locks.
      
      Note, we check state->skb before grabbing lock. This works because
      we can only enqueue with the mutex we hold already. Avoiding a race
      on adding state->skb after the check. And if tear down path is running
      that is also fine if the tear down path then removes state->skb we
      will simply set skb=NULL and the subsequent goto is skipped. This
      slight complication avoids locking in normal case.
      
      With this fix we no longer see this warning splat from tcp side on
      socket close when we hit the above case with redirect to ingress self.
      
      [224913.935822] WARNING: CPU: 3 PID: 32100 at net/core/stream.c:208 sk_stream_kill_queues+0x212/0x220
      [224913.935841] Modules linked in: fuse overlay bpf_preload x86_pkg_temp_thermal intel_uncore wmi_bmof squashfs sch_fq_codel efivarfs ip_tables x_tables uas xhci_pci ixgbe mdio xfrm_algo xhci_hcd wmi
      [224913.935897] CPU: 3 PID: 32100 Comm: fgs-bench Tainted: G          I       5.14.0-rc1alu+ #181
      [224913.935908] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
      [224913.935914] RIP: 0010:sk_stream_kill_queues+0x212/0x220
      [224913.935923] Code: 8b 83 20 02 00 00 85 c0 75 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 df e8 2b 11 fe ff eb c3 0f 0b e9 7c ff ff ff 0f 0b eb ce <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 0f 1f 44 00 00 41 57 41
      [224913.935932] RSP: 0018:ffff88816271fd38 EFLAGS: 00010206
      [224913.935941] RAX: 0000000000000ae8 RBX: ffff88815acd5240 RCX: dffffc0000000000
      [224913.935948] RDX: 0000000000000003 RSI: 0000000000000ae8 RDI: ffff88815acd5460
      [224913.935954] RBP: ffff88815acd5460 R08: ffffffff955c0ae8 R09: fffffbfff2e6f543
      [224913.935961] R10: ffffffff9737aa17 R11: fffffbfff2e6f542 R12: ffff88815acd5390
      [224913.935967] R13: ffff88815acd5480 R14: ffffffff98d0c080 R15: ffffffff96267500
      [224913.935974] FS:  00007f86e6bd1700(0000) GS:ffff888451cc0000(0000) knlGS:0000000000000000
      [224913.935981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [224913.935988] CR2: 000000c0008eb000 CR3: 00000001020e0005 CR4: 00000000003706e0
      [224913.935994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [224913.936000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [224913.936007] Call Trace:
      [224913.936016]  inet_csk_destroy_sock+0xba/0x1f0
      [224913.936033]  __tcp_close+0x620/0x790
      [224913.936047]  tcp_close+0x20/0x80
      [224913.936056]  inet_release+0x8f/0xf0
      [224913.936070]  __sock_release+0x72/0x120
      [224913.936083]  sock_close+0x14/0x20
      
      Fixes: a136678c ("bpf: sk_msg, zap ingress queue on psock down")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210727160500.1713554-3-john.fastabend@gmail.com
      476d9801
    • J
      bpf, sockmap: Zap ingress queues after stopping strparser · 343597d5
      John Fastabend 提交于
      We don't want strparser to run and pass skbs into skmsg handlers when
      the psock is null. We just sk_drop them in this case. When removing
      a live socket from map it means extra drops that we do not need to
      incur. Move the zap below strparser close to avoid this condition.
      
      This way we stop the stream parser first stopping it from processing
      packets and then delete the psock.
      
      Fixes: a136678c ("bpf: sk_msg, zap ingress queue on psock down")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210727160500.1713554-2-john.fastabend@gmail.com
      343597d5
    • Y
      net: hns3: change the method of obtaining default ptp cycle · 8373cd38
      Yufeng Mo 提交于
      The ptp cycle is related to the hardware, so it may cause compatibility
      issues if a fixed value is used in driver. Therefore, the method of
      obtaining this value is changed to read from the register rather than
      use a fixed value in driver.
      
      Fixes: 0bf5eb78 ("net: hns3: add support for PTP")
      Signed-off-by: NYufeng Mo <moyufeng@huawei.com>
      Signed-off-by: NGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8373cd38
  4. 27 7月, 2021 4 次提交
    • T
      nfc: s3fwrn5: fix undefined parameter values in dev_err() · 801e541c
      Tang Bin 提交于
      In the function s3fwrn5_fw_download(), the 'ret' is not assigned,
      so the correct value should be given in dev_err function.
      
      Fixes: a0302ff5 ("nfc: s3fwrn5: remove unnecessary label")
      Signed-off-by: NZhang Shengju <zhangshengju@cmss.chinamobile.com>
      Signed-off-by: NTang Bin <tangbin@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      801e541c
    • P
      net: llc: fix skb_over_panic · c7c9d210
      Pavel Skripkin 提交于
      Syzbot reported skb_over_panic() in llc_pdu_init_as_xid_cmd(). The
      problem was in wrong LCC header manipulations.
      
      Syzbot's reproducer tries to send XID packet. llc_ui_sendmsg() is
      doing following steps:
      
      	1. skb allocation with size = len + header size
      		len is passed from userpace and header size
      		is 3 since addr->sllc_xid is set.
      
      	2. skb_reserve() for header_len = 3
      	3. filling all other space with memcpy_from_msg()
      
      Ok, at this moment we have fully loaded skb, only headers needs to be
      filled.
      
      Then code comes to llc_sap_action_send_xid_c(). This function pushes 3
      bytes for LLC PDU header and initializes it. Then comes
      llc_pdu_init_as_xid_cmd(). It initalizes next 3 bytes *AFTER* LLC PDU
      header and call skb_push(skb, 3). This looks wrong for 2 reasons:
      
      	1. Bytes rigth after LLC header are user data, so this function
      	   was overwriting payload.
      
      	2. skb_push(skb, 3) call can cause skb_over_panic() since
      	   all free space was filled in llc_ui_sendmsg(). (This can
      	   happen is user passed 686 len: 686 + 14 (eth header) + 3 (LLC
      	   header) = 703. SKB_DATA_ALIGN(703) = 704)
      
      So, in this patch I added 2 new private constansts: LLC_PDU_TYPE_U_XID
      and LLC_PDU_LEN_U_XID. LLC_PDU_LEN_U_XID is used to correctly reserve
      header size to handle LLC + XID case. LLC_PDU_TYPE_U_XID is used by
      llc_pdu_header_init() function to push 6 bytes instead of 3. And finally
      I removed skb_push() call from llc_pdu_init_as_xid_cmd().
      
      This changes should not affect other parts of LLC, since after
      all steps we just transmit buffer.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-and-tested-by: syzbot+5e5a981ad7cc54c4b2b4@syzkaller.appspotmail.com
      Signed-off-by: NPavel Skripkin <paskripkin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7c9d210
    • S
      octeontx2-af: Do NIX_RX_SW_SYNC twice · fcef709c
      Sunil Goutham 提交于
      NIX_RX_SW_SYNC ensures all existing transactions are finished and
      pkts are written to LLC/DRAM, queues should be teared down after
      successful SW_SYNC. Due to a HW errata, in some rare scenarios
      an existing transaction might end after SW_SYNC operation. To
      ensure operation is fully done, do the SW_SYNC twice.
      Signed-off-by: NSunil Goutham <sgoutham@marvell.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fcef709c
    • S
      bnxt_en: Fix static checker warning in bnxt_fw_reset_task() · 758684e4
      Somnath Kotur 提交于
      Now that we return when bnxt_open() fails in bnxt_fw_reset_task(),
      there is no need to check for 'rc' value again before invoking
      bnxt_reenable_sriov().
      
      Fixes: 3958b1da ("bnxt_en: fix error path of FW reset")
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NSomnath Kotur <somnath.kotur@broadcom.com>
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      758684e4
  5. 26 7月, 2021 6 次提交
    • L
      net/qla3xxx: fix schedule while atomic in ql_wait_for_drvr_lock and ql_adapter_reset · 92766c46
      Letu Ren 提交于
      When calling the 'ql_wait_for_drvr_lock' and 'ql_adapter_reset', the driver
      has already acquired the spin lock, so the driver should not call 'ssleep'
      in atomic context.
      
      This bug can be fixed by using 'mdelay' instead of 'ssleep'.
      Reported-by: NLetu Ren <fantasquex@gmail.com>
      Signed-off-by: NLetu Ren <fantasquex@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92766c46
    • C
      sctp: delete addr based on sin6_scope_id · 2ebda027
      Chen Shen 提交于
      sctp_inet6addr_event deletes 'addr' from 'local_addr_list' when setting
      netdev down, but it is possible to delete the incorrect entry (match
      the first one with the same ipaddr, but the different 'ifindex'), if
      there are some netdevs with the same 'local-link' ipaddr added already.
      It should delete the entry depending on 'sin6_addr' and 'sin6_scope_id'
      both. otherwise, the endpoint will call 'sctp_sf_ootb' if it can't find
      the according association when receives 'heartbeat', and finally will
      reply 'abort'.
      
      For example:
      1.when linux startup
      the entries in local_addr_list:
      ifindex:35 addr:fe80::40:43ff:fe80:0 (eths0.201)
      ifindex:36 addr:fe80::40:43ff:fe80:0 (eths0.209)
      ifindex:37 addr:fe80::40:43ff:fe80:0 (eths0.210)
      
      the route table:
      local fe80::40:43ff:fe80:0 dev eths0.201
      local fe80::40:43ff:fe80:0 dev eths0.209
      local fe80::40:43ff:fe80:0 dev eths0.210
      
      2.after 'ifconfig eths0.209 down'
      the entries in local_addr_list:
      ifindex:36 addr:fe80::40:43ff:fe80:0 (eths0.209)
      ifindex:37 addr:fe80::40:43ff:fe80:0 (eths0.210)
      
      the route table:
      local fe80::40:43ff:fe80:0 dev eths0.201
      local fe80::40:43ff:fe80:0 dev eths0.210
      
      3.asoc not found for src:[fe80::40:43ff:fe80:0]:37381 dst:[:1]:53335
      ::1->fe80::40:43ff:fe80:0 HEARTBEAT
      fe80::40:43ff:fe80:0->::1 ABORT
      Signed-off-by: NChen Shen <peterchenshen@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ebda027
    • M
      net: stmmac: add est_irq_status callback function for GMAC 4.10 and 5.10 · 94cbe7db
      Mohammad Athari Bin Ismail 提交于
      Assign dwmac5_est_irq_status to est_irq_status callback function for
      GMAC 4.10 and 5.10. With this, EST related interrupts could be handled
      properly.
      
      Fixes: e49aa315 ("net: stmmac: EST interrupts handling and error reporting")
      Cc: <stable@vger.kernel.org> # 5.13.x
      Signed-off-by: NMohammad Athari Bin Ismail <mohammad.athari.ismail@intel.com>
      Acked-by: NWong Vee Khee <vee.khee.wong@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94cbe7db
    • D
      Merge branch 'sctp-pmtu-probe' · 832df96d
      David S. Miller 提交于
      Xin Long says:
      
      ====================
      sctp: improve the pmtu probe in Search Complete state
      
      Timo recently suggested to use the loss of (data) packets as
      indication to send pmtu probe for Search Complete state, which
      should also be implied by RFC8899. This patchset is to change
      the current one that is doing probe with current pmtu all the
      time.
      
      v1->v2:
        - see Patch 2/2.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      832df96d
    • X
      sctp: send pmtu probe only if packet loss in Search Complete state · eacf078c
      Xin Long 提交于
      This patch is to introduce last_rtx_chunks into sctp_transport to detect
      if there's any packet retransmission/loss happened by checking against
      asoc's rtx_data_chunks in sctp_transport_pl_send().
      
      If there is, namely, transport->last_rtx_chunks != asoc->rtx_data_chunks,
      the pmtu probe will be sent out. Otherwise, increment the pl.raise_count
      and return when it's in Search Complete state.
      
      With this patch, if in Search Complete state, which is a long period, it
      doesn't need to keep probing the current pmtu unless there's data packet
      loss. This will save quite some traffic.
      
      v1->v2:
        - add the missing Fixes tag.
      
      Fixes: 0dac127c ("sctp: do black hole detection in search complete state")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eacf078c
    • X
      sctp: improve the code for pmtu probe send and recv update · 058e6e0e
      Xin Long 提交于
      This patch does 3 things:
      
        - make sctp_transport_pl_send() and sctp_transport_pl_recv()
          return bool type to decide if more probe is needed to send.
      
        - pr_debug() only when probe is really needed to send.
      
        - count pl.raise_count in sctp_transport_pl_send() instead of
          sctp_transport_pl_recv(), and it's only incremented for the
          1st probe for the same size.
      
      These are preparations for the next patch to make probes happen
      only when there's packet loss in Search Complete state.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      058e6e0e