1. 01 6月, 2022 1 次提交
  2. 25 5月, 2022 1 次提交
    • S
      KVM: Fully serialize gfn=>pfn cache refresh via mutex · 93984f19
      Sean Christopherson 提交于
      Protect gfn=>pfn cache refresh with a mutex to fully serialize refreshes.
      The refresh logic doesn't protect against
      
      - concurrent unmaps, or refreshes with different GPAs (which may or may not
        happen in practice, for example if a cache is only used under vcpu->mutex;
        but it's allowed in the code)
      
      - a false negative on the memslot generation.  If the first refresh sees
        a stale memslot generation, it will refresh the hva and generation before
        moving on to the hva=>pfn translation.  If it then drops gpc->lock, a
        different user of the cache can come along, acquire gpc->lock, see that
        the memslot generation is fresh, and skip the hva=>pfn update due to the
        userspace address also matching (because it too was updated).
      
      The refresh path can already sleep during hva=>pfn resolution, so wrap
      the refresh with a mutex to ensure that any given refresh runs to
      completion before other callers can start their refresh.
      
      Cc: stable@vger.kernel.org
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220429210025.3293691-7-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      93984f19
  3. 16 5月, 2022 1 次提交
  4. 15 5月, 2022 2 次提交
  5. 12 5月, 2022 3 次提交
  6. 10 5月, 2022 1 次提交
  7. 09 5月, 2022 2 次提交
  8. 08 5月, 2022 2 次提交
  9. 06 5月, 2022 2 次提交
    • V
      net: mscc: ocelot: mark traps with a bool instead of keeping them in a list · e1846cff
      Vladimir Oltean 提交于
      Since the blamed commit, VCAP filters can appear on more than one list.
      If their action is "trap", they are chained on ocelot->traps via
      filter->trap_list. This is in addition to their normal placement on the
      VCAP block->rules list head.
      
      Therefore, when we free a VCAP filter, we must remove it from all lists
      it is a member of, including ocelot->traps.
      
      There are at least 2 bugs which are direct consequences of this design
      decision.
      
      First is the incorrect usage of list_empty(), meant to denote whether
      "filter" is chained into ocelot->traps via filter->trap_list.
      This does not do the correct thing, because list_empty() checks whether
      "head->next == head", but in our case, head->next == head->prev == NULL.
      So we dereference NULL pointers and die when we call list_del().
      
      Second is the fact that not all places that should remove the filter
      from ocelot->traps do so. One example is ocelot_vcap_block_remove_filter(),
      which is where we have the main kfree(filter). By keeping freed filters
      in ocelot->traps we end up in a use-after-free in
      felix_update_trapping_destinations().
      
      Attempting to fix all the buggy patterns is a whack-a-mole game which
      makes the driver unmaintainable. Actually this is what the previous
      patch version attempted to do:
      https://patchwork.kernel.org/project/netdevbpf/patch/20220503115728.834457-3-vladimir.oltean@nxp.com/
      
      but it introduced another set of bugs, because there are other places in
      which create VCAP filters, not just ocelot_vcap_filter_create():
      
      - ocelot_trap_add()
      - felix_tag_8021q_vlan_add_rx()
      - felix_tag_8021q_vlan_add_tx()
      
      Relying on the convention that all those code paths must call
      INIT_LIST_HEAD(&filter->trap_list) is not going to scale.
      
      So let's do what should have been done in the first place and keep a
      bool in struct ocelot_vcap_filter which denotes whether we are looking
      at a trapping rule or not. Iterating now happens over the main VCAP IS2
      block->rules. The advantage is that we no longer risk having stale
      references to a freed filter, since it is only present in that list.
      
      Fixes: e42bd4ed ("net: mscc: ocelot: keep traps in a list")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      e1846cff
    • T
      net: Fix features skip in for_each_netdev_feature() · 85db6352
      Tariq Toukan 提交于
      The find_next_netdev_feature() macro gets the "remaining length",
      not bit index.
      Passing "bit - 1" for the following iteration is wrong as it skips
      the adjacent bit. Pass "bit" instead.
      
      Fixes: 3b89ea9c ("net: Fix for_each_netdev_feature on Big endian")
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: NGal Pressman <gal@nvidia.com>
      Link: https://lore.kernel.org/r/20220504080914.1918-1-tariqt@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      85db6352
  10. 05 5月, 2022 2 次提交
  11. 04 5月, 2022 7 次提交
    • M
      KVM: arm64: vgic-v3: Advertise GICR_CTLR.{IR, CES} as a new GICD_IIDR revision · 49a1a2c7
      Marc Zyngier 提交于
      Since adversising GICR_CTLR.{IC,CES} is directly observable from
      a guest, we need to make it selectable from userspace.
      
      For that, bump the default GICD_IIDR revision and let userspace
      downgrade it to the previous default. For GICv2, the two distributor
      revisions are strictly equivalent.
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220405182327.205520-5-maz@kernel.org
      49a1a2c7
    • M
      KVM: arm64: vgic-v3: Implement MMIO-based LPI invalidation · 4645d11f
      Marc Zyngier 提交于
      Since GICv4.1, it has become legal for an implementation to advertise
      GICR_{INVLPIR,INVALLR,SYNCR} while having an ITS, allowing for a more
      efficient invalidation scheme (no guest command queue contention when
      multiple CPUs are generating invalidations).
      
      Provide the invalidation registers as a primitive to their ITS
      counterpart. Note that we don't advertise them to the guest yet
      (the architecture allows an implementation to do this).
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NOliver Upton <oupton@google.com>
      Link: https://lore.kernel.org/r/20220405182327.205520-4-maz@kernel.org
      4645d11f
    • M
      KVM: arm64: vgic-v3: Expose GICR_CTLR.RWP when disabling LPIs · 94828468
      Marc Zyngier 提交于
      When disabling LPIs, a guest needs to poll GICR_CTLR.RWP in order
      to be sure that the write has taken effect. We so far reported it
      as 0, as we didn't advertise that LPIs could be turned off the
      first place.
      
      Start tracking this state during which LPIs are being disabled,
      and expose the 'in progress' state via the RWP bit.
      
      We also take this opportunity to disallow enabling LPIs and programming
      GICR_{PEND,PROP}BASER while LPI disabling is in progress, as allowed by
      the architecture (UNPRED behaviour).
      
      We don't advertise the feature to the guest yet (which is allowed by
      the architecture).
      Reviewed-by: NOliver Upton <oupton@google.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220405182327.205520-3-maz@kernel.org
      94828468
    • M
      irqchip/gic-v3: Exposes bit values for GICR_CTLR.{IR, CES} · 34453c2e
      Marc Zyngier 提交于
      As we're about to expose GICR_CTLR.{IR,CES} to guests, populate
      the include file with the architectural values.
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NOliver Upton <oupton@google.com>
      Link: https://lore.kernel.org/r/20220405182327.205520-2-maz@kernel.org
      34453c2e
    • O
      KVM: arm64: Implement PSCI SYSTEM_SUSPEND · bfbab445
      Oliver Upton 提交于
      ARM DEN0022D.b 5.19 "SYSTEM_SUSPEND" describes a PSCI call that allows
      software to request that a system be placed in the deepest possible
      low-power state. Effectively, software can use this to suspend itself to
      RAM.
      
      Unfortunately, there really is no good way to implement a system-wide
      PSCI call in KVM. Any precondition checks done in the kernel will need
      to be repeated by userspace since there is no good way to protect a
      critical section that spans an exit to userspace. SYSTEM_RESET and
      SYSTEM_OFF are equally plagued by this issue, although no users have
      seemingly cared for the relatively long time these calls have been
      supported.
      
      The solution is to just make the whole implementation userspace's
      problem. Introduce a new system event, KVM_SYSTEM_EVENT_SUSPEND, that
      indicates to userspace a calling vCPU has invoked PSCI SYSTEM_SUSPEND.
      Additionally, add a CAP to get buy-in from userspace for this new exit
      type.
      
      Only advertise the SYSTEM_SUSPEND PSCI call if userspace has opted in.
      If a vCPU calls SYSTEM_SUSPEND, punt straight to userspace. Provide
      explicit documentation of userspace's responsibilites for the exit and
      point to the PSCI specification to describe the actual PSCI call.
      Reviewed-by: NReiji Watanabe <reijiw@google.com>
      Signed-off-by: NOliver Upton <oupton@google.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220504032446.4133305-8-oupton@google.com
      bfbab445
    • O
      KVM: arm64: Add support for userspace to suspend a vCPU · 7b33a09d
      Oliver Upton 提交于
      Introduce a new MP state, KVM_MP_STATE_SUSPENDED, which indicates a vCPU
      is in a suspended state. In the suspended state the vCPU will block
      until a wakeup event (pending interrupt) is recognized.
      
      Add a new system event type, KVM_SYSTEM_EVENT_WAKEUP, to indicate to
      userspace that KVM has recognized one such wakeup event. It is the
      responsibility of userspace to then make the vCPU runnable, or leave it
      suspended until the next wakeup event.
      Signed-off-by: NOliver Upton <oupton@google.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220504032446.4133305-7-oupton@google.com
      7b33a09d
    • R
      KVM: arm64: Setup a framework for hypercall bitmap firmware registers · 05714cab
      Raghavendra Rao Ananta 提交于
      KVM regularly introduces new hypercall services to the guests without
      any consent from the userspace. This means, the guests can observe
      hypercall services in and out as they migrate across various host
      kernel versions. This could be a major problem if the guest
      discovered a hypercall, started using it, and after getting migrated
      to an older kernel realizes that it's no longer available. Depending
      on how the guest handles the change, there's a potential chance that
      the guest would just panic.
      
      As a result, there's a need for the userspace to elect the services
      that it wishes the guest to discover. It can elect these services
      based on the kernels spread across its (migration) fleet. To remedy
      this, extend the existing firmware pseudo-registers, such as
      KVM_REG_ARM_PSCI_VERSION, but by creating a new COPROC register space
      for all the hypercall services available.
      
      These firmware registers are categorized based on the service call
      owners, but unlike the existing firmware pseudo-registers, they hold
      the features supported in the form of a bitmap.
      
      During the VM initialization, the registers are set to upper-limit of
      the features supported by the corresponding registers. It's expected
      that the VMMs discover the features provided by each register via
      GET_ONE_REG, and write back the desired values using SET_ONE_REG.
      KVM allows this modification only until the VM has started.
      
      Some of the standard features are not mapped to any bits of the
      registers. But since they can recreate the original problem of
      making it available without userspace's consent, they need to
      be explicitly added to the case-list in
      kvm_hvc_call_default_allowed(). Any function-id that's not enabled
      via the bitmap, or not listed in kvm_hvc_call_default_allowed, will
      be returned as SMCCC_RET_NOT_SUPPORTED to the guest.
      
      Older userspace code can simply ignore the feature and the
      hypercall services will be exposed unconditionally to the guests,
      thus ensuring backward compatibility.
      
      In this patch, the framework adds the register only for ARM's standard
      secure services (owner value 4). Currently, this includes support only
      for ARM True Random Number Generator (TRNG) service, with bit-0 of the
      register representing mandatory features of v1.0. Other services are
      momentarily added in the upcoming patches.
      Signed-off-by: NRaghavendra Rao Ananta <rananta@google.com>
      Reviewed-by: NGavin Shan <gshan@redhat.com>
      [maz: reduced the scope of some helpers, tidy-up bitmap max values,
       dropped error-only fast path]
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220502233853.1233742-3-rananta@google.com
      05714cab
  12. 03 5月, 2022 1 次提交
  13. 02 5月, 2022 1 次提交
  14. 01 5月, 2022 1 次提交
  15. 30 4月, 2022 2 次提交
    • P
      KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENT · d495f942
      Paolo Bonzini 提交于
      When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags
      member that at the time was unused.  Unfortunately this extensibility
      mechanism has several issues:
      
      - x86 is not writing the member, so it would not be possible to use it
        on x86 except for new events
      
      - the member is not aligned to 64 bits, so the definition of the
        uAPI struct is incorrect for 32- on 64-bit userspace.  This is a
        problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately
        usage of flags was only introduced in 5.18.
      
      Since padding has to be introduced, place a new field in there
      that tells if the flags field is valid.  To allow further extensibility,
      in fact, change flags to an array of 16 values, and store how many
      of the values are valid.  The availability of the new ndata field
      is tied to a system capability; all architectures are changed to
      fill in the field.
      
      To avoid breaking compilation of userspace that was using the flags
      field, provide a userspace-only union to overlap flags with data[0].
      The new field is placed at the same offset for both 32- and 64-bit
      userspace.
      
      Cc: Will Deacon <will@kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Message-Id: <20220422103013.34832-1-pbonzini@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d495f942
    • T
      SUNRPC: Ensure gss-proxy connects on setup · 892de36f
      Trond Myklebust 提交于
      For reasons best known to the author, gss-proxy does not implement a
      NULL procedure, and returns RPC_PROC_UNAVAIL. However we still want to
      ensure that we connect to the service at setup time.
      So add a quirk-flag specially for this case.
      
      Fixes: 1d658336 ("SUNRPC: Add RPC based upcall mechanism for RPCGSS auth")
      Cc: stable@vger.kernel.org
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      892de36f
  16. 28 4月, 2022 2 次提交
    • C
      elf: Fix the arm64 MTE ELF segment name and value · c35fe2a6
      Catalin Marinas 提交于
      Unfortunately, the name/value choice for the MTE ELF segment type
      (PT_ARM_MEMTAG_MTE) was pretty poor: LOPROC+1 is already in use by
      PT_AARCH64_UNWIND, as defined in the AArch64 ELF ABI
      (https://github.com/ARM-software/abi-aa/blob/main/aaelf64/aaelf64.rst).
      
      Update the ELF segment type value to LOPROC+2 and also change the define
      to PT_AARCH64_MEMTAG_MTE to match the AArch64 ELF ABI namespace. The
      AArch64 ELF ABI document is updating accordingly (segment type not
      previously mentioned in the document).
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Fixes: 761b9b36 ("elf: Introduce the ARM MTE ELF segment type")
      Cc: Will Deacon <will@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Luis Machado <luis.machado@arm.com>
      Cc: Richard Earnshaw <Richard.Earnshaw@arm.com>
      Link: https://lore.kernel.org/r/20220425151833.2603830-1-catalin.marinas@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      c35fe2a6
    • M
      hex2bin: make the function hex_to_bin constant-time · e5be1576
      Mikulas Patocka 提交于
      The function hex2bin is used to load cryptographic keys into device
      mapper targets dm-crypt and dm-integrity.  It should take constant time
      independent on the processed data, so that concurrently running
      unprivileged code can't infer any information about the keys via
      microarchitectural convert channels.
      
      This patch changes the function hex_to_bin so that it contains no
      branches and no memory accesses.
      
      Note that this shouldn't cause performance degradation because the size
      of the new function is the same as the size of the old function (on
      x86-64) - and the new function causes no branch misprediction penalties.
      
      I compile-tested this function with gcc on aarch64 alpha arm hppa hppa64
      i386 ia64 m68k mips32 mips64 powerpc powerpc64 riscv sh4 s390x sparc32
      sparc64 x86_64 and with clang on aarch64 arm hexagon i386 mips32 mips64
      powerpc powerpc64 s390x sparc32 sparc64 x86_64 to verify that there are
      no branches in the generated code.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5be1576
  17. 27 4月, 2022 3 次提交
    • S
      net: Use this_cpu_inc() to increment net->core_stats · 6510ea97
      Sebastian Andrzej Siewior 提交于
      The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes
      netdev_core_stats_alloc() to return a per-CPU pointer.
      netdev_core_stats_alloc() will allocate memory on its first invocation
      which breaks on PREEMPT_RT because it requires non-atomic context for
      memory allocation.
      
      This can be avoided by enabling preemption in netdev_core_stats_alloc()
      assuming the caller always disables preemption.
      
      It might be better to replace local_inc() with this_cpu_inc() now that
      dev_core_stats_##FIELD##_inc() gained a preempt-disable section and does
      not rely on already disabled preemption. This results in less
      instructions on x86-64:
      local_inc:
      |          incl %gs:__preempt_count(%rip)  # __preempt_count
      |          movq    488(%rdi), %rax # _1->core_stats, _22
      |          testq   %rax, %rax      # _22
      |          je      .L585   #,
      |          add %gs:this_cpu_off(%rip), %rax        # this_cpu_off, tcp_ptr__
      |  .L586:
      |          testq   %rax, %rax      # _27
      |          je      .L587   #,
      |          incq (%rax)            # _6->a.counter
      |  .L587:
      |          decl %gs:__preempt_count(%rip)  # __preempt_count
      
      this_cpu_inc(), this patch:
      |         movq    488(%rdi), %rax # _1->core_stats, _5
      |         testq   %rax, %rax      # _5
      |         je      .L591   #,
      | .L585:
      |         incq %gs:(%rax) # _18->rx_dropped
      
      Use unsigned long as type for the counter. Use this_cpu_inc() to
      increment the counter. Use a plain read of the counter.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/YmbO0pxgtKpCw4SY@linutronix.deSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      6510ea97
    • L
      Bluetooth: hci_sync: Cleanup hci_conn if it cannot be aborted · 9b3628d7
      Luiz Augusto von Dentz 提交于
      This attempts to cleanup the hci_conn if it cannot be aborted as
      otherwise it would likely result in having the controller and host
      stack out of sync with respect to connection handle.
      Signed-off-by: NLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      9b3628d7
    • L
      Bluetooth: hci_event: Fix checking for invalid handle on error status · c86cc5a3
      Luiz Augusto von Dentz 提交于
      Commit d5ebaa7c introduces checks for handle range
      (e.g HCI_CONN_HANDLE_MAX) but controllers like Intel AX200 don't seem
      to respect the valid range int case of error status:
      
      > HCI Event: Connect Complete (0x03) plen 11
              Status: Page Timeout (0x04)
              Handle: 65535
              Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment&
      	Sound Products Inc)
              Link type: ACL (0x01)
              Encryption: Disabled (0x00)
      [1644965.827560] Bluetooth: hci0: Ignoring HCI_Connection_Complete for invalid handle
      
      Because of it is impossible to cleanup the connections properly since
      the stack would attempt to cancel the connection which is no longer in
      progress causing the following trace:
      
      < HCI Command: Create Connection Cancel (0x01|0x0008) plen 6
              Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment&
      	Sound Products Inc)
      = bluetoothd: src/profile.c:record_cb() Unable to get Hands-Free Voice
      	gateway SDP record: Connection timed out
      > HCI Event: Command Complete (0x0e) plen 10
            Create Connection Cancel (0x01|0x0008) ncmd 1
              Status: Unknown Connection Identifier (0x02)
              Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment&
      	Sound Products Inc)
      < HCI Command: Create Connection Cancel (0x01|0x0008) plen 6
              Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment&
      	Sound Products Inc)
      
      Fixes: d5ebaa7c ("Bluetooth: hci_event: Ignore multiple conn complete events")
      Signed-off-by: NLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      c86cc5a3
  18. 26 4月, 2022 2 次提交
    • M
      xsk: Fix possible crash when multiple sockets are created · ba3beec2
      Maciej Fijalkowski 提交于
      Fix a crash that happens if an Rx only socket is created first, then a
      second socket is created that is Tx only and bound to the same umem as
      the first socket and also the same netdev and queue_id together with the
      XDP_SHARED_UMEM flag. In this specific case, the tx_descs array page
      pool was not created by the first socket as it was an Rx only socket.
      When the second socket is bound it needs this tx_descs array of this
      shared page pool as it has a Tx component, but unfortunately it was
      never allocated, leading to a crash. Note that this array is only used
      for zero-copy drivers using the batched Tx APIs, currently only ice and
      i40e.
      
      [ 5511.150360] BUG: kernel NULL pointer dereference, address: 0000000000000008
      [ 5511.158419] #PF: supervisor write access in kernel mode
      [ 5511.164472] #PF: error_code(0x0002) - not-present page
      [ 5511.170416] PGD 0 P4D 0
      [ 5511.173347] Oops: 0002 [#1] PREEMPT SMP PTI
      [ 5511.178186] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G            E     5.18.0-rc1+ #97
      [ 5511.187245] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016
      [ 5511.198418] RIP: 0010:xsk_tx_peek_release_desc_batch+0x198/0x310
      [ 5511.205375] Code: c0 83 c6 01 84 c2 74 6d 8d 46 ff 23 07 44 89 e1 48 83 c0 14 48 c1 e1 04 48 c1 e0 04 48 03 47 10 4c 01 c1 48 8b 50 08 48 8b 00 <48> 89 51 08 48 89 01 41 80 bd d7 00 00 00 00 75 82 48 8b 19 49 8b
      [ 5511.227091] RSP: 0018:ffffc90000003dd0 EFLAGS: 00010246
      [ 5511.233135] RAX: 0000000000000000 RBX: ffff88810c8da600 RCX: 0000000000000000
      [ 5511.241384] RDX: 000000000000003c RSI: 0000000000000001 RDI: ffff888115f555c0
      [ 5511.249634] RBP: ffffc90000003e08 R08: 0000000000000000 R09: ffff889092296b48
      [ 5511.257886] R10: 0000ffffffffffff R11: ffff889092296800 R12: 0000000000000000
      [ 5511.266138] R13: ffff88810c8db500 R14: 0000000000000040 R15: 0000000000000100
      [ 5511.274387] FS:  0000000000000000(0000) GS:ffff88903f800000(0000) knlGS:0000000000000000
      [ 5511.283746] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5511.290389] CR2: 0000000000000008 CR3: 00000001046e2001 CR4: 00000000003706f0
      [ 5511.298640] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 5511.306892] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 5511.315142] Call Trace:
      [ 5511.317972]  <IRQ>
      [ 5511.320301]  ice_xmit_zc+0x68/0x2f0 [ice]
      [ 5511.324977]  ? ktime_get+0x38/0xa0
      [ 5511.328913]  ice_napi_poll+0x7a/0x6a0 [ice]
      [ 5511.333784]  __napi_poll+0x2c/0x160
      [ 5511.337821]  net_rx_action+0xdd/0x200
      [ 5511.342058]  __do_softirq+0xe6/0x2dd
      [ 5511.346198]  irq_exit_rcu+0xb5/0x100
      [ 5511.350339]  common_interrupt+0xa4/0xc0
      [ 5511.354777]  </IRQ>
      [ 5511.357201]  <TASK>
      [ 5511.359625]  asm_common_interrupt+0x1e/0x40
      [ 5511.364466] RIP: 0010:cpuidle_enter_state+0xd2/0x360
      [ 5511.370211] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 e9 00 7b ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 72 02 00 00 31 ff e8 02 0c 80 ff fb 45 85 f6 <0f> 88 11 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d 14 90 49
      [ 5511.391921] RSP: 0018:ffffffff82a03e60 EFLAGS: 00000202
      [ 5511.397962] RAX: ffff88903f800000 RBX: 0000000000000001 RCX: 000000000000001f
      [ 5511.406214] RDX: 0000000000000000 RSI: ffffffff823400b9 RDI: ffffffff8234c046
      [ 5511.424646] RBP: ffff88810a384800 R08: 000005032a28c046 R09: 0000000000000008
      [ 5511.443233] R10: 000000000000000b R11: 0000000000000006 R12: ffffffff82bcf700
      [ 5511.461922] R13: 000005032a28c046 R14: 0000000000000001 R15: 0000000000000000
      [ 5511.480300]  cpuidle_enter+0x29/0x40
      [ 5511.494329]  do_idle+0x1c7/0x250
      [ 5511.507610]  cpu_startup_entry+0x19/0x20
      [ 5511.521394]  start_kernel+0x649/0x66e
      [ 5511.534626]  secondary_startup_64_no_verify+0xc3/0xcb
      [ 5511.549230]  </TASK>
      
      Detect such case during bind() and allocate this memory region via newly
      introduced xp_alloc_tx_descs(). Also, use kvcalloc instead of kcalloc as
      for other buffer pool allocations, so that it matches the kvfree() from
      xp_destroy().
      
      Fixes: d1bc532e ("i40e: xsk: Move tmp desc array from driver to pool")
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/20220425153745.481322-1-maciej.fijalkowski@intel.com
      ba3beec2
    • S
      bug: Have __warn() prototype defined unconditionally · 1fa568e2
      Shida Zhang 提交于
      The __warn() prototype is declared in CONFIG_BUG scope but the function
      definition in panic.c is unconditional. The IBT enablement started using
      it unconditionally but a CONFIG_X86_KERNEL_IBT=y, CONFIG_BUG=n .config
      will trigger a
      
        arch/x86/kernel/traps.c: In function ‘__exc_control_protection’:
        arch/x86/kernel/traps.c:249:17: error: implicit declaration of function \
        	  ‘__warn’; did you mean ‘pr_warn’? [-Werror=implicit-function-declaration]
      
      Pull up the declarations so that they're unconditionally visible too.
      
        [ bp: Rewrite commit message. ]
      
      Fixes: 991625f3 ("x86/ibt: Add IBT feature, MSR and #CP handling")
      Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NShida Zhang <zhangshida@kylinos.cn>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lore.kernel.org/r/20220426032007.510245-1-starzhangzsd@gmail.com
      1fa568e2
  19. 25 4月, 2022 3 次提交
    • E
      tcp: make sure treq->af_specific is initialized · ba5a4fdd
      Eric Dumazet 提交于
      syzbot complained about a recent change in TCP stack,
      hitting a NULL pointer [1]
      
      tcp request sockets have an af_specific pointer, which
      was used before the blamed change only for SYNACK generation
      in non SYNCOOKIE mode.
      
      tcp requests sockets momentarily created when third packet
      coming from client in SYNCOOKIE mode were not using
      treq->af_specific.
      
      Make sure this field is populated, in the same way normal
      TCP requests sockets do in tcp_conn_request().
      
      [1]
      TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending cookies.  Check SNMP counters.
      general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
      KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
      CPU: 1 PID: 3695 Comm: syz-executor864 Not tainted 5.18.0-rc3-syzkaller-00224-g5fd1fe48 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:tcp_create_openreq_child+0xe16/0x16b0 net/ipv4/tcp_minisocks.c:534
      Code: 48 c1 ea 03 80 3c 02 00 0f 85 e5 07 00 00 4c 8b b3 28 01 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d 7e 08 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 c9 07 00 00 48 8b 3c 24 48 89 de 41 ff 56 08 48
      RSP: 0018:ffffc90000de0588 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: ffff888076490330 RCX: 0000000000000100
      RDX: 0000000000000001 RSI: ffffffff87d67ff0 RDI: 0000000000000008
      RBP: ffff88806ee1c7f8 R08: 0000000000000000 R09: 0000000000000000
      R10: ffffffff87d67f00 R11: 0000000000000000 R12: ffff88806ee1bfc0
      R13: ffff88801b0e0368 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f517fe58700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffcead76960 CR3: 000000006f97b000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       tcp_v6_syn_recv_sock+0x199/0x23b0 net/ipv6/tcp_ipv6.c:1267
       tcp_get_cookie_sock+0xc9/0x850 net/ipv4/syncookies.c:207
       cookie_v6_check+0x15c3/0x2340 net/ipv6/syncookies.c:258
       tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1131 [inline]
       tcp_v6_do_rcv+0x1148/0x13b0 net/ipv6/tcp_ipv6.c:1486
       tcp_v6_rcv+0x3305/0x3840 net/ipv6/tcp_ipv6.c:1725
       ip6_protocol_deliver_rcu+0x2e9/0x1900 net/ipv6/ip6_input.c:422
       ip6_input_finish+0x14c/0x2c0 net/ipv6/ip6_input.c:464
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ip6_input+0x9c/0xd0 net/ipv6/ip6_input.c:473
       dst_input include/net/dst.h:461 [inline]
       ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ipv6_rcv+0x27f/0x3b0 net/ipv6/ip6_input.c:297
       __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5405
       __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5519
       process_backlog+0x3a0/0x7c0 net/core/dev.c:5847
       __napi_poll+0xb3/0x6e0 net/core/dev.c:6413
       napi_poll net/core/dev.c:6480 [inline]
       net_rx_action+0x8ec/0xc60 net/core/dev.c:6567
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
       sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
      
      Fixes: 5b0b9e4c ("tcp: md5: incorrect tcp_header_len for incoming connections")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Francesco Ruggeri <fruggeri@arista.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba5a4fdd
    • E
      tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT · 4bfe744f
      Eric Dumazet 提交于
      I had this bug sitting for too long in my pile, it is time to fix it.
      
      Thanks to Doug Porter for reminding me of it!
      
      We had various attempts in the past, including commit
      0cbe6a8f ("tcp: remove SOCK_QUEUE_SHRUNK"),
      but the issue is that TCP stack currently only generates
      EPOLLOUT from input path, when tp->snd_una has advanced
      and skb(s) cleaned from rtx queue.
      
      If a flow has a big RTT, and/or receives SACKs, it is possible
      that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
      and no more data can be sent until tp->snd_una finally advances.
      
      What is needed is to also check if POLLOUT needs to be generated
      whenever tp->snd_nxt is advanced, from output path.
      
      This bug triggers more often after an idle period, as
      we do not receive ACK for at least one RTT. tcp_notsent_lowat
      could be a fraction of what CWND and pacing rate would allow to
      send during this RTT.
      
      In a followup patch, I will remove the bogus call
      to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
      from tcp_check_space(). Fact that we have decided to generate
      an EPOLLOUT does not mean the application has immediately
      refilled the transmit queue. This optimistic call
      might have been the reason the bug seemed not too serious.
      
      Tested:
      
      200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]
      
      $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
      $ cat bench_rr.sh
      SUM=0
      for i in {1..10}
      do
       V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
       echo $V
       SUM=$(($SUM + $V))
      done
      echo SUM=$SUM
      
      Before patch:
      $ bench_rr.sh
      130000000
      80000000
      140000000
      140000000
      140000000
      140000000
      130000000
      40000000
      90000000
      110000000
      SUM=1140000000
      
      After patch:
      $ bench_rr.sh
      430000000
      590000000
      530000000
      450000000
      450000000
      350000000
      450000000
      490000000
      480000000
      460000000
      SUM=4680000000  # This is 410 % of the value before patch.
      
      Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDoug Porter <dsp@fb.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bfe744f
    • P
      ip_gre, ip6_gre: Fix race condition on o_seqno in collect_md mode · 31c417c9
      Peilin Ye 提交于
      As pointed out by Jakub Kicinski, currently using TUNNEL_SEQ in
      collect_md mode is racy for [IP6]GRE[TAP] devices.  Consider the
      following sequence of events:
      
      1. An [IP6]GRE[TAP] device is created in collect_md mode using "ip link
         add ... external".  "ip" ignores "[o]seq" if "external" is specified,
         so TUNNEL_SEQ is off, and the device is marked as NETIF_F_LLTX (i.e.
         it uses lockless TX);
      2. Someone sets TUNNEL_SEQ on outgoing skb's, using e.g.
         bpf_skb_set_tunnel_key() in an eBPF program attached to this device;
      3. gre_fb_xmit() or __gre6_xmit() processes these skb's:
      
      	gre_build_header(skb, tun_hlen,
      			 flags, protocol,
      			 tunnel_id_to_key32(tun_info->key.tun_id),
      			 (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++)
      					      : 0);   ^^^^^^^^^^^^^^^^^
      
      Since we are not using the TX lock (&txq->_xmit_lock), multiple CPUs may
      try to do this tunnel->o_seqno++ in parallel, which is racy.  Fix it by
      making o_seqno atomic_t.
      
      As mentioned by Eric Dumazet in commit b790e01a ("ip_gre: lockless
      xmit"), making o_seqno atomic_t increases "chance for packets being out
      of order at receiver" when NETIF_F_LLTX is on.
      
      Maybe a better fix would be:
      
      1. Do not ignore "oseq" in external mode.  Users MUST specify "oseq" if
         they want the kernel to allow sequencing of outgoing packets;
      2. Reject all outgoing TUNNEL_SEQ packets if the device was not created
         with "oseq".
      
      Unfortunately, that would break userspace.
      
      We could now make [IP6]GRE[TAP] devices always NETIF_F_LLTX, but let us
      do it in separate patches to keep this fix minimal.
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Fixes: 77a5196a ("gre: add sequence number for collect md mode.")
      Signed-off-by: NPeilin Ye <peilin.ye@bytedance.com>
      Acked-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31c417c9
  20. 23 4月, 2022 1 次提交