1. 04 5月, 2022 1 次提交
  2. 30 4月, 2022 1 次提交
    • P
      KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENT · d495f942
      Paolo Bonzini 提交于
      When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags
      member that at the time was unused.  Unfortunately this extensibility
      mechanism has several issues:
      
      - x86 is not writing the member, so it would not be possible to use it
        on x86 except for new events
      
      - the member is not aligned to 64 bits, so the definition of the
        uAPI struct is incorrect for 32- on 64-bit userspace.  This is a
        problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately
        usage of flags was only introduced in 5.18.
      
      Since padding has to be introduced, place a new field in there
      that tells if the flags field is valid.  To allow further extensibility,
      in fact, change flags to an array of 16 values, and store how many
      of the values are valid.  The availability of the new ndata field
      is tied to a system capability; all architectures are changed to
      fill in the field.
      
      To avoid breaking compilation of userspace that was using the flags
      field, provide a userspace-only union to overlap flags with data[0].
      The new field is placed at the same offset for both 32- and 64-bit
      userspace.
      
      Cc: Will Deacon <will@kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Message-Id: <20220422103013.34832-1-pbonzini@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d495f942
  3. 28 4月, 2022 2 次提交
    • C
      elf: Fix the arm64 MTE ELF segment name and value · c35fe2a6
      Catalin Marinas 提交于
      Unfortunately, the name/value choice for the MTE ELF segment type
      (PT_ARM_MEMTAG_MTE) was pretty poor: LOPROC+1 is already in use by
      PT_AARCH64_UNWIND, as defined in the AArch64 ELF ABI
      (https://github.com/ARM-software/abi-aa/blob/main/aaelf64/aaelf64.rst).
      
      Update the ELF segment type value to LOPROC+2 and also change the define
      to PT_AARCH64_MEMTAG_MTE to match the AArch64 ELF ABI namespace. The
      AArch64 ELF ABI document is updating accordingly (segment type not
      previously mentioned in the document).
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Fixes: 761b9b36 ("elf: Introduce the ARM MTE ELF segment type")
      Cc: Will Deacon <will@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Luis Machado <luis.machado@arm.com>
      Cc: Richard Earnshaw <Richard.Earnshaw@arm.com>
      Link: https://lore.kernel.org/r/20220425151833.2603830-1-catalin.marinas@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      c35fe2a6
    • M
      hex2bin: make the function hex_to_bin constant-time · e5be1576
      Mikulas Patocka 提交于
      The function hex2bin is used to load cryptographic keys into device
      mapper targets dm-crypt and dm-integrity.  It should take constant time
      independent on the processed data, so that concurrently running
      unprivileged code can't infer any information about the keys via
      microarchitectural convert channels.
      
      This patch changes the function hex_to_bin so that it contains no
      branches and no memory accesses.
      
      Note that this shouldn't cause performance degradation because the size
      of the new function is the same as the size of the old function (on
      x86-64) - and the new function causes no branch misprediction penalties.
      
      I compile-tested this function with gcc on aarch64 alpha arm hppa hppa64
      i386 ia64 m68k mips32 mips64 powerpc powerpc64 riscv sh4 s390x sparc32
      sparc64 x86_64 and with clang on aarch64 arm hexagon i386 mips32 mips64
      powerpc powerpc64 s390x sparc32 sparc64 x86_64 to verify that there are
      no branches in the generated code.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5be1576
  4. 27 4月, 2022 3 次提交
    • S
      net: Use this_cpu_inc() to increment net->core_stats · 6510ea97
      Sebastian Andrzej Siewior 提交于
      The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes
      netdev_core_stats_alloc() to return a per-CPU pointer.
      netdev_core_stats_alloc() will allocate memory on its first invocation
      which breaks on PREEMPT_RT because it requires non-atomic context for
      memory allocation.
      
      This can be avoided by enabling preemption in netdev_core_stats_alloc()
      assuming the caller always disables preemption.
      
      It might be better to replace local_inc() with this_cpu_inc() now that
      dev_core_stats_##FIELD##_inc() gained a preempt-disable section and does
      not rely on already disabled preemption. This results in less
      instructions on x86-64:
      local_inc:
      |          incl %gs:__preempt_count(%rip)  # __preempt_count
      |          movq    488(%rdi), %rax # _1->core_stats, _22
      |          testq   %rax, %rax      # _22
      |          je      .L585   #,
      |          add %gs:this_cpu_off(%rip), %rax        # this_cpu_off, tcp_ptr__
      |  .L586:
      |          testq   %rax, %rax      # _27
      |          je      .L587   #,
      |          incq (%rax)            # _6->a.counter
      |  .L587:
      |          decl %gs:__preempt_count(%rip)  # __preempt_count
      
      this_cpu_inc(), this patch:
      |         movq    488(%rdi), %rax # _1->core_stats, _5
      |         testq   %rax, %rax      # _5
      |         je      .L591   #,
      | .L585:
      |         incq %gs:(%rax) # _18->rx_dropped
      
      Use unsigned long as type for the counter. Use this_cpu_inc() to
      increment the counter. Use a plain read of the counter.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/YmbO0pxgtKpCw4SY@linutronix.deSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      6510ea97
    • L
      Bluetooth: hci_sync: Cleanup hci_conn if it cannot be aborted · 9b3628d7
      Luiz Augusto von Dentz 提交于
      This attempts to cleanup the hci_conn if it cannot be aborted as
      otherwise it would likely result in having the controller and host
      stack out of sync with respect to connection handle.
      Signed-off-by: NLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      9b3628d7
    • L
      Bluetooth: hci_event: Fix checking for invalid handle on error status · c86cc5a3
      Luiz Augusto von Dentz 提交于
      Commit d5ebaa7c introduces checks for handle range
      (e.g HCI_CONN_HANDLE_MAX) but controllers like Intel AX200 don't seem
      to respect the valid range int case of error status:
      
      > HCI Event: Connect Complete (0x03) plen 11
              Status: Page Timeout (0x04)
              Handle: 65535
              Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment&
      	Sound Products Inc)
              Link type: ACL (0x01)
              Encryption: Disabled (0x00)
      [1644965.827560] Bluetooth: hci0: Ignoring HCI_Connection_Complete for invalid handle
      
      Because of it is impossible to cleanup the connections properly since
      the stack would attempt to cancel the connection which is no longer in
      progress causing the following trace:
      
      < HCI Command: Create Connection Cancel (0x01|0x0008) plen 6
              Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment&
      	Sound Products Inc)
      = bluetoothd: src/profile.c:record_cb() Unable to get Hands-Free Voice
      	gateway SDP record: Connection timed out
      > HCI Event: Command Complete (0x0e) plen 10
            Create Connection Cancel (0x01|0x0008) ncmd 1
              Status: Unknown Connection Identifier (0x02)
              Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment&
      	Sound Products Inc)
      < HCI Command: Create Connection Cancel (0x01|0x0008) plen 6
              Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment&
      	Sound Products Inc)
      
      Fixes: d5ebaa7c ("Bluetooth: hci_event: Ignore multiple conn complete events")
      Signed-off-by: NLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      c86cc5a3
  5. 26 4月, 2022 2 次提交
    • M
      xsk: Fix possible crash when multiple sockets are created · ba3beec2
      Maciej Fijalkowski 提交于
      Fix a crash that happens if an Rx only socket is created first, then a
      second socket is created that is Tx only and bound to the same umem as
      the first socket and also the same netdev and queue_id together with the
      XDP_SHARED_UMEM flag. In this specific case, the tx_descs array page
      pool was not created by the first socket as it was an Rx only socket.
      When the second socket is bound it needs this tx_descs array of this
      shared page pool as it has a Tx component, but unfortunately it was
      never allocated, leading to a crash. Note that this array is only used
      for zero-copy drivers using the batched Tx APIs, currently only ice and
      i40e.
      
      [ 5511.150360] BUG: kernel NULL pointer dereference, address: 0000000000000008
      [ 5511.158419] #PF: supervisor write access in kernel mode
      [ 5511.164472] #PF: error_code(0x0002) - not-present page
      [ 5511.170416] PGD 0 P4D 0
      [ 5511.173347] Oops: 0002 [#1] PREEMPT SMP PTI
      [ 5511.178186] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G            E     5.18.0-rc1+ #97
      [ 5511.187245] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016
      [ 5511.198418] RIP: 0010:xsk_tx_peek_release_desc_batch+0x198/0x310
      [ 5511.205375] Code: c0 83 c6 01 84 c2 74 6d 8d 46 ff 23 07 44 89 e1 48 83 c0 14 48 c1 e1 04 48 c1 e0 04 48 03 47 10 4c 01 c1 48 8b 50 08 48 8b 00 <48> 89 51 08 48 89 01 41 80 bd d7 00 00 00 00 75 82 48 8b 19 49 8b
      [ 5511.227091] RSP: 0018:ffffc90000003dd0 EFLAGS: 00010246
      [ 5511.233135] RAX: 0000000000000000 RBX: ffff88810c8da600 RCX: 0000000000000000
      [ 5511.241384] RDX: 000000000000003c RSI: 0000000000000001 RDI: ffff888115f555c0
      [ 5511.249634] RBP: ffffc90000003e08 R08: 0000000000000000 R09: ffff889092296b48
      [ 5511.257886] R10: 0000ffffffffffff R11: ffff889092296800 R12: 0000000000000000
      [ 5511.266138] R13: ffff88810c8db500 R14: 0000000000000040 R15: 0000000000000100
      [ 5511.274387] FS:  0000000000000000(0000) GS:ffff88903f800000(0000) knlGS:0000000000000000
      [ 5511.283746] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5511.290389] CR2: 0000000000000008 CR3: 00000001046e2001 CR4: 00000000003706f0
      [ 5511.298640] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 5511.306892] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 5511.315142] Call Trace:
      [ 5511.317972]  <IRQ>
      [ 5511.320301]  ice_xmit_zc+0x68/0x2f0 [ice]
      [ 5511.324977]  ? ktime_get+0x38/0xa0
      [ 5511.328913]  ice_napi_poll+0x7a/0x6a0 [ice]
      [ 5511.333784]  __napi_poll+0x2c/0x160
      [ 5511.337821]  net_rx_action+0xdd/0x200
      [ 5511.342058]  __do_softirq+0xe6/0x2dd
      [ 5511.346198]  irq_exit_rcu+0xb5/0x100
      [ 5511.350339]  common_interrupt+0xa4/0xc0
      [ 5511.354777]  </IRQ>
      [ 5511.357201]  <TASK>
      [ 5511.359625]  asm_common_interrupt+0x1e/0x40
      [ 5511.364466] RIP: 0010:cpuidle_enter_state+0xd2/0x360
      [ 5511.370211] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 e9 00 7b ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 72 02 00 00 31 ff e8 02 0c 80 ff fb 45 85 f6 <0f> 88 11 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d 14 90 49
      [ 5511.391921] RSP: 0018:ffffffff82a03e60 EFLAGS: 00000202
      [ 5511.397962] RAX: ffff88903f800000 RBX: 0000000000000001 RCX: 000000000000001f
      [ 5511.406214] RDX: 0000000000000000 RSI: ffffffff823400b9 RDI: ffffffff8234c046
      [ 5511.424646] RBP: ffff88810a384800 R08: 000005032a28c046 R09: 0000000000000008
      [ 5511.443233] R10: 000000000000000b R11: 0000000000000006 R12: ffffffff82bcf700
      [ 5511.461922] R13: 000005032a28c046 R14: 0000000000000001 R15: 0000000000000000
      [ 5511.480300]  cpuidle_enter+0x29/0x40
      [ 5511.494329]  do_idle+0x1c7/0x250
      [ 5511.507610]  cpu_startup_entry+0x19/0x20
      [ 5511.521394]  start_kernel+0x649/0x66e
      [ 5511.534626]  secondary_startup_64_no_verify+0xc3/0xcb
      [ 5511.549230]  </TASK>
      
      Detect such case during bind() and allocate this memory region via newly
      introduced xp_alloc_tx_descs(). Also, use kvcalloc instead of kcalloc as
      for other buffer pool allocations, so that it matches the kvfree() from
      xp_destroy().
      
      Fixes: d1bc532e ("i40e: xsk: Move tmp desc array from driver to pool")
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/20220425153745.481322-1-maciej.fijalkowski@intel.com
      ba3beec2
    • S
      bug: Have __warn() prototype defined unconditionally · 1fa568e2
      Shida Zhang 提交于
      The __warn() prototype is declared in CONFIG_BUG scope but the function
      definition in panic.c is unconditional. The IBT enablement started using
      it unconditionally but a CONFIG_X86_KERNEL_IBT=y, CONFIG_BUG=n .config
      will trigger a
      
        arch/x86/kernel/traps.c: In function ‘__exc_control_protection’:
        arch/x86/kernel/traps.c:249:17: error: implicit declaration of function \
        	  ‘__warn’; did you mean ‘pr_warn’? [-Werror=implicit-function-declaration]
      
      Pull up the declarations so that they're unconditionally visible too.
      
        [ bp: Rewrite commit message. ]
      
      Fixes: 991625f3 ("x86/ibt: Add IBT feature, MSR and #CP handling")
      Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NShida Zhang <zhangshida@kylinos.cn>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lore.kernel.org/r/20220426032007.510245-1-starzhangzsd@gmail.com
      1fa568e2
  6. 25 4月, 2022 3 次提交
    • E
      tcp: make sure treq->af_specific is initialized · ba5a4fdd
      Eric Dumazet 提交于
      syzbot complained about a recent change in TCP stack,
      hitting a NULL pointer [1]
      
      tcp request sockets have an af_specific pointer, which
      was used before the blamed change only for SYNACK generation
      in non SYNCOOKIE mode.
      
      tcp requests sockets momentarily created when third packet
      coming from client in SYNCOOKIE mode were not using
      treq->af_specific.
      
      Make sure this field is populated, in the same way normal
      TCP requests sockets do in tcp_conn_request().
      
      [1]
      TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending cookies.  Check SNMP counters.
      general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
      KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
      CPU: 1 PID: 3695 Comm: syz-executor864 Not tainted 5.18.0-rc3-syzkaller-00224-g5fd1fe48 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:tcp_create_openreq_child+0xe16/0x16b0 net/ipv4/tcp_minisocks.c:534
      Code: 48 c1 ea 03 80 3c 02 00 0f 85 e5 07 00 00 4c 8b b3 28 01 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d 7e 08 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 c9 07 00 00 48 8b 3c 24 48 89 de 41 ff 56 08 48
      RSP: 0018:ffffc90000de0588 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: ffff888076490330 RCX: 0000000000000100
      RDX: 0000000000000001 RSI: ffffffff87d67ff0 RDI: 0000000000000008
      RBP: ffff88806ee1c7f8 R08: 0000000000000000 R09: 0000000000000000
      R10: ffffffff87d67f00 R11: 0000000000000000 R12: ffff88806ee1bfc0
      R13: ffff88801b0e0368 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f517fe58700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffcead76960 CR3: 000000006f97b000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       tcp_v6_syn_recv_sock+0x199/0x23b0 net/ipv6/tcp_ipv6.c:1267
       tcp_get_cookie_sock+0xc9/0x850 net/ipv4/syncookies.c:207
       cookie_v6_check+0x15c3/0x2340 net/ipv6/syncookies.c:258
       tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1131 [inline]
       tcp_v6_do_rcv+0x1148/0x13b0 net/ipv6/tcp_ipv6.c:1486
       tcp_v6_rcv+0x3305/0x3840 net/ipv6/tcp_ipv6.c:1725
       ip6_protocol_deliver_rcu+0x2e9/0x1900 net/ipv6/ip6_input.c:422
       ip6_input_finish+0x14c/0x2c0 net/ipv6/ip6_input.c:464
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ip6_input+0x9c/0xd0 net/ipv6/ip6_input.c:473
       dst_input include/net/dst.h:461 [inline]
       ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ipv6_rcv+0x27f/0x3b0 net/ipv6/ip6_input.c:297
       __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5405
       __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5519
       process_backlog+0x3a0/0x7c0 net/core/dev.c:5847
       __napi_poll+0xb3/0x6e0 net/core/dev.c:6413
       napi_poll net/core/dev.c:6480 [inline]
       net_rx_action+0x8ec/0xc60 net/core/dev.c:6567
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
       sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
      
      Fixes: 5b0b9e4c ("tcp: md5: incorrect tcp_header_len for incoming connections")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Francesco Ruggeri <fruggeri@arista.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba5a4fdd
    • E
      tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT · 4bfe744f
      Eric Dumazet 提交于
      I had this bug sitting for too long in my pile, it is time to fix it.
      
      Thanks to Doug Porter for reminding me of it!
      
      We had various attempts in the past, including commit
      0cbe6a8f ("tcp: remove SOCK_QUEUE_SHRUNK"),
      but the issue is that TCP stack currently only generates
      EPOLLOUT from input path, when tp->snd_una has advanced
      and skb(s) cleaned from rtx queue.
      
      If a flow has a big RTT, and/or receives SACKs, it is possible
      that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
      and no more data can be sent until tp->snd_una finally advances.
      
      What is needed is to also check if POLLOUT needs to be generated
      whenever tp->snd_nxt is advanced, from output path.
      
      This bug triggers more often after an idle period, as
      we do not receive ACK for at least one RTT. tcp_notsent_lowat
      could be a fraction of what CWND and pacing rate would allow to
      send during this RTT.
      
      In a followup patch, I will remove the bogus call
      to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
      from tcp_check_space(). Fact that we have decided to generate
      an EPOLLOUT does not mean the application has immediately
      refilled the transmit queue. This optimistic call
      might have been the reason the bug seemed not too serious.
      
      Tested:
      
      200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]
      
      $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
      $ cat bench_rr.sh
      SUM=0
      for i in {1..10}
      do
       V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
       echo $V
       SUM=$(($SUM + $V))
      done
      echo SUM=$SUM
      
      Before patch:
      $ bench_rr.sh
      130000000
      80000000
      140000000
      140000000
      140000000
      140000000
      130000000
      40000000
      90000000
      110000000
      SUM=1140000000
      
      After patch:
      $ bench_rr.sh
      430000000
      590000000
      530000000
      450000000
      450000000
      350000000
      450000000
      490000000
      480000000
      460000000
      SUM=4680000000  # This is 410 % of the value before patch.
      
      Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDoug Porter <dsp@fb.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bfe744f
    • P
      ip_gre, ip6_gre: Fix race condition on o_seqno in collect_md mode · 31c417c9
      Peilin Ye 提交于
      As pointed out by Jakub Kicinski, currently using TUNNEL_SEQ in
      collect_md mode is racy for [IP6]GRE[TAP] devices.  Consider the
      following sequence of events:
      
      1. An [IP6]GRE[TAP] device is created in collect_md mode using "ip link
         add ... external".  "ip" ignores "[o]seq" if "external" is specified,
         so TUNNEL_SEQ is off, and the device is marked as NETIF_F_LLTX (i.e.
         it uses lockless TX);
      2. Someone sets TUNNEL_SEQ on outgoing skb's, using e.g.
         bpf_skb_set_tunnel_key() in an eBPF program attached to this device;
      3. gre_fb_xmit() or __gre6_xmit() processes these skb's:
      
      	gre_build_header(skb, tun_hlen,
      			 flags, protocol,
      			 tunnel_id_to_key32(tun_info->key.tun_id),
      			 (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++)
      					      : 0);   ^^^^^^^^^^^^^^^^^
      
      Since we are not using the TX lock (&txq->_xmit_lock), multiple CPUs may
      try to do this tunnel->o_seqno++ in parallel, which is racy.  Fix it by
      making o_seqno atomic_t.
      
      As mentioned by Eric Dumazet in commit b790e01a ("ip_gre: lockless
      xmit"), making o_seqno atomic_t increases "chance for packets being out
      of order at receiver" when NETIF_F_LLTX is on.
      
      Maybe a better fix would be:
      
      1. Do not ignore "oseq" in external mode.  Users MUST specify "oseq" if
         they want the kernel to allow sequencing of outgoing packets;
      2. Reject all outgoing TUNNEL_SEQ packets if the device was not created
         with "oseq".
      
      Unfortunately, that would break userspace.
      
      We could now make [IP6]GRE[TAP] devices always NETIF_F_LLTX, but let us
      do it in separate patches to keep this fix minimal.
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Fixes: 77a5196a ("gre: add sequence number for collect md mode.")
      Signed-off-by: NPeilin Ye <peilin.ye@bytedance.com>
      Acked-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31c417c9
  7. 23 4月, 2022 2 次提交
  8. 22 4月, 2022 7 次提交
    • N
      oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup · e4a38402
      Nico Pache 提交于
      The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which
      can be targeted by the oom reaper.  This mapping is used to store the
      futex robust list head; the kernel does not keep a copy of the robust
      list and instead references a userspace address to maintain the
      robustness during a process death.
      
      A race can occur between exit_mm and the oom reaper that allows the oom
      reaper to free the memory of the futex robust list before the exit path
      has handled the futex death:
      
          CPU1                               CPU2
          --------------------------------------------------------------------
          page_fault
          do_exit "signal"
          wake_oom_reaper
                                              oom_reaper
                                              oom_reap_task_mm (invalidates mm)
          exit_mm
          exit_mm_release
          futex_exit_release
          futex_cleanup
          exit_robust_list
          get_user (EFAULT- can't access memory)
      
      If the get_user EFAULT's, the kernel will be unable to recover the
      waiters on the robust_list, leaving userspace mutexes hung indefinitely.
      
      Delay the OOM reaper, allowing more time for the exit path to perform
      the futex cleanup.
      
      Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
      
      Based on a patch by Michal Hocko.
      
      Link: https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370 [1]
      Link: https://lkml.kernel.org/r/20220414144042.677008-1-npache@redhat.com
      Fixes: 21292580 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
      Signed-off-by: NJoel Savitz <jsavitz@redhat.com>
      Signed-off-by: NNico Pache <npache@redhat.com>
      Co-developed-by: NJoel Savitz <jsavitz@redhat.com>
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Herton R. Krzesinski <herton@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joel Savitz <jsavitz@redhat.com>
      Cc: Darren Hart <dvhart@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4a38402
    • C
      mm, hugetlb: allow for "high" userspace addresses · 5f24d5a5
      Christophe Leroy 提交于
      This is a fix for commit f6795053 ("mm: mmap: Allow for "high"
      userspace addresses") for hugetlb.
      
      This patch adds support for "high" userspace addresses that are
      optionally supported on the system and have to be requested via a hint
      mechanism ("high" addr parameter to mmap).
      
      Architectures such as powerpc and x86 achieve this by making changes to
      their architectural versions of hugetlb_get_unmapped_area() function.
      However, arm64 uses the generic version of that function.
      
      So take into account arch_get_mmap_base() and arch_get_mmap_end() in
      hugetlb_get_unmapped_area().  To allow that, move those two macros out
      of mm/mmap.c into include/linux/sched/mm.h
      
      If these macros are not defined in architectural code then they default
      to (TASK_SIZE) and (base) so should not introduce any behavioural
      changes to architectures that do not define them.
      
      For the time being, only ARM64 is affected by this change.
      
      Catalin (ARM64) said
       "We should have fixed hugetlb_get_unmapped_area() as well when we added
        support for 52-bit VA. The reason for commit f6795053 was to
        prevent normal mmap() from returning addresses above 48-bit by default
        as some user-space had hard assumptions about this.
      
        It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
        but I doubt anyone would notice. It's more likely that the current
        behaviour would cause issues, so I'd rather have them consistent.
      
        Basically when arm64 gained support for 52-bit addresses we did not
        want user-space calling mmap() to suddenly get such high addresses,
        otherwise we could have inadvertently broken some programs (similar
        behaviour to x86 here). Hence we added commit f6795053. But we
        missed hugetlbfs which could still get such high mmap() addresses. So
        in theory that's a potential regression that should have bee addressed
        at the same time as commit f6795053 (and before arm64 enabled
        52-bit addresses)"
      
      Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.1650033747.git.christophe.leroy@csgroup.eu
      Fixes: f6795053 ("mm: mmap: Allow for "high" userspace addresses")
      Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <stable@vger.kernel.org>	[5.0.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f24d5a5
    • S
      memcg: sync flush only if periodic flush is delayed · 9b301615
      Shakeel Butt 提交于
      Daniel Dao has reported [1] a regression on workloads that may trigger a
      lot of refaults (anon and file).  The underlying issue is that flushing
      rstat is expensive.  Although rstat flush are batched with (nr_cpus *
      MEMCG_BATCH) stat updates, it seems like there are workloads which
      genuinely do stat updates larger than batch value within short amount of
      time.  Since the rstat flush can happen in the performance critical
      codepaths like page faults, such workload can suffer greatly.
      
      This patch fixes this regression by making the rstat flushing
      conditional in the performance critical codepaths.  More specifically,
      the kernel relies on the async periodic rstat flusher to flush the stats
      and only if the periodic flusher is delayed by more than twice the
      amount of its normal time window then the kernel allows rstat flushing
      from the performance critical codepaths.
      
      Now the question: what are the side-effects of this change? The worst
      that can happen is the refault codepath will see 4sec old lruvec stats
      and may cause false (or missed) activations of the refaulted page which
      may under-or-overestimate the workingset size.  Though that is not very
      concerning as the kernel can already miss or do false activations.
      
      There are two more codepaths whose flushing behavior is not changed by
      this patch and we may need to come to them in future.  One is the
      writeback stats used by dirty throttling and second is the deactivation
      heuristic in the reclaim.  For now keeping an eye on them and if there
      is report of regression due to these codepaths, we will reevaluate then.
      
      Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
      Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com
      Fixes: 1f828223 ("memcg: flush lruvec stats in the refault")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: NDaniel Dao <dqminh@cloudflare.com>
      Tested-by: NIvan Babrou <ivan@cloudflare.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Frank Hofmann <fhofmann@cloudflare.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b301615
    • N
      mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb() · 405ce051
      Naoya Horiguchi 提交于
      There is a race condition between memory_failure_hugetlb() and hugetlb
      free/demotion, which causes setting PageHWPoison flag on the wrong page.
      The one simple result is that wrong processes can be killed, but another
      (more serious) one is that the actual error is left unhandled, so no one
      prevents later access to it, and that might lead to more serious results
      like consuming corrupted data.
      
      Think about the below race window:
      
        CPU 1                                   CPU 2
        memory_failure_hugetlb
        struct page *head = compound_head(p);
                                                hugetlb page might be freed to
                                                buddy, or even changed to another
                                                compound page.
      
        get_hwpoison_page -- page is not what we want now...
      
      The current code first does prechecks roughly and then reconfirms after
      taking refcount, but it's found that it makes code overly complicated,
      so move the prechecks in a single hugetlb_lock range.
      
      A newly introduced function, try_memory_failure_hugetlb(), always takes
      hugetlb_lock (even for non-hugetlb pages).  That can be improved, but
      memory_failure() is rare in principle, so should not be a big problem.
      
      Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
      Fixes: 761ad8d7 ("mm: hwpoison: introduce memory_failure_hugetlb()")
      Signed-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reported-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      405ce051
    • M
      KVM: SEV: add cache flush to solve SEV cache incoherency issues · 683412cc
      Mingwei Zhang 提交于
      Flush the CPU caches when memory is reclaimed from an SEV guest (where
      reclaim also includes it being unmapped from KVM's memslots).  Due to lack
      of coherency for SEV encrypted memory, failure to flush results in silent
      data corruption if userspace is malicious/broken and doesn't ensure SEV
      guest memory is properly pinned and unpinned.
      
      Cache coherency is not enforced across the VM boundary in SEV (AMD APM
      vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
      VM guests have to be explicitly flushed on the host side. If a memory page
      containing dirty confidential cachelines was released by VM and reallocated
      to another user, the cachelines may corrupt the new user at a later time.
      
      KVM takes a shortcut by assuming all confidential memory remain pinned
      until the end of VM lifetime. Therefore, KVM does not flush cache at
      mmu_notifier invalidation events. Because of this incorrect assumption and
      the lack of cache flushing, malicous userspace can crash the host kernel:
      creating a malicious VM and continuously allocates/releases unpinned
      confidential memory pages when the VM is running.
      
      Add cache flush operations to mmu_notifier operations to ensure that any
      physical memory leaving the guest VM get flushed. In particular, hook
      mmu_notifier_invalidate_range_start and mmu_notifier_release events and
      flush cache accordingly. The hook after releasing the mmu lock to avoid
      contention with other vCPUs.
      
      Cc: stable@vger.kernel.org
      Suggested-by: NSean Christpherson <seanjc@google.com>
      Reported-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NMingwei Zhang <mizhang@google.com>
      Message-Id: <20220421031407.2516575-4-mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      683412cc
    • S
      KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abused · 2031f287
      Sean Christopherson 提交于
      Add wrappers to acquire/release KVM's SRCU lock when stashing the index
      in vcpu->src_idx, along with rudimentary detection of illegal usage,
      e.g. re-acquiring SRCU and thus overwriting vcpu->src_idx.  Because the
      SRCU index is (currently) either 0 or 1, illegal nesting bugs can go
      unnoticed for quite some time and only cause problems when the nested
      lock happens to get a different index.
      
      Wrap the WARNs in PROVE_RCU=y, and make them ONCE, otherwise KVM will
      likely yell so loudly that it will bring the kernel to its knees.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Tested-by: NFabiano Rosas <farosas@linux.ibm.com>
      Message-Id: <20220415004343.2203171-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2031f287
    • B
      usb: typec: tcpm: Fix undefined behavior due to shift overflowing the constant · 8d084b2e
      Borislav Petkov 提交于
      Fix:
      
        drivers/usb/typec/tcpm/tcpm.c: In function ‘run_state_machine’:
        drivers/usb/typec/tcpm/tcpm.c:4724:3: error: case label does not reduce to an integer constant
           case BDO_MODE_TESTDATA:
           ^~~~
      
      See https://lore.kernel.org/r/YkwQ6%2BtIH8GQpuct@zn.tnic for the gory
      details as to why it triggers with older gccs only.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: linux-usb@vger.kernel.org
      Link: https://lore.kernel.org/r/20220405151517.29753-8-bp@alien8.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8d084b2e
  9. 21 4月, 2022 2 次提交
  10. 20 4月, 2022 3 次提交
  11. 18 4月, 2022 1 次提交
  12. 16 4月, 2022 3 次提交
  13. 14 4月, 2022 1 次提交
    • J
      vfio/pci: Fix vf_token mechanism when device-specific VF drivers are used · 1ef3342a
      Jason Gunthorpe 提交于
      get_pf_vdev() tries to check if a PF is a VFIO PF by looking at the driver:
      
             if (pci_dev_driver(physfn) != pci_dev_driver(vdev->pdev)) {
      
      However now that we have multiple VF and PF drivers this is no longer
      reliable.
      
      This means that security tests realted to vf_token can be skipped by
      mixing and matching different VFIO PCI drivers.
      
      Instead of trying to use the driver core to find the PF devices maintain a
      linked list of all PF vfio_pci_core_device's that we have called
      pci_enable_sriov() on.
      
      When registering a VF just search the list to see if the PF is present and
      record the match permanently in the struct. PCI core locking prevents a PF
      from passing pci_disable_sriov() while VF drivers are attached so the VFIO
      owned PF becomes a static property of the VF.
      
      In common cases where vfio does not own the PF the global list remains
      empty and the VF's pointer is statically NULL.
      
      This also fixes a lockdep splat from recursive locking of the
      vfio_group::device_lock between vfio_device_get_from_name() and
      vfio_device_get_from_dev(). If the VF and PF share the same group this
      would deadlock.
      
      Fixes: ff53edf6 ("vfio/pci: Split the pci_driver code out of vfio_pci_core.c")
      Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
      Link: https://lore.kernel.org/r/0-v3-876570980634+f2e8-vfio_vf_token_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>
      1ef3342a
  14. 13 4月, 2022 5 次提交
  15. 12 4月, 2022 4 次提交
反馈
建议
客服 返回
顶部