1. 15 3月, 2021 6 次提交
    • L
      Merge tag 'perf_urgent_for_v5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 75013c6c
      Linus Torvalds 提交于
      Pull perf fixes from Borislav Petkov:
      
       - Make sure PMU internal buffers are flushed for per-CPU events too and
         properly handle PID/TID for large PEBS.
      
       - Handle the case properly when there's no PMU and therefore return an
         empty list of perf MSRs for VMX to switch instead of reading random
         garbage from the stack.
      
      * tag 'perf_urgent_for_v5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/perf: Use RET0 as default for guest_get_msrs to handle "no PMU" case
        perf/x86/intel: Set PERF_ATTACH_SCHED_CB for large PEBS and LBR
        perf/core: Flush PMU internal buffers for per-CPU events
      75013c6c
    • L
      Merge tag 'efi-urgent-for-v5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 836d7f05
      Linus Torvalds 提交于
      Pull EFI fix from Ard Biesheuvel via Borislav Petkov:
       "Fix an oversight in the handling of EFI_RT_PROPERTIES_TABLE, which was
        added v5.10, but failed to take the SetVirtualAddressMap() RT service
        into account"
      
      * tag 'efi-urgent-for-v5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        efi: stub: omit SetVirtualAddressMap() if marked unsupported in RT_PROP table
      836d7f05
    • L
      Merge tag 'x86_urgent_for_v5.12_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 0a7c10df
      Linus Torvalds 提交于
      Pull x86 fixes from Borislav Petkov:
      
       - A couple of SEV-ES fixes and robustifications: verify usermode stack
         pointer in NMI is not coming from the syscall gap, correctly track
         IRQ states in the #VC handler and access user insn bytes atomically
         in same handler as latter cannot sleep.
      
       - Balance 32-bit fast syscall exit path to do the proper work on exit
         and thus not confuse audit and ptrace frameworks.
      
       - Two fixes for the ORC unwinder going "off the rails" into KASAN
         redzones and when ORC data is missing.
      
      * tag 'x86_urgent_for_v5.12_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/sev-es: Use __copy_from_user_inatomic()
        x86/sev-es: Correctly track IRQ states in runtime #VC handler
        x86/sev-es: Check regs->sp is trusted before adjusting #VC IST stack
        x86/sev-es: Introduce ip_within_syscall_gap() helper
        x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls
        x86/unwind/orc: Silence warnings caused by missing ORC data
        x86/unwind/orc: Disable KASAN checking in the ORC unwinder, part 2
      0a7c10df
    • L
      Merge tag 'powerpc-5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · c3c7579f
      Linus Torvalds 提交于
      Pull powerpc fixes from Michael Ellerman:
       "Some more powerpc fixes for 5.12:
      
         - Fix wrong instruction encoding for lis in ppc_function_entry(),
           which could potentially lead to missed kprobes.
      
         - Fix SET_FULL_REGS on 32-bit and 64e, which prevented ptrace of
           non-volatile GPRs immediately after exec.
      
         - Clean up a missed SRR specifier in the recent interrupt rework.
      
         - Don't treat unrecoverable_exception() as an interrupt handler, it's
           called from other handlers so shouldn't do the interrupt entry/exit
           accounting itself.
      
         - Fix build errors caused by missing declarations for
           [en/dis]able_kernel_vsx().
      
        Thanks to Christophe Leroy, Daniel Axtens, Geert Uytterhoeven, Jiri
        Olsa, Naveen N. Rao, and Nicholas Piggin"
      
      * tag 'powerpc-5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/traps: unrecoverable_exception() is not an interrupt handler
        powerpc: Fix missing declaration of [en/dis]able_kernel_vsx()
        powerpc/64s/exception: Clean up a missed SRR specifier
        powerpc: Fix inverted SET_FULL_REGS bitop
        powerpc/64s: Use symbolic macros for function entry encoding
        powerpc/64s: Fix instruction encoding for lis in ppc_function_entry()
      c3c7579f
    • L
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 9d0c8e79
      Linus Torvalds 提交于
      Pull KVM fixes from Paolo Bonzini:
       "More fixes for ARM and x86"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: LAPIC: Advancing the timer expiration on guest initiated write
        KVM: x86/mmu: Skip !MMU-present SPTEs when removing SP in exclusive mode
        KVM: kvmclock: Fix vCPUs > 64 can't be online/hotpluged
        kvm: x86: annotate RCU pointers
        KVM: arm64: Fix exclusive limit for IPA size
        KVM: arm64: Reject VM creation when the default IPA size is unsupported
        KVM: arm64: Ensure I-cache isolation between vcpus of a same VM
        KVM: arm64: Don't use cbz/adr with external symbols
        KVM: arm64: Fix range alignment when walking page tables
        KVM: arm64: Workaround firmware wrongly advertising GICv2-on-v3 compatibility
        KVM: arm64: Rename __vgic_v3_get_ich_vtr_el2() to __vgic_v3_get_gic_config()
        KVM: arm64: Don't access PMSELR_EL0/PMUSERENR_EL0 when no PMU is available
        KVM: arm64: Turn kvm_arm_support_pmu_v3() into a static key
        KVM: arm64: Fix nVHE hyp panic host context restore
        KVM: arm64: Avoid corrupting vCPU context register in guest exit
        KVM: arm64: nvhe: Save the SPE context early
        kvm: x86: use NULL instead of using plain integer as pointer
        KVM: SVM: Connect 'npt' module param to KVM's internal 'npt_enabled'
        KVM: x86: Ensure deadline timer has truly expired before posting its IRQ
      9d0c8e79
    • L
      Merge branch 'akpm' (patches from Andrew) · 50eb842f
      Linus Torvalds 提交于
      Merge misc fixes from Andrew Morton:
       "28 patches.
      
        Subsystems affected by this series: mm (memblock, pagealloc, hugetlb,
        highmem, kfence, oom-kill, madvise, kasan, userfaultfd, memcg, and
        zram), core-kernel, kconfig, fork, binfmt, MAINTAINERS, kbuild, and
        ia64"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (28 commits)
        zram: fix broken page writeback
        zram: fix return value on writeback_store
        mm/memcg: set memcg when splitting page
        mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument
        ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign
        ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls
        mm/userfaultfd: fix memory corruption due to writeprotect
        kasan: fix KASAN_STACK dependency for HW_TAGS
        kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC
        mm/madvise: replace ptrace attach requirement for process_madvise
        include/linux/sched/mm.h: use rcu_dereference in in_vfork()
        kfence: fix reports if constant function prefixes exist
        kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocations
        kfence: fix printk format for ptrdiff_t
        linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP*
        MAINTAINERS: exclude uapi directories in API/ABI section
        binfmt_misc: fix possible deadlock in bm_register_write
        mm/highmem.c: fix zero_user_segments() with start > end
        hugetlb: do early cow when page pinned on src mm
        mm: use is_cow_mapping() across tree where proper
        ...
      50eb842f
  2. 14 3月, 2021 34 次提交
    • L
      Merge tag 'char-misc-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 88fe4924
      Linus Torvalds 提交于
      Pull char/misc driver fixes from Greg KH:
       "Here are some small misc/char driver fixes to resolve some reported
        problems:
      
         - habanalabs driver fixes
      
         - Acrn build fixes (reported many times)
      
         - pvpanic module table export fix
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'char-misc-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        misc/pvpanic: Export module FDT device table
        misc: fastrpc: restrict user apps from sending kernel RPC messages
        virt: acrn: Correct type casting of argument of copy_from_user()
        virt: acrn: Use EPOLLIN instead of POLLIN
        virt: acrn: Use vfs_poll() instead of f_op->poll()
        virt: acrn: Make remove_cpu sysfs invisible with !CONFIG_HOTPLUG_CPU
        cpu/hotplug: Fix build error of using {add,remove}_cpu() with !CONFIG_SMP
        habanalabs: fix debugfs address translation
        habanalabs: Disable file operations after device is removed
        habanalabs: Call put_pid() when releasing control device
        drivers: habanalabs: remove unused dentry pointer for debugfs files
        habanalabs: mark hl_eq_inc_ptr() as static
      88fe4924
    • L
      Merge tag 'staging-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · be61af33
      Linus Torvalds 提交于
      Pull staging driver fixes from Greg KH:
       "Here are some small staging driver fixes for reported problems. They
        include:
      
         - wfx header file cleanup patch reverted as it could cause problems
      
         - comedi driver endian fixes
      
         - buffer overflow problems for staging wifi drivers
      
         - build dependency issue for rtl8192e driver
      
        All have been in linux-next for a while with no reported problems"
      
      * tag 'staging-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (23 commits)
        Revert "staging: wfx: remove unused included header files"
        staging: rtl8188eu: prevent ->ssid overflow in rtw_wx_set_scan()
        staging: rtl8188eu: fix potential memory corruption in rtw_check_beacon_data()
        staging: rtl8192u: fix ->ssid overflow in r8192_wx_set_scan()
        staging: comedi: pcl726: Use 16-bit 0 for interrupt data
        staging: comedi: ni_65xx: Use 16-bit 0 for interrupt data
        staging: comedi: ni_6527: Use 16-bit 0 for interrupt data
        staging: comedi: comedi_parport: Use 16-bit 0 for interrupt data
        staging: comedi: amplc_pc236_common: Use 16-bit 0 for interrupt data
        staging: comedi: pcl818: Fix endian problem for AI command data
        staging: comedi: pcl711: Fix endian problem for AI command data
        staging: comedi: me4000: Fix endian problem for AI command data
        staging: comedi: dmm32at: Fix endian problem for AI command data
        staging: comedi: das800: Fix endian problem for AI command data
        staging: comedi: das6402: Fix endian problem for AI command data
        staging: comedi: adv_pci1710: Fix endian problem for AI command data
        staging: comedi: addi_apci_1500: Fix endian problem for command sample
        staging: comedi: addi_apci_1032: Fix endian problem for COS sample
        staging: ks7010: prevent buffer overflow in ks_wlan_set_scan()
        staging: rtl8712: Fix possible buffer overflow in r8712_sitesurvey_cmd
        ...
      be61af33
    • L
      Merge tag 'tty-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · cc14086f
      Linus Torvalds 提交于
      Pull tty/serial fixes from Greg KH:
       "Here are some small tty and serial driver fixes to resolve some
        reported problems:
      
         - led tty trigger fixes based on review and were acked by the led
           maintainer
      
         - revert a max310x serial driver patch as it was causing problems
      
         - revert a pty change as it was also causing problems
      
        All of these have been in linux-next for a while with no reported
        problems"
      
      * tag 'tty-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        Revert "drivers:tty:pty: Fix a race causing data loss on close"
        Revert "serial: max310x: rework RX interrupt handling"
        leds: trigger/tty: Use led_set_brightness_sync() from workqueue
        leds: trigger: Fix error path to not unlock the unlocked mutex
      cc14086f
    • L
      Merge tag 'usb-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 5c7bdbf8
      Linus Torvalds 提交于
      Pull USB fixes from Greg KH:
       "Here are a small number of USB fixes for 5.12-rc3 to resolve a bunch
        of reported issues:
      
         - usbip fixups for issues found by syzbot
      
         - xhci driver fixes and quirk additions
      
         - gadget driver fixes
      
         - dwc3 QCOM driver fix
      
         - usb-serial new ids and fixes
      
         - usblp fix for a long-time issue
      
         - cdc-acm quirk addition
      
         - other tiny fixes for reported problems
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'usb-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (25 commits)
        xhci: Fix repeated xhci wake after suspend due to uncleared internal wake state
        usb: xhci: Fix ASMedia ASM1042A and ASM3242 DMA addressing
        xhci: Improve detection of device initiated wake signal.
        usb: xhci: do not perform Soft Retry for some xHCI hosts
        usbip: fix vudc usbip_sockfd_store races leading to gpf
        usbip: fix vhci_hcd attach_store() races leading to gpf
        usbip: fix stub_dev usbip_sockfd_store() races leading to gpf
        usbip: fix vudc to check for stream socket
        usbip: fix vhci_hcd to check for stream socket
        usbip: fix stub_dev to check for stream socket
        usb: dwc3: qcom: Add missing DWC3 OF node refcount decrement
        USB: usblp: fix a hang in poll() if disconnected
        USB: gadget: udc: s3c2410_udc: fix return value check in s3c2410_udc_probe()
        usb: renesas_usbhs: Clear PIPECFG for re-enabling pipe with other EPNUM
        usb: dwc3: qcom: Honor wakeup enabled/disabled state
        usb: gadget: f_uac1: stop playback on function disable
        usb: gadget: f_uac2: always increase endpoint max_packet_size by one audio slot
        USB: gadget: u_ether: Fix a configfs return code
        usb: dwc3: qcom: add ACPI device id for sc8180x
        Goodix Fingerprint device is not a modem
        ...
      5c7bdbf8
    • L
      Merge tag 'erofs-for-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs · 42062343
      Linus Torvalds 提交于
      Pull erofs fix from Gao Xiang:
       "Fix an urgent regression introduced by commit baa2c7c9 ("block:
        set .bi_max_vecs as actual allocated vector number"), which could
        cause unexpected hung since linux 5.12-rc1.
      
        Resolve it by avoiding using bio->bi_max_vecs completely"
      
      * tag 'erofs-for-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
        erofs: fix bio->bi_max_vecs behavior change
      42062343
    • L
      Merge tag 'kbuild-fixes-v5.12-2' of... · e83bad7f
      Linus Torvalds 提交于
      Merge tag 'kbuild-fixes-v5.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
      
      Pull Kbuild fixes from Masahiro Yamada:
      
       - avoid 'make image_name' invoking syncconfig
      
       - fix a couple of bugs in scripts/dummy-tools
      
       - fix LLD_VENDOR and locale issues in scripts/ld-version.sh
      
       - rebuild GCC plugins when the compiler is upgraded
      
       - allow LTO to be enabled with KASAN_HW_TAGS
      
       - allow LTO to be enabled without LLVM=1
      
      * tag 'kbuild-fixes-v5.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        kbuild: fix ld-version.sh to not be affected by locale
        kbuild: remove meaningless parameter to $(call if_changed_rule,dtc)
        kbuild: remove LLVM=1 test from HAS_LTO_CLANG
        kbuild: remove unneeded -O option to dtc
        kbuild: dummy-tools: adjust to scripts/cc-version.sh
        kbuild: Allow LTO to be selected with KASAN_HW_TAGS
        kbuild: dummy-tools: support MPROFILE_KERNEL checks for ppc
        kbuild: rebuild GCC plugins when the compiler is upgraded
        kbuild: Fix ld-version.sh script if LLD was built with LLD_VENDOR
        kbuild: dummy-tools: fix inverted tests for gcc
        kbuild: add image_name to no-sync-config-targets
      e83bad7f
    • M
      zram: fix broken page writeback · 2766f182
      Minchan Kim 提交于
      commit 0d835962 ("zram: support page writeback") introduced two
      problems.  It overwrites writeback_store's return value as kstrtol's
      return value, which makes return value zero so user could see zero as
      return value of write syscall even though it wrote data successfully.
      
      It also breaks index value in the loop in that it doesn't increase the
      index any longer.  It means it can write only first starting block index
      so user couldn't write all idle pages in the zram so lose memory saving
      chance.
      
      This patch fixes those issues.
      
      Link: https://lkml.kernel.org/r/20210312173949.2197662-2-minchan@kernel.org
      Fixes: 0d835962("zram: support page writeback")
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NAmos Bianchi <amosbianchi@google.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: John Dias <joaodias@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2766f182
    • M
      zram: fix return value on writeback_store · 57e0076e
      Minchan Kim 提交于
      writeback_store's return value is overwritten by submit_bio_wait's return
      value.  Thus, writeback_store will return zero since there was no IO
      error.  In the end, write syscall from userspace will see the zero as
      return value, which could make the process stall to keep trying the write
      until it will succeed.
      
      Link: https://lkml.kernel.org/r/20210312173949.2197662-1-minchan@kernel.org
      Fixes: 3b82a051("drivers/block/zram/zram_drv.c: fix error return codes not being returned in writeback_store")
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: John Dias <joaodias@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57e0076e
    • Z
      mm/memcg: set memcg when splitting page · e1baddf8
      Zhou Guanghui 提交于
      As described in the split_page() comment, for the non-compound high order
      page, the sub-pages must be freed individually.  If the memcg of the first
      page is valid, the tail pages cannot be uncharged when be freed.
      
      For example, when alloc_pages_exact is used to allocate 1MB continuous
      physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
      set).  When make_alloc_exact free the unused 1MB and free_pages_exact free
      the applied 1MB, actually, only 4KB(one page) is uncharged.
      
      Therefore, the memcg of the tail page needs to be set when splitting a
      page.
      
      Michel:
      
      There are at least two explicit users of __GFP_ACCOUNT with
      alloc_exact_pages added recently.  See 7efe8ef2 ("KVM: arm64:
      Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c4196218
      ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
      just a theoretical issue.
      
      Link: https://lkml.kernel.org/r/20210304074053.65527-3-zhouguanghui1@huawei.comSigned-off-by: NZhou Guanghui <zhouguanghui1@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Rui Xiang <rui.xiang@huawei.com>
      Cc: Tianhong Ding <dingtianhong@huawei.com>
      Cc: Weilong Chen <chenweilong@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1baddf8
    • Z
      mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument · be6c8982
      Zhou Guanghui 提交于
      Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly pass
      in page number argument.
      
      In this way, the interface name is more common and can be used by
      potential users.  In addition, the complete info(memcg and flag) of the
      memcg needs to be set to the tail pages.
      
      Link: https://lkml.kernel.org/r/20210304074053.65527-2-zhouguanghui1@huawei.comSigned-off-by: NZhou Guanghui <zhouguanghui1@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Tianhong Ding <dingtianhong@huawei.com>
      Cc: Weilong Chen <chenweilong@huawei.com>
      Cc: Rui Xiang <rui.xiang@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be6c8982
    • S
      ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign · 61bf318e
      Sergei Trofimovich 提交于
      In https://bugs.gentoo.org/769614 Dmitry noticed that
      `ptrace(PTRACE_GET_SYSCALL_INFO)` does not return error sign properly.
      
      The bug is in mismatch between get/set errors:
      
      static inline long syscall_get_error(struct task_struct *task,
                                           struct pt_regs *regs)
      {
              return regs->r10 == -1 ? regs->r8:0;
      }
      
      static inline long syscall_get_return_value(struct task_struct *task,
                                                  struct pt_regs *regs)
      {
              return regs->r8;
      }
      
      static inline void syscall_set_return_value(struct task_struct *task,
                                                  struct pt_regs *regs,
                                                  int error, long val)
      {
              if (error) {
                      /* error < 0, but ia64 uses > 0 return value */
                      regs->r8 = -error;
                      regs->r10 = -1;
              } else {
                      regs->r8 = val;
                      regs->r10 = 0;
              }
      }
      
      Tested on v5.10 on rx3600 machine (ia64 9040 CPU).
      
      Link: https://lkml.kernel.org/r/20210221002554.333076-2-slyfox@gentoo.org
      Link: https://bugs.gentoo.org/769614Signed-off-by: NSergei Trofimovich <slyfox@gentoo.org>
      Reported-by: NDmitry V. Levin <ldv@altlinux.org>
      Reviewed-by: NDmitry V. Levin <ldv@altlinux.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61bf318e
    • S
      ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls · 0ceb1ace
      Sergei Trofimovich 提交于
      In https://bugs.gentoo.org/769614 Dmitry noticed that
      `ptrace(PTRACE_GET_SYSCALL_INFO)` does not work for syscalls called via
      glibc's syscall() wrapper.
      
      ia64 has two ways to call syscalls from userspace: via `break` and via
      `eps` instructions.
      
      The difference is in stack layout:
      
      1. `eps` creates simple stack frame: no locals, in{0..7} == out{0..8}
      2. `break` uses userspace stack frame: may be locals (glibc provides
         one), in{0..7} == out{0..8}.
      
      Both work fine in syscall handling cde itself.
      
      But `ptrace(PTRACE_GET_SYSCALL_INFO)` uses unwind mechanism to
      re-extract syscall arguments but it does not account for locals.
      
      The change always skips locals registers. It should not change `eps`
      path as kernel's handler already enforces locals=0 and fixes `break`.
      
      Tested on v5.10 on rx3600 machine (ia64 9040 CPU).
      
      Link: https://lkml.kernel.org/r/20210221002554.333076-1-slyfox@gentoo.org
      Link: https://bugs.gentoo.org/769614Signed-off-by: NSergei Trofimovich <slyfox@gentoo.org>
      Reported-by: NDmitry V. Levin <ldv@altlinux.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ceb1ace
    • N
      mm/userfaultfd: fix memory corruption due to writeprotect · 6ce64428
      Nadav Amit 提交于
      Userfaultfd self-test fails occasionally, indicating a memory corruption.
      
      Analyzing this problem indicates that there is a real bug since mmap_lock
      is only taken for read in mwriteprotect_range() and defers flushes, and
      since there is insufficient consideration of concurrent deferred TLB
      flushes in wp_page_copy().  Although the PTE is flushed from the TLBs in
      wp_page_copy(), this flush takes place after the copy has already been
      performed, and therefore changes of the page are possible between the time
      of the copy and the time in which the PTE is flushed.
      
      To make matters worse, memory-unprotection using userfaultfd also poses a
      problem.  Although memory unprotection is logically a promotion of PTE
      permissions, and therefore should not require a TLB flush, the current
      userrfaultfd code might actually cause a demotion of the architectural PTE
      permission: when userfaultfd_writeprotect() unprotects memory region, it
      unintentionally *clears* the RW-bit if it was already set.  Note that this
      unprotecting a PTE that is not write-protected is a valid use-case: the
      userfaultfd monitor might ask to unprotect a region that holds both
      write-protected and write-unprotected PTEs.
      
      The scenario that happens in selftests/vm/userfaultfd is as follows:
      
      cpu0				cpu1			cpu2
      ----				----			----
      							[ Writable PTE
      							  cached in TLB ]
      userfaultfd_writeprotect()
      [ write-*unprotect* ]
      mwriteprotect_range()
      mmap_read_lock()
      change_protection()
      
      change_protection_range()
      ...
      change_pte_range()
      [ *clear* “write”-bit ]
      [ defer TLB flushes ]
      				[ page-fault ]
      				...
      				wp_page_copy()
      				 cow_user_page()
      				  [ copy page ]
      							[ write to old
      							  page ]
      				...
      				 set_pte_at_notify()
      
      A similar scenario can happen:
      
      cpu0		cpu1		cpu2		cpu3
      ----		----		----		----
      						[ Writable PTE
      				  		  cached in TLB ]
      userfaultfd_writeprotect()
      [ write-protect ]
      [ deferred TLB flush ]
      		userfaultfd_writeprotect()
      		[ write-unprotect ]
      		[ deferred TLB flush]
      				[ page-fault ]
      				wp_page_copy()
      				 cow_user_page()
      				 [ copy page ]
      				 ...		[ write to page ]
      				set_pte_at_notify()
      
      This race exists since commit 292924b2 ("userfaultfd: wp: apply
      _PAGE_UFFD_WP bit").  Yet, as Yu Zhao pointed, these races became apparent
      since commit 09854ba9 ("mm: do_wp_page() simplification") which made
      wp_page_copy() more likely to take place, specifically if page_count(page)
      > 1.
      
      To resolve the aforementioned races, check whether there are pending
      flushes on uffd-write-protected VMAs, and if there are, perform a flush
      before doing the COW.
      
      Further optimizations will follow to avoid during uffd-write-unprotect
      unnecassary PTE write-protection and TLB flushes.
      
      Link: https://lkml.kernel.org/r/20210304095423.3825684-1-namit@vmware.com
      Fixes: 09854ba9 ("mm: do_wp_page() simplification")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Suggested-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Tested-by: NPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Will Deacon <will@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>	[5.9+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ce64428
    • A
      kasan: fix KASAN_STACK dependency for HW_TAGS · d9b571c8
      Andrey Konovalov 提交于
      There's a runtime failure when running HW_TAGS-enabled kernel built with
      GCC on hardware that doesn't support MTE.  GCC-built kernels always have
      CONFIG_KASAN_STACK enabled, even though stack instrumentation isn't
      supported by HW_TAGS.  Having that config enabled causes KASAN to issue
      MTE-only instructions to unpoison kernel stacks, which causes the failure.
      
      Fix the issue by disallowing CONFIG_KASAN_STACK when HW_TAGS is used.
      
      (The commit that introduced CONFIG_KASAN_HW_TAGS specified proper
       dependency for CONFIG_KASAN_STACK_ENABLE but not for CONFIG_KASAN_STACK.)
      
      Link: https://lkml.kernel.org/r/59e75426241dbb5611277758c8d4d6f5f9298dac.1615215441.git.andreyknvl@google.com
      Fixes: 6a63a63f ("kasan: introduce CONFIG_KASAN_HW_TAGS")
      Signed-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reported-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: <stable@vger.kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Branislav Rankov <Branislav.Rankov@arm.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9b571c8
    • A
      kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC · f9d79e8d
      Andrey Konovalov 提交于
      Currently, kasan_free_nondeferred_pages()->kasan_free_pages() is called
      after debug_pagealloc_unmap_pages(). This causes a crash when
      debug_pagealloc is enabled, as HW_TAGS KASAN can't set tags on an
      unmapped page.
      
      This patch puts kasan_free_nondeferred_pages() before
      debug_pagealloc_unmap_pages() and arch_free_page(), which can also make
      the page unavailable.
      
      Link: https://lkml.kernel.org/r/24cd7db274090f0e5bc3adcdc7399243668e3171.1614987311.git.andreyknvl@google.com
      Fixes: 94ab5b61 ("kasan, arm64: enable CONFIG_KASAN_HW_TAGS")
      Signed-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Branislav Rankov <Branislav.Rankov@arm.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9d79e8d
    • S
      mm/madvise: replace ptrace attach requirement for process_madvise · 96cfe2c0
      Suren Baghdasaryan 提交于
      process_madvise currently requires ptrace attach capability.
      PTRACE_MODE_ATTACH gives one process complete control over another
      process.  It effectively removes the security boundary between the two
      processes (in one direction).  Granting ptrace attach capability even to a
      system process is considered dangerous since it creates an attack surface.
      This severely limits the usage of this API.
      
      The operations process_madvise can perform do not affect the correctness
      of the operation of the target process; they only affect where the data is
      physically located (and therefore, how fast it can be accessed).  What we
      want is the ability for one process to influence another process in order
      to optimize performance across the entire system while leaving the
      security boundary intact.
      
      Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ and
      CAP_SYS_NICE.  PTRACE_MODE_READ to prevent leaking ASLR metadata and
      CAP_SYS_NICE for influencing process performance.
      
      Link: https://lkml.kernel.org/r/20210303185807.2160264-1-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: <stable@vger.kernel.org>	[5.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96cfe2c0
    • M
      include/linux/sched/mm.h: use rcu_dereference in in_vfork() · 149fc787
      Matthew Wilcox (Oracle) 提交于
      Fix a sparse warning by using rcu_dereference().  Technically this is a
      bug and a sufficiently aggressive compiler could reload the `real_parent'
      pointer outside the protection of the rcu lock (and access freed memory),
      but I think it's pretty unlikely to happen.
      
      Link: https://lkml.kernel.org/r/20210221194207.1351703-1-willy@infradead.org
      Fixes: b18dc5f2 ("mm, oom: skip vforked tasks from being selected")
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      149fc787
    • M
      kfence: fix reports if constant function prefixes exist · 0aa41cae
      Marco Elver 提交于
      Some architectures prefix all functions with a constant string ('.' on
      ppc64).  Add ARCH_FUNC_PREFIX, which may optionally be defined in
      <asm/kfence.h>, so that get_stack_skipnr() can work properly.
      
      Link: https://lkml.kernel.org/r/f036c53d-7e81-763c-47f4-6024c6c5f058@csgroup.eu
      Link: https://lkml.kernel.org/r/20210304144000.1148590-1-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Reported-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Tested-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0aa41cae
    • M
      kfence, slab: fix cache_alloc_debugcheck_after() for bulk allocations · df3ae2c9
      Marco Elver 提交于
      cache_alloc_debugcheck_after() performs checks on an object, including
      adjusting the returned pointer.  None of this should apply to KFENCE
      objects.  While for non-bulk allocations, the checks are skipped when we
      allocate via KFENCE, for bulk allocations cache_alloc_debugcheck_after()
      is called via cache_alloc_debugcheck_after_bulk().
      
      Fix it by skipping cache_alloc_debugcheck_after() for KFENCE objects.
      
      Link: https://lkml.kernel.org/r/20210304205256.2162309-1-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df3ae2c9
    • M
      kfence: fix printk format for ptrdiff_t · 702b16d7
      Marco Elver 提交于
      Use %td for ptrdiff_t.
      
      Link: https://lkml.kernel.org/r/3abbe4c9-16ad-c168-a90f-087978ccd8f7@csgroup.eu
      Link: https://lkml.kernel.org/r/20210303121157.3430807-1-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Reported-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: NAlexander Potapenko <glider@google.com>
      Cc: Dmitriy Vyukov <dvyukov@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      702b16d7
    • A
      linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP* · 97e49102
      Arnd Bergmann 提交于
      Separating compiler-clang.h from compiler-gcc.h inadventently dropped the
      definitions of the three HAVE_BUILTIN_BSWAP macros, which requires falling
      back to the open-coded version and hoping that the compiler detects it.
      
      Since all versions of clang support the __builtin_bswap interfaces, add
      back the flags and have the headers pick these up automatically.
      
      This results in a 4% improvement of compilation speed for arm defconfig.
      
      Note: it might also be worth revisiting which architectures set
      CONFIG_ARCH_USE_BUILTIN_BSWAP for one compiler or the other, today this is
      set on six architectures (arm32, csky, mips, powerpc, s390, x86), while
      another ten architectures define custom helpers (alpha, arc, ia64, m68k,
      mips, nios2, parisc, sh, sparc, xtensa), and the rest (arm64, h8300,
      hexagon, microblaze, nds32, openrisc, riscv) just get the unoptimized
      version and rely on the compiler to detect it.
      
      A long time ago, the compiler builtins were architecture specific, but
      nowadays, all compilers that are able to build the kernel have correct
      implementations of them, though some may not be as optimized as the inline
      asm versions.
      
      The patch that dropped the optimization landed in v4.19, so as discussed
      it would be fairly safe to backport this revert to stable kernels to the
      4.19/5.4/5.10 stable kernels, but there is a remaining risk for
      regressions, and it has no known side-effects besides compile speed.
      
      Link: https://lkml.kernel.org/r/20210226161151.2629097-1-arnd@kernel.org
      Link: https://lore.kernel.org/lkml/20210225164513.3667778-1-arnd@kernel.org/
      Fixes: 815f0ddb ("include/linux/compiler*.h: make compiler-*.h mutually exclusive")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NNathan Chancellor <nathan@kernel.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Acked-by: NMiguel Ojeda <ojeda@kernel.org>
      Acked-by: NNick Desaulniers <ndesaulniers@google.com>
      Acked-by: NLuc Van Oostenryck <luc.vanoostenryck@gmail.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Arvind Sankar <nivedita@alum.mit.edu>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97e49102
    • V
      MAINTAINERS: exclude uapi directories in API/ABI section · f0b15b60
      Vlastimil Babka 提交于
      Commit 7b4693e6 ("MAINTAINERS: add uapi directories to API/ABI
      section") added include/uapi/ and arch/*/include/uapi/ so that patches
      modifying them CC linux-api.  However that was already done in the past
      and resulted in too much noise and thus later removed, as explained in
      b14fd334 ("MAINTAINERS: trim the file triggers for ABI/API")
      
      To prevent another round of addition and removal in the future, change the
      entries to X: (explicit exclusion) for documentation purposes, although
      they are not subdirectories of broader included directories, as there is
      apparently no defined way to add plain comments in subsystem sections.
      
      Link: https://lkml.kernel.org/r/20210301100255.25229-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NMichael Kerrisk (man-pages) <mtk.manpages@gmail.com>
      Acked-by: NMichael Kerrisk (man-pages) <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0b15b60
    • L
      binfmt_misc: fix possible deadlock in bm_register_write · e7850f4d
      Lior Ribak 提交于
      There is a deadlock in bm_register_write:
      
      First, in the begining of the function, a lock is taken on the binfmt_misc
      root inode with inode_lock(d_inode(root)).
      
      Then, if the user used the MISC_FMT_OPEN_FILE flag, the function will call
      open_exec on the user-provided interpreter.
      
      open_exec will call a path lookup, and if the path lookup process includes
      the root of binfmt_misc, it will try to take a shared lock on its inode
      again, but it is already locked, and the code will get stuck in a deadlock
      
      To reproduce the bug:
      $ echo ":iiiii:E::ii::/proc/sys/fs/binfmt_misc/bla:F" > /proc/sys/fs/binfmt_misc/register
      
      backtrace of where the lock occurs (#5):
      0  schedule () at ./arch/x86/include/asm/current.h:15
      1  0xffffffff81b51237 in rwsem_down_read_slowpath (sem=0xffff888003b202e0, count=<optimized out>, state=state@entry=2) at kernel/locking/rwsem.c:992
      2  0xffffffff81b5150a in __down_read_common (state=2, sem=<optimized out>) at kernel/locking/rwsem.c:1213
      3  __down_read (sem=<optimized out>) at kernel/locking/rwsem.c:1222
      4  down_read (sem=<optimized out>) at kernel/locking/rwsem.c:1355
      5  0xffffffff811ee22a in inode_lock_shared (inode=<optimized out>) at ./include/linux/fs.h:783
      6  open_last_lookups (op=0xffffc9000022fe34, file=0xffff888004098600, nd=0xffffc9000022fd10) at fs/namei.c:3177
      7  path_openat (nd=nd@entry=0xffffc9000022fd10, op=op@entry=0xffffc9000022fe34, flags=flags@entry=65) at fs/namei.c:3366
      8  0xffffffff811efe1c in do_filp_open (dfd=<optimized out>, pathname=pathname@entry=0xffff8880031b9000, op=op@entry=0xffffc9000022fe34) at fs/namei.c:3396
      9  0xffffffff811e493f in do_open_execat (fd=fd@entry=-100, name=name@entry=0xffff8880031b9000, flags=<optimized out>, flags@entry=0) at fs/exec.c:913
      10 0xffffffff811e4a92 in open_exec (name=<optimized out>) at fs/exec.c:948
      11 0xffffffff8124aa84 in bm_register_write (file=<optimized out>, buffer=<optimized out>, count=19, ppos=<optimized out>) at fs/binfmt_misc.c:682
      12 0xffffffff811decd2 in vfs_write (file=file@entry=0xffff888004098500, buf=buf@entry=0xa758d0 ":iiiii:E::ii::i:CF
      ", count=count@entry=19, pos=pos@entry=0xffffc9000022ff10) at fs/read_write.c:603
      13 0xffffffff811defda in ksys_write (fd=<optimized out>, buf=0xa758d0 ":iiiii:E::ii::i:CF
      ", count=19) at fs/read_write.c:658
      14 0xffffffff81b49813 in do_syscall_64 (nr=<optimized out>, regs=0xffffc9000022ff58) at arch/x86/entry/common.c:46
      15 0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
      
      To solve the issue, the open_exec call is moved to before the write
      lock is taken by bm_register_write
      
      Link: https://lkml.kernel.org/r/20210228224414.95962-1-liorribak@gmail.com
      Fixes: 948b701a ("binfmt_misc: add persistent opened binary handler for containers")
      Signed-off-by: NLior Ribak <liorribak@gmail.com>
      Acked-by: NHelge Deller <deller@gmx.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7850f4d
    • O
      mm/highmem.c: fix zero_user_segments() with start > end · 184cee51
      OGAWA Hirofumi 提交于
      zero_user_segments() is used from __block_write_begin_int(), for example
      like the following
      
      	zero_user_segments(page, 4096, 1024, 512, 918)
      
      But new the zero_user_segments() implementation for for HIGHMEM +
      TRANSPARENT_HUGEPAGE doesn't handle "start > end" case correctly, and hits
      BUG_ON().  (we can fix __block_write_begin_int() instead though, it is the
      old and multiple usage)
      
      Also it calls kmap_atomic() unnecessarily while start == end == 0.
      
      Link: https://lkml.kernel.org/r/87v9ab60r4.fsf@mail.parknet.co.jp
      Fixes: 0060ef3b ("mm: support THPs in zero_user_segments")
      Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      184cee51
    • P
      hugetlb: do early cow when page pinned on src mm · 4eae4efa
      Peter Xu 提交于
      This is the last missing piece of the COW-during-fork effort when there're
      pinned pages found.  One can reference 70e806e4 ("mm: Do early cow for
      pinned pages during fork() for ptes", 2020-09-27) for more information,
      since we do similar things here rather than pte this time, but just for
      hugetlb.
      
      Note that after Jason's recent work on 57efa1fe ("mm/gup: prevent
      gup_fast from racing with COW during fork", 2020-12-15) which is safer and
      easier to understand, we're safe now within the whole copy_page_range()
      against gup-fast, we don't need the wr-protect trick that proposed in
      70e806e4 anymore.
      
      Link: https://lkml.kernel.org/r/20210217233547.93892-6-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NJason Gunthorpe <jgg@ziepe.ca>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Gal Pressman <galpress@amazon.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Roland Scheidegger <sroland@vmware.com>
      Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
      Cc: Wei Zhang <wzam@amazon.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4eae4efa
    • P
      mm: use is_cow_mapping() across tree where proper · ca6eb14d
      Peter Xu 提交于
      After is_cow_mapping() is exported in mm.h, replace some manual checks
      elsewhere throughout the tree but start to use the new helper.
      
      Link: https://lkml.kernel.org/r/20210217233547.93892-5-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NJason Gunthorpe <jgg@ziepe.ca>
      Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
      Cc: Roland Scheidegger <sroland@vmware.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Gal Pressman <galpress@amazon.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Wei Zhang <wzam@amazon.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca6eb14d
    • P
      mm: introduce page_needs_cow_for_dma() for deciding whether cow · 97a7e473
      Peter Xu 提交于
      We've got quite a few places (pte, pmd, pud) that explicitly checked
      against whether we should break the cow right now during fork().  It's
      easier to provide a helper, especially before we work the same thing on
      hugetlbfs.
      
      Since we'll reference is_cow_mapping() in mm.h, move it there too.
      Actually it suites mm.h more since internal.h is mm/ only, but mm.h is
      exported to the whole kernel.  With that we should expect another patch to
      use is_cow_mapping() whenever we can across the kernel since we do use it
      quite a lot but it's always done with raw code against VM_* flags.
      
      Link: https://lkml.kernel.org/r/20210217233547.93892-4-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NJason Gunthorpe <jgg@ziepe.ca>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Gal Pressman <galpress@amazon.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Roland Scheidegger <sroland@vmware.com>
      Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
      Cc: Wei Zhang <wzam@amazon.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a7e473
    • P
      hugetlb: break earlier in add_reservation_in_range() when we can · ca7e0457
      Peter Xu 提交于
      All the regions maintained in hugetlb reserved map is inclusive on "from"
      but exclusive on "to".  We can break earlier even if rg->from==t because
      it already means no possible intersection.
      
      This does not need a Fixes in all cases because when it happens
      (rg->from==t) we'll not break out of the loop while we should, however the
      next thing we'd do is still add the last file_region we'd need and quit
      the loop in the next round.  So this change is not a bugfix (since the old
      code should still run okay iiuc), but we'd better still touch it up to
      make it logically sane.
      
      Link: https://lkml.kernel.org/r/20210217233547.93892-3-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Gal Pressman <galpress@amazon.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Roland Scheidegger <sroland@vmware.com>
      Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
      Cc: Wei Zhang <wzam@amazon.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca7e0457
    • P
      hugetlb: dedup the code to add a new file_region · 2103cf9c
      Peter Xu 提交于
      Patch series "mm/hugetlb: Early cow on fork, and a few cleanups", v5.
      
      As reported by Gal [1], we still miss the code clip to handle early cow
      for hugetlb case, which is true.  Again, it still feels odd to fork()
      after using a few huge pages, especially if they're privately mapped to
      me..  However I do agree with Gal and Jason in that we should still have
      that since that'll complete the early cow on fork effort at least, and
      it'll still fix issues where buffers are not well under control and not
      easy to apply MADV_DONTFORK.
      
      The first two patches (1-2) are some cleanups I noticed when reading into
      the hugetlb reserve map code.  I think it's good to have but they're not
      necessary for fixing the fork issue.
      
      The last two patches (3-4) are the real fix.
      
      I tested this with a fork() after some vfio-pci assignment, so I'm pretty
      sure the page copy path could trigger well (page will be accounted right
      after the fork()), but I didn't do data check since the card I assigned is
      some random nic.
      
        https://github.com/xzpeter/linux/tree/fork-cow-pin-huge
      
      [1] https://lore.kernel.org/lkml/27564187-4a08-f187-5a84-3df50009f6ca@amazon.com/
      
      Introduce hugetlb_resv_map_add() helper to add a new file_region rather
      than duplication the similar code twice in add_reservation_in_range().
      
      Link: https://lkml.kernel.org/r/20210217233547.93892-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20210217233547.93892-2-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Gal Pressman <galpress@amazon.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Wei Zhang <wzam@amazon.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Roland Scheidegger <sroland@vmware.com>
      Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2103cf9c
    • F
      mm/fork: clear PASID for new mm · 82e69a12
      Fenghua Yu 提交于
      When a new mm is created, its PASID should be cleared, i.e.  the PASID is
      initialized to its init state 0 on both ARM and X86.
      
      This patch was part of the series introducing mm->pasid, but got lost
      along the way [1].  It still makes sense to have it, because each address
      space has a different PASID.  And the IOMMU code in
      iommu_sva_alloc_pasid() expects the pasid field of a new mm struct to be
      cleared.
      
      [1] https://lore.kernel.org/linux-iommu/YDgh53AcQHT+T3L0@otcwcpicx3.sc.intel.com/
      
      Link: https://lkml.kernel.org/r/20210302103837.2562625-1-jean-philippe@linaro.orgSigned-off-by: NFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: NJean-Philippe Brucker <jean-philippe@linaro.org>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Cc: Jacob Pan <jacob.jun.pan@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82e69a12
    • M
      mm/page_alloc.c: refactor initialization of struct page for holes in memory layout · 0740a50b
      Mike Rapoport 提交于
      There could be struct pages that are not backed by actual physical memory.
      This can happen when the actual memory bank is not a multiple of
      SECTION_SIZE or when an architecture does not register memory holes
      reserved by the firmware as memblock.memory.
      
      Such pages are currently initialized using init_unavailable_mem() function
      that iterates through PFNs in holes in memblock.memory and if there is a
      struct page corresponding to a PFN, the fields of this page are set to
      default values and it is marked as Reserved.
      
      init_unavailable_mem() does not take into account zone and node the page
      belongs to and sets both zone and node links in struct page to zero.
      
      Before commit 73a6e474 ("mm: memmap_init: iterate over memblock
      regions rather that check each PFN") the holes inside a zone were
      re-initialized during memmap_init() and got their zone/node links right.
      However, after that commit nothing updates the struct pages representing
      such holes.
      
      On a system that has firmware reserved holes in a zone above ZONE_DMA, for
      instance in a configuration below:
      
      	# grep -A1 E820 /proc/iomem
      	7a17b000-7a216fff : Unknown E820 type
      	7a217000-7bffffff : System RAM
      
      unset zone link in struct page will trigger
      
      	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
      
      in set_pfnblock_flags_mask() when called with a struct page from a range
      other than E820_TYPE_RAM because there are pages in the range of
      ZONE_DMA32 but the unset zone link in struct page makes them appear as a
      part of ZONE_DMA.
      
      Interleave initialization of the unavailable pages with the normal
      initialization of memory map, so that zone and node information will be
      properly set on struct pages that are not backed by the actual memory.
      
      With this change the pages for holes inside a zone will get proper
      zone/node links and the pages that are not spanned by any node will get
      links to the adjacent zone/node.  The holes between nodes will be
      prepended to the zone/node above the hole and the trailing pages in the
      last section that will be appended to the zone/node below.
      
      [akpm@linux-foundation.org: don't initialize static to zero, use %llu for u64]
      
      Link: https://lkml.kernel.org/r/20210225224351.7356-2-rppt@kernel.org
      Fixes: 73a6e474 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reported-by: NQian Cai <cai@lca.pw>
      Reported-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Łukasz Majczak <lma@semihalf.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Sarvela, Tomi P" <tomi.p.sarvela@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0740a50b
    • M
      init/Kconfig: make COMPILE_TEST depend on HAS_IOMEM · ea29b20a
      Masahiro Yamada 提交于
      I read the commit log of the following two:
      
      - bc083a64 ("init/Kconfig: make COMPILE_TEST depend on !UML")
      - 334ef6ed ("init/Kconfig: make COMPILE_TEST depend on !S390")
      
      Both are talking about HAS_IOMEM dependency missing in many drivers.
      
      So, 'depends on HAS_IOMEM' seems the direct, sensible solution to me.
      
      This does not change the behavior of UML. UML still cannot enable
      COMPILE_TEST because it does not provide HAS_IOMEM.
      
      The current dependency for S390 is too strong. Under the condition of
      CONFIG_PCI=y, S390 provides HAS_IOMEM, hence can enable COMPILE_TEST.
      
      I also removed the meaningless 'default n'.
      
      Link: https://lkml.kernel.org/r/20210224140809.1067582-1-masahiroy@kernel.orgSigned-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Arnd Bergmann <arnd@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KP Singh <kpsingh@google.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Terrell <terrelln@fb.com>
      Cc: Quentin Perret <qperret@google.com>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: "Enrico Weigelt, metux IT consult" <lkml@metux.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea29b20a
    • A
      stop_machine: mark helpers __always_inline · cbf78d85
      Arnd Bergmann 提交于
      With clang-13, some functions only get partially inlined, with a
      specialized version referring to a global variable.  This triggers a
      harmless build-time check for the intel-rng driver:
      
      WARNING: modpost: drivers/char/hw_random/intel-rng.o(.text+0xe): Section mismatch in reference from the function stop_machine() to the function .init.text:intel_rng_hw_init()
      The function stop_machine() references
      the function __init intel_rng_hw_init().
      This is often because stop_machine lacks a __init
      annotation or the annotation of intel_rng_hw_init is wrong.
      
      In this instance, an easy workaround is to force the stop_machine()
      function to be inline, along with related interfaces that did not show the
      same behavior at the moment, but theoretically could.
      
      The combination of the two patches listed below triggers the behavior in
      clang-13, but individually these commits are correct.
      
      Link: https://lkml.kernel.org/r/20210225130153.1956990-1-arnd@kernel.org
      Fixes: fe5595c0 ("stop_machine: Provide stop_machine_cpuslocked()")
      Fixes: ee527cd3 ("Use stop_machine_run in the Intel RNG driver")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbf78d85
    • A
      memblock: fix section mismatch warning · 34dc2efb
      Arnd Bergmann 提交于
      The inlining logic in clang-13 is rewritten to often not inline some
      functions that were inlined by all earlier compilers.
      
      In case of the memblock interfaces, this exposed a harmless bug of a
      missing __init annotation:
      
      WARNING: modpost: vmlinux.o(.text+0x507c0a): Section mismatch in reference from the function memblock_bottom_up() to the variable .meminit.data:memblock
      The function memblock_bottom_up() references
      the variable __meminitdata memblock.
      This is often because memblock_bottom_up lacks a __meminitdata
      annotation or the annotation of memblock is wrong.
      
      Interestingly, these annotations were present originally, but got removed
      with the explanation that the __init annotation prevents the function from
      getting inlined.  I checked this again and found that while this is the
      case with clang, gcc (version 7 through 10, did not test others) does
      inline the functions regardless.
      
      As the previous change was apparently intended to help the clang builds,
      reverting it to help the newer clang versions seems appropriate as well.
      gcc builds don't seem to care either way.
      
      Link: https://lkml.kernel.org/r/20210225133808.2188581-1-arnd@kernel.org
      Fixes: 5bdba520 ("mm: memblock: drop __init from memblock functions to make it inline")
      Reference: 2cfb3665 ("include/linux/memblock.h: add __init to memblock_set_bottom_up()")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Aslan Bakirov <aslan@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34dc2efb