1. 27 12月, 2019 40 次提交
    • J
      driver core: add per device iommu param · 6d1740b5
      Jacob Pan 提交于
      hulk inclusion
      category: feature
      bugzilla: 14369
      CVE: NA
      -------------------
      
      DMA faults can be detected by IOMMU at device level. Adding a pointer
      to struct device allows IOMMU subsystem to report relevant faults
      back to the device driver for further handling.
      For direct assigned device (or user space drivers), guest OS holds
      responsibility to handle and respond per device IOMMU fault.
      Therefore we need fault reporting mechanism to propagate faults beyond
      IOMMU subsystem.
      
      There are two other IOMMU data pointers under struct device today, here
      we introduce iommu_param as a parent pointer such that all device IOMMU
      data can be consolidated here. The idea was suggested here by Greg KH
      and Joerg. The name iommu_param is chosen here since iommu_data has
      been used.
      Suggested-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
      Signed-off-by: NJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Link: https://lkml.org/lkml/2017/10/6/81Signed-off-by: NFang Lijun <fanglijun3@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Reviewed-by: NZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6d1740b5
    • J
      iommu: introduce device fault data · 5c398a27
      Jacob Pan 提交于
      hulk inclusion
      category: feature
      bugzilla: 14369
      CVE: NA
      -------------------
      
      Device faults detected by IOMMU can be reported outside IOMMU
      subsystem for further processing. This patch intends to provide
      a generic device fault data such that device drivers can be
      communicated with IOMMU faults without model specific knowledge.
      
      The proposed format is the result of discussion at:
      https://lkml.org/lkml/2017/11/10/291
      Part of the code is based on Jean-Philippe Brucker's patchset
      (https://patchwork.kernel.org/patch/9989315/).
      
      The assumption is that model specific IOMMU driver can filter and
      handle most of the internal faults if the cause is within IOMMU driver
      control. Therefore, the fault reasons can be reported are grouped
      and generalized based common specifications such as PCI ATS.
      Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
      Signed-off-by: NJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: NLiu, Yi L <yi.l.liu@linux.intel.com>
      Signed-off-by: NAshok Raj <ashok.raj@intel.com>
      Signed-off-by: NFang Lijun <fanglijun3@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Reviewed-by: NZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5c398a27
    • L
      iommu: introduce iommu invalidate API function · 51c7a125
      Liu, Yi L 提交于
      hulk inclusion
      category: feature
      bugzilla: 14369
      CVE: NA
      -------------------
      
      When an SVM capable device is assigned to a guest, the first level page
      tables are owned by the guest and the guest PASID table pointer is
      linked to the device context entry of the physical IOMMU.
      
      Host IOMMU driver has no knowledge of caching structure updates unless
      the guest invalidation activities are passed down to the host. The
      primary usage is derived from emulated IOMMU in the guest, where QEMU
      can trap invalidation activities before passing them down to the
      host/physical IOMMU.
      Since the invalidation data are obtained from user space and will be
      written into physical IOMMU, we must allow security check at various
      layers. Therefore, generic invalidation data format are proposed here,
      model specific IOMMU drivers need to convert them into their own format.
      Signed-off-by: NLiu, Yi L <yi.l.liu@linux.intel.com>
      Signed-off-by: NJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
      Signed-off-by: NAshok Raj <ashok.raj@intel.com>
      Signed-off-by: NFang Lijun <fanglijun3@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Reviewed-by: NZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      51c7a125
    • J
      iommu: introduce bind_pasid_table API function · d32e9e95
      Jacob Pan 提交于
      hulk inclusion
      category: feature
      bugzilla: 14369
      CVE: NA
      -------------------
      
      Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
      use in the guest:
      https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
      
      As part of the proposed architecture, when an SVM capable PCI
      device is assigned to a guest, nested mode is turned on. Guest owns the
      first level page tables (request with PASID) which performs GVA->GPA
      translation. Second level page tables are owned by the host for GPA->HPA
      translation for both request with and without PASID.
      
      A new IOMMU driver interface is therefore needed to perform tasks as
      follows:
      * Enable nested translation and appropriate translation type
      * Assign guest PASID table pointer (in GPA) and size to host IOMMU
      
      This patch introduces new API functions to perform bind/unbind guest PASID
      tables. Based on common data, model specific IOMMU drivers can be extended
      to perform the specific steps for binding pasid table of assigned devices.
      Signed-off-by: NJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: NLiu, Yi L <yi.l.liu@linux.intel.com>
      Signed-off-by: NAshok Raj <ashok.raj@intel.com>
      Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
      [Backported to 4.19
      -add SPDX-License-Identifier]
      Signed-off-by: NFang Lijun <fanglijun3@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Reviewed-by: NZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d32e9e95
    • P
      KVM: fix spectrev1 gadgets · 579b95fc
      Paolo Bonzini 提交于
      [ Upstream commit 1d487e9bf8ba66a7174c56a0029c54b1eca8f99c ]
      
      These were found with smatch, and then generalized when applicable.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      579b95fc
    • J
      x86/reboot, efi: Use EFI reboot for Acer TravelMate X514-51T · 4778affe
      Jian-Hong Pan 提交于
      [ Upstream commit 0082517fa4bce073e7cf542633439f26538a14cc ]
      
      Upon reboot, the Acer TravelMate X514-51T laptop appears to complete the
      shutdown process, but then it hangs in BIOS POST with a black screen.
      
      The problem is intermittent - at some points it has appeared related to
      Secure Boot settings or different kernel builds, but ultimately we have
      not been able to identify the exact conditions that trigger the issue to
      come and go.
      
      Besides, the EFI mode cannot be disabled in the BIOS of this model.
      
      However, after extensive testing, we observe that using the EFI reboot
      method reliably avoids the issue in all cases.
      
      So add a boot time quirk to use EFI reboot on such systems.
      
      Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=203119Signed-off-by: NJian-Hong Pan <jian-hong@endlessm.com>
      Signed-off-by: NDaniel Drake <drake@endlessm.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Cc: linux@endlessm.com
      Link: http://lkml.kernel.org/r/20190412080152.3718-1-jian-hong@endlessm.com
      [ Fix !CONFIG_EFI build failure, clarify the code and the changelog a bit. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4778affe
    • J
      bfq: update internal depth state when queue depth changes · 3c8256e7
      Jens Axboe 提交于
      commit 77f1e0a52d26242b6c2dba019f6ebebfb9ff701e upstream
      
      A previous commit moved the shallow depth and BFQ depth map calculations
      to be done at init time, moving it outside of the hotter IO path. This
      potentially causes hangs if the users changes the depth of the scheduler
      map, by writing to the 'nr_requests' sysfs file for that device.
      
      Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if
      the depth changes, so that the scheduler can update its internal state.
      Signed-off-by: NEric Wheeler <bfq@linux.ewheeler.net>
      Tested-by: NKai Krakow <kai@kaishome.de>
      Reported-by: NPaolo Valente <paolo.valente@linaro.org>
      Fixes: f0635b8a ("bfq: calculate shallow depths at init time")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3c8256e7
    • J
      cpu/speculation: Add 'mitigations=' cmdline option · 9b878097
      Josh Poimboeuf 提交于
      commit 98af8452945c55652de68536afdde3b520fec429 upstream
      
      Keeping track of the number of mitigations for all the CPU speculation
      bugs has become overwhelming for many users.  It's getting more and more
      complicated to decide which mitigations are needed for a given
      architecture.  Complicating matters is the fact that each arch tends to
      have its own custom way to mitigate the same vulnerability.
      
      Most users fall into a few basic categories:
      
      a) they want all mitigations off;
      
      b) they want all reasonable mitigations on, with SMT enabled even if
         it's vulnerable; or
      
      c) they want all reasonable mitigations on, with SMT disabled if
         vulnerable.
      
      Define a set of curated, arch-independent options, each of which is an
      aggregation of existing options:
      
      - mitigations=off: Disable all mitigations.
      
      - mitigations=auto: [default] Enable all the default mitigations, but
        leave SMT enabled, even if it's vulnerable.
      
      - mitigations=auto,nosmt: Enable all the default mitigations, disabling
        SMT if needed by a mitigation.
      
      Currently, these options are placeholders which don't actually do
      anything.  They will be fleshed out in upcoming patches.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86)
      Reviewed-by: NJiri Kosina <jkosina@suse.cz>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux-s390@vger.kernel.org
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-arch@vger.kernel.org
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Phil Auld <pauld@redhat.com>
      Link: https://lkml.kernel.org/r/b07a8ef9b7c5055c3a4637c87d07c296d5016fe0.1555085500.git.jpoimboe@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      9b878097
    • T
      x86/speculation/mds: Add sysfs reporting for MDS · 6896d064
      Thomas Gleixner 提交于
      commit 8a4b06d391b0a42a373808979b5028f5c84d9c6a upstream
      
      Add the sysfs reporting file for MDS. It exposes the vulnerability and
      mitigation state similar to the existing files for the other speculative
      hardware vulnerabilities.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NJon Masters <jcm@redhat.com>
      Tested-by: NJon Masters <jcm@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6896d064
    • L
      mm: Be allowed to alloc CDM node memory for MPOL_BIND · 1f3b5458
      Lijun Fang 提交于
      euler inclusion
      category: feature
      bugzilla: 11082
      CVE: NA
      -----------------
      
      CDM nodes should not be part of mems_allowed, However,
      It must be allowed to alloc from CDM node, when mpol->mode was MPOL_BIND.
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1f3b5458
    • A
      mm: Exclude CDM marked VMAs from auto NUMA · 90cd24b6
      Anshuman Khandual 提交于
      euler inclusion
      category: feature
      bugzilla: 11082
      CVE: NA
      -------------------
      
      Kernel cannot track device memory accesses behind VMAs containing CDM
      memory. Hence all the VM_CDM marked VMAs should not be part of the auto
      NUMA migration scheme. This patch also adds a new function is_cdm_vma()
      to detect any VMA marked with flag VM_CDM.
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      90cd24b6
    • A
      mm: Tag VMA with VM_CDM flag explicitly during mbind(MPOL_BIND) and page fault · e1ddb9d2
      Anshuman Khandual 提交于
      euler inclusion
      category: feature
      bugzilla: 11082
      CVE: NA
      -------------------
      
      Mark all the applicable VMAs with VM_CDM explicitly during mbind(MPOL_BIND)
      call if the user provided nodemask has a CDM node.
      
      Mark the corresponding VMA with VM_CDM flag if the allocated page happens
      to be from a CDM node. This can be expensive from performance stand point.
      There are multiple checks to avoid an expensive page_to_nid lookup but it
      can be optimized further.
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e1ddb9d2
    • A
      mm: Define coherent device memory (CDM) node · 4886e905
      Anshuman Khandual 提交于
      euler inclusion
      category: feature
      bugzilla: 11082
      CVE: NA
      -------------------
      
      There are certain devices like specialized accelerator, GPU cards, network
      cards, FPGA cards etc which might contain onboard memory which is coherent
      along with the existing system RAM while being accessed either from the CPU
      or from the device. They share some similar properties with that of normal
      system RAM but at the same time can also be different with respect to
      system RAM.
      
      User applications might be interested in using this kind of coherent device
      memory explicitly or implicitly along side the system RAM utilizing all
      possible core memory functions like anon mapping (LRU), file mapping (LRU),
      page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations
      etc. To achieve this kind of tight integration with core memory subsystem,
      the device onboard coherent memory must be represented as a memory only
      NUMA node. At the same time arch must export some kind of a function to
      identify of this node as a coherent device memory not any other regular
      cpu less memory only NUMA node.
      
      After achieving the integration with core memory subsystem coherent device
      memory might still need some special consideration inside the kernel. There
      can be a variety of coherent memory nodes with different expectations from
      the core kernel memory. But right now only one kind of special treatment is
      considered which requires certain isolation.
      
      Now consider the case of a coherent device memory node type which requires
      isolation. This kind of coherent memory is onboard an external device
      attached to the system through a link where there is always a chance of a
      link failure taking down the entire memory node with it. More over the
      memory might also have higher chance of ECC failure as compared to the
      system RAM. Hence allocation into this kind of coherent memory node should
      be regulated. Kernel allocations must not come here. Normal user space
      allocations too should not come here implicitly (without user application
      knowing about it). This summarizes isolation requirement of certain kind of
      coherent device memory node as an example. There can be different kinds of
      isolation requirement also.
      
      Some coherent memory devices might not require isolation altogether after
      all. Then there might be other coherent memory devices which might require
      some other special treatment after being part of core memory representation
      . For now, will look into isolation seeking coherent device memory node not
      the other ones.
      
      To implement the integration as well as isolation, the coherent memory node
      must be present in N_MEMORY and a new N_COHERENT_DEVICE node mask inside
      the node_states[] array. During memory hotplug operations, the new nodemask
      N_COHERENT_DEVICE is updated along with N_MEMORY for these coherent device
      memory nodes. This also creates the following new sysfs based interface to
      list down all the coherent memory nodes of the system.
      
      	/sys/devices/system/node/is_cdm_node
      
      Architectures must export function arch_check_node_cdm() which identifies
      any coherent device memory node in case they enable CONFIG_COHERENT_DEVICE.
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      [Backported to 4.19
      -remove set or clear node state for memory_hotplug
      -separate CONFIG_COHERENT and CPUSET]
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4886e905
    • J
      linux/kernel.h: Use parentheses around argument in u64_to_user_ptr() · a74603f0
      Jann Horn 提交于
      [ Upstream commit a0fe2c6479aab5723239b315ef1b552673f434a3 ]
      
      Use parentheses around uses of the argument in u64_to_user_ptr() to
      ensure that the cast doesn't apply to part of the argument.
      
      There are existing uses of the macro of the form
      
        u64_to_user_ptr(A + B)
      
      which expands to
      
        (void __user *)(uintptr_t)A + B
      
      (the cast applies to the first operand of the addition, the addition
      is a pointer addition). This happens to still work as intended, the
      semantic difference doesn't cause a difference in behavior.
      
      But I want to use u64_to_user_ptr() with a ternary operator in the
      argument, like so:
      
        u64_to_user_ptr(A ? B : C)
      
      This currently doesn't work as intended.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NMukesh Ojha <mojha@codeaurora.org>
      Cc: Andrei Vagin <avagin@openvz.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jani Nikula <jani.nikula@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qiaowei Ren <qiaowei.ren@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190329214652.258477-1-jannh@google.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a74603f0
    • Y
      fs/dcache.c: avoid softlock since too many negative dentry · 7ba5d5d0
      yangerkun 提交于
      euler inclusion
      category: bugfix
      bugzilla: 15743
      CVE: NA
      ---------------------------
      
      Parallel thread to add negative dentry under root dir. Sometimes later,
      'systemctl daemon-reload' will report softlockup since
      __fsnotify_update_child_dentry_flags need update all child under root
      dentry without distinguish does it active or not. It will waste so long
      time with catching d_lock of root dentry. And other thread try to
      spin_lock d_lock will run overtime.
      
      Limit negative dentry under dir can avoid this.
      Signed-off-by: Nyangerkun <yangerkun@huawei.com>
      Reviewed-by: NMiao Xie <miaoxie@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7ba5d5d0
    • Z
      block: add info when opening a write opend block device exclusively · 9a8887a9
      zhangyi (F) 提交于
      euler inclusion
      category: feature
      bugzilla: 14367
      CVE: NA
      ---------------------------
      
      Just like open an exclusive opened block device for write, open a block
      device exclusively which has been opened for write by some other
      processes may also lead to potential data corruption. This patch record
      the write openers and give a hint if that happens.
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Reviewed-by: NMiao Xie <miaoxie@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      9a8887a9
    • X
      pciehp: fix a race between pciehp and removing operations by sysfs · 762c10db
      Xiongfeng Wang 提交于
      hulk inclusion
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      -------------------------------------------------
      
      When I run a stress test about pcie hotplug and removing operations by
      sysfs, I got a hange task, and the following call trace is printed.
      
       INFO: task irq/746-pciehp:41551 blocked for more than 120 seconds.
             Tainted: P        W  OE     4.19.25-
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       irq/746-pciehp  D    0 41551      2 0x00000228
       Call trace:
        __switch_to+0x94/0xe8
        __schedule+0x270/0x8b0
        schedule+0x2c/0x88
        schedule_preempt_disabled+0x14/0x20
        __mutex_lock.isra.1+0x1fc/0x540
        __mutex_lock_slowpath+0x24/0x30
        mutex_lock+0x80/0xa8
        pci_lock_rescan_remove+0x20/0x28
        pciehp_configure_device+0x30/0x140
        pciehp_handle_presence_or_link_change+0x35c/0x4b0
        pciehp_ist+0x1cc/0x1d0
        irq_thread_fn+0x30/0x80
        irq_thread+0x128/0x200
        kthread+0x134/0x138
        ret_from_fork+0x10/0x18
       INFO: task bash:6424 blocked for more than 120 seconds.
             Tainted: P        W  OE     4.19.25-
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       bash            D    0  6424   2231 0x00000200
       Call trace:
        __switch_to+0x94/0xe8
        __schedule+0x270/0x8b0
        schedule+0x2c/0x88
        schedule_timeout+0x224/0x448
        wait_for_common+0x198/0x2a0
        wait_for_completion+0x28/0x38
        kthread_stop+0x60/0x190
        __free_irq+0x1c0/0x348
        free_irq+0x40/0x88
        pcie_shutdown_notification+0x54/0x80
        pciehp_remove+0x30/0x50
        pcie_port_remove_service+0x3c/0x58
        device_release_driver_internal+0x1b4/0x250
        device_release_driver+0x28/0x38
        bus_remove_device+0xd4/0x160
        device_del+0x128/0x348
        device_unregister+0x24/0x78
        remove_iter+0x48/0x58
        device_for_each_child+0x6c/0xb8
        pcie_port_device_remove+0x2c/0x48
        pcie_portdrv_remove+0x5c/0x68
        pci_device_remove+0x48/0xd8
        device_release_driver_internal+0x1b4/0x250
        device_release_driver+0x28/0x38
        pci_stop_bus_device+0x84/0xb8
        pci_stop_and_remove_bus_device_locked+0x24/0x40
        remove_store+0xa4/0xb8
        dev_attr_store+0x44/0x60
        sysfs_kf_write+0x58/0x80
        kernfs_fop_write+0xe8/0x1f0
        __vfs_write+0x60/0x190
        vfs_write+0xac/0x1c0
        ksys_write+0x6c/0xd8
        __arm64_sys_write+0x24/0x30
        el0_svc_common+0xa0/0x180
        el0_svc_handler+0x38/0x78
        el0_svc+0x8/0xc
      
      When we remove a slot by sysfs.
      'pci_stop_and_remove_bus_device_locked()' will be called. This function
      will get the global mutex lock 'pci_rescan_remove_lock', and remove the
      slot. If the irq thread 'pciehp_ist' is still running, we will wait
      until it exits.
      
      If a pciehp interrupt happens immediately after we remove the slot by
      sysfs, but before we free the pciehp irq in
      'pci_stop_and_remove_bus_device_locked()'. 'pciehp_ist' will hung
      because the global mutex lock 'pci_rescan_remove_lock' is held by the
      sysfs operation. But the sysfs operation is waiting for the pciehp irq
      thread 'pciehp_ist' ends. Then a hung task occurs.
      
      So this two kinds of operation, removing through attention buttion and
      removing through /sys/devices/pci***/remove, should not be excuted at
      the same time. This patch add a global variable to mark that one of these
      operations is under processing. When this variable is set,  if another
      operation is requested, it will be rejected.
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      762c10db
    • L
      Revert Kernel Warpdrive part:SPIMDEV, which depends on VFIO/VFIO_MDEV. · 9e68018c
      lingmingqiang 提交于
      driver inclusion
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      This reverts commit 6b546eb4c02210d7bc77aa64025cc6c84e5f1b30.
      
      Feature or Bugfix: Bugfix
      Signed-off-by: Nlingmingqiang <lingmingqiang@huawei.com>
      Reviewed-by: Nhucheng.hu <hucheng.hu@huawei.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      9e68018c
    • Y
      fs/dcache.c: avoid panic while lockref of dentry overflow · 510fef4d
      yangerkun 提交于
      euler inclusion
      category: bugfix
      bugzilla: 14351
      CVE: NA
      ---------------------------
      
      We use lockref for dentry reference without notice that so many negative
      dentry under one dir can lead to overflow of lockref. This can lead to
      system crash if we do this under root dir.
      
      Since there is not a perfect solution, we just limit max number of
      dentry count up to INT_MAX / 2. Also, it will cost a lot of time from
      INT_MAX / 2 to INT_MAX, so we no need to do this under protection of
      dentry lock.
      
      Also, we limit the FILES_MAX to INT_MAX / 2, since a lot open for
      same file can lead to overflow too.
      
      Changelog:
      v1->v2: add a function to do check / add a Macro to mean INT_MAX / 2
      Signed-off-by: Nyangerkun <yangerkun@huawei.com>
      Reviewed-by: NMiao Xie <miaoxie@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      510fef4d
    • D
      clk: x86: Add system specific quirk to mark clocks as critical · a1b70161
      David Müller 提交于
      commit 7c2e07130090ae001a97a6b65597830d6815e93e upstream.
      
      Since commit 648e9218 ("clk: x86: Stop marking clocks as
      CLK_IS_CRITICAL"), the pmc_plt_clocks of the Bay Trail SoC are
      unconditionally gated off. Unfortunately this will break systems where these
      clocks are used for external purposes beyond the kernel's knowledge. Fix it
      by implementing a system specific quirk to mark the necessary pmc_plt_clks as
      critical.
      
      Fixes: 648e9218 ("clk: x86: Stop marking clocks as CLK_IS_CRITICAL")
      Signed-off-by: NDavid Müller <dave.mueller@gmx.ch>
      Signed-off-by: NHans de Goede <hdegoede@redhat.com>
      Reviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NStephen Boyd <sboyd@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a1b70161
    • K
      fs: stream_open - opener for stream-like files so that read and write can run... · e2e782c9
      Kirill Smelkov 提交于
      fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
      
      [ Upstream commit 10dce8af34226d90fa56746a934f8da5dcdba3df ]
      
      Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added
      locking for file.f_pos access and in particular made concurrent read and
      write not possible - now both those functions take f_pos lock for the
      whole run, and so if e.g. a read is blocked waiting for data, write will
      deadlock waiting for that read to complete.
      
      This caused regression for stream-like files where previously read and
      write could run simultaneously, but after that patch could not do so
      anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes
      to /proc/xen/xenbus") which fixes such regression for particular case of
      /proc/xen/xenbus.
      
      The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
      safety for read/write/lseek and added the locking to file descriptors of
      all regular files. In 2014 that thread-safety problem was not new as it
      was already discussed earlier in 2006.
      
      However even though 2006'th version of Linus's patch was adding f_pos
      locking "only for files that are marked seekable with FMODE_LSEEK (thus
      avoiding the stream-like objects like pipes and sockets)", the 2014
      version - the one that actually made it into the tree as 9c225f26 -
      is doing so irregardless of whether a file is seekable or not.
      
      See
      
          https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
          https://lwn.net/Articles/180387
          https://lwn.net/Articles/180396
      
      for historic context.
      
      The reason that it did so is, probably, that there are many files that
      are marked non-seekable, but e.g. their read implementation actually
      depends on knowing current position to correctly handle the read. Some
      examples:
      
      	kernel/power/user.c		snapshot_read
      	fs/debugfs/file.c		u32_array_read
      	fs/fuse/control.c		fuse_conn_waiting_read + ...
      	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
      	arch/s390/hypfs/inode.c		hypfs_read_iter
      	...
      
      Despite that, many nonseekable_open users implement read and write with
      pure stream semantics - they don't depend on passed ppos at all. And for
      those cases where read could wait for something inside, it creates a
      situation similar to xenbus - the write could be never made to go until
      read is done, and read is waiting for some, potentially external, event,
      for potentially unbounded time -> deadlock.
      
      Besides xenbus, there are 14 such places in the kernel that I've found
      with semantic patch (see below):
      
      	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
      	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
      	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
      	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
      	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
      	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
      	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
      	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
      	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
      	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
      	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
      	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
      	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
      	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
      
      In addition to the cases above another regression caused by f_pos
      locking is that now FUSE filesystems that implement open with
      FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
      stream-like files - for the same reason as above e.g. read can deadlock
      write locking on file.f_pos in the kernel.
      
      FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse:
      implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
      in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
      write routines not depending on current position at all, and with both
      read and write being potentially blocking operations:
      
      See
      
          https://github.com/libfuse/osspd
          https://lwn.net/Articles/308445
      
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
      
      Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
      "somewhat pipe-like files ..." with read handler not using offset.
      However that test implements only read without write and cannot exercise
      the deadlock scenario:
      
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
      
      I've actually hit the read vs write deadlock for real while implementing
      my FUSE filesystem where there is /head/watch file, for which open
      creates separate bidirectional socket-like stream in between filesystem
      and its user with both read and write being later performed
      simultaneously. And there it is semantically not easy to split the
      stream into two separate read-only and write-only channels:
      
          https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
      
      Let's fix this regression. The plan is:
      
      1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
         doing so would break many in-kernel nonseekable_open users which
         actually use ppos in read/write handlers.
      
      2. Add stream_open() to kernel to open stream-like non-seekable file
         descriptors. Read and write on such file descriptors would never use
         nor change ppos. And with that property on stream-like files read and
         write will be running without taking f_pos lock - i.e. read and write
         could be running simultaneously.
      
      3. With semantic patch search and convert to stream_open all in-kernel
         nonseekable_open users for which read and write actually do not
         depend on ppos and where there is no other methods in file_operations
         which assume @offset access.
      
      4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
         steam_open if that bit is present in filesystem open reply.
      
         It was tempting to change fs/fuse/ open handler to use stream_open
         instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
         grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
         and in particular GVFS which actually uses offset in its read and
         write handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
         so if we would do such a change it will break a real user.
      
      5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
         from v3.14+ (the kernel where 9c225f26 first appeared).
      
         This will allow to patch OSSPD and other FUSE filesystems that
         provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
         in their open handler and this way avoid the deadlock on all kernel
         versions. This should work because fs/fuse/ ignores unknown open
         flags returned from a filesystem and so passing FOPEN_STREAM to a
         kernel that is not aware of this flag cannot hurt. In turn the kernel
         that is not aware of FOPEN_STREAM will be < v3.14 where just
         FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
         write deadlock.
      
      This patch adds stream_open, converts /proc/xen/xenbus to it and adds
      semantic patch to automatically locate in-kernel places that are either
      required to be converted due to read vs write deadlock, or that are just
      safe to be converted because read and write do not use ppos and there
      are no other funky methods in file_operations.
      
      Regarding semantic patch I've verified each generated change manually -
      that it is correct to convert - and each other nonseekable_open instance
      left - that it is either not correct to convert there, or that it is not
      converted due to current stream_open.cocci limitations.
      
      The script also does not convert files that should be valid to convert,
      but that currently have .llseek = noop_llseek or generic_file_llseek for
      unknown reason despite file being opened with nonseekable_open (e.g.
      drivers/input/mousedev.c)
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yongzhi Pan <panyongzhi@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Nikolaus Rath <Nikolaus@rath.org>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Signed-off-by: NKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e2e782c9
    • A
      USB: core: Fix bug caused by duplicate interface PM usage counter · 947050f1
      Alan Stern 提交于
      commit c2b71462d294cf517a0bc6e4fd6424d7cee5596f upstream.
      
      The syzkaller fuzzer reported a bug in the USB hub driver which turned
      out to be caused by a negative runtime-PM usage counter.  This allowed
      a hub to be runtime suspended at a time when the driver did not expect
      it.  The symptom is a WARNING issued because the hub's status URB is
      submitted while it is already active:
      
      	URB 0000000031fb463e submitted while active
      	WARNING: CPU: 0 PID: 2917 at drivers/usb/core/urb.c:363
      
      The negative runtime-PM usage count was caused by an unfortunate
      design decision made when runtime PM was first implemented for USB.
      At that time, USB class drivers were allowed to unbind from their
      interfaces without balancing the usage counter (i.e., leaving it with
      a positive count).  The core code would take care of setting the
      counter back to 0 before allowing another driver to bind to the
      interface.
      
      Later on when runtime PM was implemented for the entire kernel, the
      opposite decision was made: Drivers were required to balance their
      runtime-PM get and put calls.  In order to maintain backward
      compatibility, however, the USB subsystem adapted to the new
      implementation by keeping an independent usage counter for each
      interface and using it to automatically adjust the normal usage
      counter back to 0 whenever a driver was unbound.
      
      This approach involves duplicating information, but what is worse, it
      doesn't work properly in cases where a USB class driver delays
      decrementing the usage counter until after the driver's disconnect()
      routine has returned and the counter has been adjusted back to 0.
      Doing so would cause the usage counter to become negative.  There's
      even a warning about this in the USB power management documentation!
      
      As it happens, this is exactly what the hub driver does.  The
      kick_hub_wq() routine increments the runtime-PM usage counter, and the
      corresponding decrement is carried out by hub_event() in the context
      of the hub_wq work-queue thread.  This work routine may sometimes run
      after the driver has been unbound from its interface, and when it does
      it causes the usage counter to go negative.
      
      It is not possible for hub_disconnect() to wait for a pending
      hub_event() call to finish, because hub_disconnect() is called with
      the device lock held and hub_event() acquires that lock.  The only
      feasible fix is to reverse the original design decision: remove the
      duplicate interface-specific usage counter and require USB drivers to
      balance their runtime PM gets and puts.  As far as I know, all
      existing drivers currently do this.
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Reported-and-tested-by: syzbot+7634edaea4d0b341c625@syzkaller.appspotmail.com
      CC: <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      947050f1
    • J
      i2c: Allow recovery of the initial IRQ by an I2C client device. · 420637ba
      Jim Broadus 提交于
      commit 93b6604c5a669d84e45fe5129294875bf82eb1ff upstream.
      
      A previous change allowed I2C client devices to discover new IRQs upon
      reprobe by clearing the IRQ in i2c_device_remove. However, if an IRQ was
      assigned in i2c_new_device, that information is lost.
      
      For example, the touchscreen and trackpad devices on a Dell Inspiron laptop
      are I2C devices whose IRQs are defined by ACPI extended IRQ types. The
      client device structures are initialized during an ACPI walk. After
      removing the i2c_hid device, modprobe fails.
      
      This change caches the initial IRQ value in i2c_new_device and then resets
      the client device IRQ to the initial value in i2c_device_remove.
      
      Fixes: 6f108dd70d30 ("i2c: Clear client->irq in i2c_device_remove")
      Signed-off-by: NJim Broadus <jbroadus@gmail.com>
      Reviewed-by: NBenjamin Tissoires <benjamin.tissoires@redhat.com>
      Reviewed-by: NCharles Keepax <ckeepax@opensource.cirrus.com>
      [wsa: this is an easy to backport fix for the regression. We will
      refactor the code to handle irq assignments better in general.]
      Signed-off-by: NWolfram Sang <wsa@the-dreams.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      420637ba
    • L
      merge the driver code to hulk branch and format rectificaiton · 299b4bc2
      lingmingqiang 提交于
      driver inclusion
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      Feature or Bugfix: Bugfix
      
      1. [42feaced] HPRE Crypto warning clean
      Warning -- Suspicious Truncation in arithmetic expression combining
      with pointer
      
      2. [4b3f837d] crypto/zip: Fix to get queue from possible zip functions
      Original commit message:
      Current code just gets queue from the closest function, return fail
      if closest function has no available queue. In this patch, we firstly
      sort all available functions, then get queue from sorted functions one
      by one if closer function has no available queue.
      
      3. [7250b1a9] crypto/qm: Export function to get free qp number for acc
      
      4. [86eeda2b] crypto/hisilicon/qm: Fix static check warning
      Reduce loop complexity of qm_qp_ctx_cfg function.
      
      5. [f1c558c0] Fix static check warning
      
      6. [dfdfef8f] crypto/hisilicon/qm: Fix QM task timeout bug
      There is a domain segment in eqc/aeqc should be assignd value
      in D06 ES, Which is reserved in D06 CS.
      
      7. [4bf721fe] Bugfixed as two kill signal gotten by user processes
      As two kill signals are gotten by the processes, file->ops->flush will
      be called twice. As uacce->ops->flush will be called twice too. Currently,
      flush cannot be called again at the same uacce_queue file, or core dump
      will show.So, status of uacce queue is added, as flush and release
      operations doing,queue status will be checked atomically. If already being
      released, do nothing.
      
      8. [20bd4257] uacce/dummy: Fix dummy compile problem
      Original commit message:
      As we move flags, api_ver, gf_pg_start to uacce from uacce_ops, so also fix
      dummy driver to work together with current uacce.
      Signed-off-by: Nlingmingqiang <lingmingqiang@huawei.com>
      
      Changes to be committed:
      	modified:   drivers/crypto/hisilicon/Kconfig
      	modified:   drivers/crypto/hisilicon/hpre/hpre_crypto.c
      	modified:   drivers/crypto/hisilicon/qm.c
      	modified:   drivers/crypto/hisilicon/qm.h
      	modified:   drivers/crypto/hisilicon/zip/zip_main.c
      	modified:   drivers/uacce/Kconfig
      	modified:   drivers/uacce/dummy_drv/dummy_wd_dev.c
      	modified:   drivers/uacce/dummy_drv/dummy_wd_v2.c
      	modified:   drivers/uacce/uacce.c
      	modified:   include/linux/uacce.h
      Reviewed-by: Nhucheng.hu <hucheng.hu@huawei.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      299b4bc2
    • H
      cgroup/files: use task_get_css() to get a valid css during dup_fd() · a8453b17
      Hou Tao 提交于
      euler inclusion
      category: bugfix
      bugzilla: 14007
      CVE: NA
      -------------------------------------------------
      
      Process fork and cgroup migration can happen simultaneously, and
      in the following case use-after-free of css_set is possible:
      
      CPU 0: process fork    CPU 1: cgroup migration
      
      dup_fd                 __cgroup1_procs_write(threadgroup=false)
        files_cgroup_assign
          // task A
          task_lock
          task_cgroup(current, files_cgrp_id)
            css_set = task_css_set_check()
      
       			 cgroup_migrate_execute
        			   files_cgroup_can_attach
      			   css_set_move_task
      			     put_css_set_locked()
        			   files_cgroup_attach
      			     // task B which is in the same
      			     // thread group as task A
      			     task_lock
      			 cgroup_migrate_finish
      			   // the css_set will be freed
      			   put_css_set_locked()
      
            // use-after-free
            css_set->subsys[files_cgrp_id]
      
      Fix it by using task_get_css() instead to get a valid css.
      
      Fixes: 52cc1eccf6de ("cgroups: Resource controller for open files")
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Reviewed-by: Nluojiajun <luojiajun3@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a8453b17
    • H
      cgroup: undo unnecessary updates to struct cgroup_subsys & cftype · d72c2468
      Hou Tao 提交于
      euler inclusion
      category: bugfix
      bugzilla: 14007
      CVE: NA
      
      -------------------------------------------------
      
      These updates are leftovers from CentOS 7.x, and are not needed on
      hulk-4.19, so kill them.
      
      Fixes: 52cc1eccf6de ("cgroups: Resource controller for open files")
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Reviewed-by: Nluojiajun <luojiajun3@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d72c2468
    • A
      ptrace: take into account saved_sigmask in PTRACE{GET, SET}SIGMASK · dd19bcdb
      Andrei Vagin 提交于
      [ Upstream commit fcfc2aa0185f4a731d05a21e9f359968fdfd02e7 ]
      
      There are a few system calls (pselect, ppoll, etc) which replace a task
      sigmask while they are running in a kernel-space
      
      When a task calls one of these syscalls, the kernel saves a current
      sigmask in task->saved_sigmask and sets a syscall sigmask.
      
      On syscall-exit-stop, ptrace traps a task before restoring the
      saved_sigmask, so PTRACE_GETSIGMASK returns the syscall sigmask and
      PTRACE_SETSIGMASK does nothing, because its sigmask is replaced by
      saved_sigmask, when the task returns to user-space.
      
      This patch fixes this problem.  PTRACE_GETSIGMASK returns saved_sigmask
      if it's set.  PTRACE_SETSIGMASK drops the TIF_RESTORE_SIGMASK flag.
      
      Link: http://lkml.kernel.org/r/20181120060616.6043-1-avagin@gmail.com
      Fixes: 29000cae ("ptrace: add ability to get/set signal-blocked mask")
      Signed-off-by: NAndrei Vagin <avagin@gmail.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin (Microsoft) <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      dd19bcdb
    • J
      tracing: Fix buffer_ref pipe ops · d82ba528
      Jann Horn 提交于
      commit b987222654f84f7b4ca95b3a55eca784cb30235b upstream.
      
      This fixes multiple issues in buffer_pipe_buf_ops:
      
       - The ->steal() handler must not return zero unless the pipe buffer has
         the only reference to the page. But generic_pipe_buf_steal() assumes
         that every reference to the pipe is tracked by the page's refcount,
         which isn't true for these buffers - buffer_pipe_buf_get(), which
         duplicates a buffer, doesn't touch the page's refcount.
         Fix it by using generic_pipe_buf_nosteal(), which refuses every
         attempted theft. It should be easy to actually support ->steal, but the
         only current users of pipe_buf_steal() are the virtio console and FUSE,
         and they also only use it as an optimization. So it's probably not worth
         the effort.
       - The ->get() and ->release() handlers can be invoked concurrently on pipe
         buffers backed by the same struct buffer_ref. Make them safe against
         concurrency by using refcount_t.
       - The pointers stored in ->private were only zeroed out when the last
         reference to the buffer_ref was dropped. As far as I know, this
         shouldn't be necessary anyway, but if we do it, let's always do it.
      
      Link: http://lkml.kernel.org/r/20190404215925.253531-1-jannh@google.com
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org
      Fixes: 73a757e6 ("ring-buffer: Return reader page back into existing ring buffer")
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      
      Conflicts:
        kernel/trace/trace.c
        include/linux/pipe_fs_i.h
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d82ba528
    • M
      x86/kprobes: Verify stack frame on kretprobe · dd26d152
      Masami Hiramatsu 提交于
      commit 3ff9c075cc767b3060bdac12da72fc94dd7da1b8 upstream.
      
      Verify the stack frame pointer on kretprobe trampoline handler,
      If the stack frame pointer does not match, it skips the wrong
      entry and tries to find correct one.
      
      This can happen if user puts the kretprobe on the function
      which can be used in the path of ftrace user-function call.
      Such functions should not be probed, so this adds a warning
      message that reports which function should be blacklisted.
      Tested-by: NAndrea Righi <righi.andrea@gmail.com>
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/155094059185.6137.15527904013362842072.stgit@devboxSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      dd26d152
    • S
      failover: allow name change on IFF_UP slave interfaces · 29b6215a
      Si-Wei Liu 提交于
      [ Upstream commit 8065a779 ]
      
      When a netdev appears through hot plug then gets enslaved by a failover
      master that is already up and running, the slave will be opened
      right away after getting enslaved. Today there's a race that userspace
      (udev) may fail to rename the slave if the kernel (net_failover)
      opens the slave earlier than when the userspace rename happens.
      Unlike bond or team, the primary slave of failover can't be renamed by
      userspace ahead of time, since the kernel initiated auto-enslavement is
      unable to, or rather, is never meant to be synchronized with the rename
      request from userspace.
      
      As the failover slave interfaces are not designed to be operated
      directly by userspace apps: IP configuration, filter rules with
      regard to network traffic passing and etc., should all be done on master
      interface. In general, userspace apps only care about the
      name of master interface, while slave names are less important as long
      as admin users can see reliable names that may carry
      other information describing the netdev. For e.g., they can infer that
      "ens3nsby" is a standby slave of "ens3", while for a
      name like "eth0" they can't tell which master it belongs to.
      
      Historically the name of IFF_UP interface can't be changed because
      there might be admin script or management software that is already
      relying on such behavior and assumes that the slave name can't be
      changed once UP. But failover is special: with the in-kernel
      auto-enslavement mechanism, the userspace expectation for device
      enumeration and bring-up order is already broken. Previously initramfs
      and various userspace config tools were modified to bypass failover
      slaves because of auto-enslavement and duplicate MAC address. Similarly,
      in case that users care about seeing reliable slave name, the new type
      of failover slaves needs to be taken care of specifically in userspace
      anyway.
      
      It's less risky to lift up the rename restriction on failover slave
      which is already UP. Although it's possible this change may potentially
      break userspace component (most likely configuration scripts or
      management software) that assumes slave name can't be changed while
      UP, it's relatively a limited and controllable set among all userspace
      components, which can be fixed specifically to listen for the rename
      events on failover slaves. Userspace component interacting with slaves
      is expected to be changed to operate on failover master interface
      instead, as the failover slave is dynamic in nature which may come and
      go at any point.  The goal is to make the role of failover slaves less
      relevant, and userspace components should only deal with failover master
      in the long run.
      
      Fixes: 30c8bd5a ("net: Introduce generic failover module")
      Signed-off-by: NSi-Wei Liu <si-wei.liu@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Acked-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      29b6215a
    • J
      net: phy: marvell: add new default led configure for m88e151x · 1e8a0462
      Jian Shen 提交于
      mainline-next inclusion
      from mainline-5.1-rc5
      commit a93f7fe134543649cf2e2d8fc2c50a8f4d742915
      category: bugfix
      bugzilla: NA
      CVE: NA
      -------------------------------------------------
      
      The default m88e151x LED configuration is 0x1177, used LED[0]
      for 1000M link, LED[1] for 100M link, and LED[2] for active.
      But for some boards, which use LED[0] for link, and LED[1] for
      active, prefer to be 0x1040. To be compatible with this case,
      this patch defines a new dev_flag, and set it before connect
      phy in HNS3 driver. When phy initializing, using the new
      LED configuration if this dev_flag is set.
      Signed-off-by: NJian Shen <shenjian15@huawei.com>
      Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Reviewed-by: Peng Li<lipeng321@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1e8a0462
    • M
      fs: prevent page refcount overflow in pipe_buf_get · 49ecd39f
      Matthew Wilcox 提交于
      mainline inclusion
      from mainline-5.1-rc5
      commit 15fab63e1e57be9fdb5eec1bbc5916e9825e9acb
      category: 13690
      bugzilla: NA
      CVE: CVE-2019-11487
      
      There are four commits to fix this CVE:
        fs: prevent page refcount overflow in pipe_buf_get
        mm: prevent get_user_pages() from overflowing page refcount
        mm: add 'try_get_page()' helper function
        mm: make page ref count overflow check tighter and more explicit
      
      -------------------------------------------------
      
      Change pipe_buf_get() to return a bool indicating whether it succeeded
      in raising the refcount of the page (if the thing in the pipe is a page).
      This removes another mechanism for overflowing the page refcount.  All
      callers converted to handle a failure.
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      49ecd39f
    • L
      mm: make page ref count overflow check tighter and more explicit · 5026c0ae
      Linus Torvalds 提交于
      mainline inclusion
      from mainline-5.1-rc5
      commit f958d7b528b1b40c44cfda5eabe2d82760d868c3
      category: 13690
      bugzilla: NA
      CVE: CVE-2019-11487
      
      There are four commits to fix this CVE:
        fs: prevent page refcount overflow in pipe_buf_get
        mm: prevent get_user_pages() from overflowing page refcount
        mm: add 'try_get_page()' helper function
        mm: make page ref count overflow check tighter and more explicit
      
      -------------------------------------------------
      
      We have a VM_BUG_ON() to check that the page reference count doesn't
      underflow (or get close to overflow) by checking the sign of the count.
      
      That's all fine, but we actually want to allow people to use a "get page
      ref unless it's already very high" helper function, and we want that one
      to use the sign of the page ref (without triggering this VM_BUG_ON).
      
      Change the VM_BUG_ON to only check for small underflows (or _very_ close
      to overflowing), and ignore overflows which have strayed into negative
      territory.
      Acked-by: NMatthew Wilcox <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5026c0ae
    • Y
      dm mpath: fix missing call of path selector type->end_io · bfa273a7
      Yufen Yu 提交于
      euler inclusion
      category: bugfix
      bugzilla: 13971
      CVE: NA
      
      -------------------------------------------------
      
      After commit 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via
      blk_insert_cloned_request feedback"), map_request() will requeue the tio
      when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.
      
      Thus, if device drive status is error, a tio may be requeued multiple times
      until the return value is not DM_MAPIO_REQUEUE. That means type->start_io
      may be called multiple tims, while type->end_io just be called when IO
      complete.
      
      In fact, even without the commit, setup_clone() error also can make the
      tio requeue and miss call type->end_io.
      
      As servicer-time path selector for example, it selects path based on
      in_flight_size, which is increased by st_start_io() and decreased by
      st_end_io(). Missing call of end_io can lead to in_flight_size error
      and let the selector make the wrong choice. In addition, queue-length
      path selector will also be affected.
      
      To fix the problem, we call type->end_io in ->release_clone_rq before
      tio requeue. It pass map_info to ->release_clone_rq() for requeue path,
      and pass NULL for the others path.
      
      Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Reviewed-by: NMiao Xie <miaoxie@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      bfa273a7
    • J
      iommu/iova: Separate atomic variables to improve performance · d1473d2a
      Jinyu Qi 提交于
      mainline inclusion
      from linux-next
      commit: 14bd9a607f9082e7b5690c27e69072f2aeae0de4
      category: feature
      feature: IOMMU performance
      bugzilla: NA
      CVE: NA
      
      --------------------------------------------------
      
      In struct iova_domain, there are three atomic variables, the former two
      are about TLB flush counters which use atomic_add operation, anoter is
      used to flush timer that use cmpxhg operation.
      These variables are in the same cache line, so it will cause some
      performance loss under the condition that many cores call queue_iova
      function, Let's isolate the two type atomic variables to different
      cache line to reduce cache line conflict.
      
      Cc: Joerg Roedel <joro@8bytes.org>
      Signed-off-by: NJinyu Qi <jinyuqi@huawei.com>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d1473d2a
    • G
      iommu/iova: Optimise attempts to allocate iova from 32bit address range · 42a55ea3
      Ganapatrao Kulkarni 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit: bee60e94a1e20ec0b8ffdafae270731d8fda4551
      category: feature
      feature: IOMMU performance
      bugzilla: NA
      CVE: NA
      
      --------------------------------------------------
      
      As an optimisation for PCI devices, there is always first attempt
      been made to allocate iova from SAC address range. This will lead
      to unnecessary attempts, when there are no free ranges
      available. Adding fix to track recently failed iova address size and
      allow further attempts, only if requested size is lesser than a failed
      size. The size is updated when any replenish happens.
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NGanapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      42a55ea3
    • A
      coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping · bf98085f
      Andrea Arcangeli 提交于
      mainline inclusion
      from mainline-5.1-rc6
      commit 04f5866e41fb70690e28397487d8bd8eea7d712a
      category: 13690
      bugzilla: NA
      CVE: CVE-2019-3892
      
      -------------------------------------------------
      
      The core dumping code has always run without holding the mmap_sem for
      writing, despite that is the only way to ensure that the entire vma
      layout will not change from under it.  Only using some signal
      serialization on the processes belonging to the mm is not nearly enough.
      This was pointed out earlier.  For example in Hugh's post from Jul 2017:
      
        https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils
      
        "Not strictly relevant here, but a related note: I was very surprised
         to discover, only quite recently, how handle_mm_fault() may be called
         without down_read(mmap_sem) - when core dumping. That seems a
         misguided optimization to me, which would also be nice to correct"
      
      In particular because the growsdown and growsup can move the
      vm_start/vm_end the various loops the core dump does around the vma will
      not be consistent if page faults can happen concurrently.
      
      Pretty much all users calling mmget_not_zero()/get_task_mm() and then
      taking the mmap_sem had the potential to introduce unexpected side
      effects in the core dumping code.
      
      Adding mmap_sem for writing around the ->core_dump invocation is a
      viable long term fix, but it requires removing all copy user and page
      faults and to replace them with get_dump_page() for all binary formats
      which is not suitable as a short term fix.
      
      For the time being this solution manually covers the places that can
      confuse the core dump either by altering the vma layout or the vma flags
      while it runs.  Once ->core_dump runs under mmap_sem for writing the
      function mmget_still_valid() can be dropped.
      
      Allowing mmap_sem protected sections to run in parallel with the
      coredump provides some minor parallelism advantage to the swapoff code
      (which seems to be safe enough by never mangling any vma field and can
      keep doing swapins in parallel to the core dumping) and to some other
      corner case.
      
      In order to facilitate the backporting I added "Fixes: 86039bd3"
      however the side effect of this same race condition in /proc/pid/mem
      should be reproducible since before 2.6.12-rc2 so I couldn't add any
      other "Fixes:" because there's no hash beyond the git genesis commit.
      
      Because find_extend_vma() is the only location outside of the process
      context that could modify the "mm" structures under mmap_sem for
      reading, by adding the mmget_still_valid() check to it, all other cases
      that take the mmap_sem for reading don't need the new check after
      mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
      context also doesn't need the new check, because all tasks under core
      dumping are frozen.
      
      Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
      Fixes: 86039bd3 ("userfaultfd: add new syscall to provide memory externalization")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NJann Horn <jannh@google.com>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NJann Horn <jannh@google.com>
      Acked-by: NJason Gunthorpe <jgg@mellanox.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
        drivers/infiniband/core/uverbs_main.c
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      bf98085f
    • M
      arm64: CI Code scanning warning clean · 33cef3a3
      Mingqiang Ling 提交于
      driver inclusion
      category: bugfix
      bugzilla: 13683
      CVE: NA
      
      -------------------------------------------------
      
      CI Code scanning warning clean
      
      Feature or Bugfix:Bugfix
      Signed-off-by: Nxuzaibo <xuzaibo@huawei.com>
      Reviewed-by: Nwangzhou <wangzhou1@hisilicon.com>
      Signed-off-by: NMingqiang Ling <lingmingqiang@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      33cef3a3
    • T
      arm64: Fix static check warning for qm/zip · ac6bd719
      tanshukun 提交于
      driver inclusion
      category: bugfix
      bugzilla: 13683
      CVE: NA
      
      -------------------------------------------------
      
      Fix static check warning for qm/zip modules
      
      Feature or Bugfix:Bugfix
      Signed-off-by: Ntanshukun (A) <tanshukun1@huawei.com>
      Reviewed-by: Nwangzhou <wangzhou1@hisilicon.com>
      Signed-off-by: NMingqiang Ling <lingmingqiang@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ac6bd719
    • M
      arm64: uacce: Remove some items in uacce_ops to uacce · b9a8af79
      Mingqiang Ling 提交于
      driver inclusion
      category: bugfix
      bugzilla: 13683
      CVE: NA
      
      -------------------------------------------------
      
      Orinal commit message:
      
      This patch removes api_ver, flags, qf_pg_start in uacce_ops to uacce,
      deletes owner in uacce_ops as we already have owner in cdev.
      
      These items indeed belong to uacce.
      Signed-off-by: NZhou Wang <wangzhou1@hisilicon.com>
      Reviewed-by: Nxuzaibo <xuzaibo@huawei.com>
      Reviewed-by: Nfanghao <fanghao11@huawei.com>
      Signed-off-by: NMingqiang Ling <lingmingqiang@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b9a8af79