- 27 12月, 2019 40 次提交
-
-
由 Jacob Pan 提交于
hulk inclusion category: feature bugzilla: 14369 CVE: NA ------------------- DMA faults can be detected by IOMMU at device level. Adding a pointer to struct device allows IOMMU subsystem to report relevant faults back to the device driver for further handling. For direct assigned device (or user space drivers), guest OS holds responsibility to handle and respond per device IOMMU fault. Therefore we need fault reporting mechanism to propagate faults beyond IOMMU subsystem. There are two other IOMMU data pointers under struct device today, here we introduce iommu_param as a parent pointer such that all device IOMMU data can be consolidated here. The idea was suggested here by Greg KH and Joerg. The name iommu_param is chosen here since iommu_data has been used. Suggested-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: NJean-Philippe Brucker <jean-philippe.brucker@arm.com> Link: https://lkml.org/lkml/2017/10/6/81Signed-off-by: NFang Lijun <fanglijun3@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Reviewed-by: NZhen Lei <thunder.leizhen@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Jacob Pan 提交于
hulk inclusion category: feature bugzilla: 14369 CVE: NA ------------------- Device faults detected by IOMMU can be reported outside IOMMU subsystem for further processing. This patch intends to provide a generic device fault data such that device drivers can be communicated with IOMMU faults without model specific knowledge. The proposed format is the result of discussion at: https://lkml.org/lkml/2017/11/10/291 Part of the code is based on Jean-Philippe Brucker's patchset (https://patchwork.kernel.org/patch/9989315/). The assumption is that model specific IOMMU driver can filter and handle most of the internal faults if the cause is within IOMMU driver control. Therefore, the fault reasons can be reported are grouped and generalized based common specifications such as PCI ATS. Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: NJean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by: NLiu, Yi L <yi.l.liu@linux.intel.com> Signed-off-by: NAshok Raj <ashok.raj@intel.com> Signed-off-by: NFang Lijun <fanglijun3@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Reviewed-by: NZhen Lei <thunder.leizhen@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Liu, Yi L 提交于
hulk inclusion category: feature bugzilla: 14369 CVE: NA ------------------- When an SVM capable device is assigned to a guest, the first level page tables are owned by the guest and the guest PASID table pointer is linked to the device context entry of the physical IOMMU. Host IOMMU driver has no knowledge of caching structure updates unless the guest invalidation activities are passed down to the host. The primary usage is derived from emulated IOMMU in the guest, where QEMU can trap invalidation activities before passing them down to the host/physical IOMMU. Since the invalidation data are obtained from user space and will be written into physical IOMMU, we must allow security check at various layers. Therefore, generic invalidation data format are proposed here, model specific IOMMU drivers need to convert them into their own format. Signed-off-by: NLiu, Yi L <yi.l.liu@linux.intel.com> Signed-off-by: NJean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: NAshok Raj <ashok.raj@intel.com> Signed-off-by: NFang Lijun <fanglijun3@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Reviewed-by: NZhen Lei <thunder.leizhen@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Jacob Pan 提交于
hulk inclusion category: feature bugzilla: 14369 CVE: NA ------------------- Virtual IOMMU was proposed to support Shared Virtual Memory (SVM) use in the guest: https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html As part of the proposed architecture, when an SVM capable PCI device is assigned to a guest, nested mode is turned on. Guest owns the first level page tables (request with PASID) which performs GVA->GPA translation. Second level page tables are owned by the host for GPA->HPA translation for both request with and without PASID. A new IOMMU driver interface is therefore needed to perform tasks as follows: * Enable nested translation and appropriate translation type * Assign guest PASID table pointer (in GPA) and size to host IOMMU This patch introduces new API functions to perform bind/unbind guest PASID tables. Based on common data, model specific IOMMU drivers can be extended to perform the specific steps for binding pasid table of assigned devices. Signed-off-by: NJean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by: NLiu, Yi L <yi.l.liu@linux.intel.com> Signed-off-by: NAshok Raj <ashok.raj@intel.com> Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com> [Backported to 4.19 -add SPDX-License-Identifier] Signed-off-by: NFang Lijun <fanglijun3@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Reviewed-by: NZhen Lei <thunder.leizhen@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Paolo Bonzini 提交于
[ Upstream commit 1d487e9bf8ba66a7174c56a0029c54b1eca8f99c ] These were found with smatch, and then generalized when applicable. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Jian-Hong Pan 提交于
[ Upstream commit 0082517fa4bce073e7cf542633439f26538a14cc ] Upon reboot, the Acer TravelMate X514-51T laptop appears to complete the shutdown process, but then it hangs in BIOS POST with a black screen. The problem is intermittent - at some points it has appeared related to Secure Boot settings or different kernel builds, but ultimately we have not been able to identify the exact conditions that trigger the issue to come and go. Besides, the EFI mode cannot be disabled in the BIOS of this model. However, after extensive testing, we observe that using the EFI reboot method reliably avoids the issue in all cases. So add a boot time quirk to use EFI reboot on such systems. Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=203119Signed-off-by: NJian-Hong Pan <jian-hong@endlessm.com> Signed-off-by: NDaniel Drake <drake@endlessm.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Cc: linux@endlessm.com Link: http://lkml.kernel.org/r/20190412080152.3718-1-jian-hong@endlessm.com [ Fix !CONFIG_EFI build failure, clarify the code and the changelog a bit. ] Signed-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Jens Axboe 提交于
commit 77f1e0a52d26242b6c2dba019f6ebebfb9ff701e upstream A previous commit moved the shallow depth and BFQ depth map calculations to be done at init time, moving it outside of the hotter IO path. This potentially causes hangs if the users changes the depth of the scheduler map, by writing to the 'nr_requests' sysfs file for that device. Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if the depth changes, so that the scheduler can update its internal state. Signed-off-by: NEric Wheeler <bfq@linux.ewheeler.net> Tested-by: NKai Krakow <kai@kaishome.de> Reported-by: NPaolo Valente <paolo.valente@linaro.org> Fixes: f0635b8a ("bfq: calculate shallow depths at init time") Signed-off-by: NJens Axboe <axboe@kernel.dk> Cc: stable@vger.kernel.org Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Josh Poimboeuf 提交于
commit 98af8452945c55652de68536afdde3b520fec429 upstream Keeping track of the number of mitigations for all the CPU speculation bugs has become overwhelming for many users. It's getting more and more complicated to decide which mitigations are needed for a given architecture. Complicating matters is the fact that each arch tends to have its own custom way to mitigate the same vulnerability. Most users fall into a few basic categories: a) they want all mitigations off; b) they want all reasonable mitigations on, with SMT enabled even if it's vulnerable; or c) they want all reasonable mitigations on, with SMT disabled if vulnerable. Define a set of curated, arch-independent options, each of which is an aggregation of existing options: - mitigations=off: Disable all mitigations. - mitigations=auto: [default] Enable all the default mitigations, but leave SMT enabled, even if it's vulnerable. - mitigations=auto,nosmt: Enable all the default mitigations, disabling SMT if needed by a mitigation. Currently, these options are placeholders which don't actually do anything. They will be fleshed out in upcoming patches. Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86) Reviewed-by: NJiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jon Masters <jcm@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-arch@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Price <steven.price@arm.com> Cc: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/b07a8ef9b7c5055c3a4637c87d07c296d5016fe0.1555085500.git.jpoimboe@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Thomas Gleixner 提交于
commit 8a4b06d391b0a42a373808979b5028f5c84d9c6a upstream Add the sysfs reporting file for MDS. It exposes the vulnerability and mitigation state similar to the existing files for the other speculative hardware vulnerabilities. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NJon Masters <jcm@redhat.com> Tested-by: NJon Masters <jcm@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Lijun Fang 提交于
euler inclusion category: feature bugzilla: 11082 CVE: NA ----------------- CDM nodes should not be part of mems_allowed, However, It must be allowed to alloc from CDM node, when mpol->mode was MPOL_BIND. Signed-off-by: NLijun Fang <fanglijun3@huawei.com> Reviewed-by: Nzhong jiang <zhongjiang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Anshuman Khandual 提交于
euler inclusion category: feature bugzilla: 11082 CVE: NA ------------------- Kernel cannot track device memory accesses behind VMAs containing CDM memory. Hence all the VM_CDM marked VMAs should not be part of the auto NUMA migration scheme. This patch also adds a new function is_cdm_vma() to detect any VMA marked with flag VM_CDM. Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Nzhong jiang <zhongjiang@huawei.com> Signed-off-by: NLijun Fang <fanglijun3@huawei.com> Reviewed-by: Nzhong jiang <zhongjiang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Anshuman Khandual 提交于
euler inclusion category: feature bugzilla: 11082 CVE: NA ------------------- Mark all the applicable VMAs with VM_CDM explicitly during mbind(MPOL_BIND) call if the user provided nodemask has a CDM node. Mark the corresponding VMA with VM_CDM flag if the allocated page happens to be from a CDM node. This can be expensive from performance stand point. There are multiple checks to avoid an expensive page_to_nid lookup but it can be optimized further. Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Nzhong jiang <zhongjiang@huawei.com> Signed-off-by: NLijun Fang <fanglijun3@huawei.com> Reviewed-by: Nzhong jiang <zhongjiang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Anshuman Khandual 提交于
euler inclusion category: feature bugzilla: 11082 CVE: NA ------------------- There are certain devices like specialized accelerator, GPU cards, network cards, FPGA cards etc which might contain onboard memory which is coherent along with the existing system RAM while being accessed either from the CPU or from the device. They share some similar properties with that of normal system RAM but at the same time can also be different with respect to system RAM. User applications might be interested in using this kind of coherent device memory explicitly or implicitly along side the system RAM utilizing all possible core memory functions like anon mapping (LRU), file mapping (LRU), page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations etc. To achieve this kind of tight integration with core memory subsystem, the device onboard coherent memory must be represented as a memory only NUMA node. At the same time arch must export some kind of a function to identify of this node as a coherent device memory not any other regular cpu less memory only NUMA node. After achieving the integration with core memory subsystem coherent device memory might still need some special consideration inside the kernel. There can be a variety of coherent memory nodes with different expectations from the core kernel memory. But right now only one kind of special treatment is considered which requires certain isolation. Now consider the case of a coherent device memory node type which requires isolation. This kind of coherent memory is onboard an external device attached to the system through a link where there is always a chance of a link failure taking down the entire memory node with it. More over the memory might also have higher chance of ECC failure as compared to the system RAM. Hence allocation into this kind of coherent memory node should be regulated. Kernel allocations must not come here. Normal user space allocations too should not come here implicitly (without user application knowing about it). This summarizes isolation requirement of certain kind of coherent device memory node as an example. There can be different kinds of isolation requirement also. Some coherent memory devices might not require isolation altogether after all. Then there might be other coherent memory devices which might require some other special treatment after being part of core memory representation . For now, will look into isolation seeking coherent device memory node not the other ones. To implement the integration as well as isolation, the coherent memory node must be present in N_MEMORY and a new N_COHERENT_DEVICE node mask inside the node_states[] array. During memory hotplug operations, the new nodemask N_COHERENT_DEVICE is updated along with N_MEMORY for these coherent device memory nodes. This also creates the following new sysfs based interface to list down all the coherent memory nodes of the system. /sys/devices/system/node/is_cdm_node Architectures must export function arch_check_node_cdm() which identifies any coherent device memory node in case they enable CONFIG_COHERENT_DEVICE. Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Nzhong jiang <zhongjiang@huawei.com> [Backported to 4.19 -remove set or clear node state for memory_hotplug -separate CONFIG_COHERENT and CPUSET] Signed-off-by: NLijun Fang <fanglijun3@huawei.com> Reviewed-by: Nzhong jiang <zhongjiang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Jann Horn 提交于
[ Upstream commit a0fe2c6479aab5723239b315ef1b552673f434a3 ] Use parentheses around uses of the argument in u64_to_user_ptr() to ensure that the cast doesn't apply to part of the argument. There are existing uses of the macro of the form u64_to_user_ptr(A + B) which expands to (void __user *)(uintptr_t)A + B (the cast applies to the first operand of the addition, the addition is a pointer addition). This happens to still work as intended, the semantic difference doesn't cause a difference in behavior. But I want to use u64_to_user_ptr() with a ternary operator in the argument, like so: u64_to_user_ptr(A ? B : C) This currently doesn't work as intended. Signed-off-by: NJann Horn <jannh@google.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NMukesh Ojha <mojha@codeaurora.org> Cc: Andrei Vagin <avagin@openvz.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: NeilBrown <neilb@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qiaowei Ren <qiaowei.ren@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190329214652.258477-1-jannh@google.comSigned-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 yangerkun 提交于
euler inclusion category: bugfix bugzilla: 15743 CVE: NA --------------------------- Parallel thread to add negative dentry under root dir. Sometimes later, 'systemctl daemon-reload' will report softlockup since __fsnotify_update_child_dentry_flags need update all child under root dentry without distinguish does it active or not. It will waste so long time with catching d_lock of root dentry. And other thread try to spin_lock d_lock will run overtime. Limit negative dentry under dir can avoid this. Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: NMiao Xie <miaoxie@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 zhangyi (F) 提交于
euler inclusion category: feature bugzilla: 14367 CVE: NA --------------------------- Just like open an exclusive opened block device for write, open a block device exclusively which has been opened for write by some other processes may also lead to potential data corruption. This patch record the write openers and give a hint if that happens. Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com> Reviewed-by: NMiao Xie <miaoxie@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Xiongfeng Wang 提交于
hulk inclusion category: bugfix bugzilla: NA CVE: NA ------------------------------------------------- When I run a stress test about pcie hotplug and removing operations by sysfs, I got a hange task, and the following call trace is printed. INFO: task irq/746-pciehp:41551 blocked for more than 120 seconds. Tainted: P W OE 4.19.25- "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. irq/746-pciehp D 0 41551 2 0x00000228 Call trace: __switch_to+0x94/0xe8 __schedule+0x270/0x8b0 schedule+0x2c/0x88 schedule_preempt_disabled+0x14/0x20 __mutex_lock.isra.1+0x1fc/0x540 __mutex_lock_slowpath+0x24/0x30 mutex_lock+0x80/0xa8 pci_lock_rescan_remove+0x20/0x28 pciehp_configure_device+0x30/0x140 pciehp_handle_presence_or_link_change+0x35c/0x4b0 pciehp_ist+0x1cc/0x1d0 irq_thread_fn+0x30/0x80 irq_thread+0x128/0x200 kthread+0x134/0x138 ret_from_fork+0x10/0x18 INFO: task bash:6424 blocked for more than 120 seconds. Tainted: P W OE 4.19.25- "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. bash D 0 6424 2231 0x00000200 Call trace: __switch_to+0x94/0xe8 __schedule+0x270/0x8b0 schedule+0x2c/0x88 schedule_timeout+0x224/0x448 wait_for_common+0x198/0x2a0 wait_for_completion+0x28/0x38 kthread_stop+0x60/0x190 __free_irq+0x1c0/0x348 free_irq+0x40/0x88 pcie_shutdown_notification+0x54/0x80 pciehp_remove+0x30/0x50 pcie_port_remove_service+0x3c/0x58 device_release_driver_internal+0x1b4/0x250 device_release_driver+0x28/0x38 bus_remove_device+0xd4/0x160 device_del+0x128/0x348 device_unregister+0x24/0x78 remove_iter+0x48/0x58 device_for_each_child+0x6c/0xb8 pcie_port_device_remove+0x2c/0x48 pcie_portdrv_remove+0x5c/0x68 pci_device_remove+0x48/0xd8 device_release_driver_internal+0x1b4/0x250 device_release_driver+0x28/0x38 pci_stop_bus_device+0x84/0xb8 pci_stop_and_remove_bus_device_locked+0x24/0x40 remove_store+0xa4/0xb8 dev_attr_store+0x44/0x60 sysfs_kf_write+0x58/0x80 kernfs_fop_write+0xe8/0x1f0 __vfs_write+0x60/0x190 vfs_write+0xac/0x1c0 ksys_write+0x6c/0xd8 __arm64_sys_write+0x24/0x30 el0_svc_common+0xa0/0x180 el0_svc_handler+0x38/0x78 el0_svc+0x8/0xc When we remove a slot by sysfs. 'pci_stop_and_remove_bus_device_locked()' will be called. This function will get the global mutex lock 'pci_rescan_remove_lock', and remove the slot. If the irq thread 'pciehp_ist' is still running, we will wait until it exits. If a pciehp interrupt happens immediately after we remove the slot by sysfs, but before we free the pciehp irq in 'pci_stop_and_remove_bus_device_locked()'. 'pciehp_ist' will hung because the global mutex lock 'pci_rescan_remove_lock' is held by the sysfs operation. But the sysfs operation is waiting for the pciehp irq thread 'pciehp_ist' ends. Then a hung task occurs. So this two kinds of operation, removing through attention buttion and removing through /sys/devices/pci***/remove, should not be excuted at the same time. This patch add a global variable to mark that one of these operations is under processing. When this variable is set, if another operation is requested, it will be rejected. Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 lingmingqiang 提交于
driver inclusion category: bugfix bugzilla: NA CVE: NA This reverts commit 6b546eb4c02210d7bc77aa64025cc6c84e5f1b30. Feature or Bugfix: Bugfix Signed-off-by: Nlingmingqiang <lingmingqiang@huawei.com> Reviewed-by: Nhucheng.hu <hucheng.hu@huawei.com> Reviewed-by: NYang Yingliang <yangyingliang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 yangerkun 提交于
euler inclusion category: bugfix bugzilla: 14351 CVE: NA --------------------------- We use lockref for dentry reference without notice that so many negative dentry under one dir can lead to overflow of lockref. This can lead to system crash if we do this under root dir. Since there is not a perfect solution, we just limit max number of dentry count up to INT_MAX / 2. Also, it will cost a lot of time from INT_MAX / 2 to INT_MAX, so we no need to do this under protection of dentry lock. Also, we limit the FILES_MAX to INT_MAX / 2, since a lot open for same file can lead to overflow too. Changelog: v1->v2: add a function to do check / add a Macro to mean INT_MAX / 2 Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: NMiao Xie <miaoxie@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 David Müller 提交于
commit 7c2e07130090ae001a97a6b65597830d6815e93e upstream. Since commit 648e9218 ("clk: x86: Stop marking clocks as CLK_IS_CRITICAL"), the pmc_plt_clocks of the Bay Trail SoC are unconditionally gated off. Unfortunately this will break systems where these clocks are used for external purposes beyond the kernel's knowledge. Fix it by implementing a system specific quirk to mark the necessary pmc_plt_clks as critical. Fixes: 648e9218 ("clk: x86: Stop marking clocks as CLK_IS_CRITICAL") Signed-off-by: NDavid Müller <dave.mueller@gmx.ch> Signed-off-by: NHans de Goede <hdegoede@redhat.com> Reviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: NStephen Boyd <sboyd@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Kirill Smelkov 提交于
fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock [ Upstream commit 10dce8af34226d90fa56746a934f8da5dcdba3df ] Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added locking for file.f_pos access and in particular made concurrent read and write not possible - now both those functions take f_pos lock for the whole run, and so if e.g. a read is blocked waiting for data, write will deadlock waiting for that read to complete. This caused regression for stream-like files where previously read and write could run simultaneously, but after that patch could not do so anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes to /proc/xen/xenbus") which fixes such regression for particular case of /proc/xen/xenbus. The patch that added f_pos lock in 2014 did so to guarantee POSIX thread safety for read/write/lseek and added the locking to file descriptors of all regular files. In 2014 that thread-safety problem was not new as it was already discussed earlier in 2006. However even though 2006'th version of Linus's patch was adding f_pos locking "only for files that are marked seekable with FMODE_LSEEK (thus avoiding the stream-like objects like pipes and sockets)", the 2014 version - the one that actually made it into the tree as 9c225f26 - is doing so irregardless of whether a file is seekable or not. See https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/ https://lwn.net/Articles/180387 https://lwn.net/Articles/180396 for historic context. The reason that it did so is, probably, that there are many files that are marked non-seekable, but e.g. their read implementation actually depends on knowing current position to correctly handle the read. Some examples: kernel/power/user.c snapshot_read fs/debugfs/file.c u32_array_read fs/fuse/control.c fuse_conn_waiting_read + ... drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read arch/s390/hypfs/inode.c hypfs_read_iter ... Despite that, many nonseekable_open users implement read and write with pure stream semantics - they don't depend on passed ppos at all. And for those cases where read could wait for something inside, it creates a situation similar to xenbus - the write could be never made to go until read is done, and read is waiting for some, potentially external, event, for potentially unbounded time -> deadlock. Besides xenbus, there are 14 such places in the kernel that I've found with semantic patch (see below): drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write() drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write() drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write() drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write() net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write() drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write() drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write() drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write() net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write() drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write() drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write() drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write() drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write() drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write() In addition to the cases above another regression caused by f_pos locking is that now FUSE filesystems that implement open with FOPEN_NONSEEKABLE flag, can no longer implement bidirectional stream-like files - for the same reason as above e.g. read can deadlock write locking on file.f_pos in the kernel. FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse: implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and write routines not depending on current position at all, and with both read and write being potentially blocking operations: See https://github.com/libfuse/osspd https://lwn.net/Articles/308445 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510 Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as "somewhat pipe-like files ..." with read handler not using offset. However that test implements only read without write and cannot exercise the deadlock scenario: https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131 https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163 https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216 I've actually hit the read vs write deadlock for real while implementing my FUSE filesystem where there is /head/watch file, for which open creates separate bidirectional socket-like stream in between filesystem and its user with both read and write being later performed simultaneously. And there it is semantically not easy to split the stream into two separate read-only and write-only channels: https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169 Let's fix this regression. The plan is: 1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS - doing so would break many in-kernel nonseekable_open users which actually use ppos in read/write handlers. 2. Add stream_open() to kernel to open stream-like non-seekable file descriptors. Read and write on such file descriptors would never use nor change ppos. And with that property on stream-like files read and write will be running without taking f_pos lock - i.e. read and write could be running simultaneously. 3. With semantic patch search and convert to stream_open all in-kernel nonseekable_open users for which read and write actually do not depend on ppos and where there is no other methods in file_operations which assume @offset access. 4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via steam_open if that bit is present in filesystem open reply. It was tempting to change fs/fuse/ open handler to use stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually uses offset in its read and write handlers https://codesearch.debian.net/search?q=-%3Enonseekable+%3D https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481 so if we would do such a change it will break a real user. 5. Add stream_open and FOPEN_STREAM handling to stable kernels starting from v3.14+ (the kernel where 9c225f26 first appeared). This will allow to patch OSSPD and other FUSE filesystems that provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE in their open handler and this way avoid the deadlock on all kernel versions. This should work because fs/fuse/ ignores unknown open flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that is not aware of this flag cannot hurt. In turn the kernel that is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to implement streams without read vs write deadlock. This patch adds stream_open, converts /proc/xen/xenbus to it and adds semantic patch to automatically locate in-kernel places that are either required to be converted due to read vs write deadlock, or that are just safe to be converted because read and write do not use ppos and there are no other funky methods in file_operations. Regarding semantic patch I've verified each generated change manually - that it is correct to convert - and each other nonseekable_open instance left - that it is either not correct to convert there, or that it is not converted due to current stream_open.cocci limitations. The script also does not convert files that should be valid to convert, but that currently have .llseek = noop_llseek or generic_file_llseek for unknown reason despite file being opened with nonseekable_open (e.g. drivers/input/mousedev.c) Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Yongzhi Pan <panyongzhi@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Juergen Gross <jgross@suse.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Tejun Heo <tj@kernel.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Nikolaus Rath <Nikolaus@rath.org> Cc: Han-Wen Nienhuys <hanwen@google.com> Signed-off-by: NKirill Smelkov <kirr@nexedi.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Alan Stern 提交于
commit c2b71462d294cf517a0bc6e4fd6424d7cee5596f upstream. The syzkaller fuzzer reported a bug in the USB hub driver which turned out to be caused by a negative runtime-PM usage counter. This allowed a hub to be runtime suspended at a time when the driver did not expect it. The symptom is a WARNING issued because the hub's status URB is submitted while it is already active: URB 0000000031fb463e submitted while active WARNING: CPU: 0 PID: 2917 at drivers/usb/core/urb.c:363 The negative runtime-PM usage count was caused by an unfortunate design decision made when runtime PM was first implemented for USB. At that time, USB class drivers were allowed to unbind from their interfaces without balancing the usage counter (i.e., leaving it with a positive count). The core code would take care of setting the counter back to 0 before allowing another driver to bind to the interface. Later on when runtime PM was implemented for the entire kernel, the opposite decision was made: Drivers were required to balance their runtime-PM get and put calls. In order to maintain backward compatibility, however, the USB subsystem adapted to the new implementation by keeping an independent usage counter for each interface and using it to automatically adjust the normal usage counter back to 0 whenever a driver was unbound. This approach involves duplicating information, but what is worse, it doesn't work properly in cases where a USB class driver delays decrementing the usage counter until after the driver's disconnect() routine has returned and the counter has been adjusted back to 0. Doing so would cause the usage counter to become negative. There's even a warning about this in the USB power management documentation! As it happens, this is exactly what the hub driver does. The kick_hub_wq() routine increments the runtime-PM usage counter, and the corresponding decrement is carried out by hub_event() in the context of the hub_wq work-queue thread. This work routine may sometimes run after the driver has been unbound from its interface, and when it does it causes the usage counter to go negative. It is not possible for hub_disconnect() to wait for a pending hub_event() call to finish, because hub_disconnect() is called with the device lock held and hub_event() acquires that lock. The only feasible fix is to reverse the original design decision: remove the duplicate interface-specific usage counter and require USB drivers to balance their runtime PM gets and puts. As far as I know, all existing drivers currently do this. Signed-off-by: NAlan Stern <stern@rowland.harvard.edu> Reported-and-tested-by: syzbot+7634edaea4d0b341c625@syzkaller.appspotmail.com CC: <stable@vger.kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Jim Broadus 提交于
commit 93b6604c5a669d84e45fe5129294875bf82eb1ff upstream. A previous change allowed I2C client devices to discover new IRQs upon reprobe by clearing the IRQ in i2c_device_remove. However, if an IRQ was assigned in i2c_new_device, that information is lost. For example, the touchscreen and trackpad devices on a Dell Inspiron laptop are I2C devices whose IRQs are defined by ACPI extended IRQ types. The client device structures are initialized during an ACPI walk. After removing the i2c_hid device, modprobe fails. This change caches the initial IRQ value in i2c_new_device and then resets the client device IRQ to the initial value in i2c_device_remove. Fixes: 6f108dd70d30 ("i2c: Clear client->irq in i2c_device_remove") Signed-off-by: NJim Broadus <jbroadus@gmail.com> Reviewed-by: NBenjamin Tissoires <benjamin.tissoires@redhat.com> Reviewed-by: NCharles Keepax <ckeepax@opensource.cirrus.com> [wsa: this is an easy to backport fix for the regression. We will refactor the code to handle irq assignments better in general.] Signed-off-by: NWolfram Sang <wsa@the-dreams.de> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 lingmingqiang 提交于
driver inclusion category: bugfix bugzilla: NA CVE: NA Feature or Bugfix: Bugfix 1. [42feaced] HPRE Crypto warning clean Warning -- Suspicious Truncation in arithmetic expression combining with pointer 2. [4b3f837d] crypto/zip: Fix to get queue from possible zip functions Original commit message: Current code just gets queue from the closest function, return fail if closest function has no available queue. In this patch, we firstly sort all available functions, then get queue from sorted functions one by one if closer function has no available queue. 3. [7250b1a9] crypto/qm: Export function to get free qp number for acc 4. [86eeda2b] crypto/hisilicon/qm: Fix static check warning Reduce loop complexity of qm_qp_ctx_cfg function. 5. [f1c558c0] Fix static check warning 6. [dfdfef8f] crypto/hisilicon/qm: Fix QM task timeout bug There is a domain segment in eqc/aeqc should be assignd value in D06 ES, Which is reserved in D06 CS. 7. [4bf721fe] Bugfixed as two kill signal gotten by user processes As two kill signals are gotten by the processes, file->ops->flush will be called twice. As uacce->ops->flush will be called twice too. Currently, flush cannot be called again at the same uacce_queue file, or core dump will show.So, status of uacce queue is added, as flush and release operations doing,queue status will be checked atomically. If already being released, do nothing. 8. [20bd4257] uacce/dummy: Fix dummy compile problem Original commit message: As we move flags, api_ver, gf_pg_start to uacce from uacce_ops, so also fix dummy driver to work together with current uacce. Signed-off-by: Nlingmingqiang <lingmingqiang@huawei.com> Changes to be committed: modified: drivers/crypto/hisilicon/Kconfig modified: drivers/crypto/hisilicon/hpre/hpre_crypto.c modified: drivers/crypto/hisilicon/qm.c modified: drivers/crypto/hisilicon/qm.h modified: drivers/crypto/hisilicon/zip/zip_main.c modified: drivers/uacce/Kconfig modified: drivers/uacce/dummy_drv/dummy_wd_dev.c modified: drivers/uacce/dummy_drv/dummy_wd_v2.c modified: drivers/uacce/uacce.c modified: include/linux/uacce.h Reviewed-by: Nhucheng.hu <hucheng.hu@huawei.com> Reviewed-by: NYang Yingliang <yangyingliang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Hou Tao 提交于
euler inclusion category: bugfix bugzilla: 14007 CVE: NA ------------------------------------------------- Process fork and cgroup migration can happen simultaneously, and in the following case use-after-free of css_set is possible: CPU 0: process fork CPU 1: cgroup migration dup_fd __cgroup1_procs_write(threadgroup=false) files_cgroup_assign // task A task_lock task_cgroup(current, files_cgrp_id) css_set = task_css_set_check() cgroup_migrate_execute files_cgroup_can_attach css_set_move_task put_css_set_locked() files_cgroup_attach // task B which is in the same // thread group as task A task_lock cgroup_migrate_finish // the css_set will be freed put_css_set_locked() // use-after-free css_set->subsys[files_cgrp_id] Fix it by using task_get_css() instead to get a valid css. Fixes: 52cc1eccf6de ("cgroups: Resource controller for open files") Signed-off-by: NHou Tao <houtao1@huawei.com> Reviewed-by: Nluojiajun <luojiajun3@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Hou Tao 提交于
euler inclusion category: bugfix bugzilla: 14007 CVE: NA ------------------------------------------------- These updates are leftovers from CentOS 7.x, and are not needed on hulk-4.19, so kill them. Fixes: 52cc1eccf6de ("cgroups: Resource controller for open files") Signed-off-by: NHou Tao <houtao1@huawei.com> Reviewed-by: Nluojiajun <luojiajun3@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Andrei Vagin 提交于
[ Upstream commit fcfc2aa0185f4a731d05a21e9f359968fdfd02e7 ] There are a few system calls (pselect, ppoll, etc) which replace a task sigmask while they are running in a kernel-space When a task calls one of these syscalls, the kernel saves a current sigmask in task->saved_sigmask and sets a syscall sigmask. On syscall-exit-stop, ptrace traps a task before restoring the saved_sigmask, so PTRACE_GETSIGMASK returns the syscall sigmask and PTRACE_SETSIGMASK does nothing, because its sigmask is replaced by saved_sigmask, when the task returns to user-space. This patch fixes this problem. PTRACE_GETSIGMASK returns saved_sigmask if it's set. PTRACE_SETSIGMASK drops the TIF_RESTORE_SIGMASK flag. Link: http://lkml.kernel.org/r/20181120060616.6043-1-avagin@gmail.com Fixes: 29000cae ("ptrace: add ability to get/set signal-blocked mask") Signed-off-by: NAndrei Vagin <avagin@gmail.com> Acked-by: NOleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NSasha Levin (Microsoft) <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Jann Horn 提交于
commit b987222654f84f7b4ca95b3a55eca784cb30235b upstream. This fixes multiple issues in buffer_pipe_buf_ops: - The ->steal() handler must not return zero unless the pipe buffer has the only reference to the page. But generic_pipe_buf_steal() assumes that every reference to the pipe is tracked by the page's refcount, which isn't true for these buffers - buffer_pipe_buf_get(), which duplicates a buffer, doesn't touch the page's refcount. Fix it by using generic_pipe_buf_nosteal(), which refuses every attempted theft. It should be easy to actually support ->steal, but the only current users of pipe_buf_steal() are the virtio console and FUSE, and they also only use it as an optimization. So it's probably not worth the effort. - The ->get() and ->release() handlers can be invoked concurrently on pipe buffers backed by the same struct buffer_ref. Make them safe against concurrency by using refcount_t. - The pointers stored in ->private were only zeroed out when the last reference to the buffer_ref was dropped. As far as I know, this shouldn't be necessary anyway, but if we do it, let's always do it. Link: http://lkml.kernel.org/r/20190404215925.253531-1-jannh@google.com Cc: Ingo Molnar <mingo@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Fixes: 73a757e6 ("ring-buffer: Return reader page back into existing ring buffer") Signed-off-by: NJann Horn <jannh@google.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Conflicts: kernel/trace/trace.c include/linux/pipe_fs_i.h [yyl: adjust context] Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Masami Hiramatsu 提交于
commit 3ff9c075cc767b3060bdac12da72fc94dd7da1b8 upstream. Verify the stack frame pointer on kretprobe trampoline handler, If the stack frame pointer does not match, it skips the wrong entry and tries to find correct one. This can happen if user puts the kretprobe on the function which can be used in the path of ftrace user-function call. Such functions should not be probed, so this adds a warning message that reports which function should be blacklisted. Tested-by: NAndrea Righi <righi.andrea@gmail.com> Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org> Acked-by: NSteven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/155094059185.6137.15527904013362842072.stgit@devboxSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Si-Wei Liu 提交于
[ Upstream commit 8065a779 ] When a netdev appears through hot plug then gets enslaved by a failover master that is already up and running, the slave will be opened right away after getting enslaved. Today there's a race that userspace (udev) may fail to rename the slave if the kernel (net_failover) opens the slave earlier than when the userspace rename happens. Unlike bond or team, the primary slave of failover can't be renamed by userspace ahead of time, since the kernel initiated auto-enslavement is unable to, or rather, is never meant to be synchronized with the rename request from userspace. As the failover slave interfaces are not designed to be operated directly by userspace apps: IP configuration, filter rules with regard to network traffic passing and etc., should all be done on master interface. In general, userspace apps only care about the name of master interface, while slave names are less important as long as admin users can see reliable names that may carry other information describing the netdev. For e.g., they can infer that "ens3nsby" is a standby slave of "ens3", while for a name like "eth0" they can't tell which master it belongs to. Historically the name of IFF_UP interface can't be changed because there might be admin script or management software that is already relying on such behavior and assumes that the slave name can't be changed once UP. But failover is special: with the in-kernel auto-enslavement mechanism, the userspace expectation for device enumeration and bring-up order is already broken. Previously initramfs and various userspace config tools were modified to bypass failover slaves because of auto-enslavement and duplicate MAC address. Similarly, in case that users care about seeing reliable slave name, the new type of failover slaves needs to be taken care of specifically in userspace anyway. It's less risky to lift up the rename restriction on failover slave which is already UP. Although it's possible this change may potentially break userspace component (most likely configuration scripts or management software) that assumes slave name can't be changed while UP, it's relatively a limited and controllable set among all userspace components, which can be fixed specifically to listen for the rename events on failover slaves. Userspace component interacting with slaves is expected to be changed to operate on failover master interface instead, as the failover slave is dynamic in nature which may come and go at any point. The goal is to make the role of failover slaves less relevant, and userspace components should only deal with failover master in the long run. Fixes: 30c8bd5a ("net: Introduce generic failover module") Signed-off-by: NSi-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: NLiran Alon <liran.alon@oracle.com> Acked-by: NSridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Jian Shen 提交于
mainline-next inclusion from mainline-5.1-rc5 commit a93f7fe134543649cf2e2d8fc2c50a8f4d742915 category: bugfix bugzilla: NA CVE: NA ------------------------------------------------- The default m88e151x LED configuration is 0x1177, used LED[0] for 1000M link, LED[1] for 100M link, and LED[2] for active. But for some boards, which use LED[0] for link, and LED[1] for active, prefer to be 0x1040. To be compatible with this case, this patch defines a new dev_flag, and set it before connect phy in HNS3 driver. When phy initializing, using the new LED configuration if this dev_flag is set. Signed-off-by: NJian Shen <shenjian15@huawei.com> Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Reviewed-by: Peng Li<lipeng321@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Matthew Wilcox 提交于
mainline inclusion from mainline-5.1-rc5 commit 15fab63e1e57be9fdb5eec1bbc5916e9825e9acb category: 13690 bugzilla: NA CVE: CVE-2019-11487 There are four commits to fix this CVE: fs: prevent page refcount overflow in pipe_buf_get mm: prevent get_user_pages() from overflowing page refcount mm: add 'try_get_page()' helper function mm: make page ref count overflow check tighter and more explicit ------------------------------------------------- Change pipe_buf_get() to return a bool indicating whether it succeeded in raising the refcount of the page (if the thing in the pipe is a page). This removes another mechanism for overflowing the page refcount. All callers converted to handle a failure. Reported-by: NJann Horn <jannh@google.com> Signed-off-by: NMatthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: Nzhong jiang <zhongjiang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Linus Torvalds 提交于
mainline inclusion from mainline-5.1-rc5 commit f958d7b528b1b40c44cfda5eabe2d82760d868c3 category: 13690 bugzilla: NA CVE: CVE-2019-11487 There are four commits to fix this CVE: fs: prevent page refcount overflow in pipe_buf_get mm: prevent get_user_pages() from overflowing page refcount mm: add 'try_get_page()' helper function mm: make page ref count overflow check tighter and more explicit ------------------------------------------------- We have a VM_BUG_ON() to check that the page reference count doesn't underflow (or get close to overflow) by checking the sign of the count. That's all fine, but we actually want to allow people to use a "get page ref unless it's already very high" helper function, and we want that one to use the sign of the page ref (without triggering this VM_BUG_ON). Change the VM_BUG_ON to only check for small underflows (or _very_ close to overflowing), and ignore overflows which have strayed into negative territory. Acked-by: NMatthew Wilcox <willy@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: Nzhong jiang <zhongjiang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Yufen Yu 提交于
euler inclusion category: bugfix bugzilla: 13971 CVE: NA ------------------------------------------------- After commit 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback"), map_request() will requeue the tio when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. Thus, if device drive status is error, a tio may be requeued multiple times until the return value is not DM_MAPIO_REQUEUE. That means type->start_io may be called multiple tims, while type->end_io just be called when IO complete. In fact, even without the commit, setup_clone() error also can make the tio requeue and miss call type->end_io. As servicer-time path selector for example, it selects path based on in_flight_size, which is increased by st_start_io() and decreased by st_end_io(). Missing call of end_io can lead to in_flight_size error and let the selector make the wrong choice. In addition, queue-length path selector will also be affected. To fix the problem, we call type->end_io in ->release_clone_rq before tio requeue. It pass map_info to ->release_clone_rq() for requeue path, and pass NULL for the others path. Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") Signed-off-by: NYufen Yu <yuyufen@huawei.com> Reviewed-by: NMiao Xie <miaoxie@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Jinyu Qi 提交于
mainline inclusion from linux-next commit: 14bd9a607f9082e7b5690c27e69072f2aeae0de4 category: feature feature: IOMMU performance bugzilla: NA CVE: NA -------------------------------------------------- In struct iova_domain, there are three atomic variables, the former two are about TLB flush counters which use atomic_add operation, anoter is used to flush timer that use cmpxhg operation. These variables are in the same cache line, so it will cause some performance loss under the condition that many cores call queue_iova function, Let's isolate the two type atomic variables to different cache line to reduce cache line conflict. Cc: Joerg Roedel <joro@8bytes.org> Signed-off-by: NJinyu Qi <jinyuqi@huawei.com> Signed-off-by: NJoerg Roedel <jroedel@suse.de> Signed-off-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ganapatrao Kulkarni 提交于
mainline inclusion from mainline-4.20-rc1 commit: bee60e94a1e20ec0b8ffdafae270731d8fda4551 category: feature feature: IOMMU performance bugzilla: NA CVE: NA -------------------------------------------------- As an optimisation for PCI devices, there is always first attempt been made to allocate iova from SAC address range. This will lead to unnecessary attempts, when there are no free ranges available. Adding fix to track recently failed iova address size and allow further attempts, only if requested size is lesser than a failed size. The size is updated when any replenish happens. Reviewed-by: NRobin Murphy <robin.murphy@arm.com> Signed-off-by: NGanapatrao Kulkarni <ganapatrao.kulkarni@cavium.com> Signed-off-by: NJoerg Roedel <jroedel@suse.de> Signed-off-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Andrea Arcangeli 提交于
mainline inclusion from mainline-5.1-rc6 commit 04f5866e41fb70690e28397487d8bd8eea7d712a category: 13690 bugzilla: NA CVE: CVE-2019-3892 ------------------------------------------------- The core dumping code has always run without holding the mmap_sem for writing, despite that is the only way to ensure that the entire vma layout will not change from under it. Only using some signal serialization on the processes belonging to the mm is not nearly enough. This was pointed out earlier. For example in Hugh's post from Jul 2017: https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils "Not strictly relevant here, but a related note: I was very surprised to discover, only quite recently, how handle_mm_fault() may be called without down_read(mmap_sem) - when core dumping. That seems a misguided optimization to me, which would also be nice to correct" In particular because the growsdown and growsup can move the vm_start/vm_end the various loops the core dump does around the vma will not be consistent if page faults can happen concurrently. Pretty much all users calling mmget_not_zero()/get_task_mm() and then taking the mmap_sem had the potential to introduce unexpected side effects in the core dumping code. Adding mmap_sem for writing around the ->core_dump invocation is a viable long term fix, but it requires removing all copy user and page faults and to replace them with get_dump_page() for all binary formats which is not suitable as a short term fix. For the time being this solution manually covers the places that can confuse the core dump either by altering the vma layout or the vma flags while it runs. Once ->core_dump runs under mmap_sem for writing the function mmget_still_valid() can be dropped. Allowing mmap_sem protected sections to run in parallel with the coredump provides some minor parallelism advantage to the swapoff code (which seems to be safe enough by never mangling any vma field and can keep doing swapins in parallel to the core dumping) and to some other corner case. In order to facilitate the backporting I added "Fixes: 86039bd3" however the side effect of this same race condition in /proc/pid/mem should be reproducible since before 2.6.12-rc2 so I couldn't add any other "Fixes:" because there's no hash beyond the git genesis commit. Because find_extend_vma() is the only location outside of the process context that could modify the "mm" structures under mmap_sem for reading, by adding the mmget_still_valid() check to it, all other cases that take the mmap_sem for reading don't need the new check after mmget_not_zero()/get_task_mm(). The expand_stack() in page fault context also doesn't need the new check, because all tasks under core dumping are frozen. Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com Fixes: 86039bd3 ("userfaultfd: add new syscall to provide memory externalization") Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Reported-by: NJann Horn <jannh@google.com> Suggested-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NPeter Xu <peterx@redhat.com> Reviewed-by: NMike Rapoport <rppt@linux.ibm.com> Reviewed-by: NOleg Nesterov <oleg@redhat.com> Reviewed-by: NJann Horn <jannh@google.com> Acked-by: NJason Gunthorpe <jgg@mellanox.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: drivers/infiniband/core/uverbs_main.c [yyl: adjust context] Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: Nzhong jiang <zhongjiang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Mingqiang Ling 提交于
driver inclusion category: bugfix bugzilla: 13683 CVE: NA ------------------------------------------------- CI Code scanning warning clean Feature or Bugfix:Bugfix Signed-off-by: Nxuzaibo <xuzaibo@huawei.com> Reviewed-by: Nwangzhou <wangzhou1@hisilicon.com> Signed-off-by: NMingqiang Ling <lingmingqiang@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 tanshukun 提交于
driver inclusion category: bugfix bugzilla: 13683 CVE: NA ------------------------------------------------- Fix static check warning for qm/zip modules Feature or Bugfix:Bugfix Signed-off-by: Ntanshukun (A) <tanshukun1@huawei.com> Reviewed-by: Nwangzhou <wangzhou1@hisilicon.com> Signed-off-by: NMingqiang Ling <lingmingqiang@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Mingqiang Ling 提交于
driver inclusion category: bugfix bugzilla: 13683 CVE: NA ------------------------------------------------- Orinal commit message: This patch removes api_ver, flags, qf_pg_start in uacce_ops to uacce, deletes owner in uacce_ops as we already have owner in cdev. These items indeed belong to uacce. Signed-off-by: NZhou Wang <wangzhou1@hisilicon.com> Reviewed-by: Nxuzaibo <xuzaibo@huawei.com> Reviewed-by: Nfanghao <fanghao11@huawei.com> Signed-off-by: NMingqiang Ling <lingmingqiang@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-