提交 · 6d1740b577a1af5f95f3d43d191f9cb76e48c244 · openeuler / raspberrypi-kernel

27 12月, 2019 40 次提交

driver core: add per device iommu param · 6d1740b5

由 Jacob Pan 提交于 5月 18, 2019

hulk inclusion
category: feature
bugzilla: 14369
CVE: NA
-------------------

DMA faults can be detected by IOMMU at device level. Adding a pointer
to struct device allows IOMMU subsystem to report relevant faults
back to the device driver for further handling.
For direct assigned device (or user space drivers), guest OS holds
responsibility to handle and respond per device IOMMU fault.
Therefore we need fault reporting mechanism to propagate faults beyond
IOMMU subsystem.

There are two other IOMMU data pointers under struct device today, here
we introduce iommu_param as a parent pointer such that all device IOMMU
data can be consolidated here. The idea was suggested here by Greg KH
and Joerg. The name iommu_param is chosen here since iommu_data has
been used.
Suggested-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: NJean-Philippe Brucker <jean-philippe.brucker@arm.com>
Link: https://lkml.org/lkml/2017/10/6/81Signed-off-by: NFang Lijun <fanglijun3@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Reviewed-by: NZhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

6d1740b5

iommu: introduce device fault data · 5c398a27

由 Jacob Pan 提交于 5月 18, 2019

hulk inclusion
category: feature
bugzilla: 14369
CVE: NA
-------------------

Device faults detected by IOMMU can be reported outside IOMMU
subsystem for further processing. This patch intends to provide
a generic device fault data such that device drivers can be
communicated with IOMMU faults without model specific knowledge.

The proposed format is the result of discussion at:
https://lkml.org/lkml/2017/11/10/291
Part of the code is based on Jean-Philippe Brucker's patchset
(https://patchwork.kernel.org/patch/9989315/).

The assumption is that model specific IOMMU driver can filter and
handle most of the internal faults if the cause is within IOMMU driver
control. Therefore, the fault reasons can be reported are grouped
and generalized based common specifications such as PCI ATS.
Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: NJean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: NLiu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: NAshok Raj <ashok.raj@intel.com>
Signed-off-by: NFang Lijun <fanglijun3@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Reviewed-by: NZhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

5c398a27

iommu: introduce iommu invalidate API function · 51c7a125

由 Liu, Yi L 提交于 5月 18, 2019

hulk inclusion
category: feature
bugzilla: 14369
CVE: NA
-------------------

When an SVM capable device is assigned to a guest, the first level page
tables are owned by the guest and the guest PASID table pointer is
linked to the device context entry of the physical IOMMU.

Host IOMMU driver has no knowledge of caching structure updates unless
the guest invalidation activities are passed down to the host. The
primary usage is derived from emulated IOMMU in the guest, where QEMU
can trap invalidation activities before passing them down to the
host/physical IOMMU.
Since the invalidation data are obtained from user space and will be
written into physical IOMMU, we must allow security check at various
layers. Therefore, generic invalidation data format are proposed here,
model specific IOMMU drivers need to convert them into their own format.
Signed-off-by: NLiu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: NJean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: NAshok Raj <ashok.raj@intel.com>
Signed-off-by: NFang Lijun <fanglijun3@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Reviewed-by: NZhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

51c7a125

iommu: introduce bind_pasid_table API function · d32e9e95

由 Jacob Pan 提交于 5月 18, 2019

hulk inclusion
category: feature
bugzilla: 14369
CVE: NA
-------------------

Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
use in the guest:
https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html

As part of the proposed architecture, when an SVM capable PCI
device is assigned to a guest, nested mode is turned on. Guest owns the
first level page tables (request with PASID) which performs GVA->GPA
translation. Second level page tables are owned by the host for GPA->HPA
translation for both request with and without PASID.

A new IOMMU driver interface is therefore needed to perform tasks as
follows:
* Enable nested translation and appropriate translation type
* Assign guest PASID table pointer (in GPA) and size to host IOMMU

This patch introduces new API functions to perform bind/unbind guest PASID
tables. Based on common data, model specific IOMMU drivers can be extended
to perform the specific steps for binding pasid table of assigned devices.
Signed-off-by: NJean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: NLiu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: NAshok Raj <ashok.raj@intel.com>
Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
[Backported to 4.19
-add SPDX-License-Identifier]
Signed-off-by: NFang Lijun <fanglijun3@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Reviewed-by: NZhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d32e9e95

KVM: fix spectrev1 gadgets · 579b95fc

由 Paolo Bonzini 提交于 5月 17, 2019

[ Upstream commit 1d487e9bf8ba66a7174c56a0029c54b1eca8f99c ]

These were found with smatch, and then generalized when applicable.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

579b95fc

x86/reboot, efi: Use EFI reboot for Acer TravelMate X514-51T · 4778affe

由 Jian-Hong Pan 提交于 5月 17, 2019

[ Upstream commit 0082517fa4bce073e7cf542633439f26538a14cc ]

Upon reboot, the Acer TravelMate X514-51T laptop appears to complete the
shutdown process, but then it hangs in BIOS POST with a black screen.

The problem is intermittent - at some points it has appeared related to
Secure Boot settings or different kernel builds, but ultimately we have
not been able to identify the exact conditions that trigger the issue to
come and go.

Besides, the EFI mode cannot be disabled in the BIOS of this model.

However, after extensive testing, we observe that using the EFI reboot
method reliably avoids the issue in all cases.

So add a boot time quirk to use EFI reboot on such systems.

Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=203119Signed-off-by: NJian-Hong Pan <jian-hong@endlessm.com>
Signed-off-by: NDaniel Drake <drake@endlessm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Cc: linux@endlessm.com
Link: http://lkml.kernel.org/r/20190412080152.3718-1-jian-hong@endlessm.com
[ Fix !CONFIG_EFI build failure, clarify the code and the changelog a bit. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

4778affe

bfq: update internal depth state when queue depth changes · 3c8256e7

由 Jens Axboe 提交于 5月 17, 2019

commit 77f1e0a52d26242b6c2dba019f6ebebfb9ff701e upstream

A previous commit moved the shallow depth and BFQ depth map calculations
to be done at init time, moving it outside of the hotter IO path. This
potentially causes hangs if the users changes the depth of the scheduler
map, by writing to the 'nr_requests' sysfs file for that device.

Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if
the depth changes, so that the scheduler can update its internal state.
Signed-off-by: NEric Wheeler <bfq@linux.ewheeler.net>
Tested-by: NKai Krakow <kai@kaishome.de>
Reported-by: NPaolo Valente <paolo.valente@linaro.org>
Fixes: f0635b8a ("bfq: calculate shallow depths at init time")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Cc: stable@vger.kernel.org
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

3c8256e7

cpu/speculation: Add 'mitigations=' cmdline option · 9b878097

由 Josh Poimboeuf 提交于 5月 16, 2019

commit 98af8452945c55652de68536afdde3b520fec429 upstream

Keeping track of the number of mitigations for all the CPU speculation
bugs has become overwhelming for many users.  It's getting more and more
complicated to decide which mitigations are needed for a given
architecture.  Complicating matters is the fact that each arch tends to
have its own custom way to mitigate the same vulnerability.

Most users fall into a few basic categories:

a) they want all mitigations off;

b) they want all reasonable mitigations on, with SMT enabled even if
   it's vulnerable; or

c) they want all reasonable mitigations on, with SMT disabled if
   vulnerable.

Define a set of curated, arch-independent options, each of which is an
aggregation of existing options:

- mitigations=off: Disable all mitigations.

- mitigations=auto: [default] Enable all the default mitigations, but
  leave SMT enabled, even if it's vulnerable.

- mitigations=auto,nosmt: Enable all the default mitigations, disabling
  SMT if needed by a mitigation.

Currently, these options are placeholders which don't actually do
anything.  They will be fleshed out in upcoming patches.
Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86)
Reviewed-by: NJiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jon Masters <jcm@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-arch@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/b07a8ef9b7c5055c3a4637c87d07c296d5016fe0.1555085500.git.jpoimboe@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

9b878097

x86/speculation/mds: Add sysfs reporting for MDS · 6896d064

由 Thomas Gleixner 提交于 5月 16, 2019

commit 8a4b06d391b0a42a373808979b5028f5c84d9c6a upstream

Add the sysfs reporting file for MDS. It exposes the vulnerability and
mitigation state similar to the existing files for the other speculative
hardware vulnerabilities.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NJon Masters <jcm@redhat.com>
Tested-by: NJon Masters <jcm@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

6896d064

mm: Be allowed to alloc CDM node memory for MPOL_BIND · 1f3b5458

由 Lijun Fang 提交于 5月 13, 2019

euler inclusion
category: feature
bugzilla: 11082
CVE: NA
-----------------

CDM nodes should not be part of mems_allowed, However,
It must be allowed to alloc from CDM node, when mpol->mode was MPOL_BIND.
Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

1f3b5458

mm: Exclude CDM marked VMAs from auto NUMA · 90cd24b6

由 Anshuman Khandual 提交于 5月 13, 2019

euler inclusion
category: feature
bugzilla: 11082
CVE: NA
-------------------

Kernel cannot track device memory accesses behind VMAs containing CDM
memory. Hence all the VM_CDM marked VMAs should not be part of the auto
NUMA migration scheme. This patch also adds a new function is_cdm_vma()
to detect any VMA marked with flag VM_CDM.
Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

90cd24b6

mm: Tag VMA with VM_CDM flag explicitly during mbind(MPOL_BIND) and page fault · e1ddb9d2

由 Anshuman Khandual 提交于 5月 13, 2019

euler inclusion
category: feature
bugzilla: 11082
CVE: NA
-------------------

Mark all the applicable VMAs with VM_CDM explicitly during mbind(MPOL_BIND)
call if the user provided nodemask has a CDM node.

Mark the corresponding VMA with VM_CDM flag if the allocated page happens
to be from a CDM node. This can be expensive from performance stand point.
There are multiple checks to avoid an expensive page_to_nid lookup but it
can be optimized further.
Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

e1ddb9d2

mm: Define coherent device memory (CDM) node · 4886e905

由 Anshuman Khandual 提交于 5月 13, 2019

euler inclusion
category: feature
bugzilla: 11082
CVE: NA
-------------------

There are certain devices like specialized accelerator, GPU cards, network
cards, FPGA cards etc which might contain onboard memory which is coherent
along with the existing system RAM while being accessed either from the CPU
or from the device. They share some similar properties with that of normal
system RAM but at the same time can also be different with respect to
system RAM.

User applications might be interested in using this kind of coherent device
memory explicitly or implicitly along side the system RAM utilizing all
possible core memory functions like anon mapping (LRU), file mapping (LRU),
page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations
etc. To achieve this kind of tight integration with core memory subsystem,
the device onboard coherent memory must be represented as a memory only
NUMA node. At the same time arch must export some kind of a function to
identify of this node as a coherent device memory not any other regular
cpu less memory only NUMA node.

After achieving the integration with core memory subsystem coherent device
memory might still need some special consideration inside the kernel. There
can be a variety of coherent memory nodes with different expectations from
the core kernel memory. But right now only one kind of special treatment is
considered which requires certain isolation.

Now consider the case of a coherent device memory node type which requires
isolation. This kind of coherent memory is onboard an external device
attached to the system through a link where there is always a chance of a
link failure taking down the entire memory node with it. More over the
memory might also have higher chance of ECC failure as compared to the
system RAM. Hence allocation into this kind of coherent memory node should
be regulated. Kernel allocations must not come here. Normal user space
allocations too should not come here implicitly (without user application
knowing about it). This summarizes isolation requirement of certain kind of
coherent device memory node as an example. There can be different kinds of
isolation requirement also.

Some coherent memory devices might not require isolation altogether after
all. Then there might be other coherent memory devices which might require
some other special treatment after being part of core memory representation
. For now, will look into isolation seeking coherent device memory node not
the other ones.

To implement the integration as well as isolation, the coherent memory node
must be present in N_MEMORY and a new N_COHERENT_DEVICE node mask inside
the node_states[] array. During memory hotplug operations, the new nodemask
N_COHERENT_DEVICE is updated along with N_MEMORY for these coherent device
memory nodes. This also creates the following new sysfs based interface to
list down all the coherent memory nodes of the system.

	/sys/devices/system/node/is_cdm_node

Architectures must export function arch_check_node_cdm() which identifies
any coherent device memory node in case they enable CONFIG_COHERENT_DEVICE.
Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
[Backported to 4.19
-remove set or clear node state for memory_hotplug
-separate CONFIG_COHERENT and CPUSET]
Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

4886e905

linux/kernel.h: Use parentheses around argument in u64_to_user_ptr() · a74603f0

由 Jann Horn 提交于 5月 13, 2019

[ Upstream commit a0fe2c6479aab5723239b315ef1b552673f434a3 ]

Use parentheses around uses of the argument in u64_to_user_ptr() to
ensure that the cast doesn't apply to part of the argument.

There are existing uses of the macro of the form

  u64_to_user_ptr(A + B)

which expands to

  (void __user *)(uintptr_t)A + B

(the cast applies to the first operand of the addition, the addition
is a pointer addition). This happens to still work as intended, the
semantic difference doesn't cause a difference in behavior.

But I want to use u64_to_user_ptr() with a ternary operator in the
argument, like so:

  u64_to_user_ptr(A ? B : C)

This currently doesn't work as intended.
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NMukesh Ojha <mojha@codeaurora.org>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: NeilBrown <neilb@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190329214652.258477-1-jannh@google.comSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

a74603f0

fs/dcache.c: avoid softlock since too many negative dentry · 7ba5d5d0

由 yangerkun 提交于 5月 12, 2019

euler inclusion
category: bugfix
bugzilla: 15743
CVE: NA
---------------------------

Parallel thread to add negative dentry under root dir. Sometimes later,
'systemctl daemon-reload' will report softlockup since
__fsnotify_update_child_dentry_flags need update all child under root
dentry without distinguish does it active or not. It will waste so long
time with catching d_lock of root dentry. And other thread try to
spin_lock d_lock will run overtime.

Limit negative dentry under dir can avoid this.
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: NMiao Xie <miaoxie@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

7ba5d5d0

block: add info when opening a write opend block device exclusively · 9a8887a9

由 zhangyi (F) 提交于 5月 11, 2019

euler inclusion
category: feature
bugzilla: 14367
CVE: NA
---------------------------

Just like open an exclusive opened block device for write, open a block
device exclusively which has been opened for write by some other
processes may also lead to potential data corruption. This patch record
the write openers and give a hint if that happens.
Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: NMiao Xie <miaoxie@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

9a8887a9

pciehp: fix a race between pciehp and removing operations by sysfs · 762c10db

由 Xiongfeng Wang 提交于 5月 09, 2019

hulk inclusion
category: bugfix
bugzilla: NA
CVE: NA

-------------------------------------------------

When I run a stress test about pcie hotplug and removing operations by
sysfs, I got a hange task, and the following call trace is printed.

 INFO: task irq/746-pciehp:41551 blocked for more than 120 seconds.
       Tainted: P        W  OE     4.19.25-
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 irq/746-pciehp  D    0 41551      2 0x00000228
 Call trace:
  __switch_to+0x94/0xe8
  __schedule+0x270/0x8b0
  schedule+0x2c/0x88
  schedule_preempt_disabled+0x14/0x20
  __mutex_lock.isra.1+0x1fc/0x540
  __mutex_lock_slowpath+0x24/0x30
  mutex_lock+0x80/0xa8
  pci_lock_rescan_remove+0x20/0x28
  pciehp_configure_device+0x30/0x140
  pciehp_handle_presence_or_link_change+0x35c/0x4b0
  pciehp_ist+0x1cc/0x1d0
  irq_thread_fn+0x30/0x80
  irq_thread+0x128/0x200
  kthread+0x134/0x138
  ret_from_fork+0x10/0x18
 INFO: task bash:6424 blocked for more than 120 seconds.
       Tainted: P        W  OE     4.19.25-
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 bash            D    0  6424   2231 0x00000200
 Call trace:
  __switch_to+0x94/0xe8
  __schedule+0x270/0x8b0
  schedule+0x2c/0x88
  schedule_timeout+0x224/0x448
  wait_for_common+0x198/0x2a0
  wait_for_completion+0x28/0x38
  kthread_stop+0x60/0x190
  __free_irq+0x1c0/0x348
  free_irq+0x40/0x88
  pcie_shutdown_notification+0x54/0x80
  pciehp_remove+0x30/0x50
  pcie_port_remove_service+0x3c/0x58
  device_release_driver_internal+0x1b4/0x250
  device_release_driver+0x28/0x38
  bus_remove_device+0xd4/0x160
  device_del+0x128/0x348
  device_unregister+0x24/0x78
  remove_iter+0x48/0x58
  device_for_each_child+0x6c/0xb8
  pcie_port_device_remove+0x2c/0x48
  pcie_portdrv_remove+0x5c/0x68
  pci_device_remove+0x48/0xd8
  device_release_driver_internal+0x1b4/0x250
  device_release_driver+0x28/0x38
  pci_stop_bus_device+0x84/0xb8
  pci_stop_and_remove_bus_device_locked+0x24/0x40
  remove_store+0xa4/0xb8
  dev_attr_store+0x44/0x60
  sysfs_kf_write+0x58/0x80
  kernfs_fop_write+0xe8/0x1f0
  __vfs_write+0x60/0x190
  vfs_write+0xac/0x1c0
  ksys_write+0x6c/0xd8
  __arm64_sys_write+0x24/0x30
  el0_svc_common+0xa0/0x180
  el0_svc_handler+0x38/0x78
  el0_svc+0x8/0xc

When we remove a slot by sysfs.
'pci_stop_and_remove_bus_device_locked()' will be called. This function
will get the global mutex lock 'pci_rescan_remove_lock', and remove the
slot. If the irq thread 'pciehp_ist' is still running, we will wait
until it exits.

If a pciehp interrupt happens immediately after we remove the slot by
sysfs, but before we free the pciehp irq in
'pci_stop_and_remove_bus_device_locked()'. 'pciehp_ist' will hung
because the global mutex lock 'pci_rescan_remove_lock' is held by the
sysfs operation. But the sysfs operation is waiting for the pciehp irq
thread 'pciehp_ist' ends. Then a hung task occurs.

So this two kinds of operation, removing through attention buttion and
removing through /sys/devices/pci***/remove, should not be excuted at
the same time. This patch add a global variable to mark that one of these
operations is under processing. When this variable is set,  if another
operation is requested, it will be rejected.
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

762c10db

Revert Kernel Warpdrive part:SPIMDEV, which depends on VFIO/VFIO_MDEV. · 9e68018c

由 lingmingqiang 提交于 5月 09, 2019

driver inclusion
category: bugfix
bugzilla: NA
CVE: NA

This reverts commit 6b546eb4c02210d7bc77aa64025cc6c84e5f1b30.

Feature or Bugfix: Bugfix
Signed-off-by: Nlingmingqiang <lingmingqiang@huawei.com>
Reviewed-by: Nhucheng.hu <hucheng.hu@huawei.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

9e68018c

fs/dcache.c: avoid panic while lockref of dentry overflow · 510fef4d

由 yangerkun 提交于 5月 09, 2019

euler inclusion
category: bugfix
bugzilla: 14351
CVE: NA
---------------------------

We use lockref for dentry reference without notice that so many negative
dentry under one dir can lead to overflow of lockref. This can lead to
system crash if we do this under root dir.

Since there is not a perfect solution, we just limit max number of
dentry count up to INT_MAX / 2. Also, it will cost a lot of time from
INT_MAX / 2 to INT_MAX, so we no need to do this under protection of
dentry lock.

Also, we limit the FILES_MAX to INT_MAX / 2, since a lot open for
same file can lead to overflow too.

Changelog:
v1->v2: add a function to do check / add a Macro to mean INT_MAX / 2
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: NMiao Xie <miaoxie@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

510fef4d

clk: x86: Add system specific quirk to mark clocks as critical · a1b70161

由 David Müller 提交于 5月 08, 2019

commit 7c2e07130090ae001a97a6b65597830d6815e93e upstream.

Since commit 648e9218 ("clk: x86: Stop marking clocks as
CLK_IS_CRITICAL"), the pmc_plt_clocks of the Bay Trail SoC are
unconditionally gated off. Unfortunately this will break systems where these
clocks are used for external purposes beyond the kernel's knowledge. Fix it
by implementing a system specific quirk to mark the necessary pmc_plt_clks as
critical.

Fixes: 648e9218 ("clk: x86: Stop marking clocks as CLK_IS_CRITICAL")
Signed-off-by: NDavid Müller <dave.mueller@gmx.ch>
Signed-off-by: NHans de Goede <hdegoede@redhat.com>
Reviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: NStephen Boyd <sboyd@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

a1b70161

fs: stream_open - opener for stream-like files so that read and write can run... · e2e782c9

由 Kirill Smelkov 提交于 5月 08, 2019

fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock

[ Upstream commit 10dce8af34226d90fa56746a934f8da5dcdba3df ]

Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added
locking for file.f_pos access and in particular made concurrent read and
write not possible - now both those functions take f_pos lock for the
whole run, and so if e.g. a read is blocked waiting for data, write will
deadlock waiting for that read to complete.

This caused regression for stream-like files where previously read and
write could run simultaneously, but after that patch could not do so
anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes
to /proc/xen/xenbus") which fixes such regression for particular case of
/proc/xen/xenbus.

The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
safety for read/write/lseek and added the locking to file descriptors of
all regular files. In 2014 that thread-safety problem was not new as it
was already discussed earlier in 2006.

However even though 2006'th version of Linus's patch was adding f_pos
locking "only for files that are marked seekable with FMODE_LSEEK (thus
avoiding the stream-like objects like pipes and sockets)", the 2014
version - the one that actually made it into the tree as 9c225f26 -
is doing so irregardless of whether a file is seekable or not.

See

    https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
    https://lwn.net/Articles/180387
    https://lwn.net/Articles/180396

for historic context.

The reason that it did so is, probably, that there are many files that
are marked non-seekable, but e.g. their read implementation actually
depends on knowing current position to correctly handle the read. Some
examples:

	kernel/power/user.c		snapshot_read
	fs/debugfs/file.c		u32_array_read
	fs/fuse/control.c		fuse_conn_waiting_read + ...
	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
	arch/s390/hypfs/inode.c		hypfs_read_iter
	...

Despite that, many nonseekable_open users implement read and write with
pure stream semantics - they don't depend on passed ppos at all. And for
those cases where read could wait for something inside, it creates a
situation similar to xenbus - the write could be never made to go until
read is done, and read is waiting for some, potentially external, event,
for potentially unbounded time -> deadlock.

Besides xenbus, there are 14 such places in the kernel that I've found
with semantic patch (see below):

	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()

In addition to the cases above another regression caused by f_pos
locking is that now FUSE filesystems that implement open with
FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
stream-like files - for the same reason as above e.g. read can deadlock
write locking on file.f_pos in the kernel.

FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse:
implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
write routines not depending on current position at all, and with both
read and write being potentially blocking operations:

See

    https://github.com/libfuse/osspd
    https://lwn.net/Articles/308445

    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510

Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
"somewhat pipe-like files ..." with read handler not using offset.
However that test implements only read without write and cannot exercise
the deadlock scenario:

    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216

I've actually hit the read vs write deadlock for real while implementing
my FUSE filesystem where there is /head/watch file, for which open
creates separate bidirectional socket-like stream in between filesystem
and its user with both read and write being later performed
simultaneously. And there it is semantically not easy to split the
stream into two separate read-only and write-only channels:

    https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169

Let's fix this regression. The plan is:

1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
   doing so would break many in-kernel nonseekable_open users which
   actually use ppos in read/write handlers.

2. Add stream_open() to kernel to open stream-like non-seekable file
   descriptors. Read and write on such file descriptors would never use
   nor change ppos. And with that property on stream-like files read and
   write will be running without taking f_pos lock - i.e. read and write
   could be running simultaneously.

3. With semantic patch search and convert to stream_open all in-kernel
   nonseekable_open users for which read and write actually do not
   depend on ppos and where there is no other methods in file_operations
   which assume @offset access.

4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
   steam_open if that bit is present in filesystem open reply.

   It was tempting to change fs/fuse/ open handler to use stream_open
   instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
   grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
   and in particular GVFS which actually uses offset in its read and
   write handlers

	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481

   so if we would do such a change it will break a real user.

5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
   from v3.14+ (the kernel where 9c225f26 first appeared).

   This will allow to patch OSSPD and other FUSE filesystems that
   provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
   in their open handler and this way avoid the deadlock on all kernel
   versions. This should work because fs/fuse/ ignores unknown open
   flags returned from a filesystem and so passing FOPEN_STREAM to a
   kernel that is not aware of this flag cannot hurt. In turn the kernel
   that is not aware of FOPEN_STREAM will be < v3.14 where just
   FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
   write deadlock.

This patch adds stream_open, converts /proc/xen/xenbus to it and adds
semantic patch to automatically locate in-kernel places that are either
required to be converted due to read vs write deadlock, or that are just
safe to be converted because read and write do not use ppos and there
are no other funky methods in file_operations.

Regarding semantic patch I've verified each generated change manually -
that it is correct to convert - and each other nonseekable_open instance
left - that it is either not correct to convert there, or that it is not
converted due to current stream_open.cocci limitations.

The script also does not convert files that should be valid to convert,
but that currently have .llseek = noop_llseek or generic_file_llseek for
unknown reason despite file being opened with nonseekable_open (e.g.
drivers/input/mousedev.c)

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: NKirill Smelkov <kirr@nexedi.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

e2e782c9

USB: core: Fix bug caused by duplicate interface PM usage counter · 947050f1

由 Alan Stern 提交于 5月 08, 2019

commit c2b71462d294cf517a0bc6e4fd6424d7cee5596f upstream.

The syzkaller fuzzer reported a bug in the USB hub driver which turned
out to be caused by a negative runtime-PM usage counter.  This allowed
a hub to be runtime suspended at a time when the driver did not expect
it.  The symptom is a WARNING issued because the hub's status URB is
submitted while it is already active:

	URB 0000000031fb463e submitted while active
	WARNING: CPU: 0 PID: 2917 at drivers/usb/core/urb.c:363

The negative runtime-PM usage count was caused by an unfortunate
design decision made when runtime PM was first implemented for USB.
At that time, USB class drivers were allowed to unbind from their
interfaces without balancing the usage counter (i.e., leaving it with
a positive count).  The core code would take care of setting the
counter back to 0 before allowing another driver to bind to the
interface.

Later on when runtime PM was implemented for the entire kernel, the
opposite decision was made: Drivers were required to balance their
runtime-PM get and put calls.  In order to maintain backward
compatibility, however, the USB subsystem adapted to the new
implementation by keeping an independent usage counter for each
interface and using it to automatically adjust the normal usage
counter back to 0 whenever a driver was unbound.

This approach involves duplicating information, but what is worse, it
doesn't work properly in cases where a USB class driver delays
decrementing the usage counter until after the driver's disconnect()
routine has returned and the counter has been adjusted back to 0.
Doing so would cause the usage counter to become negative.  There's
even a warning about this in the USB power management documentation!

As it happens, this is exactly what the hub driver does.  The
kick_hub_wq() routine increments the runtime-PM usage counter, and the
corresponding decrement is carried out by hub_event() in the context
of the hub_wq work-queue thread.  This work routine may sometimes run
after the driver has been unbound from its interface, and when it does
it causes the usage counter to go negative.

It is not possible for hub_disconnect() to wait for a pending
hub_event() call to finish, because hub_disconnect() is called with
the device lock held and hub_event() acquires that lock.  The only
feasible fix is to reverse the original design decision: remove the
duplicate interface-specific usage counter and require USB drivers to
balance their runtime PM gets and puts.  As far as I know, all
existing drivers currently do this.
Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
Reported-and-tested-by: syzbot+7634edaea4d0b341c625@syzkaller.appspotmail.com
CC: <stable@vger.kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

947050f1

i2c: Allow recovery of the initial IRQ by an I2C client device. · 420637ba

由 Jim Broadus 提交于 5月 08, 2019

commit 93b6604c5a669d84e45fe5129294875bf82eb1ff upstream.

A previous change allowed I2C client devices to discover new IRQs upon
reprobe by clearing the IRQ in i2c_device_remove. However, if an IRQ was
assigned in i2c_new_device, that information is lost.

For example, the touchscreen and trackpad devices on a Dell Inspiron laptop
are I2C devices whose IRQs are defined by ACPI extended IRQ types. The
client device structures are initialized during an ACPI walk. After
removing the i2c_hid device, modprobe fails.

This change caches the initial IRQ value in i2c_new_device and then resets
the client device IRQ to the initial value in i2c_device_remove.

Fixes: 6f108dd70d30 ("i2c: Clear client->irq in i2c_device_remove")
Signed-off-by: NJim Broadus <jbroadus@gmail.com>
Reviewed-by: NBenjamin Tissoires <benjamin.tissoires@redhat.com>
Reviewed-by: NCharles Keepax <ckeepax@opensource.cirrus.com>
[wsa: this is an easy to backport fix for the regression. We will
refactor the code to handle irq assignments better in general.]
Signed-off-by: NWolfram Sang <wsa@the-dreams.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

420637ba

merge the driver code to hulk branch and format rectificaiton · 299b4bc2

由 lingmingqiang 提交于 5月 06, 2019

driver inclusion
category: bugfix
bugzilla: NA
CVE: NA

Feature or Bugfix: Bugfix

1. [42feaced] HPRE Crypto warning clean
Warning -- Suspicious Truncation in arithmetic expression combining
with pointer

2. [4b3f837d] crypto/zip: Fix to get queue from possible zip functions
Original commit message:
Current code just gets queue from the closest function, return fail
if closest function has no available queue. In this patch, we firstly
sort all available functions, then get queue from sorted functions one
by one if closer function has no available queue.

3. [7250b1a9] crypto/qm: Export function to get free qp number for acc

4. [86eeda2b] crypto/hisilicon/qm: Fix static check warning
Reduce loop complexity of qm_qp_ctx_cfg function.

5. [f1c558c0] Fix static check warning

6. [dfdfef8f] crypto/hisilicon/qm: Fix QM task timeout bug
There is a domain segment in eqc/aeqc should be assignd value
in D06 ES, Which is reserved in D06 CS.

7. [4bf721fe] Bugfixed as two kill signal gotten by user processes
As two kill signals are gotten by the processes, file->ops->flush will
be called twice. As uacce->ops->flush will be called twice too. Currently,
flush cannot be called again at the same uacce_queue file, or core dump
will show.So, status of uacce queue is added, as flush and release
operations doing,queue status will be checked atomically. If already being
released, do nothing.

8. [20bd4257] uacce/dummy: Fix dummy compile problem
Original commit message:
As we move flags, api_ver, gf_pg_start to uacce from uacce_ops, so also fix
dummy driver to work together with current uacce.
Signed-off-by: Nlingmingqiang <lingmingqiang@huawei.com>

Changes to be committed:
	modified:   drivers/crypto/hisilicon/Kconfig
	modified:   drivers/crypto/hisilicon/hpre/hpre_crypto.c
	modified:   drivers/crypto/hisilicon/qm.c
	modified:   drivers/crypto/hisilicon/qm.h
	modified:   drivers/crypto/hisilicon/zip/zip_main.c
	modified:   drivers/uacce/Kconfig
	modified:   drivers/uacce/dummy_drv/dummy_wd_dev.c
	modified:   drivers/uacce/dummy_drv/dummy_wd_v2.c
	modified:   drivers/uacce/uacce.c
	modified:   include/linux/uacce.h
Reviewed-by: Nhucheng.hu <hucheng.hu@huawei.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

299b4bc2

cgroup/files: use task_get_css() to get a valid css during dup_fd() · a8453b17

由 Hou Tao 提交于 5月 05, 2019

euler inclusion
category: bugfix
bugzilla: 14007
CVE: NA
-------------------------------------------------

Process fork and cgroup migration can happen simultaneously, and
in the following case use-after-free of css_set is possible:

CPU 0: process fork    CPU 1: cgroup migration

dup_fd                 __cgroup1_procs_write(threadgroup=false)
  files_cgroup_assign
    // task A
    task_lock
    task_cgroup(current, files_cgrp_id)
      css_set = task_css_set_check()

 			 cgroup_migrate_execute
  			   files_cgroup_can_attach
			   css_set_move_task
			     put_css_set_locked()
  			   files_cgroup_attach
			     // task B which is in the same
			     // thread group as task A
			     task_lock
			 cgroup_migrate_finish
			   // the css_set will be freed
			   put_css_set_locked()

      // use-after-free
      css_set->subsys[files_cgrp_id]

Fix it by using task_get_css() instead to get a valid css.

Fixes: 52cc1eccf6de ("cgroups: Resource controller for open files")
Signed-off-by: NHou Tao <houtao1@huawei.com>
Reviewed-by: Nluojiajun <luojiajun3@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

a8453b17

cgroup: undo unnecessary updates to struct cgroup_subsys & cftype · d72c2468

由 Hou Tao 提交于 5月 05, 2019

euler inclusion
category: bugfix
bugzilla: 14007
CVE: NA

-------------------------------------------------

These updates are leftovers from CentOS 7.x, and are not needed on
hulk-4.19, so kill them.

Fixes: 52cc1eccf6de ("cgroups: Resource controller for open files")
Signed-off-by: NHou Tao <houtao1@huawei.com>
Reviewed-by: Nluojiajun <luojiajun3@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d72c2468

ptrace: take into account saved_sigmask in PTRACE{GET, SET}SIGMASK · dd19bcdb

由 Andrei Vagin 提交于 5月 04, 2019

[ Upstream commit fcfc2aa0185f4a731d05a21e9f359968fdfd02e7 ]

There are a few system calls (pselect, ppoll, etc) which replace a task
sigmask while they are running in a kernel-space

When a task calls one of these syscalls, the kernel saves a current
sigmask in task->saved_sigmask and sets a syscall sigmask.

On syscall-exit-stop, ptrace traps a task before restoring the
saved_sigmask, so PTRACE_GETSIGMASK returns the syscall sigmask and
PTRACE_SETSIGMASK does nothing, because its sigmask is replaced by
saved_sigmask, when the task returns to user-space.

This patch fixes this problem.  PTRACE_GETSIGMASK returns saved_sigmask
if it's set.  PTRACE_SETSIGMASK drops the TIF_RESTORE_SIGMASK flag.

Link: http://lkml.kernel.org/r/20181120060616.6043-1-avagin@gmail.com
Fixes: 29000cae ("ptrace: add ability to get/set signal-blocked mask")
Signed-off-by: NAndrei Vagin <avagin@gmail.com>
Acked-by: NOleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSasha Levin (Microsoft) <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

dd19bcdb

tracing: Fix buffer_ref pipe ops · d82ba528

由 Jann Horn 提交于 5月 03, 2019

commit b987222654f84f7b4ca95b3a55eca784cb30235b upstream.

This fixes multiple issues in buffer_pipe_buf_ops:

 - The ->steal() handler must not return zero unless the pipe buffer has
   the only reference to the page. But generic_pipe_buf_steal() assumes
   that every reference to the pipe is tracked by the page's refcount,
   which isn't true for these buffers - buffer_pipe_buf_get(), which
   duplicates a buffer, doesn't touch the page's refcount.
   Fix it by using generic_pipe_buf_nosteal(), which refuses every
   attempted theft. It should be easy to actually support ->steal, but the
   only current users of pipe_buf_steal() are the virtio console and FUSE,
   and they also only use it as an optimization. So it's probably not worth
   the effort.
 - The ->get() and ->release() handlers can be invoked concurrently on pipe
   buffers backed by the same struct buffer_ref. Make them safe against
   concurrency by using refcount_t.
 - The pointers stored in ->private were only zeroed out when the last
   reference to the buffer_ref was dropped. As far as I know, this
   shouldn't be necessary anyway, but if we do it, let's always do it.

Link: http://lkml.kernel.org/r/20190404215925.253531-1-jannh@google.com

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
Fixes: 73a757e6 ("ring-buffer: Return reader page back into existing ring buffer")
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

Conflicts:
  kernel/trace/trace.c
  include/linux/pipe_fs_i.h
[yyl: adjust context]
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d82ba528

x86/kprobes: Verify stack frame on kretprobe · dd26d152

由 Masami Hiramatsu 提交于 4月 28, 2019

commit 3ff9c075cc767b3060bdac12da72fc94dd7da1b8 upstream.

Verify the stack frame pointer on kretprobe trampoline handler,
If the stack frame pointer does not match, it skips the wrong
entry and tries to find correct one.

This can happen if user puts the kretprobe on the function
which can be used in the path of ftrace user-function call.
Such functions should not be probed, so this adds a warning
message that reports which function should be blacklisted.
Tested-by: NAndrea Righi <righi.andrea@gmail.com>
Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
Acked-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/155094059185.6137.15527904013362842072.stgit@devboxSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

dd26d152

failover: allow name change on IFF_UP slave interfaces · 29b6215a

由 Si-Wei Liu 提交于 4月 28, 2019

[ Upstream commit 8065a779 ]

When a netdev appears through hot plug then gets enslaved by a failover
master that is already up and running, the slave will be opened
right away after getting enslaved. Today there's a race that userspace
(udev) may fail to rename the slave if the kernel (net_failover)
opens the slave earlier than when the userspace rename happens.
Unlike bond or team, the primary slave of failover can't be renamed by
userspace ahead of time, since the kernel initiated auto-enslavement is
unable to, or rather, is never meant to be synchronized with the rename
request from userspace.

As the failover slave interfaces are not designed to be operated
directly by userspace apps: IP configuration, filter rules with
regard to network traffic passing and etc., should all be done on master
interface. In general, userspace apps only care about the
name of master interface, while slave names are less important as long
as admin users can see reliable names that may carry
other information describing the netdev. For e.g., they can infer that
"ens3nsby" is a standby slave of "ens3", while for a
name like "eth0" they can't tell which master it belongs to.

Historically the name of IFF_UP interface can't be changed because
there might be admin script or management software that is already
relying on such behavior and assumes that the slave name can't be
changed once UP. But failover is special: with the in-kernel
auto-enslavement mechanism, the userspace expectation for device
enumeration and bring-up order is already broken. Previously initramfs
and various userspace config tools were modified to bypass failover
slaves because of auto-enslavement and duplicate MAC address. Similarly,
in case that users care about seeing reliable slave name, the new type
of failover slaves needs to be taken care of specifically in userspace
anyway.

It's less risky to lift up the rename restriction on failover slave
which is already UP. Although it's possible this change may potentially
break userspace component (most likely configuration scripts or
management software) that assumes slave name can't be changed while
UP, it's relatively a limited and controllable set among all userspace
components, which can be fixed specifically to listen for the rename
events on failover slaves. Userspace component interacting with slaves
is expected to be changed to operate on failover master interface
instead, as the failover slave is dynamic in nature which may come and
go at any point.  The goal is to make the role of failover slaves less
relevant, and userspace components should only deal with failover master
in the long run.

Fixes: 30c8bd5a ("net: Introduce generic failover module")
Signed-off-by: NSi-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: NLiran Alon <liran.alon@oracle.com>
Acked-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

29b6215a

net: phy: marvell: add new default led configure for m88e151x · 1e8a0462

由 Jian Shen 提交于 4月 29, 2019

mainline-next inclusion
from mainline-5.1-rc5
commit a93f7fe134543649cf2e2d8fc2c50a8f4d742915
category: bugfix
bugzilla: NA
CVE: NA
-------------------------------------------------

The default m88e151x LED configuration is 0x1177, used LED[0]
for 1000M link, LED[1] for 100M link, and LED[2] for active.
But for some boards, which use LED[0] for link, and LED[1] for
active, prefer to be 0x1040. To be compatible with this case,
this patch defines a new dev_flag, and set it before connect
phy in HNS3 driver. When phy initializing, using the new
LED configuration if this dev_flag is set.
Signed-off-by: NJian Shen <shenjian15@huawei.com>
Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Reviewed-by: Peng Li<lipeng321@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

1e8a0462

fs: prevent page refcount overflow in pipe_buf_get · 49ecd39f

由 Matthew Wilcox 提交于 4月 26, 2019

mainline inclusion
from mainline-5.1-rc5
commit 15fab63e1e57be9fdb5eec1bbc5916e9825e9acb
category: 13690
bugzilla: NA
CVE: CVE-2019-11487

There are four commits to fix this CVE:
  fs: prevent page refcount overflow in pipe_buf_get
  mm: prevent get_user_pages() from overflowing page refcount
  mm: add 'try_get_page()' helper function
  mm: make page ref count overflow check tighter and more explicit

-------------------------------------------------

Change pipe_buf_get() to return a bool indicating whether it succeeded
in raising the refcount of the page (if the thing in the pipe is a page).
This removes another mechanism for overflowing the page refcount.  All
callers converted to handle a failure.
Reported-by: NJann Horn <jannh@google.com>
Signed-off-by: NMatthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

49ecd39f

mm: make page ref count overflow check tighter and more explicit · 5026c0ae

由 Linus Torvalds 提交于 4月 26, 2019

mainline inclusion
from mainline-5.1-rc5
commit f958d7b528b1b40c44cfda5eabe2d82760d868c3
category: 13690
bugzilla: NA
CVE: CVE-2019-11487

There are four commits to fix this CVE:
  fs: prevent page refcount overflow in pipe_buf_get
  mm: prevent get_user_pages() from overflowing page refcount
  mm: add 'try_get_page()' helper function
  mm: make page ref count overflow check tighter and more explicit

-------------------------------------------------

We have a VM_BUG_ON() to check that the page reference count doesn't
underflow (or get close to overflow) by checking the sign of the count.

That's all fine, but we actually want to allow people to use a "get page
ref unless it's already very high" helper function, and we want that one
to use the sign of the page ref (without triggering this VM_BUG_ON).

Change the VM_BUG_ON to only check for small underflows (or _very_ close
to overflowing), and ignore overflows which have strayed into negative
territory.
Acked-by: NMatthew Wilcox <willy@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: stable@kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

5026c0ae

dm mpath: fix missing call of path selector type->end_io · bfa273a7

由 Yufen Yu 提交于 4月 25, 2019

euler inclusion
category: bugfix
bugzilla: 13971
CVE: NA

-------------------------------------------------

After commit 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via
blk_insert_cloned_request feedback"), map_request() will requeue the tio
when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.

Thus, if device drive status is error, a tio may be requeued multiple times
until the return value is not DM_MAPIO_REQUEUE. That means type->start_io
may be called multiple tims, while type->end_io just be called when IO
complete.

In fact, even without the commit, setup_clone() error also can make the
tio requeue and miss call type->end_io.

As servicer-time path selector for example, it selects path based on
in_flight_size, which is increased by st_start_io() and decreased by
st_end_io(). Missing call of end_io can lead to in_flight_size error
and let the selector make the wrong choice. In addition, queue-length
path selector will also be affected.

To fix the problem, we call type->end_io in ->release_clone_rq before
tio requeue. It pass map_info to ->release_clone_rq() for requeue path,
and pass NULL for the others path.

Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Reviewed-by: NMiao Xie <miaoxie@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

bfa273a7

iommu/iova: Separate atomic variables to improve performance · d1473d2a

由 Jinyu Qi 提交于 4月 24, 2019

mainline inclusion
from linux-next
commit: 14bd9a607f9082e7b5690c27e69072f2aeae0de4
category: feature
feature: IOMMU performance
bugzilla: NA
CVE: NA

--------------------------------------------------

In struct iova_domain, there are three atomic variables, the former two
are about TLB flush counters which use atomic_add operation, anoter is
used to flush timer that use cmpxhg operation.
These variables are in the same cache line, so it will cause some
performance loss under the condition that many cores call queue_iova
function, Let's isolate the two type atomic variables to different
cache line to reduce cache line conflict.

Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: NJinyu Qi <jinyuqi@huawei.com>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d1473d2a

iommu/iova: Optimise attempts to allocate iova from 32bit address range · 42a55ea3

由 Ganapatrao Kulkarni 提交于 4月 24, 2019

mainline inclusion
from mainline-4.20-rc1
commit: bee60e94a1e20ec0b8ffdafae270731d8fda4551
category: feature
feature: IOMMU performance
bugzilla: NA
CVE: NA

--------------------------------------------------

As an optimisation for PCI devices, there is always first attempt
been made to allocate iova from SAC address range. This will lead
to unnecessary attempts, when there are no free ranges
available. Adding fix to track recently failed iova address size and
allow further attempts, only if requested size is lesser than a failed
size. The size is updated when any replenish happens.
Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
Signed-off-by: NGanapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

42a55ea3

coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping · bf98085f

由 Andrea Arcangeli 提交于 4月 23, 2019

mainline inclusion
from mainline-5.1-rc6
commit 04f5866e41fb70690e28397487d8bd8eea7d712a
category: 13690
bugzilla: NA
CVE: CVE-2019-3892

-------------------------------------------------

The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma
layout will not change from under it.  Only using some signal
serialization on the processes belonging to the mm is not nearly enough.
This was pointed out earlier.  For example in Hugh's post from Jul 2017:

  https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils

  "Not strictly relevant here, but a related note: I was very surprised
   to discover, only quite recently, how handle_mm_fault() may be called
   without down_read(mmap_sem) - when core dumping. That seems a
   misguided optimization to me, which would also be nice to correct"

In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.

Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side
effects in the core dumping code.

Adding mmap_sem for writing around the ->core_dump invocation is a
viable long term fix, but it requires removing all copy user and page
faults and to replace them with get_dump_page() for all binary formats
which is not suitable as a short term fix.

For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs.  Once ->core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.

Allowing mmap_sem protected sections to run in parallel with the
coredump provides some minor parallelism advantage to the swapoff code
(which seems to be safe enough by never mangling any vma field and can
keep doing swapins in parallel to the core dumping) and to some other
corner case.

In order to facilitate the backporting I added "Fixes: 86039bd3"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any
other "Fixes:" because there's no hash beyond the git genesis commit.

Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for
reading, by adding the mmget_still_valid() check to it, all other cases
that take the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
context also doesn't need the new check, because all tasks under core
dumping are frozen.

Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes: 86039bd3 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Reported-by: NJann Horn <jannh@google.com>
Suggested-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NPeter Xu <peterx@redhat.com>
Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Reviewed-by: NJann Horn <jannh@google.com>
Acked-by: NJason Gunthorpe <jgg@mellanox.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflicts:
  drivers/infiniband/core/uverbs_main.c
[yyl: adjust context]
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

bf98085f

arm64: CI Code scanning warning clean · 33cef3a3

由 Mingqiang Ling 提交于 4月 24, 2019

driver inclusion
category: bugfix
bugzilla: 13683
CVE: NA

-------------------------------------------------

CI Code scanning warning clean

Feature or Bugfix:Bugfix
Signed-off-by: Nxuzaibo <xuzaibo@huawei.com>
Reviewed-by: Nwangzhou <wangzhou1@hisilicon.com>
Signed-off-by: NMingqiang Ling <lingmingqiang@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

33cef3a3

arm64: Fix static check warning for qm/zip · ac6bd719

由 tanshukun 提交于 4月 24, 2019

driver inclusion
category: bugfix
bugzilla: 13683
CVE: NA

-------------------------------------------------

Fix static check warning for qm/zip modules

Feature or Bugfix:Bugfix
Signed-off-by: Ntanshukun (A) <tanshukun1@huawei.com>
Reviewed-by: Nwangzhou <wangzhou1@hisilicon.com>
Signed-off-by: NMingqiang Ling <lingmingqiang@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ac6bd719

arm64: uacce: Remove some items in uacce_ops to uacce · b9a8af79

由 Mingqiang Ling 提交于 4月 24, 2019

driver inclusion
category: bugfix
bugzilla: 13683
CVE: NA

-------------------------------------------------

Orinal commit message:

This patch removes api_ver, flags, qf_pg_start in uacce_ops to uacce,
deletes owner in uacce_ops as we already have owner in cdev.

These items indeed belong to uacce.
Signed-off-by: NZhou Wang <wangzhou1@hisilicon.com>
Reviewed-by: Nxuzaibo <xuzaibo@huawei.com>
Reviewed-by: Nfanghao <fanghao11@huawei.com>
Signed-off-by: NMingqiang Ling <lingmingqiang@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

b9a8af79