提交 · cdc79c13568dcd911bbae16c8558ece240745e96 · openeuler / raspberrypi-kernel

27 12月, 2019 40 次提交

workqueue, ktask: renice helper threads to prevent starvation · cdc79c13

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

With ktask helper threads running at MAX_NICE, it's possible for one or
more of them to begin chunks of the task and then have their CPU time
constrained by higher priority threads.  The main ktask thread, running
at normal priority, may finish all available chunks of the task and then
wait on the MAX_NICE helpers to finish the last in-progress chunks, for
longer than it would have if no helpers were used.

Avoid this by having the main thread assign its priority to each
unfinished helper one at a time so that on a heavily loaded system,
exactly one thread in a given ktask call is running at the main thread's
priority.  At least one thread to ensure forward progress, and at most
one thread to limit excessive multithreading.

Since the workqueue interface, on which ktask is built, does not provide
access to worker threads, ktask can't adjust their priorities directly,
so add a new interface to allow a previously-queued work item to run at
a different priority than the one controlled by the corresponding
workqueue's 'nice' attribute.  The worker assigned to the work item will
run the work at the given priority, temporarily overriding the worker's
priority.

The interface is flush_work_at_nice, which ensures the given work item's
assigned worker runs at the specified nice level and waits for the work
item to finish.

An alternative choice would have been to simply requeue the work item to
a pool with workers of the new priority, but this doesn't seem feasible
because a worker may have already started executing the work and there's
currently no way to interrupt it midway through.  The proposed interface
solves this issue because a worker's priority can be adjusted while it's
executing the work.

TODO:  flush_work_at_nice is a proof-of-concept only, and it may be
desired to have the interface set the work's nice without also waiting
for it to finish.  It's implemented in the flush path for this RFC
because it was fairly simple to write ;-)

I ran tests similar to the ones in the last patch with a couple of
differences:
 - The non-ktask workload uses 8 CPUs instead of 7 to compete with the
   main ktask thread as well as the ktask helpers, so that when the main
   thread finishes, its CPU is completely occupied by the non-ktask
   workload, meaning MAX_NICE helpers can't run as often.
 - The non-ktask workload starts before the ktask workload, rather
   than after, to maximize the chance that it starves helpers.

Runtimes in seconds.

Case 1: Synthetic, worst-case CPU contention

 ktask_test - a tight loop doing integer multiplication to max out on CPU;
              used for testing only, does not appear in this series
 stress-ng  - cpu stressor ("-c --cpu-method ackerman --cpu-ops 1200");

             8_ktask_thrs           8_ktask_thrs
             w/o_renice(stdev)   with_renice  (stdev)  1_ktask_thr(stdev)
             ------------------------------------------------------------
  ktask_test    41.98  ( 0.22)         25.15  ( 2.98)      30.40  ( 0.61)
  stress-ng     44.79  ( 1.11)         46.37  ( 0.69)      53.29  ( 1.91)

Without renicing, ktask_test finishes just after stress-ng does because
stress-ng needs to free up CPUs for the helpers to finish (ktask_test
shows a shorter runtime than stress-ng because ktask_test was started
later).  Renicing lets ktask_test finish 40% sooner, and running the
same amount of work in ktask_test with 1 thread instead of 8 finishes in
a comparable amount of time, though longer than "with_renice" because
MAX_NICE threads still get some CPU time, and the effect over 8 threads
adds up.

stress-ng's total runtime gets a little longer going from no renicing to
renicing, as expected, because each reniced ktask thread takes more CPU
time than before when the helpers were starved.

Running with one ktask thread, stress-ng's reported walltime goes up
because that single thread interferes with fewer stress-ng threads,
but with more impact, causing a greater spread in the time it takes for
individual stress-ng threads to finish.  Averages of the per-thread
stress-ng times from "with_renice" to "1_ktask_thr" come out roughly
the same, though, 43.81 and 43.89 respectively.  So the total runtime of
stress-ng across all threads is unaffected, but the time stress-ng takes
to finish running its threads completely actually improves by spreading
the ktask_test work over more threads.

Case 2: Real-world CPU contention

 ktask_vfio - VFIO page pin a 32G kvm guest
 usemem     - faults in 86G of anonymous THP per thread, PAGE_SIZE stride;
              used to mimic the page clearing that dominates in ktask_vfio
              so that usemem competes for the same system resources

             8_ktask_thrs           8_ktask_thrs
             w/o_renice  (stdev)   with_renice  (stdev)  1_ktask_thr(stdev)
             --------------------------------------------------------------
  ktask_vfio    18.59  ( 0.19)         14.62  ( 2.03)      16.24  ( 0.90)
      usemem    47.54  ( 0.89)         48.18  ( 0.77)      49.70  ( 1.20)

These results are similar to case 1's, though the differences between
times are not quite as pronounced because ktask_vfio ran shorter
compared to usemem.
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

cdc79c13

ktask: add undo support · 69cc5b58

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

Tasks can fail midway through their work.  To recover, the finished
chunks of work need to be undone in a task-specific way.

Allow ktask clients to pass an "undo" callback that is responsible for
undoing one chunk of work.  To avoid multiple levels of error handling,
do not allow the callback to fail.  For simplicity and because it's a
slow path, undoing is not multithreaded.
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

69cc5b58

ktask: multithread CPU-intensive kernel work · c48676ef

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

A single CPU can spend an excessive amount of time in the kernel
operating on large amounts of data.  Often these situations arise
during initialization- and destruction-related tasks, where the data
involved scales with system size. These long-running jobs can slow
startup and shutdown of applications and the system itself while extra
CPUs sit idle.

To ensure that applications and the kernel continue to perform well as
core counts and memory sizes increase, harness these idle CPUs to
complete such jobs more quickly.

ktask is a generic framework for parallelizing CPU-intensive work in the
kernel.  The API is generic enough to add concurrency to many different
kinds of tasks--for example, zeroing a range of pages or evicting a list
of inodes--and aims to save its clients the trouble of splitting up the
work, choosing the number of threads to use, maintaining an efficient
concurrency level, starting these threads, and load balancing the work
between them.

The Documentation patch earlier in this series, from which the above was
swiped, has more background.

Inspired by work from Pavel Tatashin, Steve Sistare, and Jonathan Adams.
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Suggested-by: NPavel Tatashin <Pavel.Tatashin@microsoft.com>
Suggested-by: NSteve Sistare <steven.sistare@oracle.com>
Suggested-by: NJonathan Adams <jwadams@google.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

c48676ef

bpf, ppc64: generalize fetching subprog into bpf_jit_get_func_addr · 51778776

由 Daniel Borkmann 提交于 7月 24, 2019

mainline inclusion
from mainline-v4.20-rc5
commit e2c95a61656d29ceaac97b6a975c8a1f26e26f15
category: bugfix
bugzilla: 5654
CVE: NA

Make fetching of the BPF call address from ppc64 JIT generic. ppc64
was using a slightly different variant rather than through the insns'
imm field encoding as the target address would not fit into that space.
Therefore, the target subprog number was encoded into the insns' offset
and fetched through fp->aux->func[off]->bpf_func instead. Given there
are other JITs with this issue and the mechanism of fetching the address
is JIT-generic, move it into the core as a helper instead. On the JIT
side, we get information on whether the retrieved address is a fixed
one, that is, not changing through JIT passes, or a dynamic one. For
the former, JITs can optimize their imm emission because this doesn't
change jump offsets throughout JIT process.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NSandipan Das <sandipan@linux.ibm.com>
Tested-by: NSandipan Das <sandipan@linux.ibm.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NXuefeng Wang <wxf.wang@hisilicon.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

51778776

KVM: Fix leak vCPU's VMCS value into other pCPU · 867fd9a1

由 Wanpeng Li 提交于 8月 05, 2019

commit 17e433b5 upstream.

After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
in the VMs after stress testing:

 INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
 Call Trace:
   flush_tlb_mm_range+0x68/0x140
   tlb_flush_mmu.part.75+0x37/0xe0
   tlb_finish_mmu+0x55/0x60
   zap_page_range+0x142/0x190
   SyS_madvise+0x3cd/0x9c0
   system_call_fastpath+0x1c/0x21

swait_active() sustains to be true before finish_swait() is called in
kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
by kvm_vcpu_on_spin() loop greatly increases the probability condition
kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
VMCS.

This patch fixes it by checking conservatively a subset of events.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Marc Zyngier <Marc.Zyngier@arm.com>
Cc: stable@vger.kernel.org
Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

867fd9a1

crypto: ccp - Add support for valid authsize values less than 16 · b1fd456c

由 Gary R Hook 提交于 7月 30, 2019

commit 9f00baf7 upstream.

AES GCM encryption allows for authsize values of 4, 8, and 12-16 bytes.
Validate the requested authsize, and retain it to save in the request
context.

Fixes: 36cf515b ("crypto: ccp - Enable support for AES GCM on v5 CCPs")
Cc: <stable@vger.kernel.org>
Signed-off-by: NGary R Hook <gary.hook@amd.com>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

b1fd456c

arm_pmu: acpi: spe: Add initial MADT/SPE probing · 09436382

由 Jeremy Linton 提交于 8月 19, 2019

mainline inclusion
from mainline-5.3-rc1
commit d24a0c7099b3
category: feature
bugzilla: 16072
CVE: NA
---------------------------

ACPI 6.3 adds additional fields to the MADT GICC
structure to describe SPE PPI's. We pick these out
of the cached reference to the madt_gicc structure
similarly to the core PMU code. We then create a platform
device referring to the IRQ and let the user/module loader
decide whether to load the SPE driver.
Tested-by: NHanjun Gou <gouhanjun@huawei.com>
Reviewed-by: NSudeep Holla <sudeep.holla@arm.com>
Reviewed-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: NJeremy Linton <jeremy.linton@arm.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

09436382

ACPI/PPTT: Add function to return ACPI 6.3 Identical tokens · 7bd18595

由 Jeremy Linton 提交于 8月 19, 2019

mainline inclusion
from mainline-5.3-rc1
commit 56855a99
category: feature
bugzilla: 16072
CVE: NA
---------------------------

ACPI 6.3 adds a flag to indicate that child nodes are all
identical cores. This is useful to authoritatively determine
if a set of (possibly offline) cores are identical or not.

Since the flag doesn't give us a unique id we can generate
one and use it to create bitmaps of sibling nodes, or simply
in a loop to determine if a subset of cores are identical.
Tested-by: NHanjun Gou <gouhanjun@huawei.com>
Reviewed-by: NSudeep Holla <sudeep.holla@arm.com>
Signed-off-by: NJeremy Linton <jeremy.linton@arm.com>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

7bd18595

uacce: Break the limit of 2^(MAX_ORDER) page size DMA memory. · 560d3ab0

由 lingmingqiang 提交于 8月 16, 2019

driver inclusion
category: bugfix
bugzilla: NA
CVE: NA

In this patch, we try to reserve more DMA memory for usre space
application.At first, we bring SS(share static region) slice, which
includes a continuous physical address memory. And, a SS region of a
Warpdrive queue can have multiple slices. Before mapping to user
space VMA, the slices will be sorted in a physical increasing order
and merged those whose physical addresses are continuous. After
reserving the memory, several IOCAL system call will be done
to get the slices' physical address information for user space.
Signed-off-by: Nyumeng <yumeng18@huawei.com>
Reviewed-by: Nxuzaibo <xuzaibo@huawei.com>
Signed-off-by: Nlingmingqiang <lingmingqiang@huawei.com>
Reviewed-by: NZhou Wang <wangzhou1@hisilicon.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

560d3ab0

cgroup: Include dying leaders with live threads in PROCS iterations · 2cd06ce5

由 Tejun Heo 提交于 8月 13, 2019

commit c03cd7738a83b13739f00546166969342c8ff014 upstream.

CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
this means that a process with dying leader and live threads will be
skipped.  IOW, cgroup.procs might be empty while cgroup.threads isn't,
which is confusing to say the least.

Fix it by making cset track dying tasks and include dying leaders with
live threads in PROCS iteration.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-and-tested-by: NTopi Miettinen <toiwoton@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

2cd06ce5

cgroup: Implement css_task_iter_skip() · 0b0c6e8a

由 Tejun Heo 提交于 8月 13, 2019

commit b636fd38dc40113f853337a7d2a6885ad23b8811 upstream.

When a task is moved out of a cset, task iterators pointing to the
task are advanced using the normal css_task_iter_advance() call. This
is fine but we'll be tracking dying tasks on csets and thus moving
tasks from cset->tasks to (to be added) cset->dying_tasks. When we
remove a task from cset->tasks, if we advance the iterators, they may
move over to the next cset before we had the chance to add the task
back on the dying list, which can allow the task to escape iteration.

This patch separates out skipping from advancing. Skipping only moves
the affected iterators to the next pointer rather than fully advancing
it and the following advancing will recognize that the cursor has
already been moved forward and do the rest of advancing. This ensures
that when a task moves from one list to another in its cset, as long
as it moves in the right direction, it's always visible to iteration.

This doesn't cause any visible behavior changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

0b0c6e8a

compat_ioctl: pppoe: fix PPPOEIOCSFWD handling · f5ff908e

由 Arnd Bergmann 提交于 8月 13, 2019

[ Upstream commit 055d8824 ]

Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in
linux-2.5.69 along with hundreds of other commands, but was always broken
sincen only the structure is compatible, but the command number is not,
due to the size being sizeof(size_t), or at first sizeof(sizeof((struct
sockaddr_pppox)), which is different on 64-bit architectures.

Guillaume Nault adds:

  And the implementation was broken until 2016 (see 29e73269 ("pppoe:
  fix reference counting in PPPoE proxy")), and nobody ever noticed. I
  should probably have removed this ioctl entirely instead of fixing it.
  Clearly, it has never been used.

Fix it by adding a compat_ioctl handler for all pppoe variants that
translates the command number and then calls the regular ioctl function.

All other ioctl commands handled by pppoe are compatible between 32-bit
and 64-bit, and require compat_ptr() conversion.

This should apply to all stable kernels.
Acked-by: NGuillaume Nault <g.nault@alphalink.fr>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

f5ff908e

net/mlx5e: Prevent encap flow counter update async to user query · 703f039e

由 Ariel Levkovich 提交于 8月 13, 2019

[ Upstream commit 90bb769291161cf25a818d69cf608c181654473e ]

This patch prevents a race between user invoked cached counters
query and a neighbor last usage updater.

The cached flow counter stats can be queried by calling
"mlx5_fc_query_cached" which provides the number of bytes and
packets that passed via this flow since the last time this counter
was queried.
It does so by reducting the last saved stats from the current, cached
stats and then updating the last saved stats with the cached stats.
It also provide the lastuse value for that flow.

Since "mlx5e_tc_update_neigh_used_value" needs to retrieve the
last usage time of encapsulation flows, it calls the flow counter
query method periodically and async to user queries of the flow counter
using cls_flower.
This call is causing the driver to update the last reported bytes and
packets from the cache and therefore, future user queries of the flow
stats will return lower than expected number for bytes and packets
since the last saved stats in the driver was updated async to the last
saved stats in cls_flower.

This causes wrong stats presentation of encapsulation flows to user.

Since the neighbor usage updater only needs the lastuse stats from the
cached counter, the fix is to use a dedicated lastuse query call that
returns the lastuse value without synching between the cached stats and
the last saved stats.

Fixes: f6dfb4c3 ("net/mlx5e: Update neighbour 'used' state using HW flow rules counters")
Signed-off-by: NAriel Levkovich <lariel@mellanox.com>
Reviewed-by: NRoi Dayan <roid@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

703f039e

net/mlx5: Fix modify_cq_in alignment · 993da109

由 Edward Srouji 提交于 8月 13, 2019

[ Upstream commit 7a32f296 ]

Fix modify_cq_in alignment to match the device specification.
After this fix the 'cq_umem_valid' field will be in the right offset.

Cc: <stable@vger.kernel.org> # 4.19
Fixes: bd371975 ("net/mlx5: Update mlx5_ifc with DEVX UID bits")
Signed-off-by: NEdward Srouji <edwards@mellanox.com>
Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

993da109

drivers/base: Introduce kill_device() · d4ac5ea5

由 Dan Williams 提交于 8月 13, 2019

commit 00289cd8 upstream.

The libnvdimm subsystem arranges for devices to be destroyed as a result
of a sysfs operation. Since device_unregister() cannot be called from
an actively running sysfs attribute of the same device libnvdimm
arranges for device_unregister() to be performed in an out-of-line async
context.

The driver core maintains a 'dead' state for coordinating its own racing
async registration / de-registration requests. Rather than add local
'dead' state tracking infrastructure to libnvdimm device objects, export
the existing state tracking via a new kill_device() helper.

The kill_device() helper simply marks the device as dead, i.e. that it
is on its way to device_del(), or returns that the device was already
dead. This can be used in advance of calling device_unregister() for
subsystems like libnvdimm that might need to handle multiple user
threads racing to delete a device.

This refactoring does not change any behavior, but it is a pre-requisite
for follow-on fixes and therefore marked for -stable.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Fixes: 4d88a97a ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...")
Cc: <stable@vger.kernel.org>
Tested-by: NJane Chu <jane.chu@oracle.com>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/156341207332.292348.14959761496009347574.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d4ac5ea5

ACC: QM: use dus map flag to handle error · 7a8808dc

由 tanshukun 提交于 8月 09, 2019

driver inclusion
category: bugfix
bugzilla: NA
CVE: NA

Feature or Bugfix:Bugfix
Signed-off-by: Ntanshukun (A) <tanshukun1@huawei.com>
Reviewed-by: Nwangzhou <wangzhou1@hisilicon.com>
Signed-off-by: Nlingmingqiang <lingmingqiang@huawei.com>
Reviewed-by: Nlingmingqiang <lingmingqiang@huawei.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

7a8808dc

UACCE: Support err device isolation of UACCE · cf79ddf5

由 lingmingqiang 提交于 8月 09, 2019

driver inclusion
category: bugfix
bugzilla: NA
CVE: NA

Feature or Bugfix:Feature
Signed-off-by: Ntanghui20 <tanghui20@huawei.com>
Reviewed-by: Nwangzhou <wangzhou1@hisilicon.com>
Signed-off-by: Nlingmingqiang <lingmingqiang@huawei.com>
Reviewed-by: Nlingmingqiang <lingmingqiang@huawei.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

cf79ddf5

UACCE: use mmap flag to handle hw error · 2a79a6bc

由 lingmingqiang 提交于 8月 09, 2019

driver inclusion
category: bugfix
bugzilla: NA
CVE: NA

Feature or Bugfix:Feature
Signed-off-by: Ntanshukun (A) <tanshukun1@huawei.com>
Reviewed-by: Nxuzaibo <xuzaibo@huawei.com>
Signed-off-by: Nlingmingqiang <lingmingqiang@huawei.com>
Reviewed-by: Nlingmingqiang <lingmingqiang@huawei.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

2a79a6bc

uapi linux/coda_psdev.h: move upc_req definition from uapi to kernel side headers · b711ca8f

由 Mikko Rapeli 提交于 8月 07, 2019

[ Upstream commit f90fb3c7e2c13ae829db2274b88b845a75038b8a ]

Only users of upc_req in kernel side fs/coda/psdev.c and
fs/coda/upcall.c already include linux/coda_psdev.h.

Suggested by Jan Harkes <jaharkes@cs.cmu.edu> in
  https://lore.kernel.org/lkml/20150531111913.GA23377@cs.cmu.edu/

Fixes these include/uapi/linux/coda_psdev.h compilation errors in userspace:

  linux/coda_psdev.h:12:19: error: field `uc_chain' has incomplete type
  struct list_head    uc_chain;
                   ^
  linux/coda_psdev.h:13:2: error: unknown type name `caddr_t'
  caddr_t             uc_data;
  ^
  linux/coda_psdev.h:14:2: error: unknown type name `u_short'
  u_short             uc_flags;
  ^
  linux/coda_psdev.h:15:2: error: unknown type name `u_short'
  u_short             uc_inSize;  /* Size is at most 5000 bytes */
  ^
  linux/coda_psdev.h:16:2: error: unknown type name `u_short'
  u_short             uc_outSize;
  ^
  linux/coda_psdev.h:17:2: error: unknown type name `u_short'
  u_short             uc_opcode;  /* copied from data to save lookup */
  ^
  linux/coda_psdev.h:19:2: error: unknown type name `wait_queue_head_t'
  wait_queue_head_t   uc_sleep;   /* process' wait queue */
  ^

Link: http://lkml.kernel.org/r/9f99f5ce6a0563d5266e6cf7aa9585aac2cae971.1558117389.git.jaharkes@cs.cmu.eduSigned-off-by: NMikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: NJan Harkes <jaharkes@cs.cmu.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Sam Protsenko <semen.protsenko@linaro.org>
Cc: Yann Droneaud <ydroneaud@opteya.com>
Cc: Zhouyang Jia <jiazhouyang09@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

b711ca8f

coda: fix build using bare-metal toolchain · 8ac845c3

由 Sam Protsenko 提交于 8月 07, 2019

[ Upstream commit b2a57e33 ]

The kernel is self-contained project and can be built with bare-metal
toolchain.  But bare-metal toolchain doesn't define __linux__.  Because
of this u_quad_t type is not defined when using bare-metal toolchain and
codafs build fails.  This patch fixes it by defining u_quad_t type
unconditionally.

Link: http://lkml.kernel.org/r/3cbb40b0a57b6f9923a9d67b53473c0b691a3eaa.1558117389.git.jaharkes@cs.cmu.eduSigned-off-by: NSam Protsenko <semen.protsenko@linaro.org>
Signed-off-by: NJan Harkes <jaharkes@cs.cmu.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Mikko Rapeli <mikko.rapeli@iki.fi>
Cc: Yann Droneaud <ydroneaud@opteya.com>
Cc: Zhouyang Jia <jiazhouyang09@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

8ac845c3

ACPI: fix false-positive -Wuninitialized warning · 9069d9a6

由 Arnd Bergmann 提交于 8月 07, 2019

[ Upstream commit dfd6f9ad ]

clang gets confused by an uninitialized variable in what looks
to it like a never executed code path:

arch/x86/kernel/acpi/boot.c:618:13: error: variable 'polarity' is uninitialized when used here [-Werror,-Wuninitialized]
        polarity = polarity ? ACPI_ACTIVE_LOW : ACPI_ACTIVE_HIGH;
                   ^~~~~~~~
arch/x86/kernel/acpi/boot.c:606:32: note: initialize the variable 'polarity' to silence this warning
        int rc, irq, trigger, polarity;
                                      ^
                                       = 0
arch/x86/kernel/acpi/boot.c:617:12: error: variable 'trigger' is uninitialized when used here [-Werror,-Wuninitialized]
        trigger = trigger ? ACPI_LEVEL_SENSITIVE : ACPI_EDGE_SENSITIVE;
                  ^~~~~~~
arch/x86/kernel/acpi/boot.c:606:22: note: initialize the variable 'trigger' to silence this warning
        int rc, irq, trigger, polarity;
                            ^
                             = 0

This is unfortunately a design decision in clang and won't be fixed.

Changing the acpi_get_override_irq() macro to an inline function
reliably avoids the issue.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

9069d9a6

block, scsi: Change the preempt-only flag into a counter · 99a9d5a0

由 Bart Van Assche 提交于 8月 06, 2019

commit cd84a62e upstream.

The RQF_PREEMPT flag is used for three purposes:
- In the SCSI core, for making sure that power management requests
  are executed even if a device is in the "quiesced" state.
- For domain validation by SCSI drivers that use the parallel port.
- In the IDE driver, for IDE preempt requests.
Rename "preempt-only" into "pm-only" because the primary purpose of
this mode is power management. Since the power management core may
but does not have to resume a runtime suspended device before
performing system-wide suspend and since a later patch will set
"pm-only" mode as long as a block device is runtime suspended, make
it possible to set "pm-only" mode from more than one context. Since
with this change scsi_device_quiesce() is no longer idempotent, make
that function return early if it is called for a quiesced queue.
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

99a9d5a0

sched/fair: Use RCU accessors consistently for ->numa_group · 6724dbbb

由 Jann Horn 提交于 8月 06, 2019

commit cb361d8c upstream.

The old code used RCU annotations and accessors inconsistently for
->numa_group, which can lead to use-after-frees and NULL dereferences.

Let all accesses to ->numa_group use proper RCU helpers to prevent such
issues.
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Fixes: 8c8a743c ("sched/numa: Use {cpu, pid} to create task groups for shared faults")
Link: https://lkml.kernel.org/r/20190716152047.14424-3-jannh@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

6724dbbb

sched/fair: Don't free p->numa_faults with concurrent readers · d8d6b59f

由 Jann Horn 提交于 8月 06, 2019

commit 16d51a59 upstream.

When going through execve(), zero out the NUMA fault statistics instead of
freeing them.

During execve, the task is reachable through procfs and the scheduler. A
concurrent /proc/*/sched reader can read data from a freed ->numa_faults
allocation (confirmed by KASAN) and write it back to userspace.
I believe that it would also be possible for a use-after-free read to occur
through a race between a NUMA fault and execve(): task_numa_fault() can
lead to task_numa_compare(), which invokes task_weight() on the currently
running task of a different CPU.

Another way to fix this would be to make ->numa_faults RCU-managed or add
extra locking, but it seems easier to wipe the NUMA fault statistics on
execve.
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Fixes: 82727018 ("sched/numa: Call task_numa_free() from do_execve()")
Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d8d6b59f

iommu/iova: Fix compilation error with !CONFIG_IOMMU_IOVA · d477f4a7

由 Joerg Roedel 提交于 8月 06, 2019

commit 201c1db9 upstream.

The stub function for !CONFIG_IOMMU_IOVA needs to be
'static inline'.

Fixes: effa4678 ('iommu/vt-d: Don't queue_iova() if there is no flush queue')
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: NDmitry Safonov <dima@arista.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d477f4a7

iommu/vt-d: Don't queue_iova() if there is no flush queue · adcfbdef

由 Dmitry Safonov 提交于 8月 06, 2019

commit effa4678 upstream.

Intel VT-d driver was reworked to use common deferred flushing
implementation. Previously there was one global per-cpu flush queue,
afterwards - one per domain.

Before deferring a flush, the queue should be allocated and initialized.

Currently only domains with IOMMU_DOMAIN_DMA type initialize their flush
queue. It's probably worth to init it for static or unmanaged domains
too, but it may be arguable - I'm leaving it to iommu folks.

Prevent queuing an iova flush if the domain doesn't have a queue.
The defensive check seems to be worth to keep even if queue would be
initialized for all kinds of domains. And is easy backportable.

On 4.19.43 stable kernel it has a user-visible effect: previously for
devices in si domain there were crashes, on sata devices:

 BUG: spinlock bad magic on CPU#6, swapper/0/1
  lock: 0xffff88844f582008, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
 CPU: 6 PID: 1 Comm: swapper/0 Not tainted 4.19.43 #1
 Call Trace:
  <IRQ>
  dump_stack+0x61/0x7e
  spin_bug+0x9d/0xa3
  do_raw_spin_lock+0x22/0x8e
  _raw_spin_lock_irqsave+0x32/0x3a
  queue_iova+0x45/0x115
  intel_unmap+0x107/0x113
  intel_unmap_sg+0x6b/0x76
  __ata_qc_complete+0x7f/0x103
  ata_qc_complete+0x9b/0x26a
  ata_qc_complete_multiple+0xd0/0xe3
  ahci_handle_port_interrupt+0x3ee/0x48a
  ahci_handle_port_intr+0x73/0xa9
  ahci_single_level_irq_intr+0x40/0x60
  __handle_irq_event_percpu+0x7f/0x19a
  handle_irq_event_percpu+0x32/0x72
  handle_irq_event+0x38/0x56
  handle_edge_irq+0x102/0x121
  handle_irq+0x147/0x15c
  do_IRQ+0x66/0xf2
  common_interrupt+0xf/0xf
 RIP: 0010:__do_softirq+0x8c/0x2df

The same for usb devices that use ehci-pci:
 BUG: spinlock bad magic on CPU#0, swapper/0/1
  lock: 0xffff88844f402008, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.43 #4
 Call Trace:
  <IRQ>
  dump_stack+0x61/0x7e
  spin_bug+0x9d/0xa3
  do_raw_spin_lock+0x22/0x8e
  _raw_spin_lock_irqsave+0x32/0x3a
  queue_iova+0x77/0x145
  intel_unmap+0x107/0x113
  intel_unmap_page+0xe/0x10
  usb_hcd_unmap_urb_setup_for_dma+0x53/0x9d
  usb_hcd_unmap_urb_for_dma+0x17/0x100
  unmap_urb_for_dma+0x22/0x24
  __usb_hcd_giveback_urb+0x51/0xc3
  usb_giveback_urb_bh+0x97/0xde
  tasklet_action_common.isra.4+0x5f/0xa1
  tasklet_action+0x2d/0x30
  __do_softirq+0x138/0x2df
  irq_exit+0x7d/0x8b
  smp_apic_timer_interrupt+0x10f/0x151
  apic_timer_interrupt+0xf/0x20
  </IRQ>
 RIP: 0010:_raw_spin_unlock_irqrestore+0x17/0x39

Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: <stable@vger.kernel.org> # 4.14+
Fixes: 13cf0174 ("iommu/vt-d: Make use of iova deferred flushing")
Signed-off-by: NDmitry Safonov <dima@arista.com>
Reviewed-by: NLu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
[v4.14-port notes:
o minor conflict with untrusted IOMMU devices check under if-condition]
Signed-off-by: NDmitry Safonov <dima@arista.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

adcfbdef

access: avoid the RCU grace period for the temporary subjective credentials · 5e9a36d9

由 Linus Torvalds 提交于 7月 31, 2019

commit d7852fbd upstream.

It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
work because it installs a temporary credential that gets allocated and
freed for each system call.

The allocation and freeing overhead is mostly benign, but because
credentials can be accessed under the RCU read lock, the freeing
involves a RCU grace period.

Which is not a huge deal normally, but if you have a lot of access()
calls, this causes a fair amount of seconday damage: instead of having a
nice alloc/free patterns that hits in hot per-CPU slab caches, you have
all those delayed free's, and on big machines with hundreds of cores,
the RCU overhead can end up being enormous.

But it turns out that all of this is entirely unnecessary.  Exactly
because access() only installs the credential as the thread-local
subjective credential, the temporary cred pointer doesn't actually need
to be RCU free'd at all.  Once we're done using it, we can just free it
synchronously and avoid all the RCU overhead.

So add a 'non_rcu' flag to 'struct cred', which can be set by users that
know they only use it in non-RCU context (there are other potential
users for this).  We can make it a union with the rcu freeing list head
that we need for the RCU case, so this doesn't need any extra storage.

Note that this also makes 'get_current_cred()' clear the new non_rcu
flag, in case we have filesystems that take a long-term reference to the
cred and then expect the RCU delayed freeing afterwards.  It's not
entirely clear that this is required, but it makes for clear semantics:
the subjective cred remains non-RCU as long as you only access it
synchronously using the thread-local accessors, but you _can_ use it as
a generic cred if you want to.

It is possible that we should just remove the whole RCU markings for
->cred entirely.  Only ->real_cred is really supposed to be accessed
through RCU, and the long-term cred copies that nfs uses might want to
explicitly re-enable RCU freeing if required, rather than have
get_current_cred() do it implicitly.

But this is a "minimal semantic changes" change for the immediate
problem.
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NPaul E. McKenney <paulmck@linux.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jan Glauber <jglauber@marvell.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>
Cc: Greg KH <greg@kroah.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

5e9a36d9

gpu: host1x: Increase maximum DMA segment size · e2e48559

由 Thierry Reding 提交于 7月 31, 2019

[ Upstream commit 1e390478 ]

Recent versions of the DMA API debug code have started to warn about
violations of the maximum DMA segment size. This is because the segment
size defaults to 64 KiB, which can easily be exceeded in large buffer
allocations such as used in DRM/KMS for framebuffers.

Technically the Tegra SMMU and ARM SMMU don't have a maximum segment
size (they map individual pages irrespective of whether they are
contiguous or not), so the choice of 4 MiB is a bit arbitrary here. The
maximum segment size is a 32-bit unsigned integer, though, so we can't
set it to the correct maximum size, which would be the size of the
aperture.
Signed-off-by: NThierry Reding <treding@nvidia.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

e2e48559

block: switch to per-cpu in-flight counters · fa23f552

由 Mikulas Patocka 提交于 7月 30, 2019

mainline inclusion
from mainline-5.0-rc1
commit 1226b8dd
category: bugfix
bugzilla: 18695
CVE: NA

---------------------------

Now when part_round_stats is gone, we can switch to per-cpu in-flight
counters.

We use the local-atomic type local_t, so that if part_inc_in_flight or
part_dec_in_flight is reentrantly called from an interrupt, the value will
be correct.

The other counters could be corrupted due to reentrant interrupt, but the
corruption only results in slight counter skew - the in_flight counter
must be exact, so it needs local_t.

Conflicts:
  include/linux/genhd.h
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

fa23f552

jbd2: introduce jbd2_inode dirty range scoping · 452a2abc

由 Ross Zwisler 提交于 7月 29, 2019

commit 6ba0e7dc64a5adcda2fbe65adc466891795d639e upstream.

Currently both journal_submit_inode_data_buffers() and
journal_finish_inode_data_buffers() operate on the entire address space
of each of the inodes associated with a given journal entry.  The
consequence of this is that if we have an inode where we are constantly
appending dirty pages we can end up waiting for an indefinite amount of
time in journal_finish_inode_data_buffers() while we wait for all the
pages under writeback to be written out.

The easiest way to cause this type of workload is do just dd from
/dev/zero to a file until it fills the entire filesystem.  This can
cause journal_finish_inode_data_buffers() to wait for the duration of
the entire dd operation.

We can improve this situation by scoping each of the inode dirty ranges
associated with a given transaction.  We do this via the jbd2_inode
structure so that the scoping is contained within jbd2 and so that it
follows the lifetime and locking rules for that structure.

This allows us to limit the writeback & wait in
journal_submit_inode_data_buffers() and
journal_finish_inode_data_buffers() respectively to the dirty range for
a given struct jdb2_inode, keeping us from waiting forever if the inode
in question is still being appended to.
Signed-off-by: NRoss Zwisler <zwisler@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

452a2abc

mm: add filemap_fdatawait_range_keep_errors() · c7a9d05b

由 Ross Zwisler 提交于 7月 29, 2019

commit aa0bfcd939c30617385ffa28682c062d78050eba upstream.

In the spirit of filemap_fdatawait_range() and
filemap_fdatawait_keep_errors(), introduce
filemap_fdatawait_range_keep_errors() which both takes a range upon
which to wait and does not clear errors from the address space.
Signed-off-by: NRoss Zwisler <zwisler@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

c7a9d05b

perf/core: Fix exclusive events' grouping · 9e413000

由 Alexander Shishkin 提交于 7月 29, 2019

commit 8a58ddae upstream.

So far, we tried to disallow grouping exclusive events for the fear of
complications they would cause with moving between contexts. Specifically,
moving a software group to a hardware context would violate the exclusivity
rules if both groups contain matching exclusive events.

This attempt was, however, unsuccessful: the check that we have in the
perf_event_open() syscall is both wrong (looks at wrong PMU) and
insufficient (group leader may still be exclusive), as can be illustrated
by running:

  $ perf record -e '{intel_pt//,cycles}' uname
  $ perf record -e '{cycles,intel_pt//}' uname

ultimately successfully.

Furthermore, we are completely free to trigger the exclusivity violation
by:

   perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'

even though the helpful perf record will not allow that, the ABI will.

The warning later in the perf_event_open() path will also not trigger, because
it's also wrong.

Fix all this by validating the original group before moving, getting rid
of broken safeguards and placing a useful one to perf_install_in_context().
Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: mathieu.poirier@linaro.org
Cc: will.deacon@arm.com
Fixes: bed5b25a ("perf: Add a pmu capability for "exclusive" events")
Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

9e413000

clocksource/drivers/exynos_mct: Increase priority over ARM arch timer · 4f5e9691

由 Marek Szyprowski 提交于 7月 29, 2019

[ Upstream commit 6282edb72bed5324352522d732080d4c1b9dfed6 ]

Exynos SoCs based on CA7/CA15 have 2 timer interfaces: custom Exynos MCT
(Multi Core Timer) and standard ARM Architected Timers.

There are use cases, where both timer interfaces are used simultanously.
One of such examples is using Exynos MCT for the main system timer and
ARM Architected Timers for the KVM and virtualized guests (KVM requires
arch timers).

Exynos Multi-Core Timer driver (exynos_mct) must be however started
before ARM Architected Timers (arch_timer), because they both share some
common hardware blocks (global system counter) and turning on MCT is
needed to get ARM Architected Timer working properly.

To ensure selecting Exynos MCT as the main system timer, increase MCT
timer rating. To ensure proper starting order of both timers during
suspend/resume cycle, increase MCT hotplug priority over ARM Archictected
Timers.
Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: NKrzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: NChanwoo Choi <cw00.choi@samsung.com>
Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

4f5e9691

rcu: Force inlining of rcu_read_lock() · a384fb29

由 Waiman Long 提交于 7月 29, 2019

[ Upstream commit 6da9f775 ]

When debugging options are turned on, the rcu_read_lock() function
might not be inlined. This results in lockdep's print_lock() function
printing "rcu_read_lock+0x0/0x70" instead of rcu_read_lock()'s caller.
For example:

[   10.579995] =============================
[   10.584033] WARNING: suspicious RCU usage
[   10.588074] 4.18.0.memcg_v2+ #1 Not tainted
[   10.593162] -----------------------------
[   10.597203] include/linux/rcupdate.h:281 Illegal context switch in
RCU read-side critical section!
[   10.606220]
[   10.606220] other info that might help us debug this:
[   10.606220]
[   10.614280]
[   10.614280] rcu_scheduler_active = 2, debug_locks = 1
[   10.620853] 3 locks held by systemd/1:
[   10.624632]  #0: (____ptrval____) (&type->i_mutex_dir_key#5){.+.+}, at: lookup_slow+0x42/0x70
[   10.633232]  #1: (____ptrval____) (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x70
[   10.640954]  #2: (____ptrval____) (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x70

These "rcu_read_lock+0x0/0x70" strings are not providing any useful
information.  This commit therefore forces inlining of the rcu_read_lock()
function so that rcu_read_lock()'s caller is instead shown.
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

a384fb29

sched/wake_q: Document wake_q_add() · a96ae41d

由 Peter Zijlstra 提交于 12月 17, 2018

The only guarantee provided by wake_q_add() is that a wakeup will
happen after it, it does _NOT_ guarantee the wakeup will be delayed
until the matching wake_up_q().

If wake_q_add() fails the cmpxchg() a concurrent wakeup is pending and
that can happen at any time after the cmpxchg(). This means we should
not rely on the wakeup happening at wake_q_up(), but should be ready
for wake_q_add() to issue the wakeup.

The delay; if provided (most likely); should only result in more efficient
behaviour.
Reported-by: NYongji Xie <elohimes@gmail.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

a96ae41d

linux/kernel.h: fix overflow for DIV_ROUND_UP_ULL · 03a21df7

由 Vinod Koul 提交于 7月 23, 2019

[ Upstream commit 8f9fab480c7a87b10bb5440b5555f370272a5d59 ]

DIV_ROUND_UP_ULL adds the two arguments and then invokes
DIV_ROUND_DOWN_ULL.  But on a 32bit system the addition of two 32 bit
values can overflow.  DIV_ROUND_DOWN_ULL does it correctly and stashes
the addition into a unsigned long long so cast the result to unsigned
long long here to avoid the overflow condition.

[akpm@linux-foundation.org: DIV_ROUND_UP_ULL must be an rval]
Link: http://lkml.kernel.org/r/20190625100518.30753-1-vkoul@kernel.orgSigned-off-by: NVinod Koul <vkoul@kernel.org>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

03a21df7

drivers: base: cacheinfo: Ensure cpu hotplug work is done before Intel RDT · 93fe2686

由 James Morse 提交于 7月 23, 2019

commit 83b44fe343b5abfcb1b2261289bd0cfcfcfd60a8 upstream.

The cacheinfo structures are alloced/freed by cpu online/offline
callbacks. Originally these were only used by sysfs to expose the
cache topology to user space. Without any in-kernel dependencies
CPUHP_AP_ONLINE_DYN was an appropriate choice.

resctrl has started using these structures to identify CPUs that
share a cache. It updates its 'domain' structures from cpu
online/offline callbacks. These depend on the cacheinfo structures
(resctrl_online_cpu()->domain_add_cpu()->get_cache_id()->
 get_cpu_cacheinfo()).
These also run as CPUHP_AP_ONLINE_DYN.

Now that there is an in-kernel dependency, move the cacheinfo
work earlier so we know its done before resctrl's CPUHP_AP_ONLINE_DYN
work runs.

Fixes: 2264d9c7 ("x86/intel_rdt: Build structures for each resource based on cache topology")
Cc: <stable@vger.kernel.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: NJames Morse <james.morse@arm.com>
Link: https://lore.kernel.org/r/20190624173656.202407-1-james.morse@arm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

93fe2686

drivers: base: cacheinfo: Add variable to record max cache line size · 34a52f8f

由 Shaokun Zhang 提交于 6月 05, 2019

mainline inclusion
from mainline-v5.2
commit: 9a83c84c
category: performance
bugzilla: NA
CVE: NA

--------------------------------------------------

Add coherency_max_size variable to record the maximum cache line size
for different cache levels. If it is available, we will synchronize
it as cache line size, otherwise we will use CTR_EL0.CWG reporting
in cache_line_size() for arm64.

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Jeremy Linton <jeremy.linton@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Reviewed-by: NSudeep Holla <sudeep.holla@arm.com>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NShaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NShaokun Zhang <zhangshaokun@hisilicon.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

34a52f8f

VMCI: Fix integer overflow in VMCI handle arrays · 3c01c96d

由 Vishnu DASA 提交于 7月 15, 2019

commit 1c2eb5b2 upstream.

The VMCI handle array has an integer overflow in
vmci_handle_arr_append_entry when it tries to expand the array. This can be
triggered from a guest, since the doorbell link hypercall doesn't impose a
limit on the number of doorbell handles that a VM can create in the
hypervisor, and these handles are stored in a handle array.

In this change, we introduce a mandatory max capacity for handle
arrays/lists to avoid excessive memory usage.
Signed-off-by: NVishnu Dasa <vdasa@vmware.com>
Reviewed-by: NAdit Ranadive <aditr@vmware.com>
Reviewed-by: NJorgen Hansen <jhansen@vmware.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

3c01c96d

ACC: Add state machine for uacce/QM · 8bf65d55

由 Mingqiang Ling 提交于 7月 11, 2019

driver inclusion
category: bugfix
bugzilla: NA
CVE: NA

This patch adds state machine for uacce and QM. Related state machine and
lock design documents can be found:

https://github.com/hisilicon/dev-docs/blob/master/warpdrive/state_model.rst
https://github.com/hisilicon/dev-docs/blob/master/warpdrive/uacce_lock.rst

This patch also solves below problems:

- Remove uacce_queue pool in uacce, let's low level driver to maintain
queue pool. So locks(uacce_mutex, uacce->q_lock) for uacce are also
removed.

- Provide an ioctl to put hardware queue, as hardware queue which is put in
uacce .release callback will be delayed to put in kernel.

- Modify reset logic of uacce. UACCE_ST_RST state of uacce has been
deleted. Current logic is: before doing hardware reset, a SIGIO has been
sent to process which has been binded with uacce_queue, low level driver
waits user to close all fds before doing uacce_unregister; after hardware
reset done, low level driver can register to uacce subsystem again.

- Modify the way to send reset signal: use send_sig_info to send signal.

Known issue as comments of function uacce_send_sig_to_client:

"This function can be called in low level driver, which may bring a race
with uacce_fops_release. The problem is this function may be called
when q is NULL. Low level driver should avoid this by locking hardware
queue pool and check if there is related hardware queue before calling
this function.

Modify the sec/rde module codes to adapt the uacce/qm changes.
Signed-off-by: Ntanshukun (A) <tanshukun1@huawei.com>
Reviewed-by: Nwangzhou <wangzhou1@hisilicon.com>
Signed-off-by: NMingqiang Ling <lingmingqiang@huawei.com>
Signed-off-by: Nlingmingqiang <lingmingqiang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

8bf65d55