提交 · b381a9ce1208590ba279e3e6cab150aead8ac320 · openanolis / cloud-kernel

02 9月, 2020 36 次提交

由 Christoph Hellwig 提交于 3月 03, 2019

fix #29327388

commit 3aef3cae4342c1d8137a1c0782cbb66f1be3943c upstream

Return the currently active bvec segment, potentially spanning multiple
pages.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b381a9ce

block: move queues types to the block layer · cd87b6e5

由 Christoph Hellwig 提交于 12月 02, 2018

to #28991349

commit e20ba6e1da029136ded295f33076483d65ddf50a upstream

Having another indirect all in the fast path doesn't really help
in our post-spectre world.  Also having too many queue type is just
going to create confusion, so I'd rather manage them centrally.

Note that the queue type naming and ordering changes a bit - the
first index now is the default queue for everything not explicitly
marked, the optional ones are read and poll queues.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

cd87b6e5

nvme: add separate poll queue map · b244a53e

由 Jens Axboe 提交于 11月 05, 2018

to #28991349

commit 4b04cc6a8f86c4842314def22332de1f15de8523 upstream

Adds support for defining a variable number of poll queues, currently
configurable with the 'poll_queues' module parameter. Defaults to
a single poll queue.

And now we finally have poll support without triggering interrupts!
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b244a53e

blk-mq: initial support for multiple queue maps · c454176d

由 Jens Axboe 提交于 10月 24, 2018

to #28991349

commit 843477d4cc5c4bb4e346c561ecd3b9d0bd67e8c8 upstream

Add a queue offset to the tag map. This enables users to map
iteratively, for each queue map type they support.

Bump maximum number of supported maps to 2, we're now fully
able to support more than 1 map.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c454176d

blk-mq: cache request hardware queue mapping · 107174b9

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit ea4f995ee8b8f0578b3319949f2edd5d812fdb0a upstream

We call blk_mq_map_queue() a lot, at least two times for each
request per IO, sometimes more. Since we now have an indirect
call as well in that function. cache the mapping so we don't
have to re-call blk_mq_map_queue() for the same request
multiple times.
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

107174b9

blk-mq: support multiple hctx maps · 2ca20c32

由 Jens Axboe 提交于 10月 30, 2018

to #28991349

commit b3c661b15d5ab11d982e58bee23e05c1780528a1 upstream

Add support for the tag set carrying multiple queue maps, and
for the driver to inform blk-mq how many it wishes to support
through setting set->nr_maps.

This adds an mq_ops helper for drivers that support more than 1
map, mq_ops->rq_flags_to_type(). The function takes request/bio
flags and CPU, and returns a queue map index for that. We then
use the type information in blk_mq_map_queue() to index the map
set.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

2ca20c32

blk-mq: allow software queue to map to multiple hardware queues · d78f5292

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit f31967f0e455d08d3ea1d2f849bf62dafc92dbf4 upstream

The mapping used to be dependent on just the CPU location, but
now it's a tuple of (type, cpu) instead. This is a prep patch
for allowing a single software queue to map to multiple hardware
queues. No functional changes in this patch.

This changes the software queue count to an unsigned short
to save a bit of space. We can still support 64K-1 CPUs,
which should be enough. Add a check to catch a wrap.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

d78f5292

blk-mq: abstract out queue map · f63859ea

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit ed76e329d74a4b15ac0f5fd3adbd52ec0178a134 upstream

This is in preparation for allowing multiple sets of maps per
queue, if so desired.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f63859ea

blk-mq: kill q->mq_map · 7c058cfc

由 Jens Axboe 提交于 10月 16, 2018

to #28991349

commit a8908939af569ce2419f43fd56eeaf003bc3d85d upstream

It's just a pointer to set->mq_map, use that instead. Move the
assignment a bit earlier, so we always know it's valid.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

7c058cfc

genirq/affinity: Add support for allocating interrupt sets · 292e47ef

由 Jens Axboe 提交于 11月 02, 2018

to #28991349

commit 6da4b3ab9a6e9b1b5f90322ab3fa3a7dd18edb19 upstream

A driver may have a need to allocate multiple sets of MSI/MSI-X interrupts,
and have them appropriately affinitized.

Add support for defining a number of sets in the irq_affinity structure, of
varying sizes, and get each set affinitized correctly across the machine.

[ tglx: Minor changelog tweaks ]
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Cc: linux-block@vger.kernel.org
Link: https://lkml.kernel.org/r/20181102145951.31979-5-ming.lei@redhat.comSigned-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

292e47ef

block: make rq sector size accessible for block stats · 9bdcaff2

由 Hou Tao 提交于 5月 21, 2019

to #29361128

commit 3d24430694077313c75c6b89f618db09943621e4 upstream.

Currently rq->data_len will be decreased by partial completion or
zeroed by completion, so when blk_stat_add() is invoked, data_len
will be zero and there will never be samples in poll_cb because
blk_mq_poll_stats_bkt() will return -1 if data_len is zero.

We could move blk_stat_add() back to __blk_mq_complete_request(),
but that would make the effort of trying to call ktime_get_ns()
once in vain. Instead we can reuse throtl_size field, and use
it for both block stats and block throttle, and adjust the
logic in blk_mq_poll_stats_bkt() accordingly.

Fixes: 4bc6339a ("block: move blk_stat_add() to __blk_mq_end_request()")
Tested-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9bdcaff2

blk-iocost: switch to fixed non-auto-decaying use_delay · fc94dc72

由 Tejun Heo 提交于 4月 13, 2020

to #29361128

commit 54c52e10dc9b939084a7e6e3d32ce8fd8dee7898 upstream.

The use_delay mechanism was introduced by blk-iolatency to hold memory
allocators accountable for the reclaim and other shared IOs they cause. The
duration of the delay is dynamically balanced between iolatency increasing the
value on each target miss and it auto-decaying as time passes and threads get
delayed on it.

While this works well for iolatency, iocost's control model isn't compatible
with it. There is no repeated "violation" events which can be balanced against
auto-decaying. iocost instead knows how much a given cgroup is over budget and
wants to prevent that cgroup from issuing IOs while over budget. Until now,
iocost has been adding the cost of force-issued IOs. However, this doesn't
reflect the amount which is already over budget and is simply not enough to
counter the auto-decaying allowing anon-memory leaking low priority cgroup to
go over its alloted share of IOs.

As auto-decaying doesn't make much sense for iocost, this patch introduces a
different mode of operation for use_delay - when blkcg_set_delay() are used
insted of blkcg_add/use_delay(), the delay duration is not auto-decayed until it
is explicitly cleared with blkcg_clear_delay(). iocost is updated to keep the
delay duration synchronized to the budget overage amount.

With this change, iocost can effectively police cgroups which generate
significant amount of force-issued IOs.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

fc94dc72

alinux: sched: Add cpu_stress to show system-wide task waiting · ab81d2d9

由 Yihao Wu 提交于 6月 01, 2020

to #28739709

/proc/loadavg can reflex the waiting tasks over a period of time
to some extent. But to become a SLI requires better precision and
quicker response. Furthermore, I/O block is not concerned here,
and bandwidth control is excluded from cpu_stress.

This patch adds a new interface /proc/cpu_stress. It's based on
task runtime tracking so we don't need to deal with complex state
transition. And because task runtime tracking is done in most
scheduler events, the precision is quite enough.

Like loadavg, cpu_stress has 3 average windows too (1,5,15 min)
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

ab81d2d9

vfs, afs, ext4: Make the inode hash table RCU searchable · dc136109

由 David Howells 提交于 12月 01, 2017

task #29263287

commit 3f19b2ab97a97b413c24b66c67ae16daa4f56c35 upstream

Make the inode hash table RCU searchable so that searches that want to
access or modify an inode without taking a ref on that inode can do so
without taking the inode hash table lock.

The main thing this requires is some RCU annotation on the list
manipulation operations.  Inodes are already freed by RCU in most cases.

Users of this interface must take care as the inode may be still under
construction or may be being torn down around them.

There are at least three instances where this can be of use:

 (1) Testing whether the inode number iunique() is going to return is
     currently unique (the iunique_lock is still held).

 (2) Ext4 date stamp updating.

 (3) AFS callback breaking.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
cc: linux-ext4@vger.kernel.org
cc: linux-afs@lists.infradead.org
[jeffle: resolve collision in afs_break_one_callback since code base change]
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

dc136109

acpi/nfit, libnvdimm/security: add Intel DSM 1.8 master passphrase support · 8302c7d9

由 Dave Jiang 提交于 12月 10, 2018

to #27305291

commit 89fa9d8ea7bdfa841d19044485cec5f4171069e5 upstream.

With Intel DSM 1.8 [1] two new security DSMs are introduced. Enable/update
master passphrase and master secure erase. The master passphrase allows
a secure erase to be performed without the user passphrase that is set on
the NVDIMM. The commands of master_update and master_erase are added to
the sysfs knob in order to initiate the DSMs. They are similar in opeartion
mechanism compare to update and erase.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdfSigned-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

8302c7d9

acpi/nfit, libnvdimm/security: Add security DSM overwrite support · a3e32b16

由 Dave Jiang 提交于 12月 13, 2018

to #27305291

commit 7d988097c546187ada602cc9bccd0f03d473eb8f upstream.

Add support for the NVDIMM_FAMILY_INTEL "ovewrite" capability as
described by the Intel DSM spec v1.7. This will allow triggering of
overwrite on Intel NVDIMMs. The overwrite operation can take tens of
minutes. When the overwrite DSM is issued successfully, the NVDIMMs will
be unaccessible. The kernel will do backoff polling to detect when the
overwrite process is completed. According to the DSM spec v1.7, the 128G
NVDIMMs can take up to 15mins to perform overwrite and larger DIMMs will
take longer.

Given that overwrite puts the DIMM in an indeterminate state until it
completes introduce the NDD_SECURITY_OVERWRITE flag to prevent other
operations from executing when overwrite is happening. The
NDD_WORK_PENDING flag is added to denote that there is a device reference
on the nvdimm device for an async workqueue thread context.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

a3e32b16

acpi/nfit, libnvdimm: Add support for issue secure erase DSM to Intel nvdimm · 7c7f13d6

由 Dave Jiang 提交于 12月 07, 2018

to #27305291

commit 64e77c8c047fb91ea8c7800c1238108a72f0bf9c upstream.

Add support to issue a secure erase DSM to the Intel nvdimm. The
required passphrase is acquired from an encrypted key in the kernel user
keyring. To trigger the action, "erase <keyid>" is written to the
"security" sysfs attribute.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

7c7f13d6

acpi/nfit, libnvdimm: Add disable passphrase support to Intel nvdimm. · 24684577

由 Dave Jiang 提交于 12月 07, 2018

to #27305291

commit 03b65b22ada8115a7a7bfdf0789f6a94adfd6070 upstream.

Add support to disable passphrase (security) for the Intel nvdimm. The
passphrase used for disabling is pulled from an encrypted-key in the kernel
user keyring. The action is triggered by writing "disable <keyid>" to the
sysfs attribute "security".
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

24684577

acpi/nfit, libnvdimm: Add unlock of nvdimm support for Intel DIMMs · ba87cbfc

由 Dave Jiang 提交于 12月 06, 2018

to #27305291

commit 4c6926a23b76ea23403976290cd45a7a143f6500 upstream.

Add support to unlock the dimm via the kernel key management APIs. The
passphrase is expected to be pulled from userspace through keyutils.
The key management and sysfs attributes are libnvdimm generic.

Encrypted keys are used to protect the nvdimm passphrase at rest. The
master key can be a trusted-key sealed in a TPM, preferred, or an
encrypted-key, more flexible, but more exposure to a potential attacker.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Co-developed-by: NDan Williams <dan.j.williams@intel.com>
Reported-by: NRandy Dunlap <rdunlap@infradead.org>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

ba87cbfc

acpi/nfit, libnvdimm: Add freeze security support to Intel nvdimm · 29fd2e68

由 Dave Jiang 提交于 12月 06, 2018

to #27305291

commit 37833fb7989a9d3c3e26354e6878e682c340d718 upstream.

Add support for freeze security on Intel nvdimm. This locks out any
changes to security for the DIMM until a hard reset of the DIMM is
performed. This is triggered by writing "freeze" to the generic
nvdimm/nmemX "security" sysfs attribute.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Co-developed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

29fd2e68

acpi/nfit, libnvdimm: Introduce nvdimm_security_ops · 3b8f481b

由 Dave Jiang 提交于 12月 05, 2018

to #27305291

commit f2989396553a0bd13f4b25f567a3dee3d722ce40 upstream.

Some NVDIMMs, like the ones defined by the NVDIMM_FAMILY_INTEL command
set, expose a security capability to lock the DIMMs at poweroff and
require a passphrase to unlock them. The security model is derived from
ATA security. In anticipation of other DIMMs implementing a similar
scheme, and to abstract the core security implementation away from the
device-specific details, introduce nvdimm_security_ops.

Initially only a status retrieval operation, ->state(), is defined,
along with the base infrastructure and definitions for future
operations.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Co-developed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

3b8f481b

keys: Export lookup_user_key to external users · 12aad331

由 Dave Jiang 提交于 12月 04, 2018

to #27305291

commit 76ef5e17252789da79db78341851922af0c16181 upstream.

Export lookup_user_key() symbol in order to allow nvdimm passphrase
update to retrieve user injected keys.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Acked-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

12aad331

acpi/nfit, libnvdimm: Store dimm id as a member to struct nvdimm · 83d94276

由 Dave Jiang 提交于 12月 04, 2018

to #27305291

commit d6548ae4d16dc231dec22860c9c472bcb991fb15 upstream.

The generated dimm id is needed for the sysfs attribute as well as being
used as the identifier/description for the security key. Since it's
constant and should never change, store it as a member of struct nvdimm.

As nvdimm_create() continues to grow parameters relative to NFIT driver
requirements, do not require other implementations to keep pace.
Introduce __nvdimm_create() to carry the new parameters and keep
nvdimm_create() with the long standing default api.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

[ Shile: fixed conflict in drivers/acpi/nfit/nfit.h ]
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

83d94276

acpi/nfit: Add support for Intel DSM 1.8 commands · d7258548

由 Dave Jiang 提交于 12月 04, 2018

to #27305291

commit b3ed2ce024c36054e51cca2eb31a1cdbe4a5f11e upstream.

Add command definition for security commands defined in Intel DSM
specification v1.8 [1]. This includes "get security state", "set
passphrase", "unlock unit", "freeze lock", "secure erase", "overwrite",
"overwrite query", "master passphrase enable/disable", and "master
erase", . Since this adds several Intel definitions, move the relevant
bits to their own header.

These commands mutate physical data, but that manipulation is not cache
coherent. The requirement to flush and invalidate caches makes these
commands unsuitable to be called from userspace, so extra logic is added
to detect and block these commands from being submitted via the ioctl
command submission path.

Lastly, the commands may contain sensitive key material that should not
be dumped in a standard debug session. Update the nvdimm-command
payload-dump facility to move security command payloads behind a
default-off compile time switch.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdfSigned-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

[ Shile: fixed conflicts:
This patch updated the file "drivers/acpi/nfit/intel.h". The header file is
introduced by commit 0ead111 ("acpi, nfit: Collect shutdown status") in
upstream, which also update the test files. So let's fetch this part to fix
the conflict:
- tools/testing/nvdimm/test/nfit.c
- tools/testing/nvdimm/test/nfit_test.h ]
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

d7258548

mm, page_alloc: do not wake kswapd with zone lock held · ba16c9c8

由 Mel Gorman 提交于 1月 08, 2019

to #28825456

commit 73444bc4d8f92e46a20cb6bd3342fc2ea75c6787 upstream.

syzbot reported the following regression in the latest merge window and
it was confirmed by Qian Cai that a similar bug was visible from a
different context.

  ======================================================
  WARNING: possible circular locking dependency detected
  4.20.0+ #297 Not tainted
  ------------------------------------------------------
  syz-executor0/8529 is trying to acquire lock:
  000000005e7fb829 (&pgdat->kswapd_wait){....}, at:
  __wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120

  but task is already holding lock:
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: spin_lock
  include/linux/spinlock.h:329 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_bulk
  mm/page_alloc.c:2548 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: __rmqueue_pcplist
  mm/page_alloc.c:3021 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_pcplist
  mm/page_alloc.c:3050 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue
  mm/page_alloc.c:3072 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at:
  get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491

It appears to be a false positive in that the only way the lock ordering
should be inverted is if kswapd is waking itself and the wakeup
allocates debugging objects which should already be allocated if it's
kswapd doing the waking.  Nevertheless, the possibility exists and so
it's best to avoid the problem.

This patch flags a zone as needing a kswapd using the, surprisingly,
unused zone flag field.  The flag is read without the lock held to do
the wakeup.  It's possible that the flag setting context is not the same
as the flag clearing context or for small races to occur.  However, each
race possibility is harmless and there is no visible degredation in
fragmentation treatment.

While zone->flag could have continued to be unused, there is potential
for moving some existing fields into the flags field instead.
Particularly read-mostly ones like zone->initialized and
zone->contiguous.

Link: http://lkml.kernel.org/r/20190103225712.GJ31517@techsingularity.net
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Reported-by: syzbot+93d94a001cfbce9e60e1@syzkaller.appspotmail.com
Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Tested-by: NQian Cai <cai@lca.pw>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

Conflicts:
	include/linux/mmzone.h
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

ba16c9c8

mm: reclaim small amounts of memory when an external fragmentation event occurs · 9bcadc70

由 Mel Gorman 提交于 12月 28, 2018

to #28825456

commit 1c30844d2dfe272d58c8fc000960b835d13aa2ac upstream.

An external fragmentation event was previously described as

    When the page allocator fragments memory, it records the event using
    the mm_page_alloc_extfrag event. If the fallback_order is smaller
    than a pageblock order (order-9 on 64-bit x86) then it's considered
    an event that will cause external fragmentation issues in the future.

The kernel reduces the probability of such events by increasing the
watermark sizes by calling set_recommended_min_free_kbytes early in the
lifetime of the system.  This works reasonably well in general but if
there are enough sparsely populated pageblocks then the problem can still
occur as enough memory is free overall and kswapd stays asleep.

This patch introduces a watermark_boost_factor sysctl that allows a zone
watermark to be temporarily boosted when an external fragmentation causing
events occurs.  The boosting will stall allocations that would decrease
free memory below the boosted low watermark and kswapd is woken if the
calling context allows to reclaim an amount of memory relative to the size
of the high watermark and the watermark_boost_factor until the boost is
cleared.  When kswapd finishes, it wakes kcompactd at the pageblock order
to clean some of the pageblocks that may have been affected by the
fragmentation event.  kswapd avoids any writeback, slab shrinkage and swap
from reclaim context during this operation to avoid excessive system
disruption in the name of fragmentation avoidance.  Care is taken so that
kswapd will do normal reclaim work if the system is really low on memory.

This was evaluated using the same workloads as "mm, page_alloc: Spread
allocations across zones before introducing fragmentation".

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.20-rc3 extfrag events < order 9:   804694
4.20-rc3+patch:                      408912 (49% reduction)
4.20-rc3+patch1-4:                    18421 (98% reduction)

                                   4.20.0-rc3             4.20.0-rc3
                                 lowzone-v5r8             boost-v5r8
Amean     fault-base-1      653.58 (   0.00%)      652.71 (   0.13%)
Amean     fault-huge-1        0.00 (   0.00%)      178.93 * -99.00%*

                              4.20.0-rc3             4.20.0-rc3
                            lowzone-v5r8             boost-v5r8
Percentage huge-1        0.00 (   0.00%)        5.12 ( 100.00%)

Note that external fragmentation causing events are massively reduced by
this path whether in comparison to the previous kernel or the vanilla
kernel.  The fault latency for huge pages appears to be increased but that
is only because THP allocations were successful with the patch applied.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  291392
4.20-rc3+patch:                     191187 (34% reduction)
4.20-rc3+patch1-4:                   13464 (95% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                 lowzone-v5r8             boost-v5r8
Min       fault-base-1      912.00 (   0.00%)      905.00 (   0.77%)
Min       fault-huge-1      127.00 (   0.00%)      135.00 (  -6.30%)
Amean     fault-base-1     1467.55 (   0.00%)     1481.67 (  -0.96%)
Amean     fault-huge-1     1127.11 (   0.00%)     1063.88 *   5.61%*

                              4.20.0-rc3             4.20.0-rc3
                            lowzone-v5r8             boost-v5r8
Percentage huge-1       77.64 (   0.00%)       83.46 (   7.49%)

As before, massive reduction in external fragmentation events, some jitter
on latencies and an increase in THP allocation success rates.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  215698
4.20-rc3+patch:                     200210 (7% reduction)
4.20-rc3+patch1-4:                   14263 (93% reduction)

                                   4.20.0-rc3             4.20.0-rc3
                                 lowzone-v5r8             boost-v5r8
Amean     fault-base-5     1346.45 (   0.00%)     1306.87 (   2.94%)
Amean     fault-huge-5     3418.60 (   0.00%)     1348.94 (  60.54%)

                              4.20.0-rc3             4.20.0-rc3
                            lowzone-v5r8             boost-v5r8
Percentage huge-5        0.78 (   0.00%)        7.91 ( 910.64%)

There is a 93% reduction in fragmentation causing events, there is a big
reduction in the huge page fault latency and allocation success rate is
higher.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9: 166352
4.20-rc3+patch:                    147463 (11% reduction)
4.20-rc3+patch1-4:                  11095 (93% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                 lowzone-v5r8             boost-v5r8
Amean     fault-base-5     6217.43 (   0.00%)     7419.67 * -19.34%*
Amean     fault-huge-5     3163.33 (   0.00%)     3263.80 (  -3.18%)

                              4.20.0-rc3             4.20.0-rc3
                            lowzone-v5r8             boost-v5r8
Percentage huge-5       95.14 (   0.00%)       87.98 (  -7.53%)

There is a large reduction in fragmentation events with some jitter around
the latencies and success rates.  As before, the high THP allocation
success rate does mean the system is under a lot of pressure.  However, as
the fragmentation events are reduced, it would be expected that the
long-term allocation success rate would be higher.

Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

9bcadc70

filemap: kill page_cache_read usage in filemap_fault · d33d6167

由 Josef Bacik 提交于 3月 13, 2019

to #28718400

commit a75d4c33377277b6034dd1e2663bce444f952c14 upstream.

Patch series "drop the mmap_sem when doing IO in the fault path", v6.

Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions.  Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.

We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such.  This involves running
'ps' or other such utilities.  These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem).  Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.

This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer.  This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal.  But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.

In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO.  This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock.  With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.

The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation.  We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().

I've tested these patches with xfstests and there are no regressions.

This patch (of 3):

If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it.  This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed.  Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache.  Then use the normal page locking and readpage logic already in
filemap_fault.  This simplifies the no page in page cache case
significantly.

[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
  Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

Conflicts:
	mm/filemap.c
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

d33d6167

arm_pmu: acpi: spe: Add initial MADT/SPE probing · f1c96d2b

由 Jeremy Linton 提交于 2月 18, 2020

fix #26734090

commit d24a0c7099b32b6981d7f126c45348e381718350 upstream

ACPI 6.3 adds additional fields to the MADT GICC
structure to describe SPE PPI's. We pick these out
of the cached reference to the madt_gicc structure
similarly to the core PMU code. We then create a platform
device referring to the IRQ and let the user/module loader
decide whether to load the SPE driver.
Tested-by: NHanjun Guo <hanjun.guo@linaro.org>
Reviewed-by: NSudeep Holla <sudeep.holla@arm.com>
Reviewed-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: NJeremy Linton <jeremy.linton@arm.com>
Signed-off-by: NWill Deacon <will@kernel.org>
Signed-off-by: NXin Hao <xhao@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

f1c96d2b

arm64: Relax ICC_PMR_EL1 accesses when ICC_CTLR_EL1.PMHE is clear · 625b8a72

由 Marc Zyngier 提交于 10月 02, 2019

task #25552995

commit f226650494c6aa87526d12135b7de8b8c074f3de upstream.

The GICv3 architecture specification is incredibly misleading when it
comes to PMR and the requirement for a DSB. It turns out that this DSB
is only required if the CPU interface sends an Upstream Control
message to the redistributor in order to update the RD's view of PMR.

This message is only sent when ICC_CTLR_EL1.PMHE is set, which isn't
the case in Linux. It can still be set from EL3, so some special care
is required. But the upshot is that in the (hopefuly large) majority
of the cases, we can drop the DSB altogether.

This relies on a new static key being set if the boot CPU has PMHE
set. The drawback is that this static key has to be exported to
modules.

Cc: Will Deacon <will@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

625b8a72

efi: Let architectures decide the flags that should be saved/restored · 1c2e583b

由 Julien Thierry 提交于 1月 31, 2019

task #25552995

commit 13b210ddf474d9f3368766008a89fe82a6f90b48 upstream

Currently, irqflags are saved before calling runtime services and
checked for mismatch on return.

Provide a pair of overridable macros to save and restore (if needed) the
state that need to be preserved on return from a runtime service.
This allows to check for flags that are not necesarly related to
irqflags.
Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: linux-efi@vger.kernel.org
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

1c2e583b

irqdesc: Add domain handler for NMIs · d0cd7461

由 Julien Thierry 提交于 1月 31, 2019

task #25552995

commit 6e4933a006616343f66c4702dc4fc56bb25e7b02 upstream

NMI handling code should be executed between calls to nmi_enter and
nmi_exit.

Add a separate domain handler to properly setup NMI context when handling
an interrupt requested as NMI.
Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

d0cd7461

genirq: Provide NMI handlers · 3035257e

由 Julien Thierry 提交于 1月 31, 2019

task #25552995

commit 2dcf1fbcad352baaa5f47b17e57c5743c8eedbad upstream

Provide flow handlers that are NMI safe for interrupts and percpu_devid
interrupts.
Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

3035257e

genirq: Provide NMI management for percpu_devid interrupts · 3bd2aa7b

由 Julien Thierry 提交于 1月 31, 2019

task #25552995

commit 4b078c3f1a26487c39363089ba0d5c6b09f2a89f upstream

Add support for percpu_devid interrupts treated as NMIs.

Percpu_devid NMIs need to be setup/torn down on each CPU they target.

The same restrictions as for global NMIs still apply for percpu_devid NMIs.
Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

[ Shile: fixed conflicts in include/linux/interrupt.h ]
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>

3bd2aa7b

genirq: Provide basic NMI management for interrupt lines · 2e186fb5

由 Julien Thierry 提交于 1月 31, 2019

task #25552995

commit b525903c254dab2491410f0f23707691b7c2c317 upstream

Add functionality to allocate interrupt lines that will deliver IRQs
as Non-Maskable Interrupts. These allocations are only successful if
the irqchip provides the necessary support and allows NMI delivery for the
interrupt line.

Interrupt lines allocated for NMI delivery must be enabled/disabled through
enable_nmi/disable_nmi_nosync to keep their state consistent.

To treat a PERCPU IRQ as NMI, the interrupt must not be shared nor threaded,
the irqchip directly managing the IRQ must be the root irqchip and the
irqchip cannot be behind a slow bus.
Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

2e186fb5

irqchip/gic: Unify GIC priority definitions · 8fdd8aec

由 Julien Thierry 提交于 8月 28, 2018

task #25552995

commit 2130b789b3ef6a518b9c9c6f245642620e2b0c0c upstream.

LPIs use the same priority value as other GIC interrupts.

Make the GIC default priority definition visible to ITS implementation
and use this same definition for LPI priorities.
Tested-by: NDaniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

8fdd8aec

iommu: Add fast hook for getting DMA domains · 73559ad3

由 Robin Murphy 提交于 9月 12, 2018

fix #27432135

commit 6af588fed39178c8e118fcf9cb6664e58a1fbe88 upstream

While iommu_get_domain_for_dev() is the robust way for arbitrary IOMMU
API callers to retrieve the domain pointer, for DMA ops domains it
doesn't scale well for large systems and multi-queue devices, since the
momentary refcount adjustment will lead to exclusive cacheline contention
when multiple CPUs are operating in parallel on different mappings for
the same device.

In the case of DMA ops domains, however, this refcounting is actually
unnecessary, since they already imply that the group exists and is
managed by platform code and IOMMU internals (by virtue of
iommu_group_get_for_dev()) such that a reference will already be held
for the lifetime of the device. Thus we can avoid the bottleneck by
providing a fast lookup specifically for the DMA code to retrieve the
default domain it already knows it has set up - a simple read-only
dereference plays much nicer with cache-coherency protocols.
Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
Tested-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: NRongwei Wang <rongwei.wang@linux.alibaba.com>
Acked-by: Nzou cao <zoucao@linux.alibaba.com>

73559ad3

29 6月, 2020 4 次提交

splice: export do_tee() · 53763d3f

由 Pavel Begunkov 提交于 5月 17, 2020

to #28736503

commit 9dafdfc2f0a3ae551711098de3d7b621a469f11a upstream

export do_tee() for use in io_uring
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

53763d3f

ACPI / APEI: Add support for the SDEI GHES Notification type · 512ddcd8

由 James Morse 提交于 1月 29, 2019

fix #28612342

commit f9f05395f384ee858520b6c65d7e3e436af20c53 upstream

If the GHES notification type is SDEI, register the provided event
using the SDEI-GHES helper.

SDEI may be one of two types of event, normal and critical. Critical
events can interrupt normal events, so these must have separate
fixmap slots and locks in case both event types are in use.
Signed-off-by: NJames Morse <james.morse@arm.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

512ddcd8

firmware: arm_sdei: Add ACPI GHES registration helper · 9ba61fa5

由 James Morse 提交于 1月 29, 2019

fix #28612342

commit f96935d3bc38a5f4b5188b6470a10e3fb8c3f0cc upstream

APEI's Generic Hardware Error Source structures do not describe
whether the SDEI event is shared or private, as this information is
discoverable via the API.

GHES needs to know whether an event is normal or critical to avoid
sharing locks or fixmap entries, but GHES shouldn't have to know about
the SDEI API.

Add a helper to register the GHES using the appropriate normal or
critical callback.
Signed-off-by: NJames Morse <james.morse@arm.com>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

9ba61fa5

mm/memory-failure: Add memory_failure_queue_kick() · 14e58c87

由 James Morse 提交于 5月 01, 2020

fix #28612342

commit 062022315e8ad9e0628515dfc756ab54b5fdb26b upstream

The GHES code calls memory_failure_queue() from IRQ context to schedule
work on the current CPU so that memory_failure() can sleep.

For synchronous memory errors the arch code needs to know any signals
that memory_failure() will trigger are pending before it returns to
user-space, possibly when exiting from the IRQ.

Add a helper to kick the memory failure queue, to ensure the scheduled
work has happened. This has to be called from process context, so may
have been migrated from the original cpu. Pass the cpu the work was
queued on.

Change memory_failure_work_func() to permit being called on the 'wrong'
cpu.
Signed-off-by: NJames Morse <james.morse@arm.com>
Tested-by: NTyler Baicar <baicar@os.amperecomputing.com>
Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

14e58c87

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功