提交 · 0176005ec6aedf8d22074781028c7090924c64c3 · openanolis / cloud-kernel

18 3月, 2020 40 次提交

cpuidle: allow governor switch on cpuidle_register_driver() · 0176005e

由 Joao Martins 提交于 9月 08, 2019

commit 11c59eae6633b8a7e77b8ee1cf908964d80c78cd upstream

The recently introduced haltpoll driver is largely only useful with
haltpoll governor. To allow drivers to associate with a particular idle
behaviour, add a @governor property to 'struct cpuidle_driver' and thus
allow a cpuidle driver to switch to a *preferred* governor on idle driver
registration. We save the previous governor, and when an idle driver is
unregistered we switch back to that.

The @governor can be overridden by cpuidle.governor= boot param or
alternatively be ignored if the governor doesn't exist.
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

0176005e

cpuidle-haltpoll: vcpu hotplug support · 1e18beca

由 Marcelo Tosatti 提交于 7月 03, 2019

commit 97d3eb9da84cae0548359b0aecb8619faad003b7 upstream

When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
past the online ones and thus fail to register the idle driver.
This is because cpuidle_add_sysfs() will return with -ENODEV as a
consequence from get_cpu_device() return no device for a non-existing
CPU.

Instead switch to cpuidle_register_driver() and manually register each
of the present cpus through cpuhp_setup_state() callbacks and future
ones that get onlined or offlined. This mimmics similar logic that
intel_idle does.

Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

1e18beca

cpuidle: add haltpoll governor · 7b905bca

由 Marcelo Tosatti 提交于 7月 03, 2019

commit 2cffe9f6b96fece065ee8522673c90e92ef2085d upstream

The cpuidle_haltpoll governor, in conjunction with the haltpoll cpuidle
driver, allows guest vcpus to poll for a specified amount of time before
halting.
This provides the following benefits to host side polling:

        1) The POLL flag is set while polling is performed, which allows
           a remote vCPU to avoid sending an IPI (and the associated
           cost of handling the IPI) when performing a wakeup.

        2) The VM-exit cost can be avoided.

The downside of guest side polling is that polling is performed
even with other runnable tasks in the host.

Results comparing halt_poll_ns and server/client application
where a small packet is ping-ponged:

host                                        --> 31.33
halt_poll_ns=300000 / no guest busy spin    --> 33.40   (93.8%)
halt_poll_ns=0 / guest_halt_poll_ns=300000  --> 32.73   (95.7%)

For the SAP HANA benchmarks (where idle_spin is a parameter
of the previous version of the patch, results should be the
same):

hpns == halt_poll_ns

                          idle_spin=0/   idle_spin=800/    idle_spin=0/
                          hpns=200000    hpns=0            hpns=800000
DeleteC06T03 (100 thread) 1.76           1.71 (-3%)        1.78   (+1%)
InsertC16T02 (100 thread) 2.14           2.07 (-3%)        2.18   (+1.8%)
DeleteC00T01 (1 thread)   1.34           1.28 (-4.5%)      1.29   (-3.7%)
UpdateC00T03 (1 thread)   4.72           4.18 (-12%)       4.53   (-5%)
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

7b905bca

add cpuidle-haltpoll driver · 6d2ef95f

由 Marcelo Tosatti 提交于 7月 03, 2019

commit fa86ee90eb1111267de67cb4272b5ce711f18cbb upstream

Add a cpuidle driver that calls the architecture default_idle routine.

To be used in conjunction with the haltpoll governor.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

6d2ef95f

governors: unify last_state_idx · dc59f8b0

由 Marcelo Tosatti 提交于 7月 03, 2019

commit 7d4daeedd575bbc3c40c87fc6708a8b88c50fe7e upstream

Since this field is shared by all governors, move it to
cpuidle device structure.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

dc59f8b0

cpuidle: menu: Do not update last_state_idx in menu_select() · 5e083c5c

由 Rafael J. Wysocki 提交于 10月 02, 2018

commit eb40a380bff28f84b6583bba6786b46ef26ef548 upstream

It is not necessary to update data->last_state_idx in menu_select()
as it only is used in menu_update() which only runs when
data->needs_update is set and that is set only when updating
data->last_state_idx in menu_reflect().

Accordingly, drop the update of data->last_state_idx from
menu_select() and get rid of the (now redundant) "out" label
from it.

No intentional behavior changes.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

5e083c5c

cpuidle: add poll_limit_ns to cpuidle_device structure · 2163e221

由 Marcelo Tosatti 提交于 7月 03, 2019

commit 259231a045616c4101d023a8f4dcc8379af265a6 upstream

Add a poll_limit_ns variable to cpuidle_device structure.

Calculate and configure it in the new cpuidle_poll_time
function, in case its zero.

Individual governors are allowed to override this value.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

2163e221

cpuidle: poll_state: Fix default time limit · a72c57d7

由 Doug Smythies 提交于 1月 30, 2019

commit 1617971c6616c87185cbc78fa1a86dfc70dd16b6 upstream

The default time is declared in units of microsecnds,
but is used as nanoseconds, resulting in significant
accounting errors for idle state 0 time when all idle
states deeper than 0 are disabled.

Under these unusual conditions, we don't really care
about the poll time limit anyhow.

Fixes: 800fb34a99ce ("cpuidle: poll_state: Disregard disable idle states")
Signed-off-by: NDoug Smythies <dsmythies@telus.net>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

a72c57d7

cpuidle: Add cpuidle.governor= command line parameter · 4488ba36

由 Rafael J. Wysocki 提交于 12月 05, 2018

commit 61cb5758d3c46bc1ba87694fefc0d9653613ce6b upstream

Add cpuidle.governor= command line parameter to allow the default
cpuidle governor to be replaced.

That is useful, for example, if someone running a tickful kernel
wants to use the menu governor on it.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

4488ba36

cpuidle: poll_state: Disregard disable idle states · 907fd899

由 Rafael J. Wysocki 提交于 12月 03, 2018

commit 800fb34a99ce7d22dca839c90f869c7a12b50f70 upstream

When computing the limit of time to spend in the loop in poll_idle(),
use the target residency of the first enabled idle state deeper than
state 0 instead of always using the target residency of state 1.

This helps when state 1 is disabled for diagnostics, for instance.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

907fd899

cpuidle: poll_state: Revise loop termination condition · a89bfc68

由 Rafael J. Wysocki 提交于 10月 02, 2018

commit 01bad1c6896db021db82042e71c2bf1f97cc026b upstream

If need_resched() returns "false", breaking out of the loop in
poll_idle() will cause a new idle state to be selected, so in fact
it usually doesn't make sense to spin in it longer than the target
residency of the second state.  [Note that the "polling" state is
used only if there is at least one "real" state defined in addition
to it, so the second state is always there.]  On the other hand,
breaking out of it early (say in case the next state is disabled)
shouldn't hurt as it is polling anyway.

For this reason, make the loop in poll_idle() break if the CPU has
been spinning longer than the target residency of the second state
(the "polling" state can only be state[0]).
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

a89bfc68

io_uring: fix __io_iopoll_check deadlock in io_sq_thread · a715bf8d

由 Xiaoguang Wang 提交于 2月 22, 2020

commit c7849be9cc2dd2754c48ddbaca27c2de6d80a95d upstream.

Since commit a3a0e43fd770 ("io_uring: don't enter poll loop if we have
CQEs pending"), if we already events pending, we won't enter poll loop.
In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has
been terminated and don't reap pending events which are already in cq
ring, and there are some reqs in poll_list, io_sq_thread will enter
__io_iopoll_check(), and find pending events, then return, this loop
will never have a chance to exit.

I have seen this issue in fio stress tests, to fix this issue, let
io_sq_thread call io_iopoll_getevents() with argument 'min' being zero,
and remove __io_iopoll_check().

Fixes: a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending")
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a715bf8d

alinux: doc: use unified official project name Cloud Kernel · a60721b9

由 Caspar Zhang 提交于 2月 22, 2020

Cloud Kernel is the official name of our project, this patch unitizes
the project names used in docs and comments.
Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

a60721b9

alinux: mm: oom_kill: show killed task's cgroup info in global oom · 5028e358

由 Wenwei Tao 提交于 9月 20, 2019

Some users want to know the killed task's cgroup info in global
oom, this message would help them to make upper decision.
Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

5028e358

alinux: mm: memcontrol: enable oom.group on cgroup-v1 · 7d41295c

由 Wenwei Tao 提交于 9月 10, 2019

Enable oom.group on cgroup-v1.
Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

7d41295c

alinux: doc: alibaba: Add priority oom descriptions · 0c8648d9

由 Wenwei Tao 提交于 9月 11, 2019

Add "memory.priority" and "memory.use_priority_oom" descriptions.
Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

0c8648d9

alinux: mm: memcontrol: introduce memcg priority oom · 52e375fc

由 Wenwei Tao 提交于 8月 23, 2019

Under memory pressure reclaim and oom would happen, with multiple
cgroups exist in one system, we might want some of their memory
or tasks survived the reclaim and oom while there are other
candidates.

The @memory.low and @memory.min have make that happen during reclaim,
this patch introduces memcg priority oom to meet above requirement in
the oom.

The priority is from 0 to 12, the higher number the higher priority.
When oom happens it always choose victim from low priority memcg.
And it works both for memcg oom and global oom, it can be enabled/disabled
through @memory.use_priority_oom, for global oom through the root
memcg's @memory.use_priority_oom, it is disabled by default.
Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

52e375fc

alinux: kernel: cgroup: account number of tasks in the css and its descendants · 1e91d392

由 Wenwei Tao 提交于 8月 26, 2019

Account number of the tasks in the css and its descendants, this is
prepared for the incoming memcg priority patch.

In memcg priority oom, we will select victim cgroup which has victim
tasks in it. We need to know whether the memcg and its descendants
have tasks before the selection can move on.
Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

1e91d392

alinux: doc: Add Documentation/alibaba/interfaces.rst · 279df2ff

由 Xunlei Pang 提交于 9月 04, 2019

This file collects all the interfaces specific to Alibaba Cloud Kernel.

Add "memory.wmark_min_adj", "memory.exstat", and "zombie memcgs reaper"
descriptions.
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

279df2ff

alinux: memcg: Account throttled time due to memory.wmark_min_adj · ef467b9d

由 Xunlei Pang 提交于 9月 01, 2019

Accessing original memory.stat turned out to be one heavy
operation which has been caused many real product problems.

Introduce new cgroup memory.exstat, memory.exstat stands
for "extra/extended memory.stat", which contains dedicated
statistics from Alibaba Clould Kernel.

memory.exstat is supposed to provide hierarchical statistics.

Export its first "wmark_min_throttled_ms", and will add more
like direct reclaim, direct compaction, etc.
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

ef467b9d

alinux: memcg: Introduce memory.wmark_min_adj · 60be0f54

由 Xunlei Pang 提交于 8月 27, 2019

In co-location environment, there are more or less some memory
overcommitment, then BATCH tasks may break the shared global min
watermark resulting in all types of applications falling into
the direct reclaim slow path hurting the RT of LS tasks.
(NOTE: BATCH tasks tolerate big latency spike even in seconds
as long as doesn't hurt its overal throughput. While LS tasks
are very Latency-Sensitive, they may time out or fail in case
of sudden latency spike lasts like hundreds of ms typically.)

Actually BATCH tasks are not sensitive to memory latency, they
can be assigned a strict min watermark which is different from
that of LS tasks(which can be aissgned a lenient min watermark
accordingly), thus isolating each other in case of global memory
allocation. This is kind of like the idea behind ALLOC_HARDER
for rt_task(), see gfp_to_alloc_flags().

memory.wmark_min_adj stands for memcg global WMARK_MIN adjustment,
it is used to realize separate min watermarks above-mentioned for
memcgs, its valid value is within [-25, 50], specifically:
negative value means to be relative to [0, WMARK_MIN],
positive value means to be relative to [WMARK_MIN, WMARK_LOW].
For examples,
  -25 means "WMARK_MIN + (WMARK_MIN - 0) * (-25%)"
   50 means "WMARK_MIN + (WMARK_LOW - WMARK_MIN) * 50%"

Note that the minimum -25 is what ALLOC_HARDER uses which is safe
for us to adopt, and the maximum 50 is one experienced value.

Negative memory.wmark_min_adj means high QoS requirements, it can
allocate below the global WMARK_MIN, which is kind of like the idea
behind ALLOC_HARDER, see gfp_to_alloc_flags().

Positive memory.wmark_min_adj means low QoS requirements, thus when
allocation broke memcg min watermark, it should trigger direct reclaim
traditionally, and we trigger throttle instead to further prevent
them from disturbing others.

With this interface, we can assign positive values for BATCH memcgs
and negative values for LS memcgs.

memory.wmark_min_adj default value is 0, and inherit from its parent,
Note that the final effective wmark_min_adj will consider all the
hierarchical values, its value is the maximal(most conservative)
wmark_min_adj along the hierarchy but excluding intermediate default
values(zero).
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

60be0f54

alinux: memcg: Provide users the ability to reap zombie memcgs · 63442ea9

由 Xunlei Pang 提交于 5月 06, 2019

After memcg was deleted, page caches still reference to this memcg
causing large number of dead(zombie) memcgs in the system. Then it
slows down access to "/sys/fs/cgroup/cpu/memory.stat", etc due to
tons of iterations, further causing various latencies.

This patch introduces two ways to reclaim these zombie memcgs.
1) Background kthread reaper
Introduce a kernel thread "memcg_zombie_reaper" to reclaim zombie
memcgs at background periodically.

Several knobs are also added to control the reaper scan frequency:
- /sys/kernel/mm/memcg_reaper/scan_interval
  The scan period in second. Default 5s.
- /sys/kernel/mm/memcg_reaper/pages_scan
  The scan rate of pages per scan. Default 1310720(5GiB for 4KiB page).
- /sys/kernel/mm/memcg_reaper/verbose
  Output some zombie memcg information for debug purpose. Default off.
- /sys/kernel/mm/memcg_reaper/reap_background
  "on/off" switch. Default "0" means off. Write "1" to switch it on.

2) One-shot trigger by users
- /sys/kernel/mm/memcg_reaper/reap
  Write "1" to trigger one round of zombie memcg reaping, but without
  any guarantee, you may need to launch multiple rounds as needed.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

63442ea9

alinux: blk-throttle: fix logic error about BIO_THROTL_STATED in throtl_bio_end_io() · 8daa9640

由 Xiaoguang Wang 提交于 2月 18, 2020

When CONFIG_BLK_DEV_THROTTLING is enabled, though we may not set
block cgroup's blk-throttle bps or iops limits, every bio still
enters blk_throtl_bio() firstly, then this bug will result in the
corresponding blkcg_gq's refcnt will increase by 1 for every bio.
atomit_t is an 'int' type, and if usr continually issues batches
of bios, this refcnt will overflow, which will trigger WARNING in
blkg_get() or blkg_put().

Fixes: bc0cc360 ("alinux: blk-throttle: fix tg NULL pointer dereference")
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8daa9640

ext4: start to support iopoll method · 7f8fa198

由 Xiaoguang Wang 提交于 2月 17, 2020

Since commit "b1b4705d54ab ext4: introduce direct I/O read using
iomap infrastructure", we can easily make ext4 support iopoll
method, just use iomap_dio_iopoll().
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

7f8fa198

ext4: Move to shared i_rwsem even without dioread_nolock mount opt · bde24335

由 Ritesh Harjani 提交于 12月 05, 2019

commit bc6385dab125d20870f0eb9ca9e589f43abb3f56 upstream.

We were using shared locking only in case of dioread_nolock mount option in case
of DIO overwrites. This mount condition is not needed anymore with current code,
since:-

1. No race between buffered writes & DIO overwrites. Since buffIO writes takes
exclusive lock & DIO overwrites will take shared locking. Also DIO path will
make sure to flush and wait for any dirty page cache data.

2. No race between buffered reads & DIO overwrites, since there is no block
allocation that is possible with DIO overwrites. So no stale data exposure
should happen. Same is the case between DIO reads & DIO overwrites.

3. Also other paths like truncate is protected, since we wait there for any DIO
in flight to be over.
Reviewed-by: NJan Kara <jack@suse.cz>
Tested-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-4-riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

bde24335

ext4: Start with shared i_rwsem in case of DIO instead of exclusive · 3444040a

由 Ritesh Harjani 提交于 12月 05, 2019

commit aa9714d0e39788d0688474c9d5f6a9a36159599f upstream.

Earlier there was no shared lock in DIO read path. But this patch
(16c54688: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.

But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.

So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().

Other than that, it also simplifies below cases:-

1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.

Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.

2. Added ext4_extending_io(). This checks if the IO is extending the file.

3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: NJan Kara <jack@suse.cz>
Tested-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

3444040a

ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAIT · f0afa64f

由 Ritesh Harjani 提交于 12月 05, 2019

commit f629afe3369e9885fd6e9cc7a4f514b6a65cf9e9 upstream.

Apparently our current rwsem code doesn't like doing the trylock, then
lock for real scheme. So change our dax read/write methods to just do the
trylock for the RWF_NOWAIT case.
This seems to fix AIM7 regression in some scalable filesystems upto ~25%
in some cases. Claimed in commit 942491c9 ("xfs: fix AIM7 regression")
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org>
Tested-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-2-riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

f0afa64f

ext4: introduce direct I/O write using iomap infrastructure · 683d3d57

由 Matthew Bobrowski 提交于 11月 05, 2019

commit 378f32bab3714f04c4e0c3aee4129f6703805550 upstream.

This patch introduces a new direct I/O write path which makes use of
the iomap infrastructure.

All direct I/O writes are now passed from the ->write_iter() callback
through to the new direct I/O handler ext4_dio_write_iter(). This
function is responsible for calling into the iomap infrastructure via
iomap_dio_rw().

Code snippets from the existing direct I/O write code within
ext4_file_write_iter() such as, checking whether the I/O request is
unaligned asynchronous I/O, or whether the write will result in an
overwrite have effectively been moved out and into the new direct I/O
->write_iter() handler.
The block mapping flags that are eventually passed down to
ext4_map_blocks() from the *_get_block_*() suite of routines have been
taken out and introduced within ext4_iomap_alloc().

For inode extension cases, ext4_handle_inode_extension() is
effectively the function responsible for performing such metadata
updates. This is called after iomap_dio_rw() has returned so that we
can safely determine whether we need to potentially truncate any
allocated blocks that may have been prepared for this direct I/O
write. We don't perform the inode extension, or truncate operations
from the ->end_io() handler as we don't have the original I/O 'length'
available there. The ->end_io() however is responsible fo converting
allocated unwritten extents to written extents.

In the instance of a short write, we fallback and complete the
remainder of the I/O using buffered I/O via
ext4_buffered_write_iter().

The existing buffer_head direct I/O implementation has been removed as
it's now redundant.

[ Fix up ext4_dio_write_iter() per Jan's comments at
  https://lore.kernel.org/r/20191105135932.GN22379@quack2.suse.cz -- TYT ]
Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/e55db6f12ae6ff017f36774135e79f3e7b0333da.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

683d3d57

iomap: move the iomap_dio_rw ->end_io callback into a structure · 64ce72b7

由 Christoph Hellwig 提交于 9月 19, 2019

commit 838c4f3d7515efe9d0e32c846fb5d102b6d8a29d upstream.

Add a new iomap_dio_ops structure that for now just contains the end_io
handler.  This avoid storing the function pointer in a mutable structure,
which is a possible exploit vector for kernel code execution, and prepares
for adding a submit_io handler that btrfs needs.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

64ce72b7

ext4: update ext4_sync_file() to not use __generic_file_fsync() · b273dcd4

由 Matthew Bobrowski 提交于 11月 05, 2019

commit 3eaf9cc62f447a742b26fa601993e94406aa1ea1 upstream.

When the filesystem is created without a journal, we eventually call
into __generic_file_fsync() in order to write out all the modified
in-core data to the permanent storage device. This function happens to
try and obtain an inode_lock() while synchronizing the files buffer
and it's associated metadata.

Generally, this is fine, however it becomes a problem when there is
higher level code that has already obtained an inode_lock() as this
leads to a recursive lock situation. This case is especially true when
porting across direct I/O to iomap infrastructure as we obtain an
inode_lock() early on in the I/O within ext4_dio_write_iter() and hold
it until the I/O has been completed. Consequently, to not run into
this specific issue, we move away from calling into
__generic_file_fsync() and perform the necessary synchronization tasks
within ext4_sync_file().
Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/3495f35ef67f2021b567e28e6f59222e583689b8.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

b273dcd4

ext4: move inode extension check out from ext4_iomap_alloc() · 39d1170d

由 Matthew Bobrowski 提交于 11月 05, 2019

commit 0b9f230b94dd7457802264dc4c16921b3527dcf1 upstream.

Lift the inode extension/orphan list handling code out from
ext4_iomap_alloc() and apply it within the ext4_dax_write_iter().
Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/fd5c84db25d5d0da87d97ed4c36fd844f57da759.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

39d1170d

ext4: move inode extension/truncate code out from ->iomap_end() callback · 91c91b47

由 Matthew Bobrowski 提交于 11月 05, 2019

commit 569342dc2485392e95b6a626281708c25014ba37 upstream.

In preparation for implementing the iomap direct I/O modifications,
the inode extension/truncate code needs to be moved out from the
ext4_iomap_end() callback. For direct I/O, if the current code
remained, it would behave incorrrectly. Updating the inode size prior
to converting unwritten extents would potentially allow a racing
direct I/O read to find unwritten extents before being converted
correctly.

The inode extension/truncate code now resides within a new helper
ext4_handle_inode_extension(). This function has been designed so that
it can accommodate for both DAX and direct I/O extension/truncate
operations.
Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/d41ffa26e20b15b12895812c3cad7c91a6a59bc6.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

91c91b47

ext4: introduce direct I/O read using iomap infrastructure · 1ffe50d3

由 Matthew Bobrowski 提交于 11月 05, 2019

commit b1b4705d54abedfd69dcdf42779c521aa1e0fbd3 upstream.

This patch introduces a new direct I/O read path which makes use of
the iomap infrastructure.

The new function ext4_do_read_iter() is responsible for calling into
the iomap infrastructure via iomap_dio_rw(). If the read operation
performed on the inode is not supported, which is checked via
ext4_dio_supported(), then we simply fallback and complete the I/O
using buffered I/O.

Existing direct I/O read code path has been removed, as it is now
redundant.
Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/f98a6f73fadddbfbad0fc5ed04f712ca0b799f37.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

1ffe50d3

ext4: introduce new callback for IOMAP_REPORT · bcb8aa32

由 Matthew Bobrowski 提交于 11月 05, 2019

commit 09edf4d381957b144440bac18a4769c53063b943 upstream.

As part of the ext4_iomap_begin() cleanups that precede this patch, we
also split up the IOMAP_REPORT branch into a completely separate
->iomap_begin() callback named ext4_iomap_begin_report(). Again, the
raionale for this change is to reduce the overall clutter within
ext4_iomap_begin().
Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/5c97a569e26ddb6696e3d3ac9fbde41317e029a0.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

bcb8aa32

iomap: use a srcmap for a read-modify-write I/O · 5b44c648

由 Goldwyn Rodrigues 提交于 10月 18, 2019

commit c039b99792726346ad46ff17c5a5bcb77a5edac4 upstream.

The srcmap is used to identify where the read is to be performed from.
It is passed to ->iomap_begin, which can fill it in if we need to read
data for partially written blocks from a different location than the
write target.  The srcmap is only supported for buffered writes so far.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
[hch: merged two patches, removed the IOMAP_F_COW flag, use iomap as
      srcmap if not set, adjust length down to srcmap end as well]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Acked-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

5b44c648

ext4: split IOMAP_WRITE branch in ext4_iomap_begin() into helper · 2dd15e83

由 Matthew Bobrowski 提交于 11月 05, 2019

commit f063db5ee989aafe2dc9d571b5538f2a1f1cbad2 upstream.

In preparation for porting across the ext4 direct I/O path over to the
iomap infrastructure, split up the IOMAP_WRITE branch that's currently
within ext4_iomap_begin() into a separate helper
ext4_alloc_iomap(). This way, when we add in the necessary code for
direct I/O, we don't end up with ext4_iomap_begin() becoming a
monstrous twisty maze.
Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/50eef383add1ea529651640574111076c55aca9f.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

2dd15e83

ext4: move set iomap routines into a separate helper ext4_set_iomap() · 47d1c3ef

由 Matthew Bobrowski 提交于 11月 05, 2019

commit c8fdfe294187455b70e42a15df35a3e1882f332d upstream.

Separate the iomap field population code that is currently within
ext4_iomap_begin() into a separate helper ext4_set_iomap(). The intent
of this function is self explanatory, however the rationale behind
taking this step is to reeduce the overall clutter that we currently
have within the ext4_iomap_begin() callback.
Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1ea34da65eecffcddffb2386668ae06134e8deaf.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

47d1c3ef

ext4: iomap that extends beyond EOF should be marked dirty · f7a0a34d

由 Matthew Bobrowski 提交于 11月 05, 2019

commit 2e9b51d78229d5145725a481bb5464ebc0a3f9b2 upstream.

This patch addresses what Dave Chinner had discovered and fixed within
commit: 7684e2c4384d. This changes does not have any user visible
impact for ext4 as none of the current users of ext4_iomap_begin()
that extend files depend on IOMAP_F_DIRTY.

When doing a direct IO that spans the current EOF, and there are
written blocks beyond EOF that extend beyond the current write, the
only metadata update that needs to be done is a file size extension.

However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
there is IO completion metadata updates required, and hence we may
fail to correctly sync file size extensions made in IO completion when
O_DSYNC writes are being used and the hardware supports FUA.

Hence when setting IOMAP_F_DIRTY, we need to also take into account
whether the iomap spans the current EOF. If it does, then we need to
mark it dirty so that IO completion will call generic_write_sync() to
flush the inode size update to stable storage correctly.
Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8b43ee9ee94bee5328da56ba0909b7d2229ef150.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

f7a0a34d

ext4: update direct I/O read lock pattern for IOCB_NOWAIT · 5306c287

由 Matthew Bobrowski 提交于 11月 05, 2019

commit 548feebec7e93e58b647dba70b3303dcb569c914 upstream.

This patch updates the lock pattern in ext4_direct_IO_read() to not
block on inode lock in cases of IOCB_NOWAIT direct I/O reads. The
locking condition implemented here is similar to that of 942491c9
("xfs: fix AIM7 regression").

Fixes: 16c54688 ("ext4: Allow parallel DIO reads")
Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/c5d5e759f91747359fbd2c6f9a36240cf75ad79f.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

5306c287

ext4: reorder map.m_flags checks within ext4_iomap_begin() · 143e25e9

由 Matthew Bobrowski 提交于 11月 05, 2019

commit 53e5cca56795a301bbe8465781dab084f7ae8d54 upstream.

For the direct I/O changes that follow in this patch series, we need
to accommodate for the case where the block mapping flags passed
through to ext4_map_blocks() result in m_flags having both
EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits set. In order for any
allocated unwritten extents to be converted correctly in the
->end_io() handler, the iomap->type must be set to IOMAP_UNWRITTEN for
cases where the EXT4_MAP_UNWRITTEN bit has been set within
m_flags. Hence the reason why we need to reshuffle this conditional
statement around.

This change is a no-op for DAX as the block mapping flags passed
through to ext4_map_blocks() i.e. EXT4_GET_BLOCKS_CREATE_ZERO never
results in both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN being set at
once.
Signed-off-by: NMatthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1309ad80d31a637b2deed55a85283d582a54a26a.1572949325.git.mbobrowski@mbobrowski.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

143e25e9

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功