提交 · 09c2c359be546df45be0b158ea1d3cc8ea83c876 · openanolis / cloud-kernel

21 4月, 2017 5 次提交

bdi: Drop 'parent' argument from bdi_register[_va]() · 7c4cc300

由 Jan Kara 提交于 4月 12, 2017

Drop 'parent' argument of bdi_register() and bdi_register_va().  It is
always NULL.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

7c4cc300

block: Remove unused functions · 2e82b84c

由 Jan Kara 提交于 4月 12, 2017

Now that all backing_dev_info structure are allocated separately, we can
drop some unused functions.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

2e82b84c

bdi: Export bdi_alloc_node() and bdi_put() · 62bf42ad

由 Jan Kara 提交于 4月 12, 2017

MTD will want to call bdi_alloc_node() and bdi_put() directly. Export
these functions.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

62bf42ad

block: Unregister bdi on last reference drop · 5af110b2

由 Jan Kara 提交于 4月 12, 2017

Most users will want to unregister bdi when dropping last reference to a
bdi. Only a few users (like block devices) want to play more complex
tricks with bdi registration and unregistration. So unregister bdi when
the last reference to bdi is dropped and just make sure we don't
unregister the bdi the second time if it is already unregistered.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

5af110b2

bdi: Provide bdi_register_va() and bdi_alloc() · baf7a616

由 Jan Kara 提交于 4月 12, 2017

Add function that registers bdi and takes va_list instead of variable
number of arguments.

Add bdi_alloc() as simple wrapper for NUMA-unaware users allocating BDI.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

baf7a616

23 3月, 2017 6 次提交

bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() · b1c51afc

由 Jan Kara 提交于 3月 23, 2017

Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
from bdi_unregister() which is not necessarily called from bdi_destroy()
and thus the name is somewhat misleading.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

b1c51afc

bdi: Do not wait for cgwbs release in bdi_unregister() · 4514451e

由 Jan Kara 提交于 3月 23, 2017

Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
(called from bdi_unregister()). That is however unnecessary now when
cgwb->bdi is a proper refcounted reference (thus bdi cannot get
released before all cgwbs are released) and when cgwb_bdi_destroy()
shuts down writeback directly.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

4514451e

bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() · 5318ce7d

由 Jan Kara 提交于 3月 23, 2017

Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
which also means that writeback has been shutdown on them. Since this
wait is going away, directly shutdown writeback on cgwbs from
cgwb_bdi_destroy() to avoid live writeback structures after
bdi_unregister() has finished. To make that safe with concurrent
shutdown from cgwb_release_workfn(), we also have to make sure
wb_shutdown() returns only after the bdi_writeback structure is really
shutdown.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

5318ce7d

bdi: Unify bdi->wb_list handling for root wb_writeback · e8cb72b3

由 Jan Kara 提交于 3月 23, 2017

Currently root wb_writeback structure is added to bdi->wb_list in
bdi_init() and never removed. That is different from all other
wb_writeback structures which get added to the list when created and
removed from it before wb_shutdown().

So move list addition of root bdi_writeback to bdi_register() and list
removal of all wb_writeback structures to wb_shutdown(). That way a
wb_writeback structure is on bdi->wb_list if and only if it can handle
writeback and it will make it easier for us to handle shutdown of all
wb_writeback structures in bdi_unregister().
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

e8cb72b3

bdi: Make wb->bdi a proper reference · 810df54a

由 Jan Kara 提交于 3月 23, 2017

Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
structures except for the one embedded inside struct backing_dev_info.
That will allow us to simplify bdi unregistration.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

810df54a

bdi: Mark congested->bdi as internal · b7d680d7

由 Jan Kara 提交于 3月 23, 2017

congested->bdi pointer is used only to be able to remove congested
structure from bdi->cgwb_congested_tree on structure release. Moreover
the pointer can become NULL when we unregister the bdi. Rename the field
to __bdi and add a comment to make it more explicit this is internal
stuff of memcg writeback code and people should not use the field as
such use will be likely race prone.

We do not bother with converting congested->bdi to a proper refcounted
reference. It will be slightly ugly to special-case bdi->wb.congested to
avoid effectively a cyclic reference of bdi to itself and the reference
gets cleared from bdi_unregister() making it impossible to reference
a freed bdi.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

b7d680d7

09 3月, 2017 2 次提交

bdi: Fix use-after-free in wb_congested_put() · df23de55

由 Jan Kara 提交于 3月 08, 2017

bdi_writeback_congested structures get created for each blkcg and bdi
regardless whether bdi is registered or not. When they are created in
unregistered bdi and the request queue (and thus bdi) is then destroyed
while blkg still holds reference to bdi_writeback_congested structure,
this structure will be referencing freed bdi and last wb_congested_put()
will try to remove the structure from already freed bdi.

With commit 165a5e22 "block: Move bdi_unregister() to
del_gendisk()", SCSI started to destroy bdis without calling
bdi_unregister() first (previously it was calling bdi_unregister() even
for unregistered bdis) and thus the code detaching
bdi_writeback_congested in cgwb_bdi_destroy() was not triggered and we
started hitting this use-after-free bug. It is enough to boot a KVM
instance with virtio-scsi device to trigger this behavior.

Fix the problem by detaching bdi_writeback_congested structures in
bdi_exit() instead of bdi_unregister(). This is also more logical as
they can get attached to bdi regardless whether it ever got registered
or not.

Fixes: 165a5e22Signed-off-by: NJan Kara <jack@suse.cz>
Tested-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

df23de55

block: Allow bdi re-registration · b6f8fec4

由 Jan Kara 提交于 3月 08, 2017

SCSI can call device_add_disk() several times for one request queue when
a device in unbound and bound, creating new gendisk each time. This will
lead to bdi being repeatedly registered and unregistered. This was not a
big problem until commit 165a5e22 "block: Move bdi_unregister() to
del_gendisk()" since bdi was only registered repeatedly (bdi_register()
handles repeated calls fine, only we ended up leaking reference to
gendisk due to overwriting bdi->owner) but unregistered only in
blk_cleanup_queue() which didn't get called repeatedly. After
165a5e22 we were doing correct bdi_register() - bdi_unregister()
cycles however bdi_unregister() is not prepared for it. So make sure
bdi_unregister() cleans up bdi in such a way that it is prepared for
a possible following bdi_register() call.

An easy way to provoke this behavior is to enable
CONFIG_DEBUG_TEST_DRIVER_REMOVE and use scsi_debug driver to create a
scsi disk which immediately hangs without this fix.

Fixes: 165a5e22Signed-off-by: NJan Kara <jack@suse.cz>
Tested-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

b6f8fec4

23 2月, 2017 1 次提交

mm/backing-dev.c: use rb_entry() · bc71226b

由 Geliang Tang 提交于 2月 22, 2017

To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Link: http://lkml.kernel.org/r/671275de093d93ddc7c6f77ddc0d357149691a39.1484306840.git.geliangtang@gmail.comSigned-off-by: NGeliang Tang <geliangtang@gmail.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bc71226b

09 2月, 2017 1 次提交

block: fix double-free in the failure path of cgwb_bdi_init() · 5f478e4e

由 Tejun Heo 提交于 2月 08, 2017

When !CONFIG_CGROUP_WRITEBACK, bdi has single bdi_writeback_congested
at bdi->wb_congested.  cgwb_bdi_init() allocates it with kzalloc() and
doesn't do further initialization.  This usually works fine as the
reference count gets bumped to 1 by wb_init() and the put from
wb_exit() releases it.

However, when wb_init() fails, it puts the wb base ref automatically
freeing the wb and the explicit kfree() in cgwb_bdi_init() error path
ends up trying to free the same pointer the second time causing a
double-free.

Fix it by explicitly initilizing the refcnt to 1 and putting the base
ref from cgwb_bdi_destroy().
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NDmitry Vyukov <dvyukov@google.com>
Fixes: a13f35e8 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: NJens Axboe <axboe@fb.com>

5f478e4e

02 2月, 2017 1 次提交

block: Dynamically allocate and refcount backing_dev_info · d03f6cdc

由 Jan Kara 提交于 2月 02, 2017

Instead of storing backing_dev_info inside struct request_queue,
allocate it dynamically, reference count it, and free it when the last
reference is dropped. Currently only request_queue holds the reference
but in the following patch we add other users referencing
backing_dev_info.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

d03f6cdc

08 11月, 2016 1 次提交

writeback: track if we're sleeping on progress in balance_dirty_pages() · b57d74af

由 Jens Axboe 提交于 9月 01, 2016

Note in the bdi_writeback structure whenever a task ends up sleeping
waiting for progress. We can use that information in the lower layers
to increase the priority of writes.
Signed-off-by: NJens Axboe <axboe@fb.com>
Reviewed-by: NJan Kara <jack@suse.cz>

b57d74af

05 8月, 2016 1 次提交

block: fix bdi vs gendisk lifetime mismatch · df08c32c

由 Dan Williams 提交于 7月 31, 2016

The name for a bdi of a gendisk is derived from the gendisk's devt.
However, since the gendisk is destroyed before the bdi it leaves a
window where a new gendisk could dynamically reuse the same devt while a
bdi with the same name is still live.  Arrange for the bdi to hold a
reference against its "owner" disk device while it is registered.
Otherwise we can hit sysfs duplicate name collisions like the following:

 WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
 sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'

 Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
  0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
  ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
  0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
 Call Trace:
  [<ffffffff8134caec>] dump_stack+0x63/0x87
  [<ffffffff8108c351>] __warn+0xd1/0xf0
  [<ffffffff8108c3cf>] warn_slowpath_fmt+0x5f/0x80
  [<ffffffff812a0d34>] sysfs_warn_dup+0x64/0x80
  [<ffffffff812a0e1e>] sysfs_create_dir_ns+0x7e/0x90
  [<ffffffff8134faaa>] kobject_add_internal+0xaa/0x320
  [<ffffffff81358d4e>] ? vsnprintf+0x34e/0x4d0
  [<ffffffff8134ff55>] kobject_add+0x75/0xd0
  [<ffffffff816e66b2>] ? mutex_lock+0x12/0x2f
  [<ffffffff8148b0a5>] device_add+0x125/0x610
  [<ffffffff8148b788>] device_create_groups_vargs+0xd8/0x100
  [<ffffffff8148b7cc>] device_create_vargs+0x1c/0x20
  [<ffffffff811b775c>] bdi_register+0x8c/0x180
  [<ffffffff811b7877>] bdi_register_dev+0x27/0x30
  [<ffffffff813317f5>] add_disk+0x175/0x4a0

Cc: <stable@vger.kernel.org>
Reported-by: NYi Zhang <yizhan@redhat.com>
Tested-by: NYi Zhang <yizhan@redhat.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

Fixed up missing 0 return in bdi_register_owner().
Signed-off-by: NJens Axboe <axboe@fb.com>

df08c32c

29 7月, 2016 1 次提交

mm, vmscan: move LRU lists to node · 599d0c95

由 Mel Gorman 提交于 7月 28, 2016

This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.

Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic. Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes. It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies. For example, the scans are
per-zone but using per-node counters. We also mark a node as congested
when a zone is congested. This causes weird problems that are fixed
later but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

When pages are skipped, they are immediately added back onto the LRU
list. If lowmem reclaim persisted for long periods of time, the same
highmem pages get continually scanned. The idea would be that lowmem
keeps those pages on a separate list until a reclaim for highmem pages
arrives that splices the highmem pages back onto the LRU. It potentially
could be implemented similar to the UNEVICTABLE list.

That would reduce the skip rate with the potential corner case is that
highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

This will break LRU ordering but may be preferable and faster during
memory pressure than skipping LRU pages.

Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

599d0c95

21 5月, 2016 1 次提交

mm: throttle on IO only when there are too many dirty and writeback pages · ede37713

由 Michal Hocko 提交于 5月 20, 2016

wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress.

We used to do congestion_wait before commit 0e093d99 ("writeback: do
not sleep on the congestion queue if there are no congested BDIs or if
significant congestion is not being encountered in the current zone")
but that led to undesirable stalls and sleeping for the full timeout
even when the BDI wasn't congested.  Hence wait_iff_congested was used
instead.

But it seems that even wait_iff_congested doesn't work as expected.  We
might have a small file LRU list with all pages dirty/writeback and yet
the bdi is not congested so this is just a cond_resched in the end and
can end up triggering pre mature OOM.

This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round
of direct reclaim didn't make any progress and dirty+writeback pages are
more than a half of the reclaimable pages on the zone which might be
usable for our target allocation.  This shouldn't reintroduce stalls
fixed by 0e093d99 because congestion_wait is called only when we are
getting hopeless when sleeping is a better choice than OOM with many
pages under IO.

We have to preserve logic introduced by commit 373ccbe5 ("mm,
vmstat: allow WQ concurrency to discover memory reclaim doesn't make any
progress") into the __alloc_pages_slowpath now that wait_iff_congested
is not used anymore.  As the only remaining user of wait_iff_congested
is shrink_inactive_list we can remove the WQ specific short sleep from
wait_iff_congested because the sleep is needed to be done only once in
the allocation retry cycle.

[mhocko@suse.com: high_zoneidx->ac_classzone_idx to evaluate memory reserves properly]
 Link: http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ede37713

01 4月, 2016 1 次提交

writeback: fix the wrong congested state variable definition · c877ef8a

由 Kaixu Xia 提交于 3月 31, 2016

The right variable definition should be wb_congested_state that
include WB_async_congested and WB_sync_congested. So fix it.
Signed-off-by: NKaixu Xia <xiakaixu@huawei.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@fb.com>

c877ef8a

18 3月, 2016 1 次提交

mm: convert printk(KERN_<LEVEL> to pr_<level> · 1170532b

由 Joe Perches 提交于 3月 17, 2016

Most of the mm subsystem uses pr_<level> so make it consistent.

Miscellanea:

 - Realign arguments
 - Add missing newline to format
 - kmemleak-test.c has a "kmemleak: " prefix added to the
   "Kmemleak testing" logging message via pr_fmt
Signed-off-by: NJoe Perches <joe@perches.com>
Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1170532b

12 2月, 2016 1 次提交

mm/backing-dev.c: fix error path in wb_init() · 078c6c3a

由 Rasmus Villemoes 提交于 2月 11, 2016

We need to use post-decrement to get percpu_counter_destroy() called on
&wb->stat[0].  Moreover, the pre-decremebt would cause infinite
out-of-bounds accesses if the setup code failed at i==0.
Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

078c6c3a

06 2月, 2016 1 次提交

mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress · 564e81a5

由 Tetsuo Handa 提交于 2月 05, 2016

Jan Stancek has reported that system occasionally hanging after "oom01"
testcase from LTP triggers OOM.  Guessing from a result that there is a
kworker thread doing memory allocation and the values between "Node 0
Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not
up-to-date for some reason.

According to commit 373ccbe5 ("mm, vmstat: allow WQ concurrency to
discover memory reclaim doesn't make any progress"), it meant to force
the kworker thread to take a short sleep, but it by error used
schedule_timeout(1).  We missed that schedule_timeout() in state
TASK_RUNNING doesn't do anything.

Fix it by using schedule_timeout_uninterruptible(1) which forces the
kworker thread to take a short sleep in order to make sure that vmstat
is up-to-date.

Fixes: 373ccbe5 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: NJan Stancek <jstancek@redhat.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Cristopher Lameter <clameter@sgi.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

564e81a5

15 1月, 2016 1 次提交

mm: memcontrol: export root_mem_cgroup · 7d828602

由 Johannes Weiner 提交于 1月 14, 2016

A later patch will need this symbol in files other than memcontrol.c, so
export it now and replace mem_cgroup_root_css at the same time.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7d828602

13 12月, 2015 1 次提交

mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress · 373ccbe5

由 Michal Hocko 提交于 12月 11, 2015

Tetsuo Handa has reported that the system might basically livelock in
OOM condition without triggering the OOM killer.

The issue is caused by internal dependency of the direct reclaim on
vmstat counter updates (via zone_reclaimable) which are performed from
the workqueue context.  If all the current workers get assigned to an
allocation request, though, they will be looping inside the allocator
trying to reclaim memory but zone_reclaimable can see stalled numbers so
it will consider a zone reclaimable even though it has been scanned way
too much.  WQ concurrency logic will not consider this situation as a
congested workqueue because it relies that worker would have to sleep in
such a situation.  This also means that it doesn't try to spawn new
workers or invoke the rescuer thread if the one is assigned to the
queue.

In order to fix this issue we need to do two things.  First we have to
let wq concurrency code know that we are in trouble so we have to do a
short sleep.  In order to prevent from issues handled by 0e093d99
("writeback: do not sleep on the congestion queue if there are no
congested BDIs or if significant congestion is not being encountered in
the current zone") we limit the sleep only to worker threads which are
the ones of the interest anyway.

The second thing to do is to create a dedicated workqueue for vmstat and
mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
have a spare worker thread for it.
Signed-off-by: NMichal Hocko <mhocko@suse.com>
Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Tejun Heo <tj@kernel.org>
Cc: Cristopher Lameter <clameter@sgi.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

373ccbe5

07 11月, 2015 1 次提交

mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc

由 Mel Gorman 提交于 11月 06, 2015

mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.
Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d0164adc

21 10月, 2015 1 次提交

writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in cgwb_bdi_destroy() · e27c5b9d

由 Tejun Heo 提交于 10月 13, 2015

a20135ff ("writeback: don't drain bdi_writeback_congested on bdi
destruction") added rbtree_postorder_for_each_entry_safe() which is
used to remove all entries; however, according to Cody, the iterator
isn't safe against operations which may rebalance the tree. Fix it by
switching to repeatedly removing rb_first() until empty.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NCody P Schafer <dev@codyps.com>
Fixes: a20135ff ("writeback: don't drain bdi_writeback_congested on bdi destruction")
Link: http://lkml.kernel.org/g/1443997973-1700-1-git-send-email-dev@codyps.comSigned-off-by: NJens Axboe <axboe@fb.com>

e27c5b9d

15 10月, 2015 1 次提交

block: don't release bdi while request_queue has live references · b02176f3

由 Tejun Heo 提交于 9月 08, 2015

bdi's are initialized in two steps, bdi_init() and bdi_register(), but
destroyed in a single step by bdi_destroy() which, for a bdi embedded
in a request_queue, is called during blk_cleanup_queue() which makes
the queue invisible and starts the draining of remaining usages.

A request_queue's user can access the congestion state of the embedded
bdi as long as it holds a reference to the queue. As such, it may
access the congested state of a queue which finished
blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
Because the congested state was embedded in backing_dev_info which in
turn is embedded in request_queue, accessing the congested state after
bdi_destroy() was called was fine. The bdi was destroyed but the
memory region for the congested state remained accessible till the
queue got released.

a13f35e8 ("writeback: don't embed root bdi_writeback_congested in
bdi_writeback") changed the situation. Now, the root congested state
which is expected to be pinned while request_queue remains accessible
is separately reference counted and the base ref is put during
bdi_destroy(). This means that the root congested state may go away
prematurely while the queue is between bdi_dstroy() and
blk_cleanup_queue(), which was detected by Andrey's KASAN tests.

The root cause of this problem is that bdi doesn't distinguish the two
steps of destruction, unregistration and release, and now the root
congested state actually requires a separate release step. To fix the
issue, this patch separates out bdi_unregister() and bdi_exit() from
bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue()
and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a
simple wrapper calling the two steps back-to-back.

While at it, the prototype of bdi_destroy() is moved right below
bdi_setup_and_register() so that the counterpart operations are
located together.
Signed-off-by: NTejun Heo <tj@kernel.org>
Fixes: a13f35e8 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
Cc: stable@vger.kernel.org # v4.2+
Reported-and-tested-by: NAndrey Konovalov <andreyknvl@google.com>
Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.comReviewed-by: NJan Kara <jack@suse.com>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

b02176f3

13 10月, 2015 1 次提交

writeback: bdi_writeback iteration must not skip dying ones · b817525a

由 Tejun Heo 提交于 10月 02, 2015

bdi_for_each_wb() is used in several places to wake up or issue
writeback work items to all wb's (bdi_writeback's) on a given bdi.
The iteration is performed by walking bdi->cgwb_tree; however, the
tree only indexes wb's which are currently active.

For example, when a memcg gets associated with a different blkcg, the
old wb is removed from the tree so that the new one can be indexed.
The old wb starts dying from then on but will linger till all its
inodes are drained.  As these dying wb's may still host dirty inodes,
writeback operations which affect all wb's must include them.
bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
failing to sync the inodes belonging to those wb's.

This patch adds a RCU protected @bdi->wb_list which lists all wb's
beloinging to that bdi.  wb's are added on creation and removed on
release rather than on the start of destruction.  bdi_for_each_wb()
usages are replaced with list_for_each[_continue]_rcu() iterations
over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.

v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
    fixed and unnecessary list head severing in cgwb_bdi_destroy()
    removed.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-and-tested-by: NArtem Bityutskiy <dedekind1@gmail.com>
Fixes: ebe41ab0 ("writeback: implement bdi_for_each_wb()")
Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

b817525a

19 8月, 2015 1 次提交

blkcg: rename subsystem name from blkio to io · c165b3e3

由 Tejun Heo 提交于 8月 18, 2015

blkio interface has become messy over time and is currently the
largest.  In addition to the inconsistent naming scheme, it has
multiple stat files which report more or less the same thing, a number
of debug stat files which expose internal details which shouldn't have
been part of the public interface in the first place, recursive and
non-recursive stats and leaf and non-leaf knobs.

Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
don't make any sense on the unified hierarchy as only leaf cgroups can
contain processes.  cgroups is going through a major interface
revision with the unified hierarchy involving significant fundamental
usage changes and given that a significant portion of the interface
doesn't make sense anymore, it's a good time to reorganize the
interface.

As the first step, this patch renames the external visible subsystem
name from "blkio" to "io".  This is more concise, matches the other
two major subsystem names, "cpu" and "memory", and better suited as
blkcg will be involved in anything writeback related too whether an
actual block device is involved or not.

As the subsystem legacy_name is set to "blkio", the only userland
visible change outside the unified hierarchy is that blkcg is reported
as "io" instead of "blkio" in the subsystem initialized message during
boot.  On the unified hierarchy, blkcg now appears as "io".
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@fb.com>

c165b3e3

18 8月, 2015 1 次提交

inode: rename i_wb_list to i_io_list · c7f54084

由 Dave Chinner 提交于 3月 04, 2015

There's a small consistency problem between the inode and writeback
naming. Writeback calls the "for IO" inode queues b_io and
b_more_io, but the inode calls these the "writeback list" or
i_wb_list. This makes it hard to an new "under writeback" list to
the inode, or call it an "under IO" list on the bdi because either
way we'll have writeback on IO and IO on writeback and it'll just be
confusing. I'm getting confused just writing this!

So, rename the inode "for IO" list variable to i_io_list so we can
add a new "writeback list" in a subsequent patch.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NDave Chinner <dchinner@redhat.com>

c7f54084

02 7月, 2015 2 次提交

writeback: don't drain bdi_writeback_congested on bdi destruction · a20135ff

由 Tejun Heo 提交于 7月 01, 2015

52ebea74 ("writeback: make backing_dev_info host cgroup-specific
bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
(bdi_writeback's).  As the congested state needs to be per-wb and
referenced from blkcg side and multiple wbs, the patch made all
non-root cong's (bdi_writeback_congested's) reference counted and
indexed on bdi.

When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
non-root cong's; however, this can hang indefinitely because wb's can
also be referenced from blkcg_gq's which are destroyed after bdi
destruction is complete.

This patch fixes the bug by updating bdi destruction to not wait for
cong's to drain.  A cong is unlinked from bdi->cgwb_congested_tree on
bdi destuction regardless of its reference count as the bdi may go
away any point after destruction.  wb_congested_put() checks whether
the cong is already unlinked on release.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NJon Christopherson <jon@jons.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681
Fixes: 52ebea74 ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
Tested-by: NJon Christopherson <jon@jons.org>
Signed-off-by: NJens Axboe <axboe@fb.com>

a20135ff

writeback: don't embed root bdi_writeback_congested in bdi_writeback · a13f35e8

由 Tejun Heo 提交于 7月 02, 2015

52ebea74 ("writeback: make backing_dev_info host cgroup-specific
bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
(bdi_writeback's).  As the congested state needs to be per-wb and
referenced from blkcg side and multiple wbs, the patch made all
non-root cong's (bdi_writeback_congested's) reference counted and
indexed on bdi.

When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
non-root cong's; however, this can hang indefinitely because wb's can
also be referenced from blkcg_gq's which are destroyed after bdi
destruction is complete.

To fix the bug, bdi destruction will be updated to not wait for cong's
to drain, which naturally means that cong's may outlive the associated
bdi.  This is fine for non-root cong's but is problematic for the root
cong's which are embedded in their bdi's as they may end up getting
dereferenced after the containing bdi's are freed.

This patch makes root cong's behave the same as non-root cong's.  They
are no longer embedded in their bdi's but allocated separately during
bdi initialization, indexed and reference counted the same way.

* As cong handling is the same for all wb's, wb->congested
  initialization is moved into wb_init().

* When !CONFIG_CGROUP_WRITEBACK, there was no indexing or refcnting.
  bdi->wb_congested is now a pointer pointing to the root cong
  allocated during bdi init and minimal refcnting operations are
  implemented.

* The above makes root wb init paths diverge depending on
  CONFIG_CGROUP_WRITEBACK.  root wb init is moved to cgwb_bdi_init().

This patch in itself shouldn't cause any consequential behavior
differences but prepares for the actual fix.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NJon Christopherson <jon@jons.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681Tested-by: NJon Christopherson <jon@jons.org>

Added <linux/slab.h> include to backing-dev.h for kfree() definition.
Signed-off-by: NJens Axboe <axboe@fb.com>

a13f35e8

05 6月, 2015 1 次提交

bdi: fix wrong error return value in cgwb_create() · 5857cd63

由 Tejun Heo 提交于 6月 05, 2015

On wb_congested_get_create() failure, cgwb_create() forgot to set @ret
to -ENOMEM ending up returning 0.  Fix it so that it returns -ENOMEM.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

5857cd63

02 6月, 2015 5 次提交

writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() · 21c6321f

由 Tejun Heo 提交于 5月 28, 2015

Currently, majority of cgroup writeback support including all the
above functions are implemented in include/linux/backing-dev.h and
mm/backing-dev.c; however, the portion closely related to writeback
logic implemented in include/linux/writeback.h and mm/page-writeback.c
will expand to support foreign writeback detection and correction.

This patch moves wb[_try]_get() and wb_put() to
include/linux/backing-dev-defs.h so that they can be used from
writeback.h and inode_{attach|detach}_wb() to writeback.h and
page-writeback.c.

This is pure reorganization and doesn't introduce any functional
changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

21c6321f

writeback: implement memcg wb_domain · 841710aa

由 Tejun Heo 提交于 5月 22, 2015

Dirtyable memory is distributed to a wb (bdi_writeback) according to
the relative bandwidth the wb is writing out in the whole system.
This distribution is global - each wb is measured against all other
wb's and gets the proportinately sized portion of the memory in the
whole system.

For cgroup writeback, the amount of dirtyable memory is scoped by
memcg and thus each wb would need to be measured and controlled in its
memcg.  IOW, a wb will belong to two writeback domains - the global
and memcg domains.

The previous patches laid the groundwork to support the two wb_domains
and this patch implements memcg wb_domain.  memcg->cgwb_domain is
initialized on css online and destroyed on css release,
wb->memcg_completions is added, and __wb_writeout_inc() is updated to
increment completions against both global and memcg wb_domains.

The following patches will update balance_dirty_pages() and its
subroutines to actually consider memcg wb_domain for throttling.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

841710aa

writeback: clean up wb_dirty_limit() · 0d960a38

由 Tejun Heo 提交于 5月 22, 2015

The function name wb_dirty_limit(), its argument @dirty and the local
variable @wb_dirty are mortally confusing given that the function
calculates per-wb threshold value not dirty pages, especially given
that @dirty and @wb_dirty are used elsewhere for dirty pages.

Let's rename the function to wb_calc_thresh() and wb_dirty to
wb_thresh.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

0d960a38

writeback: implement bdi_wait_for_completion() · cc395d7f

由 Tejun Heo 提交于 5月 22, 2015

If the completion of a wb_writeback_work can be waited upon by setting
its ->done to a struct completion and waiting on it; however, for
cgroup writeback support, it's necessary to issue multiple work items
to multiple bdi_writebacks and wait for the completion of all.

This patch implements wb_completion which can wait for multiple work
items and replaces the struct completion with it.  It can be defined
using DEFINE_WB_COMPLETION_ONSTACK(), used for multiple work items and
waited for by wb_wait_for_completion().

Nobody currently issues multiple work items and this patch doesn't
introduce any behavior changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

cc395d7f

writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account · 95a46c65

由 Tejun Heo 提交于 5月 22, 2015

bdi_has_dirty_io() used to only reflect whether the root wb
(bdi_writeback) has dirty inodes.  For cgroup writeback support, it
needs to take all active wb's into account.  If any wb on the bdi has
dirty inodes, bdi_has_dirty_io() should return true.

To achieve that, as inode_wb_list_{move|del}_locked() now keep track
of the dirty state transition of each wb, the number of dirty wbs can
be counted in the bdi; however, bdi is already aggregating
wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when
there are any dirty inodes by ensuring wb->avg_write_bandwidth can't
dip below 1.  bdi_has_dirty_io() can simply test whether
bdi->tot_write_bandwidth is zero or not.

While this bumps the value of wb->avg_write_bandwidth to one when it
used to be zero, this shouldn't cause any meaningful behavior
difference.

bdi_has_dirty_io() is made an inline function which tests whether
->tot_write_bandwidth is non-zero.  Also, WARN_ON_ONCE()'s on its
value are added to inode_wb_list_{move|del}_locked().
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

95a46c65

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功