提交 · dc48e56d761610da4ea1088d1bea0a030b8e3e43 · openanolis / cloud-kernel

16 4月, 2015 1 次提交

aio: fix serial draining in exit_aio() · dc48e56d

由 Jens Axboe 提交于 4月 15, 2015

exit_aio() currently serializes killing io contexts. Each context
killing ends up having to do percpu_ref_kill(), which in turns has
to wait for an RCU grace period. This can take a long time, depending
on the number of contexts. And there's no point in doing them serially,
when we could be waiting for all of them in one fell swoop.

This patches makes my fio thread offload test case exit 0.2s instead
of almost 6s.
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

dc48e56d

20 2月, 2015 1 次提交

fs/aio.c: Remove duplicate function name in pr_debug messages · acd88d4e

由 Kinglong Mee 提交于 2月 04, 2015

Have defined pr_fmt as below in fs/aio.c, so remove duplicate
function name in pr_debug message.

#define pr_fmt(fmt) "%s: " fmt, __func__
Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

acd88d4e

04 2月, 2015 1 次提交

aio: annotate aio_read_event_ring for sleep patterns · 9c9ce763

由 Dave Chinner 提交于 2月 03, 2015

Under CONFIG_DEBUG_ATOMIC_SLEEP=y, aio_read_event_ring() will throw
warnings like the following due to being called from wait_event
context:

 WARNING: CPU: 0 PID: 16006 at kernel/sched/core.c:7300 __might_sleep+0x7f/0x90()
 do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810d85a3>] prepare_to_wait_event+0x63/0x110
 Modules linked in:
 CPU: 0 PID: 16006 Comm: aio-dio-fcntl-r Not tainted 3.19.0-rc6-dgc+ #705
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  ffffffff821c0372 ffff88003c117cd8 ffffffff81daf2bd 000000000000d8d8
  ffff88003c117d28 ffff88003c117d18 ffffffff8109beda ffff88003c117cf8
  ffffffff821c115e 0000000000000061 0000000000000000 00007ffffe4aa300
 Call Trace:
  [<ffffffff81daf2bd>] dump_stack+0x4c/0x65
  [<ffffffff8109beda>] warn_slowpath_common+0x8a/0xc0
  [<ffffffff8109bf56>] warn_slowpath_fmt+0x46/0x50
  [<ffffffff810d85a3>] ? prepare_to_wait_event+0x63/0x110
  [<ffffffff810d85a3>] ? prepare_to_wait_event+0x63/0x110
  [<ffffffff810bdfcf>] __might_sleep+0x7f/0x90
  [<ffffffff81db8344>] mutex_lock+0x24/0x45
  [<ffffffff81216b7c>] aio_read_events+0x4c/0x290
  [<ffffffff81216fac>] read_events+0x1ec/0x220
  [<ffffffff810d8650>] ? prepare_to_wait_event+0x110/0x110
  [<ffffffff810fdb10>] ? hrtimer_get_res+0x50/0x50
  [<ffffffff8121899d>] SyS_io_getevents+0x4d/0xb0
  [<ffffffff81dba5a9>] system_call_fastpath+0x12/0x17
 ---[ end trace bde69eaf655a4fea ]---

There is not actually a bug here, so annotate the code to tell the
debug logic that everything is just fine and not to fire a false
positive.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

9c9ce763

21 1月, 2015 2 次提交

fs: remove mapping->backing_dev_info · b83ae6d4

由 Christoph Hellwig 提交于 1月 14, 2015

Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

b83ae6d4

fs: introduce f_op->mmap_capabilities for nommu mmap support · b4caecd4

由 Christoph Hellwig 提交于 1月 14, 2015

Since "BDI: Provide backing device capability information [try #3]" the
backing_dev_info structure also provides flags for the kind of mmap
operation available in a nommu environment, which is entirely unrelated
to it's original purpose.

Introduce a new nommu-only file operation to provide this information to
the nommu mmap code instead.  Splitting this from the backing_dev_info
structure allows to remove lots of backing_dev_info instance that aren't
otherwise needed, and entirely gets rid of the concept of providing a
backing_dev_info for a character device.  It also removes the need for
the mtd_inodefs filesystem.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NTejun Heo <tj@kernel.org>
Acked-by: NBrian Norris <computersforpeace@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

b4caecd4

14 12月, 2014 2 次提交

aio: Skip timer for io_getevents if timeout=0 · 5f785de5

由 Fam Zheng 提交于 11月 06, 2014

In this case, it is basically a polling. Let's not involve timer at all
because that would hurt performance for application event loops.

In an arbitrary test I've done, io_getevents syscall elapsed time
reduces from 50000+ nanoseconds to a few hundereds.
Signed-off-by: NFam Zheng <famz@redhat.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

5f785de5

aio: Make it possible to remap aio ring · e4a0d3e7

由 Pavel Emelyanov 提交于 9月 18, 2014

There are actually two issues this patch addresses. Let me start with
the one I tried to solve in the beginning.

So, in the checkpoint-restore project (criu) we try to dump tasks'
state and restore one back exactly as it was. One of the tasks' state
bits is rings set up with io_setup() call. There's (almost) no problems
in dumping them, there's a problem restoring them -- if I dump a task
with aio ring originally mapped at address A, I want to restore one
back at exactly the same address A. Unfortunately, the io_setup() does
not allow for that -- it mmaps the ring at whatever place mm finds
appropriate (it calls do_mmap_pgoff() with zero address and without
the MAP_FIXED flag).

To make restore possible I'm going to mremap() the freshly created ring
into the address A (under which it was seen before dump). The problem is
that the ring's virtual address is passed back to the user-space as the
context ID and this ID is then used as search key by all the other io_foo()
calls. Reworking this ID to be just some integer doesn't seem to work, as
this value is already used by libaio as a pointer using which this library
accesses memory for aio meta-data.

So, to make restore work we need to make sure that

a) ring is mapped at desired virtual address
b) kioctx->user_id matches this value

Having said that, the patch makes mremap() on aio region update the
kioctx's user_id and mmap_base values.

Here appears the 2nd issue I mentioned in the beginning of this mail.
If (regardless of the C/R dances I do) someone creates an io context
with io_setup(), then mremap()-s the ring and then destroys the context,
the kill_ioctx() routine will call munmap() on wrong (old) address.
This will result in a) aio ring remaining in memory and b) some other
vma get unexpectedly unmapped.

What do you think?
Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
Acked-by: NDmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

e4a0d3e7

07 11月, 2014 1 次提交

aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer · 835f252c

由 Gu Zheng 提交于 11月 06, 2014

https://bugzilla.kernel.org/show_bug.cgi?id=86831

Markus reported that when shutting down mysqld (with AIO support,
on a ext3 formatted Harddrive) leads to a negative number of dirty pages
(underrun to the counter). The negative number results in a drastic reduction
of the write performance because the page cache is not used, because the kernel
thinks it is still 2 ^ 32 dirty pages open.

Add a warn trace in __dec_zone_state will catch this easily:

static inline void __dec_zone_state(struct zone *zone, enum
	zone_stat_item item)
{
     atomic_long_dec(&zone->vm_stat[item]);
+    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
	atomic_long_read(&zone->vm_stat[item]) < 0);
     atomic_long_dec(&vm_stat[item]);
}

[   21.341632] ------------[ cut here ]------------
[   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
cancel_dirty_page+0x164/0x224()
[   21.355296] Modules linked in: wutbox_cp sata_mv
[   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
[   21.366793] Workqueue: events free_ioctx
[   21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>]
(show_stack+0x20/0x24)
[   21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>]
(dump_stack+0x24/0x28)
[   21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>]
(warn_slowpath_common+0x84/0x9c)
[   21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>]
(warn_slowpath_null+0x2c/0x34)
[   21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>]
(cancel_dirty_page+0x164/0x224)
[   21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>]
(truncate_inode_page+0x8c/0x158)
[   21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>]
(truncate_inode_pages_range+0x11c/0x53c)
[   21.429890] [<c00c0a94>] (truncate_inode_pages_range) from
[<c00c0f6c>] (truncate_pagecache+0x88/0xac)
[   21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>]
(truncate_setsize+0x5c/0x74)
[   21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>]
(put_aio_ring_file.isra.14+0x34/0x90)
[   21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from
[<c013b424>] (aio_free_ring+0x20/0xcc)
[   21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>]
(free_ioctx+0x24/0x44)
[   21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>]
(process_one_work+0x134/0x47c)
[   21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>]
(worker_thread+0x130/0x414)
[   21.489350] [<c003e988>] (worker_thread) from [<c00448ac>]
(kthread+0xd4/0xec)
[   21.496621] [<c00448ac>] (kthread) from [<c000ec18>]
(ret_from_fork+0x14/0x20)
[   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---

The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
(bypasses the VFS dirty pages increment) when init, and aio fs uses
*default_backing_dev_info* as the backing dev, which does not disable
the dirty pages accounting capability.
So truncating aio ring file will contribute to accounting dirty pages (VFS
dirty pages decrement), then error occurs.

The original goal is keeping these pages in memory (can not be reclaimed
or swapped) in life-time via marking it dirty. But thinking more, we have
already pinned pages via elevating the page's refcount, which can already
achieve the goal, so the SetPageDirty seems unnecessary.

In order to fix the issue, using the __set_page_dirty_no_writeback instead
of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).

With the above change, the dirty pages accounting can work well. But as we
known, aio fs is an anonymous one, which should never cause any real write-back,
we can ignore the dirty pages (write back) accounting by disabling the dirty
pages (write back) accounting capability. So we introduce an aio private
backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
replace the default one.
Reported-by: NMarkus Königshaus <m.koenigshaus@wut.de>
Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
Cc: stable <stable@vger.kernel.org>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

835f252c

25 9月, 2014 1 次提交

percpu_ref: add PERCPU_REF_INIT_* flags · 2aad2a86

由 Tejun Heo 提交于 9月 24, 2014

With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.

Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.

This patch adds @flags to percpu_ref_init() and implements the
following flags.

* PERCPU_REF_INIT_ATOMIC	: start ref in atomic mode
* PERCPU_REF_INIT_DEAD		: start ref killed and drained

These flags should be able to serve the above two use cases.

v2: target_core_tpg.c conversion was missing.  Fixed.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NKent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>

2aad2a86

08 9月, 2014 1 次提交

percpu-refcount: add @gfp to percpu_ref_init() · a34375ef

由 Tejun Heo 提交于 9月 08, 2014

Percpu allocator now supports allocation mask.  Add @gfp to
percpu_ref_init() so that !GFP_KERNEL allocation masks can be used
with percpu_refs too.

This patch doesn't make any functional difference.

v2: blk-mq conversion was missing.  Updated.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Jens Axboe <axboe@kernel.dk>

a34375ef

05 9月, 2014 1 次提交

aio: block exit_aio() until all context requests are completed · 6098b45b

由 Gu Zheng 提交于 9月 03, 2014

It seems that exit_aio() also needs to wait for all iocbs to complete (like
io_destroy), but we missed the wait step in current implemention, so fix
it in the same way as we did in io_destroy.
Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org

6098b45b

03 9月, 2014 1 次提交

aio: add missing smp_rmb() in read_events_ring · 2ff396be

由 Jeff Moyer 提交于 9月 02, 2014

We ran into a case on ppc64 running mariadb where io_getevents would
return zeroed out I/O events.  After adding instrumentation, it became
clear that there was some missing synchronization between reading the
tail pointer and the events themselves.  This small patch fixes the
problem in testing.

Thanks to Zach for helping to look into this, and suggesting the fix.
Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org

2ff396be

25 8月, 2014 1 次提交

aio: fix reqs_available handling · d856f32a

由 Benjamin LaHaise 提交于 8月 24, 2014

As reported by Dan Aloni, commit f8567a38 ("aio: fix aio request
leak when events are reaped by userspace") introduces a regression when
user code attempts to perform io_submit() with more events than are
available in the ring buffer.  Reverting that commit would reintroduce a
regression when user space event reaping is used.

Fixing this bug is a bit more involved than the previous attempts to fix
this regression.  Since we do not have a single point at which we can
count events as being reaped by user space and io_getevents(), we have
to track event completion by looking at the number of events left in the
event ring.  So long as there are as many events in the ring buffer as
there have been completion events generate, we cannot call
put_reqs_available().  The code to check for this is now placed in
refill_reqs_available().

A test program from Dan and modified by me for verifying this bug is available
at http://www.kvack.org/~bcrl/20140824-aio_bug.c .
Reported-by: NDan Aloni <dan@kernelim.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Acked-by: NDan Aloni <dan@kernelim.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: stable@vger.kernel.org      # v3.16 and anything that f8567a38 was backported to
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d856f32a

24 7月, 2014 4 次提交

aio: use iovec array rather than the single one · 00fefb9c

由 Gu Zheng 提交于 7月 23, 2014

Previously, we only offer a single iovec to handle all the read/write cases, so
the PREADV/PWRITEV request always need to alloc more iovec buffer when copying
user vectors.
If we use a tmp iovec array rather than the single one, some small PREADV/PWRITEV
workloads(vector size small than the tmp buffer) will not need to alloc more
iovec buffer when copying user vectors.
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

00fefb9c

aio: fix some comments · 2be4e7de

由 Gu Zheng 提交于 7月 23, 2014

The function comments of aio_run_iocb and aio_read_events are out of date, so
fix them here.
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

2be4e7de

aio: use the macro rather than the inline magic number · 8dc4379e

由 Gu Zheng 提交于 7月 23, 2014

Replace the inline magic number with the ready-made macro(AIO_RING_MAGIC),
just clean up.
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

8dc4379e

aio: remove the needless registration of ring file's private_data · b53f1f82

由 Gu Zheng 提交于 7月 23, 2014

Remove the registration of ring file's private_data, we do not use
it.
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

b53f1f82

22 7月, 2014 1 次提交

aio: remove no longer needed preempt_disable() · be6fb451

由 Benjamin LaHaise 提交于 7月 22, 2014

Based on feedback from Jens Axboe on 263782c1,
clean up get/put_reqs_available() to remove the no longer needed preempt_disable()
and preempt_enable() pair.
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Cc: Jens Axboe <axboe@kernel.dk>

be6fb451

15 7月, 2014 1 次提交

aio: protect reqs_available updates from changes in interrupt handlers · 263782c1

由 Benjamin LaHaise 提交于 7月 14, 2014

As of commit f8567a38 it is now possible to
have put_reqs_available() called from irq context.  While put_reqs_available()
is per cpu, it did not protect itself from interrupts on the same CPU.  This
lead to aio_complete() corrupting the available io requests count when run
under a heavy O_DIRECT workloads as reported by Robert Elliott.  Fix this by
disabling irq updates around the per cpu batch updates of reqs_available.

Many thanks to Robert and folks for testing and tracking this down.
Reported-by: NRobert Elliot <Elliott@hp.com>
Tested-by: NRobert Elliot <Elliott@hp.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org>
Cc: stable@vger.kenel.org

263782c1

28 6月, 2014 2 次提交

percpu-refcount: require percpu_ref to be exited explicitly · 9a1049da

由 Tejun Heo 提交于 6月 28, 2014

Currently, a percpu_ref undoes percpu_ref_init() automatically by
freeing the allocated percpu area when the percpu_ref is killed.
While seemingly convenient, this has the following niggles.

* It's impossible to re-init a released reference counter without
  going through re-allocation.

* In the similar vein, it's impossible to initialize a percpu_ref
  count with static percpu variables.

* We need and have an explicit destructor anyway for failure paths -
  percpu_ref_cancel_init().

This patch removes the automatic percpu counter freeing in
percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a
generic destructor now named percpu_ref_exit().  percpu_ref_destroy()
is considered but it gets confusing with percpu_ref_kill() while
"exit" clearly indicates that it's the counterpart of
percpu_ref_init().

All percpu_ref_cancel_init() users are updated to invoke
percpu_ref_exit() instead and explicit percpu_ref_exit() calls are
added to the destruction path of all percpu_ref users.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NBenjamin LaHaise <bcrl@kvack.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Li Zefan <lizefan@huawei.com>

9a1049da

percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc() · 55c6c814

由 Tejun Heo 提交于 6月 28, 2014

ioctx_alloc() reaches inside percpu_ref and directly frees
->pcpu_count in its failure path, which is quite gross.  percpu_ref
has been providing a proper interface to do this,
percpu_ref_cancel_init(), for quite some time now.  Let's use that
instead.

This patch doesn't introduce any behavior changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NBenjamin LaHaise <bcrl@kvack.org>
Cc: Kent Overstreet <kmo@daterainc.com>

55c6c814

25 6月, 2014 4 次提交

aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx() · 855ef0de

由 Oleg Nesterov 提交于 4月 30, 2014

ioctx_add_table() is the writer, it does not need rcu_read_lock() to
protect ->ioctx_table. It relies on mm->ioctx_lock and rcu locks just
add the confusion.

And it doesn't need rcu_dereference() by the same reason, it must see
any updates previously done under the same ->ioctx_lock. We could use
rcu_dereference_protected() but the patch uses rcu_dereference_raw(),
the function is simple enough.

The same for kill_ioctx(), although it does not update the pointer.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

855ef0de

aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock() · 4b70ac5f

由 Oleg Nesterov 提交于 4月 30, 2014

On 04/30, Benjamin LaHaise wrote:
>
> > -		ctx->mmap_size = 0;
> > -
> > -		kill_ioctx(mm, ctx, NULL);
> > +		if (ctx) {
> > +			ctx->mmap_size = 0;
> > +			kill_ioctx(mm, ctx, NULL);
> > +		}
>
> Rather than indenting and moving the two lines changing mmap_size and the
> kill_ioctx() call, why not just do "if (!ctx) ... continue;"?  That reduces
> the number of lines changed and avoid excessive indentation.

OK. To me the code looks better/simpler with "if (ctx)", but this is subjective
of course, I won't argue.

The patch still removes the empty line between mmap_size = 0 and kill_ioctx(),
we reset mmap_size only for kill_ioctx(). But feel free to remove this change.

-------------------------------------------------------------------------------
Subject: [PATCH v3 1/2] aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()

1. We can read ->ioctx_table only once and we do not read rcu_read_lock()
   or even rcu_dereference().

   This mm has no users, nobody else can play with ->ioctx_table. Otherwise
   the code is buggy anyway, if we need rcu_read_lock() in a loop because
   ->ioctx_table can be updated then kfree(table) is obviously wrong.

2. Update the comment. "exit_mmap(mm) is coming" is the good reason to avoid
   munmap(), but another reason is that we simply can't do vm_munmap() unless
   current->mm == mm and this is not true in general, the caller is mmput().

3. We do not really need to nullify mm->ioctx_table before return, probably
   the current code does this to catch the potential problems. But in this
   case RCU_INIT_POINTER(NULL) looks better.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

4b70ac5f

aio: fix kernel memory disclosure in io_getevents() introduced in v3.10 · edfbbf38

由 Benjamin LaHaise 提交于 6月 24, 2014

A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10
by commit a31ad380.  The changes made to
aio_read_events_ring() failed to correctly limit the index into
ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of
an arbitrary page with a copy_to_user() to copy the contents into userspace.
This vulnerability has been assigned CVE-2014-0206.  Thanks to Mateusz and
Petr for disclosing this issue.

This patch applies to v3.12+.  A separate backport is needed for 3.10/3.11.
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: stable@vger.kernel.org

edfbbf38

aio: fix aio request leak when events are reaped by userspace · f8567a38

由 Benjamin LaHaise 提交于 6月 24, 2014

The aio cleanups and optimizations by kmo that were merged into the 3.10
tree added a regression for userspace event reaping.  Specifically, the
reference counts are not decremented if the event is reaped in userspace,
leading to the application being unable to submit further aio requests.
This patch applies to 3.12+.  A separate backport is required for 3.10/3.11.
This issue was uncovered as part of CVE-2014-0206.
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>

f8567a38

07 5月, 2014 1 次提交

new methods: ->read_iter() and ->write_iter() · 293bc982

由 Al Viro 提交于 2月 11, 2014

Beginning to introduce those.  Just the callers for now, and it's
clumsier than it'll eventually become; once we finish converting
aio_read and aio_write instances, the things will get nicer.

For now, these guys are in parallel to ->aio_read() and ->aio_write();
they take iocb and iov_iter, with everything in iov_iter already
validated.  File offset is passed in iocb->ki_pos, iov/nr_segs -
in iov_iter.

Main concerns in that series are stack footprint and ability to
split the damn thing cleanly.

[fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded]
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

293bc982

01 5月, 2014 1 次提交

aio: fix potential leak in aio_run_iocb(). · 754320d6

由 Leon Yu 提交于 5月 01, 2014

iovec should be reclaimed whenever caller of rw_copy_check_uvector() returns,
but it doesn't hold when failure happens right after aio_setup_vectored_rw().

Fix that in a such way to avoid hairy goto.
Signed-off-by: NLeon Yu <chianglungyu@gmail.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org

754320d6

30 4月, 2014 2 次提交

aio: cleanup: flatten kill_ioctx() · fa88b6f8

由 Benjamin LaHaise 提交于 4月 29, 2014

There is no need to have most of the code in kill_ioctx() indented.  Flatten
it.
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

fa88b6f8

aio: report error from io_destroy() when threads race in io_destroy() · fb2d4483

由 Benjamin LaHaise 提交于 4月 29, 2014

As reported by Anatol Pomozov, io_destroy() fails to report an error when
it loses the race to destroy a given ioctx. Since there is a difference in
behaviour between the thread that wins the race (which blocks on outstanding
io requests) versus lthe thread that loses (which returns immediately), wire
up a return code from kill_ioctx() to the io_destroy() syscall.
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Cc: Anatol Pomozov <anatol.pomozov@gmail.com>

fb2d4483

23 4月, 2014 1 次提交

fs/aio.c: Remove ctx parameter in kiocb_cancel · d52a8f9e

由 Fabian Frederick 提交于 4月 22, 2014

ctx is no longer used in kiocb_cancel since

57282d8f ("aio: Kill ki_users")

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

d52a8f9e

17 4月, 2014 1 次提交

aio: block io_destroy() until all context requests are completed · e02ba72a

由 Anatol Pomozov 提交于 4月 15, 2014

deletes aio context and all resources related to. It makes sense that
no IO operations connected to the context should be running after the context
is destroyed. As we removed io_context we have no chance to
get requests status or call io_getevents().

man page for io_destroy says that this function may block until
all context's requests are completed. Before kernel 3.11 io_destroy()
blocked indeed, but since aio refactoring in 3.11 it is not true anymore.

Here is a pseudo-code that shows a testcase for a race condition discovered
in 3.11:

  initialize io_context
  io_submit(read to buffer)
  io_destroy()

  // context is destroyed so we can free the resources
  free(buffers);

  // if the buffer is allocated by some other user he'll be surprised
  // to learn that the buffer still filled by an outstanding operation
  // from the destroyed io_context

The fix is straight-forward - add a completion struct and wait on it
in io_destroy, complete() should be called when number of in-fligh requests
reaches zero.

If two or more io_destroy() called for the same context simultaneously then
only the first one waits for IO completion, other calls behaviour is undefined.

Tested: ran http://pastebin.com/LrPsQ4RL testcase for several hours and
  do not see the race condition anymore.
Signed-off-by: NAnatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

e02ba72a

28 3月, 2014 1 次提交

aio: v4 ensure access to ctx->ring_pages is correctly serialised for migration · fa8a53c3

由 Benjamin LaHaise 提交于 3月 28, 2014

As reported by Tang Chen, Gu Zheng and Yasuaki Isimatsu, the following issues
exist in the aio ring page migration support.

As a result, for example, we have the following problem:

            thread 1                      |              thread 2
                                          |
aio_migratepage()                         |
 |-> take ctx->completion_lock            |
 |-> migrate_page_copy(new, old)          |
 |   *NOW*, ctx->ring_pages[idx] == old   |
                                          |
                                          |    *NOW*, ctx->ring_pages[idx] == old
                                          |    aio_read_events_ring()
                                          |     |-> ring = kmap_atomic(ctx->ring_pages[0])
                                          |     |-> ring->head = head;          *HERE, write to the old ring page*
                                          |     |-> kunmap_atomic(ring);
                                          |
 |-> ctx->ring_pages[idx] = new           |
 |   *BUT NOW*, the content of            |
 |    ring_pages[idx] is old.             |
 |-> release ctx->completion_lock         |

As above, the new ring page will not be updated.

Fix this issue, as well as prevent races in aio_ring_setup() by holding
the ring_lock mutex during kioctx setup and page migration.  This avoids
the overhead of taking another spinlock in aio_read_events_ring() as Tang's
and Gu's original fix did, pushing the overhead into the migration code.

Note that to handle the nesting of ring_lock inside of mmap_sem, the
migratepage operation uses mutex_trylock().  Page migration is not a 100%
critical operation in this case, so the ocassional failure can be
tolerated.  This issue was reported by Sasha Levin.

Based on feedback from Linus, avoid the extra taking of ctx->completion_lock.
Instead, make page migration fully serialised by mapping->private_lock, and
have aio_free_ring() simply disconnect the kioctx from the mapping by calling
put_aio_ring_file() before touching ctx->ring_pages[].  This simplifies the
error handling logic in aio_migratepage(), and should improve robustness.

v4: always do mutex_unlock() in cases when kioctx setup fails.
Reported-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reported-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: stable@vger.kernel.org

fa8a53c3

23 12月, 2013 1 次提交

aio: clean up and fix aio_setup_ring page mapping · 3dc9acb6

由 Linus Torvalds 提交于 12月 20, 2013

Since commit 36bc08cc ("fs/aio: Add support to aio ring pages
migration") the aio ring setup code has used a special per-ring backing
inode for the page allocations, rather than just using random anonymous
pages.

However, rather than remembering the pages as it allocated them, it
would allocate the pages, insert them into the file mapping (dirty, so
that they couldn't be free'd), and then forget about them.  And then to
look them up again, it would mmap the mapping, and then use
"get_user_pages()" to get back an array of the pages we just created.

Now, not only is that incredibly inefficient, it also leaked all the
pages if the mmap failed (which could happen due to excessive number of
mappings, for example).

So clean it all up, making it much more straightforward.  Also remove
some left-overs of the previous (broken) mm_populate() usage that was
removed in commit d6c355c7 ("aio: fix race in ring buffer page
lookup introduced by page migration support") but left the pointless and
now misleading MAP_POPULATE flag around.
Tested-and-acked-by: NBenjamin LaHaise <bcrl@kvack.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3dc9acb6

22 12月, 2013 2 次提交

aio/migratepages: make aio migrate pages sane · 8e321fef

由 Benjamin LaHaise 提交于 12月 21, 2013

The arbitrary restriction on page counts offered by the core
migrate_page_move_mapping() code results in rather suspicious looking
fiddling with page reference counts in the aio_migratepage() operation.
To fix this, make migrate_page_move_mapping() take an extra_count parameter
that allows aio to tell the code about its own reference count on the page
being migrated.

While cleaning up aio_migratepage(), make it validate that the old page
being passed in is actually what aio_migratepage() expects to prevent
misbehaviour in the case of races.
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

8e321fef

aio: fix kioctx leak introduced by "aio: Fix a trinity splat" · 1881686f

由 Benjamin LaHaise 提交于 12月 21, 2013

e34ecee2 reworked the percpu reference
counting to correct a bug trinity found.  Unfortunately, the change lead
to kioctxes being leaked because there was no final reference count to
put.  Add that reference count back in to fix things.
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org

1881686f

06 12月, 2013 1 次提交

aio: clean up aio ring in the fail path · d1b94327

由 Gu Zheng 提交于 12月 04, 2013

Clean up the aio ring file in the fail path of aio_setup_ring
and ioctx_alloc. And maybe it can fix the GPF issue reported by
Dave Jones:
https://lkml.org/lkml/2013/11/25/898Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

d1b94327

20 11月, 2013 2 次提交

aio: nullify aio->ring_pages after freeing it · ddb8c45b

由 Sasha Levin 提交于 11月 19, 2013

After freeing ring_pages we leave it as is causing a dangling pointer. This
has already caused an issue so to help catching any issues in the future
NULL it out.
Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

ddb8c45b

aio: prevent double free in ioctx_alloc · d5580232

由 Sasha Levin 提交于 11月 19, 2013

ioctx_alloc() calls aio_setup_ring() to allocate a ring. If aio_setup_ring()
fails to do so it would call aio_free_ring() before returning, but
ioctx_alloc() would call aio_free_ring() again causing a double free of
the ring.

This is easily reproducible from userspace.
Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>

d5580232

13 11月, 2013 1 次提交

aio: checking for NULL instead of IS_ERR · 7f62656b

由 Dan Carpenter 提交于 11月 13, 2013

alloc_anon_inode() returns an ERR_PTR(), it doesn't return NULL.

Fixes: 71ad7490 ('rework aio migrate pages to use aio fs')
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7f62656b

09 11月, 2013 1 次提交

rework aio migrate pages to use aio fs · 71ad7490

由 Benjamin LaHaise 提交于 9月 17, 2013

Don't abuse anon_inodes.c to host private files needed by aio;
we can bloody well declare a mini-fs of our own instead of
patching up what anon_inodes can create for us.
Tested-by: NBenjamin LaHaise <bcrl@kvack.org>
Acked-by: NBenjamin LaHaise <bcrl@kvack.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

71ad7490

openanolis / cloud-kernel 11 个月 前同步成功

openanolis / cloud-kernel
11 个月前同步成功