1. 04 4月, 2016 1 次提交
  2. 05 9月, 2015 1 次提交
  3. 16 4月, 2015 1 次提交
    • J
      aio: fix serial draining in exit_aio() · dc48e56d
      Jens Axboe 提交于
      exit_aio() currently serializes killing io contexts. Each context
      killing ends up having to do percpu_ref_kill(), which in turns has
      to wait for an RCU grace period. This can take a long time, depending
      on the number of contexts. And there's no point in doing them serially,
      when we could be waiting for all of them in one fell swoop.
      
      This patches makes my fio thread offload test case exit 0.2s instead
      of almost 6s.
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dc48e56d
  4. 12 4月, 2015 8 次提交
  5. 07 4月, 2015 2 次提交
    • A
      ioctx_alloc(): fix vma (and file) leak on failure · deeb8525
      Al Viro 提交于
      If we fail past the aio_setup_ring(), we need to destroy the
      mapping.  We don't need to care about anybody having found ctx,
      or added requests to it, since the last failure exit is exactly
      the failure to make ctx visible to lookups.
      
      Reproducer (based on one by Joe Mario <jmario@redhat.com>):
      
      void count(char *p)
      {
      	char s[80];
      	printf("%s: ", p);
      	fflush(stdout);
      	sprintf(s, "/bin/cat /proc/%d/maps|/bin/fgrep -c '/[aio] (deleted)'", getpid());
      	system(s);
      }
      
      int main()
      {
      	io_context_t *ctx;
      	int created, limit, i, destroyed;
      	FILE *f;
      
      	count("before");
      	if ((f = fopen("/proc/sys/fs/aio-max-nr", "r")) == NULL)
      		perror("opening aio-max-nr");
      	else if (fscanf(f, "%d", &limit) != 1)
      		fprintf(stderr, "can't parse aio-max-nr\n");
      	else if ((ctx = calloc(limit, sizeof(io_context_t))) == NULL)
      		perror("allocating aio_context_t array");
      	else {
      		for (i = 0, created = 0; i < limit; i++) {
      			if (io_setup(1000, ctx + created) == 0)
      				created++;
      		}
      		for (i = 0, destroyed = 0; i < created; i++)
      			if (io_destroy(ctx[i]) == 0)
      				destroyed++;
      		printf("created %d, failed %d, destroyed %d\n",
      			created, limit - created, destroyed);
      		count("after");
      	}
      }
      Found-by: NJoe Mario <jmario@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      deeb8525
    • A
      fix mremap() vs. ioctx_kill() race · b2edffdd
      Al Viro 提交于
      teach ->mremap() method to return an error and have it fail for
      aio mappings in process of being killed
      
      Note that in case of ->mremap() failure we need to undo move_page_tables()
      we'd already done; we could call ->mremap() first, but then the failure of
      move_page_tables() would require undoing whatever _successful_ ->mremap()
      has done, which would be a lot more headache in general.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b2edffdd
  6. 14 3月, 2015 2 次提交
    • C
      fs: split generic and aio kiocb · 04b2fa9f
      Christoph Hellwig 提交于
      Most callers in the kernel want to perform synchronous file I/O, but
      still have to bloat the stack with a full struct kiocb.  Split out
      the parts needed in filesystem code from those in the aio code, and
      only allocate those needed to pass down argument on the stack.  The
      aio code embedds the generic iocb in the one it allocates and can
      easily get back to it by using container_of.
      
      Also add a ->ki_complete method to struct kiocb, this is used to call
      into the aio code and thus removes the dependency on aio for filesystems
      impementing asynchronous operations.  It will also allow other callers
      to substitute their own completion callback.
      
      We also add a new ->ki_flags field to work around the nasty layering
      violation recently introduced in commit 5e33f6 ("usb: gadget: ffs: add
      eventfd notification about ffs events").
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      04b2fa9f
    • C
      fs: don't allow to complete sync iocbs through aio_complete · 599bd19b
      Christoph Hellwig 提交于
      The AIO interface is fairly complex because it tries to allow
      filesystems to always work async and then wakeup a synchronous
      caller through aio_complete.  It turns out that basically no one
      was doing this to avoid the complexity and context switches,
      and we've already fixed up the remaining users and can now
      get rid of this case.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      599bd19b
  7. 13 3月, 2015 1 次提交
  8. 20 2月, 2015 1 次提交
  9. 04 2月, 2015 1 次提交
    • D
      aio: annotate aio_read_event_ring for sleep patterns · 9c9ce763
      Dave Chinner 提交于
      Under CONFIG_DEBUG_ATOMIC_SLEEP=y, aio_read_event_ring() will throw
      warnings like the following due to being called from wait_event
      context:
      
       WARNING: CPU: 0 PID: 16006 at kernel/sched/core.c:7300 __might_sleep+0x7f/0x90()
       do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810d85a3>] prepare_to_wait_event+0x63/0x110
       Modules linked in:
       CPU: 0 PID: 16006 Comm: aio-dio-fcntl-r Not tainted 3.19.0-rc6-dgc+ #705
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        ffffffff821c0372 ffff88003c117cd8 ffffffff81daf2bd 000000000000d8d8
        ffff88003c117d28 ffff88003c117d18 ffffffff8109beda ffff88003c117cf8
        ffffffff821c115e 0000000000000061 0000000000000000 00007ffffe4aa300
       Call Trace:
        [<ffffffff81daf2bd>] dump_stack+0x4c/0x65
        [<ffffffff8109beda>] warn_slowpath_common+0x8a/0xc0
        [<ffffffff8109bf56>] warn_slowpath_fmt+0x46/0x50
        [<ffffffff810d85a3>] ? prepare_to_wait_event+0x63/0x110
        [<ffffffff810d85a3>] ? prepare_to_wait_event+0x63/0x110
        [<ffffffff810bdfcf>] __might_sleep+0x7f/0x90
        [<ffffffff81db8344>] mutex_lock+0x24/0x45
        [<ffffffff81216b7c>] aio_read_events+0x4c/0x290
        [<ffffffff81216fac>] read_events+0x1ec/0x220
        [<ffffffff810d8650>] ? prepare_to_wait_event+0x110/0x110
        [<ffffffff810fdb10>] ? hrtimer_get_res+0x50/0x50
        [<ffffffff8121899d>] SyS_io_getevents+0x4d/0xb0
        [<ffffffff81dba5a9>] system_call_fastpath+0x12/0x17
       ---[ end trace bde69eaf655a4fea ]---
      
      There is not actually a bug here, so annotate the code to tell the
      debug logic that everything is just fine and not to fire a false
      positive.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      9c9ce763
  10. 21 1月, 2015 2 次提交
  11. 14 12月, 2014 2 次提交
    • F
      aio: Skip timer for io_getevents if timeout=0 · 5f785de5
      Fam Zheng 提交于
      In this case, it is basically a polling. Let's not involve timer at all
      because that would hurt performance for application event loops.
      
      In an arbitrary test I've done, io_getevents syscall elapsed time
      reduces from 50000+ nanoseconds to a few hundereds.
      Signed-off-by: NFam Zheng <famz@redhat.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      5f785de5
    • P
      aio: Make it possible to remap aio ring · e4a0d3e7
      Pavel Emelyanov 提交于
      There are actually two issues this patch addresses. Let me start with
      the one I tried to solve in the beginning.
      
      So, in the checkpoint-restore project (criu) we try to dump tasks'
      state and restore one back exactly as it was. One of the tasks' state
      bits is rings set up with io_setup() call. There's (almost) no problems
      in dumping them, there's a problem restoring them -- if I dump a task
      with aio ring originally mapped at address A, I want to restore one
      back at exactly the same address A. Unfortunately, the io_setup() does
      not allow for that -- it mmaps the ring at whatever place mm finds
      appropriate (it calls do_mmap_pgoff() with zero address and without
      the MAP_FIXED flag).
      
      To make restore possible I'm going to mremap() the freshly created ring
      into the address A (under which it was seen before dump). The problem is
      that the ring's virtual address is passed back to the user-space as the
      context ID and this ID is then used as search key by all the other io_foo()
      calls. Reworking this ID to be just some integer doesn't seem to work, as
      this value is already used by libaio as a pointer using which this library
      accesses memory for aio meta-data.
      
      So, to make restore work we need to make sure that
      
      a) ring is mapped at desired virtual address
      b) kioctx->user_id matches this value
      
      Having said that, the patch makes mremap() on aio region update the
      kioctx's user_id and mmap_base values.
      
      Here appears the 2nd issue I mentioned in the beginning of this mail.
      If (regardless of the C/R dances I do) someone creates an io context
      with io_setup(), then mremap()-s the ring and then destroys the context,
      the kill_ioctx() routine will call munmap() on wrong (old) address.
      This will result in a) aio ring remaining in memory and b) some other
      vma get unexpectedly unmapped.
      
      What do you think?
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      e4a0d3e7
  12. 07 11月, 2014 1 次提交
    • G
      aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer · 835f252c
      Gu Zheng 提交于
      https://bugzilla.kernel.org/show_bug.cgi?id=86831
      
      Markus reported that when shutting down mysqld (with AIO support,
      on a ext3 formatted Harddrive) leads to a negative number of dirty pages
      (underrun to the counter). The negative number results in a drastic reduction
      of the write performance because the page cache is not used, because the kernel
      thinks it is still 2 ^ 32 dirty pages open.
      
      Add a warn trace in __dec_zone_state will catch this easily:
      
      static inline void __dec_zone_state(struct zone *zone, enum
      	zone_stat_item item)
      {
           atomic_long_dec(&zone->vm_stat[item]);
      +    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
      	atomic_long_read(&zone->vm_stat[item]) < 0);
           atomic_long_dec(&vm_stat[item]);
      }
      
      [   21.341632] ------------[ cut here ]------------
      [   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
      cancel_dirty_page+0x164/0x224()
      [   21.355296] Modules linked in: wutbox_cp sata_mv
      [   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
      [   21.366793] Workqueue: events free_ioctx
      [   21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>]
      (show_stack+0x20/0x24)
      [   21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>]
      (dump_stack+0x24/0x28)
      [   21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>]
      (warn_slowpath_common+0x84/0x9c)
      [   21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>]
      (warn_slowpath_null+0x2c/0x34)
      [   21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>]
      (cancel_dirty_page+0x164/0x224)
      [   21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>]
      (truncate_inode_page+0x8c/0x158)
      [   21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>]
      (truncate_inode_pages_range+0x11c/0x53c)
      [   21.429890] [<c00c0a94>] (truncate_inode_pages_range) from
      [<c00c0f6c>] (truncate_pagecache+0x88/0xac)
      [   21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>]
      (truncate_setsize+0x5c/0x74)
      [   21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>]
      (put_aio_ring_file.isra.14+0x34/0x90)
      [   21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from
      [<c013b424>] (aio_free_ring+0x20/0xcc)
      [   21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>]
      (free_ioctx+0x24/0x44)
      [   21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>]
      (process_one_work+0x134/0x47c)
      [   21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>]
      (worker_thread+0x130/0x414)
      [   21.489350] [<c003e988>] (worker_thread) from [<c00448ac>]
      (kthread+0xd4/0xec)
      [   21.496621] [<c00448ac>] (kthread) from [<c000ec18>]
      (ret_from_fork+0x14/0x20)
      [   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---
      
      The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
      (bypasses the VFS dirty pages increment) when init, and aio fs uses
      *default_backing_dev_info* as the backing dev, which does not disable
      the dirty pages accounting capability.
      So truncating aio ring file will contribute to accounting dirty pages (VFS
      dirty pages decrement), then error occurs.
      
      The original goal is keeping these pages in memory (can not be reclaimed
      or swapped) in life-time via marking it dirty. But thinking more, we have
      already pinned pages via elevating the page's refcount, which can already
      achieve the goal, so the SetPageDirty seems unnecessary.
      
      In order to fix the issue, using the __set_page_dirty_no_writeback instead
      of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
      set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).
      
      With the above change, the dirty pages accounting can work well. But as we
      known, aio fs is an anonymous one, which should never cause any real write-back,
      we can ignore the dirty pages (write back) accounting by disabling the dirty
      pages (write back) accounting capability. So we introduce an aio private
      backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
      replace the default one.
      Reported-by: NMarkus Königshaus <m.koenigshaus@wut.de>
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: stable <stable@vger.kernel.org>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      835f252c
  13. 25 9月, 2014 1 次提交
    • T
      percpu_ref: add PERCPU_REF_INIT_* flags · 2aad2a86
      Tejun Heo 提交于
      With the recent addition of percpu_ref_reinit(), percpu_ref now can be
      used as a persistent switch which can be turned on and off repeatedly
      where turning off maps to killing the ref and waiting for it to drain;
      however, there currently isn't a way to initialize a percpu_ref in its
      off (killed and drained) state, which can be inconvenient for certain
      persistent switch use cases.
      
      Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
      selection of operation mode; however, currently a newly initialized
      percpu_ref is always in percpu mode making it impossible to avoid the
      latency overhead of switching to atomic mode.
      
      This patch adds @flags to percpu_ref_init() and implements the
      following flags.
      
      * PERCPU_REF_INIT_ATOMIC	: start ref in atomic mode
      * PERCPU_REF_INIT_DEAD		: start ref killed and drained
      
      These flags should be able to serve the above two use cases.
      
      v2: target_core_tpg.c conversion was missing.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      2aad2a86
  14. 08 9月, 2014 1 次提交
    • T
      percpu-refcount: add @gfp to percpu_ref_init() · a34375ef
      Tejun Heo 提交于
      Percpu allocator now supports allocation mask.  Add @gfp to
      percpu_ref_init() so that !GFP_KERNEL allocation masks can be used
      with percpu_refs too.
      
      This patch doesn't make any functional difference.
      
      v2: blk-mq conversion was missing.  Updated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      a34375ef
  15. 05 9月, 2014 1 次提交
  16. 03 9月, 2014 1 次提交
    • J
      aio: add missing smp_rmb() in read_events_ring · 2ff396be
      Jeff Moyer 提交于
      We ran into a case on ppc64 running mariadb where io_getevents would
      return zeroed out I/O events.  After adding instrumentation, it became
      clear that there was some missing synchronization between reading the
      tail pointer and the events themselves.  This small patch fixes the
      problem in testing.
      
      Thanks to Zach for helping to look into this, and suggesting the fix.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: stable@vger.kernel.org
      2ff396be
  17. 25 8月, 2014 1 次提交
    • B
      aio: fix reqs_available handling · d856f32a
      Benjamin LaHaise 提交于
      As reported by Dan Aloni, commit f8567a38 ("aio: fix aio request
      leak when events are reaped by userspace") introduces a regression when
      user code attempts to perform io_submit() with more events than are
      available in the ring buffer.  Reverting that commit would reintroduce a
      regression when user space event reaping is used.
      
      Fixing this bug is a bit more involved than the previous attempts to fix
      this regression.  Since we do not have a single point at which we can
      count events as being reaped by user space and io_getevents(), we have
      to track event completion by looking at the number of events left in the
      event ring.  So long as there are as many events in the ring buffer as
      there have been completion events generate, we cannot call
      put_reqs_available().  The code to check for this is now placed in
      refill_reqs_available().
      
      A test program from Dan and modified by me for verifying this bug is available
      at http://www.kvack.org/~bcrl/20140824-aio_bug.c .
      Reported-by: NDan Aloni <dan@kernelim.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Acked-by: NDan Aloni <dan@kernelim.com>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: stable@vger.kernel.org      # v3.16 and anything that f8567a38 was backported to
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d856f32a
  18. 24 7月, 2014 4 次提交
  19. 22 7月, 2014 1 次提交
  20. 15 7月, 2014 1 次提交
  21. 28 6月, 2014 2 次提交
    • T
      percpu-refcount: require percpu_ref to be exited explicitly · 9a1049da
      Tejun Heo 提交于
      Currently, a percpu_ref undoes percpu_ref_init() automatically by
      freeing the allocated percpu area when the percpu_ref is killed.
      While seemingly convenient, this has the following niggles.
      
      * It's impossible to re-init a released reference counter without
        going through re-allocation.
      
      * In the similar vein, it's impossible to initialize a percpu_ref
        count with static percpu variables.
      
      * We need and have an explicit destructor anyway for failure paths -
        percpu_ref_cancel_init().
      
      This patch removes the automatic percpu counter freeing in
      percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a
      generic destructor now named percpu_ref_exit().  percpu_ref_destroy()
      is considered but it gets confusing with percpu_ref_kill() while
      "exit" clearly indicates that it's the counterpart of
      percpu_ref_init().
      
      All percpu_ref_cancel_init() users are updated to invoke
      percpu_ref_exit() instead and explicit percpu_ref_exit() calls are
      added to the destruction path of all percpu_ref users.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Cc: Li Zefan <lizefan@huawei.com>
      9a1049da
    • T
      percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc() · 55c6c814
      Tejun Heo 提交于
      ioctx_alloc() reaches inside percpu_ref and directly frees
      ->pcpu_count in its failure path, which is quite gross.  percpu_ref
      has been providing a proper interface to do this,
      percpu_ref_cancel_init(), for quite some time now.  Let's use that
      instead.
      
      This patch doesn't introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      55c6c814
  22. 25 6月, 2014 4 次提交
    • O
      aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx() · 855ef0de
      Oleg Nesterov 提交于
      ioctx_add_table() is the writer, it does not need rcu_read_lock() to
      protect ->ioctx_table. It relies on mm->ioctx_lock and rcu locks just
      add the confusion.
      
      And it doesn't need rcu_dereference() by the same reason, it must see
      any updates previously done under the same ->ioctx_lock. We could use
      rcu_dereference_protected() but the patch uses rcu_dereference_raw(),
      the function is simple enough.
      
      The same for kill_ioctx(), although it does not update the pointer.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      855ef0de
    • O
      aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock() · 4b70ac5f
      Oleg Nesterov 提交于
      On 04/30, Benjamin LaHaise wrote:
      >
      > > -		ctx->mmap_size = 0;
      > > -
      > > -		kill_ioctx(mm, ctx, NULL);
      > > +		if (ctx) {
      > > +			ctx->mmap_size = 0;
      > > +			kill_ioctx(mm, ctx, NULL);
      > > +		}
      >
      > Rather than indenting and moving the two lines changing mmap_size and the
      > kill_ioctx() call, why not just do "if (!ctx) ... continue;"?  That reduces
      > the number of lines changed and avoid excessive indentation.
      
      OK. To me the code looks better/simpler with "if (ctx)", but this is subjective
      of course, I won't argue.
      
      The patch still removes the empty line between mmap_size = 0 and kill_ioctx(),
      we reset mmap_size only for kill_ioctx(). But feel free to remove this change.
      
      -------------------------------------------------------------------------------
      Subject: [PATCH v3 1/2] aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()
      
      1. We can read ->ioctx_table only once and we do not read rcu_read_lock()
         or even rcu_dereference().
      
         This mm has no users, nobody else can play with ->ioctx_table. Otherwise
         the code is buggy anyway, if we need rcu_read_lock() in a loop because
         ->ioctx_table can be updated then kfree(table) is obviously wrong.
      
      2. Update the comment. "exit_mmap(mm) is coming" is the good reason to avoid
         munmap(), but another reason is that we simply can't do vm_munmap() unless
         current->mm == mm and this is not true in general, the caller is mmput().
      
      3. We do not really need to nullify mm->ioctx_table before return, probably
         the current code does this to catch the potential problems. But in this
         case RCU_INIT_POINTER(NULL) looks better.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      4b70ac5f
    • B
      aio: fix kernel memory disclosure in io_getevents() introduced in v3.10 · edfbbf38
      Benjamin LaHaise 提交于
      A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10
      by commit a31ad380.  The changes made to
      aio_read_events_ring() failed to correctly limit the index into
      ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of
      an arbitrary page with a copy_to_user() to copy the contents into userspace.
      This vulnerability has been assigned CVE-2014-0206.  Thanks to Mateusz and
      Petr for disclosing this issue.
      
      This patch applies to v3.12+.  A separate backport is needed for 3.10/3.11.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: stable@vger.kernel.org
      edfbbf38
    • B
      aio: fix aio request leak when events are reaped by userspace · f8567a38
      Benjamin LaHaise 提交于
      The aio cleanups and optimizations by kmo that were merged into the 3.10
      tree added a regression for userspace event reaping.  Specifically, the
      reference counts are not decremented if the event is reaped in userspace,
      leading to the application being unable to submit further aio requests.
      This patch applies to 3.12+.  A separate backport is required for 3.10/3.11.
      This issue was uncovered as part of CVE-2014-0206.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Cc: stable@vger.kernel.org
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      f8567a38